1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

2751 commits

Author SHA1 Message Date
Qu Wenruo
bf50aca633 btrfs: remove the unused locked_folio parameter from btrfs_cleanup_ordered_extents()
The function btrfs_cleanup_ordered_extents() is only called in error
handling path, and the last caller with a @locked_folio parameter was
removed to fix a bug in the btrfs_run_delalloc_range() error handling.

There is no need to pass @locked_folio parameter anymore.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 16:00:50 +01:00
Qu Wenruo
975a6a8855 btrfs: add extra error messages for delalloc range related errors
All the error handling bugs I hit so far are all -ENOSPC from either:

- cow_file_range()
- run_delalloc_nocow()
- submit_uncompressed_range()

Previously when those functions failed, there was no error message at
all, making the debugging much harder.

So here we introduce extra error messages for:

- cow_file_range()
- run_delalloc_nocow()
- submit_uncompressed_range()
- writepage_delalloc() when btrfs_run_delalloc_range() failed
- extent_writepage() when extent_writepage_io() failed

One example of the new debug error messages is the following one:

  run fstests generic/750 at 2024-12-08 12:41:41
  BTRFS: device fsid 461b25f5-e240-4543-8deb-e7c2bd01a6d3 devid 1 transid 8 /dev/mapper/test-scratch1 (253:4) scanned by mount (2436600)
  BTRFS info (device dm-4): first mount of filesystem 461b25f5-e240-4543-8deb-e7c2bd01a6d3
  BTRFS info (device dm-4): using crc32c (crc32c-arm64) checksum algorithm
  BTRFS info (device dm-4): forcing free space tree for sector size 4096 with page size 65536
  BTRFS info (device dm-4): using free-space-tree
  BTRFS warning (device dm-4): read-write for sector size 4096 with page size 65536 is experimental
  BTRFS info (device dm-4): checking UUID tree
  BTRFS error (device dm-4): cow_file_range failed, root=363 inode=412 start=503808 len=98304: -28
  BTRFS error (device dm-4): run_delalloc_nocow failed, root=363 inode=412 start=503808 len=98304: -28
  BTRFS error (device dm-4): failed to run delalloc range, root=363 ino=412 folio=458752 submit_bitmap=11-15 start=503808 len=98304: -28

Which shows an error from cow_file_range() which is called inside a
nocow write attempt, along with the extra bitmap from
writepage_delalloc().

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 15:59:35 +01:00
Qu Wenruo
c2b47df81c btrfs: do proper folio cleanup when run_delalloc_nocow() failed
[BUG]
With CONFIG_DEBUG_VM set, test case generic/476 has some chance to crash
with the following VM_BUG_ON_FOLIO():

  BTRFS error (device dm-3): cow_file_range failed, start 1146880 end 1253375 len 106496 ret -28
  BTRFS error (device dm-3): run_delalloc_nocow failed, start 1146880 end 1253375 len 106496 ret -28
  page: refcount:4 mapcount:0 mapping:00000000592787cc index:0x12 pfn:0x10664
  aops:btrfs_aops [btrfs] ino:101 dentry name(?):"f1774"
  flags: 0x2fffff80004028(uptodate|lru|private|node=0|zone=2|lastcpupid=0xfffff)
  page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
  ------------[ cut here ]------------
  kernel BUG at mm/page-writeback.c:2992!
  Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
  CPU: 2 UID: 0 PID: 3943513 Comm: kworker/u24:15 Tainted: G           OE      6.12.0-rc7-custom+ #87
  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
  Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
  pc : folio_clear_dirty_for_io+0x128/0x258
  lr : folio_clear_dirty_for_io+0x128/0x258
  Call trace:
   folio_clear_dirty_for_io+0x128/0x258
   btrfs_folio_clamp_clear_dirty+0x80/0xd0 [btrfs]
   __process_folios_contig+0x154/0x268 [btrfs]
   extent_clear_unlock_delalloc+0x5c/0x80 [btrfs]
   run_delalloc_nocow+0x5f8/0x760 [btrfs]
   btrfs_run_delalloc_range+0xa8/0x220 [btrfs]
   writepage_delalloc+0x230/0x4c8 [btrfs]
   extent_writepage+0xb8/0x358 [btrfs]
   extent_write_cache_pages+0x21c/0x4e8 [btrfs]
   btrfs_writepages+0x94/0x150 [btrfs]
   do_writepages+0x74/0x190
   filemap_fdatawrite_wbc+0x88/0xc8
   start_delalloc_inodes+0x178/0x3a8 [btrfs]
   btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
   shrink_delalloc+0x114/0x280 [btrfs]
   flush_space+0x250/0x2f8 [btrfs]
   btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
   process_one_work+0x164/0x408
   worker_thread+0x25c/0x388
   kthread+0x100/0x118
   ret_from_fork+0x10/0x20
  Code: 910a8021 a90363f7 a9046bf9 94012379 (d4210000)
  ---[ end trace 0000000000000000 ]---

[CAUSE]
The first two lines of extra debug messages show the problem is caused
by the error handling of run_delalloc_nocow().

E.g. we have the following dirtied range (4K blocksize 4K page size):

    0                 16K                  32K
    |//////////////////////////////////////|
    |  Pre-allocated  |

And the range [0, 16K) has a preallocated extent.

- Enter run_delalloc_nocow() for range [0, 16K)
  Which found range [0, 16K) is preallocated, can do the proper NOCOW
  write.

- Enter fallback_to_fow() for range [16K, 32K)
  Since the range [16K, 32K) is not backed by preallocated extent, we
  have to go COW.

- cow_file_range() failed for range [16K, 32K)
  So cow_file_range() will do the clean up by clearing folio dirty,
  unlock the folios.

  Now the folios in range [16K, 32K) is unlocked.

- Enter extent_clear_unlock_delalloc() from run_delalloc_nocow()
  Which is called with PAGE_START_WRITEBACK to start page writeback.
  But folios can only be marked writeback when it's properly locked,
  thus this triggered the VM_BUG_ON_FOLIO().

Furthermore there is another hidden but common bug that
run_delalloc_nocow() is not clearing the folio dirty flags in its error
handling path.
This is the common bug shared between run_delalloc_nocow() and
cow_file_range().

[FIX]
- Clear folio dirty for range [@start, @cur_offset)
  Introduce a helper, cleanup_dirty_folios(), which
  will find and lock the folio in the range, clear the dirty flag and
  start/end the writeback, with the extra handling for the
  @locked_folio.

- Introduce a helper to clear folio dirty, start and end writeback

- Introduce a helper to record the last failed COW range end
  This is to trace which range we should skip, to avoid double
  unlocking.

- Skip the failed COW range for the error handling

CC: stable@vger.kernel.org
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 15:52:17 +01:00
Qu Wenruo
06f3642847 btrfs: do proper folio cleanup when cow_file_range() failed
[BUG]
When testing with COW fixup marked as BUG_ON() (this is involved with the
new pin_user_pages*() change, which should not result new out-of-band
dirty pages), I hit a crash triggered by the BUG_ON() from hitting COW
fixup path.

This BUG_ON() happens just after a failed btrfs_run_delalloc_range():

  BTRFS error (device dm-2): failed to run delalloc range, root 348 ino 405 folio 65536 submit_bitmap 6-15 start 90112 len 106496: -28
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/extent_io.c:1444!
  Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
  CPU: 0 UID: 0 PID: 434621 Comm: kworker/u24:8 Tainted: G           OE      6.12.0-rc7-custom+ #86
  Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
  Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
  pc : extent_writepage_io+0x2d4/0x308 [btrfs]
  lr : extent_writepage_io+0x2d4/0x308 [btrfs]
  Call trace:
   extent_writepage_io+0x2d4/0x308 [btrfs]
   extent_writepage+0x218/0x330 [btrfs]
   extent_write_cache_pages+0x1d4/0x4b0 [btrfs]
   btrfs_writepages+0x94/0x150 [btrfs]
   do_writepages+0x74/0x190
   filemap_fdatawrite_wbc+0x88/0xc8
   start_delalloc_inodes+0x180/0x3b0 [btrfs]
   btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
   shrink_delalloc+0x114/0x280 [btrfs]
   flush_space+0x250/0x2f8 [btrfs]
   btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
   process_one_work+0x164/0x408
   worker_thread+0x25c/0x388
   kthread+0x100/0x118
   ret_from_fork+0x10/0x20
  Code: aa1403e1 9402f3ef aa1403e0 9402f36f (d4210000)
  ---[ end trace 0000000000000000 ]---

[CAUSE]
That failure is mostly from cow_file_range(), where we can hit -ENOSPC.

Although the -ENOSPC is already a bug related to our space reservation
code, let's just focus on the error handling.

For example, we have the following dirty range [0, 64K) of an inode,
with 4K sector size and 4K page size:

   0        16K        32K       48K       64K
   |///////////////////////////////////////|
   |#######################################|

Where |///| means page are still dirty, and |###| means the extent io
tree has EXTENT_DELALLOC flag.

- Enter extent_writepage() for page 0

- Enter btrfs_run_delalloc_range() for range [0, 64K)

- Enter cow_file_range() for range [0, 64K)

- Function btrfs_reserve_extent() only reserved one 16K extent
  So we created extent map and ordered extent for range [0, 16K)

   0        16K        32K       48K       64K
   |////////|//////////////////////////////|
   |<- OE ->|##############################|

   And range [0, 16K) has its delalloc flag cleared.
   But since we haven't yet submit any bio, involved 4 pages are still
   dirty.

- Function btrfs_reserve_extent() returns with -ENOSPC
  Now we have to run error cleanup, which will clear all
  EXTENT_DELALLOC* flags and clear the dirty flags for the remaining
  ranges:

   0        16K        32K       48K       64K
   |////////|                              |
   |        |                              |

  Note that range [0, 16K) still has its pages dirty.

- Some time later, writeback is triggered again for the range [0, 16K)
  since the page range still has dirty flags.

- btrfs_run_delalloc_range() will do nothing because there is no
  EXTENT_DELALLOC flag.

- extent_writepage_io() finds page 0 has no ordered flag
  Which falls into the COW fixup path, triggering the BUG_ON().

Unfortunately this error handling bug dates back to the introduction of
btrfs.  Thankfully with the abuse of COW fixup, at least it won't crash
the kernel.

[FIX]
Instead of immediately unlocking the extent and folios, we keep the extent
and folios locked until either erroring out or the whole delalloc range
finished.

When the whole delalloc range finished without error, we just unlock the
whole range with PAGE_SET_ORDERED (and PAGE_UNLOCK for !keep_locked
cases), with EXTENT_DELALLOC and EXTENT_LOCKED cleared.
And the involved folios will be properly submitted, with their dirty
flags cleared during submission.

For the error path, it will be a little more complex:

- The range with ordered extent allocated (range (1))
  We only clear the EXTENT_DELALLOC and EXTENT_LOCKED, as the remaining
  flags are cleaned up by
  btrfs_mark_ordered_io_finished()->btrfs_finish_one_ordered().

  For folios we finish the IO (clear dirty, start writeback and
  immediately finish the writeback) and unlock the folios.

- The range with reserved extent but no ordered extent (range(2))
- The range we never touched (range(3))
  For both range (2) and range(3) the behavior is not changed.

Now even if cow_file_range() failed halfway with some successfully
reserved extents/ordered extents, we will keep all folios clean, so
there will be no future writeback triggered on them.

CC: stable@vger.kernel.org
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 15:50:12 +01:00
Qu Wenruo
a7858d5c36 btrfs: fix error handling of submit_uncompressed_range()
[BUG]
If we failed to compress the range, or cannot reserve a large enough
data extent (e.g. too fragmented free space), we will fall back to
submit_uncompressed_range().

But inside submit_uncompressed_range(), run_delalloc_cow() can also fail
due to -ENOSPC or any other error.

In that case there are 3 bugs in the error handling:

1) Double freeing for the same ordered extent
   This can lead to crash due to ordered extent double accounting

2) Start/end writeback without updating the subpage writeback bitmap

3) Unlock the folio without clear the subpage lock bitmap

Both bugs 2) and 3) will crash the kernel if the btrfs block size is
smaller than folio size, as the next time the folio gets writeback/lock
updates, subpage will find the bitmap already have the range set,
triggering an ASSERT().

[CAUSE]
Bug 1) happens in the following call chain:

  submit_uncompressed_range()
  |- run_delalloc_cow()
  |  |- cow_file_range()
  |     |- btrfs_reserve_extent()
  |        Failed with -ENOSPC or whatever error
  |
  |- btrfs_clean_up_ordered_extents()
  |  |- btrfs_mark_ordered_io_finished()
  |     Which cleans all the ordered extents in the async_extent range.
  |
  |- btrfs_mark_ordered_io_finished()
     Which cleans the folio range.

The finished ordered extents may not be immediately removed from the
ordered io tree, as they are removed inside a work queue.

So the second btrfs_mark_ordered_io_finished() may find the finished but
not-yet-removed ordered extents, and double free them.

Furthermore, the second btrfs_mark_ordered_io_finished() is not subpage
compatible, as it uses fixed folio_pos() with PAGE_SIZE, which can cover
other ordered extents.

Bugs 2) and 3) are more straightforward, btrfs just calls folio_unlock(),
folio_start_writeback() and folio_end_writeback(), other than the helpers
which handle subpage cases.

[FIX]
For bug 1) since the first btrfs_cleanup_ordered_extents() call is
handling the whole range, we should not do the second
btrfs_mark_ordered_io_finished() call.

And for the first btrfs_cleanup_ordered_extents(), we no longer need to
pass the @locked_page parameter, as we are already in the async extent
context, thus will never rely on the error handling inside
btrfs_run_delalloc_range().

So just let the btrfs_clean_up_ordered_extents() handle every folio
equally.

For bug 2) we should not even call
folio_start_writeback()/folio_end_writeback() anymore.
As the error handling protocol, cow_file_range() should clear
dirty flag and start/finish the writeback for the whole range passed in.

For bug 3) just change the folio_unlock() to btrfs_folio_end_lock()
helper.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 15:36:06 +01:00
Qu Wenruo
72dad8e377 btrfs: fix double accounting race when btrfs_run_delalloc_range() failed
[BUG]
When running btrfs with block size (4K) smaller than page size (64K,
aarch64), there is a very high chance to crash the kernel at
generic/750, with the following messages:
(before the call traces, there are 3 extra debug messages added)

  BTRFS warning (device dm-3): read-write for sector size 4096 with page size 65536 is experimental
  BTRFS info (device dm-3): checking UUID tree
  hrtimer: interrupt took 5451385 ns
  BTRFS error (device dm-3): cow_file_range failed, root=4957 inode=257 start=1605632 len=69632: -28
  BTRFS error (device dm-3): run_delalloc_nocow failed, root=4957 inode=257 start=1605632 len=69632: -28
  BTRFS error (device dm-3): failed to run delalloc range, root=4957 ino=257 folio=1572864 submit_bitmap=8-15 start=1605632 len=69632: -28
  ------------[ cut here ]------------
  WARNING: CPU: 2 PID: 3020984 at ordered-data.c:360 can_finish_ordered_extent+0x370/0x3b8 [btrfs]
  CPU: 2 UID: 0 PID: 3020984 Comm: kworker/u24:1 Tainted: G           OE      6.13.0-rc1-custom+ #89
  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
  Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
  pc : can_finish_ordered_extent+0x370/0x3b8 [btrfs]
  lr : can_finish_ordered_extent+0x1ec/0x3b8 [btrfs]
  Call trace:
   can_finish_ordered_extent+0x370/0x3b8 [btrfs] (P)
   can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] (L)
   btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs]
   extent_writepage+0x10c/0x3b8 [btrfs]
   extent_write_cache_pages+0x21c/0x4e8 [btrfs]
   btrfs_writepages+0x94/0x160 [btrfs]
   do_writepages+0x74/0x190
   filemap_fdatawrite_wbc+0x74/0xa0
   start_delalloc_inodes+0x17c/0x3b0 [btrfs]
   btrfs_start_delalloc_roots+0x17c/0x288 [btrfs]
   shrink_delalloc+0x11c/0x280 [btrfs]
   flush_space+0x288/0x328 [btrfs]
   btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
   process_one_work+0x228/0x680
   worker_thread+0x1bc/0x360
   kthread+0x100/0x118
   ret_from_fork+0x10/0x20
  ---[ end trace 0000000000000000 ]---
  BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1605632 OE len=16384 to_dec=16384 left=0
  BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1622016 OE len=12288 to_dec=12288 left=0
  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
  BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1634304 OE len=8192 to_dec=4096 left=0
  CPU: 1 UID: 0 PID: 3286940 Comm: kworker/u24:3 Tainted: G        W  OE      6.13.0-rc1-custom+ #89
  Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
  Workqueue:  btrfs_work_helper [btrfs] (btrfs-endio-write)
  pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  pc : process_one_work+0x110/0x680
  lr : worker_thread+0x1bc/0x360
  Call trace:
   process_one_work+0x110/0x680 (P)
   worker_thread+0x1bc/0x360 (L)
   worker_thread+0x1bc/0x360
   kthread+0x100/0x118
   ret_from_fork+0x10/0x20
  Code: f84086a1 f9000fe1 53041c21 b9003361 (f9400661)
  ---[ end trace 0000000000000000 ]---
  Kernel panic - not syncing: Oops: Fatal exception
  SMP: stopping secondary CPUs
  SMP: failed to stop secondary CPUs 2-3
  Dumping ftrace buffer:
     (ftrace buffer empty)
  Kernel Offset: 0x275bb9540000 from 0xffff800080000000
  PHYS_OFFSET: 0xffff8fbba0000000
  CPU features: 0x100,00000070,00801250,8201720b

[CAUSE]
The above warning is triggered immediately after the delalloc range
failure, this happens in the following sequence:

- Range [1568K, 1636K) is dirty

   1536K  1568K     1600K    1636K  1664K
   |      |/////////|////////|      |

  Where 1536K, 1600K and 1664K are page boundaries (64K page size)

- Enter extent_writepage() for page 1536K

- Enter run_delalloc_nocow() with locked page 1536K and range
  [1568K, 1636K)
  This is due to the inode having preallocated extents.

- Enter cow_file_range() with locked page 1536K and range
  [1568K, 1636K)

- btrfs_reserve_extent() only reserved two extents
  The main loop of cow_file_range() only reserved two data extents,

  Now we have:

   1536K  1568K        1600K    1636K  1664K
   |      |<-->|<--->|/|///////|      |
               1584K  1596K
  Range [1568K, 1596K) has an ordered extent reserved.

- btrfs_reserve_extent() failed inside cow_file_range() for file offset
  1596K
  This is already a bug in our space reservation code, but for now let's
  focus on the error handling path.

  Now cow_file_range() returned -ENOSPC.

- btrfs_run_delalloc_range() do error cleanup <<< ROOT CAUSE
  Call btrfs_cleanup_ordered_extents() with locked folio 1536K and range
  [1568K, 1636K)

  Function btrfs_cleanup_ordered_extents() normally needs to skip the
  ranges inside the folio, as it will normally be cleaned up by
  extent_writepage().

  Such split error handling is already problematic in the first place.

  What's worse is the folio range skipping itself, which is not taking
  subpage cases into consideration at all, it will only skip the range
  if the page start >= the range start.
  In our case, the page start < the range start, since for subpage cases
  we can have delalloc ranges inside the folio but not covering the
  folio.

  So it doesn't skip the page range at all.
  This means all the ordered extents, both [1568K, 1584K) and
  [1584K, 1596K) will be marked as IOERR.

  And these two ordered extents have no more pending ios, they are marked
  finished, and *QUEUED* to be deleted from the io tree.

- extent_writepage() do error cleanup
  Call btrfs_mark_ordered_io_finished() for the range [1536K, 1600K).

  Although ranges [1568K, 1584K) and [1584K, 1596K) are finished, the
  deletion from io tree is async, it may or may not happen at this
  time.

  If the ranges have not yet been removed, we will do double cleaning on
  those ranges, triggering the above ordered extent warnings.

In theory there are other bugs, like the cleanup in extent_writepage()
can cause double accounting on ranges that are submitted asynchronously
(compression for example).

But that's much harder to trigger because normally we do not mix regular
and compression delalloc ranges.

[FIX]
The folio range split is already buggy and not subpage compatible, it
was introduced a long time ago where subpage support was not even considered.

So instead of splitting the ordered extents cleanup into the folio range
and out of folio range, do all the cleanup inside writepage_delalloc().

- Pass @NULL as locked_folio for btrfs_cleanup_ordered_extents() in
  btrfs_run_delalloc_range()

- Skip the btrfs_cleanup_ordered_extents() if writepage_delalloc()
  failed

  So all ordered extents are only cleaned up by
  btrfs_run_delalloc_range().

- Handle the ranges that already have ordered extents allocated
  If part of the folio already has ordered extent allocated, and
  btrfs_run_delalloc_range() failed, we also need to cleanup that range.

Now we have a concentrated error handling for ordered extents during
btrfs_run_delalloc_range().

Fixes: d1051d6ebf ("btrfs: Fix error handling in btrfs_cleanup_ordered_extents")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 15:26:23 +01:00
Filipe Manana
5c7763312c btrfs: inode: remove unnecessary calls to btrfs_mark_buffer_dirty()
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.

These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:19 +01:00
Filipe Manana
097a7eef61 btrfs: uncollapse transaction aborts during renames
During renames we are grouping transaction aborts that can be due to a
failure of one of several function calls. While this makes the code less
verbose, it makes it harder to debug as we end up not knowing from which
function call we got an error.

So change this to trigger a transaction abort after each function call
failure, so that when we get a transaction abort message we know exactly
which function call failed, helping us to debug issues.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:18 +01:00
Filipe Manana
9e0d43ea4e btrfs: simplify arguments for btrfs_cross_ref_exist()
Instead of passing a root and an objectid which matches an inode number,
pass the inode instead, since the root is always the root associated to
the inode and the objectid is the number of that inode.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:16 +01:00
Filipe Manana
6c44075524 btrfs: remove no longer needed strict argument from can_nocow_extent()
All callers of can_nocow_extent() now pass a value of false for its
'strict' argument, making it redundant. So remove the argument from
can_nocow_extent() as well as can_nocow_file_extent(),
btrfs_cross_ref_exist() and check_committed_ref(), because this
argument was used just to influence the behavior of check_committed_ref().
Also remove the 'strict' field from struct can_nocow_file_extent_args,
which is now always false as well, as its value is taken from the
argument to can_nocow_extent().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:16 +01:00
Naohiro Aota
7de9ca1f30 btrfs: drop fs_info argument from btrfs_update_space_info_*()
Since commit e1e577aafe41 ("btrfs: store fs_info in space_info"), we have
the fs_info in a space_info. So, we can drop fs_info argument from
btrfs_update_space_info_*. There is no behavior change.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 14:53:14 +01:00
Linus Torvalds
c059361673 for-6.13-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmdw8AgACgkQxWXV+ddt
 WDsL4w/+Ib5WGmd2Rjn1+1X9U5dzrEb+/072UBAhwwaqOOUTlBofeyRSdYqFB0oZ
 aucRMXdXPpVe1xrXsj0WsOZmPsuZT46Eh2ALqqZP5fO1sgBkJ2WmQF0Ei7uypfb+
 abQwiEO2IaMMwt2XgDNzbpZS7oVNGEXHzoHF0R/deL4FoBDNMsbCfRnW+L9++tWU
 dUSpafLhgMMwivJN07VJYwU4ZVXsBhmKv2qI8WpJ5w9kJb1ssN692CvBOVjhuSYd
 A8IMV84dW2KO37fmPqN36QAWotz4mKpv8yrhjJvrix7nAOcXe3TXFUhaFBh1Vmzg
 G5bhkqYcNP6UHT7CIcLZE1mdv6ZAKTp0zSNCh2Uu51+MJL2tIQVjTaUQhbkYLnLN
 9DS2dXz4ksm9ISrjr2tmPe4kgyNQIrp5TCdwXu3CYs+AaU7yKeEBukZ7mXcp/e/W
 TdLKvzPRLMED8mGlFBwg2QbOvcJJ663UW2esyv6DvC61F3tXyiV2RXSC/1qF+RyZ
 FBJvvEevensQlASn1NScuQV+iEQpMo2lMURnRjSG8dGhwMmHpW3wifa2TJDyBzWS
 AH0MriQA9nsYQTkPGPnqr46/BAhFG2vEfVlX20Sk9S0PTBLu8YRy/o2evcV67J8v
 zGaa5pa7fQPbEjRv4Rthdb4R2VIFkZTOtIZSZfjHkPDjtvS7ahU=
 =NwGH
 -----END PGP SIGNATURE-----

Merge tag 'for-6.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more fixes that accumulated over the last two weeks, fixing some
  user reported problems:

   - swapfile fixes:
       - conditional reschedule in the activation loop
       - fix race with memory mapped file when activating
       - make activation loop interruptible
       - rework and fix extent sharing checks

   - folio fixes:
       - in send, recheck folio mapping after unlock
       - in relocation, recheck folio mapping after unlock

   - fix waiting for encoded read io_uring requests

   - fix transaction atomicity when enabling simple quotas

   - move COW block trace point before the block gets freed

   - print various sizes in sysfs with correct endianity"

* tag 'for-6.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: sysfs: fix direct super block member reads
  btrfs: fix transaction atomicity bug when enabling simple quotas
  btrfs: avoid monopolizing a core when activating a swap file
  btrfs: allow swap activation to be interruptible
  btrfs: fix swap file activation failure due to extents that used to be shared
  btrfs: fix race with memory mapped writes when activating swap file
  btrfs: check folio mapping after unlock in put_file_data()
  btrfs: check folio mapping after unlock in relocate_one_folio()
  btrfs: fix use-after-free when COWing tree bock and tracing is enabled
  btrfs: fix use-after-free waiting for encoded read endios
2024-12-29 09:34:34 -08:00
Filipe Manana
2c8507c63f btrfs: avoid monopolizing a core when activating a swap file
During swap activation we iterate over the extents of a file and we can
have many thousands of them, so we can end up in a busy loop monopolizing
a core. Avoid this by doing a voluntary reschedule after processing each
extent.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23 22:04:48 +01:00
Filipe Manana
9a45022a0e btrfs: allow swap activation to be interruptible
During swap activation we iterate over the extents of a file, then do
several checks for each extent, some of which may take some significant
time such as checking if an extent is shared. Since a file can have
many thousands of extents, this can be a very slow operation and it's
currently not interruptible. I had a bug during development of a previous
patch that resulted in an infinite loop when iterating the extents, so
a core was busy looping and I couldn't cancel the operation, which is very
annoying and requires a reboot. So make the loop interruptible by checking
for fatal signals at the end of each iteration and stopping immediately if
there is one.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23 22:04:38 +01:00
Filipe Manana
03018e5d85 btrfs: fix swap file activation failure due to extents that used to be shared
When activating a swap file, to determine if an extent is shared we use
can_nocow_extent(), which ends up at btrfs_cross_ref_exist(). That helper
is meant to be quick because it's used in the NOCOW write path, when
flushing delalloc and when doing a direct IO write, however it does return
some false positives, meaning it may indicate that an extent is shared
even if it's no longer the case. For the write path this is fine, we just
do a unnecessary COW operation instead of doing a more rigorous check
which would be too heavy (calling btrfs_is_data_extent_shared()).

However when activating a swap file, the false positives simply result
in a failure, which is confusing for users/applications. One particular
case where this happens is when a data extent only has 1 reference but
that reference is not inlined in the extent item located in the extent
tree - this happens when we create more than 33 references for an extent
and then delete those 33 references plus every other non-inline reference
except one. The function check_committed_ref() assumes that if the size
of an extent item doesn't match the size of struct btrfs_extent_item
plus the size of an inline reference (plus an owner reference in case
simple quotas are enabled), then the extent is shared - that is not the
case however, we can have a single reference but it's not inlined - the
reason we do this is to be fast and avoid inspecting non-inline references
which may be located in another leaf of the extent tree, slowing down
write paths.

The following test script reproduces the bug:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/sdi
   MNT=/mnt/sdi
   NUM_CLONES=50

   umount $DEV &> /dev/null

   run_test()
   {
        local sync_after_add_reflinks=$1
        local sync_after_remove_reflinks=$2

        mkfs.btrfs -f $DEV > /dev/null
        #mkfs.xfs -f $DEV > /dev/null
        mount $DEV $MNT

        touch $MNT/foo
        chmod 0600 $MNT/foo
   	# On btrfs the file must be NOCOW.
        chattr +C $MNT/foo &> /dev/null
        xfs_io -s -c "pwrite -b 1M 0 1M" $MNT/foo
        mkswap $MNT/foo

        for ((i = 1; i <= $NUM_CLONES; i++)); do
            touch $MNT/foo_clone_$i
            chmod 0600 $MNT/foo_clone_$i
            # On btrfs the file must be NOCOW.
            chattr +C $MNT/foo_clone_$i &> /dev/null
            cp --reflink=always $MNT/foo $MNT/foo_clone_$i
        done

        if [ $sync_after_add_reflinks -ne 0 ]; then
            # Flush delayed refs and commit current transaction.
            sync -f $MNT
        fi

        # Remove the original file and all clones except the last.
        rm -f $MNT/foo
        for ((i = 1; i < $NUM_CLONES; i++)); do
            rm -f $MNT/foo_clone_$i
        done

        if [ $sync_after_remove_reflinks -ne 0 ]; then
            # Flush delayed refs and commit current transaction.
            sync -f $MNT
        fi

        # Now use the last clone as a swap file. It should work since
        # its extent are not shared anymore.
        swapon $MNT/foo_clone_${NUM_CLONES}
        swapoff $MNT/foo_clone_${NUM_CLONES}

        umount $MNT
   }

   echo -e "\nTest without sync after creating and removing clones"
   run_test 0 0

   echo -e "\nTest with sync after creating clones"
   run_test 1 0

   echo -e "\nTest with sync after removing clones"
   run_test 0 1

   echo -e "\nTest with sync after creating and removing clones"
   run_test 1 1

Running the test:

   $ ./test.sh
   Test without sync after creating and removing clones
   wrote 1048576/1048576 bytes at offset 0
   1 MiB, 1 ops; 0.0017 sec (556.793 MiB/sec and 556.7929 ops/sec)
   Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
   no label, UUID=a6b9c29e-5ef4-4689-a8ac-bc199c750f02
   swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
   swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument

   Test with sync after creating clones
   wrote 1048576/1048576 bytes at offset 0
   1 MiB, 1 ops; 0.0036 sec (271.739 MiB/sec and 271.7391 ops/sec)
   Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
   no label, UUID=5e9008d6-1f7a-4948-a1b4-3f30aba20a33
   swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
   swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument

   Test with sync after removing clones
   wrote 1048576/1048576 bytes at offset 0
   1 MiB, 1 ops; 0.0103 sec (96.665 MiB/sec and 96.6651 ops/sec)
   Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
   no label, UUID=916c2740-fa9f-4385-9f06-29c3f89e4764

   Test with sync after creating and removing clones
   wrote 1048576/1048576 bytes at offset 0
   1 MiB, 1 ops; 0.0031 sec (314.268 MiB/sec and 314.2678 ops/sec)
   Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
   no label, UUID=06aab1dd-4d90-49c0-bd9f-3a8db4e2f912
   swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
   swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument

Fix this by reworking btrfs_swap_activate() to instead of using extent
maps and checking for shared extents with can_nocow_extent(), iterate
over the inode's file extent items and use the accurate
btrfs_is_data_extent_shared().

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23 22:04:17 +01:00
Filipe Manana
0525064bb8 btrfs: fix race with memory mapped writes when activating swap file
When activating the swap file we flush all delalloc and wait for ordered
extent completion, so that we don't miss any delalloc and extents before
we check that the file's extent layout is usable for a swap file and
activate the swap file. We are called with the inode's VFS lock acquired,
so we won't race with buffered and direct IO writes, however we can still
race with memory mapped writes since they don't acquire the inode's VFS
lock. The race window is between flushing all delalloc and locking the
whole file's extent range, since memory mapped writes lock an extent range
with the length of a page.

Fix this by acquiring the inode's mmap lock before we flush delalloc.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23 22:03:43 +01:00
Johannes Thumshirn
d29662695e btrfs: fix use-after-free waiting for encoded read endios
Fix a use-after-free in the I/O completion path for encoded reads by
using a completion instead of a wait_queue for synchronizing the
destruction of 'struct btrfs_encoded_read_private'.

Fixes: 1881fba89b ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23 21:55:06 +01:00
Linus Torvalds
5a087a6b17 for-6.13-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmdYzmoACgkQxWXV+ddt
 WDv5GxAAnCsGctNax89x/VpCDZynRghrkxlzu/4kG/pqxsJyzlgXDFtzHAEewSMs
 MYL+WCZLYpeKB5FpZq98mDJVLGNMG+9wqkx1bH/xy2ajBGZTeQe5pnkXMNlv9U1O
 SX34t8nzOdTCENDnQeRc5I2vTcsQRhgHoVjJkAYdWdhcD9fs6xHKZRe+himlstSn
 46ioKzEKSR3ztEUW4ycPF379g7d4kTR0hkk3pu5Nxe7ER8iq+jNSWXj0mzKg7mpJ
 KxP56VgY0OrsiUcJr2qFZ1hQIp810puaAuM4C1lLgRplECHxtLbP9JvL9Rr7a8Ox
 68tuThyLEpQtR59078jIX3RK6CwVi15rKb/ZkLZkW19TNSAAfM5qrB146hLBUM4T
 16WaiJ0x9lVkH2oYQv8zbNZiqDxPhPUdS/JArNAcQYk9ma+C1hCsxPQ/N5yoWH/C
 OABJddNR83sm4VTXu3Nci1EB8QuEoOuihYO6CdRkJ3PPNDuQiG6gwnoA2zqSihhy
 L5fQaLSWAUsLczarHZrvAi9Y0rfG66QzqGR+A1K/8qMTQ8pSCupd+LfqVa21QpI1
 Awx/wVFzsAm7z9CrnPTRJe+JSlBDQdeXWX7pDhhkXgwbCsMVSf3dbBweCD3o1EiM
 BVI7SfEgImlbatd0QvDp9FcsnEqp90SCi+99U+zZCmQ1SW8CEC0=
 =+DUB
 -----END PGP SIGNATURE-----

Merge tag 'for-6.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more fixes. Apart from the one liners and updated bio splitting
  error handling there's a fix for subvolume mount with different flags.
  This was known and fixed for some time but I've delayed it to give it
  more testing.

   - fix unbalanced locking when swapfile activation fails when the
     subvolume gets deleted in the meantime

   - add btrfs error handling after bio_split() calls that got error
     handling recently

   - during unmount, flush delalloc workers at the right time before the
     cleaner thread is shut down

   - fix regression in buffered write folio conversion, explicitly wait
     for writeback as FGP_STABLE flag is currently a no-op on btrfs

   - handle race in subvolume mount with different flags, the conversion
     to the new mount API did not handle the case where multiple
     subvolumes get mounted in parallel, which is a distro use case"

* tag 'for-6.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: flush delalloc workers queue before stopping cleaner kthread during unmount
  btrfs: handle bio_split() errors
  btrfs: properly wait for writeback before buffered write
  btrfs: fix missing snapshot drew unlock when root is dead during swap activation
  btrfs: fix mount failure due to remount races
2024-12-10 18:18:01 -08:00
Filipe Manana
9c803c474c btrfs: fix missing snapshot drew unlock when root is dead during swap activation
When activating a swap file we acquire the root's snapshot drew lock and
then check if the root is dead, failing and returning with -EPERM if it's
dead but without unlocking the root's snapshot lock. Fix this by adding
the missing unlock.

Fixes: 60021bd754 ("btrfs: prevent subvol with swapfile from being deleted")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-03 20:27:02 +01:00
Linus Torvalds
feffde684a for-6.13-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmdPMbMACgkQxWXV+ddt
 WDtFSA/+Kd61BPKaiZFF0yjOsjBOlu8WMHFLO5xa2ZRqV3HUzm6rdO4wUSq3Eyqg
 lrFszPfLA0REZIEdY2rqDKJduk1MZZg6NY7Pvn+/ByGOorM0Ym1BJoydtlUN5o2Y
 AWeaDxs6LBMwRqpai0+AkikufK/jK7QfhHci+Oo2XmOv1C39DQbkXO76SYh/yERt
 CcZNaSjm9DUlLxOOpkefxYvpPW4Uv4NSBfh9aymX/u1VxXqeuMuzPZqwZO+nwl/p
 M1yr9fcbrqh3yKC+JjhD7xmOJM3x4c2PmzSkQTdepOuAlQvuQ/iFD+Zc1YjY0XlT
 Fl938rdTKULgCarR5rdsXqdlFRnOprlgt0J1Pdf+GipTVU0EY3WU343HCz6h8pmG
 F/NPvlCahkEtk1UCguL92NOBeAf0adWhuYfKjkuxYuL5ZTvzsOl2ymF6Vlja0Y39
 VK4exjG4ilDESxweinJe53k3QLDwvUc2h0D291QVoo2X06dhXda8oOVK+6lR5mij
 zDtXqurjlCybT1W7op6RcWMY0TaA8IR3Bo9oU1YbVcctuc86/X8SERv1G8MQtXhh
 tevOVb/cKy7gEw9q73OtrH/J2EAklnsesnpH2sdt7WUEzOQPw9BIwEa/NRbjvoqn
 n2ts5KktBhNjM1s0elHBf7o4+OaTrTlTrp0HelXyONZucPrzDQ4=
 =4mFO
 -----END PGP SIGNATURE-----

Merge tag 'for-6.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - add lockdep annotations for io_uring/encoded read integration, inode
   lock is held when returning to userspace

 - properly reflect experimental config option to sysfs

 - handle NULL root in case the rescue mode accepts invalid/damaged tree
   roots (rescue=ibadroot)

 - regression fix of a deadlock between transaction and extent locks

 - fix pending bio accounting bug in encoded read ioctl

 - fix NOWAIT mode when checking references for NOCOW files

 - fix use-after-free in a rb-tree cleanup in ref-verify debugging tool

* tag 'for-6.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix lockdep warnings on io_uring encoded reads
  btrfs: ref-verify: fix use-after-free after invalid ref action
  btrfs: add a sanity check for btrfs root in btrfs_search_slot()
  btrfs: don't loop for nowait writes when checking for cross references
  btrfs: sysfs: advertise experimental features only if CONFIG_BTRFS_EXPERIMENTAL=y
  btrfs: fix deadlock between transaction commits and extent locks
  btrfs: fix use-after-free in btrfs_encoded_read_endio()
2024-12-03 11:02:17 -08:00
Filipe Manana
7d6872ccbd btrfs: fix deadlock between transaction commits and extent locks
When running a workload with fsstress and duperemove (generic/561) we can
hit a deadlock related to transaction commits and locking extent ranges,
as described below.

Task A hanging during a transaction commit, waiting for all other writers
to complete:

  [178317.334817] INFO: task fsstress:555623 blocked for more than 120 seconds.
  [178317.335693]       Not tainted 6.12.0-rc6-btrfs-next-179+ #1
  [178317.336528] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [178317.337673] task:fsstress        state:D stack:0     pid:555623 tgid:555623 ppid:555620 flags:0x00004002
  [178317.337679] Call Trace:
  [178317.337681]  <TASK>
  [178317.337685]  __schedule+0x364/0xbe0
  [178317.337691]  schedule+0x26/0xa0
  [178317.337695]  btrfs_commit_transaction+0x5c5/0x1050 [btrfs]
  [178317.337769]  ? start_transaction+0xc4/0x800 [btrfs]
  [178317.337815]  ? __pfx_autoremove_wake_function+0x10/0x10
  [178317.337819]  btrfs_mksubvol+0x381/0x640 [btrfs]
  [178317.337878]  btrfs_mksnapshot+0x7a/0xb0 [btrfs]
  [178317.337935]  __btrfs_ioctl_snap_create+0x1bb/0x1d0 [btrfs]
  [178317.337995]  btrfs_ioctl_snap_create_v2+0x103/0x130 [btrfs]
  [178317.338053]  btrfs_ioctl+0x29b/0x2a90 [btrfs]
  [178317.338118]  ? kmem_cache_alloc_noprof+0x5f/0x2c0
  [178317.338126]  ? getname_flags+0x45/0x1f0
  [178317.338133]  ? _raw_spin_unlock+0x15/0x30
  [178317.338145]  ? __x64_sys_ioctl+0x88/0xc0
  [178317.338149]  __x64_sys_ioctl+0x88/0xc0
  [178317.338152]  do_syscall_64+0x4a/0x110
  [178317.338160]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [178317.338190] RIP: 0033:0x7f13c28e271b

Which corresponds to line 2361 of transaction.c:

  $ cat -n fs/btrfs/transaction.c
  (...)
  2162  int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
  2163  {
  (...)
  2349          spin_lock(&fs_info->trans_lock);
  2350          add_pending_snapshot(trans);
  2351          cur_trans->state = TRANS_STATE_COMMIT_DOING;
  2352          spin_unlock(&fs_info->trans_lock);
  2353
  2354          /*
  2355           * The thread has started/joined the transaction thus it holds the
  2356           * lockdep map as a reader. It has to release it before acquiring the
  2357           * lockdep map as a writer.
  2358           */
  2359          btrfs_lockdep_release(fs_info, btrfs_trans_num_writers);
  2360          btrfs_might_wait_for_event(fs_info, btrfs_trans_num_writers);
  2361          wait_event(cur_trans->writer_wait,
  2362                     atomic_read(&cur_trans->num_writers) == 1);
  (...)

The transaction is in the TRANS_STATE_COMMIT_DOING state and so it's
waiting for all other existing writers to complete and release their
transaction handle.

Task B is running ordered extent completion and blocked waiting to lock an
extent range in an inode's io tree:

  [178317.327411] INFO: task kworker/u48:8:554545 blocked for more than 120 seconds.
  [178317.328630]       Not tainted 6.12.0-rc6-btrfs-next-179+ #1
  [178317.329635] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [178317.330872] task:kworker/u48:8   state:D stack:0     pid:554545 tgid:554545 ppid:2      flags:0x00004000
  [178317.330878] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
  [178317.330944] Call Trace:
  [178317.330945]  <TASK>
  [178317.330947]  __schedule+0x364/0xbe0
  [178317.330952]  schedule+0x26/0xa0
  [178317.330955]  __lock_extent+0x337/0x3a0 [btrfs]
  [178317.331014]  ? __pfx_autoremove_wake_function+0x10/0x10
  [178317.331017]  btrfs_finish_one_ordered+0x47a/0xaa0 [btrfs]
  [178317.331074]  ? psi_group_change+0x132/0x2d0
  [178317.331078]  btrfs_work_helper+0xbd/0x370 [btrfs]
  [178317.331140]  process_scheduled_works+0xd3/0x460
  [178317.331144]  ? __pfx_worker_thread+0x10/0x10
  [178317.331146]  worker_thread+0x121/0x250
  [178317.331149]  ? __pfx_worker_thread+0x10/0x10
  [178317.331151]  kthread+0xe9/0x120
  [178317.331154]  ? __pfx_kthread+0x10/0x10
  [178317.331157]  ret_from_fork+0x2d/0x50
  [178317.331159]  ? __pfx_kthread+0x10/0x10
  [178317.331162]  ret_from_fork_asm+0x1a/0x30

This extent range locking happens after joining the current transaction,
so task A is waiting for task B to release its transaction handle
(decrementing the transaction's num_writers counter).

Task C while doing a fiemap it tries to join the current transaction:

  [242682.812815] task:pool            state:D stack:0     pid:560767 tgid:560724 ppid:555622 flags:0x00004006
  [242682.812827] Call Trace:
  [242682.812856]  <TASK>
  [242682.812864]  __schedule+0x364/0xbe0
  [242682.812879]  ? _raw_spin_unlock_irqrestore+0x23/0x40
  [242682.812897]  schedule+0x26/0xa0
  [242682.812909]  wait_current_trans+0xd6/0x130 [btrfs]
  [242682.813148]  ? __pfx_autoremove_wake_function+0x10/0x10
  [242682.813162]  start_transaction+0x3d4/0x800 [btrfs]
  [242682.813399]  btrfs_is_data_extent_shared+0xd2/0x440 [btrfs]
  [242682.813723]  fiemap_process_hole+0x2a2/0x300 [btrfs]
  [242682.813995]  extent_fiemap+0x9b8/0xb80 [btrfs]
  [242682.814249]  btrfs_fiemap+0x78/0xc0 [btrfs]
  [242682.814501]  do_vfs_ioctl+0x2db/0xa50
  [242682.814519]  __x64_sys_ioctl+0x6a/0xc0
  [242682.814531]  do_syscall_64+0x4a/0x110
  [242682.814544]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [242682.814556] RIP: 0033:0x7efff595e71b

It tries to join the current transaction, but it can't because the
transaction is in the TRANS_STATE_COMMIT_DOING state, so
join_transaction() returns -EBUSY to start_transaction() and makes it
wait for the current transaction to complete. And while it's waiting
for the transaction to complete, it's holding an extent range locked
in the same inode that task B is operating, which causes a deadlock
between these 3 tasks. The extent range for the inode was locked at
the start of the fiemap operation, early at extent_fiemap().

In short these tasks deadlock because:

1) Task A is waiting for task B to release its transaction handle;

2) Task B is waiting to lock an extent range for an inode while holding a
   transaction handle open;

3) Task C is waiting for the current transaction to complete (for task A
   to finish the transaction commit) while holding the extent range for
   the inode locked, so task B can't progress and release its transaction
   handle.

This results in an ABBA deadlock involving transaction commits and extent
locks. Extent locks are higher level locks, like inode VFS locks, and
should always be acquired before joining or starting a transaction, but
recently commit 2206265f41 ("btrfs: remove code duplication in ordered
extent finishing") accidentally changed btrfs_finish_one_ordered() to do
the transaction join before locking the extent range.

Fix this by making sure that btrfs_finish_one_ordered() always locks the
extent before joining a transaction and add an explicit comment about the
need for this order.

Fixes: 2206265f41 ("btrfs: remove code duplication in ordered extent finishing")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-28 20:46:40 +01:00
Johannes Thumshirn
05b36b04d7 btrfs: fix use-after-free in btrfs_encoded_read_endio()
Shinichiro reported the following use-after free that sometimes is
happening in our CI system when running fstests' btrfs/284 on a TCMU
runner device:

  BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780
  Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219

  CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15
  Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
  Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
  Call Trace:
   <TASK>
   dump_stack_lvl+0x6e/0xa0
   ? lock_release+0x708/0x780
   print_report+0x174/0x505
   ? lock_release+0x708/0x780
   ? __virt_addr_valid+0x224/0x410
   ? lock_release+0x708/0x780
   kasan_report+0xda/0x1b0
   ? lock_release+0x708/0x780
   ? __wake_up+0x44/0x60
   lock_release+0x708/0x780
   ? __pfx_lock_release+0x10/0x10
   ? __pfx_do_raw_spin_lock+0x10/0x10
   ? lock_is_held_type+0x9a/0x110
   _raw_spin_unlock_irqrestore+0x1f/0x60
   __wake_up+0x44/0x60
   btrfs_encoded_read_endio+0x14b/0x190 [btrfs]
   btrfs_check_read_bio+0x8d9/0x1360 [btrfs]
   ? lock_release+0x1b0/0x780
   ? trace_lock_acquire+0x12f/0x1a0
   ? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs]
   ? process_one_work+0x7e3/0x1460
   ? lock_acquire+0x31/0xc0
   ? process_one_work+0x7e3/0x1460
   process_one_work+0x85c/0x1460
   ? __pfx_process_one_work+0x10/0x10
   ? assign_work+0x16c/0x240
   worker_thread+0x5e6/0xfc0
   ? __pfx_worker_thread+0x10/0x10
   kthread+0x2c3/0x3a0
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x31/0x70
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1a/0x30
   </TASK>

  Allocated by task 3661:
   kasan_save_stack+0x30/0x50
   kasan_save_track+0x14/0x30
   __kasan_kmalloc+0xaa/0xb0
   btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs]
   send_extent_data+0xf0f/0x24a0 [btrfs]
   process_extent+0x48a/0x1830 [btrfs]
   changed_cb+0x178b/0x2ea0 [btrfs]
   btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
   _btrfs_ioctl_send+0x117/0x330 [btrfs]
   btrfs_ioctl+0x184a/0x60a0 [btrfs]
   __x64_sys_ioctl+0x12e/0x1a0
   do_syscall_64+0x95/0x180
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

  Freed by task 3661:
   kasan_save_stack+0x30/0x50
   kasan_save_track+0x14/0x30
   kasan_save_free_info+0x3b/0x70
   __kasan_slab_free+0x4f/0x70
   kfree+0x143/0x490
   btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs]
   send_extent_data+0xf0f/0x24a0 [btrfs]
   process_extent+0x48a/0x1830 [btrfs]
   changed_cb+0x178b/0x2ea0 [btrfs]
   btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
   _btrfs_ioctl_send+0x117/0x330 [btrfs]
   btrfs_ioctl+0x184a/0x60a0 [btrfs]
   __x64_sys_ioctl+0x12e/0x1a0
   do_syscall_64+0x95/0x180
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

  The buggy address belongs to the object at ffff888106a83f00
   which belongs to the cache kmalloc-rnd-07-96 of size 96
  The buggy address is located 24 bytes inside of
   freed 96-byte region [ffff888106a83f00, ffff888106a83f60)

  The buggy address belongs to the physical page:
  page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83
  flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
  page_type: f5(slab)
  raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004
  raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
   ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
  >ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
                              ^
   ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
   ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ==================================================================

Further analyzing the trace and the crash dump's vmcore file shows that
the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on
the wait_queue that is in the private data passed to the end_io handler.

Commit 4ff47df40447 ("btrfs: move priv off stack in
btrfs_encoded_read_regular_fill_pages()") moved 'struct
btrfs_encoded_read_private' off the stack.

Before that commit one can see a corruption of the private data when
analyzing the vmcore after a crash:

*(struct btrfs_encoded_read_private *)0xffff88815626eec8 = {
	.wait = (wait_queue_head_t){
		.lock = (spinlock_t){
			.rlock = (struct raw_spinlock){
				.raw_lock = (arch_spinlock_t){
					.val = (atomic_t){
						.counter = (int)-2005885696,
					},
					.locked = (u8)0,
					.pending = (u8)157,
					.locked_pending = (u16)40192,
					.tail = (u16)34928,
				},
				.magic = (unsigned int)536325682,
				.owner_cpu = (unsigned int)29,
				.owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0,
				.dep_map = (struct lockdep_map){
					.key = (struct lock_class_key *)0xffff8881575a3b6c,
					.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
					.name = (const char *)0xffff88815626f100 = "",
					.wait_type_outer = (u8)37,
					.wait_type_inner = (u8)178,
					.lock_type = (u8)154,
				},
			},
			.__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 },
			.dep_map = (struct lockdep_map){
				.key = (struct lock_class_key *)0xffff8881575a3b6c,
				.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
				.name = (const char *)0xffff88815626f100 = "",
				.wait_type_outer = (u8)37,
				.wait_type_inner = (u8)178,
				.lock_type = (u8)154,
			},
		},
		.head = (struct list_head){
			.next = (struct list_head *)0x112cca,
			.prev = (struct list_head *)0x47,
		},
	},
	.pending = (atomic_t){
		.counter = (int)-1491499288,
	},
	.status = (blk_status_t)130,
}

Here we can see several indicators of in-memory data corruption, e.g. the
large negative atomic values of ->pending or
->wait->lock->rlock->raw_lock->val, as well as the bogus spinlock magic
0x1ff7ae32 (decimal 536325682 above) instead of 0xdead4ead or the bogus
pointer values for ->wait->head.

To fix this, change atomic_dec_return() to atomic_dec_and_test() to fix the
corruption, as atomic_dec_return() is defined as two instructions on
x86_64, whereas atomic_dec_and_test() is defined as a single atomic
operation. This can lead to a situation where counter value is already
decremented but the if statement in btrfs_encoded_read_endio() is not
completely processed, i.e. the 0 test has not completed. If another thread
continues executing btrfs_encoded_read_regular_fill_pages() the
atomic_dec_return() there can see an already updated ->pending counter and
continues by freeing the private data. Continuing in the endio handler the
test for 0 succeeds and the wait_queue is woken up, resulting in a
use-after-free.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Fixes: 1881fba89b ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-28 20:45:43 +01:00
Linus Torvalds
c14a8a4c04 for-6.13-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmc0zT4ACgkQxWXV+ddt
 WDtThRAAhzSSiHcJqTfCL5nHh7w85MNEVw28o1ETgXSYJmx0JOWLE7Znlp2FV7jj
 IbYkFfF2gXJzYvRZkcXB/TAHV9KJG5yZIBZfccbM+9db9f8xkImVKMuqQRXPU41R
 ppSCmqZTeujtt8ucsaJkMpm6pzECKJCJaGOsMJ8fiqKpo89dKO3eGAVboSbpPF4C
 r0YmppiBwSP/cCXQCqWxZRbqPGN+lUgZpIGNRi157kehfmRHlVVJTO1pgqK8PCXb
 uIT09Kulppfez8+1A10CPcniDTyinLik/qLTNlzdWoDBL4iNJMg0A0wsA04AJVf0
 PdOS0REusiv3QcEIO6PefuRFRRfXcSLPpPDUceltJT5O0uM2gUqf2C7dEHXUGU3o
 TdgYlbQpsJWpZ7VGWQDZeGGV04lOPQvu0LGLPgEerUQd5H9ABa0dX8Fn0sPhKsa8
 whpAcdfE4rdNxB2OJFnqQeFq0z3cSjP/rvKlluCmAj97QYI+kiu3QyhemcT1YSC9
 U7n5Ya9IzIYCN3ml54q3hEgyD0IVGGG20GuUmqC9XSP9mrQRC8I1g7v26AiOTrrk
 VhgSdtMmphDxXudifsnYMaQ0Z1QqiUrW1SM/prAEOnBYCo75+HDsTgrq9ithgHoI
 4xz4YXJyMRs18qfTJctXC1wmGuz5plTdQrwarHdNsELN5HEyqX4=
 =aAcf
 -----END PGP SIGNATURE-----

Merge tag 'for-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Changes outside of btrfs: add io_uring command flag to track a dying
  task (the rest will go via the block git tree).

  User visible changes:

   - wire encoded read (ioctl) to io_uring commands, this can be used on
     itself, in the future this will allow 'send' to be asynchronous. As
     a consequence, the encoded read ioctl can also work in non-blocking
     mode

   - new ioctl to wait for cleaned subvolumes, no need to use the
     generic and root-only SEARCH_TREE ioctl, will be used by "btrfs
     subvol sync"

   - recognize different paths/symlinks for the same devices and don't
     report them during rescanning, this can be observed with LVM or DM

   - seeding device use case change, the sprout device (the one
     capturing new writes) will not clear the read-only status of the
     super block; this prevents accumulating space from deleted
     snapshots

  Performance improvements:

   - reduce lock contention when traversing extent buffers

   - reduce extent tree lock contention when searching for inline
     backref

   - switch from rb-trees to xarray for delayed ref tracking,
     improvements due to better cache locality, branching factors and
     more compact data structures

   - enable extent map shrinker again (prevent memory exhaustion under
     some types of IO load), reworked to run in a single worker thread
     (there used to be problems causing long stalls under memory
     pressure)

  Core changes:

   - raid-stripe-tree feature updates:
       - make device replace and scrub work
       - implement partial deletion of stripe extents
       - new selftests

   - split the config option BTRFS_DEBUG and add EXPERIMENTAL for
     features that are experimental or with known problems so we don't
     misuse debugging config for that

   - subpage mode updates (sector < page):
       - update compression implementations
       - update writepage, writeback

   - continued folio API conversions:
       - buffered writes

   - make buffered write copy one page at a time, preparatory work for
     future integration with large folios, may cause performance drop

   - proper locking of root item regarding starting send

   - error handling improvements

   - code cleanups and refactoring:
       - dead code removal
       - unused parameter reduction
       - lockdep assertions"

* tag 'for-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (119 commits)
  btrfs: send: check for read-only send root under critical section
  btrfs: send: check for dead send root under critical section
  btrfs: remove check for NULL fs_info at btrfs_folio_end_lock_bitmap()
  btrfs: fix warning on PTR_ERR() against NULL device at btrfs_control_ioctl()
  btrfs: fix a typo in btrfs_use_zone_append
  btrfs: avoid superfluous calls to free_extent_map() in btrfs_encoded_read()
  btrfs: simplify logic to decrement snapshot counter at btrfs_mksnapshot()
  btrfs: remove hole from struct btrfs_delayed_node
  btrfs: update stale comment for struct btrfs_delayed_ref_node::add_list
  btrfs: add new ioctl to wait for cleaned subvolumes
  btrfs: simplify range tracking in cow_file_range()
  btrfs: remove conditional path allocation in btrfs_read_locked_inode()
  btrfs: push cleanup into btrfs_read_locked_inode()
  io_uring/cmd: let cmds to know about dying task
  btrfs: add struct io_btrfs_cmd as type for io_uring_cmd_to_pdu()
  btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)
  btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages()
  btrfs: don't sleep in btrfs_encoded_read() if IOCB_NOWAIT is set
  btrfs: change btrfs_encoded_read() so that reading of extent is done by caller
  btrfs: remove pointless iocb::ki_pos addition in btrfs_encoded_read()
  ...
2024-11-18 16:37:41 -08:00
Linus Torvalds
56be9aaf98 vfs-6.13.pagecache
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcUQAAKCRCRxhvAZXjc
 onEpAQCUdwIBHpwmSIFvJFA9aNGpbLzi0dDSEIxuWYtp5qVuogD+ImccwqpG3kEi
 Zq9vokdPpB1zbahxKl1mkvBG4G0GFQE=
 =LbP6
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs pagecache updates from Christian Brauner:
 "Cleanup filesystem page flag usage: This continues the work to make
  the mappedtodisk/owner_2 flag available to filesystems which don't use
  buffer heads. Further patches remove uses of Private2. This brings us
  very close to being rid of it entirely"

* tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  migrate: Remove references to Private2
  ceph: Remove call to PagePrivate2()
  btrfs: Switch from using the private_2 flag to owner_2
  mm: Remove PageMappedToDisk
  nilfs2: Convert nilfs_copy_buffer() to use folios
  fs: Move clearing of mappedtodisk to buffer.c
2024-11-18 09:54:32 -08:00
Linus Torvalds
70e7730c2a vfs-6.13.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcToAAKCRCRxhvAZXjc
 osL9AP948FFumJRC28gDJ4xp+X4eohNOfkgoEG8FTbF2zU6ulwD+O0pr26FqpFli
 pqlG+38UdATImpfqqWjPbb72sBYcfQg=
 =wLUh
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Features:

   - Fixup and improve NLM and kNFSD file lock callbacks

     Last year both GFS2 and OCFS2 had some work done to make their
     locking more robust when exported over NFS. Unfortunately, part of
     that work caused both NLM (for NFS v3 exports) and kNFSD (for
     NFSv4.1+ exports) to no longer send lock notifications to clients

     This in itself is not a huge problem because most NFS clients will
     still poll the server in order to acquire a conflicted lock

     It's important for NLM and kNFSD that they do not block their
     kernel threads inside filesystem's file_lock implementations
     because that can produce deadlocks. We used to make sure of this by
     only trusting that posix_lock_file() can correctly handle blocking
     lock calls asynchronously, so the lock managers would only setup
     their file_lock requests for async callbacks if the filesystem did
     not define its own lock() file operation

     However, when GFS2 and OCFS2 grew the capability to correctly
     handle blocking lock requests asynchronously, they started
     signalling this behavior with EXPORT_OP_ASYNC_LOCK, and the check
     for also trusting posix_lock_file() was inadvertently dropped, so
     now most filesystems no longer produce lock notifications when
     exported over NFS

     Fix this by using an fop_flag which greatly simplifies the problem
     and grooms the way for future uses by both filesystems and lock
     managers alike

   - Add a sysctl to delete the dentry when a file is removed instead of
     making it a negative dentry

     Commit 681ce86235 ("vfs: Delete the associated dentry when
     deleting a file") introduced an unconditional deletion of the
     associated dentry when a file is removed. However, this led to
     performance regressions in specific benchmarks, such as
     ilebench.sum_operations/s, prompting a revert in commit
     4a4be1ad3a ("Revert "vfs: Delete the associated dentry when
     deleting a file""). This reintroduces the concept conditionally
     through a sysctl

   - Expand the statmount() system call:

       * Report the filesystem subtype in a new fs_subtype field to
         e.g., report fuse filesystem subtypes

       * Report the superblock source in a new sb_source field

       * Add a new way to return filesystem specific mount options in an
         option array that returns filesystem specific mount options
         separated by zero bytes and unescaped. This allows caller's to
         retrieve filesystem specific mount options and immediately pass
         them to e.g., fsconfig() without having to unescape or split
         them

       * Report security (LSM) specific mount options in a separate
         security option array. We don't lump them together with
         filesystem specific mount options as security mount options are
         generic and most users aren't interested in them

         The format is the same as for the filesystem specific mount
         option array

   - Support relative paths in fsconfig()'s FSCONFIG_SET_STRING command

   - Optimize acl_permission_check() to avoid costly {g,u}id ownership
     checks if possible

   - Use smp_mb__after_spinlock() to avoid full smp_mb() in evict()

   - Add synchronous wakeup support for ep_poll_callback.

     Currently, epoll only uses wake_up() to wake up task. But sometimes
     there are epoll users which want to use the synchronous wakeup flag
     to give a hint to the scheduler, e.g., the Android binder driver.
     So add a wake_up_sync() define, and use wake_up_sync() when sync is
     true in ep_poll_callback()

  Fixes:

   - Fix kernel documentation for inode_insert5() and iget5_locked()

   - Annotate racy epoll check on file->f_ep

   - Make F_DUPFD_QUERY associative

   - Avoid filename buffer overrun in initramfs

   - Don't let statmount() return empty strings

   - Add a cond_resched() to dump_user_range() to avoid hogging the CPU

   - Don't query the device logical blocksize multiple times for hfsplus

   - Make filemap_read() check that the offset is positive or zero

  Cleanups:

   - Various typo fixes

   - Cleanup wbc_attach_fdatawrite_inode()

   - Add __releases annotation to wbc_attach_and_unlock_inode()

   - Add hugetlbfs tracepoints

   - Fix various vfs kernel doc parameters

   - Remove obsolete TODO comment from io_cancel()

   - Convert wbc_account_cgroup_owner() to take a folio

   - Fix comments for BANDWITH_INTERVAL and wb_domain_writeout_add()

   - Reorder struct posix_acl to save 8 bytes

   - Annotate struct posix_acl with __counted_by()

   - Replace one-element array with flexible array member in freevxfs

   - Use idiomatic atomic64_inc_return() in alloc_mnt_ns()"

* tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
  statmount: retrieve security mount options
  vfs: make evict() use smp_mb__after_spinlock instead of smp_mb
  statmount: add flag to retrieve unescaped options
  fs: add the ability for statmount() to report the sb_source
  writeback: wbc_attach_fdatawrite_inode out of line
  writeback: add a __releases annoation to wbc_attach_and_unlock_inode
  fs: add the ability for statmount() to report the fs_subtype
  fs: don't let statmount return empty strings
  fs:aio: Remove TODO comment suggesting hash or array usage in io_cancel()
  hfsplus: don't query the device logical block size multiple times
  freevxfs: Replace one-element array with flexible array member
  fs: optimize acl_permission_check()
  initramfs: avoid filename buffer overrun
  fs/writeback: convert wbc_account_cgroup_owner to take a folio
  acl: Annotate struct posix_acl with __counted_by()
  acl: Realign struct posix_acl to save 8 bytes
  epoll: Add synchronous wakeup support for ep_poll_callback
  coredump: add cond_resched() to dump_user_range
  mm/page-writeback.c: Fix comment of wb_domain_writeout_add()
  mm/page-writeback.c: Update comment for BANDWIDTH_INTERVAL
  ...
2024-11-18 09:35:30 -08:00
Mark Harmstone
08fdca9eee btrfs: avoid superfluous calls to free_extent_map() in btrfs_encoded_read()
Change the control flow of btrfs_encoded_read() so that it doesn't call
free_extent_map() when we know that this has already been done.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Suggested-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:22 +01:00
Haisu Wang
5599f39356 btrfs: simplify range tracking in cow_file_range()
Simplify tracking of the range processed by using cur_alloc_size only to
store the reserved part that may fail to the allocated extent. Remove
the ram_size as well since it is always equal to cur_alloc_size in the
context. Advance the start in normal path until extent allocation
succeeds and keep the start unchanged in the error handling path.

Passed the fstest generic/475 test for a hundred times with quota
enabled. And a modified generic/475 test by removing the sleep time
for a hundred times. About one tenth of the tests do enter the error
handling path due to fail to reserve extent.

Suggested-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:22 +01:00
Leo Martins
7c855e16ab btrfs: remove conditional path allocation in btrfs_read_locked_inode()
Remove conditional path allocation from btrfs_read_locked_inode(). Add
an ASSERT(path) to indicate it should never be called with a NULL path.

Call btrfs_read_locked_inode() directly from btrfs_iget(). This causes
code duplication between btrfs_iget() and btrfs_iget_path(), but I
think this is justifiable as it removes the need for conditionally
allocating the path inside of btrfs_read_locked_inode(). This makes the
code easier to reason about and makes it clear who has the
responsibility of allocating and freeing the path.

Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:22 +01:00
Leo Martins
69673992b1 btrfs: push cleanup into btrfs_read_locked_inode()
Move btrfs_add_inode_to_root() so it can be called from
btrfs_read_locked_inode(), no changes were made to the function.

Move cleanup code from btrfs_iget_path() to btrfs_read_locked_inode.
This improves readability and improves a leaky abstraction. Previously
btrfs_iget_path() had to handle a positive error case as a result of a
call to btrfs_search_slot(), but it makes more sense to handle this
closer to the source of the call.

Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:21 +01:00
Mark Harmstone
34310c442e btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)
Add an io_uring command for encoded reads, using the same interface as
the existing BTRFS_IOC_ENCODED_READ ioctl.

btrfs_uring_encoded_read() is an io_uring version of
btrfs_ioctl_encoded_read(), which validates the user input and calls
btrfs_encoded_read() to read the appropriate metadata. If we determine
that we need to read an extent from disk, we call
btrfs_encoded_read_regular_fill_pages() through
btrfs_uring_read_extent() to prepare the bio.

The existing btrfs_encoded_read_regular_fill_pages() is changed so that
if it is passed a valid uring_ctx, rather than waking up any waiting
threads it calls btrfs_uring_read_extent_endio(). This in turn copies
the read data back to userspace, and calls io_uring_cmd_done() to
complete the io_uring command.

Because we're potentially doing a non-blocking read,
btrfs_uring_read_extent() doesn't clean up after itself if it returns
-EIOCBQUEUED. Instead, it allocates a priv struct, populates the fields
there that we will need to unlock the inode and free our allocations,
and defers this to the btrfs_uring_read_finished() that gets called when
the bio completes.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:21 +01:00
Mark Harmstone
68d3b27e05 btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages()
Change btrfs_encoded_read_regular_fill_pages() so that the priv struct
is allocated rather than stored on the stack, in preparation for adding
an asynchronous mode to the function.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:21 +01:00
Mark Harmstone
973a432637 btrfs: don't sleep in btrfs_encoded_read() if IOCB_NOWAIT is set
Change btrfs_encoded_read() so that it returns -EAGAIN rather than sleeps
if IOCB_NOWAIT is set in iocb->ki_flags. The conditions that require
sleeping are: inode lock, writeback, extent lock, ordered range.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:21 +01:00
Mark Harmstone
26efd44796 btrfs: change btrfs_encoded_read() so that reading of extent is done by caller
Change the behaviour of btrfs_encoded_read() so that if it needs to read
an extent from disk, it leaves the extent and inode locked and returns
-EIOCBQUEUED. The caller is then responsible for doing the I/O via
btrfs_encoded_read_regular() and unlocking the extent and inode.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:21 +01:00
Mark Harmstone
4bca7412b8 btrfs: remove pointless iocb::ki_pos addition in btrfs_encoded_read()
iocb->ki_pos isn't used after this function, so there's no point in
changing its value.

Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:21 +01:00
Anand Jain
d07eaa9995 btrfs: use filemap_get_folio() helper
When fgp_flags and gfp_flags are zero, use filemap_get_folio(A, B)
instead of __filemap_get_folio(A, B, 0, 0)—no need for the extra
arguments 0, 0.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:19 +01:00
Mark Harmstone
b1c5f6eda2 btrfs: fix wrong sizeof in btrfs_do_encoded_write()
btrfs_do_encoded_write() was converted to use folios in 400b172b8c,
but we're still allocating based on sizeof(struct page *) rather than
sizeof(struct folio *). There's no functional change.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:18 +01:00
David Sterba
590168edbe btrfs: drop unused parameter file_offset from btrfs_encoded_read_regular_fill_pages()
The file_offset parameter used to be passed to encoded read struct but
was removed in commit b665affe93 ("btrfs: remove unused members from
struct btrfs_encoded_read_private").

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:16 +01:00
David Sterba
e469da5d84 btrfs: drop unused parameter offset from __cow_file_range_inline()
We don't need offset for inline extents, they always start from 0.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:16 +01:00
David Sterba
372e5f88af btrfs: drop unused parameter inode from read_inline_extent()
We don't need the inode pointer to read inline extent, it's all
accessible from the path pointer.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:16 +01:00
Qu Wenruo
0fcaf926ad btrfs: remove btrfs_set_range_writeback()
The function btrfs_set_range_writeback() was originally a callback for
metadata and data, to mark a range with writeback flag.

Then it was converted into a common function call for both metadata and
data.

From the very beginning, the function had been only called on a full page,
later converted to handle range inside a page.

But it never needed to handle multiple pages, and since commit
8189197425 ("btrfs: refactor __extent_writepage_io() to do
sector-by-sector submission") the function was only called on a
sector-by-sector basis.

This makes the function unnecessary, and can be converted to a simple
btrfs_folio_set_writeback() call instead.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:15 +01:00
Johannes Thumshirn
61b4d75e3c btrfs: handle empty list of NOCOW ordered extents with checksum list
Currently we BUG_ON() in btrfs_finish_one_ordered() if we are finishing
an ordered extent that is flagged as NOCOW, but it's checksum list is
not empty.

This is clearly a logic error which we can recover from by aborting the
transaction.

For developer builds which enable CONFIG_BTRFS_ASSERT, also ASSERT()
that the list is empty.

Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:14 +01:00
Shen Lichuan
2144e1f23f btrfs: correct typos in multiple comments across various files
Fix some confusing spelling errors that were currently identified,
the details are as follows:

	block-group.c: 2800: 	uncompressible 	==> incompressible
	extent-tree.c: 3131:	EXTEMT		==> EXTENT
	extent_io.c: 3124: 	utlizing 	==> utilizing
	extent_map.c: 1323: 	ealier		==> earlier
	extent_map.c: 1325:	possiblity	==> possibility
	fiemap.c: 189:		emmitted	==> emitted
	fiemap.c: 197:		emmitted	==> emitted
	fiemap.c: 203:		emmitted	==> emitted
	transaction.h: 36:	trasaction	==> transaction
	volumes.c: 5312:	filesysmte	==> filesystem
	zoned.c: 1977:		trasnsaction	==> transaction

Signed-off-by: Shen Lichuan <shenlichuan@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:14 +01:00
Johannes Thumshirn
2206265f41 btrfs: remove code duplication in ordered extent finishing
Remove the duplicated transaction joining, block reserve setting and raid
extent inserting in btrfs_finish_ordered_extent().

While at it, also abort the transaction in case inserting a RAID
stripe-tree entry fails.

Suggested-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:13 +01:00
Qu Wenruo
1d2fbb7f1f btrfs: allow compression even if the range is not page aligned
Previously for btrfs with sector size smaller than page size (subpage),
we only allow compression if the range is fully page aligned.

This is to work around the asynchronous submission of compressed range,
which delayed the page unlock and writeback into a workqueue,
furthermore asynchronous submission can lock multiple sector range
across page boundary.

Such asynchronous submission makes it very hard to co-operate with other
regular writes.

With the recent changes to the subpage folio unlock path, now
asynchronous submission of compressed pages can co-operate with regular
submission, so enable sector perfect compression if it's an experimental
build.

The ETA for moving this feature out of experimental is 6.15, and I hope
all remaining corner cases can be exposed before that.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:13 +01:00
Qu Wenruo
a4ef54dbb5 btrfs: make extent_range_clear_dirty_for_io() to handle sector size < page size cases
For btrfs with sector size < page size (e.g. 4K sector size, 64K page
size), and enable the sector perfect compression support, then the
following dirty range can lead to problems:

   0     32K     64K     96K    128K
   |     |///////||//////|    |/|
                              124K

In above case, if we start writeback for that inode, the last dirty
range [124K, 128K) will not be submitted and cause reserved space
leakage:

- Start writeback for page 0
  We find the range [32K, 96K) is suitable for compression, and queue it
  into a workqueue to do the delayed compression and submission.

- Compression happens for range [32K, 96K)
  Function extent_range_clear_dirty_for_io() is called, however it is
  only doing full page handling, not considering any the extra bitmaps
  for subpage cases.

  That function will clear page dirty for both page 0 and page 64K.

- Writeback for the inode is done
  Because page 64K has its dirty flag cleared, it will not be considered
  as a writeback target.

This means the range [124K, 128K) will not be submitted, and reserved
space for it will be leaked.

Fix this problem by using the subpage helper to clear the dirty flag.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:12 +01:00
Linus Torvalds
9183e033ec for-6.12-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmct+40ACgkQxWXV+ddt
 WDvCtRAAp0rheEu14hpVvWE2//+6u9Gx7Wfjzbj0+o4zBRWdg7BigFxfeb6JsH/E
 2TjuWdcoP/OMV9ghCBQAQxySAPtsxH7skkyNy2UcMk5byBIrNvhw9auP5GXXlrhK
 jSKDD4yfOMb++8LhrLevgTrijNyjLqaKXruw9a1Pmc3gxpdNmnMEySsQaF62o2Sm
 YC3jwi0KpNAhu2qyJ6TnPgd5zf3BTM0JAeuB019IZW4WoeRTOdcPe7S7gqqJwZ+e
 lL0D2/lfIE1lKvLE266Fab4FAQiJV07rozYj25XHiDpqThCxnJVOZCEHasOQ1PRy
 d6j3RmGPqJYAYfQL1L+FH2hsS1BVZfVyCV1V7A/cN+lAffBfnROnf13C3gJ15Nbx
 3lTyjBPQQw2WpfdmeyF3ikbrjZ8AfahChQO+mMnLN7oAWdIwWX5MRB+cwfWTxzA/
 P8upz6HSTpSwy8nXdq264q1KkyCjx0Wv+8iyU7LirN2fCcEchA12HAIaOBeHedgh
 PrGZDqrkZccQQxAvU5H7hQv0hZkGK8qba381oYHO09g72VM6ysuBU7tGrPZrlZYB
 CvYTCwNZ/lqI8ikrcHOyUO1SPR9SaaWej1mWgBJ69ZIfg+ZuMtOMl171DU4S/i2V
 iYgYoN8eCqTQWdaX5kk+3LWmK8fSU7F/KSDtJtT1KxkaSwCacfY=
 =TQzP
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more one-liners that fix some user visible problems:

   - use correct range when clearing qgroup reservations after COW

   - properly reset freed delayed ref list head

   - fix ro/rw subvolume mounts to be backward compatible with old and
     new mount API"

* tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix the length of reserved qgroup to free
  btrfs: reinitialize delayed ref list after deleting it from the list
  btrfs: fix per-subvolume RO/RW flags with new mount API
2024-11-08 07:31:03 -10:00
Haisu Wang
2b084d8205 btrfs: fix the length of reserved qgroup to free
The dealloc flag may be cleared and the extent won't reach the disk in
cow_file_range when errors path. The reserved qgroup space is freed in
commit 30479f31d4 ("btrfs: fix qgroup reserve leaks in
cow_file_range"). However, the length of untouched region to free needs
to be adjusted with the correct remaining region size.

Fixes: 30479f31d4 ("btrfs: fix qgroup reserve leaks in cow_file_range")
CC: stable@vger.kernel.org # 6.11+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-07 02:08:29 +01:00
Pankaj Raghav
30dac24e14
fs/writeback: convert wbc_account_cgroup_owner to take a folio
Most of the callers of wbc_account_cgroup_owner() are converting a folio
to page before calling the function. wbc_account_cgroup_owner() is
converting the page back to a folio to call mem_cgroup_css_from_folio().

Convert wbc_account_cgroup_owner() to take a folio instead of a page,
and convert all callers to pass a folio directly except f2fs.

Convert the page to folio for all the callers from f2fs as they were the
only callers calling wbc_account_cgroup_owner() with a page. As f2fs is
already in the process of converting to folios, these call sites might
also soon be calling wbc_account_cgroup_owner() with a folio directly in
the future.

No functional changes. Only compile tested.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20240926140121.203821-1-kernel@pankajraghav.com
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-28 13:26:54 +01:00
Linus Torvalds
4e46774408 for-6.12-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmcZGqsACgkQxWXV+ddt
 WDsZTg//dSLAAswT3dTAvWDt7rGxPhJC1V+gnzWdj/0Q3CUimNaw7zHp1QHtJDFT
 S+3TVvJJ2JDvOnbi3u24s/bL5YdkWcvIyy87oVE4trJMbQPc/E45pkYFqF3TRQAH
 wEYAVEnO9f9WUY+ekxX2XBzhmKp3xol93j9BBHDXOF9kHsDC5lI0D5YVYEhRY8Qu
 D4GcqAhqUj5Hj6I6ppiqO47NCJBNzw1Se9QsgruPpmItRbB0/LYJFhUfevwosTKg
 xVpkRVntFqQjkIdIdBfBv/ZGWfxJyM7K4M49QwLUqUQfxugu7BiGYuEjkBkiy07a
 pZDEOF9s8wUZsxvRVohqKlhL0zHBF+/pAANowYuhKNW1sqKt4GVCdN3V34AbH8ST
 JIbPvC2g1tUzIc2wckE8GO/NnsNR4r3k6iPB53MdCHrIo3jnENOeb9wF0GiVDb6s
 OrhCa3ph2ps80YC1aCnc4Jr/yV2ONebSivnqvHCIUQEZfpjtAc07atm0i946/7nx
 eiBE+9zSVJZB0LoooIVz2I3HX3lRnm3Wwi4nj8U/sNL/IbGaHNCurIETR23NivYP
 yWVql2njwE+yMc8q9YZXs4MBdKsSGP6eGJW3ZwKi6ru2PJdf5eIib+ffvR4bLqXD
 UUDfMyC1esGsQB24sc8wppk97wmmMrdqQUj9WnmNlTP8FUFggvQ=
 =sYxX
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - mount option fixes:
     - fix handling of compression mount options on remount
     - reject rw remount in case there are options that don't work
       in read-write mode (like rescue options)

 - fix zone accounting of unusable space

 - fix in-memory corruption when merging extent maps

 - fix delalloc range locking for sector < page

 - use more convenient default value of drop subtree threshold, clean
   more subvolumes without the fallback to marking quotas inconsistent

 - fix smatch warning about incorrect value passed to ERR_PTR

* tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item()
  btrfs: reject ro->rw reconfiguration if there are hard ro requirements
  btrfs: fix read corruption due to race with extent map merging
  btrfs: fix the delalloc range locking if sector size < page size
  btrfs: qgroup: set a more sane default value for subtree drop threshold
  btrfs: clear force-compress on remount when compress mount option is given
  btrfs: zoned: fix zone unusable accounting for freed reserved extent
2024-10-24 13:04:15 -07:00
Yue Haibing
75f49c3dc7 btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item()
The ret may be zero in btrfs_search_dir_index_item() and should not
passed to ERR_PTR(). Now btrfs_unlink_subvol() is the only caller to
this, reconstructed it to check ERR_PTR(-ENOENT) while ret >= 0.

This fixes smatch warnings:

fs/btrfs/dir-item.c:353
  btrfs_search_dir_index_item() warn: passing zero to 'ERR_PTR'

Fixes: 9dcbe16fcc ("btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22 16:10:55 +02:00