The parameter @nr_ret is used to tell the caller how many sectors have
been submitted for IO.
Then callers check @nr_ret value to determine if we need to manually
clear the PAGECACHE_TAG_DIRTY, as if we submitted no sector (e.g. all
sectors are beyond i_size) there is no folio_start_writeback() called thus
PAGECACHE_TAG_DIRTY tag will not be cleared.
Remove this parameter by:
- Moving the btrfs_folio_clear_writeback() call into
__extent_writepage_io()
So that if we didn't submit any IO, then manually call
btrfs_folio_set_writeback() to clear PAGECACHE_TAG_DIRTY when
the page is no longer dirty.
- Use a bool to record if we have submitted any sector
Instead of an int.
- Use subpage compatible helpers to end folio writeback.
This brings no change to the behavior, just for the sake of consistency.
As for the call site inside __extent_writepage(), we're always called
for the whole page, so the existing full page helper
folio_(start|end)_writeback() is totally fine.
For the call site inside extent_write_locked_range(), although we can
have subpage range, folio_start_writeback() will only clear
PAGECACHE_TAG_DIRTY if the page is no longer dirty, and the full folio
will still be dirty if there is any subpage dirty range.
Only when the last dirty subpage sector is cleared, the
folio_start_writeback() will clear PAGECACHE_TAG_DIRTY.
So no matter if we call the full page or subpage helper, the result
is still the same, then just use the subpage helpers for consistency.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already use a folio some in this function, replace all page usage
with the folio and update the function to take the folio as an argument.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that btrfs_get_extent takes a folio, update __get_extent_map to
take a folio as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only pass this into read_inline_extent, change it to take a folio and
update the callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of a page, use a folio for btrfs_writepage_cow_fixup. We
already have a folio at the only caller, and the fixup worker uses
folios.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that every function that btrfs_run_delalloc_range calls takes a
folio, update it to take a folio and update the callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This mostly uses folios, convert it to take a folio instead and update
the callers to pass in the folio.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of taking the locked page, take the locked folio so we can pass
that into __process_folios_contig.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that this mostly uses folios, update it to take folios, use the
folios that are passed in, and rename from process_one_page =>
process_one_folio.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This operates mostly on folios, update it to take a folio for the locked
folio instead of the page, rename from __process_pages_contig =>
__process_folios_contig.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All of the callers have a folio at this point, update
__unlock_for_delalloc to take a folio so that it's consistent with its
callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Also rename lock_delalloc_pages => lock_delalloc_folios in the process,
now that it exclusively works on folios.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of passing in a page for locked_page, pass in the folio instead.
We only use the folio itself to validate some range assumptions, and
then pass it into other functions.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We already use a folio heavily in this function, pass the folio in
directly and use it everywhere, only passing the page down to functions
that do not take a folio yet.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only need a folio now, make it take a folio as an argument and update
all of the callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The callers and callee's of this now all use folios, update it to take a
folio as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we've gotten most of the helpers updated to only take a folio,
update __extent_writepage to only deal in folios.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using pages for everything, find a folio and use that. This
makes things a bit cleaner as a lot of the functions calls here all take
folios.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
__extent_writepage_io uses page everywhere, but a lot of these functions
take a folio. Convert it to use the folio based helpers, and then
change it to take a folio as an argument and update its callers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Willy is wanting to get rid of page->index, convert the writepage
tracepoint to take a folio so we can do folio->index instead of
page->index.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the callers and helpers mostly use folio, convert
btrfs_do_readpage to take a folio, and rename it to btrfs_do_read_folio.
Update all of the page stuff to use the folio based helpers instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The callers of this helper are going to be converted to using a folio,
so adjust submit_extent_page to become submit_extent_folio and update it
to use all the relevant folio helpers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This already uses a folio internally, change it to take a folio as an
argument instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this helper function to set the page range uptodate once we're
done reading it, as well as run fsverity against it. Half of these
functions already take a folio, just rename this to end_folio_read and
then rework it to take a folio instead, and update everything
accordingly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we're using the page for everything here. Convert this to use
the folio helpers instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're the only user of readahead_page_batch(). Convert
btrfs_readahead() to use the folio based helpers to do readahead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is only used inside inode.c by compress_file_range(),
so move it to inode.c and unexport it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is only one caller utilizing the @extra_gfp parameter,
alloc_eb_folio_array(). And in that case the extra_gfp is only assigned
to __GFP_NOFAIL.
Rename the @extra_gfp parameter to @nofail to indicate that.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_alloc_folio_array() is only utilized in
btrfs_submit_compressed_read() and no other location, and the only
caller is not utilizing the @extra_gfp parameter.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to is_data_inode() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
We can add const to many parameters, this is for clarity and minor
addition to safety. There are some minor effects, in the assembly
code and .ko measured on release config. This patch does not cover all
possible conversions.
Signed-off-by: David Sterba <dsterba@suse.com>
When extent_write_locked_range() generated an inline extent, it would
set and finish the writeback for the whole page.
Although currently it's safe since subpage disables inline creation,
for the sake of consistency, let it go with subpage helpers to set and
clear the writeback flags.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
For subpage + zoned case, the following workload can lead to rsv data
leak at unmount time:
# mkfs.btrfs -f -s 4k $dev
# mount $dev $mnt
# fsstress -w -n 8 -d $mnt -s 1709539240
0/0: fiemap - no filename
0/1: copyrange read - no filename
0/2: write - no filename
0/3: rename - no source filename
0/4: creat f0 x:0 0 0
0/4: creat add id=0,parent=-1
0/5: writev f0[259 1 0 0 0 0] [778052,113,965] 0
0/6: ioctl(FIEMAP) f0[259 1 0 0 224 887097] [1294220,2291618343991484791,0x10000] -1
0/7: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 224 887097] return 25, fallback to stat()
0/7: dwrite f0[259 1 0 0 224 887097] [696320,102400] 0
# umount $mnt
The dmesg includes the following rsv leak detection warning (all call
trace skipped):
------------[ cut here ]------------
WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8653 btrfs_destroy_inode+0x1e0/0x200 [btrfs]
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8654 btrfs_destroy_inode+0x1a8/0x200 [btrfs]
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8660 btrfs_destroy_inode+0x1a0/0x200 [btrfs]
---[ end trace 0000000000000000 ]---
BTRFS info (device sda): last unmount of filesystem 1b4abba9-de34-4f07-9e7f-157cf12a18d6
------------[ cut here ]------------
WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
---[ end trace 0000000000000000 ]---
BTRFS info (device sda): space_info DATA has 268218368 free, is not full
BTRFS info (device sda): space_info total=268435456, used=204800, pinned=0, reserved=0, may_use=12288, readonly=0 zone_unusable=0
BTRFS info (device sda): global_block_rsv: size 0 reserved 0
BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0
------------[ cut here ]------------
WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
---[ end trace 0000000000000000 ]---
BTRFS info (device sda): space_info METADATA has 267796480 free, is not full
BTRFS info (device sda): space_info total=268435456, used=131072, pinned=0, reserved=0, may_use=262144, readonly=0 zone_unusable=245760
BTRFS info (device sda): global_block_rsv: size 0 reserved 0
BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0
Above $dev is a tcmu-runner emulated zoned HDD, which has a max zone
append size of 64K, and the system has 64K page size.
[CAUSE]
I have added several trace_printk() to show the events (header skipped):
> btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
> btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
> btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
> btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864
The above lines show our buffered write has dirtied 3 pages of inode
259 of root 5:
704K 768K 832K 896K
I |////I/////////////////I///////////| I
756K 868K
|///| is the dirtied range using subpage bitmaps. and 'I' is the page
boundary.
Meanwhile all three pages (704K, 768K, 832K) have their PageDirty
flag set.
> btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400
Then direct IO write starts, since the range [680K, 780K) covers the
beginning part of the above dirty range, we need to writeback the
two pages at 704K and 768K.
> cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
> extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536
Now the above 2 lines show that we're writing back for dirty range
[756K, 756K + 64K).
We only writeback 64K because the zoned device has max zone append size
as 64K.
> extent_write_locked_range: r/i=5/259 clear dirty for page=786432
!!! The above line shows the root cause. !!!
We're calling clear_page_dirty_for_io() inside extent_write_locked_range(),
for the page 768K.
This is because extent_write_locked_range() can go beyond the current
locked page, here we hit the page at 768K and clear its page dirt.
In fact this would lead to the desync between subpage dirty and page
dirty flags. We have the page dirty flag cleared, but the subpage range
[820K, 832K) is still dirty.
After the writeback of range [756K, 820K), the dirty flags look like
this, as page 768K no longer has dirty flag set.
704K 768K 832K 896K
I I | I/////////////| I
820K 868K
This means we will no longer writeback range [820K, 832K), thus the
reserved data/metadata space would never be properly released.
> extent_write_cache_pages: r/i=5/259 skip non-dirty folio=786432
Now even though we try to start writeback for page 768K, since the
page is not dirty, we completely skip it at extent_write_cache_pages()
time.
> btrfs_direct_write: r/i=5/259 dio done filepos=696320 len=0
Now the direct IO finished.
> cow_file_range: r/i=5/259 add ordered extent filepos=851968 len=36864
> extent_write_locked_range: r/i=5/259 locked page=851968 start=851968 len=36864
Now we writeback the remaining dirty range, which is [832K, 868K).
Causing the range [820K, 832K) never to be submitted, thus leaking the
reserved space.
This bug only affects subpage and zoned case. For non-subpage and zoned
case, we have exactly one sector for each page, thus no such partial dirty
cases.
For subpage and non-zoned case, we never go into run_delalloc_cow(), and
normally all the dirty subpage ranges would be properly submitted inside
__extent_writepage_io().
[FIX]
Just do not clear the page dirty at all inside extent_write_locked_range().
As __extent_writepage_io() would do a more accurate, subpage compatible
clear for page and subpage dirty flags anyway.
Now the correct trace would look like this:
> btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
> btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
> btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
> btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864
The page dirty part is still the same 3 pages.
> btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400
> cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
> extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536
And the writeback for the first 64K is still correct.
> cow_file_range: r/i=5/259 add ordered extent filepos=839680 len=49152
> extent_write_locked_range: r/i=5/259 locked page=786432 start=839680 len=49152
Now with the fix, we can properly writeback the range [820K, 832K), and
properly release the reserved data/metadata space.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have a subpage range like this for a 16K page with 4K sectorsize:
0 4K 8K 12K 16K
|/////| |//////| |
|/////| = dirty range
Currently writepage_delalloc() would go through the following steps:
- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)
So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().
But there is a special case for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.
Address the page unlocking problem of run_delalloc_cow(), by changing
the workflow to the following one:
- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)
So that run_delalloc_cow() can only unlock the full page until the
last lock user released.
To do that:
- Utilize subpage locked bitmap
So for every delalloc range we found, call
btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
and later btrfs_folio_end_all_writers() if the page is fully unlocked.
So we know there is a delalloc range that needs to be run later.
- Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
Since subpage locked bitmap is only for ranges inside the page,
meanwhile we can have delalloc range ends beyond our page boundary,
we have to save the @last_delalloc_end just in case it's beyond our
page boundary.
Although there is one extra point to notice:
- We need to handle errors in previous iteration
Since we can have multiple locked delalloc ranges we have to call
run_delalloc_ranges() multiple times.
If we hit an error half way, we still need to unlock the remaining
ranges.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function __extent_writepage_io() is designed to find all dirty ranges of
a page, and add the dirty ranges to the bio_ctrl for submission.
It requires all the dirtied ranges to be covered by an ordered extent.
It gets called in two locations, but one call site is not subpage aware:
- __extent_writepage()
It gets called when writepage_delalloc() returned 0, which means
writepage_delalloc() has handled delalloc for all subpage sectors
inside the page.
So this call site is OK.
- extent_write_locked_range()
This call site is utilized by zoned support, and in this case, we may
only run delalloc range for a subset of the page, like this: (64K page
size)
0 16K 32K 48K 64K
|/////| |///////| |
In the above case, if extent_write_locked_range() is only triggered for
range [0, 16K), __extent_writepage_io() would still try to submit
the dirty range of [32K, 48K), then it would not find any ordered
extent for it and triggers various ASSERT()s.
Fix this problem by:
- Introducing @start and @len parameters to specify the range
For the first call site, we just pass the whole page, and the behavior
is not touched, since run_delalloc_range() for the page should have
created all ordered extents for the page.
For the second call site, we avoid touching anything beyond the
range, thus avoiding the dirty range which is not yet covered by any
delalloc range.
- Making btrfs_folio_assert_not_dirty() subpage aware
The only caller is inside __extent_writepage_io(), and since that
caller now accepts a subpage range, we should also check the subpage
range other than the whole page.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The member extent_map::block_start can be calculated from
extent_map::disk_bytenr + extent_map::offset for regular extents.
And otherwise just extent_map::disk_bytenr.
And this is already validated by the validate_extent_map(). Now we can
remove the member.
However there is a special case in btrfs_create_dio_extent() where we
for NOCOW/PREALLOC ordered extents cannot directly use the resulting
btrfs_file_extent, as btrfs_split_ordered_extent() cannot handle them
yet.
So for that call site, we pass file_extent->disk_bytenr +
file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for
offset.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the core of the fiemap code lives in extent_io.c, which does
not make any sense because it's not related to extent IO at all (and it
was not as well before the big rewrite of fiemap I did some time ago).
The entry point for fiemap, btrfs_fiemap(), lives in inode.c since it's
an inode operation.
Since there's a significant amount of fiemap code, move all of it into a
dedicated file, including its entry point inode.c:btrfs_fiemap().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reported by 'gcc -Wcast-qual', the argument from which write_extent_buffer()
reads data to write to the eb should be const. In addition the const
needs to be also added to __write_extent_buffer() local buffers.
All callers of write_eb_member() can now be updated to use const for the
input buffer structure or type.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When running 'make W=2' there are a few reports where a variable of the
same name is declared in a nested block. In all the cases we can use the
one declared in the parent block, no problematic cases were found.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Another improper use of __folio_put() in an error path after freshly
allocating pages/folios which returns them with the refcount initialized
to 1. The refactor from __free_pages() -> __folio_put() (instead of
folio_put) removed a refcount decrement found in __free_pages() and
folio_put but absent from __folio_put().
Fixes: 13df3775ef ("btrfs: cleanup metadata page pointer usage")
CC: stable@vger.kernel.org # 6.8+
Tested-by: Ed Tomlinson <edtoml@gmail.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Since v6.8 there are rare kernel crashes reported by various people,
the common factor is bad page status error messages like this:
BUG: Bad page state in process kswapd0 pfn:d6e840
page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
pfn:0xd6e840
aops:btree_aops ino:1
flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
page_type: 0xffffffff()
raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: non-NULL mapping
[CAUSE]
Commit 09e6cef19c ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") changes the sequence when allocating a new
extent buffer.
Previously we always called grab_extent_buffer() under
mapping->i_private_lock, to ensure the safety on modification on
folio::private (which is a pointer to extent buffer for regular
sectorsize).
This can lead to the following race:
Thread A is trying to allocate an extent buffer at bytenr X, with 4
4K pages, meanwhile thread B is trying to release the page at X + 4K
(the second page of the extent buffer at X).
Thread A | Thread B
-----------------------------------+-------------------------------------
| btree_release_folio()
| | This is for the page at X + 4K,
| | Not page X.
| |
alloc_extent_buffer() | |- release_extent_buffer()
|- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs)
| page at bytenr X (the first | | |
| page). | | |
| Which returned -EEXIST. | | |
| | | |
|- filemap_lock_folio() | | |
| Returned the first page locked. | | |
| | | |
|- grab_extent_buffer() | | |
| |- atomic_inc_not_zero() | | |
| | Returned false | | |
| |- folio_detach_private() | | |- folio_detach_private() for X
| |- folio_test_private() | | |- folio_test_private()
| Returned true | | | Returned true
|- folio_put() | |- folio_put()
Now there are two puts on the same folio at folio X, leading to refcount
underflow of the folio X, and eventually causing the BUG_ON() on the
page->mapping.
The condition is not that easy to hit:
- The release must be triggered for the middle page of an eb
If the release is on the same first page of an eb, page lock would kick
in and prevent the race.
- folio_detach_private() has a very small race window
It's only between folio_test_private() and folio_clear_private().
That's exactly when mapping->i_private_lock is used to prevent such race,
and commit 09e6cef19c ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") screwed that up.
At that time, I thought the page lock would kick in as
filemap_release_folio() also requires the page to be locked, but forgot
the filemap_release_folio() only locks one page, not all pages of an
extent buffer.
[FIX]
Move all the code requiring i_private_lock into
attach_eb_folio_to_filemap(), so that everything is done with proper
lock protection.
Furthermore to prevent future problems, add an extra
lockdep_assert_locked() to ensure we're holding the proper lock.
To reproducer that is able to hit the race (takes a few minutes with
instrumented code inserting delays to alloc_extent_buffer()):
#!/bin/sh
drop_caches () {
while(true); do
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
done
}
run_tar () {
while(true); do
for x in `seq 1 80` ; do
tar cf /dev/zero /mnt > /dev/null &
done
wait
done
}
mkfs.btrfs -f -d single -m single /dev/vda
mount -o noatime /dev/vda /mnt
# create 200,000 files, 1K each
./simoop -n 200000 -E -f 1k /mnt
drop_caches &
(run_tar)
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/
Reported-by: syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com
Fixes: 09e6cef19c ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
CC: stable@vger.kernel.org # 6.8+
CC: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the error status of super block write is tracked in page/folio
status bit Error. For that we need to keep the reference for the whole
duration of write and wait.
Count the number of superblock writeback errors in the btrfs_device.
That means we don't need the folio to stay around until it's waited for,
and can avoid the extra call to folio_get/put.
Also remove a mention of PageError in a comment as it's the last mention
of the page Error state.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have the lock_extent tightly coupled with
extent_clear_unlock_delalloc we can add a cached state to
extent_clear_unlock_delalloc and benefit from skipping the extra lookup
when we're doing cow.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We want to limit the scope of the extent lock to be around operations
that can change in flight. Currently we hold the extent lock through
the entire writepage operation, which isn't really necessary.
We want to protect to make sure nobody has updated DELALLOC. In
find_lock_delalloc_range we must lock the range in order to validate the
contents of our io_tree. However once we've done that we're safe to
unlock the range and continue, as we have the page lock already held for
the range.
We are protected from all operations at this point.
* mmap() - we're holding the page lock, thus are protected.
* buffered writes - again, we're protected because we take the page lock
for the first and last page in our range for buffered writes so we
won't create new delalloc ranges in this area.
* direct IO - we invalidate pagecache before attempting to write a new
area, which requires the page lock, so again are protected once we're
holding the page lock on this range.
Additionally this behavior actually already exists for compressed, we
unlock the range as soon as we start to process the async extents, and
re-lock it during compression. So this is completely safe, and makes
the locking more consistent.
Make this simple by just pushing the extent lock into
btrfs_run_delalloc_range. From there followup patches will push the
lock further down into its users.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently try_release_extent_mapping() as an int return type, but we
use it as a boolean. Its only caller, the release folio callback, also
returns a boolean which corresponds to try_release_extent_mapping()'s
return value. So change its return value type to bool as well as its
helper try_release_extent_state().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At try_release_extent_mapping(), called during the release folio callback
(btrfs_release_folio() callchain), we don't release any extent maps in the
range if the GFP flags don't allow blocking. This behaviour is exaggerated
because:
1) Both searching for extent maps and removing them are not blocking
operations. The only thing that it is the cond_resched() call at the
end of the loop that searches for and removes extent maps;
2) We currently only operate on a single page, so for the case where
block size matches the page size, we can only have one extent map,
and for the case where the block size is smaller than the page size,
we can have at most 16 extent maps.
So it's very unlikely the cond_resched() call will ever block even in the
block size smaller than page size scenario.
So instead of not removing any extent maps at all in case the GFP glags
don't allow blocking, keep removing extent maps while we don't need to
reschedule. This makes it safe for the subpage case and for a future
where we can process folios with a size larger than a page.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we don't attempt to release extent maps if the inode has an
i_size that is not greater than 16M. This condition was added way back
in 2008 by commit 70dec8079d ("Btrfs: extent_io and extent_state
optimizations"), without any explanation about it. A quick chat with
Chris on slack revealed that the goal was probably to release the extent
maps for small files only when closing the inode. This however can be
harmful in case we have tons of such files being kept open for very long
periods of time, since we will consume more and more pages for extent
maps.
So remove the condition.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nowadays we have the btrfs_get_fs_generation() to get the current
generation of the filesystem, so there's no need anymore to lock the
transaction spinlock to read it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>