At btrfs_commit_super(), called in a few contexts such as when unmounting
a filesystem, we use btrfs_join_transaction() to catch any running
transaction and then commit it. This will however create a new and empty
transaction in case there's no running transaction or there's a running
transaction with a state >= TRANS_STATE_UNBLOCKED.
As we just want to be sure that any existing transaction is fully
committed, we can use btrfs_attach_transaction_barrier() instead of
btrfs_join_transaction(), therefore avoiding the creation and commit of
empty transactions, which only waste IO and causes rotation of the
precious backup roots.
Example where we create and commit a pointless empty transaction:
$ mkfs.btrfs -f /dev/sdj
$ btrfs inspect-internal dump-super /dev/sdj | grep -e '^generation'
generation 6
$ mount /dev/sdj /mnt/sdj
$ touch /mnt/sdj/foo
# Commit the currently open transaction. Just 'sync' or wait ~30
# seconds for the transaction kthread to commit it.
$ sync
$ btrfs inspect-internal dump-super /dev/sdj | grep -e '^generation'
generation 7
$ umount /mnt/sdj
$ btrfs inspect-internal dump-super /dev/sdj | grep -e '^generation'
generation 8
The transaction with id 8 was pointless, an empty transaction that did
not achieve anything.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When flushing reservations we are using btrfs_join_transaction() to get a
handle for the current transaction and then commit it to try to release
space. However btrfs_join_transaction() has some undesirable consequences:
1) If there's no running transaction, it will create one, and we will
commit it right after. This is unnecessary because it will not release
any space, and it will result in unnecessary IO and rotation of backup
roots in the superblock;
2) If there's a current transaction and that transaction is committing
(its state is >= TRANS_STATE_COMMIT_DOING), it will wait for that
transaction to almost finish its commit (for its state to be >=
TRANS_STATE_UNBLOCKED) and then start and return a new transaction.
We will then commit that new transaction, which is pointless because
all we wanted was to wait for the current (previous) transaction to
fully finish its commit (state == TRANS_STATE_COMPLETED), and by
starting and committing a new transaction we are wasting IO too and
causing unnecessary rotation of backup roots in the superblock.
So improve this by using btrfs_attach_transaction_barrier() instead, which
does not create a new transaction if there's none running, and if there's
a current transaction that is committing, it will wait for it to fully
commit and not create a new transaction.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The range is specified only in two ways, we can simplify the case for
the whole filesystem range as a NULL block group parameter.
Signed-off-by: David Sterba <dsterba@suse.com>
Currently if we fully clean a subvolume (not only delete its directory,
but fully clean all it's related data and root item), the associated
qgroup would not be removed.
We have "btrfs qgroup clear-stale" to handle such 0 level qgroups.
Change the behavior to automatically removie the qgroup of a fully
cleaned subvolume when possible:
- Full qgroup but still consistent
We can and should remove the qgroup.
The qgroup numbers should be 0, without any rsv.
- Full qgroup but inconsistent
Can happen with drop_subtree_threshold feature (skip accounting
and mark qgroup inconsistent).
We can and should remove the qgroup.
Higher level qgroup numbers will be incorrect, but since qgroup
is already inconsistent, it should not be a problem.
- Squota mode
This is the special case, we can only drop the qgroup if its numbers
are all 0.
This would be handled by can_delete_qgroup(), so we only need to check
the return value and ignore the -EBUSY error.
Link: https://bugzilla.suse.com/show_bug.cgi?id=1222847
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Currently if one is utilizing "qgroups/drop_subtree_threshold" sysfs,
and a snapshot with level higher than that value is dropped, we will
not be able to delete the qgroup until next qgroup rescan:
uuid=ffffffff-eeee-dddd-cccc-000000000000
wipefs -fa $dev
mkfs.btrfs -f $dev -O quota -s 4k -n 4k -U $uuid
mount $dev $mnt
btrfs subvolume create $mnt/subv1/
for (( i = 0; i < 1024; i++ )); do
xfs_io -f -c "pwrite 0 2k" $mnt/subv1/file_$i > /dev/null
done
sync
btrfs subvolume snapshot $mnt/subv1 $mnt/snapshot
btrfs quota enable $mnt
btrfs quota rescan -w $mnt
sync
echo 1 > /sys/fs/btrfs/$uuid/qgroups/drop_subtree_threshold
btrfs subvolume delete $mnt/snapshot
btrfs subvolume sync $mnt
btrfs qgroup show -prce --sync $mnt
btrfs qgroup destroy 0/257 $mnt
umount $mnt
The final qgroup removal would fail with the following error:
ERROR: unable to destroy quota group: Device or resource busy
[CAUSE]
The above script would generate a subvolume of level 2, then snapshot
it, enable qgroup, set the drop_subtree_threshold, then drop the
snapshot.
Since the subvolume drop would meet the threshold, qgroup would be
marked inconsistent and skip accounting to avoid hanging the system at
transaction commit.
But currently we do not allow a qgroup with any rfer/excl numbers to be
dropped, and this is not really compatible with the new
drop_subtree_threshold behavior.
[FIX]
Only require the strict zero rfer/excl/rfer_cmpr/excl_cmpr for squota
mode. This is due to the fact that squota can never go inconsistent,
and it can have dropped subvolume but with non-zero qgroup numbers for
future accounting.
For full qgroup mode, we only check if there is a subvolume for it.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reported by 'gcc -Wcast-qual', the argument from which write_extent_buffer()
reads data to write to the eb should be const. In addition the const
needs to be also added to __write_extent_buffer() local buffers.
All callers of write_eb_member() can now be updated to use const for the
input buffer structure or type.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This was reported by 'gcc -Wcast-qual', the get_unaligned_le8() simply
returns the argument and there's no reason to drop the cast.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This was added in c61a16a701 ("Btrfs: fix the confusion between
delalloc bytes and metadata bytes") and removed in 03fe78cc29
("btrfs: use delalloc_bytes to determine flush amount for
shrink_delalloc") where the calculation was reworked to use a
non-constant numbers. This was found by 'make W=2'.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've started to use for-loop local variables and in a few places this
shadows a function variable. Convert a few cases reported by 'make W=2'.
If applicable also change the style to post-increment, that's the
preferred one.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix variable names in two macros where there's a local function variable
of the same name. In subpage_calc_start_bit() it's in several callers,
in btrfs_abort_transaction() it's only in replace_file_extents().
Found by 'make W=2'.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When running 'make W=2' there are a few reports where a variable of the
same name is declared in a nested block. In all the cases we can use the
one declared in the parent block, no problematic cases were found.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using a VFS inode local pointer and then doing many BTRFS_I()
calls inside btrfs_sync_file(), use a btrfs_inode pointer instead. This
makes everything a bit easier to read and less confusing, allowing to
make some statements shorter.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode
instead, as this is generally what we do for internal APIs, making it
more consistent with most of the code base. This will later allow to
help to remove a lot of BTRFS_I() calls in btrfs_sync_file().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode
instead, as this is generally what we do for internal APIs, making it
more consistent with most of the code base. This will later allow to
help to remove a lot of BTRFS_I() calls in btrfs_sync_file().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using a inode pointer, use a btrfs_inode pointer in the log
context structure, as this is generally what we need and allows for some
internal APIs to take a btrfs_inode instead, making them more consistent
with most of the code base. This will later allow to help to remove a lot
of BTRFS_I() calls in btrfs_sync_file().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_finish_ordered_extent() returns a boolean indicating if
the ordered extent was added to the work queue for completion, but none
of its callers cares about it, so make it return void.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_block_group_root() is declared in disk-io.c; however,
all its callers are in block-group.c. Move it to the latter file and
declare it static.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Drop the single-use variable bytenr_orig and instead use btrfs_sb_offset()
in the function argument passing.
Fix a stale comment about not automatically fixing a bad primary
superblock from the backup mirror copies. Also, move the comment closer
to where the primary superblock read occurs.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are currently using a cached rb_root (struct rb_root_cached) for the
rb root of struct extent_map_tree. This doesn't offer much of an advantage
here because:
1) It's only advantage over the regular rb_root is that it caches a
pointer to the left most node (first node), so a call to
rb_first_cached() doesn't have to chase pointers until it reaches
the left most node;
2) We only have two scenarios that access left most node with
rb_first_cached():
When dropping all extent maps from an inode, during inode eviction;
When iterating over extent maps during the extent map shrinker;
3) In both cases we keep removing extent maps, which causes deletion of
the left most node so rb_erase_cached() has to call rb_next() to find
out what's the next left most node and assign it to
struct rb_root_cached::rb_leftmost;
4) We can do that ourselves in those two uses cases and stop using a
rb_root_cached rb tree and use instead a regular rb_root rb tree.
This reduces the size of struct extent_map_tree by 8 bytes and, since
this structure is embedded in struct btrfs_inode, it also reduces the
size of that structure by 8 bytes.
So on a 64 bits platform the size of btrfs_inode is reduced from 1032
bytes down to 1024 bytes.
This means we will be able to have 4 inodes per 4K page instead of 3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we name the rb_root member of struct extent_map_tree as 'map',
which is odd and confusing. Since it's a root node, rename it to 'root'.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On 64 bits platforms we don't really need to have a dedicated member (the
objectid field) for the inode's number since we store in the VFS inode's
i_ino member, which is an unsigned long and this type is 64 bits wide on
64 bits platforms. We only need that field in case we are on a 32 bits
platform because the unsigned long type is 32 bits wide on such platforms
See commit 33345d0152 ("Btrfs: Always use 64bit inode number") regarding
this 64/32 bits detail.
The objectid field of struct btrfs_inode is also used to store the ID of
a root for directories that are stubs for unreferenced roots. In such
cases the inode is a directory and has the BTRFS_INODE_ROOT_STUB runtime
flag set.
So in order to reduce the size of btrfs_inode structure on 64 bits
platforms we can remove the objectid member and use the VFS inode's i_ino
member instead whenever we need to get the inode number. In case the inode
is a root stub (BTRFS_INODE_ROOT_STUB set) we can use the member
last_reflink_trans to store the ID of the unreferenced root, since such
inode is a directory and reflinks can't be done against directories.
So remove the objectid fields for 64 bits platforms and alias the
last_reflink_trans field with a name of ref_root_id in a union.
On a release kernel config, this reduces the size of struct btrfs_inode
from 1040 bytes down to 1032 bytes.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently struct btrfs_inode has a key member, named "location", that is
either:
1) The key of the inode's item. In this case the objectid is the number
of the inode;
2) A key stored in a dir entry with a type of BTRFS_ROOT_ITEM_KEY, for
the case where we have a root that is a snapshot of a subvolume that
points to other subvolumes. In this case the objectid is the ID of
a subvolume inside the snapshotted parent subvolume.
The key is only used to lookup the inode item for the first case, while
for the second it's never used since it corresponds to directory stubs
created with new_simple_dir() and which are marked as dummy, so there's
no actual inode item to ever update. In the second case we only check
the key type at btrfs_ino() for 32 bits platforms and its objectid is
only needed for unlink.
Instead of using a key we can do fine with just the objectid, since we
can generate the key whenever we need it having only the objectid, as
in all use cases the type is always BTRFS_INODE_ITEM_KEY and the offset
is always 0.
So use only an objectid instead of a full key. This reduces the size of
struct btrfs_inode from 1048 bytes down to 1040 bytes on a release kernel.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When not using the NO_HOLES feature we always allocate an io tree for an
inode's file_extent_tree. This is wasteful because that io tree is only
used for regular files, so we allocate more memory than needed for inodes
that represent directories or symlinks for example, or for inodes that
correspond to free space inodes.
So improve on this by allocating the io tree only for inodes of regular
files that are not free space inodes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The index_cnt field of struct btrfs_inode is used only for two purposes:
1) To store the index for the next entry added to a directory;
2) For the data relocation inode to track the logical start address of the
block group currently being relocated.
For the relocation case we use index_cnt because it's not used for
anything else in the relocation use case - we could have used other fields
that are not used by relocation such as defrag_bytes, last_unlink_trans
or last_reflink_trans for example (among others).
Since the csum_bytes field is not used for directories, do the following
changes:
1) Put index_cnt and csum_bytes in a union, and index_cnt is only
initialized when the inode is a directory. The csum_bytes is only
accessed in IO paths for regular files, so we're fine here;
2) Use the defrag_bytes field for relocation, since the data relocation
inode is never used for defrag purposes. And to make the naming better,
alias it to reloc_block_group_start by using a union.
This reduces the size of struct btrfs_inode by 8 bytes in a release
kernel, from 1056 bytes down to 1048 bytes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we use the spinlock inode_lock from struct btrfs_root to
serialize access to two different data structures:
1) The delayed inodes xarray (struct btrfs_root::delayed_nodes);
2) The inodes xarray (struct btrfs_root::inodes).
Instead of using our own lock, we can use the spinlock that is part of the
xarray implementation, by using the xa_lock() and xa_unlock() APIs and
using the xarray APIs with the double underscore prefix that don't take
the xarray locks and assume the caller is using xa_lock() and xa_unlock().
So remove the spinlock inode_lock from struct btrfs_root and use the
corresponding xarray locks. This brings 2 benefits:
1) We reduce the size of struct btrfs_root, from 1336 bytes down to
1328 bytes on a 64 bits release kernel config;
2) We reduce lock contention by not using anymore the same lock for
changing two different and unrelated xarrays.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make btrfs_iget_path() simpler and easier to read by avoiding nesting of
if-then-else statements and having an error label to do all the error
handling instead of repeating it a couple times.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When creating a new inode, at btrfs_create_new_inode(), one of the very
last steps is to add the inode to the root's inodes xarray. This often
requires allocating memory which may fail (even though xarrays have a
dedicated kmem_cache which make it less likely to fail), and at that point
we are forced to abort the current transaction (as some, but not all, of
the inode metadata was added to its subvolume btree).
To avoid a transaction abort, preallocate memory for the xarray early at
btrfs_create_new_inode(), so that if we fail we don't need to abort the
transaction and the insertion into the xarray is guaranteed to succeed.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we use a red black tree (rb-tree) to track the currently open
inodes of a root (in struct btrfs_root::inode_tree). This however is not
very efficient when the number of inodes is large since rb-trees are
binary trees. For example for 100K open inodes, the tree has a depth of
17. Besides that, inserting into the tree requires navigating through it
and pulling useless cache lines in the process since the red black tree
nodes are embedded within the btrfs inode - on the other hand, by being
embedded, it requires no extra memory allocations.
We can improve this by using an xarray instead, which is efficient when
indices are densely clustered (such as inode numbers), is more cache
friendly and behaves like a resizable array, with a much better search
and insertion complexity than a red black tree. This only has one small
disadvantage which is that insertion will sometimes require allocating
memory for the xarray - which may fail (not that often since it uses a
kmem_cache) - but on the other hand we can reduce the btrfs inode
structure size by 24 bytes (from 1080 down to 1056 bytes) after removing
the embedded red black tree node, which after the next patches will allow
to reduce the size of the structure to 1024 bytes, meaning we will be able
to store 4 inodes per 4K page instead of 3 inodes.
This change does a straightforward change to use an xarray, and results
in a transaction abort if we can't allocate memory for the xarray when
creating an inode - but the next patch changes things so that we don't
need to abort.
Running the following fs_mark test showed some improvements:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
MOUNT_OPTIONS="-o ssd"
FILES=100000
THREADS=$(nproc --all)
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 92081.6 12505547
16 2400000 0 138222.6 13067072
23 3600000 0 148833.1 13290336
43 4800000 0 97864.7 13931248
53 6000000 0 85597.3 14384313
After this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 93225.1 12571078
16 2400000 0 146720.3 12805007
23 3600000 0 160626.4 13073835
46 4800000 0 116286.2 13802927
53 6000000 0 90087.9 14754892
The test was run with a release kernel config (Debian's default config).
Also capturing the insertion times into the rb tree and into the xarray,
that is measuring the duration of the old function inode_tree_add() and
the duration of the new btrfs_add_inode_to_root() function, gave the
following results (in nanoseconds):
Before this patch, inode_tree_add() execution times:
Count: 5000000
Range: 0.000 - 5536887.000; Mean: 775.674; Median: 729.000; Stddev: 4820.961
Percentiles: 90th: 1015.000; 95th: 1139.000; 99th: 1397.000
0.000 - 7.816: 40 |
7.816 - 37.858: 209 |
37.858 - 170.278: 6059 |
170.278 - 753.961: 2754890 #####################################################
753.961 - 3326.728: 2232312 ###########################################
3326.728 - 14667.018: 4366 |
14667.018 - 64652.943: 852 |
64652.943 - 284981.761: 550 |
284981.761 - 1256150.914: 221 |
1256150.914 - 5536887.000: 7 |
After this patch, btrfs_add_inode_to_root() execution times:
Count: 5000000
Range: 0.000 - 2900652.000; Mean: 272.148; Median: 241.000; Stddev: 2873.369
Percentiles: 90th: 342.000; 95th: 432.000; 99th: 572.000
0.000 - 7.264: 104 |
7.264 - 33.145: 352 |
33.145 - 140.081: 109606 #
140.081 - 581.930: 4840090 #####################################################
581.930 - 2407.590: 43532 |
2407.590 - 9950.979: 2245 |
9950.979 - 41119.278: 514 |
41119.278 - 169902.616: 155 |
169902.616 - 702018.539: 47 |
702018.539 - 2900652.000: 9 |
Average, percentiles, standard deviation, etc, are all much better.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several hard-to-hit ASSERT()s hit inside raid56.
Unfortunately the ASSERT() expression is a little complex, and except
the ASSERT(), there is nothing to provide any clue.
Considering if race is involved, it's pretty hard to reproduce.
Meanwhile sometimes the dump of the rbio structure can provide some
pretty good clues, it's worth to do the extra multi-line dump for
btrfs raid56 related code.
The dump looks like this:
BTRFS critical (device dm-3): bioc logical=4598530048 full_stripe=4598530048 size=0 map_type=0x81 mirror=0 replace_nr_stripes=0 replace_stripe_src=-1 num_stripes=5
BTRFS critical (device dm-3): nr=0 devid=1 physical=1166147584
BTRFS critical (device dm-3): nr=1 devid=2 physical=1145176064
BTRFS critical (device dm-3): nr=2 devid=4 physical=1145176064
BTRFS critical (device dm-3): nr=3 devid=5 physical=1145176064
BTRFS critical (device dm-3): nr=4 devid=3 physical=1145176064
BTRFS critical (device dm-3): rbio flags=0x0 nr_sectors=80 nr_data=4 real_stripes=5 stripe_nsectors=16 scrubp=0 dbitmap=0x0
BTRFS critical (device dm-3): logical=4598530048
assertion failed: orig_logical >= full_stripe_start && orig_logical + orig_len <= full_stripe_start + rbio->nr_data * BTRFS_STRIPE_LEN, in fs/btrfs/raid56.c:1702
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Due to a refactoring introduced by commit 53d9981ca2 ("btrfs: split
btrfs_alloc_ordered_extent to allocation and insertion helpers"), the
function btrfs_alloc_ordered_extent() was renamed to
alloc_ordered_extent(), so the comment at btrfs_remove_ordered_extent()
is no longer very accurate. Update the comment to refer to the new
name "alloc_ordered_extent()".
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix typo in the end IO compression callbacks, from "comprssed" to
"compressed".
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_migrate_to_delayed_refs_rsv() is no longer used.
Its last use was removed in commit 2f6397e448 ("btrfs: don't refill
whole delayed refs block reserve when starting transaction").
So remove the function.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not used outside zoned.c, so make it static.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Passing in a 'struct btrfs_io_geometry into handle_ops_on_dev_replace
can reduce the number of arguments by two.
No functional changes otherwise.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The ioctls that add relations, create qgroups or set limits start/join
transaction. When quotas are not enabled this is not necessary, there
will be errors reported back anyway but this could be also misleading
and we should really report that quotas are not enabled. For that use
-ENOTCONN.
The helper is meant to do a quick check before any other standard ioctl
checks are done. If quota is disabled meanwhile we still rely on proper
locking inside any active operation changing the qgroup structures.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaG0jUACgkQxWXV+ddt
WDsg4Q//f7xRooomsRaRvIDN4Add17jyZWuhreMhxKKvdtoRzLq/YZUldGJt1XXu
tVpCvKkVvrJt5UY3b1czfA8GTHeOInZSoqyV8ZaxOAcg6cfKrUtLFi1wAKGU6BK/
S9Zz/9lfbjXhG4PnZW7GRFmH0n68p/dmS+kXPVVPhj7CibxuZJtifflK1r6rHw+Y
PDDnMIPqbrdK+xBfR+SEpG+1uIEFvI6SgsAZG6p0mzcbbUINGp7SpU3v2NVuvdIX
tjTlZhz+D2do4dPk3RJ4Z6dpL+VlZnR1F9CZn6SmPcdGWn6mTkywoukJPP62iHYO
b8Xr4j1mKDPcu55OEPWwhYWOvRP4FvltrGB+35xsQoPlKBBKxSMFgNyEkdfvjtKU
qL9qRYJvm8D2OBqdv5BnawKRSouPcR4KPiM9JSGWRcW6vdOiY3Cd1mnAR8EA1LSd
SRYmBlrZIIjOpEeGkEy2MeNKEH9L5B1wAU21VQd+cAlCKBFUmFb1csZ/vCJJ1B34
swedmirNx0lyZTNBGQsDipkxSM0c/0xj/ysG8N41bLvPUM35x0K9xw5ZqBlTXrrg
Lln51s6TeqSzfEvKdDGikBL3dVhT5I6acRUchQU9YSZ3Bqxj3QCTNnOZMZQsCFkp
CMohduRVHiK4LHQTkU++q8bT8xX20kh/1HIJhAjdnDbrlRY4oTU=
=tyXO
-----END PGP SIGNATURE-----
Merge tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix folio refcounting when releasing them (encoded write, dummy
extent buffer)
- fix out of bounds read when checking qgroup inherit data
- fix how configurable chunk size is handled in zoned mode
- in the ref-verify tool, fix uninitialized return value when checking
extent owner ref and simple quota are not enabled
* tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix folio refcount in __alloc_dummy_extent_buffer()
btrfs: fix folio refcount in btrfs_do_encoded_write()
btrfs: fix uninitialized return value in the ref-verify tool
btrfs: always do the basic checks for btrfs_qgroup_inherit structure
btrfs: zoned: fix calc_available_free_space() for zoned mode
Another improper use of __folio_put() in an error path after freshly
allocating pages/folios which returns them with the refcount initialized
to 1. The refactor from __free_pages() -> __folio_put() (instead of
folio_put) removed a refcount decrement found in __free_pages() and
folio_put but absent from __folio_put().
Fixes: 13df3775ef ("btrfs: cleanup metadata page pointer usage")
CC: stable@vger.kernel.org # 6.8+
Tested-by: Ed Tomlinson <edtoml@gmail.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The conversion to folios switched __free_page() to __folio_put() in the
error path in btrfs_do_encoded_write().
However, this gets the page refcounting wrong. If we do hit that error
path (I reproduced by modifying btrfs_do_encoded_write to pretend to
always fail in a way that jumps to out_folios and running the fstests
case btrfs/281), then we always hit the following BUG freeing the folio:
BUG: Bad page state in process btrfs pfn:40ab0b
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x61be5 pfn:0x40ab0b
flags: 0x5ffff0000000000(node=0|zone=2|lastcpupid=0x1ffff)
raw: 05ffff0000000000 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000061be5 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: nonzero _refcount
Call Trace:
<TASK>
dump_stack_lvl+0x3d/0xe0
bad_page+0xea/0xf0
free_unref_page+0x8e1/0x900
? __mem_cgroup_uncharge+0x69/0x90
__folio_put+0xe6/0x190
btrfs_do_encoded_write+0x445/0x780
? current_time+0x25/0xd0
btrfs_do_write_iter+0x2cc/0x4b0
btrfs_ioctl_encoded_write+0x2b6/0x340
It turns out __free_page() decreases the page reference count while
__folio_put() does not. Switch __folio_put() to folio_put() which
decreases the folio reference count first.
Fixes: 400b172b8c ("btrfs: compression: migrate compression/decompression paths to folios")
Tested-by: Ed Tomlinson <edtoml@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the ref-verify tool, when processing the inline references of an extent
item, we may end up returning with uninitialized return value, because:
1) The 'ret' variable is not initialized if there are no inline extent
references ('ptr' == 'end' before the while loop starts);
2) If we find an extent owner inline reference we don't initialize 'ret'.
So fix these cases by initializing 'ret' to 0 when declaring the variable
and set it to -EINVAL if we find an extent owner inline references and
simple quotas are not enabled (as well as print an error message).
Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/59b40ebe-c824-457d-8b24-0bbca69d472b@gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Syzbot reports the following regression detected by KASAN:
BUG: KASAN: slab-out-of-bounds in btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277
Read of size 8 at addr ffff88814628ca50 by task syz-executor318/5171
CPU: 0 PID: 5171 Comm: syz-executor318 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
print_address_description mm/kasan/report.c:377 [inline]
print_report+0x169/0x550 mm/kasan/report.c:488
kasan_report+0x143/0x180 mm/kasan/report.c:601
btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277
create_pending_snapshot+0x1359/0x29b0 fs/btrfs/transaction.c:1854
create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1922
btrfs_commit_transaction+0xf20/0x3740 fs/btrfs/transaction.c:2382
create_snapshot+0x6a1/0x9e0 fs/btrfs/ioctl.c:875
btrfs_mksubvol+0x58f/0x710 fs/btrfs/ioctl.c:1029
btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1075
__btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1340
btrfs_ioctl_snap_create_v2+0x1f2/0x3a0 fs/btrfs/ioctl.c:1422
btrfs_ioctl+0x99e/0xc60
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:907 [inline]
__se_sys_ioctl+0xfc/0x170 fs/ioctl.c:893
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fcbf1992509
RSP: 002b:00007fcbf1928218 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fcbf1a1f618 RCX: 00007fcbf1992509
RDX: 0000000020000280 RSI: 0000000050009417 RDI: 0000000000000003
RBP: 00007fcbf1a1f610 R08: 00007ffea1298e97 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fcbf19eb660
R13: 00000000200002b8 R14: 00007fcbf19e60c0 R15: 0030656c69662f2e
</TASK>
And it also pinned it down to commit b5357cb268 ("btrfs: qgroup: do not
check qgroup inherit if qgroup is disabled").
[CAUSE]
That offending commit skips the whole qgroup inherit check if qgroup is
not enabled.
But that also skips the very basic checks like
num_ref_copies/num_excl_copies and the structure size checks.
Meaning if a qgroup enable/disable race is happening at the background,
and we pass a btrfs_qgroup_inherit structure when the qgroup is
disabled, the check would be completely skipped.
Then at the time of transaction commitment, qgroup is re-enabled and
btrfs_qgroup_inherit() is going to use the incorrect structure and
causing the above KASAN error.
[FIX]
Make btrfs_qgroup_check_inherit() only skip the source qgroup checks.
So that even if invalid btrfs_qgroup_inherit structure is passed in, we
can still reject invalid ones no matter if qgroup is enabled or not.
Furthermore we do already have an extra safety inside
btrfs_qgroup_inherit(), which would just ignore invalid qgroup sources,
so even if we only skip the qgroup source check we're still safe.
Reported-by: syzbot+a0d1f7e26910be4dc171@syzkaller.appspotmail.com
Fixes: b5357cb268 ("btrfs: qgroup: do not check qgroup inherit if qgroup is disabled")
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
calc_available_free_space() returns the total size of metadata (or
system) block groups, which can be allocated from unallocated disk
space. The logic is wrong on zoned mode in two places.
First, the calculation of data_chunk_size is wrong. We always allocate
one zone as one chunk, and no partial allocation of a zone. So, we
should use zone_size (= data_sinfo->chunk_size) as it is.
Second, the result "avail" may not be zone aligned. Since we always
allocate one zone as one chunk on zoned mode, returning non-zone size
aligned bytes will result in less pressure on the async metadata reclaim
process.
This is serious for the nearly full state with a large zone size device.
Allowing over-commit too much will result in less async reclaim work and
end up in ENOSPC. We can align down to the zone size to avoid that.
Fixes: cb6cbab790 ("btrfs: adjust overcommit logic when very close to full")
CC: stable@vger.kernel.org # 6.9
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaC45kACgkQxWXV+ddt
WDt6yRAAkn3n/nkapAvQbtOEIAV9GOc+DYecQXLM+E6m85vsvOBO6OeO/QDfIGvI
ALNE4EEKTkmqk6AOLNX9rQUvo8aOaDXj/9bNZYuVSCzG2hfwLijv6DlRShxuEIE8
kzOisuIZ4w7aL7G7OsUa50j/BLdZ47+iVF/79N+odhdaCDhhK/CcfLbemiLUS9AF
OYkYDyl2WCo9HLduSlVHDWXNUKs7I6/S29UWQpkTKTkmLlMk7rdkgbpjgpKjFxsd
/CuVW5NEGs+4dkV2OdOJ9t+f4qGJ56YuJanvrV3R1bGh+sgDzrcA0kP8318nFzgG
KBYTXjAZoe5RAi4IYfMhSrEExo2JJYFeei0B7Dv3M4IvxLOF7NdMvppBFdODF0A4
20gZ8EgNtZ0sMEafK2WAZSI23sjX0TH+/P3FFKQszxX0fRH3NV6al2drhuBTeIKr
UzxDqeDuqpDHSZskHIMR9nMEymcov5JTg5v2ds64WiT8kr3L23u3j1KO+8UrZ8j4
eB5vYEXE9VJDNPoJQ/qhbqDduzIJ1s+RU8lhEVIUJRIThRk/oHc7YcXgXKqvS6RM
ilJeAzjEhp2t3CQDfx+FJ9jjgi5jhOoCkbITg6iiiExIt++XOXJtL0y2BH514LA5
5jQXoXl/YnukfCaKLNRsxYXCw7JAQxWQiVq3s9g1tG5679zrfNA=
=RkBU
-----END PGP SIGNATURE-----
Merge tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A fixup for a recent fix that prevents an infinite loop during block
group reclaim.
Unfortunately it introduced an unsafe way of updating block group list
and could race with relocation. This could be hit on fast devices when
relocation/balance does not have enough space"
* tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix adding block group to a reclaim list and the unused list during reclaim
There is a potential parallel list adding for retrying in
btrfs_reclaim_bgs_work and adding to the unused list. Since the block
group is removed from the reclaim list and it is on a relocation work,
it can be added into the unused list in parallel. When that happens,
adding it to the reclaim list will corrupt the list head and trigger
list corruption like below.
Fix it by taking fs_info->unused_bgs_lock.
[177.504][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
[177.514][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
[177.529][T2585409] ------------[ cut here ]------------
[177.537][T2585409] kernel BUG at lib/list_debug.c:65!
[177.545][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
[177.555][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G W 6.10.0-rc5-kts #1
[177.568][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
[177.579][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
[177.589][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
[177.624][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
[177.633][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
[177.644][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
[177.655][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
[177.665][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
[177.676][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
[177.687][T2585409] FS: 0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
[177.700][T2585409] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[177.709][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
[177.720][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
[177.731][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
[177.742][T2585409] PKRU: 55555554
[177.748][T2585409] Call Trace:
[177.753][T2585409] <TASK>
[177.759][T2585409] ? __die_body.cold+0x19/0x27
[177.766][T2585409] ? die+0x2e/0x50
[177.772][T2585409] ? do_trap+0x1ea/0x2d0
[177.779][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.788][T2585409] ? do_error_trap+0xa3/0x160
[177.795][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.805][T2585409] ? handle_invalid_op+0x2c/0x40
[177.812][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.820][T2585409] ? exc_invalid_op+0x2d/0x40
[177.827][T2585409] ? asm_exc_invalid_op+0x1a/0x20
[177.834][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.843][T2585409] btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]
There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
safe, AFAICS. Since the block group was in the unused list, the used bytes
should be 0 when it was added to the unused list. Then, it checks
block_group->{used,reserved,pinned} are still 0 under the
block_group->lock. So, they should be still eligible for the unused list,
not the reclaim list.
The reason it is safe there it's because because we're holding
space_info->groups_sem in write mode.
That means no other task can allocate from the block group, so while we
are at deleted_unused_bgs() it's not possible for other tasks to
allocate and deallocate extents from the block group, so it can't be
added to the unused list or the reclaim list by anyone else.
The bug can be reproduced by btrfs/166 after a few rounds. In practice
this can be hit when relocation cannot find more chunk space and ends
with ENOSPC.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Fixes: 4eb4e85c4f ("btrfs: retry block group reclaim without infinite loop")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmZ9gV0ACgkQxWXV+ddt
WDsK2hAAmbOArbK5QHprawdOqqvJL46yoGCMba798EjYo53+hO1F8/lb531+zCUI
GDhdIC2mHdRIkARJ8Cde5POUjID1Kv3Rlc0rdHy3nOw38WZmA/+HdkcKzQhsDFSR
/FX9RKSWiu0xl6JdCLh4KkIWE+2m1v1kybhvRHCKb+70iBua1e+OSoM33BeiIhrP
yoFwMwIbzG2CoZOHoobDxUjs9ZMUHm4wH0csJYG9R59Vv7uLBOgpWuQB46iqpoj4
EYR8Sg8PscI7YXa0y8VTP3pdrNMW48IC6jerIAKHUeWRWRoTCh9He+3/E4xjNHxz
3Pm+Aat5QYdsqmE68IbeN5c7QB1YAdUCgoJJJwAFjwe9WtTn8RS9RiMjWIr+VHYm
E3REibQI151p0yHwAl8xPHDiTecmlNisof0eg6gzHdvODm/NYFuFapD+aDxWribX
G63dOa8Fy0h4pwDoF73Rd2YYbtO6tDVSNVIG3bWpPep3r+SI/oC4JMHbmn1aqmqF
c0/ZYVbsx/Hm066l4LCrpgi7kJ8en2zQ8MmcZHHt+2gXe1AyAON7kvRHaEizUCaA
fnLVhQvaOofC4g7DJc1JkwyLc9VF5hMTYUhldoJvqj1wm2qsT/siJgKVvllRGoMs
FU2qlWYaN/fPRylyEySyZPq/sWKHOAZZSOhyWM8SB2nYoBXNkO0=
=qpEp
-----END PGP SIGNATURE-----
Merge tag 'for-6.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix quota root leak after quota disable failure
- fix condition when checking if a zone can be added as free
- allocate inode in NOFS context during logging or tree-log replay
- handle raid-stripe-tree lookup correctly during scrub
* tag 'for-6.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: qgroup: fix quota root leak after quota disable failure
btrfs: scrub: handle RST lookup error correctly
btrfs: zoned: fix initial free space detection
btrfs: use NOFS context when getting inodes during logging and log replay
If during the quota disable we fail when cleaning the quota tree or when
deleting the root from the root tree, we jump to the 'out' label without
ever dropping the reference on the quota root, resulting in a leak of the
root since fs_info->quota_root is no longer pointing to the root (we have
set it to NULL just before those steps).
Fix this by always doing a btrfs_put_root() call under the 'out' label.
This is a problem that exists since qgroups were first added in 2012 by
commit bed92eae26 ("Btrfs: qgroup implementation and prototypes"), but
back then we missed a kfree on the quota root and free_extent_buffer()
calls on its root and commit root nodes, since back then roots were not
yet reference counted.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/060 with forced RST feature, it would crash the
following ASSERT() inside scrub_read_endio():
ASSERT(sector_nr < stripe->nr_sectors);
Before that, we would have tree dump from
btrfs_get_raid_extent_offset(), as we failed to find the RST entry for
the range.
[CAUSE]
Inside scrub_submit_extent_sector_read() every time we allocated a new
bbio we immediately called btrfs_map_block() to make sure there was some
RST range covering the scrub target.
But if btrfs_map_block() fails, we immediately call endio for the bbio,
while the bbio is newly allocated, it's completely empty.
Then inside scrub_read_endio(), we go through the bvecs to find
the sector number (as bi_sector is no longer reliable if the bio is
submitted to lower layers).
And since the bio is empty, such bvecs iteration would not find any
sector matching the sector, and return sector_nr == stripe->nr_sectors,
triggering the ASSERT().
[FIX]
Instead of calling btrfs_map_block() after allocating a new bbio, call
btrfs_map_block() first.
Since our only objective of calling btrfs_map_block() is only to update
stripe_len, there is really no need to do that after btrfs_alloc_bio().
This new timing would avoid the problem of handling empty bbio
completely, and in fact fixes a possible race window for the old code,
where if the submission thread is the only owner of the pending_io, the
scrub would never finish (since we didn't decrease the pending_io
counter).
Although the root cause of RST lookup failure still needs to be
addressed.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When creating a new block group, it calls btrfs_add_new_free_space() to add
the entire block group range into the free space accounting.
__btrfs_add_free_space_zoned() checks if size == block_group->length to
detect the initial free space adding, and proceed that case properly.
However, if the zone_capacity == zone_size and the over-write speed is fast
enough, the entire zone can be over-written within one transaction. That
confuses __btrfs_add_free_space_zoned() to handle it as an initial free
space accounting. As a result, that block group becomes a strange state: 0
used bytes, 0 zone_unusable bytes, but alloc_offset == zone_capacity (no
allocation anymore).
The initial free space accounting can properly be checked by checking
alloc_offset too.
Fixes: 98173255bd ("btrfs: zoned: calculate free space from zone capacity")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>