In the next patch in the series, there will be the need to access the old
name, and its length, of an inode when logging the inode during a rename.
So instead of passing the inode to btrfs_log_new_name() pass the dentry,
because from the dentry we can get the inode, the name and its length.
This will avoid passing 3 new parameters to btrfs_log_new_name() in the
next patch - the name, its length and an index number. This way we end
up passing only 1 new parameter, the index number.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the code that finds and deletes a logged dir entry out of
btrfs_del_dir_entries_in_log() into a helper function. This new helper
function will be used by another patch in the same series, and serves
to avoid having duplicated logic.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a directory we are trying to log subdirectories that were
changed in the current transaction and created in a past transaction.
This type of behaviour was introduced by commit 2f2ff0ee5e ("Btrfs:
fix metadata inconsistencies after directory fsync"), to fix some metadata
inconsistencies that in the meanwhile no longer need this behaviour due to
numerous other changes that happened throughout the years.
This behaviour, besides not needed anymore, it's also undesirable because:
1) It's not reliable because it's only triggered for the directories
of dentries (dir items) that happen to be present on a leaf that
was changed in the current transaction. If a dentry that points to
a directory resides on a leaf that was not changed in the current
transaction, then it's not logged, as at log_dir_items() and
log_new_dir_dentries() we use btrfs_search_forward();
2) It's not required by posix or any standard, it's undefined territory.
The only way to guarantee a subdirectory is logged, it to explicitly
fsync it;
Making the behaviour guaranteed would require scanning all directory
items, check which point to a directory, and then fsync each subdirectory
which was modified in the current transaction. This could be very
expensive for large directories with many subdirectories and/or large
subdirectories.
So remove that obsolete logic.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a directory, we go over every leaf of the subvolume tree that
was changed in the current transaction and copy all its dir index keys to
the log tree.
That includes copying dir index keys created in past transactions. This is
done mostly for simplicity, as after logging the keys we log an item that
specifies the start and end ranges of the keys we logged. That item is
then used during log replay to figure out which keys need to be deleted -
every key in that range that we find in the subvolume tree and is not in
the log tree, needs to be deleted.
Now that we log only dir index keys, and not dir item keys anymore, when
we remove dentries from a directory (due to unlink and rename operations),
we can get entire leaves that we changed only for deleting old dir index
keys, or that have few dir index keys that are new - this is due to the
fact that the offset for new index keys comes from a monotonically
increasing counter.
We can avoid logging dir index keys from past transactions, and in order
to track the deletions, only log range items (BTRFS_DIR_LOG_INDEX_KEY key
type) when we find gaps between consecutive index keys. This massively
reduces the amount of logged metadata when we have deleted directory
entries, even if it's a small percentage of the total number of entries.
The reduction comes from both less items that are logged and instead of
logging many dir index items (struct btrfs_dir_item), which have a size
of 30 bytes plus a file name, we typically log just a few range items
(struct btrfs_dir_log_item), which take only 8 bytes each.
Even if no entries were deleted from a directory and only new entries
were added, we typically still get a reduction on the amount of logged
metadata, because it's very likely the first leaf that got the new
dir index entries also has several old dir index entries.
So change the logging logic to not log dir index keys created in past
transactions and log a range item for every gap it finds between each
pair of consecutive index keys, to ensure deletions are tracked and
replayed on log replay.
This patch is part of a patchset comprised of the following patches:
1/4 btrfs: don't log unnecessary boundary keys when logging directory
2/4 btrfs: put initial index value of a directory in a constant
3/4 btrfs: stop copying old dir items when logging a directory
4/4 btrfs: stop trying to log subdirectories created in past transactions
The following test was run on a branch without this patchset and on a
branch with the first three patches applied:
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_FILES=1000000
NUM_FILE_DELETES=10000
MKFS_OPTIONS="-O no-holes -R free-space-tree"
MOUNT_OPTIONS="-o ssd"
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
sync
del_inc=$(( $NUM_FILES / $NUM_FILE_DELETES ))
for ((i = 1; i <= $NUM_FILES; i += $del_inc)); do
rm -f $MNT/testdir/file_$i
done
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
echo
umount $MNT
The test was run on a non-debug kernel (Debian's default kernel config),
and the results were the following for various values of NUM_FILES and
NUM_FILE_DELETES:
** before, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 **
dir fsync took 585 ms after deleting 10000 files
** after, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 **
dir fsync took 34 ms after deleting 10000 files (-94.2%)
** before, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 **
dir fsync took 50 ms after deleting 1000 files
** after, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 **
dir fsync took 7 ms after deleting 1000 files (-86.0%)
** before, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 **
dir fsync took 9 ms after deleting 100 files
** after, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 **
dir fsync took 5 ms after deleting 100 files (-44.4%)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before we start to log dir index keys from a leaf, we check if there is a
previous index key, which normally is at the end of a leaf that was not
changed in the current transaction. Then we log that key and set the start
of logged range (item of type BTRFS_DIR_LOG_INDEX_KEY) to the offset of
that key. This is to ensure that if there were deleted index keys between
that key and the first key we are going to log, those deletions are
replayed in case we need to replay to the log after a power failure.
However we really don't need to log that previous key, we can just set the
start of the logged range to that key's offset plus 1. This achieves the
same and avoids logging one dir index key.
The same logic is performed when we finish logging the index keys of a
leaf and we find that the next leaf has index keys and was not changed in
the current transaction. We are logging the first key of that next leaf
and use its offset as the end of range we log. This is just to ensure that
if there were deleted index keys between the last index key we logged and
the first key of that next leaf, those index keys are deleted if we end
up replaying the log. However that is not necessary, we can avoid logging
that first index key of the next leaf and instead set the end of the
logged range to match the offset of that index key minus 1.
So avoid logging those index keys at the boundaries and adjust the start
and end offsets of the logged ranges as described above.
This patch is part of a patchset comprised of the following patches:
1/4 btrfs: don't log unnecessary boundary keys when logging directory
2/4 btrfs: put initial index value of a directory in a constant
3/4 btrfs: stop copying old dir items when logging a directory
4/4 btrfs: stop trying to log subdirectories created in past transactions
Performance test results are listed in the changelog of patch 3/4.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The ->write and ->wait fields of struct walk_control, used for log trees,
are not used since 2008, more specifically since commit d0c803c404
("Btrfs: Record dirty pages tree-log pages in an extent_io tree") and
since commit d0c803c404 ("Btrfs: Record dirty pages tree-log pages in
an extent_io tree"). So just remove them, along with the function
btrfs_write_tree_block(), which is also not used anymore after removing
the ->write member.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During log replay, whenever we need to check if a name (dentry) exists in
a directory we do searches on the subvolume tree for inode references or
or directory entries (BTRFS_DIR_INDEX_KEY keys, and BTRFS_DIR_ITEM_KEY
keys as well, before kernel 5.17). However when during log replay we
unlink a name, through btrfs_unlink_inode(), we may not delete inode
references and dir index keys from a subvolume tree and instead just add
the deletions to the delayed inode's delayed items, which will only be
run when we commit the transaction used for log replay. This means that
after an unlink operation during log replay, if we attempt to search for
the same name during log replay, we will not see that the name was already
deleted, since the deletion is recorded only on the delayed items.
We run delayed items after every unlink operation during log replay,
except at unlink_old_inode_refs() and at add_inode_ref(). This was due
to an overlook, as delayed items should be run after evert unlink, for
the reasons stated above.
So fix those two cases.
Fixes: 0d836392ca ("Btrfs: fix mount failure after fsync due to hard link recreation")
Fixes: 1f250e929a ("Btrfs: fix log replay failure after unlink and link combination")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a full fsync, if we have prealloc extents beyond (or at) eof,
and the leaves that contain them were not modified in the current
transaction, we end up not logging them. This results in losing those
extents when we replay the log after a power failure, since the inode is
truncated to the current value of the logged i_size.
Just like for the fast fsync path, we need to always log all prealloc
extents starting at or beyond i_size. The fast fsync case was fixed in
commit 471d557afe ("Btrfs: fix loss of prealloc extents past i_size
after fsync log replay") but it missed the full fsync path. The problem
exists since the very early days, when the log tree was added by
commit e02119d5a7 ("Btrfs: Add a write ahead tree log to optimize
synchronous operations").
Example reproducer:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
# Create our test file with many file extent items, so that they span
# several leaves of metadata, even if the node/page size is 64K. Use
# direct IO and not fsync/O_SYNC because it's both faster and it avoids
# clearing the full sync flag from the inode - we want the fsync below
# to trigger the slow full sync code path.
$ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo
# Now add two preallocated extents to our file without extending the
# file's size. One right at i_size, and another further beyond, leaving
# a gap between the two prealloc extents.
$ xfs_io -c "falloc -k 16M 1M" /mnt/foo
$ xfs_io -c "falloc -k 20M 1M" /mnt/foo
# Make sure everything is durably persisted and the transaction is
# committed. This makes all created extents to have a generation lower
# than the generation of the transaction used by the next write and
# fsync.
sync
# Now overwrite only the first extent, which will result in modifying
# only the first leaf of metadata for our inode. Then fsync it. This
# fsync will use the slow code path (inode full sync bit is set) because
# it's the first fsync since the inode was created/loaded.
$ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo
# Extent list before power failure.
$ xfs_io -c "fiemap -v" /mnt/foo
/mnt/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 2178048..2178055 8 0x0
1: [8..16383]: 26632..43007 16376 0x0
2: [16384..32767]: 2156544..2172927 16384 0x0
3: [32768..34815]: 2172928..2174975 2048 0x800
4: [34816..40959]: hole 6144
5: [40960..43007]: 2174976..2177023 2048 0x801
<power fail>
# Mount fs again, trigger log replay.
$ mount /dev/sdc /mnt
# Extent list after power failure and log replay.
$ xfs_io -c "fiemap -v" /mnt/foo
/mnt/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..7]: 2178048..2178055 8 0x0
1: [8..16383]: 26632..43007 16376 0x0
2: [16384..32767]: 2156544..2172927 16384 0x1
# The prealloc extents at file offsets 16M and 20M are missing.
So fix this by calling btrfs_log_prealloc_extents() when we are doing a
full fsync, so that we always log all prealloc extents beyond eof.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After the recent changes made by commit c2e3930529 ("btrfs: clear
extent buffer uptodate when we fail to write it") and its followup fix,
commit 651740a502 ("btrfs: check WRITE_ERR when trying to read an
extent buffer"), we can now end up not cleaning up space reservations of
log tree extent buffers after a transaction abort happens, as well as not
cleaning up still dirty extent buffers.
This happens because if writeback for a log tree extent buffer failed,
then we have cleared the bit EXTENT_BUFFER_UPTODATE from the extent buffer
and we have also set the bit EXTENT_BUFFER_WRITE_ERR on it. Later on,
when trying to free the log tree with free_log_tree(), which iterates
over the tree, we can end up getting an -EIO error when trying to read
a node or a leaf, since read_extent_buffer_pages() returns -EIO if an
extent buffer does not have EXTENT_BUFFER_UPTODATE set and has the
EXTENT_BUFFER_WRITE_ERR bit set. Getting that -EIO means that we return
immediately as we can not iterate over the entire tree.
In that case we never update the reserved space for an extent buffer in
the respective block group and space_info object.
When this happens we get the following traces when unmounting the fs:
[174957.284509] BTRFS: error (device dm-0) in cleanup_transaction:1913: errno=-5 IO failure
[174957.286497] BTRFS: error (device dm-0) in free_log_tree:3420: errno=-5 IO failure
[174957.399379] ------------[ cut here ]------------
[174957.402497] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:127 btrfs_put_block_group+0x77/0xb0 [btrfs]
[174957.407523] Modules linked in: btrfs overlay dm_zero (...)
[174957.424917] CPU: 2 PID: 3206883 Comm: umount Tainted: G W 5.16.0-rc5-btrfs-next-109 #1
[174957.426689] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[174957.428716] RIP: 0010:btrfs_put_block_group+0x77/0xb0 [btrfs]
[174957.429717] Code: 21 48 8b bd (...)
[174957.432867] RSP: 0018:ffffb70d41cffdd0 EFLAGS: 00010206
[174957.433632] RAX: 0000000000000001 RBX: ffff8b09c3848000 RCX: ffff8b0758edd1c8
[174957.434689] RDX: 0000000000000001 RSI: ffffffffc0b467e7 RDI: ffff8b0758edd000
[174957.436068] RBP: ffff8b0758edd000 R08: 0000000000000000 R09: 0000000000000000
[174957.437114] R10: 0000000000000246 R11: 0000000000000000 R12: ffff8b09c3848148
[174957.438140] R13: ffff8b09c3848198 R14: ffff8b0758edd188 R15: dead000000000100
[174957.439317] FS: 00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
[174957.440402] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[174957.441164] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
[174957.442117] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[174957.443076] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[174957.443948] Call Trace:
[174957.444264] <TASK>
[174957.444538] btrfs_free_block_groups+0x255/0x3c0 [btrfs]
[174957.445238] close_ctree+0x301/0x357 [btrfs]
[174957.445803] ? call_rcu+0x16c/0x290
[174957.446250] generic_shutdown_super+0x74/0x120
[174957.446832] kill_anon_super+0x14/0x30
[174957.447305] btrfs_kill_super+0x12/0x20 [btrfs]
[174957.447890] deactivate_locked_super+0x31/0xa0
[174957.448440] cleanup_mnt+0x147/0x1c0
[174957.448888] task_work_run+0x5c/0xa0
[174957.449336] exit_to_user_mode_prepare+0x1e5/0x1f0
[174957.449934] syscall_exit_to_user_mode+0x16/0x40
[174957.450512] do_syscall_64+0x48/0xc0
[174957.450980] entry_SYSCALL_64_after_hwframe+0x44/0xae
[174957.451605] RIP: 0033:0x7f328fdc4a97
[174957.452059] Code: 03 0c 00 f7 (...)
[174957.454320] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[174957.455262] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
[174957.456131] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
[174957.457118] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
[174957.458005] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
[174957.459113] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
[174957.460193] </TASK>
[174957.460534] irq event stamp: 0
[174957.461003] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[174957.461947] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.463147] softirqs last enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.465116] softirqs last disabled at (0): [<0000000000000000>] 0x0
[174957.466323] ---[ end trace bc7ee0c490bce3af ]---
[174957.467282] ------------[ cut here ]------------
[174957.468184] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:3976 btrfs_free_block_groups+0x330/0x3c0 [btrfs]
[174957.470066] Modules linked in: btrfs overlay dm_zero (...)
[174957.483137] CPU: 2 PID: 3206883 Comm: umount Tainted: G W 5.16.0-rc5-btrfs-next-109 #1
[174957.484691] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[174957.486853] RIP: 0010:btrfs_free_block_groups+0x330/0x3c0 [btrfs]
[174957.488050] Code: 00 00 00 ad de (...)
[174957.491479] RSP: 0018:ffffb70d41cffde0 EFLAGS: 00010206
[174957.492520] RAX: ffff8b08d79310b0 RBX: ffff8b09c3848000 RCX: 0000000000000000
[174957.493868] RDX: 0000000000000001 RSI: fffff443055ee600 RDI: ffffffffb1131846
[174957.495183] RBP: ffff8b08d79310b0 R08: 0000000000000000 R09: 0000000000000000
[174957.496580] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8b08d7931000
[174957.498027] R13: ffff8b09c38492b0 R14: dead000000000122 R15: dead000000000100
[174957.499438] FS: 00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
[174957.500990] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[174957.502117] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
[174957.503513] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[174957.504864] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[174957.506167] Call Trace:
[174957.506654] <TASK>
[174957.507047] close_ctree+0x301/0x357 [btrfs]
[174957.507867] ? call_rcu+0x16c/0x290
[174957.508567] generic_shutdown_super+0x74/0x120
[174957.509447] kill_anon_super+0x14/0x30
[174957.510194] btrfs_kill_super+0x12/0x20 [btrfs]
[174957.511123] deactivate_locked_super+0x31/0xa0
[174957.511976] cleanup_mnt+0x147/0x1c0
[174957.512610] task_work_run+0x5c/0xa0
[174957.513309] exit_to_user_mode_prepare+0x1e5/0x1f0
[174957.514231] syscall_exit_to_user_mode+0x16/0x40
[174957.515069] do_syscall_64+0x48/0xc0
[174957.515718] entry_SYSCALL_64_after_hwframe+0x44/0xae
[174957.516688] RIP: 0033:0x7f328fdc4a97
[174957.517413] Code: 03 0c 00 f7 d8 (...)
[174957.521052] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[174957.522514] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
[174957.523950] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
[174957.525375] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
[174957.526763] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
[174957.528058] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
[174957.529404] </TASK>
[174957.529843] irq event stamp: 0
[174957.530256] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[174957.531061] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.532075] softirqs last enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.533083] softirqs last disabled at (0): [<0000000000000000>] 0x0
[174957.533865] ---[ end trace bc7ee0c490bce3b0 ]---
[174957.534452] BTRFS info (device dm-0): space_info 4 has 1070841856 free, is not full
[174957.535404] BTRFS info (device dm-0): space_info total=1073741824, used=2785280, pinned=0, reserved=49152, may_use=0, readonly=65536 zone_unusable=0
[174957.537029] BTRFS info (device dm-0): global_block_rsv: size 0 reserved 0
[174957.537859] BTRFS info (device dm-0): trans_block_rsv: size 0 reserved 0
[174957.538697] BTRFS info (device dm-0): chunk_block_rsv: size 0 reserved 0
[174957.539552] BTRFS info (device dm-0): delayed_block_rsv: size 0 reserved 0
[174957.540403] BTRFS info (device dm-0): delayed_refs_rsv: size 0 reserved 0
This also means that in case we have log tree extent buffers that are
still dirty, we can end up not cleaning them up in case we find an
extent buffer with EXTENT_BUFFER_WRITE_ERR set on it, as in that case
we have no way for iterating over the rest of the tree.
This issue is very often triggered with test cases generic/475 and
generic/648 from fstests.
The issue could almost be fixed by iterating over the io tree attached to
each log root which keeps tracks of the range of allocated extent buffers,
log_root->dirty_log_pages, however that does not work and has some
inconveniences:
1) After we sync the log, we clear the range of the extent buffers from
the io tree, so we can't find them after writeback. We could keep the
ranges in the io tree, with a separate bit to signal they represent
extent buffers already written, but that means we need to hold into
more memory until the transaction commits.
How much more memory is used depends a lot on whether we are able to
allocate contiguous extent buffers on disk (and how often) for a log
tree - if we are able to, then a single extent state record can
represent multiple extent buffers, otherwise we need multiple extent
state record structures to track each extent buffer.
In fact, my earlier approach did that:
https://lore.kernel.org/linux-btrfs/3aae7c6728257c7ce2279d6660ee2797e5e34bbd.1641300250.git.fdmanana@suse.com/
However that can cause a very significant negative impact on
performance, not only due to the extra memory usage but also because
we get a larger and deeper dirty_log_pages io tree.
We got a report that, on beefy machines at least, we can get such
performance drop with fsmark for example:
https://lore.kernel.org/linux-btrfs/20220117082426.GE32491@xsang-OptiPlex-9020/
2) We would be doing it only to deal with an unexpected and exceptional
case, which is basically failure to read an extent buffer from disk
due to IO failures. On a healthy system we don't expect transaction
aborts to happen after all;
3) Instead of relying on iterating the log tree or tracking the ranges
of extent buffers in the dirty_log_pages io tree, using the radix
tree that tracks extent buffers (fs_info->buffer_radix) to find all
log tree extent buffers is not reliable either, because after writeback
of an extent buffer it can be evicted from memory by the release page
callback of the btree inode (btree_releasepage()).
Since there's no way to be able to properly cleanup a log tree without
being able to read its extent buffers from disk and without using more
memory to track the logical ranges of the allocated extent buffers do
the following:
1) When we fail to cleanup a log tree, setup a flag that indicates that
failure;
2) Trigger writeback of all log tree extent buffers that are still dirty,
and wait for the writeback to complete. This is just to cleanup their
state, page states, page leaks, etc;
3) When unmounting the fs, ignore if the number of bytes reserved in a
block group and in a space_info is not 0 if, and only if, we failed to
cleanup a log tree. Also ignore only for metadata block groups and the
metadata space_info object.
This is far from a perfect solution, but it serves to silence test
failures such as those from generic/475 and generic/648. However having
a non-zero value for the reserved bytes counters on unmount after a
transaction abort, is not such a terrible thing and it's completely
harmless, it does not affect the filesystem integrity in any way.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only throttle the btrfs_truncate_inode_items if the root is
SHAREABLE, which isn't set on the log root, which means this loop is
unnecessary.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the future we're going to want to use btrfs_truncate_inode_items
without looking up the associated inode. In order to accommodate this
add the inode to btrfs_truncate_control and handle the case where
control->inode is NULL appropriately. This is fairly straightforward,
we simply need to add a helper for the trace points, as the file extent
map update is controlled by a flag on btrfs_truncate_control.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the future we are going to want to truncate inode items without
needing to have an btrfs_inode to pass in, so add ino to the
btrfs_truncate_control and use that to look up the inode items to
truncate.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've had weird bugs in the past where we forgot to adjust the truncate
path to deal with the fact that we can be called by the tree log path.
Instead of checking if our root is a LOG_ROOT use a flag on the
btrfs_truncate_control to indicate that we don't want to do extent
reference updates during this truncate.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I'm going to be adding more arguments and counters to
btrfs_truncate_inode_items, so add a control struct to handle all of the
extra arguments to make it easier to follow.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few helpers in inode-item.c, and I'm going to make a few
changes to how we do truncate in the future, so break out these
definitions into their own header file to trim down ctree.h some and
make it easier to do the work on truncate in the future.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to have multiple csum roots in the future, so convert all
users of ->csum_root to btrfs_csum_root() and rename ->csum_root to
->_csum_root so we can easily find remaining users in the future.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that all call sites are using the slot number to modify item values,
rename the SETGET helpers to raw_item_*(), and then rework the _nr()
helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then
rename all of the callers to the new helpers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we log only dir index keys when logging a directory, we no longer
need to deal with dir item keys in the log replay code for replaying
directory deletes. This is also true for the case when we replay a log
tree created by a kernel that still logs dir items.
So remove the remaining code of the replay of directory deletes algorithm
that deals with dir item keys.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, when logging a directory, we copy both dir items and dir index
items from the fs/subvolume tree to the log tree. Both items have exactly
the same data (same struct btrfs_dir_item), the difference lies in the key
values, where a dir index key contains the index number of a directory
entry while the dir item key does not, as it's used for doing fast lookups
of an entry by name, while the former is used for sorting entries when
listing a directory.
We can exploit that and log only the dir index items, since they contain
all the information needed to correctly add, replace and delete directory
entries when replaying a log tree. Logging only the dir index items is
also backward and forward compatible: an unpatched kernel (without this
change) can correctly replay a log tree generated by a patched kernel
(with this patch), and a patched kernel can correctly replay a log tree
generated by an unpatched kernel.
The backward compatibility is ensured because:
1) For inserting a new dentry: a dentry is only inserted when we find a
new dir index key - we can only insert if we know the dir index offset,
which is encoded in the dir index key's offset;
2) For deleting dentries: during log replay, before adding or replacing
dentries, we first replay dentry deletions. Whenever we find a dir item
key or a dir index key in the subvolume/fs tree that is not logged in
a range for which the log tree is authoritative, we do the unlink of
the dentry, which removes both the existing dir item key and the dir
index key. Therefore logging just dir index keys is enough to ensure
dentry deletions are correctly replayed;
3) For dentry replacements: they work when we log only dir index keys
and this is mostly due to a combination of 1) and 2). If we replace a
dentry with name "foobar" to point from inode A to inode B, then we
know the dir index key for the new dentry is different from the old
one, as it has an index number (key offset) larger than the old one.
This results in replaying a deletion, through replay_dir_deletes(),
that causes the old dentry to be removed, both the dir item key and
the dir index key, as mentioned at 2). Then when processing the new
dir index key, we add the new dentry, adding both a new dir item key
and a new index key pointing to inode B, as stated in 1).
The forward compatibility, the ability for a patched kernel to replay a
log created by an older, unpatched kernel, comes from the changes required
for making sure we are able to replay a log that only contains dir index
keys - we simply ignore every dir item key we find.
So modify directory logging to log only dir index items, and modify the
log replay process to ignore dir item keys, from log trees created by an
unpatched kernel, and process only with dir index keys. This reduces the
amount of logged metadata by about half, and therefore the time spent
logging or fsyncing large directories (less CPU time and less IO).
The following test script was used to measure this change:
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_NEW_FILES=1000000
NUM_FILE_DELETES=10000
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"
# sync to force transaction commit and wipeout the log.
sync
del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
rm -f $MNT/testdir/file_$i
done
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
echo
umount $MNT
The tests were run on a physical machine, with a non-debug kernel (Debian's
default kernel config), for different values of $NUM_NEW_FILES and
$NUM_FILE_DELETES, and the results were the following:
** Before patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
dir fsync took 8412 ms after adding 1000000 files
dir fsync took 500 ms after deleting 10000 files
** After patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
dir fsync took 4252 ms after adding 1000000 files (-49.5%)
dir fsync took 269 ms after deleting 10000 files (-46.2%)
** Before patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
dir fsync took 745 ms after adding 100000 files
dir fsync took 59 ms after deleting 1000 files
** After patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
dir fsync took 404 ms after adding 100000 files (-45.8%)
dir fsync took 31 ms after deleting 1000 files (-47.5%)
** Before patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
dir fsync took 67 ms after adding 10000 files
dir fsync took 9 ms after deleting 1000 files
** After patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
dir fsync took 36 ms after adding 10000 files (-46.3%)
dir fsync took 5 ms after deleting 1000 files (-44.4%)
** Before patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
dir fsync took 9 ms after adding 1000 files
dir fsync took 4 ms after deleting 100 files
** After patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
dir fsync took 7 ms after adding 1000 files (-22.2%)
dir fsync took 3 ms after deleting 100 files (-25.0%)
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a directory, once we finish processing a leaf that is full
of dir items, if we find the next leaf was not modified in the current
transaction, we grab the first key of that next leaf and log it as to
mark the end of a key range boundary.
However we did not update the value of ctx->last_dir_item_offset, which
tracks the offset of the last logged key. This can result in subsequent
logging of the same directory in the current transaction to not realize
that key was already logged, and then add it to the middle of a batch
that starts with a lower key, resulting later in a leaf with one key
that is duplicated and at non-consecutive slots. When that happens we get
an error later when writing out the leaf, reporting that there is a pair
of keys in wrong order. The report is something like the following:
Dec 13 21:44:50 kernel: BTRFS critical (device dm-0): corrupt leaf:
root=18446744073709551610 block=118444032 slot=21, bad key order, prev
(704687 84 4146773349) current (704687 84 1063561078)
Dec 13 21:44:50 kernel: BTRFS info (device dm-0): leaf 118444032 gen
91449 total ptrs 39 free space 546 owner 18446744073709551610
Dec 13 21:44:50 kernel: item 0 key (704687 1 0) itemoff 3835
itemsize 160
Dec 13 21:44:50 kernel: inode generation 35532 size
1026 mode 40755
Dec 13 21:44:50 kernel: item 1 key (704687 12 704685) itemoff
3822 itemsize 13
Dec 13 21:44:50 kernel: item 2 key (704687 24 3817753667)
itemoff 3736 itemsize 86
Dec 13 21:44:50 kernel: item 3 key (704687 60 0) itemoff 3728 itemsize 8
Dec 13 21:44:50 kernel: item 4 key (704687 72 0) itemoff 3720 itemsize 8
Dec 13 21:44:50 kernel: item 5 key (704687 84 140445108)
itemoff 3666 itemsize 54
Dec 13 21:44:50 kernel: dir oid 704793 type 1
Dec 13 21:44:50 kernel: item 6 key (704687 84 298800632)
itemoff 3599 itemsize 67
Dec 13 21:44:50 kernel: dir oid 707849 type 2
Dec 13 21:44:50 kernel: item 7 key (704687 84 476147658)
itemoff 3532 itemsize 67
Dec 13 21:44:50 kernel: dir oid 707901 type 2
Dec 13 21:44:50 kernel: item 8 key (704687 84 633818382)
itemoff 3471 itemsize 61
Dec 13 21:44:50 kernel: dir oid 704694 type 2
Dec 13 21:44:50 kernel: item 9 key (704687 84 654256665)
itemoff 3403 itemsize 68
Dec 13 21:44:50 kernel: dir oid 707841 type 1
Dec 13 21:44:50 kernel: item 10 key (704687 84 995843418)
itemoff 3331 itemsize 72
Dec 13 21:44:50 kernel: dir oid 2167736 type 1
Dec 13 21:44:50 kernel: item 11 key (704687 84 1063561078)
itemoff 3278 itemsize 53
Dec 13 21:44:50 kernel: dir oid 704799 type 2
Dec 13 21:44:50 kernel: item 12 key (704687 84 1101156010)
itemoff 3225 itemsize 53
Dec 13 21:44:50 kernel: dir oid 704696 type 1
Dec 13 21:44:50 kernel: item 13 key (704687 84 2521936574)
itemoff 3173 itemsize 52
Dec 13 21:44:50 kernel: dir oid 704704 type 2
Dec 13 21:44:50 kernel: item 14 key (704687 84 2618368432)
itemoff 3112 itemsize 61
Dec 13 21:44:50 kernel: dir oid 704738 type 1
Dec 13 21:44:50 kernel: item 15 key (704687 84 2676316190)
itemoff 3046 itemsize 66
Dec 13 21:44:50 kernel: dir oid 2167729 type 1
Dec 13 21:44:50 kernel: item 16 key (704687 84 3319104192)
itemoff 2986 itemsize 60
Dec 13 21:44:50 kernel: dir oid 704745 type 2
Dec 13 21:44:50 kernel: item 17 key (704687 84 3908046265)
itemoff 2929 itemsize 57
Dec 13 21:44:50 kernel: dir oid 2167734 type 1
Dec 13 21:44:50 kernel: item 18 key (704687 84 3945713089)
itemoff 2857 itemsize 72
Dec 13 21:44:50 kernel: dir oid 2167730 type 1
Dec 13 21:44:50 kernel: item 19 key (704687 84 4077169308)
itemoff 2795 itemsize 62
Dec 13 21:44:50 kernel: dir oid 704688 type 1
Dec 13 21:44:50 kernel: item 20 key (704687 84 4146773349)
itemoff 2727 itemsize 68
Dec 13 21:44:50 kernel: dir oid 707892 type 1
Dec 13 21:44:50 kernel: item 21 key (704687 84 1063561078)
itemoff 2674 itemsize 53
Dec 13 21:44:50 kernel: dir oid 704799 type 2
Dec 13 21:44:50 kernel: item 22 key (704687 96 2) itemoff 2612
itemsize 62
Dec 13 21:44:50 kernel: item 23 key (704687 96 6) itemoff 2551
itemsize 61
Dec 13 21:44:50 kernel: item 24 key (704687 96 7) itemoff 2498
itemsize 53
Dec 13 21:44:50 kernel: item 25 key (704687 96 12) itemoff
2446 itemsize 52
Dec 13 21:44:50 kernel: item 26 key (704687 96 14) itemoff
2385 itemsize 61
Dec 13 21:44:50 kernel: item 27 key (704687 96 18) itemoff
2325 itemsize 60
Dec 13 21:44:50 kernel: item 28 key (704687 96 24) itemoff
2271 itemsize 54
Dec 13 21:44:50 kernel: item 29 key (704687 96 28) itemoff
2218 itemsize 53
Dec 13 21:44:50 kernel: item 30 key (704687 96 62) itemoff
2150 itemsize 68
Dec 13 21:44:50 kernel: item 31 key (704687 96 66) itemoff
2083 itemsize 67
Dec 13 21:44:50 kernel: item 32 key (704687 96 75) itemoff
2015 itemsize 68
Dec 13 21:44:50 kernel: item 33 key (704687 96 79) itemoff
1948 itemsize 67
Dec 13 21:44:50 kernel: item 34 key (704687 96 82) itemoff
1882 itemsize 66
Dec 13 21:44:50 kernel: item 35 key (704687 96 83) itemoff
1810 itemsize 72
Dec 13 21:44:50 kernel: item 36 key (704687 96 85) itemoff
1753 itemsize 57
Dec 13 21:44:50 kernel: item 37 key (704687 96 87) itemoff
1681 itemsize 72
Dec 13 21:44:50 kernel: item 38 key (704694 1 0) itemoff 1521
itemsize 160
Dec 13 21:44:50 kernel: inode generation 35534 size 30
mode 40755
Dec 13 21:44:50 kernel: BTRFS error (device dm-0): block=118444032
write time tree block corruption detected
So fix that by adding the missing update of ctx->last_dir_item_offset with
the offset of the boundary key.
Reported-by: Chris Murphy <lists@colorremedies.com>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtT+RSzpUjbMq+UfzNUMe1X5+1G+DnAGbHC=OZ=iRS24jg@mail.gmail.com/
Fixes: dc2872247e ("btrfs: keep track of the last logged keys when logging a directory")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Line 1169 (#3) allocates a memory chunk for victim_name by kmalloc(),
but when the function returns in line 1184 (#4) victim_name allocated
by line 1169 (#3) is not freed, which will lead to a memory leak.
There is a similar snippet of code in this function as allocating a memory
chunk for victim_name in line 1104 (#1) as well as releasing the memory
in line 1116 (#2).
We should kfree() victim_name when the return value of backref_in_log()
is less than zero and before the function returns in line 1184 (#4).
1057 static inline int __add_inode_ref(struct btrfs_trans_handle *trans,
1058 struct btrfs_root *root,
1059 struct btrfs_path *path,
1060 struct btrfs_root *log_root,
1061 struct btrfs_inode *dir,
1062 struct btrfs_inode *inode,
1063 u64 inode_objectid, u64 parent_objectid,
1064 u64 ref_index, char *name, int namelen,
1065 int *search_done)
1066 {
1104 victim_name = kmalloc(victim_name_len, GFP_NOFS);
// #1: kmalloc (victim_name-1)
1105 if (!victim_name)
1106 return -ENOMEM;
1112 ret = backref_in_log(log_root, &search_key,
1113 parent_objectid, victim_name,
1114 victim_name_len);
1115 if (ret < 0) {
1116 kfree(victim_name); // #2: kfree (victim_name-1)
1117 return ret;
1118 } else if (!ret) {
1169 victim_name = kmalloc(victim_name_len, GFP_NOFS);
// #3: kmalloc (victim_name-2)
1170 if (!victim_name)
1171 return -ENOMEM;
1180 ret = backref_in_log(log_root, &search_key,
1181 parent_objectid, victim_name,
1182 victim_name_len);
1183 if (ret < 0) {
1184 return ret; // #4: missing kfree (victim_name-2)
1185 } else if (!ret) {
1241 return 0;
1242 }
Fixes: d3316c8233 ("btrfs: Properly handle backref_in_log retval")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Jianglei Nie <niejianglei2021@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a report of a transaction abort of -EAGAIN with the following
script.
#!/bin/sh
for d in sda sdb; do
mkfs.btrfs -d single -m single -f /dev/\${d}
done
mount /dev/sda /mnt/test
mount /dev/sdb /mnt/scratch
for dir in test scratch; do
echo 3 >/proc/sys/vm/drop_caches
fio --directory=/mnt/\${dir} --name=fio.\${dir} --rw=read --size=50G --bs=64m \
--numjobs=$(nproc) --time_based --ramp_time=5 --runtime=480 \
--group_reporting |& tee /dev/shm/fio.\${dir}
echo 3 >/proc/sys/vm/drop_caches
done
for d in sda sdb; do
umount /dev/\${d}
done
The stack trace is shown in below.
[3310.967991] BTRFS: error (device sda) in btrfs_commit_transaction:2341: errno=-11 unknown (Error while writing out transaction)
[3310.968060] BTRFS info (device sda): forced readonly
[3310.968064] BTRFS warning (device sda): Skipping commit of aborted transaction.
[3310.968065] ------------[ cut here ]------------
[3310.968066] BTRFS: Transaction aborted (error -11)
[3310.968074] WARNING: CPU: 14 PID: 1684 at fs/btrfs/transaction.c:1946 btrfs_commit_transaction.cold+0x209/0x2c8
[3310.968131] CPU: 14 PID: 1684 Comm: fio Not tainted 5.14.10-300.fc35.x86_64 #1
[3310.968135] Hardware name: DIAWAY Tartu/Tartu, BIOS V2.01.B10 04/08/2021
[3310.968137] RIP: 0010:btrfs_commit_transaction.cold+0x209/0x2c8
[3310.968144] RSP: 0018:ffffb284ce393e10 EFLAGS: 00010282
[3310.968147] RAX: 0000000000000026 RBX: ffff973f147b0f60 RCX: 0000000000000027
[3310.968149] RDX: ffff974ecf098a08 RSI: 0000000000000001 RDI: ffff974ecf098a00
[3310.968150] RBP: ffff973f147b0f08 R08: 0000000000000000 R09: ffffb284ce393c48
[3310.968151] R10: ffffb284ce393c40 R11: ffffffff84f47468 R12: ffff973f101bfc00
[3310.968153] R13: ffff971f20cf2000 R14: 00000000fffffff5 R15: ffff973f147b0e58
[3310.968154] FS: 00007efe65468740(0000) GS:ffff974ecf080000(0000) knlGS:0000000000000000
[3310.968157] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3310.968158] CR2: 000055691bcbe260 CR3: 000000105cfa4001 CR4: 0000000000770ee0
[3310.968160] PKRU: 55555554
[3310.968161] Call Trace:
[3310.968167] ? dput+0xd4/0x300
[3310.968174] btrfs_sync_file+0x3f1/0x490
[3310.968180] __x64_sys_fsync+0x33/0x60
[3310.968185] do_syscall_64+0x3b/0x90
[3310.968190] entry_SYSCALL_64_after_hwframe+0x44/0xae
[3310.968194] RIP: 0033:0x7efe6557329b
[3310.968200] RSP: 002b:00007ffe0236ebc0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[3310.968203] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007efe6557329b
[3310.968204] RDX: 0000000000000000 RSI: 00007efe58d77010 RDI: 0000000000000006
[3310.968205] RBP: 0000000004000000 R08: 0000000000000000 R09: 00007efe58d77010
[3310.968207] R10: 0000000016cacc0c R11: 0000000000000293 R12: 00007efe5ce95980
[3310.968208] R13: 0000000000000000 R14: 00007efe6447c790 R15: 0000000c80000000
[3310.968212] ---[ end trace 1a346f4d3c0d96ba ]---
[3310.968214] BTRFS: error (device sda) in cleanup_transaction:1946: errno=-11 unknown
The abort occurs because of a write hole while writing out freeing tree
nodes of a tree-log tree. For zoned btrfs, we re-dirty a freed tree
node to ensure btrfs can write the region and does not leave a hole on
write on a zoned device. The current code fails to re-dirty a node
when the tree-log tree's depth is greater or equal to 2. That leads to
a transaction abort with -EAGAIN.
Fix the issue by properly re-dirtying a node on walking up the tree.
Fixes: d3575156f6 ("btrfs: zoned: redirty released extent buffers")
CC: stable@vger.kernel.org # 5.12+
Link: https://github.com/kdave/btrfs-progs/issues/415
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The root argument passed to check_item_in_log() always matches the root
of the given directory, so it can be eliminated.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The root argument for tree-log.c:add_link() always matches the root of the
given directory and the given inode, so it can eliminated.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The root argument passed to btrfs_unlink_inode() and its callee,
__btrfs_unlink_inode(), always matches the root of the given directory and
the given inode. So remove the argument and make __btrfs_unlink_inode()
use the root of the directory.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The root argument for drop_one_dir_item() always matches the root of the
given directory inode, since each log tree is associated to one and only
one subvolume/root, so remove the argument.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At replay_dir_deletes(), if find_dir_range() returns an error we break out
of the main while loop and then assign a value of 0 (success) to the 'ret'
variable, resulting in completely ignoring that an error happened. Fix
that by jumping to the 'out' label when find_dir_range() returns an error
(negative value).
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to make 'real_root' used only in ref-verify it's required to
have the necessary context to perform the same checks that this member
is used for. So add 'mod_root' which will contain the root on behalf of
which a delayed ref was created and a 'skip_group' parameter which
will contain callsite-specific override of skip_qgroup.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few flags that are inconsistently used to describe the fs in
different states of failure. As of 5963ffcaf3 ("btrfs: always abort
the transaction if we abort a trans handle") we will always set
BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED
and ERROR to see if things have gone wrong. Add a helper to check
BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to
use the helper.
The TRANS_ABORTED bit check was added in af72273381 ("Btrfs: clean up
resources during umount after trans is aborted") but is not actually
specific.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we will abort the transaction if we get a random error (like
-EIO) while trying to remove the directory entries from the root log
during rename.
However since these are simply log tree related errors, we can mark the
trans as needing a full commit. Then if the error was truly
catastrophic we'll hit it during the normal commit and abort as
appropriate.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During inspection of the return path for replay I noticed that we don't
actually abort the transaction if we get a failure during replay. This
isn't a problem necessarily, as we properly return the error and will
fail to mount. However we still leave this dangling transaction that
could conceivably be committed without thinking there was an error.
We were using btrfs_handle_fs_error() here, but that pre-dates the
transaction abort code. Simply replace the btrfs_handle_fs_error()
calls with transaction aborts, so we still know where exactly things
went wrong, and add a few in some other un-handled error cases.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a directory and inserting a batch of directory items, we are
copying the data of each item from a leaf in the fs/subvolume tree to a
leaf in a log tree, separately. This is not really needed, since we are
copying from a contiguous memory area into another one, so we can use a
single copy operation to copy all items at once.
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 3/3.
The following test was used to compare performance of a branch without the
patchset versus one branch that has the whole patchset applied:
$ cat dir-fsync-test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_NEW_FILES=1000000
NUM_FILE_DELETES=1000
LEAF_SIZE=16K
mkfs.btrfs -f -n $LEAF_SIZE $DEV
mount -o ssd $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
# Fsync the directory, this will log the new dir items and the inodes
# they point to, because these are new inodes.
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"
# sync to force transaction commit and wipeout the log.
sync
del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
rm -f $MNT/testdir/file_$i
done
# Fsync the directory, this will only log dir items, there are no
# dentries pointing to new inodes.
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
umount $MNT
The tests were run on a non-debug kernel (Debian's default kernel config)
and were the following:
*** with a leaf size of 16K, before patchset ***
dir fsync took 8482 ms after adding 1000000 files
dir fsync took 166 ms after deleting 1000 files
*** with a leaf size of 16K, after patchset ***
dir fsync took 8196 ms after adding 1000000 files (-3.4%)
dir fsync took 143 ms after deleting 1000 files (-14.9%)
*** with a leaf size of 64K, before patchset ***
dir fsync took 12851 ms after adding 1000000 files
dir fsync took 466 ms after deleting 1000 files
*** with a leaf size of 64K, after patchset ***
dir fsync took 12287 ms after adding 1000000 files (-4.5%)
dir fsync took 414 ms after deleting 1000 files (-11.8%)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After the first time we log a directory in the current transaction, for
each directory item in a changed leaf of the subvolume tree, we have to
check if we previously logged the item, in order to overwrite it in case
its data changed or skip it in case its data hasn't changed.
Checking if we have logged each item before not only wastes times, but it
also adds lock contention on the log tree. So in order to minimize the
number of times we do such checks, keep track of the offset of the last
key we logged for a directory and, on the next time we log the directory,
skip the checks for any new keys that have an offset greater than the
offset we have previously saved. This is specially effective for index
keys, because the offset for these keys comes from a monotonically
increasing counter.
This patch is part of a patchset comprised of the following 5 patches:
btrfs: remove root argument from btrfs_log_inode() and its callees
btrfs: remove redundant log root assignment from log_dir_items()
btrfs: factor out the copying loop of dir items from log_dir_items()
btrfs: insert items in batches when logging a directory when possible
btrfs: keep track of the last logged keys when logging a directory
This is patch 5/5.
The following test was used on a non-debug kernel to measure the impact
it has on a directory fsync:
$ cat test-dir-fsync.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_NEW_FILES=100000
NUM_FILE_DELETES=1000
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
# fsync the directory, this will log the new dir items and the inodes
# they point to, because these are new inodes.
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"
# sync to force transaction commit and wipeout the log.
sync
del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
rm -f $MNT/testdir/file_$i
done
# fsync the directory, this will only log dir items, there are no
# dentries pointing to new inodes.
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
umount $MNT
Test results with NUM_NEW_FILES set to 100 000 and 1 000 000:
**** before patchset, 100 000 files, 1000 deletes ****
dir fsync took 848 ms after adding 100000 files
dir fsync took 175 ms after deleting 1000 files
**** after patchset, 100 000 files, 1000 deletes ****
dir fsync took 758 ms after adding 100000 files (-11.2%)
dir fsync took 63 ms after deleting 1000 files (-94.1%)
**** before patchset, 1 000 000 files, 1000 deletes ****
dir fsync took 9945 ms after adding 1000000 files
dir fsync took 473 ms after deleting 1000 files
**** after patchset, 1 000 000 files, 1000 deletes ****
dir fsync took 8677 ms after adding 1000000 files (-13.6%)
dir fsync took 146 ms after deleting 1000 files (-105.6%)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a directory, we scan its directory items from the subvolume
tree and then copy one by one into the log tree. This is not efficient
since we generally are able to insert several items in a batch, using a
single btree operation for adding several items at once. The reason we
copy items one by one is that we must check if each item was previously
logged in the current transaction, and if it was we either overwrite it
or skip it in case its content did not change in the subvolume tree (this
can happen only for dir item keys, but not for dir index keys), and doing
such check makes it a bit cumbersome to attempt batch insertions.
However the chances for doing batch insertions are very frequent and
always happen when:
1) Logging the directory for the first time in the current transaction,
as none of the items exist in the log tree yet;
2) Logging new dir index keys, because the offset for new dir index keys
comes from a monotonically increasing counter. This means if we keep
adding dentries to a directory, through creation of new files and
sub-directories or by adding new links or renaming from some other
directory into the one we are logging, all the new dir index keys
have a new offset that is greater than the offset of any previously
logged index keys, so we can insert them in batches into the log tree.
For dir item keys, since their offset depends on the result of an hash
function against the dentry's name, unless the directory is being logged
for the first time in the current transaction, the chances being able to
insert the items in the log using batches is pretty much random and not
predictable, as it depends on the names of the dentries, but still happens
often enough.
So change directory logging to keep track of consecutive directory items
that don't exist yet in the log and batch insert them.
This patch is part of a patchset comprised of the following 5 patches:
btrfs: remove root argument from btrfs_log_inode() and its callees
btrfs: remove redundant log root assignment from log_dir_items()
btrfs: factor out the copying loop of dir items from log_dir_items()
btrfs: insert items in batches when logging a directory when possible
btrfs: keep track of the last logged keys when logging a directory
This is patch 4/5. The change log of the last patch (5/5) has performance
results.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation for the next change, move the loop that processes a leaf
and copies its directory items to the log, into a separate helper
function. This makes the next change simpler and it also helps making
log_dir_items() a bit shorter (specially after the next change).
This patch is part of a patchset comprised of the following 5 patches:
btrfs: remove root argument from btrfs_log_inode() and its callees
btrfs: remove redundant log root assignment from log_dir_items()
btrfs: factor out the copying loop of dir items from log_dir_items()
btrfs: insert items in batches when logging a directory when possible
btrfs: keep track of the last logged keys when logging a directory
This is patch 3/5. The change log of the last patch (5/5) has performance
results.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At log_dir_items() we are assigning the exact same value to the local
variable 'log', once when it's declared and once again shortly after.
Remove the later assignment as it's pointless.
This patch is part of a patchset comprised of the following 5 patches:
btrfs: remove root argument from btrfs_log_inode() and its callees
btrfs: remove redundant log root assignment from log_dir_items()
btrfs: factor out the copying loop of dir items from log_dir_items()
btrfs: insert items in batches when logging a directory when possible
btrfs: keep track of the last logged keys when logging a directory
This is patch 2/5. The change log of the last patch (5/5) has performance
results.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The root argument passed to btrfs_log_inode() is unncessary, as it is
always the root of the inode we are going to log. This root also gets
unnecessarily propagated to several functions called by btrfs_log_inode(),
and all of them take the inode as an argument as well. So just remove
the root argument from these functions and have them get the root from
the inode where needed.
This patch is part of a patchset comprised of the following 5 patches:
btrfs: remove root argument from btrfs_log_inode() and its callees
btrfs: remove redundant log root assignment from log_dir_items()
btrfs: factor out the copying loop of dir items from log_dir_items()
btrfs: insert items in batches when logging a directory when possible
btrfs: keep track of the last logged keys when logging a directory
This is patch 1/5. The change log of the last patch (5/5) has performance
results.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a regular file in full sync mode, we are currently committing
its delayed inode item. This is to ensure that we never miss copying the
inode item, with its most up to date data, into the log tree.
However that is not necessary since commit e4545de5b0 ("Btrfs: fix fsync
data loss after append write"), because even if we don't find the leaf
with the inode item when looking for leaves that changed in the current
transaction, we end up logging the inode item later using the in-memory
content. In case we find the leaf containing the inode item, we already
end up using the in-memory inode for filling the inode item in the log
tree, and not the inode item that is in the fs/subvolume tree, as it
might be not up to date (copy_items() -> fill_inode_item()).
So don't commit the delayed inode item, which brings a couple of benefits:
1) Avoid writing the inode item to the fs/subvolume btree, saving time and
reducing lock contention on the btree;
2) In case no other item for the inode was changed, added or deleted in
the same leaf where the inode item is located, we ended up copying
all the items in that leaf to the log tree - it's harmless from a
functional point of view, but it wastes time and log tree space.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 10/10 and the following test results compare a branch with
the whole patch set applied versus a branch without any of the patches
applied.
The following script was used to test dbench with 8 and 16 jobs on a
machine with 12 cores, 64G of RAM, a NVME device and using a non-debug
kernel config (Debian's default):
$ cat test.sh
#!/bin/bash
if [ $# -ne 1 ]; then
echo "Use $0 NUM_JOBS"
exit 1
fi
NUM_JOBS=$1
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-m single -d single"
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 120 $NUM_JOBS
umount $MNT
The results were the following:
8 jobs, before patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 4113896 0.009 238.665
Close 3021699 0.001 0.590
Rename 174215 0.082 238.733
Unlink 830977 0.049 238.642
Deltree 96 2.232 8.022
Mkdir 48 0.003 0.005
Qpathinfo 3729013 0.005 2.672
Qfileinfo 653206 0.001 0.152
Qfsinfo 683866 0.002 0.526
Sfileinfo 335055 0.004 1.571
Find 1441800 0.016 4.288
WriteX 2049644 0.010 3.982
ReadX 6449786 0.003 0.969
LockX 13400 0.002 0.043
UnlockX 13400 0.001 0.075
Flush 288349 2.521 245.516
Throughput 1075.73 MB/sec 8 clients 8 procs max_latency=245.520 ms
8 jobs, after patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 4154282 0.009 156.675
Close 3051450 0.001 0.843
Rename 175912 0.072 4.444
Unlink 839067 0.048 66.050
Deltree 96 2.131 5.979
Mkdir 48 0.002 0.004
Qpathinfo 3765575 0.005 3.079
Qfileinfo 659582 0.001 0.099
Qfsinfo 690474 0.002 0.155
Sfileinfo 338366 0.004 1.419
Find 1455816 0.016 3.423
WriteX 2069538 0.010 4.328
ReadX 6512429 0.003 0.840
LockX 13530 0.002 0.078
UnlockX 13530 0.001 0.051
Flush 291158 2.500 163.468
Throughput 1105.45 MB/sec 8 clients 8 procs max_latency=163.474 ms
+2.7% throughput, -40.1% max latency
16 jobs, before patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 5457602 0.033 337.098
Close 4008979 0.002 2.018
Rename 231051 0.323 254.054
Unlink 1102209 0.202 337.243
Deltree 160 6.521 31.720
Mkdir 80 0.003 0.007
Qpathinfo 4946147 0.014 6.988
Qfileinfo 867440 0.001 1.642
Qfsinfo 907081 0.003 1.821
Sfileinfo 444433 0.005 2.053
Find 1912506 0.067 7.854
WriteX 2724852 0.018 7.428
ReadX 8553883 0.003 2.059
LockX 17770 0.003 0.350
UnlockX 17770 0.002 0.627
Flush 382533 2.810 353.691
Throughput 1413.09 MB/sec 16 clients 16 procs max_latency=353.696 ms
16 jobs, after patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 5393156 0.034 303.181
Close 3961986 0.002 1.502
Rename 228359 0.320 253.379
Unlink 1088920 0.206 303.409
Deltree 160 6.419 30.088
Mkdir 80 0.003 0.004
Qpathinfo 4887967 0.015 7.722
Qfileinfo 857408 0.001 1.651
Qfsinfo 896343 0.002 2.147
Sfileinfo 439317 0.005 4.298
Find 1890018 0.073 8.347
WriteX 2693356 0.018 6.373
ReadX 8453485 0.003 3.836
LockX 17562 0.003 0.486
UnlockX 17562 0.002 0.635
Flush 378023 2.802 315.904
Throughput 1454.46 MB/sec 16 clients 16 procs max_latency=315.910 ms
+2.9% throughput, -11.3% max latency
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an extent, in the fast fsync path, we always attempt do drop
or trim any existing extents with a range that match or overlap the range
of the extent we are about to log. We do that through a call to
btrfs_drop_extents().
However this is not needed when we are logging the inode for the first
time in the current transaction, since we have no inode items of the
inode in the log tree. Calling btrfs_drop_extents() does a deletion search
on the log tree, which is expensive when we have concurrent tasks
accessing the log tree because a deletion search always acquires a write
lock on the extent buffers at levels 2, 1 and 0, adding significant lock
contention, specially taking into account the height of a log tree rarely
(if ever) goes beyond 2 or 3, due to its short life.
So skip the call to btrfs_drop_extents() when the inode was not previously
logged in the current transaction.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 9/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we are logging that an inode exists and the inode was not logged
before, we can avoid searching in the log tree for the inode item since we
know it does not exists. That wastes time and adds more lock contention on
the extent buffers of the log tree when there are other tasks that are
logging other inodes.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 8/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Whenever we are logging a file inode in full sync mode we call
btrfs_truncate_inode_items() to delete items of the inode we may have
previously logged.
That results in doing a btree search for deletion, which is expensive
because it always acquires write locks for extent buffers at levels 2, 1
and 0, and it balances any node that is less than half full. Acquiring
the write locks can block the task if the extent buffers are already
locked by another task or block other tasks attempting to lock them,
which is specially bad in case of log trees since they are small due to
their short life, with a root node at a level typically not greater than
level 2.
If we know that we are logging the inode for the first time in the current
transaction, we can skip the call to btrfs_truncate_inode_items(), avoiding
the deletion search. This change does that.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 7/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the call to btrfs_truncate_inode_items(), and the surrounding retry
loop, into a local helper function. This avoids some repetition and avoids
making the next change a bit awkward due to a bit of too much indentation.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 6/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Whenever we are logging a directory inode, logging that an inode exists or
logging an inode that has changes in its references or xattrs, we attempt
to delete items of this inode we may have previously logged (through calls
to drop_objectid_items()).
That attempt does a btree search for deletion, which is expensive because
it always acquires write locks for extent buffers at levels 2, 1 and 0,
and it balances any node that is less than half full. Acquiring the write
locks can block the task if the extent buffers are already locked or block
other tasks attempting to lock them, which is specially bad in case of log
trees since they are small due to their short life, with a root node at a
level typically not greater than level 2.
If we know that we are logging the inode for the first time in the current
transaction, we can skip the search. This change does that.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 5/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we are logging a new name for an inode, due to a link or rename
operation, if the inode has ancestor inodes that are new, created in the
current transaction, we need to log that these inodes exist. To ensure
that a subsequent explicit fsync on one of these ancestor inodes does
sync the log, we don't set the logged_trans field of these inodes.
This was done in commit 75b463d2b4 ("btrfs: do not commit logs and
transactions during link and rename operations"), to avoid syncing a
log after a rename or link operation.
In order to allow for future changes to do some optimizations, change
this behaviour to always update the logged_trans of any logged inode
and don't update the last_log_commit of the inode if we are logging
that it exists. This accomplishes that same objective with simpler
logic, allowing for some optimizations in the next patches.
So just do that simplification.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 4/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a new name for an inode, due to a link or rename operation,
we don't need to log all new dentries of the parent directories and their
subdirectories. We only want to log the names of the inode and that any
new parent directories exist. So in this case don't trigger logging of
the new dentries, that is only need when doing an explicit fsync on a
directory or on a file which requires logging its parent directories.
This avoids unnecessary work and reduces contention on the extent buffers
of a log tree.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 3/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 75b463d2b4 ("btrfs: do not commit logs and transactions
during link and rename operations"), we always pass a non-NULL log context
to btrfs_log_inode_parent() and therefore to all the functions that it
calls. So remove the checks we have all over the place that test for a
NULL log context, making the code shorter and easier to read, as well as
reducing the size of the generated code.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 2/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In case an inode was never logged since it was loaded from disk and was
modified in the current transaction (its ->last_trans matches the ID of
the current transaction), inode_logged() returns true even if there's no
existing log tree. In this case we can simply check if a log tree exists
and return false if it does not. This avoids a caller of inode_logged()
doing some unnecessary, but harmless, work.
For btrfs_log_new_name() it avoids it logging an inode in case it was
never logged since it was loaded from disk and there is currently no log
tree for the inode's root. For the remaining callers of inode_logged(),
btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log(), it has
no effect since they already check if a log tree exists through their
calls to join_running_log_trans().
So just add a check to inode_logged() to verify if a log tree exists, and
return false if it does not.
This patch is part of a patch set comprised of the following patches:
btrfs: check if a log tree exists at inode_logged()
btrfs: remove no longer needed checks for NULL log context
btrfs: do not log new dentries when logging that a new name exists
btrfs: always update the logged transaction when logging new names
btrfs: avoid expensive search when dropping inode items from log
btrfs: add helper to truncate inode items when logging inode
btrfs: avoid expensive search when truncating inode items from the log
btrfs: avoid search for logged i_size when logging inode if possible
btrfs: avoid attempt to drop extents when logging inode for the first time
btrfs: do not commit delayed inode when logging a file in full sync mode
This is patch 1/10 and test results are listed in the change log of the
last patch in the set.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At replay_one_name(), we are treating any error from btrfs_lookup_inode()
as if the inode does not exists. Fix this by checking for an error and
returning it to the caller.
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_lookup_dir_index_item() and btrfs_lookup_dir_item() lookup for dir
entries and both are used during log replay or when updating a log tree
during an unlink.
However when the dir item does not exists, btrfs_lookup_dir_item() returns
NULL while btrfs_lookup_dir_index_item() returns PTR_ERR(-ENOENT), and if
the dir item exists but there is no matching entry for a given name or
index, both return NULL. This makes the call sites during log replay to
be more verbose than necessary and it makes it easy to miss this slight
difference. Since we don't need to distinguish between those two cases,
make btrfs_lookup_dir_index_item() always return NULL when there is no
matching directory entry - either because there isn't any dir entry or
because there is one but it does not match the given name and index.
Also rename the argument 'objectid' of btrfs_lookup_dir_index_item() to
'index' since it is supposed to match an index number, and the name
'objectid' is not very good because it can easily be confused with an
inode number (like the inode number a dir entry points to).
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>