1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

12906 commits

Author SHA1 Message Date
Jan Kara
86ec15d00b
btrfs: Convert to bdev_open_by_path()
Convert btrfs to use bdev_open_by_path() and pass the handle around.  We
also drop the holder from struct btrfs_device as it is now not needed
anymore.

CC: David Sterba <dsterba@suse.com>
CC: linux-btrfs@vger.kernel.org
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-20-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28 13:29:20 +02:00
Linus Torvalds
e017769f4c for-6.6-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmU2lLEACgkQxWXV+ddt
 WDvCThAApe+zMNdEhQ/cgrvfzP/X91Q53PXQsdVsrujPyUV8eEV4oJzEwVbJhRdw
 3ukIQtvyAMNiWhEBhOQRwxjuUoTCApGAeEEEl1cWWEqQ7G2/2LS4+bcWzgQ3Vu32
 dzYL37ddsfe4n7OgfnymtMrnv7kge0XbAlY3GbavaDccZDQDqcD5wSAOyOhfIsH7
 kcu4sA5Fi44wVSfAJX1Dms+wXfsmQu/sd3c9Gcyce9Hpy1cEW3vWbApLBE4K0aKX
 /JHTdmkAJ20a4APQsfGH+UymyuZgr8d2eGmL9rVYKhT/c+Dow0lNAWYkvGf/MawM
 CX3GdP6f6ZOR/anCPZ8nqZCE5AoFykGazvpCCSrvCOpU7o7GqxbAQkWWFcMp1FHW
 9TFrj81WK18DeCfCNw7lR3sdMy/2o2nnSUAw3DFY4n/3Lek7FUmrBTHvXlWDot7T
 TM9CzYGF840QhL5s5SMYS09YmeI0I34L7HJAi/+qli48SooGuL9RZ29TmzHIX69Y
 2bgpS64j06p/AGEnfHAcT1LbpiFCPmO5cpXKv/t40GL5QO5d4WV698ysDGoPYUPO
 8CPL85Y8cao56KGJLyOroGz0P1bo+RdNe5bN6xJJoTRn1Y9oUA+bQSnN8x9iuunF
 9QZrAIHzNyDcRGzoqgDW+3bivOvIus/Dto/u1P3ap68kP2HTVsY=
 =gOyi
 -----END PGP SIGNATURE-----

Merge tag 'for-6.6-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "One more fix for a problem with snapshot of a newly created subvolume
  that can lead to inconsistent data under some circumstances. Kernel
  6.5 added a performance optimization to skip transaction commit for
  subvolume creation but this could end up with newer data on disk but
  not linked to other structures.

  The fix itself is an added condition, the rest of the patch is a
  parameter added to several functions"

* tag 'for-6.6-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix unwritten extent buffer after snapshotting a new subvolume
2023-10-23 07:59:13 -10:00
Filipe Manana
eb96e22193 btrfs: fix unwritten extent buffer after snapshotting a new subvolume
When creating a snapshot of a subvolume that was created in the current
transaction, we can end up not persisting a dirty extent buffer that is
referenced by the snapshot, resulting in IO errors due to checksum failures
when trying to read the extent buffer later from disk. A sequence of steps
that leads to this is the following:

1) At ioctl.c:create_subvol() we allocate an extent buffer, with logical
   address 36007936, for the leaf/root of a new subvolume that has an ID
   of 291. We mark the extent buffer as dirty, and at this point the
   subvolume tree has a single node/leaf which is also its root (level 0);

2) We no longer commit the transaction used to create the subvolume at
   create_subvol(). We used to, but that was recently removed in
   commit 1b53e51a4a ("btrfs: don't commit transaction for every subvol
   create");

3) The transaction used to create the subvolume has an ID of 33, so the
   extent buffer 36007936 has a generation of 33;

4) Several updates happen to subvolume 291 during transaction 33, several
   files created and its tree height changes from 0 to 1, so we end up with
   a new root at level 1 and the extent buffer 36007936 is now a leaf of
   that new root node, which is extent buffer 36048896.

   The commit root remains as 36007936, since we are still at transaction
   33;

5) Creation of a snapshot of subvolume 291, with an ID of 292, starts at
   ioctl.c:create_snapshot(). This triggers a commit of transaction 33 and
   we end up at transaction.c:create_pending_snapshot(), in the critical
   section of a transaction commit.

   There we COW the root of subvolume 291, which is extent buffer 36048896.
   The COW operation returns extent buffer 36048896, since there's no need
   to COW because the extent buffer was created in this transaction and it
   was not written yet.

   The we call btrfs_copy_root() against the root node 36048896. During
   this operation we allocate a new extent buffer to turn into the root
   node of the snapshot, copy the contents of the root node 36048896 into
   this snapshot root extent buffer, set the owner to 292 (the ID of the
   snapshot), etc, and then we call btrfs_inc_ref(). This will create a
   delayed reference for each leaf pointed by the root node with a
   reference root of 292 - this includes a reference for the leaf
   36007936.

   After that we set the bit BTRFS_ROOT_FORCE_COW in the root's state.

   Then we call btrfs_insert_dir_item(), to create the directory entry in
   in the tree of subvolume 291 that points to the snapshot. This ends up
   needing to modify leaf 36007936 to insert the respective directory
   items. Because the bit BTRFS_ROOT_FORCE_COW is set for the root's state,
   we need to COW the leaf. We end up at btrfs_force_cow_block() and then
   at update_ref_for_cow().

   At update_ref_for_cow() we call btrfs_block_can_be_shared() which
   returns false, despite the fact the leaf 36007936 is shared - the
   subvolume's root and the snapshot's root point to that leaf. The
   reason that it incorrectly returns false is because the commit root
   of the subvolume is extent buffer 36007936 - it was the initial root
   of the subvolume when we created it. So btrfs_block_can_be_shared()
   which has the following logic:

   int btrfs_block_can_be_shared(struct btrfs_root *root,
                                 struct extent_buffer *buf)
   {
       if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) &&
           buf != root->node && buf != root->commit_root &&
           (btrfs_header_generation(buf) <=
            btrfs_root_last_snapshot(&root->root_item) ||
            btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)))
               return 1;

       return 0;
   }

   Returns false (0) since 'buf' (extent buffer 36007936) matches the
   root's commit root.

   As a result, at update_ref_for_cow(), we don't check for the number
   of references for extent buffer 36007936, we just assume it's not
   shared and therefore that it has only 1 reference, so we set the local
   variable 'refs' to 1.

   Later on, in the final if-else statement at update_ref_for_cow():

   static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
                                          struct btrfs_root *root,
                                          struct extent_buffer *buf,
                                          struct extent_buffer *cow,
                                          int *last_ref)
   {
      (...)
      if (refs > 1) {
          (...)
      } else {
          (...)
          btrfs_clear_buffer_dirty(trans, buf);
          *last_ref = 1;
      }
   }

   So we mark the extent buffer 36007936 as not dirty, and as a result
   we don't write it to disk later in the transaction commit, despite the
   fact that the snapshot's root points to it.

   Attempting to access the leaf or dumping the tree for example shows
   that the extent buffer was not written:

   $ btrfs inspect-internal dump-tree -t 292 /dev/sdb
   btrfs-progs v6.2.2
   file tree key (292 ROOT_ITEM 33)
   node 36110336 level 1 items 2 free space 119 generation 33 owner 292
   node 36110336 flags 0x1(WRITTEN) backref revision 1
   checksum stored a8103e3e
   checksum calced a8103e3e
   fs uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
   chunk uuid e8c9c885-78f4-4d31-85fe-89e5f5fd4a07
           key (256 INODE_ITEM 0) block 36007936 gen 33
           key (257 EXTENT_DATA 0) block 36052992 gen 33
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   total bytes 107374182400
   bytes used 38572032
   uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79

   The respective on disk region is full of zeroes as the device was
   trimmed at mkfs time.

   Obviously 'btrfs check' also detects and complains about this:

   $ btrfs check /dev/sdb
   Opening filesystem to check...
   Checking filesystem on /dev/sdb
   UUID: 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
   generation: 33 (33)
   [1/7] checking root items
   [2/7] checking extents
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   bad tree block 36007936, bytenr mismatch, want=36007936, have=0
   owner ref check failed [36007936 4096]
   ERROR: errors found in extent allocation tree or chunk allocation
   [3/7] checking free space tree
   [4/7] checking fs roots
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   bad tree block 36007936, bytenr mismatch, want=36007936, have=0
   The following tree block(s) is corrupted in tree 292:
        tree block bytenr: 36110336, level: 1, node key: (256, 1, 0)
   root 292 root dir 256 not found
   ERROR: errors found in fs roots
   found 38572032 bytes used, error(s) found
   total csum bytes: 16048
   total tree bytes: 1265664
   total fs tree bytes: 1118208
   total extent tree bytes: 65536
   btree space waste bytes: 562598
   file data blocks allocated: 65978368
    referenced 36569088

Fix this by updating btrfs_block_can_be_shared() to consider that an
extent buffer may be shared if it matches the commit root and if its
generation matches the current transaction's generation.

This can be reproduced with the following script:

   $ cat test.sh
   #!/bin/bash

   MNT=/mnt/sdi
   DEV=/dev/sdi

   # Use a filesystem with a 64K node size so that we have the same node
   # size on every machine regardless of its page size (on x86_64 default
   # node size is 16K due to the 4K page size, while on PPC it's 64K by
   # default). This way we can make sure we are able to create a btree for
   # the subvolume with a height of 2.
   mkfs.btrfs -f -n 64K $DEV
   mount $DEV $MNT

   btrfs subvolume create $MNT/subvol

   # Create a few empty files on the subvolume, this bumps its btree
   # height to 2 (root node at level 1 and 2 leaves).
   for ((i = 1; i <= 300; i++)); do
       echo -n > $MNT/subvol/file_$i
   done

   btrfs subvolume snapshot -r $MNT/subvol $MNT/subvol/snap

   umount $DEV

   btrfs check $DEV

Running it on a 6.5 kernel (or any 6.6-rc kernel at the moment):

   $ ./test.sh
   Create subvolume '/mnt/sdi/subvol'
   Create a readonly snapshot of '/mnt/sdi/subvol' in '/mnt/sdi/subvol/snap'
   Opening filesystem to check...
   Checking filesystem on /dev/sdi
   UUID: bbdde2ff-7d02-45ca-8a73-3c36f23755a1
   [1/7] checking root items
   [2/7] checking extents
   parent transid verify failed on 30539776 wanted 7 found 5
   parent transid verify failed on 30539776 wanted 7 found 5
   parent transid verify failed on 30539776 wanted 7 found 5
   Ignoring transid failure
   owner ref check failed [30539776 65536]
   ERROR: errors found in extent allocation tree or chunk allocation
   [3/7] checking free space tree
   [4/7] checking fs roots
   parent transid verify failed on 30539776 wanted 7 found 5
   Ignoring transid failure
   Wrong key of child node/leaf, wanted: (256, 1, 0), have: (2, 132, 0)
   Wrong generation of child node/leaf, wanted: 5, have: 7
   root 257 root dir 256 not found
   ERROR: errors found in fs roots
   found 917504 bytes used, error(s) found
   total csum bytes: 0
   total tree bytes: 851968
   total fs tree bytes: 393216
   total extent tree bytes: 65536
   btree space waste bytes: 736550
   file data blocks allocated: 0
    referenced 0

A test case for fstests will follow soon.

Fixes: 1b53e51a4a ("btrfs: don't commit transaction for every subvol create")
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-23 17:17:30 +02:00
Linus Torvalds
7cf4bea77a for-6.6-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmUuntEACgkQxWXV+ddt
 WDssdQ/9Fo6tN+MCH5ISAcvsW6WvBUWT62MrDnzawh96QhUf3NYf9sjME7QqwHHv
 w60SDiqRlAd5UzxdIPC4Qa/6GVZZh2yFLzew3l8Fh6anxhjO5argdsfx1Wv4ADk/
 FHI8zs6EZiTlk0JmEnHNclliZfaDutBRQiPL+HZx4+FCrJweS5U/4Jpg7vdfp/tp
 eWdJ51pDM8iyqGTsP7a7/VaL5wLoJhbdD9wYgupZUhvY6g2tCZ71/hNiWdbKtCK8
 EyQxXiAlc+k1UflOx6Xip1HLIh6HmKwxntXxRy+yj4IvJ3PhI+KS5Nqdl35TszN9
 6y9MRo3oCU+2y89Yay4HZZb6DLxcAi6VwpyswnntodFQ+ICXEw7ZaNi3rSO+FCO8
 KxfhLniMD5gflRP4gy+o9iZxgVQ75nmiPgBt53r+sAKZ7lv86x84DJ/ZUqL8EV0e
 OJhxdzhoT0Ks8OstIuE87fgzUCjqMcgAavxcn1psKBC6/JY9v6OneA8qauSswkKs
 P+diJIqZHHOBQVKFedqdIrDU6AstivSBq0ToPBslbBlcy97EO4IRoiMIw+QgHPYn
 CHsPHtooBmxPyw+4HTFuzY1NIrSeUFYxTDAs9p5kMPmltkVAlLPcrpGZVya9tjds
 l/YuwY2f0C9Q1pjcAc9FcN8Y5kLRCYNEWMl0M1VpC22KgjRN6r0=
 =GrNu
 -----END PGP SIGNATURE-----

Merge tag 'for-6.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "Fix a bug in chunk size decision that could lead to suboptimal
  placement and filling patterns"

* tag 'for-6.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix stripe length calculation for non-zoned data chunk allocation
2023-10-19 08:56:01 -07:00
Jeff Layton
b1c38a1338
btrfs: convert to new timestamp accessors
Convert to using the new inode timestamp accessor functions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20231004185347.80880-21-jlayton@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18 13:26:19 +02:00
Zygo Blaxell
8a540e990d btrfs: fix stripe length calculation for non-zoned data chunk allocation
Commit f6fca3917b "btrfs: store chunk size in space-info struct"
broke data chunk allocations on non-zoned multi-device filesystems when
using default chunk_size.  Commit 5da431b71d "btrfs: fix the max chunk
size and stripe length calculation" partially fixed that, and this patch
completes the fix for that case.

After commit f6fca3917b and 5da431b71d, the sequence of events for
a data chunk allocation on a non-zoned filesystem is:

        1.  btrfs_create_chunk calls init_alloc_chunk_ctl, which copies
        space_info->chunk_size (default 10 GiB) to ctl->max_stripe_len
        unmodified.  Before f6fca3917b, ctl->max_stripe_len value was
        1 GiB for non-zoned data chunks and not configurable.

        2.  btrfs_create_chunk calls gather_device_info which consumes
        and produces more fields of chunk_ctl.

        3.  gather_device_info multiplies ctl->max_stripe_len by
        ctl->dev_stripes (which is 1 in all cases except dup)
        and calls find_free_dev_extent with that number as num_bytes.

        4.  find_free_dev_extent locates the first dev_extent hole on
        a device which is at least as large as num_bytes.  With default
        max_chunk_size from f6fca3917b, it finds the first hole which is
        longer than 10 GiB, or the largest hole if that hole is shorter
        than 10 GiB.  This is different from the pre-f6fca3917b4d
        behavior, where num_bytes is 1 GiB, and find_free_dev_extent
        may choose a different hole.

        5.  gather_device_info repeats step 4 with all devices to find
        the first or largest dev_extent hole that can be allocated on
        each device.

        6.  gather_device_info sorts the device list by the hole size
        on each device, using total unallocated space on each device to
        break ties, then returns to btrfs_create_chunk with the list.

        7.  btrfs_create_chunk calls decide_stripe_size_regular.

        8.  decide_stripe_size_regular finds the largest stripe_len that
        fits across the first nr_devs device dev_extent holes that were
        found by gather_device_info (and satisfies other constraints
        on stripe_len that are not relevant here).

        9.  decide_stripe_size_regular caps the length of the stripe it
        computed at 1 GiB.  This cap appeared in 5da431b71d to correct
        one of the other regressions introduced in f6fca3917b.

        10.  btrfs_create_chunk creates a new chunk with the above
        computed size and number of devices.

At step 4, gather_device_info() has found a location where stripe up to
10 GiB in length could be allocated on several devices, and selected
which devices should have a dev_extent allocated on them, but at step
9, only 1 GiB of the space that was found on each device can be used.
This mismatch causes new suboptimal chunk allocation cases that did not
occur in pre-f6fca3917b4d kernels.

Consider a filesystem using raid1 profile with 3 devices.  After some
balances, device 1 has 10x 1 GiB unallocated space, while devices 2
and 3 have 1x 10 GiB unallocated space, i.e. the same total amount of
space, but distributed across different numbers of dev_extent holes.
For visualization, let's ignore all the chunks that were allocated before
this point, and focus on the remaining holes:

        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10x 1 GiB unallocated)
        Device 2:  [__________] (10 GiB contig unallocated)
        Device 3:  [__________] (10 GiB contig unallocated)

Before f6fca3917b, the allocator would fill these optimally by
allocating chunks with dev_extents on devices 1 and 2 ([12]), 1 and 3
([13]), or 2 and 3 ([23]):

        [after 0 chunk allocations]
        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
        Device 2:  [__________] (10 GiB)
        Device 3:  [__________] (10 GiB)

        [after 1 chunk allocation]
        Device 1:  [12] [_] [_] [_] [_] [_] [_] [_] [_] [_]
        Device 2:  [12] [_________] (9 GiB)
        Device 3:  [__________] (10 GiB)

        [after 2 chunk allocations]
        Device 1:  [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
        Device 2:  [12] [_________] (9 GiB)
        Device 3:  [13] [_________] (9 GiB)

        [after 3 chunk allocations]
        Device 1:  [12] [13] [12] [_] [_] [_] [_] [_] [_] [_] (7 GiB)
        Device 2:  [12] [12] [________] (8 GiB)
        Device 3:  [13] [_________] (9 GiB)

        [...]

        [after 12 chunk allocations]
        Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [_] [_] (2 GiB)
        Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [__] (2 GiB)
        Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)

        [after 13 chunk allocations]
        Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [12] [_] (1 GiB)
        Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
        Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)

        [after 14 chunk allocations]
        Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
        Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
        Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [13] [_] (1 GiB)

        [after 15 chunk allocations]
        Device 1:  [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
        Device 2:  [12] [12] [23] [23] [12] [12] [23] [23] [12] [23] (full)
        Device 3:  [13] [13] [23] [23] [13] [23] [13] [23] [13] [23] (full)

This allocates all of the space with no waste.  The sorting function used
by gather_device_info considers free space holes above 1 GiB in length
to be equal to 1 GiB, so once find_free_dev_extent locates a sufficiently
long hole on each device, all the holes appear equal in the sort, and the
comparison falls back to sorting devices by total free space.  This keeps
usable space on each device equal so they can all be filled completely.

After f6fca3917b, the allocator prefers the devices with larger holes
over the devices with more free space, so it makes bad allocation choices:

        [after 1 chunk allocation]
        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
        Device 2:  [23] [_________] (9 GiB)
        Device 3:  [23] [_________] (9 GiB)

        [after 2 chunk allocations]
        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
        Device 2:  [23] [23] [________] (8 GiB)
        Device 3:  [23] [23] [________] (8 GiB)

        [after 3 chunk allocations]
        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
        Device 2:  [23] [23] [23] [_______] (7 GiB)
        Device 3:  [23] [23] [23] [_______] (7 GiB)

        [...]

        [after 9 chunk allocations]
        Device 1:  [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
        Device 2:  [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
        Device 3:  [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)

        [after 10 chunk allocations]
        Device 1:  [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] (9 GiB)
        Device 2:  [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
        Device 3:  [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)

        [after 11 chunk allocations]
        Device 1:  [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
        Device 2:  [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
        Device 3:  [23] [23] [23] [23] [23] [23] [23] [23] [13] (full)

No further allocations are possible, with 8 GiB wasted (4 GiB of data
space).  The sort in gather_device_info now considers free space in
holes longer than 1 GiB to be distinct, so it will prefer devices 2 and
3 over device 1 until all but 1 GiB is allocated on devices 2 and 3.
At that point, with only 1 GiB unallocated on every device, the largest
hole length on each device is equal at 1 GiB, so the sort finally moves
to ordering the devices with the most free space, but by this time it
is too late to make use of the free space on device 1.

Note that it's possible to contrive a case where the pre-f6fca3917b4d
allocator fails the same way, but these cases generally have extensive
dev_extent fragmentation as a precondition (e.g. many holes of 768M
in length on one device, and few holes 1 GiB in length on the others).
With the regression in f6fca3917b, bad chunk allocation can occur even
under optimal conditions, when all dev_extent holes are exact multiples
of stripe_len in length, as in the example above.

Also note that post-f6fca3917b4d kernels do treat dev_extent holes
larger than 10 GiB as equal, so the bad behavior won't show up on a
freshly formatted filesystem; however, as the filesystem ages and fills
up, and holes ranging from 1 GiB to 10 GiB in size appear, the problem
can show up as a failure to balance after adding or removing devices,
or an unexpected shortfall in available space due to unequal allocation.

To fix the regression and make data chunk allocation work
again, set ctl->max_stripe_len back to the original SZ_1G, or
space_info->chunk_size if that's smaller (the latter can happen if the
user set space_info->chunk_size to less than 1 GiB via sysfs, or it's
a 32 MiB system chunk with a hardcoded chunk_size and stripe_len).

While researching the background of the earlier commits, I found that an
identical fix was already proposed at:

  https://lore.kernel.org/linux-btrfs/de83ac46-a4a3-88d3-85ce-255b7abc5249@gmx.com/

The previous review missed one detail:  ctl->max_stripe_len is used
before decide_stripe_size_regular() is called, when it is too late for
the changes in that function to have any effect.  ctl->max_stripe_len is
not used directly by decide_stripe_size_regular(), but the parameter
does heavily influence the per-device free space data presented to
the function.

Fixes: f6fca3917b ("btrfs: store chunk size in space-info struct")
CC: stable@vger.kernel.org # 6.1+
Link: https://lore.kernel.org/linux-btrfs/20231007051421.19657-1-ce3g8jdj@umail.furryterror.org/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-15 19:00:59 +02:00
David Sterba
c6e8f898f5 btrfs: open code timespec64 in struct btrfs_inode
The type of timespec64::tv_nsec is 'unsigned long', while we have only
u32 for on-disk and in-memory. This wastes a few bytes in btrfs_inode.
Add separate members for sec and nsec with the corresponding type width.
This creates a 4 byte hole in btrfs_inode which can be utilized in the
future.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:19 +02:00
Filipe Manana
cc687c2ef4 btrfs: remove redundant log root tree index assignment during log sync
During log syncing, when we start updating the log root tree we compute
an index value, stored in variable 'index2', once we lock the log root
tree's mutex. This value depends on the log root's log_transid. And
shortly after we compute again the same value for 'index2' - the value
is exactly the same since we haven't released the mutex and therefore
the log_transid of the log root is the same as before.

This second 'index2' computation became pointless after commit
a93e01682e ("btrfs: remove no longer needed use of log_writers for the
log root tree"). So remove it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:19 +02:00
Colin Ian King
a666ce9bab btrfs: remove redundant initialization of variable dirty in btrfs_update_time()
The variable dirty is initialized with a value that is never read, it
is being re-assigned later on. Remove the redundant initialization.
Cleans up clang scan build warning:

  fs/btrfs/inode.c:5965:7: warning: Value stored to 'dirty' during its
  initialization is never read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:19 +02:00
Anand Jain
f362374006 btrfs: sysfs: show temp_fsid feature
This adds sysfs objects to indicate temp_fsid feature support and
its status.

  /sys/fs/btrfs/features/temp_fsid
  /sys/fs/btrfs/<UUID>/temp_fsid

For example:

   Consider two cloned and mounted devices.

      $ blkid /dev/sdc[1-2]
      /dev/sdc1: UUID="509ad44b-ad2a-4a8a-bc8d-fe69db7220d5" ..
      /dev/sdc2: UUID="509ad44b-ad2a-4a8a-bc8d-fe69db7220d5" ..

   One gets actual fsid, and the other gets the temp_fsid when
   mounted.

      $ btrfs filesystem show -m
      Label: none  uuid: 509ad44b-ad2a-4a8a-bc8d-fe69db7220d5
	      Total devices 1 FS bytes used 54.14MiB
	      devid    1 size 300.00MiB used 144.00MiB path /dev/sdc1

      Label: none  uuid: 33bad74e-c91b-43a5-aef8-b3cab97ae63a
	      Total devices 1 FS bytes used 54.14MiB
	      devid    1 size 300.00MiB used 144.00MiB path /dev/sdc2

   Their sysfs as below.

      $ cat /sys/fs/btrfs/features/temp_fsid
      0

      $ cat /sys/fs/btrfs/509ad44b-ad2a-4a8a-bc8d-fe69db7220d5/temp_fsid
      0

      $ cat /sys/fs/btrfs/33bad74e-c91b-43a5-aef8-b3cab97ae63a/temp_fsid
      1

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:18 +02:00
Anand Jain
ac6ea6a914 btrfs: disable the device add feature for temp-fsid
The device addition operation will transform the cloned temp-fsid mounted
device into a multi-device filesystem. Therefore, it is marked as
unsupported.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:18 +02:00
Anand Jain
c47b02c1bd btrfs: disable the seed feature for temp-fsid
A seed device is an integral component of the sprout device, which
functions as a multi-device filesystem. Therefore, temp-fsid feature
is not supported.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:18 +02:00
Anand Jain
000331bb03 btrfs: update comment for temp-fsid, fsid, and metadata_uuid
Update the comment to explain the relationship between temp_fsid, fsid,
and metadata_uuid.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:18 +02:00
Filipe Manana
3cf63ddf29 btrfs: remove pointless empty log context list check when syncing log
When syncing the log, if we get an error when updating the log root, we
check first if the log root tree context is in a log context list, and if
so it deletes from the log root tree context from the list. This check
however is pointless because at this moment the context is always in a
list, he have just added it to a context list. The check became pointless
after commit a93e01682e ("btrfs: remove no longer needed use of
log_writers for the log root tree"). So remove this now pointless empty
list check.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:18 +02:00
Filipe Manana
68539bd0e7 btrfs: update comment for struct btrfs_inode::lock
Update the comment for the lock named "lock" in struct btrfs_inode because
it does not mention that the fields "delalloc_bytes", "defrag_bytes",
"csum_bytes", "outstanding_extents" and "disk_i_size" are also protected
by that lock.

Also add a comment on top of each field protected by this lock to mention
that the lock protects them.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:18 +02:00
Filipe Manana
5ca1949b79 btrfs: remove pointless barrier from btrfs_sync_file()
The memory barrier (smp_mb()) at btrfs_sync_file() is completely redundant
now that fs_info->last_trans_committed is read using READ_ONCE(), with the
helper btrfs_get_last_trans_committed(), and written using WRITE_ONCE()
with the helper btrfs_set_last_trans_committed().

This barrier was introduced in 2011, by commit a4abeea41a ("Btrfs: kill
trans_mutex"), but even back then it was not correct since the writer side
(in btrfs_commit_transaction()), did not issue a pairing memory barrier
after it updated fs_info->last_trans_committed.

So remove this barrier.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:18 +02:00
Filipe Manana
0124855ff1 btrfs: add and use helpers for reading and writing last_trans_committed
Currently the last_trans_committed field of struct btrfs_fs_info is
modified and read without any locking or other protection. For example
early in the fsync path, skip_inode_logging() is called which reads
fs_info->last_trans_committed, but at the same time we can have a
transaction commit completing and updating that field.

In the case of an fsync this is harmless and any data race should be
rare and at most cause an unnecessary logging of an inode.

To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the last_trans_committed field of struct btrfs_fs_info using
READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere.

[1] https://lwn.net/Articles/793253/

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:17 +02:00
Filipe Manana
4a4f8fe2b0 btrfs: add and use helpers for reading and writing fs_info->generation
Currently the generation field of struct btrfs_fs_info is always modified
while holding fs_info->trans_lock locked. Most readers will access this
field without taking that lock but while holding a transaction handle,
which is safe to do due to the transaction life cycle.

However there are other readers that are neither holding the lock nor
holding a transaction handle open:

1) When reading an inode from disk, at btrfs_read_locked_inode();

2) When reading the generation to expose it to sysfs, at
   btrfs_generation_show();

3) Early in the fsync path, at skip_inode_logging();

4) When creating a hole at btrfs_cont_expand(), during write paths,
   truncate and reflinking;

5) In the fs_info ioctl (btrfs_ioctl_fs_info());

6) While mounting the filesystem, in the open_ctree() path. In these
   cases it's safe to directly read fs_info->generation as no one
   can concurrently start a transaction and update fs_info->generation.

In case of the fsync path, races here should be harmless, and in the worst
case they may cause a fsync to log an inode when it's not really needed,
so nothing bad from a functional perspective. In the other cases it's not
so clear if functional problems may arise, though in case 1 rare things
like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC
flag not being set on an inode and therefore result in incorrect logging
later on in case a fsync call is made.

To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the generation field of struct btrfs_fs_info using READ_ONCE() and
WRITE_ONCE(), and use these helpers where needed.

[1] https://lwn.net/Articles/793253/

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:17 +02:00
Filipe Manana
6008859b6c btrfs: add and use helpers for reading and writing log_transid
Currently the log_transid field of a root is always modified while holding
the root's log_mutex locked. Most readers of a root's log_transid are also
holding the root's log_mutex locked, however there is one exception which
is btrfs_set_inode_last_trans() where we don't take the lock to avoid
blocking several operations if log syncing is happening in parallel.

Any races here should be harmless, and in the worst case they may cause a
fsync to log an inode when it's not really needed, so nothing bad from a
functional perspective.

To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the log_transid field of a root using READ_ONCE() and WRITE_ONCE(),
and use these helpers where needed.

[1] https://lwn.net/Articles/793253/

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:17 +02:00
Filipe Manana
f985078796 btrfs: add and use helpers for reading and writing last_log_commit
Currently, the last_log_commit of a root can be accessed concurrently
without any lock protection. Readers can be calling btrfs_inode_in_log()
early in a fsync call, which reads a root's last_log_commit, while a
writer can change the last_log_commit while a log tree if being synced,
at btrfs_sync_log(). Any races here should be harmless, and in the worst
case they may cause a fsync to log an inode when it's not really needed,
so nothing bad from a functional perspective.

To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the last_log_commit field of a root using READ_ONCE() and
WRITE_ONCE(), and use these helpers everywhere.

[1] https://lwn.net/Articles/793253/

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:17 +02:00
Anand Jain
a5b8a5f9f8 btrfs: support cloned-device mount capability
Guilherme's previous work [1] aimed at the mounting of cloned devices
using a superblock flag SINGLE_DEV during mkfs.
 [1] https://lore.kernel.org/linux-btrfs/20230831001544.3379273-1-gpiccoli@igalia.com/

Building upon this work, here is in memory only approach. As it mounts
we determine if the same fsid is already mounted if then we generate a
random temp fsid which shall be used the mount, in memory only not
written to the disk. We distinguish devices by devt.

Example:
  $ fallocate -l 300m ./disk1.img
  $ mkfs.btrfs -f ./disk1.img
  $ cp ./disk1.img ./disk2.img
  $ cp ./disk1.img ./disk3.img
  $ mount -o loop ./disk1.img /btrfs
  $ mount -o ./disk2.img /btrfs1
  $ mount -o ./disk3.img /btrfs2

  $ btrfs fi show -m
  Label: none  uuid: 4a212b48-1bec-46a5-938a-783c8c1f0b02
	Total devices 1 FS bytes used 144.00KiB
	devid    1 size 300.00MiB used 88.00MiB path /dev/loop0

  Label: none  uuid: adabf2fe-5515-4ad0-95b4-7b1609218c16
	Total devices 1 FS bytes used 144.00KiB
	devid    1 size 300.00MiB used 88.00MiB path /dev/loop1

  Label: none  uuid: 1d77d0df-7d92-439e-adbd-20b9b86fdedb
	Total devices 1 FS bytes used 144.00KiB
	devid    1 size 300.00MiB used 88.00MiB path /dev/loop2

Co-developed-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:17 +02:00
Anand Jain
69d427f34c btrfs: add helper function find_fsid_by_disk
In preparation for adding support to mount multiple single-disk
btrfs filesystems with the same FSID, wrap find_fsid() into
find_fsid_by_disk().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:17 +02:00
Filipe Manana
9ef17228e1 btrfs: stop reserving excessive space for block group item insertions
Space for block group item insertions, necessary after allocating a new
block group, is reserved in the delayed refs block reserve. Currently we
do this by incrementing the transaction handle's delayed_ref_updates
counter and then calling btrfs_update_delayed_refs_rsv(), which will
increase the size of the delayed refs block reserve by an amount that
corresponds to the same amount we use for delayed refs, given by
btrfs_calc_delayed_ref_bytes().

That is an excessive amount because it corresponds to the amount of space
needed to insert one item in a btree (btrfs_calc_insert_metadata_size())
times 2 when the free space tree feature is enabled. All we need is an
amount as given by btrfs_calc_insert_metadata_size(), since we only need to
insert a block group item in the extent tree (or block group tree if this
feature is enabled). By using btrfs_calc_insert_metadata_size() we will
need to reserve 2 times less space when using the free space tree, putting
less pressure on space reservation.

So use helpers to reserve and release space for block group item
insertions that use btrfs_calc_insert_metadata_size() for calculation of
the space.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:16 +02:00
Filipe Manana
f66e0209bd btrfs: stop reserving excessive space for block group item updates
Space for block group item updates, necessary after allocating or
deallocating an extent from a block group, is reserved in the delayed
refs block reserve. Currently we do this by incrementing the transaction
handle's delayed_ref_updates counter and then calling
btrfs_update_delayed_refs_rsv(), which will increase the size of the
delayed refs block reserve by an amount that corresponds to the same
amount we use for delayed refs, given by btrfs_calc_delayed_ref_bytes().

That is an excessive amount because it corresponds to the amount of space
needed to insert one item in a btree (btrfs_calc_insert_metadata_size())
times 2 when the free space tree feature is enabled. All we need is an
amount as given by btrfs_calc_metadata_size(), since we only need to
update an existing block group item in the extent tree (or block group
tree if this feature is enabled). By using btrfs_calc_metadata_size() we
will need to reserve 4 times less space when using the free space tree
and 2 times less space when not using it, putting less pressure on space
reservation.

So use helpers to reserve and release space for block group item updates
that use btrfs_calc_metadata_size() for calculation of the space.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:16 +02:00
David Sterba
398fb9131f btrfs: reorder btrfs_inode to fill gaps
Previous commit created a hole in struct btrfs_inode, we can move
outstanding_extents there. This reduces size by 8 bytes from 1120 to
1112 on a release config.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:16 +02:00
David Sterba
54c6537146 btrfs: open code btrfs_ordered_inode_tree in btrfs_inode
The structure btrfs_ordered_inode_tree is used only in one place, in
btrfs_inode. The structure itself has a 4 byte hole which is wasted
space.

Move the btrfs_ordered_inode_tree members to btrfs_inode with a common
prefix 'ordered_tree_' where the hole can be utilized and shrink inode
size.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:16 +02:00
Josef Bacik
cb6cbab790 btrfs: adjust overcommit logic when very close to full
A user reported some unpleasant behavior with very small file systems.
The reproducer is this

  $ mkfs.btrfs -f -m single -b 8g /dev/vdb
  $ mount /dev/vdb /mnt/test
  $ dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20

This will result in usage that looks like this

  Overall:
      Device size:                   8.00GiB
      Device allocated:              8.00GiB
      Device unallocated:            1.00MiB
      Device missing:                  0.00B
      Device slack:                  2.00GiB
      Used:                          5.47GiB
      Free (estimated):              2.52GiB      (min: 2.52GiB)
      Free (statfs, df):               0.00B
      Data ratio:                       1.00
      Metadata ratio:                   1.00
      Global reserve:                5.50MiB      (used: 0.00B)
      Multiple profiles:                  no

  Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
     /dev/vdb        7.99GiB

  Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
     /dev/vdb        8.00MiB

  System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
     /dev/vdb        4.00MiB

  Unallocated:
     /dev/vdb        1.00MiB

As you can see we've gotten ourselves quite full with metadata, with all
of the disk being allocated for data.

On smaller file systems there's not a lot of time before we get full, so
our overcommit behavior bites us here.  Generally speaking data
reservations result in chunk allocations as we assume reservation ==
actual use for data.  This means at any point we could end up with a
chunk allocation for data, and if we're very close to full we could do
this before we have a chance to figure out that we need another metadata
chunk.

Address this by adjusting the overcommit logic.  Simply put we need to
take away 1 chunk from the available chunk space in case of a data
reservation.  This will allow us to stop overcommitting before we
potentially lose this space to a data allocation.  With this fix in
place we properly allocate a metadata chunk before we're completely
full, allowing for enough slack space in metadata.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:16 +02:00
Josef Bacik
6f2d3c0196 btrfs: increase ->free_chunk_space in btrfs_grow_device
My overcommit patch exposed a bug with btrfs/177 [1].  The problem here is
that when we grow the device we're not adding to ->free_chunk_space, so
subsequent allocations can cause ->free_chunk_space to wrap, which
causes problems in can_overcommit because we add this to ->total_bytes,
which causes the counter to wrap and gives us an unexpected ENOSPC.

Fix this by properly updating ->free_chunk_space with the new available
space in btrfs_grow_device.

[1] First version of the fix:
    https://lore.kernel.org/linux-btrfs/b97e47ce0ce1d41d221878de7d6090b90aa7a597.1695065233.git.josef@toxicpanda.com/

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:16 +02:00
Josef Bacik
e9fd2c0523 btrfs: fix ->free_chunk_space math in btrfs_shrink_device
There are two bugs in how we adjust ->free_chunk_space in
btrfs_shrink_device.  First we're removing the entire diff between
new_size and old_size from ->free_chunk_space.  This only works if we're
reducing the free area, which we could potentially not be.  So adjust
the math to only subtract the diff in the free space from
->free_chunk_space.

Additionally in the error case we're unconditionally adding the diff
back into ->free_chunk_space, which we need to only do if this device is
writeable.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:15 +02:00
Filipe Manana
efba145449 btrfs: make sure we cache next state in find_first_extent_bit()
Currently, at find_first_extent_bit(), when we are given a cached extent
state that happens to have its end offset match the desired range start,
we find the next extent state using that cached state, with next_state()
calls, and then return it.

We then try to cache that next state by calling cache_state_if_flags(),
but that will not cache the state because we haven't reset *cached_state
to NULL, so we end up with the cached_state unchanged, and if the caller
is iterating over extent states in the io tree, its next call to
find_first_extent_bit() will not use the current cached state as its end
offset does not match the minimum start range offset, therefore the cached
state is reset and we have to search the rbtree to find the next suitable
extent state record.

So fix this by resetting the cached state to NULL (and dropping our ref
on it) when we have a suitable cached state and we found a next state by
using next_state() starting from the cached state. This makes use cases
of calling find_first_extent_bit() to go over all ranges in the io tree
to do a single rbtree full search, only on the first call, and the next
calls will just do next_state() (rb_next() wrapper) calls, which is more
efficient.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:15 +02:00
Filipe Manana
0f8ac74d41 btrfs: use extent_io_tree_release() to empty dirty log pages
When freeing a log tree, during a transaction commit, we clear its dirty
log pages io tree by calling clear_extent_bits() using a range from 0 to
(u64)-1. This will iterate the io tree's rbtree and call rb_erase() on
each node before freeing it, which will often trigger rebalance operations
on the rbtree. A better alternative it to use extent_io_tree_release(),
which will not do deletions and trigger rebalances.

So use extent_io_tree_release() instead of clear_extent_bits().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:15 +02:00
Filipe Manana
63ffc1f7c4 btrfs: make tree iteration in extent_io_tree_release() more efficient
Currently extent_io_tree_release() is a loop that keeps getting the first
node in the io tree, using rb_first() which is a loop that gets to the
leftmost node of the rbtree, and then for each node it calls rb_erase(),
which often requires rebalancing the rbtree.

We can make this more efficient by using
rbtree_postorder_for_each_entry_safe() to free each node without having
to delete it from the rbtree and without looping to get the first node.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:15 +02:00
Filipe Manana
df2a8e70c3 btrfs: collapse wait_on_state() to its caller wait_extent_bit()
The wait_on_state() function is very short and has a single caller, which
is wait_extent_bit(), so remove the function and put its code into the
caller.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:15 +02:00
Filipe Manana
28967c7622 btrfs: remove redundant memory barrier from extent_io_tree_release()
The memory barrier at extent_io_tree_release() is redundant. Holding
spin_lock here is not enough to drop the barrier completely.  We only
change the waitqueue of an extent state record while holding the tree
lock - see wait_on_state().

The update to waitqueue state will not become stale because there will
be an spin_unlock/spin_lock sequence between the change and waiting,
this implies a full memory barrier.

So remove the explicit smp_mb() barrier.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reword reasoning ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:15 +02:00
Filipe Manana
a1c20d15ee btrfs: make wait_extent_bit() static
The function wait_extent_bit() is not used outside extent-io-tree.c so
make it static. Furthermore the function doesn't have the 'btrfs_' prefix.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:15 +02:00
Filipe Manana
bea22a58c9 btrfs: update stale comment at extent_io_tree_release()
There's this comment at extent_io_tree_release() that mentions io btrees,
but this function is no longer used only for io btrees. Originally it was
added as a static function named clear_btree_io_tree() at transaction.c,
in commit 663dfbb077 ("Btrfs: deal with convert_extent_bit errors to
avoid fs corruption"), as it was used only for cleaning one of the io
trees that track dirty extent buffers, the dirty_log_pages io tree of a
a root and the dirty_pages io tree of a transaction. Later it was renamed
and exported and now it's used to cleanup other io trees such as the
allocation state io tree of a device or the csums range io tree of a log
root.

So remove that comment and replace it with one at the top of the function
that is more complete, mentioning what the function does and that it's
expected to be called only when a task is sure no one else will need to
use the tree anymore, as well as there should be no locked ranges in the
tree and therefore no waiters on its extent state records. Also add an
assertion to check that there are no locked extent state records in the
tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:14 +02:00
Filipe Manana
c91ea4bfa6 btrfs: make extent state merges more efficient during insertions
When inserting a new extent state record into an io tree that happens to
be mergeable, we currently do the following:

1) Insert the extent state record in the io tree's rbtree. This requires
   going down the tree to find where to insert it, and during the
   insertion we often need to balance the rbtree;

2) We then check if the previous node is mergeable, so we call rb_prev()
   to find it, which requires some looping to find the previous node;

3) If the previous node is mergeable, we adjust our node to include the
   range of the previous node and then delete the previous node from the
   rbtree, which again may need to balance the rbtree;

4) Then we check if the next node is mergeable with the node we inserted,
   so we call rb_next(), which requires some looping too. If the next node
   is indeed mergeable, we expand the range of our node to include the
   next node's range and then delete the next node from the rbtree, which
   again may need to balance the tree.

So these are quite of lot of iterations and looping over the rbtree, and
some of the operations may need to rebalance the rb tree. This can be made
a bit more efficient by:

1) When iterating the rbtree, once we find a node that is mergeable with
   the node we want to insert, we can just adjust that node's range with
   the range of the node to insert - this avoids continuing iterating
   over the tree and deleting a node from the rbtree;

2) If we expand the range of a mergeable node, then we find the next or
   the previous node, depending on other we merged a range to the right or
   to the left of the node we are currently at during the iteration. This
   merging is as before, we find the next or previous node with rb_next()
   or rb_prev() and if that other node is mergeable with the current one,
   we adjust the range of the current node and remove the other node from
   the rbtree;

3) Whenever we need to insert the new extent state record it's because
   we don't have any extent state record in the rbtree which can be
   merged, so we can remove the call to merge_state() after the insertion,
   saving rb_next() and rb_prev() calls, which require some looping.

So update the insertion function insert_state() to have this behaviour.

Running dbench for 120 seconds and capturing the execution times of
set_extent_bit() at pin_down_extent(), resulted in the following data
(time values are in nanoseconds):

Before this change:

  Count: 2278299
  Range:  0.000 - 4003728.000; Mean: 713.436; Median: 612.000; Stddev: 3606.952
  Percentiles:  90th: 1187.000; 95th: 1350.000; 99th: 1724.000
       0.000 -       7.534:       5 |
       7.534 -      35.418:      36 |
      35.418 -     154.403:     273 |
     154.403 -     662.138: 1244016 #####################################################
     662.138 -    2828.745: 1031335 ############################################
    2828.745 -   12074.102:    1395 |
   12074.102 -   51525.930:     806 |
   51525.930 -  219874.955:     162 |
  219874.955 -  938254.688:      22 |
  938254.688 - 4003728.000:       3 |

After this change:

  Count: 2275862
  Range:  0.000 - 1605175.000; Mean: 678.903; Median: 590.000; Stddev: 2149.785
  Percentiles:  90th: 1105.000; 95th: 1245.000; 99th: 1590.000
       0.000 -      10.219:      10 |
      10.219 -      40.957:      36 |
      40.957 -     155.907:     262 |
     155.907 -     585.789: 1127214 ####################################################
     585.789 -    2193.431: 1145134 #####################################################
    2193.431 -    8205.578:    1648 |
    8205.578 -   30689.378:    1039 |
   30689.378 -  114772.699:     362 |
  114772.699 -  429221.537:      52 |
  429221.537 - 1605175.000:      10 |

Maximum duration (range), average duration, percentiles and standard
deviation are all better.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:14 +02:00
David Sterba
893fe24399 btrfs: change test_range_bit to scan the whole range
The semantics of test_range_bit() with filled == 0 is now in it's own
helper so test_range_bit will check the whole range unconditionally.
The detection logic is flipped and assumes success by default and
catches exceptions.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:14 +02:00
David Sterba
99be1a66e1 btrfs: add specific helper for range bit test exists
The existing helper test_range_bit works in two ways, checks if the whole
range contains all the bits, or stop on the first occurrence.  By adding
a specific helper for the latter case, the inner loop can be simplified
and contains fewer conditionals, making it a bit faster.

There's no caller that uses the cached state pointer so this reduces the
argument count further.

Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:14 +02:00
Filipe Manana
6422b4cd95 btrfs: move btrfs_realloc_node() from ctree.c into defrag.c
btrfs_realloc_node() is only used by the defrag code. Nowadays we have a
defrag.c file, so move it, and its helper close_blocks(), into defrag.c.

During the move also do a few minor cosmetic changes:

1) Change the return value of close_blocks() from int to bool;

2) Use SZ_32K instead of 32768 at close_blocks();

3) Make some variables const in btrfs_realloc_node(), 'blocksize' and
   'end_slot';

4) Get rid of 'parent_nritems' variable, in both places where it was
   used it could be replaced by calling btrfs_header_nritems(parent);

5) Change the type of a couple variables from int to bool;

6) Rename variable 'err' to 'ret', as that's the most common name we
   use to track the return value of a function;

7) Move some variables from the top scope to the scope of the for loop
   where they are used.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:14 +02:00
Filipe Manana
79d25df0d7 btrfs: export comp_keys() from ctree.c as btrfs_comp_keys()
Export comp_keys() out of ctree.c, as btrfs_comp_keys(), so that in a
later patch we can move out defrag specific code from ctree.c into
defrag.c.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:14 +02:00
Filipe Manana
95f93bc4cb btrfs: rename and export __btrfs_cow_block()
Rename and export __btrfs_cow_block() as btrfs_force_cow_block(). This is
to allow to move defrag specific code out of ctree.c and into defrag.c in
one of the next patches.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:14 +02:00
Filipe Manana
b8bf4e4d6a btrfs: use round_down() to align block offset at btrfs_cow_block()
At btrfs_cow_block() we can use round_down() to align the extent buffer's
logical offset to the start offset of a metadata block group, instead of
the less easy to read set of bitwise operations (two plus one subtraction).
So replace the bitwise operations with a round_down() call.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:14 +02:00
Filipe Manana
7bff16e3ff btrfs: remove noinline attribute from btrfs_cow_block()
It's pointless to have the noiline attribute for btrfs_cow_block(), as the
function is exported and widely used. So remove it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:14 +02:00
Anand Jain
5966930dfd btrfs: remove incomplete metadata_uuid conversion fixup logic
Previous commit ("btrfs: reject devices with CHANGING_FSID_V2") has
stopped the assembly of devices with the CHANGING_FSID_V2 flag in the
kernel. Such devices can be scanned but will not be registered and can't
be mounted without a manual fix by btrfstune.  Remove the related logic
and now unused code.

The original motivation was to allow an interrupted partial conversion
fix itself on next mount, in case the system has to be rebooted. This is
a convenience but brings a lot of complexity the device scanning and
handling the partial states.  It's hard to estimate if this was ever
needed in practice, expecting the typical use case like a manual
conversion of an unmounted filesystem where the user can verify the
success and rerun it eventually.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add historical context ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:13 +02:00
Anand Jain
197a9ecee6 btrfs: reject devices with CHANGING_FSID_V2
The BTRFS_SUPER_FLAG_CHANGING_FSID_V2 flag indicates a transient state
where the device in the userspace btrfstune -m|-M operation failed to
complete changing the fsid.

This flag makes the kernel to automatically determine the other
partner devices to which a given device can be associated, based on the
fsid, metadata_uuid and generation values.

btrfstune -m|M feature is especially useful in virtual cloud setups, where
compute instances (disk images) are quickly copied, fsid changed, and
launched. Given numerous disk images with the same metadata_uuid but
different fsid, there's no clear way a device can be correctly assembled
with the proper partners when the CHANGING_FSID_V2 flag is set. So, the
disk could be assembled incorrectly, as in the example below:

Before this patch:

Consider the following two filesystems:
   /dev/loop[2-3] are raw copies of /dev/loop[0-1] and the btrsftune -m
operation fails.

In this scenario, as the /dev/loop0's fsid change is interrupted, and the
CHANGING_FSID_V2 flag is set as shown below.

  $ p="device|devid|^metadata_uuid|^fsid|^incom|^generation|^flags"

  $ btrfs inspect dump-super /dev/loop0 | egrep '$p'
  superblock: bytenr=65536, device=/dev/loop0
  flags			0x1000000001
  fsid			7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
  metadata_uuid		bb040a9f-233a-4de2-ad84-49aa5a28059b
  generation		9
  num_devices		2
  incompat_flags	0x741
  dev_item.devid	1

  $ btrfs inspect dump-super /dev/loop1 | egrep '$p'
  superblock: bytenr=65536, device=/dev/loop1
  flags			0x1
  fsid			11d2af4d-1b71-45a9-83f6-f2100766939d
  metadata_uuid		bb040a9f-233a-4de2-ad84-49aa5a28059b
  generation		10
  num_devices		2
  incompat_flags	0x741
  dev_item.devid	2

  $ btrfs inspect dump-super /dev/loop2 | egrep '$p'
  superblock: bytenr=65536, device=/dev/loop2
  flags			0x1
  fsid			7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
  metadata_uuid		bb040a9f-233a-4de2-ad84-49aa5a28059b
  generation		8
  num_devices		2
  incompat_flags	0x741
  dev_item.devid	1

  $ btrfs inspect dump-super /dev/loop3 | egrep '$p'
  superblock: bytenr=65536, device=/dev/loop3
  flags			0x1
  fsid			7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
  metadata_uuid		bb040a9f-233a-4de2-ad84-49aa5a28059b
  generation		8
  num_devices		2
  incompat_flags	0x741
  dev_item.devid	2

It is normal that some devices aren't instantly discovered during
system boot or iSCSI discovery. The controlled scan below demonstrates
this.

  $ btrfs device scan --forget
  $ btrfs device scan /dev/loop0
  Scanning for btrfs filesystems on '/dev/loop0'
  $ mount /dev/loop3 /btrfs
  $ btrfs filesystem show -m
  Label: none  uuid: 7d4b4b93-2b27-4432-b4e4-4be1fbccbd45
	Total devices 2 FS bytes used 144.00KiB
	devid    1 size 300.00MiB used 48.00MiB path /dev/loop0
	devid    2 size 300.00MiB used 40.00MiB path /dev/loop3

/dev/loop0 and /dev/loop3 are incorrectly partnered.

This kernel patch removes functions and code connected to the
CHANGING_FSID_V2 flag.

With this patch, now devices with the CHANGING_FSID_V2 flag are rejected.
And its partner will fail to mount with the extra -o degraded option.
The check is removed from open_ctree(), devices are rejected during
scanning which in turn fails the mount.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:13 +02:00
David Sterba
ab7c8bbf3a btrfs: relocation: constify parameters where possible
Lots of the functions in relocation.c don't change pointer parameters
but lack the annotations. Add them and reformat according to current
coding style if needed.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:13 +02:00
David Sterba
32f2abca38 btrfs: relocation: return bool from btrfs_should_ignore_reloc_root
btrfs_should_ignore_reloc_root() is a predicate so it should return
bool.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:13 +02:00
David Sterba
c71d3c698c btrfs: switch btrfs_backref_cache::is_reloc to bool
The btrfs_backref_cache::is_reloc is an indicator variable and should
use a bool type.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:13 +02:00
David Sterba
733fa44de3 btrfs: relocation: open code mapping_tree_init
There's only one user of mapping_tree_init, we don't need a helper for
the simple initialization.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:13 +02:00