1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

171 commits

Author SHA1 Message Date
Filipe Manana
ecb54277cb btrfs: fix uninitialized return value from btrfs_reclaim_sweep()
The return variable 'ret' at btrfs_reclaim_sweep() is never assigned if
none of the space infos is reclaimable (for example if periodic reclaim
is disabled, which is the default), so we return an undefined value.

This can be fixed my making btrfs_reclaim_sweep() not return any value
as well as do_reclaim_sweep() because:

1) do_reclaim_sweep() always returns 0, so we can make it return void;

2) The only caller of btrfs_reclaim_sweep() (btrfs_reclaim_bgs()) doesn't
   care about its return value, and in its context there's nothing to do
   about any errors anyway.

Therefore remove the return value from btrfs_reclaim_sweep() and
do_reclaim_sweep().

Fixes: e4ca3932ae ("btrfs: periodic block_group reclaim")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-27 16:42:09 +02:00
Naohiro Aota
8cd44dd1d1 btrfs: zoned: fix zone_unusable accounting on making block group read-write again
When btrfs makes a block group read-only, it adds all free regions in the
block group to space_info->bytes_readonly. That free space excludes
reserved and pinned regions. OTOH, when btrfs makes the block group
read-write again, it moves all the unused regions into the block group's
zone_unusable. That unused region includes reserved and pinned regions.
As a result, it counts too much zone_unusable bytes.

Fortunately (or unfortunately), having erroneous zone_unusable does not
affect the calculation of space_info->bytes_readonly, because free
space (num_bytes in btrfs_dec_block_group_ro) calculation is done based on
the erroneous zone_unusable and it reduces the num_bytes just to cancel the
error.

This behavior can be easily discovered by adding a WARN_ON to check e.g,
"bg->pinned > 0" in btrfs_dec_block_group_ro(), and running fstests test
case like btrfs/282.

Fix it by properly considering pinned and reserved in
btrfs_dec_block_group_ro(). Also, add a WARN_ON and introduce
btrfs_space_info_update_bytes_zone_unusable() to catch a similar mistake.

Fixes: 169e0da91a ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-29 19:21:19 +02:00
Naohiro Aota
d89c285d28 btrfs: do not subtract delalloc from avail bytes
The block group's avail bytes printed when dumping a space info subtract
the delalloc_bytes. However, as shown in btrfs_add_reserved_bytes() and
btrfs_free_reserved_bytes(), it is added or subtracted along with
"reserved" for the delalloc case, which means the "delalloc_bytes" is a
part of the "reserved" bytes. So, excluding it to calculate the avail space
counts delalloc_bytes twice, which can lead to an invalid result.

Fixes: e50b122b83 ("btrfs: print available space for a block group when dumping a space info")
CC: stable@vger.kernel.org # 6.6+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-29 19:21:04 +02:00
Boris Burkov
0e962e755b btrfs: urgent periodic reclaim pass
Periodic reclaim attempts to avoid block_groups seeing active use with a
sweep mark that gets cleared on allocation and set on a sweep. In urgent
conditions where we have very little unallocated space (less than one
chunk used by the threshold calculation for the unallocated target), we
want to be able to override this mechanism.

Introduce a second pass that only happens if we fail to find a reclaim
candidate and reclaim is urgent. In that case, do a second pass where
all block groups are eligible.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:27 +02:00
Boris Burkov
813d4c6422 btrfs: prevent pathological periodic reclaim loops
Periodic reclaim runs the risk of getting stuck in a state where it
keeps reclaiming the same block group over and over. This can happen if

1. reclaiming that block_group fails
2. reclaiming that block_group fails to move any extents into existing
   block_groups and just allocates a fresh chunk and moves everything.

Currently, 1. is a very tight loop inside the reclaim worker. That is
critical for edge triggered reclaim or else we risk forgetting about a
reclaimable group. On the other hand, with level triggered reclaim we
can break out of that loop and get it later.

With that fixed, 2. applies to both failures and "successes" with no
progress. If we have done a periodic reclaim on a space_info and nothing
has changed in that space_info, there is not much point to trying again,
so don't, until enough space gets free, which we capture with a
heuristic of needing to net free 1 chunk.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:27 +02:00
Boris Burkov
e4ca3932ae btrfs: periodic block_group reclaim
We currently employ a edge-triggered block group reclaim strategy which
marks block groups for reclaim as they free down past a threshold.

With a dynamic threshold, this is worse than doing it in a
level-triggered fashion periodically. That is because the reclaim
itself happens periodically, so the threshold at that point in time is
what really matters, not the threshold at freeing time. If we mark the
reclaim in a big pass, then sort by usage and do reclaim, we also
benefit from a negative feedback loop preventing unnecessary reclaims as
we crunch through the "best" candidates.

Since this is quite a different model, it requires some additional
support. The edge triggered reclaim has a good heuristic for not
reclaiming fresh block groups, so we need to replace that with a typical
GC sweep mark which skips block groups that have seen an allocation
since the last sweep.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:27 +02:00
Boris Burkov
f5ff64ccf7 btrfs: dynamic block_group reclaim threshold
We can currently recover allocated block_groups by:

- explicitly starting balance operations
- "auto reclaim" via bg_reclaim_threshold

The latter works by checking against a fixed threshold on frees. If we
pass from above the threshold to below, relocation triggers and the
block group will get reclaimed by the cleaner thread (assuming it is
still eligible)

Picking a threshold is challenging. Too high, and you end up trying to
reclaim very full block_groups which is quite costly, and you don't do
reclaim on block_groups that don't get quite THAT full, but could still
be quite fragmented and stranding a lot of space. Too low, and you
similarly miss out on reclaim even if you badly need it to avoid running
out of unallocated space, if you have heavily fragmented block groups
living above the threshold.

No matter the threshold, it suffers from a workload that happens to
bounce around that threshold, which can introduce arbitrary amounts of
reclaim waste.

To improve this situation, introduce a dynamic threshold. The basic idea
behind this threshold is that it should be very lax when there is plenty
of unallocated space, and increasingly aggressive as we approach zero
unallocated space. To that end, it sets a target for unallocated space
(10 chunks) and then linearly increases the threshold as the amount of
space short of the target we are increases. The formula is:
(target - unalloc) / target

I tested this by running it on three interesting workloads:

  1. bounce allocations around X% full.
  2. fill up all the way and introduce full fragmentation.
  3. write in a fragmented way until the filesystem is just about full.

1. and 2. attack the weaknesses of a fixed threshold; fixed either works
perfectly or fully falls apart, depending on the threshold. Dynamic
always handles these cases well.

3. attacks dynamic by checking whether it is too zealous to reclaim in
conditions with low unallocated and low unused. It tends to claw back
1GiB of unallocated fairly aggressively, but not much more. Early
versions of dynamic threshold struggled on this test.

Additional work could be done to intelligently ratchet up the urgency of
reclaim in very low unallocated conditions. Existing mechanisms are
already useless in that case anyway.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:27 +02:00
Boris Burkov
42f620aec1 btrfs: store fs_info in space_info
This is handy when computing space_info dynamic reclaim thresholds where
we do not have access to a block group. We could add it to the various
functions as a parameter, but it seems reasonable for space_info to have
an fs_info pointer.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:27 +02:00
Filipe Manana
ded980eb3f btrfs: add and use helper to commit the current transaction
We have several places that attach to the current transaction with
btrfs_attach_transaction_barrier() and then commit the transaction if
there is one. Add a helper and use it to deduplicate this pattern.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:20 +02:00
David Sterba
42317ab440 btrfs: simplify range parameters of btrfs_wait_ordered_roots()
The range is specified only in two ways, we can simplify the case for
the whole filesystem range as a NULL block group parameter.

Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:19 +02:00
David Sterba
5100c4eb52 btrfs: remove unused define EXTENT_SIZE_PER_ITEM
This was added  in c61a16a701 ("Btrfs: fix the confusion between
delalloc bytes and metadata bytes") and removed in 03fe78cc29
("btrfs: use delalloc_bytes to determine flush amount for
shrink_delalloc") where the calculation was reworked to use a
non-constant numbers. This was found by 'make W=2'.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 15:33:19 +02:00
Naohiro Aota
64d2c847ba btrfs: zoned: fix calc_available_free_space() for zoned mode
calc_available_free_space() returns the total size of metadata (or
system) block groups, which can be allocated from unallocated disk
space. The logic is wrong on zoned mode in two places.

First, the calculation of data_chunk_size is wrong. We always allocate
one zone as one chunk, and no partial allocation of a zone. So, we
should use zone_size (= data_sinfo->chunk_size) as it is.

Second, the result "avail" may not be zone aligned. Since we always
allocate one zone as one chunk on zoned mode, returning non-zone size
aligned bytes will result in less pressure on the async metadata reclaim
process.

This is serious for the nearly full state with a large zone size device.
Allowing over-commit too much will result in less async reclaim work and
end up in ENOSPC. We can align down to the zone size to avoid that.

Fixes: cb6cbab790 ("btrfs: adjust overcommit logic when very close to full")
CC: stable@vger.kernel.org # 6.9
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-02 19:13:11 +02:00
David Sterba
2b712e3bb2 btrfs: remove unused included headers
With help of neovim, LSP and clangd we can identify header files that
are not actually needed to be included in the .c files. This is focused
only on removal (with minor fixups), further cleanups are possible but
will require doing the header files properly with forward declarations,
minimized includes and include-what-you-use care.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04 16:24:46 +01:00
Filipe Manana
e06cc89475 btrfs: fix data races when accessing the reserved amount of block reserves
At space_info.c we have several places where we access the ->reserved
field of a block reserve without taking the block reserve's spinlock
first, which makes KCSAN warn about a data race since that field is
always updated while holding the spinlock.

The reports from KCSAN are like the following:

  [117.193526] BUG: KCSAN: data-race in btrfs_block_rsv_release [btrfs] / need_preemptive_reclaim [btrfs]

  [117.195148] read to 0x000000017f587190 of 8 bytes by task 6303 on cpu 3:
  [117.195172]  need_preemptive_reclaim+0x222/0x2f0 [btrfs]
  [117.195992]  __reserve_bytes+0xbb0/0xdc8 [btrfs]
  [117.196807]  btrfs_reserve_metadata_bytes+0x4c/0x120 [btrfs]
  [117.197620]  btrfs_block_rsv_add+0x78/0xa8 [btrfs]
  [117.198434]  btrfs_delayed_update_inode+0x154/0x368 [btrfs]
  [117.199300]  btrfs_update_inode+0x108/0x1c8 [btrfs]
  [117.200122]  btrfs_dirty_inode+0xb4/0x140 [btrfs]
  [117.200937]  btrfs_update_time+0x8c/0xb0 [btrfs]
  [117.201754]  touch_atime+0x16c/0x1e0
  [117.201789]  filemap_read+0x674/0x728
  [117.201823]  btrfs_file_read_iter+0xf8/0x410 [btrfs]
  [117.202653]  vfs_read+0x2b6/0x498
  [117.203454]  ksys_read+0xa2/0x150
  [117.203473]  __s390x_sys_read+0x68/0x88
  [117.203495]  do_syscall+0x1c6/0x210
  [117.203517]  __do_syscall+0xc8/0xf0
  [117.203539]  system_call+0x70/0x98

  [117.203579] write to 0x000000017f587190 of 8 bytes by task 11 on cpu 0:
  [117.203604]  btrfs_block_rsv_release+0x2e8/0x578 [btrfs]
  [117.204432]  btrfs_delayed_inode_release_metadata+0x7c/0x1d0 [btrfs]
  [117.205259]  __btrfs_update_delayed_inode+0x37c/0x5e0 [btrfs]
  [117.206093]  btrfs_async_run_delayed_root+0x356/0x498 [btrfs]
  [117.206917]  btrfs_work_helper+0x160/0x7a0 [btrfs]
  [117.207738]  process_one_work+0x3b6/0x838
  [117.207768]  worker_thread+0x75e/0xb10
  [117.207797]  kthread+0x21a/0x230
  [117.207830]  __ret_from_fork+0x6c/0xb8
  [117.207861]  ret_from_fork+0xa/0x30

So add a helper to get the reserved amount of a block reserve while
holding the lock. The value may be not be up to date anymore when used by
need_preemptive_reclaim() and btrfs_preempt_reclaim_metadata_space(), but
that's ok since the worst it can do is cause more reclaim work do be done
sooner rather than later. Reading the field while holding the lock instead
of using the data_race() annotation is used in order to prevent load
tearing.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-22 12:15:06 +01:00
Josef Bacik
cb6cbab790 btrfs: adjust overcommit logic when very close to full
A user reported some unpleasant behavior with very small file systems.
The reproducer is this

  $ mkfs.btrfs -f -m single -b 8g /dev/vdb
  $ mount /dev/vdb /mnt/test
  $ dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20

This will result in usage that looks like this

  Overall:
      Device size:                   8.00GiB
      Device allocated:              8.00GiB
      Device unallocated:            1.00MiB
      Device missing:                  0.00B
      Device slack:                  2.00GiB
      Used:                          5.47GiB
      Free (estimated):              2.52GiB      (min: 2.52GiB)
      Free (statfs, df):               0.00B
      Data ratio:                       1.00
      Metadata ratio:                   1.00
      Global reserve:                5.50MiB      (used: 0.00B)
      Multiple profiles:                  no

  Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
     /dev/vdb        7.99GiB

  Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
     /dev/vdb        8.00MiB

  System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
     /dev/vdb        4.00MiB

  Unallocated:
     /dev/vdb        1.00MiB

As you can see we've gotten ourselves quite full with metadata, with all
of the disk being allocated for data.

On smaller file systems there's not a lot of time before we get full, so
our overcommit behavior bites us here.  Generally speaking data
reservations result in chunk allocations as we assume reservation ==
actual use for data.  This means at any point we could end up with a
chunk allocation for data, and if we're very close to full we could do
this before we have a chance to figure out that we need another metadata
chunk.

Address this by adjusting the overcommit logic.  Simply put we need to
take away 1 chunk from the available chunk space in case of a data
reservation.  This will allow us to stop overcommitting before we
potentially lose this space to a data allocation.  With this fix in
place we properly allocate a metadata chunk before we're completely
full, allowing for enough slack space in metadata.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:16 +02:00
Filipe Manana
8a526c44da btrfs: allow to run delayed refs by bytes to be released instead of count
When running delayed references, through btrfs_run_delayed_refs(), we can
specify how many to run, run all existing delayed references and keep
running delayed references while we can find any. This is controlled with
the value of the 'count' argument, where a value of 0 means to run all
delayed references that exist by the time btrfs_run_delayed_refs() is
called, (unsigned long)-1 means to keep running delayed references while
we are able find any, and any other value to run that exact number of
delayed references.

Typically a specific value other than 0 or -1 is used when flushing space
to try to release a certain amount of bytes for a ticket. In this case
we just simply calculate how many delayed reference heads correspond to a
specific amount of bytes, with calc_delayed_refs_nr(). However that only
takes into account the space reserved for the reference heads themselves,
and does not account for the space reserved for deleting checksums from
the csum tree (see add_delayed_ref_head() and update_existing_head_ref())
in case we are going to delete a data extent. This means we may end up
running more delayed references than necessary in case we process delayed
references for deleting a data extent.

So change the logic of btrfs_run_delayed_refs() to take a bytes argument
to specify how many bytes of delayed references to run/release, using the
special values of 0 to mean all existing delayed references and U64_MAX
(or (u64)-1) to keep running delayed references while we can find any.

This prevents running more delayed references than necessary, when we have
delayed references for deleting data extents, but also makes the upcoming
changes/patches simpler and it's preparatory work for them.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:06 +02:00
Filipe Manana
03551d651e btrfs: pass a space_info argument to btrfs_reserve_metadata_bytes()
We are passing a block reserve argument to btrfs_reserve_metadata_bytes()
which is not really used, all we need is to pass the space_info associated
to the block reserve, we don't change the block reserve at all.

Not only it's pointless to pass the block reserve, it's also confusing as
one might think that the reserved bytes will end up being added to the
passed block reserve, when that's not the case. The pattern for reserving
space and adding it to a block reserve is to first reserve space with
btrfs_reserve_metadata_bytes() and if that succeeds, then add the space to
a block reserve by calling btrfs_block_rsv_add_bytes().

Also the reverse of btrfs_reserve_metadata_bytes(), which is
btrfs_space_info_free_bytes_may_use(), takes a space_info argument and
not a block reserve, so one more reason to pass a space_info and not a
block reserve to btrfs_reserve_metadata_bytes().

So change btrfs_reserve_metadata_bytes() and its callers to pass a
space_info argument instead of a block reserve argument.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:05 +02:00
David Sterba
9580503bcb btrfs: reformat remaining kdoc style comments
Function name in the comment does not bring much value to code not
exposed as API and we don't stick to the kdoc format anymore. Update
formatting of parameter descriptions.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:04 +02:00
Naohiro Aota
5b135b382a btrfs: zoned: re-enable metadata over-commit for zoned mode
Now that, we can re-enable metadata over-commit. As we moved the activation
from the reservation time to the write time, we no longer need to ensure
all the reserved bytes is properly activated.

Without the metadata over-commit, it suffers from lower performance because
it needs to flush the delalloc items more often and allocate more block
groups. Re-enabling metadata over-commit will solve the issue.

Fixes: 79417d040f ("btrfs: zoned: disable metadata overcommit for zoned")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:19 +02:00
Naohiro Aota
5a7d107e5e btrfs: zoned: don't activate non-DATA BG on allocation
Now that a non-DATA block group is activated at write time, don't
activate it on allocation time.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:19 +02:00
Filipe Manana
2ee70ed19c btrfs: avoid starting and committing empty transaction when flushing space
When flushing space and we are in the COMMIT_TRANS state, we join a
transaction with btrfs_join_transaction() and then commit the returned
transaction. However btrfs_join_transaction() starts a new transaction if
there is none currently open, which is pointless since comitting a new,
empty transaction, doesn't achieve anything, it only wastes time, IO and
creates an unnecessary rotation of the backup roots.

So use btrfs_attach_transaction_barrier() to avoid starting a new
transaction. This also waits for any ongoing transaction that is
committing (state >= TRANS_STATE_COMMIT_DOING) to fully complete, and
therefore wait for all the extents that were pinned during the
transaction's lifetime to be unpinned.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:18 +02:00
Filipe Manana
2391245ac2 btrfs: avoid starting new transaction when flushing delayed items and refs
When flushing space we join a transaction to flush delayed items and
delayed references, in order to try to release space. However using
btrfs_join_transaction() not only joins an existing transaction as well
as it starts a new transaction if there is none open. If there is no
transaction open, we don't have neither delayed items nor delayed
references, so creating a new transaction is a waste of time, IO and
creates an unnecessary rotation of the backup roots without gaining any
benefits (including releasing space).

So use btrfs_join_transaction_nostart() when attempting to flush delayed
items and references.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:18 +02:00
Filipe Manana
7e3bfd146e btrfs: fail priority metadata ticket with real fs error
At priority_reclaim_metadata_space(), if we were not able to satisfy the
the ticket after going through the various flushing states and we notice
the fs went into an error state, likely due to a transaction abort during
the flushing, set the ticket's error to the error that caused the
transaction abort instead of an unconditional -EROFS.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:18 +02:00
Filipe Manana
1b6948acb8 btrfs: don't steal space from global rsv after a transaction abort
When doing a priority metadata space reclaim, while we are going through
the flush states and running their respective operations, it's possible
that a transaction abort happened, for example when running delayed refs
we hit -ENOSPC or in the critical section of transaction commit we failed
with -ENOSPC or some other error. In these cases a transaction was aborted
and the fs turned into error state. If that happened, then it makes no
sense to steal from the global block reserve and return success to the
caller if the stealing was successful - the caller will later get an
error when attempting to modify the fs. Instead make the ticket fail if
we have the fs in error state and don't attempt to steal from the global
rsv, as it's not only it's pointless, it also simplifies debugging some
-ENOSPC problems.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:17 +02:00
Filipe Manana
1ff9fee3bd btrfs: print available space across all block groups when dumping space info
When dumping a space info also sum the available space for all block
groups and then print it. This often useful for debugging -ENOSPC
related problems.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:17 +02:00
Filipe Manana
e50b122b83 btrfs: print available space for a block group when dumping a space info
When dumping a space info, we iterate over all its block groups and then
print their size and the amounts of bytes used, reserved, pinned, etc.
When debugging -ENOSPC problems it's also useful to know how much space
is available (free), so calculate that and print it as well.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:17 +02:00
Filipe Manana
b92e8f5472 btrfs: print block group super and delalloc bytes when dumping space info
When dumping a space info's block groups, also print the number of bytes
used for super blocks and delalloc. This is often useful for debugging
-ENOSPC problems.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:17 +02:00
Filipe Manana
0e55a54502 btrfs: add helper to calculate space for delayed references
Instead of duplicating the logic for calculating how much space is
required for a given number of delayed references, add an inline helper
to encapsulate that logic and use it everywhere we are calculating the
space required.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:19 +02:00
Filipe Manana
f4160ee878 btrfs: constify fs_info argument for the reclaim items calculation helpers
Now that btrfs_calc_insert_metadata_size() can take a const fs_info
argument, make the fs_info argument of calc_reclaim_items_nr() and of
calc_delayed_refs_nr() const as well.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:19 +02:00
Filipe Manana
007145ff64 btrfs: accurately calculate number of delayed refs when flushing
When flushing a limited number of delayed references (FLUSH_DELAYED_REFS_NR
state), we are assuming each delayed reference is holding a number of bytes
matching the needed space for inserting for a single metadata item (the
result of btrfs_calc_insert_metadata_size()). That is not correct when
using the free space tree, as in that case we have to multiply that value
by 2 since we need to touch the free space tree as well. This is the same
computation as we do at btrfs_update_delayed_refs_rsv() and at
btrfs_delayed_refs_rsv_release().

So correct the computation for the amount of delayed references we need to
flush in case we have the free space tree. This does not fix a functional
issue, instead it makes the flush code flush less delayed references, only
the minimum necessary to satisfy a ticket.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:19 +02:00
Filipe Manana
3a49a54894 btrfs: initialize ret to -ENOSPC at __reserve_bytes()
At space-info.c:__reserve_bytes(), instead of initializing 'ret' to 0 when
it's declared and then shortly after set it to -ENOSPC under the space
info's spinlock, initialize it to -ENOSPC when declaring it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:18 +02:00
Filipe Manana
9d0d47d5c3 btrfs: update flush method assertion when reserving space
When reserving space, at space-info.c:__reserve_bytes(), we assert that
either the current task is not holding a transacion handle, or, if it is,
that the flush method is not BTRFS_RESERVE_FLUSH_ALL. This is because that
flush method can trigger transaction commits, and therefore could lead to
a deadlock.

However there are other 2 flush methods that can trigger transaction
commits:

1) BTRFS_RESERVE_FLUSH_ALL_STEAL
2) BTRFS_RESERVE_FLUSH_EVICT

So update the assertion to check the flush method is also not one those
two methods if the current task is holding a transaction handle.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:18 +02:00
Naohiro Aota
e15acc2588 btrfs: zoned: drop space_info->active_total_bytes
The space_info->active_total_bytes is no longer necessary as we now
count the region of newly allocated block group as zone_unusable. Drop
its usage.

Fixes: 6a921de589 ("btrfs: zoned: introduce space_info->active_total_bytes")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-15 20:51:07 +01:00
Josef Bacik
bf1f1fec27 btrfs: rename BTRFS_FS_NO_OVERCOMMIT to BTRFS_FS_ACTIVE_ZONE_TRACKING
This flag only gets set when we're doing active zone tracking, and we're
going to need to use this flag for things related to this behavior.
Rename the flag to represent what it actually means for the file system
so it can be used in other ways and still make sense.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-15 20:51:06 +01:00
Naohiro Aota
85e79ec7b7 btrfs: zoned: enable metadata over-commit for non-ZNS setup
The commit 79417d040f ("btrfs: zoned: disable metadata overcommit for
zoned") disabled the metadata over-commit to track active zones properly.

However, it also introduced a heavy overhead by allocating new metadata
block groups and/or flushing dirty buffers to release the space
reservations. Specifically, a workload (write only without any sync
operations) worsen its performance from 343.77 MB/sec (v5.19) to 182.89
MB/sec (v6.0).

The performance is still bad on current misc-next which is 187.95 MB/sec.
And, with this patch applied, it improves back to 326.70 MB/sec (+73.82%).

This patch introduces a new fs_info->flag BTRFS_FS_NO_OVERCOMMIT to
indicate it needs to disable the metadata over-commit. The flag is enabled
when a device with max active zones limit is loaded into a file-system.

Fixes: 79417d040f ("btrfs: zoned: disable metadata overcommit for zoned")
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11 20:04:25 +01:00
David Sterba
428c8e0310 btrfs: simplify percent calculation helpers, rename div_factor
The div_factor* helpers calculate fraction or percentage fraction. The
name is a bit confusing, we use it only for percentage calculations and
there are two helpers.

There's a helper mult_frac that's for general fractions, that tries to
be accurate but we multiply and divide by small numbers so we can use
the div_u64 helper.

Rename the div_factor* helpers and use 1..100 percentage range, also drop
the case checking for percentage == 100, it's never hit.

The conversions:

* div_factor calculates tenths and the numbers need to be adjusted
* div_factor_fine is direct replacement

Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:48 +01:00
David Sterba
43dd529abe btrfs: update function comments
Update, reformat or reword function comments. This also removes the kdoc
marker so we don't get reports when the function name is missing.

Changes made:

- remove kdoc markers
- reformat the brief description to be a proper sentence
- reword to imperative voice
- align parameter list
- fix typos

Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:45 +01:00
Josef Bacik
a0231804af btrfs: move extent-tree helpers into their own header file
Move all the extent tree related prototypes to extent-tree.h out of
ctree.h, and then go include it everywhere needed so everything
compiles.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:44 +01:00
Josef Bacik
e2f13b343c btrfs: move btrfs_account_ro_block_groups_free_space into space-info.c
This was prototyped in ctree.h and the code existed in extent-tree.c,
but it's space-info related so move it into space-info.c.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:44 +01:00
Josef Bacik
07e81dc944 btrfs: move accessor helpers into accessors.h
This is a large patch, but because they're all macros it's impossible to
split up.  Simply copy all of the item accessors in ctree.h and paste
them in accessors.h, and then update any files to include the header so
everything compiles.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments, style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:42 +01:00
Josef Bacik
c7f13d428e btrfs: move fs wide helpers out of ctree.h
We have several fs wide related helpers in ctree.h.  The bulk of these
are the incompat flag test helpers, but there are things such as
btrfs_fs_closing() and the read only helpers that also aren't directly
related to the ctree code.  Move these into a fs.h header, which will
serve as the location for file system wide related helpers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:41 +01:00
Josef Bacik
765c3fe99b btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY
Inside of FB, as well as some user reports, we've had a consistent
problem of occasional ENOSPC transaction aborts.  Inside FB we were
seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really
low occurrence rate given the size of our fleet, but it's not nothing.

There are two causes of this particular problem.

First is delayed allocation.  The reservation system for delalloc
assumes that contiguous dirty ranges will result in 1 file extent item.
However if there is memory pressure that results in fragmented writeout,
or there is fragmentation in the block groups, this won't necessarily be
true.  Consider the case where we do a single 256MiB write to a file and
then close it.  We will have 1 reservation for the inode update, the
reservations for the checksum updates, and 1 reservation for the file
extent item.  At some point later we decide to write this entire range
out, but we're so fragmented that we break this into 100 different file
extents.  Since we've already closed the file and are no longer writing
to it there's nothing to trigger a refill of the delalloc block rsv to
satisfy the 99 new file extent reservations we need.  At this point we
exhaust our delalloc reservation, and we begin to steal from the global
reserve.  If you have enough of these cases going in parallel you can
easily exhaust the global reserve, get an ENOSPC at
btrfs_alloc_tree_block() time, and then abort the transaction.

The other case is the delayed refs reserve.  The delayed refs reserve
updates its size based on outstanding delayed refs and dirty block
groups.  However we only refill this block reserve when returning
excess reservations and when we call btrfs_start_transaction(root, X).
We will reserve 2*X credits at transaction start time, and fill in X
into the delayed refs reserve to make sure it stays topped off.
Generally this works well, but clearly has downsides.  If we do a
particularly delayed ref heavy operation we may never catch up in our
reservations.  Additionally running delayed refs generates more delayed
refs, and at that point we may be committing the transaction and have no
way to trigger a refill of our delayed refs rsv.  Then a similar thing
occurs with the delalloc reserve.

Generally speaking we well over-reserve in all of our block rsvs.  If we
reserve 1 credit we're usually reserving around 264k of space, but we'll
often not use any of that reservation, or use a few blocks of that
reservation.  We can be reasonably sure that as long as you were able to
reserve space up front for your operation you'll be able to find space
on disk for that reservation.

So introduce a new flushing state, BTRFS_RESERVE_FLUSH_EMERGENCY.  This
gets used in the case that we've exhausted our reserve and the global
reserve.  It simply forces a reservation if we have enough actual space
on disk to make the reservation, which is almost always the case.  This
keeps us from hitting ENOSPC aborts in these odd occurrences where we've
not kept up with the delayed work.

Fixing this in a complete way is going to be relatively complicated and
time consuming.  This patch is what I discussed with Filipe earlier this
year, and what I put into our kernels inside FB.  With this patch we're
down to 1-2 ENOSPC aborts per week, which is a significant reduction.
This is a decent stop gap until we can work out a more wholistic
solution to these two corner cases.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:38 +01:00
Josef Bacik
1daedb1d6b btrfs: add the ability to use NO_FLUSH for data reservations
In order to accommodate NOWAIT IOCB's we need to be able to do NO_FLUSH
data reservations, so plumb this through the delalloc reservation
system.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29 17:08:28 +02:00
Filipe Manana
b0b47a3859 btrfs: remove useless used space increment during space reservation
At space-info.c:__reserve_bytes(), we increment the 'used' variable, but
then we don't use the variable anymore, making the increment pointless.
The increment became useless with commit 2e294c6049 ("btrfs: simplify
the logic in need_preemptive_flushing"), so just remove it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:02 +02:00
Qu Wenruo
8e327b9c0d btrfs: dump all space infos if we abort transaction due to ENOSPC
We have hit some transaction abort due to -ENOSPC internally.

Normally we should always reserve enough space for metadata for every
transaction, thus hitting -ENOSPC should really indicate some cases we
didn't expect.

But unfortunately current error reporting will only give a kernel
warning and stack trace, not really helpful to debug what's causing the
problem.

And mount option debug_enospc can only help when user can reproduce the
problem, but under most cases, such transaction abort by -ENOSPC is
really hard to reproduce.

So this patch will dump all space infos (data, metadata, system) when we
abort the first transaction with -ENOSPC.

This should at least provide some clue to us.

The example of a dump would look like this:

  BTRFS: Transaction aborted (error -28)
  WARNING: CPU: 8 PID: 3366 at fs/btrfs/transaction.c:2137 btrfs_commit_transaction+0xf81/0xfb0 [btrfs]
  <call trace skipped>
  ---[ end trace 0000000000000000 ]---
  BTRFS info (device dm-1: state A): dumping space info:
  BTRFS info (device dm-1: state A): space_info DATA has 6791168 free, is not full
  BTRFS info (device dm-1: state A): space_info total=8388608, used=1597440, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
  BTRFS info (device dm-1: state A): space_info METADATA has 257114112 free, is not full
  BTRFS info (device dm-1: state A): space_info total=268435456, used=131072, pinned=180224, reserved=65536, may_use=10878976, readonly=65536 zone_unusable=0
  BTRFS info (device dm-1: state A): space_info SYSTEM has 8372224 free, is not full
  BTRFS info (device dm-1: state A): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
  BTRFS info (device dm-1: state A): global_block_rsv: size 3670016 reserved 3670016
  BTRFS info (device dm-1: state A): trans_block_rsv: size 0 reserved 0
  BTRFS info (device dm-1: state A): chunk_block_rsv: size 0 reserved 0
  BTRFS info (device dm-1: state A): delayed_block_rsv: size 4063232 reserved 4063232
  BTRFS info (device dm-1: state A): delayed_refs_rsv: size 3145728 reserved 3145728
  BTRFS: error (device dm-1: state A) in btrfs_commit_transaction:2137: errno=-28 No space left
  BTRFS info (device dm-1: state EA): forced readonly

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:59 +02:00
Qu Wenruo
25a860c409 btrfs: output human readable space info flag
For btrfs_space_info, its flags has only 4 possible values:

- BTRFS_BLOCK_GROUP_SYSTEM
- BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA
- BTRFS_BLOCK_GROUP_METADATA
- BTRFS_BLOCK_GROUP_DATA

Make the output more human readable, now it looks like:

  BTRFS info (device dm-1: state A): space_info METADATA has 251494400 free, is not full

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:59 +02:00
Josef Bacik
3349b57fd4 btrfs: convert block group bit field to use bit helpers
We use a bit field in the btrfs_block_group for different flags, however
this is awkward because we have to hold the block_group->lock for any
modification of any of these fields, and makes the code clunky for a few
of these flags.  Convert these to a properly flags setup so we can
utilize the bit helpers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:54 +02:00
Josef Bacik
723de71d41 btrfs: handle space_info setting of bg in btrfs_add_bg_to_space_info
We previously had the pattern of

	btrfs_update_space_info(all, the, bg, fields, &space_info);
	link_block_group(bg);
	bg->space_info = space_info;

Now that we're passing the bg into btrfs_add_bg_to_space_info we can do
the linking in that function, transforming this to simply

	btrfs_add_bg_to_space_info(fs_info, bg);

and put the link_block_group() and bg->space_info assignment directly in
btrfs_add_bg_to_space_info.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:54 +02:00
Josef Bacik
9d4b0a129a btrfs: simplify arguments of btrfs_update_space_info and rename
This function has grown a bunch of new arguments, and it just boils down
to passing in all the block group fields as arguments.  Simplify this by
passing in the block group itself and updating the space_info fields
based on the block group fields directly.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:54 +02:00
Qu Wenruo
5da431b71d btrfs: fix the max chunk size and stripe length calculation
[BEHAVIOR CHANGE]
Since commit f6fca3917b ("btrfs: store chunk size in space-info
struct"), btrfs no longer can create larger data chunks than 1G:

  mkfs.btrfs -f -m raid1 -d raid0 $dev1 $dev2 $dev3 $dev4
  mount $dev1 $mnt

  btrfs balance start --full $mnt
  btrfs balance start --full $mnt
  umount $mnt

  btrfs ins dump-tree -t chunk $dev1 | grep "DATA|RAID0" -C 2

Before that offending commit, what we got is a 4G data chunk:

	item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
		length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
		io_align 65536 io_width 65536 sector_size 4096
		num_stripes 4 sub_stripes 1

Now what we got is only 1G data chunk:

	item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 6271533056) itemoff 15491 itemsize 176
		length 1073741824 owner 2 stripe_len 65536 type DATA|RAID0
		io_align 65536 io_width 65536 sector_size 4096
		num_stripes 4 sub_stripes 1

This will increase the number of data chunks by the number of devices,
not only increase system chunk usage, but also greatly increase mount
time.

Without a proper reason, we should not change the max chunk size.

[CAUSE]
Previously, we set max data chunk size to 10G, while max data stripe
length to 1G.

Commit f6fca3917b ("btrfs: store chunk size in space-info struct")
completely ignored the 10G limit, but use 1G max stripe limit instead,
causing above shrink in max data chunk size.

[FIX]
Fix the max data chunk size to 10G, and in decide_stripe_size_regular()
we limit stripe_size to 1G manually.

This should only affect data chunks, as for metadata chunks we always
set the max stripe size the same as max chunk size (256M or 1G
depending on fs size).

Now the same script result the same old result:

	item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
		length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
		io_align 65536 io_width 65536 sector_size 4096
		num_stripes 4 sub_stripes 1

Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Fixes: f6fca3917b ("btrfs: store chunk size in space-info struct")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-06 17:49:58 +02:00