1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

127 commits

Author SHA1 Message Date
Kent Overstreet
73d86dfd88 bcachefs: Fix erasure coding locking
This adds a new helper, bch2_trans_mutex_lock(), for locking a mutex -
dropping and retaking btree locks as needed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:53 -04:00
Kent Overstreet
2c7dd446d9 bcachefs: Erasure coding now uses bch2_bucket_alloc_trans
This code predates plumbing btree_trans through the bucket allocation
path: switching to it fixes a deadlock due to using multiple btree_trans
at the same time, which we never want to do.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:53 -04:00
Kent Overstreet
350175bf9b bcachefs: Improved nocow locking
This improves the nocow lock table so that hash table entries have
multiple locks, and locks specify which bucket they're for - i.e. we can
now resolve hash collisions.

This is important because the allocator has to skip buckets that are
locked in the nocow lock table, and previously hash collisions would
cause it to spuriously skip unlocked buckets.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:52 -04:00
Kent Overstreet
a8b3a677e7 bcachefs: Nocow support
This adds support for nocow mode, where we do writes in-place when
possible. Patch components:

 - New boolean filesystem and inode option, nocow: note that when nocow
   is enabled, data checksumming and compression are implicitly disabled

 - To prevent in-place writes from racing with data moves
   (data_update.c) or bucket reuse (i.e. a bucket being reused and
   re-allocated while a nocow write is in flight, we have a new locking
   mechanism.

   Buckets can be locked for either data update or data move, using a
   fixed size hash table of two_state_shared locks. We don't have any
   chaining, meaning updates and moves to different buckets that hash to
   the same lock will wait unnecessarily - we'll want to watch for this
   becoming an issue.

 - The allocator path also needs to check for in-place writes in flight
   to a given bucket before giving it out: thus we add another counter
   to bucket_alloc_state so we can track this.

 - Fsync now may need to issue cache flushes to block devices instead of
   flushing the journal. We add a device bitmask to bch_inode_info,
   ei_devs_need_flush, which tracks devices that need to have flushes
   issued - note that this will lead to unnecessary flushes when other
   codepaths have already issued flushes, we may want to replace this with
   a sequence number.

 - New nocow write path: look up extents, and if they're writable write
   to them - otherwise fall back to the normal COW write path.

XXX: switch to sequence numbers instead of bitmask for devs needing
journal flush

XXX: ei_quota_lock being a mutex means bch2_nocow_write_done() needs to
run in process context - see if we can improve this

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:51 -04:00
Kent Overstreet
53b1c6f44b bcachefs: Don't use key cache during fsck
The btree key cache mainly helps with lock contention, at the cost of
additional memory overhead. During some fsck passes the memory overhead
really matters, but fsck is single threaded so lock contention is an
issue - so skipping the key cache during fsck will help with
performance.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:51 -04:00
Kent Overstreet
a8c752bb1d bcachefs: New on disk format: Backpointers
This patch adds backpointers: we now have a reverse index from device
and offset on that device (specifically, offset within a bucket) back to
btree nodes and (non cached) data extents.

The first 40 backpointers within a bucket are stored in the alloc key;
after that backpointers spill over to the next backpointers btree. This
is to help avoid performance regressions from additional btree updates
on large streaming workloads.

This patch adds all the code for creating, checking and repairing
backpointers. The next patch in the series is going to use backpointers
for copygc - finally getting rid of the need to scan all extents to do
copygc.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:50 -04:00
Kent Overstreet
f2b542ba42 bcachefs: Go RW before check_alloc_info()
It's possible to do btree updates before going RW by adding them to the
list of updates for journal replay to do, but this is limited by what
fits in RAM. This patch switches the second alloc info phase to run
after going RW - btree_gc has already ensured the alloc btree itself is
correct - and tweaks the allocation path to deal with the potential
small inconsistencies.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:50 -04:00
Kent Overstreet
393a1f6863 bcachefs: Better inlining in core write path
Provide inline versions of some allocation functions
 - bch2_alloc_sectors_done_inlined()
 - bch2_alloc_sectors_append_ptrs_inlined()

and use them in the core IO path.

Also, inline bch2_extent_update_i_size_sectors() and
bch2_bkey_append_ptr().

In the core write path, function call overhead matters - every function
call is a jump to a new location and a potential cache miss.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:49 -04:00
Kent Overstreet
19a614d2e4 bcachefs: Better inlining for bch2_alloc_to_v4_mut
This separates out the slowpath into a separate function, and inlines
bch2_alloc_v4_mut into bch2_trans_start_alloc_update(), the main place
it's called.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:49 -04:00
Kent Overstreet
db36c1477d bcachefs: Fix bch2_bucket_alloc_early()
We were incorrectly retrying after a transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:49 -04:00
Kent Overstreet
87ced107f3 bcachefs: Convert EAGAIN errors to private error codes
More error code cleanup, for better error messages and debugability.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:49 -04:00
Kent Overstreet
858536c7ce bcachefs: Convert EROFS errors to private error codes
More error code improvements - this gets us more useful error messages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:49 -04:00
Kent Overstreet
e88a75ebe8 bcachefs: New bpos_cmp(), bkey_cmp() replacements
This patch introduces
 - bpos_eq()
 - bpos_lt()
 - bpos_le()
 - bpos_gt()
 - bpos_ge()

and equivalent replacements for bkey_cmp().

Looking at the generated assembly these could probably be improved
further, but we already see a significant code size improvement with
this patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:47 -04:00
Kent Overstreet
07de1803b8 bcachefs: Kill bch2_alloc_sectors_start()
Only used in one place, just inline it there.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:45 -04:00
Kent Overstreet
3e3e02e6bc bcachefs: Assorted checkpatch fixes
checkpatch.pl gives lots of warnings that we don't want - suggested
ignore list:

 ASSIGN_IN_IF
 UNSPECIFIED_INT	- bcachefs coding style prefers single token type names
 NEW_TYPEDEFS		- typedefs are occasionally good
 FUNCTION_ARGUMENTS	- we prefer to look at functions in .c files
			  (hopefully with docbook documentation), not .h
			  file prototypes
 MULTISTATEMENT_MACRO_USE_DO_WHILE
			- we have _many_ x-macros and other macros where
			  we can't do this

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:44 -04:00
Kent Overstreet
ed80c5699a bcachefs: Optimize bch2_dev_usage_read()
- add bch2_dev_usage_read_fast(), which doesn't return by value -
   bch_dev_usage is big enough that we don't want the silent memcpy
 - tweak the allocation path to only call bch2_dev_usage_read() once per
   bucket allocated

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:44 -04:00
Kent Overstreet
adf16c6dfa bcachefs: bucket_alloc_fail tracepoint should only fire when we have to block
We don't want to fire the bucket_alloc_fail tracepoint on transaction
restart, when we can retry immediately - only when we the allocation
actually has to block.

Also, switch from sched_clock() to local_clock(), as we've been doing
elsewhere.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:44 -04:00
Kent Overstreet
943f9946a6 bcachefs: Don't quash error in bch2_bucket_alloc_set_trans()
We were incorrectly returning -BCH_ERR_insufficient_devices when we'd
received a different error from bch2_bucket_alloc_trans(), which
(erronously) turns into -EROFS further up the call chain.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:42 -04:00
Kent Overstreet
ae10fe017b bcachefs: bucket_alloc_state
This refactoring puts our various allocation path counters into a
dedicated struct - the upcoming nocow patch is going to add another
counter.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:42 -04:00
Kent Overstreet
68b6cd194a bcachefs: Improve bucket_alloc tracepoint
It now includes more info - whether the bucket was for metadata or data
- and also call it in the same place as the bucket_alloc_fail
tracepoint.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:42 -04:00
Kent Overstreet
098ef98d5b bcachefs: Add private error codes for ENOSPC
Continuing the saga of introducing private dedicated error codes for
each error path, this patch converts ENOSPC to error codes that are
subtypes of ENOSPC. We've recently had a test failure where we got
-ENOSPC where we shouldn't have, and didn't have enough information to
tell where it came from, so this patch will solve that problem.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:40 -04:00
Kent Overstreet
674cfc2624 bcachefs: Add persistent counters for all tracepoints
Also, do some reorganizing/renaming, convert atomic counters in bch_fs
to persistent counters, and add a few missing counters.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:39 -04:00
Kent Overstreet
db346e7120 bcachefs: bch2_bucket_alloc_trans_early -> for_each_btree_key_norestart
Nested btree transactions require special care, and an upcoming patch is
going to add assertions to that effect. We don't want to be using them
unnecessarily, so this patch switches bch2_bucket_trans_early() to not
handle transaction restarts.

This patch also adds a cursor so that on transaction restart we can
continue scanning from where the previous search for an empty bucket
left off.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:37 -04:00
Kent Overstreet
549d173c1b bcachefs: EINTR -> BCH_ERR_transaction_restart
Now that we have error codes, with subtypes, we can switch to our own
error code for transaction restarts - and even better, a distinct error
code for each transaction restart reason: clearer code and better
debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:37 -04:00
Kent Overstreet
90cecb921c bcachefs: Prevent a btree iter overflow in alloc path
In bch2_bucket_alloc_trans(), we're iterating over buckets - but not
directly with an iterator, since we're iterating over the freespace
btree.

This means that we need to clear iter->path->preserve, otherwise we'll
end up retaining a btree_path for every alloc key we touched - which is
not what we want here.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:36 -04:00
Kent Overstreet
615f867c14 bcachefs: Improved errcodes
Instead of overloading standard error codes (EINTR/EAGAIN), and defining
short lists of error codes in multiple places that potentially end up
overlapping & conflicting, we're now going to have one master list of
error codes.

Error codes are defined with an x-macro: thus we also have
bch2_err_str() now.

Also, error codes have a class field. Now, instead of checking for
errors with ==, code should use bch2_err_matches(), which returns true
if the error is equal to or a sub-error of the error class.

This means we can define unique errors for every source location where
an error is generated, which will help improve our error messages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:36 -04:00
Kent Overstreet
8ef9831399 bcachefs: Improve bucket_alloc_fail tracepoint
We should be printing the number of free buckets, not just the number of
available buckets.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:36 -04:00
Kent Overstreet
30f0349d62 bcachefs: Split out dev_buckets_free()
Previously, dev_buckets_available() only counted buckets that are
eligible to be allocated right now - i.e. buckets that don't have cached
data, or need discard, or need gc gens, etc.

But most users of this function want to know how many buckets are
eligible to be allocated from without moving data around - copygc,
allocator striping, which means we should be including cached data
buckets etc.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:34 -04:00
Kent Overstreet
401ec4db63 bcachefs: Printbuf rework
This converts bcachefs to the modern printbuf interface/implementation,
synced with the version to be submitted upstream.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:33 -04:00
Kent Overstreet
3518e6faef bcachefs: Improve bch2_open_buckets_to_text()
This patch updates bch2_open_buckets_to_text() to include the device and
bucket the open_bucket owns.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:31 -04:00
Kent Overstreet
822835ffea bcachefs: Fold bucket_state in to BCH_DATA_TYPES()
Previously, we were missing accounting for buckets in need_gc_gens and
need_discard states. This matters because buckets in those states need
other btree operations done before they can be used, so they can't be
conuted when checking current number of free buckets against the
allocation watermark.

Also, we weren't directly counting free buckets at all. Now, data type 0
== BCH_DATA_free, and free buckets are counted; this means we can get
rid of the separate (poorly defined) count of unavailable buckets.

This is a new on disk format version, with upgrade and fsck required for
the accounting changes.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:30 -04:00
Kent Overstreet
f25d8215f4 bcachefs: Kill allocator threads & freelists
Now that we have new persistent data structures for the allocator, this
patch converts the allocator to use them.

Now, foreground bucket allocation uses the freespace btree to find
buckets to allocate, instead of popping buckets off the freelist.

The background allocator threads are no longer needed and are deleted,
as well as the allocator freelists. Now we only need background tasks
for invalidating buckets containing cached data (when we are low on
empty buckets), and for issuing discards.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:29 -04:00
Kent Overstreet
b17d3cec14 bcachefs: Run btree updates after write out of write_point
In the write path, after the write to the block device(s) complete we
have to punt to process context to do the btree update.

Instead of using the work item embedded in op->cl, this patch switches
to a per write-point work item. This helps with two different issues:

 - lock contention: btree updates to the same writepoint will (usually)
   be updating the same alloc keys
 - context switch overhead: when we're bottlenecked on btree updates,
   having a thread (running out of a work item) checking the write point
   for completed ops is cheaper than queueing up a new work item and
   waking up a kworker.

In an arbitrary benchmark, 4k random writes with fio running inside a
VM, this patch resulted in a 10% improvement in total iops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:29 -04:00
Kent Overstreet
3e1547116f bcachefs: x-macroize alloc_reserve enum
This makes an array of strings available, like our other enums.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:29 -04:00
Kent Overstreet
fcf01959ea bcachefs: Kill verify_not_stale()
This is ancient code that's more effectively checked in other places
now.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:24 -04:00
Kent Overstreet
a786087744 bcachefs: New in-memory array for bucket gens
The main in-memory bucket array is going away, but we'll still need to
keep bucket generations in memory, at least for now - ptr_stale() needs
to be an efficient operation.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:20 -04:00
Kent Overstreet
9ddffaf83b bcachefs: Put open_buckets in a hashtable
This is so that the copygc code doesn't have to refer to
bucket_mark.owned_by_allocator - assisting in getting rid of the in
memory bucket array.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:20 -04:00
Kent Overstreet
abe19d458e bcachefs: Refactor open_bucket code
Prep work for adding a hash table of open buckets - instead of embedding
a bch_extent_ptr, we need to refer to the bucket directly so that we're
not calling sector_to_bucket() in the hash table lookup code, which has
an expensive divide.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:20 -04:00
Kent Overstreet
57af63b286 bcachefs: bch2_alloc_sectors_append_ptrs() now takes cached flag
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:20 -04:00
Kent Overstreet
09943313d7 bcachefs: Rewrite bch2_bucket_alloc_new_fs()
This changes bch2_bucket_alloc_new_fs() to a simple bump allocator that
doesn't need to use the in memory bucket array, part of a larger patch
series to entirely get rid of the in memory bucket array, except for
gc/fsck.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:19 -04:00
Kent Overstreet
6be1b6d9df bcachefs: Make sure bch2_bucket_alloc_new_fs() obeys buckets_nouse
This fixes the filesystem migrate tool.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:19 -04:00
Kent Overstreet
fc6c01e2ea bcachefs: Convert bucket_alloc_ret to negative error codes
Start a new header, errcode.h, for bcachefs-private error codes - more
error codes will be converted later.

This patch just converts bucket_alloc_ret so that they can be mixed with
standard error codes and passed as ERR_PTR errors - the ec.c code was
doing this already, but incorrectly.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:17 -04:00
Kent Overstreet
89baec780f bcachefs: Allocator refactoring
This uses the kthread_wait_freezable() macro to simplify a lot of the
allocator thread code, along with cleaning up bch2_invalidate_bucket2().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
dac1525d9c bcachefs: gc shouldn't care about owned_by_allocator
The owned_by_allocator field is a purely in memory thing, even if/when
we bring back GC at runtime there's no need for it to be recalculating
this field. This is prep work for pulling it out of struct bucket, and
eventually getting rid of the bucket array.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
3e07a7300f bcachefs: Fix an RCU splat
Writepoints are never deallocated so the rcu_read_lock() isn't really
needed, but we are doing lockless list traversal.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:00 -04:00
Kent Overstreet
cb66fc5fe4 bcachefs: Fix copygc threshold
Awhile back the meaning of is_available_bucket() and thus also
bch_dev_usage->buckets_unavailable changed to include buckets that are
owned by the allocator - this was so that the stat could be persisted
like other allocation information, and wouldn't have to be regenerated
by walking each bucket at mount time.

This broke copygc, which needs to consider buckets that are reclaimable
and haven't yet been grabbed by the allocator thread and moved onta
freelist. This patch fixes that by adding dev_buckets_reclaimable() for
copygc and the allocator thread, and cleans up some of the callers a bit.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:54 -04:00
Kent Overstreet
72eab8da47 bcachefs: Refactor dev usage
This is to make it more amenable for serialization.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:52 -04:00
Kent Overstreet
6c7585b098 bcachefs: Rework allocating buckets for stripes
Allocating buckets for existing stripes was busted, in part because the
data structures were too contorted. This reworks new stripes so that we
have an array of open buckets that matches blocks in the stripe, and
it's sparse if we're reusing an existing stripe.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:51 -04:00
Kent Overstreet
890e3f5bf7 bcachefs: Reserve some open buckets for btree allocations
This reverts part of the change from "bcachefs: Don't use
BTREE_INSERT_USE_RESERVE so much" - it turns out we still should be
reserving open buckets for btree node allocations, because otherwise
data bucket allocations (especially with erasure coding enabled) can use
up all our open buckets and we won't be able to do the metadata update
that lets us release those open bucket references. Oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:50 -04:00
Kent Overstreet
8deed5f4e5 bcachefs: Use separate new stripes for copygc and non-copygc
Allocations for copygc have to be kept separate from everything else,
so that copygc doesn't get starved.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:50 -04:00