Hook the regular rmap code when an rmapbt repair operation is running so
that we can unlock the AGF buffer to scan the filesystem and keep the
in-memory btree up to date during the scan.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Rebuild the reverse mapping btree from all primary metadata. This first
patch establishes the bare mechanics of finding records and putting
together a new ondisk tree; more complex pieces are needed to make it
work properly.
Link: Documentation/filesystems/xfs-online-fsck-design.rst
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The last checks for bc_btnum can be replaced with helpers that check
the btree ops. This allows adding new btrees to XFS without having
to update a global enum.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: complete the ops predicates]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The btnum in struct xfs_btree_ops is often used for printing a symbolic
name for the btree. Add a name field to the ops structure and use that
directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Use the same summary counter calculation infrastructure to generate new
values for the in-core summary counters. The difference between the
scrubber and the repairer is that the repairer will freeze the fs during
setup, which means that the values should match exactly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If scrub finds that everything is ok with the filesystem, we need a way
to tell the health tracking that it can let go of indirect health flags,
since indirect flags only mean that at some point in the past we lost
some context.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fix the file link counts since we just computed the correct ones.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create the necessary hooks in the directory operations
(create/link/unlink/rename) code so that our live nlink scrub code can
stay up to date with link count updates in the rest of the filesystem.
This will be the means to keep our shadow link count information up to
date while the scan runs in real time.
In online fsck part 2, we'll use these same hooks to handle repairs
to directories and parent pointer information.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create the necessary scrub code to walk the filesystem's directory tree
so that we can compute file link counts. Similar to quotacheck, we
create an incore shadow array of link count information and then we walk
the filesystem a second time to compare the link counts. We need live
updates to keep the information up to date during the lengthy scan, so
this scrubber remains disabled until the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Use the shadow quota counters that live quotacheck creates to reset the
incore dquot counters.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a shadow dqtrx system in the quotacheck code that hooks the
regular dquot counter update code. This will be the means to keep our
copy of the dquot counters up to date while the scan runs in real time.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a new trio of scrub functions to check quota counters. While the
dquots themselves are filesystem metadata and should be checked early,
the dquot counter values are computed from other metadata and are
therefore summary counters. We don't plug these into the scrub dispatch
just yet, because we still need to be able to watch quota updates while
doing our scan.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Repair might encounter an inode with a totally garbage i_mode. To fix
this problem, we have to figure out if the file was a regular file, a
directory, or a special file. One way to figure this out is to check if
there are any directories with entries pointing down to the busted file.
This patch recovers the file mode by scanning every directory entry on
the filesystem to see if there are any that point to the busted file.
If the ftype of all such dirents are consistent, the mode is recovered
from the ftype. If no dirents are found, the file becomes a regular
file. In all cases, ACLs are canceled and the file is made accessible
only by root.
A previous patch attempted to guess the mode by reading the beginning of
the file data. This was rejected by Christoph on the grounds that we
cannot trust user-controlled data blocks. Users do not have direct
control over the ondisk contents of directory entries, so this method
should be much safer.
If all the dirents have the same ftype, then we can translate that back
into an S_IFMT flag and fix the file. If not, reset the mode to
S_IFREG.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The inode scanner tries to reduce contention on the AGI header buffer
lock by grabbing references to consecutive allocated inodes. Batching
stops as soon as we encounter an unallocated inode. This is unfortunate
because in the worst case performance collapses to the old "one at a
time" behavior if every other inode is free.
This is correct behavior, but we could do better. Unallocated inodes by
definition have nothing to scan, which means the iscan can ignore them
as long as someone ensures that the scan data will reflect another
thread allocating the inode and adding interesting metadata to that
inode. That mechanism is, of course, the live update hooks.
Therefore, extend the batching mechanism to track unallocated inodes
adjacent to the scan cursor. The _want_live_update predicate can tell
the caller's live update hook to incorporate all live updates to what
the scanner thinks is an unallocated inode if (after dropping the AGI)
some other thread allocates one of those inodes and begins using it.
Note that we cannot just copy the ir_free bitmap into the scan cursor
because the batching stops if iget says the inode is in an intermediate
state (e.g. on the inactivation list) and cannot be igrabbed.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
After observing xfs_scrub taking forever to rebuild parent pointers on a
pptrs enabled filesystem, I decided to profile what the system was
doing. It turns out that when there are a lot of threads trying to scan
the filesystem, most of our time is spent contending on AGI buffer
locks. Given that we're walking the inobt records anyway, we can often
tell ahead of time when there's a bunch of (up to 64) consecutive inodes
that we could grab all at once.
Do this to amortize the cost of taking the AGI lock across as many
inodes as we possibly can. On the author's system this seems to improve
parallel throughput from barely one and a half cores to slightly
sublinear scaling. The obvious antipattern here of course is where the
freemask has every other bit set (e.g. all 0xA's)
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Online directory and parent repairs on parent-pointer equipped
filesystems have shown that starting a large number of parallel iscans
causes a lot of AGI buffer contention. Try to reduce this by making it
so that iscans scan wrap around the end of the filesystem, and using a
rotor to stagger where each scanner begins. Surprisingly, this boosts
CPU utilization (on the author's test machines) from effectively
single-threaded to 160%. Not great, but see the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This patch implements a live file scanner for online fsck functions that
require the ability to walk a filesystem to gather metadata records and
stay informed about metadata changes to files that have already been
visited.
The iscan structure consists of two inode number cursors: one to track
which inode we want to visit next, and a second one to track which
inodes have already been visited. This second cursor is key to
capturing live updates to files previously scanned while the main thread
continues scanning -- any inode greater than this value hasn't been
scanned and can go on its way; any other update must be incorporated
into the collected data. It is critical for the scanning thraad to hold
exclusive access on the inode until after marking the inode visited.
This new code is a separate patch from the patchsets adding callers for
the sake of enabling the author to move patches around his tree with
ease. The intended usage model for this code is roughly:
xchk_iscan_start(iscan, 0, 0);
while ((error = xchk_iscan_iter(sc, iscan, &ip)) == 1) {
xfs_ilock(ip, ...);
/* capture inode metadata */
xchk_iscan_mark_visited(iscan, ip);
xfs_iunlock(ip, ...);
xfs_irele(ip);
}
xchk_iscan_stop(iscan);
if (error)
return error;
Hook functions for live updates can then do:
if (xchk_iscan_want_live_update(...))
/* update the captured inode metadata */
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
These functions aren't used anymore, so get rid of them.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Convert xfarray_pagesort to handle large folios by introducing a new
xfile_get_folio routine that can return a folio of arbitrary size, and
using heapsort on the full folio. This also corrects an off-by-one bug
in the calculation of len in xfarray_pagesort that was papered over by
xfarray_want_pagesort.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Add helper similar to file_{get,set}_page, but which deal with folios
and don't allocate new folio unless explicitly asked to, which map
to shmem_get_folio instead of calling into the aops.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
All current and pending xfile users use the xfile_obj_load
and xfile_obj_store API, so make those the actual implementation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
vfs_getattr is needed to query inode attributes for unknown underlying
file systems. But shmemfs is well known for users of shmem_file_setup
and shmem_read_mapping_page_gfp that rely on it not needing specific
inode revalidation and having a normal mapping. Remove the detour
through the getattr method and an extra wrapper, and just read the
inode size and i_bytes directly in the scrub tracing code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Upon a closer inspection of the quota record scrubber, I noticed that
dqiterate wasn't actually walking all possible dquots for the mapped
blocks in the quota file. This is due to xfs_qm_dqget_next skipping all
XFS_IS_DQUOT_UNINITIALIZED dquots.
For a fsck program, we really want to look at all the dquots, even if
all counters and limits in the dquot record are zero. Rewrite the
implementation to do this, as well as switching to an iterator paradigm
to reduce the number of indirect calls.
This enables removal of the old broken dqiterate code from xfs_dquot.c.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Try to repair errors that we see in file CoW forks so that we don't do
stupid things like remap garbage into a file. There's not a lot we can
do with the COW fork -- the ondisk metadata record only that the COW
staging extents are owned by the refcount btree, which effectively means
that we can't reconstruct this incore structure from scratch.
Actually, this is even worse -- we can't touch written extents, because
those map space that are actively under writeback, and there's not much
to do with delalloc reservations. Hence we can only detect crosslinked
unwritten extents and fix them by punching out the problematic parts and
replacing them with delalloc extents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Use the reverse-mapping btree information to rebuild an inode block map.
Update the btree bulk loading code as necessary to support inode rooted
btrees and fix some bitrot problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Determine if inode fork damage is responsible for the inode being unable
to pass the ifork verifiers in xfs_iget and zap the fork contents if
this is true. Once this is done the fork will be empty but we'll be
able to construct an in-core inode, and a subsequent call to the inode
fork repair ioctl will search the rmapbt to rebuild the records that
were in the fork.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If an inode is so badly damaged that it cannot be loaded into the cache,
fix the ondisk metadata and try again. If there /is/ a cached inode,
fix any problems and apply any optimizations that can be solved incore.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Use the rmapbt to find inode chunks, query the chunks to compute hole
and free masks, and with that information rebuild the inobt and finobt.
Refer to the case study in
Documentation/filesystems/xfs-online-fsck-design.rst for more details.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Rebuild the free space btrees from the gaps in the rmap btree. Refer to
the case study in Documentation/filesystems/xfs-online-fsck-design.rst
for more details.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a new xrep_newbt structure to encapsulate a fake root for
creating a staged btree cursor as well as to track all the blocks that
we need to reserve in order to build that btree.
As for the particular choice of lowspace thresholds and btree block
slack factors -- at this point one could say that the thresholds in
online repair come from bulkload_estimate_ag_slack in xfs_repair[1].
But that's not the entire story, since the offline btree rebuilding
code in xfs_repair was merged as a retroport of the online btree code
in this patchset!
Before xfs_btree_staging.[ch] came along, xfs_repair determined the
slack factor (aka the number of slots to leave unfilled in each new
btree block) via open-coded logic in repair/phase5.c[2]. At that point
the slack factors were arbitrary quantities per btree. The rmapbt
automatically left 10 slots free; everything else left zero.
That had a noticeable effect on performance straight after mounting
because adding records to /any/ btree would result in splits. A few
years ago when this patch was first written, Dave and I decided that
repair should generate btree blocks that were 75% full unless space was
tight, in which case it should try to fill the blocks to nearly full.
We defined tight as ~10% free to avoid repair failures but settled on
3/32 (~9%) to avoid div64.
IOWs, we mostly pulled the thresholds out of thin air. We've been
QAing with those geometry numbers ever since. ;)
Link: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/repair/bulkload.c?h=v6.5.0#n114
Link: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/repair/phase5.c?h=v4.19.0#n1349
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create get and set functions for rtsummary words so that we can redefine
the ondisk format with a specific endianness. Note that this requires
the definition of a distinct type for ondisk summary info words so that
the compiler can perform proper typechecking.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Further disambiguate the xfs_rtblock_t uses by creating a new type,
xfs_rtxnum_t, to store the position of an extent within the realtime
section, in units of rtextents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
XFS uses xfs_rtblock_t for many different uses, which makes it much more
difficult to perform a unit analysis on the codebase. One of these
(ab)uses is when we need to store the length of a free space extent as
stored in the realtime bitmap. Because there can be up to 2^64 realtime
extents in a filesystem, we need a new type that is larger than
xfs_rtxlen_t for callers that are querying the bitmap directly. This
means scrub and growfs.
Create this type as "xfs_rtbxlen_t" and use it to store 64-bit rtx
lengths. 'b' stands for 'bitmap' or 'big'; reader's choice.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
* Chandan Babu will be taking over as the XFS release manager. He has
reviewed all the patches that are in this branch, though I'm signing
the branch one last time since I'm still technically maintainer. :P
* Create a maintainer entry profile for XFS in which we lay out the
various roles that I have played for many years. Aside from release
manager, the remaining roles are as yet unfilled.
* Start merging online repair -- we now have in-memory pageable memory
for staging btrees, a bunch of pending fixes, and we've started the
process of refactoring the scrub support code to support more of
repair. In particular, reaping of old blocks from damaged structures.
* Scrub the realtime summary file.
* Fix a bug where scrub's quota iteration only ever returned the root
dquot. Oooops.
* Fix some typos.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZOQE2AAKCRBKO3ySh0YR
pvmZAQDe+KceaVx6Dv2f9ihckeS2dILSpDTo1bh9BeXnt005VwD/ceHTaJxEl8lp
u/dixFDkRgp9RYtoTAK2WNiUxYetsAc=
=oZN6
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Chandan Babu:
- Chandan Babu will be taking over as the XFS release manager. He has
reviewed all the patches that are in this branch, though I'm signing
the branch one last time since I'm still technically maintainer. :P
- Create a maintainer entry profile for XFS in which we lay out the
various roles that I have played for many years. Aside from release
manager, the remaining roles are as yet unfilled.
- Start merging online repair -- we now have in-memory pageable memory
for staging btrees, a bunch of pending fixes, and we've started the
process of refactoring the scrub support code to support more of
repair. In particular, reaping of old blocks from damaged structures.
- Scrub the realtime summary file.
- Fix a bug where scrub's quota iteration only ever returned the root
dquot. Oooops.
- Fix some typos.
[ Pull request from Chandan Babu, but signed tag and description from
Darrick Wong, thus the first person singular above is Darrick, not
Chandan ]
* tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (37 commits)
fs/xfs: Fix typos in comments
xfs: fix dqiterate thinko
xfs: don't check reflink iflag state when checking cow fork
xfs: simplify returns in xchk_bmap
xfs: rewrite xchk_inode_is_allocated to work properly
xfs: hide xfs_inode_is_allocated in scrub common code
xfs: fix agf_fllast when repairing an empty AGFL
xfs: allow userspace to rebuild metadata structures
xfs: clear pagf_agflreset when repairing the AGFL
xfs: allow the user to cancel repairs before we start writing
xfs: don't complain about unfixed metadata when repairs were injected
xfs: implement online scrubbing of rtsummary info
xfs: always rescan allegedly healthy per-ag metadata after repair
xfs: move the realtime summary file scrubber to a separate source file
xfs: wrap ilock/iunlock operations on sc->ip
xfs: get our own reference to inodes that we want to scrub
xfs: track usage statistics of online fsck
xfs: improve xfarray quicksort pivot
xfs: create scaffolding for creating debugfs entries
xfs: cache pages used for xfarray quicksort convergence
...
Back in the mists of time[1], I proposed this function to assist the
inode btree scrubbers in checking the inode btree contents against the
allocation state of the inode records. The original version performed a
direct lookup in the inode cache and returned the allocation status if
the cached inode hadn't been reused and wasn't in an intermediate state.
Brian thought it would be better to use the usual iget/irele mechanisms,
so that was changed for the final version.
Unfortunately, this hasn't aged well -- the IGET_INCORE flag only has
one user and clutters up the regular iget path, which makes it hard to
reason about how it actually works. Worse yet, the inode inactivation
series silently broke it because iget won't return inodes that are
anywhere in the inactivation machinery, even though the caller is
already required to prevent inode allocation and freeing. Inodes in the
inactivation machinery are still allocated, but the current code's
interactions with the iget code prevent us from being able to say that.
Now that I understand the inode lifecycle better than I did in early
2017, I now realize that as long as the cached inode hasn't been reused
and isn't actively being reclaimed, it's safe to access the i_mode field
(with the AGI, rcu, and i_flags locks held), and we don't need to worry
about the inode being freed out from under us.
Therefore, port the original version to modern code structure, which
fixes the brokennes w.r.t. inactivation. In the next patch we'll remove
IGET_INCORE since it's no longer necessary.
[1] https://lore.kernel.org/linux-xfs/149643868294.23065.8094890990886436794.stgit@birch.djwong.org/
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Add a new (superuser-only) flag to the online metadata repair ioctl to
force it to rebuild structures, even if they're not broken. We will use
this to move metadata structures out of the way during a free space
defragmentation operation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Finish the realtime summary scrubber by adding the functions we need to
compute a fresh copy of the rtsummary info and comparing it to the copy
on disk.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If all the records in an xfarray subset live within the same memory
page, we can short-circuit even more quicksort recursion by mapping that
page into the local CPU and using the kernel's heapsort function to sort
the subset. On the author's computer, this reduces the runtime by
another 15% on a 500,000 element array.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Certain xfile array operations (such as sorting) can be sped up quite a
bit by allowing xfile users to grab a page to bulk-read the records
contained within it. Create helper methods to facilitate this.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
In the previous patch, we created a very basic quicksort implementation
for xfile arrays. While the use of an alternate sorting algorithm to
avoid quicksort recursion on very small subsets reduces the runtime
modestly, we could do better than a load and store-heavy insertion sort,
particularly since each load and store requires a page mapping lookup in
the xfile.
For a small increase in kernel memory requirements, we could instead
bulk load the xfarray records into memory, use the kernel's existing
heapsort implementation to sort the records, and bulk store the memory
buffer back into the xfile. On the author's computer, this reduces the
runtime by about 5% on a 500,000 element array.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The btree bulk loading code requires that records be provided in the
correct record sort order for the given btree type. In general, repair
code cannot be required to collect records in order, and it is not
feasible to insert new records in the middle of an array to maintain
sort order.
Implement a sorting algorithm so that we can sort the records just prior
to bulk loading. In principle, an xfarray could consume many gigabytes
of memory and its backing pages can be sent out to disk at any time.
This means that we cannot map the entire array into memory at once, so
we must find a way to divide the work into smaller portions (e.g. a
page) that /can/ be mapped into memory.
Quicksort seems like a reasonable fit for this purpose, since it uses a
divide and conquer strategy to keep its average runtime logarithmic.
The solution presented here is a port of the glibc implementation, which
itself is derived from the median-of-three and tail call recursion
strategies outlined by Sedgwick.
Subsequent patches will optimize the implementation further by utilizing
the kernel's heapsort on directly-mapped memory whenever possible, and
improving the quicksort pivot selection algorithm to try to avoid O(n^2)
collapses.
Note: The sorting functionality gets its own patch because the basic big
array mechanisms were plenty for a single code patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Create a simple 'big array' data structure for storage of fixed-size
metadata records that will be used to reconstruct a btree index. For
repair operations, the most important operations are append, iterate,
and sort.
Earlier implementations of the big array used linked lists and suffered
from severe problems -- pinning all records in kernel memory was not a
good idea and frequently lead to OOM situations; random access was very
inefficient; and record overhead for the lists was unacceptably high at
40-60%.
Therefore, the big memory array relies on the 'xfile' abstraction, which
creates a memfd file and stores the records in page cache pages. Since
the memfd is created in tmpfs, the memory pages can be pushed out to
disk if necessary and we have a built-in usage limit of 50% of physical
memory.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When we're freeing extents that have been set in a bitmap, break the
bitmap extent into multiple sub-extents organized by fate, and reap the
extents. This enables us to dispose of old resources more efficiently
than doing them block by block.
While we're at it, rename the reaping functions to make it clear that
they're reaping per-AG extents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Rearrange the logic inside xrep_reap_block to make it more obvious that
crosslinked metadata blocks are handled differently. Add a couple of
tracepoints so that we can tell what's going on at the end of a btree
rebuild operation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
These two functions date from the era when I thought that we could
rebuild btrees by creating an alternate root and adding records one by
one. In other words, they predate the btree bulk loader. They're not
necessary now, so remove them.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
If the fscounters scrubber notices incorrect summary counters, it's
entirely possible that scrub is simply racing with other threads that
are updating the incore counters. There isn't a good way to stabilize
percpu counters or set ourselves up to observe live updates with hooks
like we do for the quotacheck or nlinks scanners, so we instead choose
to freeze the filesystem long enough to walk the incore per-AG
structures.
Past me thought that it was going to be commonplace to have to freeze
the filesystem to perform some kind of repair and set up a whole
separate infrastructure to freeze the filesystem in such a way that
userspace could not unfreeze while we were running. This involved
adding a mutex and freeze_super/thaw_super functions and dealing with
the fact that the VFS freeze/thaw functions can free the VFS superblock
references on return.
This was all very overwrought, since fscounters turned out to be the
only user of scrub freezes, and it doesn't require the log to quiesce,
only the incore superblock counters. We prevent other threads from
changing the freeze level by calling freeze_super_excl with a custom
freeze cookie to keep everyone else out of the filesystem.
The end result is that fscounters should be much more efficient. When
we're checking a busy system and we can't stabilize the counters, the
custom freeze will do less work, which should result in less downtime.
Repair should be similarly speedy, but that's in a later patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
The fscounters scrub code doesn't work properly because it cannot
quiesce updates to the percpu counters in the filesystem, hence it
returns false corruption reports. This has been fixed properly in
one of the online repair patchsets that are under review by replacing
the xchk_disable_reaping calls with an exclusive filesystem freeze.
Disabling background gc isn't sufficient to fix the problem.
In other words, scrub doesn't need to call xfs_inodegc_stop, which is
just as well since it wasn't correct to allow scrub to call
xfs_inodegc_start when something else could be calling xfs_inodegc_stop
(e.g. trying to freeze the filesystem).
Neuter the scrubber for now, and remove the xchk_*_reaping functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In the previous patch, we added jump labels to the intent drain code so
that regular filesystem operations need not pay the price of checking
for someone (scrub) waiting on intents to drain from some part of the
filesystem when that someone isn't running.
However, I observed that xfs/285 now spends a lot more time pushing the
AIL from the inode btree scrubber than it used to. This is because the
inobt scrubber will try push the AIL to try to get logged inode cores
written to the filesystem when it sees a weird discrepancy between the
ondisk inode and the inobt records. This AIL push is triggered when the
setup function sees TRY_HARDER is set; and the requisite EDEADLOCK
return is initiated when the discrepancy is seen.
The solution to this performance slow down is to use a different result
code (ECHRNG) for scrub code to signal that it needs to wait for
deferred intent work items to drain out of some part of the filesystem.
When this happens, set a new scrub state flag (XCHK_NEED_DRAIN) so that
setup functions will activate the jump label.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>