1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
linux/fs/xfs/libxfs
Darrick J. Wong 7dd73802f9 xfs, iomap: fix data corruption due to stale cached iomaps
This patch series fixes a data corruption that occurs in a specific
 multi-threaded write workload. The workload combined
 racing unaligned adjacent buffered writes with low memory conditions
 that caused both writeback and memory reclaim to race with the
 writes.
 
 The result of this was random partial blocks containing zeroes
 instead of the correct data.  The underlying problem is that iomap
 caches the write iomap for the duration of the write() operation,
 but it fails to take into account that the extent underlying the
 iomap can change whilst the write is in progress.
 
 The short story is that an iomap can span mutliple folios, and so
 under low memory writeback can be cleaning folios the write()
 overlaps. Whilst the overlapping data is cached in memory, this
 isn't a problem, but because the folios are now clean they can be
 reclaimed. Once reclaimed, the write() does the wrong thing when
 re-instantiating partial folios because the iomap no longer reflects
 the underlying state of the extent. e.g. it thinks the extent is
 unwritten, so it zeroes the partial range, when in fact the
 underlying extent is now written and so it should have read the data
 from disk.  This is how we get random zero ranges in the file
 instead of the correct data.
 
 The gory details of the race condition can be found here:
 
 https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
 
 Fixing the problem has two aspects. The first aspect of the problem
 is ensuring that iomap can detect a stale cached iomap during a
 write in a race-free manner. We already do this stale iomap
 detection in the writeback path, so we have a mechanism for
 detecting that the iomap backing the data range may have changed
 and needs to be remapped.
 
 In the case of the write() path, we have to ensure that the iomap is
 validated at a point in time when the page cache is stable and
 cannot be reclaimed from under us. We also need to validate the
 extent before we start performing any modifications to the folio
 state or contents. Combine these two requirements together, and the
 only "safe" place to validate the iomap is after we have looked up
 and locked the folio we are going to copy the data into, but before
 we've performed any initialisation operations on that folio.
 
 If the iomap fails validation, we then mark it stale, unlock the
 folio and end the write. This effectively means a stale iomap
 results in a short write. Filesystems should already be able to
 handle this, as write operations can end short for many reasons and
 need to iterate through another mapping cycle to be completed. Hence
 the iomap changes needed to detect and handle stale iomaps during
 write() operations is relatively simple...
 
 However, the assumption is that filesystems should already be able
 to handle write failures safely, and that's where the second
 (first?) part of the problem exists. That is, handling a partial
 write is harder than just "punching out the unused delayed
 allocation extent". This is because mmap() based faults can race
 with writes, and if they land in the delalloc region that the write
 allocated, then punching out the delalloc region can cause data
 corruption.
 
 This data corruption problem is exposed by generic/346 when iomap is
 converted to detect stale iomaps during write() operations. Hence
 write failure in the filesytems needs to handle the fact that the
 write() in progress doesn't necessarily own the data in the page
 cache over the range of the delalloc extent it just allocated.
 
 As a result, we can't just truncate the page cache over the range
 the write() didn't reach and punch all the delalloc extent. We have
 to walk the page cache over the untouched range and skip over any
 dirty data region in the cache in that range. Which is ....
 non-trivial.
 
 That is, iterating the page cache has to handle partially populated
 folios (i.e. block size < page size) that contain data. The data
 might be discontiguous within a folio. Indeed, there might be
 *multiple* discontiguous data regions within a single folio. And to
 make matters more complex, multi-page folios mean we just don't know
 how many sub-folio regions we might have to iterate to find all
 these regions. All the corner cases between the conversions and
 rounding between filesystem block size, folio size and multi-page
 folio size combined with unaligned write offsets kept breaking my
 brain.
 
 However, if we convert the code to track the processed
 write regions by byte ranges instead of fileystem block or page
 cache index, we could simply use mapping_seek_hole_data() to find
 the start and end of each discrete data region within the range we
 needed to scan. SEEK_DATA finds the start of the cached data region,
 SEEK_HOLE finds the end of the region. These are byte based
 interfaces that understand partially uptodate folio regions, and so
 can iterate discrete sub-folio data regions directly. This largely
 solved the problem of discovering the dirty regions we need to keep
 the delalloc extent over.
 
 However, to use mapping_seek_hole_data() without needing to export
 it, we have to move all the delalloc extent cleanup to the iomap
 core and so now the iomap core can clean up delayed allocation
 extents in a safe, sane and filesystem neutral manner.
 
 With all this done, the original data corruption never occurs
 anymore, and we now have a generic mechanism for ensuring that page
 cache writes do not do the wrong thing when writeback and reclaim
 change the state of the physical extent and/or page cache contents
 whilst the write is in progress.
 
 Signed-off-by: Dave Chinner <dchinner@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmOFSzwUHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h3djhAAwOf9VeLO7TW/0B1XfE3ktWGiDmEG
 ekB8mkB7CAHB9SBq7TZMHjktJIJxY81q5+Iq9qHGiW3asoVbmWvkeRSJgXljhTby
 D2KsUIT1NK/X6DhC9FhNjv/Q2GJ0nY6s65RLudUEkelYBFhGMM0kdXX+fZmtZ4yT
 T/lRYk/KBFpeQCaGRcFXK55TnB/B9muOI9FyKvh2DNWe6r0Xu3Obb3a9k+snZA9R
 EeUpAosDSrXzP4c2w2ovpU2eutUdo4eYTHIzXKGkhktbRhmCRLn4NlxvFCanoe8h
 eSS85sb8DHUh2iyaaB8yrJ6LL3MuBytOi24rNBeyd1KAyEtT21+cTUK/QAahzble
 pL8l6TA7ZXbhYcbk5uQvFEIAInR+0ffjde61uE14N55awq0Vdrym7C7D2ri60iw6
 ts45AVYKYeF61coAbwvmaJyvqvQ0tUlmVZXI4lQzN2O17Lr6004gJFMjDRsXXU7H
 eHLUt496Geq39rglw85y8G0vmxxGZ9iIGkeC1kUSSCmlvx/JfuJlbWBgyMGtNRBI
 qzv0jmk67Ft1seQSMWQJttxCZs4uOF2gwERYGAF7iUR8F4PGob/N1e2/hpg75G8q
 0S8u1N1p8Cv5u/jwybqy8FnSC2MlUZl6SQURaVQDy2DLMKHb4T1diu0jrCbiSPiF
 JKfQ7aNQxaEZIJw=
 =cv9i
 -----END PGP SIGNATURE-----

Merge tag 'xfs-iomap-stale-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-6.2-mergeB

xfs, iomap: fix data corruption due to stale cached iomaps

This patch series fixes a data corruption that occurs in a specific
multi-threaded write workload. The workload combined
racing unaligned adjacent buffered writes with low memory conditions
that caused both writeback and memory reclaim to race with the
writes.

The result of this was random partial blocks containing zeroes
instead of the correct data.  The underlying problem is that iomap
caches the write iomap for the duration of the write() operation,
but it fails to take into account that the extent underlying the
iomap can change whilst the write is in progress.

The short story is that an iomap can span mutliple folios, and so
under low memory writeback can be cleaning folios the write()
overlaps. Whilst the overlapping data is cached in memory, this
isn't a problem, but because the folios are now clean they can be
reclaimed. Once reclaimed, the write() does the wrong thing when
re-instantiating partial folios because the iomap no longer reflects
the underlying state of the extent. e.g. it thinks the extent is
unwritten, so it zeroes the partial range, when in fact the
underlying extent is now written and so it should have read the data
from disk.  This is how we get random zero ranges in the file
instead of the correct data.

The gory details of the race condition can be found here:

https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/

Fixing the problem has two aspects. The first aspect of the problem
is ensuring that iomap can detect a stale cached iomap during a
write in a race-free manner. We already do this stale iomap
detection in the writeback path, so we have a mechanism for
detecting that the iomap backing the data range may have changed
and needs to be remapped.

In the case of the write() path, we have to ensure that the iomap is
validated at a point in time when the page cache is stable and
cannot be reclaimed from under us. We also need to validate the
extent before we start performing any modifications to the folio
state or contents. Combine these two requirements together, and the
only "safe" place to validate the iomap is after we have looked up
and locked the folio we are going to copy the data into, but before
we've performed any initialisation operations on that folio.

If the iomap fails validation, we then mark it stale, unlock the
folio and end the write. This effectively means a stale iomap
results in a short write. Filesystems should already be able to
handle this, as write operations can end short for many reasons and
need to iterate through another mapping cycle to be completed. Hence
the iomap changes needed to detect and handle stale iomaps during
write() operations is relatively simple...

However, the assumption is that filesystems should already be able
to handle write failures safely, and that's where the second
(first?) part of the problem exists. That is, handling a partial
write is harder than just "punching out the unused delayed
allocation extent". This is because mmap() based faults can race
with writes, and if they land in the delalloc region that the write
allocated, then punching out the delalloc region can cause data
corruption.

This data corruption problem is exposed by generic/346 when iomap is
converted to detect stale iomaps during write() operations. Hence
write failure in the filesytems needs to handle the fact that the
write() in progress doesn't necessarily own the data in the page
cache over the range of the delalloc extent it just allocated.

As a result, we can't just truncate the page cache over the range
the write() didn't reach and punch all the delalloc extent. We have
to walk the page cache over the untouched range and skip over any
dirty data region in the cache in that range. Which is ....
non-trivial.

That is, iterating the page cache has to handle partially populated
folios (i.e. block size < page size) that contain data. The data
might be discontiguous within a folio. Indeed, there might be
*multiple* discontiguous data regions within a single folio. And to
make matters more complex, multi-page folios mean we just don't know
how many sub-folio regions we might have to iterate to find all
these regions. All the corner cases between the conversions and
rounding between filesystem block size, folio size and multi-page
folio size combined with unaligned write offsets kept breaking my
brain.

However, if we convert the code to track the processed
write regions by byte ranges instead of fileystem block or page
cache index, we could simply use mapping_seek_hole_data() to find
the start and end of each discrete data region within the range we
needed to scan. SEEK_DATA finds the start of the cached data region,
SEEK_HOLE finds the end of the region. These are byte based
interfaces that understand partially uptodate folio regions, and so
can iterate discrete sub-folio data regions directly. This largely
solved the problem of discovering the dirty regions we need to keep
the delalloc extent over.

However, to use mapping_seek_hole_data() without needing to export
it, we have to move all the delalloc extent cleanup to the iomap
core and so now the iomap core can clean up delayed allocation
extents in a safe, sane and filesystem neutral manner.

With all this done, the original data corruption never occurs
anymore, and we now have a generic mechanism for ensuring that page
cache writes do not do the wrong thing when writeback and reclaim
change the state of the physical extent and/or page cache contents
whilst the write is in progress.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>

* tag 'xfs-iomap-stale-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: drop write error injection is unfixable, remove it
  xfs: use iomap_valid method to detect stale cached iomaps
  iomap: write iomap validity checks
  xfs: xfs_bmap_punch_delalloc_range() should take a byte range
  iomap: buffered write failure should not truncate the page cache
  xfs,iomap: move delalloc punching to iomap
  xfs: use byte ranges for write cleanup ranges
  xfs: punching delalloc extents on write failure is racy
  xfs: write page faults in iomap are not buffered writes
2022-11-28 17:23:58 -08:00
..
xfs_ag.c xfs: double link the unlinked inode list 2022-07-14 11:46:43 +10:00
xfs_ag.h xfs: create a predicate to verify per-AG extents 2022-10-31 08:58:20 -07:00
xfs_ag_resv.c xfs: pass perag to xfs_alloc_read_agf() 2022-07-07 19:07:40 +10:00
xfs_ag_resv.h xfs: move perag structure and setup to libxfs/xfs_ag.[ch] 2021-06-02 10:48:24 +10:00
xfs_alloc.c xfs: create a predicate to verify per-AG extents 2022-10-31 08:58:20 -07:00
xfs_alloc.h xfs: pass perag to xfs_alloc_read_agfl 2022-07-07 19:08:15 +10:00
xfs_alloc_btree.c xfs: pass perag to xfs_alloc_put_freelist 2022-07-07 19:08:08 +10:00
xfs_alloc_btree.h xfs: use separate btree cursor cache for each btree type 2021-10-19 11:45:16 -07:00
xfs_attr.c xfs: replace XFS_IFORK_Q with a proper predicate function 2022-07-12 11:17:27 -07:00
xfs_attr.h xfs: replace XFS_IFORK_Q with a proper predicate function 2022-07-12 11:17:27 -07:00
xfs_attr_leaf.c xfs: don't leak memory when attr fork loading fails 2022-07-20 16:40:39 -07:00
xfs_attr_leaf.h xfs: don't hold xattr leaf buffers across transaction rolls 2022-06-29 08:47:56 -07:00
xfs_attr_remote.c xfs: rework xfs_buf_incore() API 2022-07-07 22:05:18 +10:00
xfs_attr_remote.h xfs: rename struct xfs_attr_item to xfs_attr_intent 2022-05-22 16:00:26 +10:00
xfs_attr_sf.h xfs: Convert xfs_attr_sf macros to inline functions 2020-09-15 20:52:42 -07:00
xfs_bit.c xfs: fix missing header includes 2019-11-07 13:00:53 -08:00
xfs_bit.h xfs: Use the correct style for SPDX License Identifier 2020-05-13 15:32:45 -07:00
xfs_bmap.c xfs: use iomap_valid method to detect stale cached iomaps 2022-11-29 09:09:17 +11:00
xfs_bmap.h xfs: convert bmapi flags to unsigned. 2022-04-21 10:46:09 +10:00
xfs_bmap_btree.c xfs: replace inode fork size macros with functions 2022-07-12 11:17:27 -07:00
xfs_bmap_btree.h xfs: use separate btree cursor cache for each btree type 2021-10-19 11:45:16 -07:00
xfs_btree.c xfs: convert XFS_IFORK_PTR to a static inline helper 2022-07-09 15:17:21 -07:00
xfs_btree.h xfs: convert btree buffer log flags to unsigned. 2022-04-21 10:46:33 +10:00
xfs_btree_staging.c xfs: encode the max btree height in the cursor 2021-10-19 11:45:15 -07:00
xfs_btree_staging.h xfs: xfs_btree_staging.h: delete duplicated words 2020-07-28 20:24:14 -07:00
xfs_cksum.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
xfs_da_btree.c xfs: trim the mapp array accordingly in xfs_da_grow_inode_int 2022-10-04 16:39:42 +11:00
xfs_da_btree.h xfs: fix TOCTOU race involving the new logged xattrs control knob 2022-06-15 23:13:32 -07:00
xfs_da_format.h Merge tag 'large-extent-counters-v9' of https://github.com/chandanr/linux into xfs-5.19-for-next 2022-04-21 16:46:17 +10:00
xfs_defer.c xfs: share xattr name and value buffers when logging xattr updates 2022-05-23 08:43:46 +10:00
xfs_defer.h xfs: Implement attr logging and replay 2022-05-09 19:09:07 +10:00
xfs_dir2.c xfs: rearrange the logic and remove the broken comment for xfs_dir2_isxx 2022-10-04 16:39:58 +11:00
xfs_dir2.h xfs: rearrange the logic and remove the broken comment for xfs_dir2_isxx 2022-10-04 16:39:58 +11:00
xfs_dir2_block.c xfs: replace inode fork size macros with functions 2022-07-12 11:17:27 -07:00
xfs_dir2_data.c xfs: convert bp->b_bn references to xfs_buf_daddr() 2021-08-19 10:07:15 -07:00
xfs_dir2_leaf.c xfs: fix exception caused by unexpected illegal bestcount in leaf dir 2022-10-20 09:42:56 -07:00
xfs_dir2_node.c xfs: convert bp->b_bn references to xfs_buf_daddr() 2021-08-19 10:07:15 -07:00
xfs_dir2_priv.h xfs: constify the name argument to various directory functions 2022-03-14 10:23:17 -07:00
xfs_dir2_sf.c xfs: Remove the unneeded result variable 2022-09-19 06:52:14 +10:00
xfs_dquot_buf.c xfs: remove the xfs_dqblk_t typedef 2021-10-14 09:19:33 -07:00
xfs_errortag.h xfs: drop write error injection is unfixable, remove it 2022-11-29 09:09:17 +11:00
xfs_format.h xfs: rename XFS_REFC_COW_START to _COWFLAG 2022-10-31 08:58:22 -07:00
xfs_fs.h Merge tag 'large-extent-counters-v9' of https://github.com/chandanr/linux into xfs-5.19-for-next 2022-04-21 16:46:17 +10:00
xfs_health.h xfs: Use the correct style for SPDX License Identifier 2020-05-13 15:32:45 -07:00
xfs_ialloc.c treewide: use get_random_u32() when possible 2022-10-11 17:42:58 -06:00
xfs_ialloc.h xfs: pass perag to xfs_read_agi 2022-07-07 19:07:47 +10:00
xfs_ialloc_btree.c xfs: make is_log_ag() a first class helper 2022-07-07 19:13:21 +10:00
xfs_ialloc_btree.h xfs: use separate btree cursor cache for each btree type 2021-10-19 11:45:16 -07:00
xfs_iext_tree.c xfs: prevent metadata files from being inactivated 2021-03-25 16:47:50 -07:00
xfs_inode_buf.c xfs: make attr forks permanent 2022-07-14 09:46:37 -07:00
xfs_inode_buf.h xfs: kill xfs_sb_version_has_v3inode() 2021-08-19 10:07:14 -07:00
xfs_inode_fork.c xfs: clean up "%Ld/%Lu" which doesn't meet C standard 2022-09-19 06:47:14 +10:00
xfs_inode_fork.h xfs: replace inode fork size macros with functions 2022-07-12 11:17:27 -07:00
xfs_log_format.h xfs: refactor all the EFI/EFD log item sizeof logic 2022-10-31 08:58:20 -07:00
xfs_log_recover.h xfs: convert buf_cancel_table allocation to kmalloc_array 2022-05-27 10:27:19 +10:00
xfs_log_rlimit.c xfs: reduce transaction reservations with reflink 2022-04-28 10:25:42 -07:00
xfs_quota_defs.h xfs: remove warning counters from struct xfs_dquot_res 2022-05-11 17:12:09 +10:00
xfs_refcount.c xfs: rename XFS_REFC_COW_START to _COWFLAG 2022-10-31 08:58:22 -07:00
xfs_refcount.h xfs: rename XFS_REFC_COW_START to _COWFLAG 2022-10-31 08:58:22 -07:00
xfs_refcount_btree.c xfs: track cow/shared record domains explicitly in xfs_refcount_irec 2022-10-31 08:58:21 -07:00
xfs_refcount_btree.h xfs: use separate btree cursor cache for each btree type 2021-10-19 11:45:16 -07:00
xfs_rmap.c xfs: create a predicate to verify per-AG extents 2022-10-31 08:58:20 -07:00
xfs_rmap.h xfs: speed up write operations by using non-overlapped lookups when possible 2022-04-28 10:24:38 -07:00
xfs_rmap_btree.c xfs: make is_log_ag() a first class helper 2022-07-07 19:13:21 +10:00
xfs_rmap_btree.h xfs: use separate btree cursor cache for each btree type 2021-10-19 11:45:16 -07:00
xfs_rtbitmap.c xfs: pass explicit mount pointer to rtalloc query functions 2022-04-12 06:49:41 +10:00
xfs_sb.c xfs: fix sb write verify for lazysbcount 2022-11-16 19:20:20 -08:00
xfs_sb.h xfs: open code sb verifier feature checks 2021-08-19 10:07:13 -07:00
xfs_shared.h xfs: tag transactions that contain intent done items 2022-05-04 11:46:21 +10:00
xfs_symlink_remote.c xfs: convert XFS_IFORK_PTR to a static inline helper 2022-07-09 15:17:21 -07:00
xfs_trans_inode.c xfs: convert xfs_sb_version_has checks to use mount features 2021-08-19 10:07:14 -07:00
xfs_trans_resv.c xfs: increase rename inode reservation 2022-10-26 13:02:24 -07:00
xfs_trans_resv.h xfs: rename xfs_*alloc*_log_count to _block_count 2022-04-28 10:25:59 -07:00
xfs_trans_space.h xfs: compute the maximum height of the rmap btree when reflink enabled 2021-10-19 11:45:16 -07:00
xfs_types.c xfs: Pre-calculate per-AG agino geometry 2022-07-07 19:13:10 +10:00
xfs_types.h xfs: report refcount domain in tracepoints 2022-10-31 08:58:21 -07:00