1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

2528 commits

Author SHA1 Message Date
Christoph Hellwig
53ffb30a78 btrfs: don't create inline extents in fallback_to_cow
For NOCOW files, run_delalloc_nocow can still fall back to COW
allocations when required and calls to fallback_to_cow helper for
that.  For such an allocation we can have multiple ordered_extents
for existing extents that NOCOW overwrites and new allocations that
fallback_to_cow creates.  If one of the new extents is an inline
extent, the writepages could would have to avoid normal page writeback
for them as indicated by the page_started return argument, which
run_delalloc_nocow can't return.   Fix this by never creating inline
extents from fallback_to_cow.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:14 +02:00
Christoph Hellwig
ba9145add5 btrfs: pass a flags argument to cow_file_range
The int used as bool unlock is not a very good way to describe the
behavior, and the next patch will have to add another behavior modifier.
We'll do that by two bool parameters instead of adding bit flags.  Now
specifies that the pages should always be kept locked.  This is the
inverse of the old unlock argument.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switch flags to bool ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:14 +02:00
Boris Burkov
e28b02118b btrfs: free qgroup rsv on io failure
If we do a write whose bio suffers an error, we will never reclaim the
qgroup reserved space for it. We allocate the space in the write_iter
codepath, then release the reservation as we allocate the ordered
extent, but we only create a delayed ref if the ordered extent finishes.
If it has an error, we simply leak the rsv. This is apparent in running
any error injecting (dmerror) fstests like btrfs/146 or btrfs/160. Such
tests fail due to dmesg on umount complaining about the leaked qgroup
data space.

When we clean up other aspects of space on failed ordered_extents, also
free the qgroup rsv.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:13 +02:00
Goldwyn Rodrigues
75d305c55b btrfs: remove duplicate free_async_extent_pages() on reservation error
While performing compressed writes, if the extent reservation fails, the
async extent pages are first freed in the error check for return value
ret, and then again at out_free label.

Remove the first call to free_async_extent_pages().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:13 +02:00
Filipe Manana
9b378f6ad4 btrfs: fix infinite directory reads
The readdir implementation currently processes always up to the last index
it finds. This however can result in an infinite loop if the directory has
a large number of entries such that they won't all fit in the given buffer
passed to the readdir callback, that is, dir_emit() returns a non-zero
value. Because in that case readdir() will be called again and if in the
meanwhile new directory entries were added and we still can't put all the
remaining entries in the buffer, we keep repeating this over and over.

The following C program and test script reproduce the problem:

  $ cat /mnt/readdir_prog.c
  #include <sys/types.h>
  #include <dirent.h>
  #include <stdio.h>

  int main(int argc, char *argv[])
  {
    DIR *dir = opendir(".");
    struct dirent *dd;

    while ((dd = readdir(dir))) {
      printf("%s\n", dd->d_name);
      rename(dd->d_name, "TEMPFILE");
      rename("TEMPFILE", dd->d_name);
    }
    closedir(dir);
  }

  $ gcc -o /mnt/readdir_prog /mnt/readdir_prog.c

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f $DEV &> /dev/null
  #mkfs.xfs -f $DEV &> /dev/null
  #mkfs.ext4 -F $DEV &> /dev/null

  mount $DEV $MNT

  mkdir $MNT/testdir
  for ((i = 1; i <= 2000; i++)); do
      echo -n > $MNT/testdir/file_$i
  done

  cd $MNT/testdir
  /mnt/readdir_prog

  cd /mnt

  umount $MNT

This behaviour is surprising to applications and it's unlike ext4, xfs,
tmpfs, vfat and other filesystems, which always finish. In this case where
new entries were added due to renames, some file names may be reported
more than once, but this varies according to each filesystem - for example
ext4 never reported the same file more than once while xfs reports the
first 13 file names twice.

So change our readdir implementation to track the last index number when
opendir() is called and then make readdir() never process beyond that
index number. This gives the same behaviour as ext4.

Reported-by: Rob Landley <rob@landley.net>
Link: https://lore.kernel.org/linux-btrfs/2c8c55ec-04c6-e0dc-9c5c-8c7924778c35@landley.net/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217681
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-14 16:17:37 +02:00
Jeff Layton
913e99287b
fs: drop the timespec64 argument from update_time
Now that all of the update_time operations are prepared for it, we can
drop the timespec64 argument from the update_time operation. Do that and
remove it from some associated functions like inode_update_time and
inode_needs_update_time.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230807-mgctime-v7-8-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-11 09:04:57 +02:00
Christoph Hellwig
12b2d64e59 btrfs: properly clear end of the unreserved range in cow_file_range
When the call to btrfs_reloc_clone_csums in cow_file_range returns an
error, we jump to the out_unlock label with the extent_reserved variable
set to false.   The cleanup at the label will then call
extent_clear_unlock_delalloc on the range from start to end.  But we've
already added cur_alloc_size to start before the jump, so there might no
range be left from the newly incremented start to end.  Move the check for
'start < end' so that it is reached by also for the !extent_reserved case.

CC: stable@vger.kernel.org # 6.1+
Fixes: a315e68f6e ("Btrfs: fix invalid attempt to free reserved space on failure to cow range")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-10 17:07:10 +02:00
Jeff Layton
bb7cc0a62e btrfs: have it use inode_update_timestamps
In later patches, we're going to drop the "now" argument from the
update_time operation. Have btrfs_update_time use the new
inode_update_timestamps helper to fetch a new timestamp and update it
properly.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230807-mgctime-v7-4-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09 08:56:37 +02:00
Jeff Layton
0d72b92883 fs: pass the request_mask to generic_fillattr
generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.

Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.

Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09 08:56:36 +02:00
Christoph Hellwig
7cad645ebf btrfs: fix ordered extent split error handling in btrfs_dio_submit_io
When the call to btrfs_extract_ordered_extent in btrfs_dio_submit_io
fails to allocate memory for a new ordered_extent, it calls into the
btrfs_dio_end_io for error handling.  btrfs_dio_end_io then assumes that
bbio->ordered is set because it is supposed to be at this point, except
for this error handling corner case.  Try to not overload the
btrfs_dio_end_io with error handling of a bio in a non-canonical state,
and instead call btrfs_finish_ordered_extent and iomap_dio_bio_end_io
directly for this error case.

Reported-by: syzbot <syzbot+5b82f0e951f8c2bcdb8f@syzkaller.appspotmail.com>
Fixes: b41b6f6937 ("btrfs: use btrfs_finish_ordered_extent to complete direct writes")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: syzbot <syzbot+5b82f0e951f8c2bcdb8f@syzkaller.appspotmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18 03:14:01 +02:00
Josef Bacik
17b17fcd6d btrfs: set_page_extent_mapped after read_folio in btrfs_cont_expand
While trying to get the subpage blocksize tests running, I hit the
following panic on generic/476

  assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:229
  kernel BUG at fs/btrfs/subpage.c:229!
  Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
  CPU: 1 PID: 1453 Comm: fsstress Not tainted 6.4.0-rc7+ #12
  Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20230301gitf80f052277c8-26.fc38 03/01/2023
  pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
  pc : btrfs_subpage_assert+0xbc/0xf0
  lr : btrfs_subpage_assert+0xbc/0xf0
  Call trace:
   btrfs_subpage_assert+0xbc/0xf0
   btrfs_subpage_clear_checked+0x38/0xc0
   btrfs_page_clear_checked+0x48/0x98
   btrfs_truncate_block+0x5d0/0x6a8
   btrfs_cont_expand+0x5c/0x528
   btrfs_write_check.isra.0+0xf8/0x150
   btrfs_buffered_write+0xb4/0x760
   btrfs_do_write_iter+0x2f8/0x4b0
   btrfs_file_write_iter+0x1c/0x30
   do_iter_readv_writev+0xc8/0x158
   do_iter_write+0x9c/0x210
   vfs_iter_write+0x24/0x40
   iter_file_splice_write+0x224/0x390
   direct_splice_actor+0x38/0x68
   splice_direct_to_actor+0x12c/0x260
   do_splice_direct+0x90/0xe8
   generic_copy_file_range+0x50/0x90
   vfs_copy_file_range+0x29c/0x470
   __arm64_sys_copy_file_range+0xcc/0x498
   invoke_syscall.constprop.0+0x80/0xd8
   do_el0_svc+0x6c/0x168
   el0_svc+0x50/0x1b0
   el0t_64_sync_handler+0x114/0x120
   el0t_64_sync+0x194/0x198

This happens because during btrfs_cont_expand we'll get a page, set it
as mapped, and if it's not Uptodate we'll read it.  However between the
read and re-locking the page we could have called release_folio() on the
page, but left the page in the file mapping.  release_folio() can clear
the page private, and thus further down we blow up when we go to modify
the subpage bits.

Fix this by putting the set_page_extent_mapped() after the read.  This
is safe because read_folio() will call set_page_extent_mapped() before
it does the read, and then if we clear page private but leave it on the
mapping we're completely safe re-setting set_page_extent_mapped().  With
this patch I can now run generic/476 without panicing.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18 03:13:58 +02:00
Filipe Manana
866e98a4d9 btrfs: use irq safe locking when running and adding delayed iputs
Running delayed iputs, which never happens in an irq context, needs to
lock the spinlock fs_info->delayed_iput_lock. When finishing bios for
data writes (irq context, bio.c) we call btrfs_put_ordered_extent() which
needs to add a delayed iput and for that it needs to acquire the spinlock
fs_info->delayed_iput_lock. Without disabling irqs when running delayed
iputs we can therefore deadlock on that spinlock. The same deadlock can
also happen when adding an inode to the delayed iputs list, since this
can be done outside an irq context as well.

Syzbot recently reported this, which results in the following trace:

  ================================
  WARNING: inconsistent lock state
  6.4.0-syzkaller-09904-ga507db1d8fdc #0 Not tainted
  --------------------------------
  inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
  btrfs-cleaner/16079 [HC0[0]:SC0[0]:HE1:SE1] takes:
  ffff888107804d20 (&fs_info->delayed_iput_lock){+.?.}-{2:2}, at: spin_lock include/linux/spinlock.h:350 [inline]
  ffff888107804d20 (&fs_info->delayed_iput_lock){+.?.}-{2:2}, at: btrfs_run_delayed_iputs+0x28/0xe0 fs/btrfs/inode.c:3523
  {IN-SOFTIRQ-W} state was registered at:
    lock_acquire kernel/locking/lockdep.c:5761 [inline]
    lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5726
    __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
    _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
    spin_lock include/linux/spinlock.h:350 [inline]
    btrfs_add_delayed_iput+0x128/0x390 fs/btrfs/inode.c:3490
    btrfs_put_ordered_extent fs/btrfs/ordered-data.c:559 [inline]
    btrfs_put_ordered_extent+0x2f6/0x610 fs/btrfs/ordered-data.c:547
    __btrfs_bio_end_io fs/btrfs/bio.c:118 [inline]
    __btrfs_bio_end_io+0x136/0x180 fs/btrfs/bio.c:112
    btrfs_orig_bbio_end_io+0x86/0x2b0 fs/btrfs/bio.c:163
    btrfs_simple_end_io+0x105/0x380 fs/btrfs/bio.c:378
    bio_endio+0x589/0x690 block/bio.c:1617
    req_bio_endio block/blk-mq.c:766 [inline]
    blk_update_request+0x5c5/0x1620 block/blk-mq.c:911
    blk_mq_end_request+0x59/0x680 block/blk-mq.c:1032
    lo_complete_rq+0x1c6/0x280 drivers/block/loop.c:370
    blk_complete_reqs+0xb3/0xf0 block/blk-mq.c:1110
    __do_softirq+0x1d4/0x905 kernel/softirq.c:553
    run_ksoftirqd kernel/softirq.c:921 [inline]
    run_ksoftirqd+0x31/0x60 kernel/softirq.c:913
    smpboot_thread_fn+0x659/0x9e0 kernel/smpboot.c:164
    kthread+0x344/0x440 kernel/kthread.c:389
    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
  irq event stamp: 39
  hardirqs last  enabled at (39): [<ffffffff81d5ebc4>] __do_kmem_cache_free mm/slab.c:3558 [inline]
  hardirqs last  enabled at (39): [<ffffffff81d5ebc4>] kmem_cache_free mm/slab.c:3582 [inline]
  hardirqs last  enabled at (39): [<ffffffff81d5ebc4>] kmem_cache_free+0x244/0x370 mm/slab.c:3575
  hardirqs last disabled at (38): [<ffffffff81d5eb5e>] __do_kmem_cache_free mm/slab.c:3553 [inline]
  hardirqs last disabled at (38): [<ffffffff81d5eb5e>] kmem_cache_free mm/slab.c:3582 [inline]
  hardirqs last disabled at (38): [<ffffffff81d5eb5e>] kmem_cache_free+0x1de/0x370 mm/slab.c:3575
  softirqs last  enabled at (0): [<ffffffff814ac99f>] copy_process+0x227f/0x75c0 kernel/fork.c:2448
  softirqs last disabled at (0): [<0000000000000000>] 0x0

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&fs_info->delayed_iput_lock);
    <Interrupt>
      lock(&fs_info->delayed_iput_lock);

   *** DEADLOCK ***

  1 lock held by btrfs-cleaner/16079:
   #0: ffff888107804860 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x103/0x4b0 fs/btrfs/disk-io.c:1463

  stack backtrace:
  CPU: 3 PID: 16079 Comm: btrfs-cleaner Not tainted 6.4.0-syzkaller-09904-ga507db1d8fdc #0
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
  Call Trace:
   <TASK>
   __dump_stack lib/dump_stack.c:88 [inline]
   dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
   print_usage_bug kernel/locking/lockdep.c:3978 [inline]
   valid_state kernel/locking/lockdep.c:4020 [inline]
   mark_lock_irq kernel/locking/lockdep.c:4223 [inline]
   mark_lock.part.0+0x1102/0x1960 kernel/locking/lockdep.c:4685
   mark_lock kernel/locking/lockdep.c:4649 [inline]
   mark_usage kernel/locking/lockdep.c:4598 [inline]
   __lock_acquire+0x8e4/0x5e20 kernel/locking/lockdep.c:5098
   lock_acquire kernel/locking/lockdep.c:5761 [inline]
   lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5726
   __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
   _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
   spin_lock include/linux/spinlock.h:350 [inline]
   btrfs_run_delayed_iputs+0x28/0xe0 fs/btrfs/inode.c:3523
   cleaner_kthread+0x2e5/0x4b0 fs/btrfs/disk-io.c:1478
   kthread+0x344/0x440 kernel/kthread.c:389
   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
   </TASK>

So fix this by using spin_lock_irq() and spin_unlock_irq() when running
delayed iputs, and using spin_lock_irqsave() and spin_unlock_irqrestore()
when adding a delayed iput().

Reported-by: syzbot+da501a04be5ff533b102@syzkaller.appspotmail.com
Fixes: ec63b84d46 ("btrfs: add an ordered_extent pointer to struct btrfs_bio")
Link: https://lore.kernel.org/linux-btrfs/000000000000d5c89a05ffbd39dd@google.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18 03:13:10 +02:00
Filipe Manana
cbaee87f2e btrfs: fix iput() on error pointer after error during orphan cleanup
At btrfs_orphan_cleanup(), if we can't find an inode (btrfs_iget() returns
an -ENOENT error pointer), we proceed with 'ret' set to -ENOENT and the
inode pointer set to ERR_PTR(-ENOENT). Later when we proceed to the body
of the following if statement:

    if (ret == -ENOENT || inode->i_nlink) {
        (...)
        trans = btrfs_start_transaction(root, 1);
        if (IS_ERR(trans)) {
            ret = PTR_ERR(trans);
            iput(inode);
            goto out;
        }
        (...)
        ret = btrfs_del_orphan_item(trans, root,
                                    found_key.objectid);
        btrfs_end_transaction(trans);
        if (ret) {
            iput(inode);
            goto out;
        }
        continue;
    }

If we get an error from btrfs_start_transaction() or from the call to
btrfs_del_orphan_item() we end calling iput() against an inode pointer
that has a value of ERR_PTR(-ENOENT), resulting in a crash with the
following trace:

  [876.667] BUG: kernel NULL pointer dereference, address: 0000000000000096
  [876.667] #PF: supervisor read access in kernel mode
  [876.667] #PF: error_code(0x0000) - not-present page
  [876.667] PGD 0 P4D 0
  [876.668] Oops: 0000 [#1] PREEMPT SMP PTI
  [876.668] CPU: 0 PID: 2356187 Comm: mount Tainted: G        W          6.4.0-rc6-btrfs-next-134+ #1
  [876.668] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
  [876.668] RIP: 0010:iput+0xa/0x20
  [876.668] Code: ff ff ff 66 (...)
  [876.669] RSP: 0018:ffffafa9c0c9f9d0 EFLAGS: 00010282
  [876.669] RAX: ffffffffffffffe4 RBX: 000000000009453b RCX: 0000000000000000
  [876.669] RDX: 0000000000000001 RSI: ffffafa9c0c9f930 RDI: fffffffffffffffe
  [876.669] RBP: ffff95c612f3b800 R08: 0000000000000001 R09: ffffffffffffffe4
  [876.670] R10: 00018f2a71010000 R11: 000000000ead96e3 R12: ffff95cb7d6909a0
  [876.670] R13: fffffffffffffffe R14: ffff95c60f477000 R15: 00000000ffffffe4
  [876.670] FS:  00007f5fbe30a840(0000) GS:ffff95ccdfa00000(0000) knlGS:0000000000000000
  [876.670] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [876.671] CR2: 0000000000000096 CR3: 000000055e9f6004 CR4: 0000000000370ef0
  [876.671] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [876.671] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [876.672] Call Trace:
  [876.744]  <TASK>
  [876.744]  ? __die_body+0x1b/0x60
  [876.744]  ? page_fault_oops+0x15d/0x450
  [876.745]  ? __kmem_cache_alloc_node+0x47/0x410
  [876.745]  ? do_user_addr_fault+0x65/0x8a0
  [876.745]  ? exc_page_fault+0x74/0x170
  [876.746]  ? asm_exc_page_fault+0x22/0x30
  [876.746]  ? iput+0xa/0x20
  [876.746]  btrfs_orphan_cleanup+0x221/0x330 [btrfs]
  [876.746]  btrfs_lookup_dentry+0x58f/0x5f0 [btrfs]
  [876.747]  btrfs_lookup+0xe/0x30 [btrfs]
  [876.747]  __lookup_slow+0x82/0x130
  [876.785]  walk_component+0xe5/0x160
  [876.786]  path_lookupat.isra.0+0x6e/0x150
  [876.786]  filename_lookup+0xcf/0x1a0
  [876.786]  ? mod_objcg_state+0xd2/0x360
  [876.786]  ? obj_cgroup_charge+0xf5/0x110
  [876.787]  ? should_failslab+0xa/0x20
  [876.787]  ? kmem_cache_alloc+0x47/0x450
  [876.787]  vfs_path_lookup+0x51/0x90
  [876.788]  mount_subtree+0x8d/0x130
  [876.788]  btrfs_mount+0x149/0x410 [btrfs]
  [876.788]  ? __kmem_cache_alloc_node+0x47/0x410
  [876.788]  ? vfs_parse_fs_param+0xc0/0x110
  [876.789]  legacy_get_tree+0x24/0x50
  [876.834]  vfs_get_tree+0x22/0xd0
  [876.852]  path_mount+0x2d8/0x9c0
  [876.852]  do_mount+0x79/0x90
  [876.852]  __x64_sys_mount+0x8e/0xd0
  [876.853]  do_syscall_64+0x38/0x90
  [876.899]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [876.958] RIP: 0033:0x7f5fbe50b76a
  [876.959] Code: 48 8b 0d a9 (...)
  [876.959] RSP: 002b:00007fff01925798 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
  [876.959] RAX: ffffffffffffffda RBX: 00007f5fbe694264 RCX: 00007f5fbe50b76a
  [876.960] RDX: 0000561bde6c8720 RSI: 0000561bde6bdec0 RDI: 0000561bde6c31a0
  [876.960] RBP: 0000561bde6bdc70 R08: 0000000000000000 R09: 0000000000000001
  [876.960] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  [876.960] R13: 0000561bde6c31a0 R14: 0000561bde6c8720 R15: 0000561bde6bdc70
  [876.960]  </TASK>

So fix this by setting 'inode' to NULL whenever we get an error from
btrfs_iget(), and to make the code simpler, stop testing for 'ret' being
-ENOENT to check if we have an inode - instead test for 'inode' being NULL
or not. Having a NULL 'inode' prevents any iput() call from crashing, as
iput() ignores NULL inode pointers. Also, stop testing for a NULL return
value from btrfs_iget() with PTR_ERR_OR_ZERO(), because btrfs_iget() never
returns NULL - in case an inode is not found, it returns ERR_PTR(-ENOENT),
and in case of memory allocation failure, it returns ERR_PTR(-ENOMEM).
We also don't need the extra iput() calls on the error branches for the
btrfs_start_transaction() and btrfs_del_orphan_item() calls, as we have
already called iput() before, so remove them.

Fixes: a13bb2c038 ("btrfs: add missing iputs on orphan cleanup failure")
CC: stable@vger.kernel.org # 6.4
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18 03:13:01 +02:00
Filipe Manana
b777d279ff btrfs: fix double iput() on inode after an error during orphan cleanup
At btrfs_orphan_cleanup(), if we were able to find the inode, we do an
iput() on the inode, then if btrfs_drop_verity_items() succeeds and then
either btrfs_start_transaction() or btrfs_del_orphan_item() fail, we do
another iput() in the respective error paths, resulting in an extra iput()
on the inode.

Fix this by setting inode to NULL after the first iput(), as iput()
ignores a NULL inode pointer argument.

Fixes: a13bb2c038 ("btrfs: add missing iputs on orphan cleanup failure")
CC: stable@vger.kernel.org # 6.4
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18 03:12:59 +02:00
Jeff Layton
2a9462de43 btrfs: convert to ctime accessor functions
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-27-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-13 10:28:04 +02:00
Jeff Layton
130f1eca98 btrfs: convert to simple_rename_timestamp
A rename potentially involves updating 4 different inode timestamps.
Convert to the new simple_rename_timestamp helper function.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230705190309.579783-7-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-10 10:08:16 +02:00
Filipe Manana
6822b3f708 btrfs: do not BUG_ON after failure to migrate space during truncation
During truncation we reserve 2 metadata units when starting a transaction
(reserved space goes to fs_info->trans_block_rsv) and then attempt to
migrate 1 unit (min_size bytes) from fs_info->trans_block_rsv into the
local block reserve. If we ever fail we trigger a BUG_ON(), which should
never happen, because we reserved 2 units. However if we happen to fail
for some reason, we don't need to be so dire and hit a BUG_ON(), we can
just error out the truncation and, since this is highly unexpected,
surround the error check with WARN_ON(), to get an informative stack
trace and tag the branh as 'unlikely'. Also make the 'min_size' variable
const, since it's not supposed to ever change and any accidental change
could possibly make the space migration not so unlikely to fail.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:39 +02:00
Christoph Hellwig
f02c75e630 btrfs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
Since commit a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag") file
systems can just set the FMODE_CAN_ODIRECT flag at open time instead of
wiring up a dummy direct_IO method to indicate support for direct I/O.
Do that for btrfs so that noop_direct_IO can eventually be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:38 +02:00
Christoph Hellwig
b41b6f6937 btrfs: use btrfs_finish_ordered_extent to complete direct writes
Use the btrfs_finish_ordered_extent helper to complete compressed writes
using the bbio->ordered pointer instead of requiring an rbtree lookup
in the otherwise equivalent btrfs_mark_ordered_io_finished called from
btrfs_writepage_endio_finish_ordered.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:37 +02:00
Christoph Hellwig
ec63b84d46 btrfs: add an ordered_extent pointer to struct btrfs_bio
Add a pointer to the ordered_extent to the existing union in struct
btrfs_bio, so all code dealing with data write bios can just use a
pointer dereference to retrieve the ordered_extent instead of doing
multiple rbtree lookups per I/O.

The reference to this ordered_extent is dropped at end I/O time,
which implies that an extra one must be acquired when the bio is split.
This also requires moving the btrfs_extract_ordered_extent call into
btrfs_split_bio so that the invariant of always having a valid
ordered_extent reference for the btrfs_bio is kept.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
112397acc3 btrfs: open code btrfs_bio_end_io in btrfs_dio_submit_io
btrfs_dio_submit_io is the only place that uses btrfs_bio_end_io to end a
bio that hasn't been submitted using btrfs_submit_bio yet, and this
invariant will become a problem with upcoming changes to the btrfs bio
layer.  Just open code the assignment of bi_status and the call to
btrfs_dio_end_io.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
d611935b5d btrfs: pass an ordered_extent to btrfs_submit_compressed_write
btrfs_submit_compressed_write always operates on a single ordered_extent.
Make that explicit by using btrfs_alloc_ordered_extent in the callers
and passing the ordered_extent to btrfs_submit_compressed_write.  This
will help with storing and ordered_extent pointer in the btrfs_bio in
subsequent patches.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
34bfaf1530 btrfs: pass an ordered_extent to btrfs_reloc_clone_csums
Both callers of btrfs_reloc_clone_csums allocate the ordered_extent that
btrfs_reloc_clone_csums operates on.  Switch them to use
btrfs_alloc_ordered_extent instead of btrfs_add_ordered_extent and
pass the ordered_extent to btrfs_reloc_clone_csums instead of doing an
extra lookup.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
3daea5fda1 btrfs: merge the two calls to btrfs_add_ordered_extent in run_delalloc_nocow
Refactor run_delalloc_nocow a little bit so that there is only a single
call to btrfs_add_ordered_extent instead of two.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:36 +02:00
Christoph Hellwig
7027f87108 btrfs: don't treat zoned writeback as being from an async helper thread
When extent_write_locked_range was originally added, it was only used
writing back compressed pages from an async helper thread.  But it is
now also used for writing back pages on zoned devices, where it is
called directly from the ->writepage context.  In this case we want to
be able to pass on the writeback_control instead of creating a new one,
and more importantly want to use all the normal cgroup interaction
instead of potentially deferring writeback to another helper.

Fixes: 898793d992 ("btrfs: zoned: write out partially allocated region")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
a994310aa2 btrfs: remove PAGE_SET_ERROR
Now that the btrfs writeback code has stopped using PageError, using
PAGE_SET_ERROR to just set the per-address_space error flag is confusing.
Open code the mapping_set_error calls in the callers and remove
the PAGE_SET_ERROR flag.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
2b2553f123 btrfs: stop setting PageError in the data I/O path
PageError is not used by the VFS/MM and deprecated because it uses up a
page bit and has no coherent rules.  Instead read errors are usually
propagated by not setting or clearing the uptodate bit, and write errors
are propagated through the address_space.  Btrfs now only sets the flag
and never clears it for data pages, so just remove all places setting it,
and the subpage error bit.

Note that the error propagation for superblock writes that work on the
block device mapping still uses PageError for now, but that will be
addressed in a separate series.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
bb7b05fe1b btrfs: rename cow_file_range_async to run_delalloc_compressed
cow_file_range_async is only used for compressed writeback.  Rename it
to run_delalloc_compressed, which also fits in with run_delalloc_nocow
and run_delalloc_zoned.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
973fb26e81 btrfs: don't fail writeback when allocating the compression context fails
If cow_file_range_async fails to allocate the asynchronous writeback
context, it currently returns an error and entirely fails the writeback.
This is not a good idea as a writeback failure is a non-temporary error
condition that will make the file system unusable.  Just fall back to
synchronous uncompressed writeback instead.  This requires us to delay
setting the BTRFS_INODE_HAS_ASYNC_EXTENT flag until we've committed to
the async writeback.

The compression checks INODE_NOCOMPRESS and FORCE_COMPRESS are moved
from cow_file_range_async to the preceding checks btrfs_run_delalloc_range().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:35 +02:00
Christoph Hellwig
f000bc6fe4 btrfs: pass the new logical address to split_extent_map
split_extent_map splits off the first chunk of an extent map into a new
one.  One of the two users is the zoned I/O completion code that wants to
rewrite the logical block start address right after this split.  Pass in
the logical address to be set in the split off first extent_map as an
argument to avoid an extra extent tree lookup for this case.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
71df088c1c btrfs: defer splitting of ordered extents until I/O completion
The btrfs zoned completion code currently needs an ordered_extent and
extent_map per bio so that it can account for the non-predictable
write location from Zone Append.  To archive that it currently splits
the ordered_extent and extent_map at I/O submission time, and then
records the actual physical address in the ->physical field of the
ordered_extent.

This patch instead switches to record the "original" physical address
that the btrfs allocator assigned in spare space in the btrfs_bio,
and then rewrites the logical address in the btrfs_ordered_sum
structure at I/O completion time.  This allows the ordered extent
completion handler to simply walk the list of ordered csums and
split the ordered extent as needed.  This removes an extra ordered
extent and extent_map lookup and manipulation during the I/O
submission path, and instead batches it in the I/O completion path
where we need to touch these anyway.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
b0307e2864 btrfs: return the new ordered_extent from btrfs_split_ordered_extent
Return the ordered_extent split from the passed in one.  This will be
needed to be able to store an ordered_extent in the btrfs_bio.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
ebdb44a00e btrfs: reorder conditions in btrfs_extract_ordered_extent
There is no good reason for doing one before the other in terms of
failure implications, but doing the extent_map split first will
simplify some upcoming refactoring.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
a6f3e205e4 btrfs: move split_extent_map to extent_map.c
split_extent_map doesn't have anything to do with the other code in
inode.c, so move it to extent_map.c.

This also allows marking replace_extent_mapping static.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Christoph Hellwig
cbfce4c7fb btrfs: optimize the logical to physical mapping for zoned writes
The current code to store the final logical to physical mapping for a
zone append write in the extent tree is rather inefficient.  It first has
to split the ordered extent so that there is one ordered extent per bio,
so that it can look up the ordered extent on I/O completion in
btrfs_record_physical_zoned and store the physical LBA returned by the
block driver in the ordered extent.

btrfs_rewrite_logical_zoned then has to do a lookup in the chunk tree to
see what physical address the logical address for this bio / ordered
extent is mapped to, and then rewrite it in the extent tree.

To optimize this process, we can store the physical address assigned in
the chunk tree to the original logical address and a pointer to
btrfs_ordered_sum structure the in the btrfs_bio structure, and then use
this information to rewrite the logical address in the btrfs_ordered_sum
structure directly at I/O completion time in btrfs_record_physical_zoned.
btrfs_rewrite_logical_zoned then simply updates the logical address in
the extent tree and the ordered_extent itself.

The code in btrfs_rewrite_logical_zoned now runs for all data I/O
completions in zoned file systems, which is fine as there is no remapping
to do for non-append writes to conventional zones or for relocation, and
the overhead for quickly breaking out of the loop is very low.

Because zoned file systems now need the ordered_sums structure to
record the actual write location returned by zone append, allocate dummy
structures without the csum array for them when the I/O doesn't use
checksums, and free them when completing the ordered_extent.

Note that the btrfs_bio doesn't grow as the new field are places into
a union that is so far not used for data writes and has plenty of space
left in it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:32 +02:00
Christoph Hellwig
5cfe76f846 btrfs: rename the bytenr field in struct btrfs_ordered_sum to logical
btrfs_ordered_sum::bytendr stores a logical address.  Make that clear by
renaming it to ->logical.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:32 +02:00
David Sterba
1d12680044 btrfs: drop gfp from parameter extent state helpers
Now that all extent state bit helpers effectively take the GFP_NOFS mask
(and GFP_NOWAIT is encoded in the bits) we can remove the parameter.
This reduces stack consumption in many functions and simplifies a lot of
code.

Net effect on module on a release build:

   text    data     bss     dec     hex filename
1250432   20985   16088 1287505  13a551 pre/btrfs.ko
1247074   20985   16088 1284147  139833 post/btrfs.ko

DELTA: -3358

Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
David Sterba
66240ab115 btrfs: open code set_extent_delalloc
The helper is used once in fs code and a few times in the self test
code.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:30 +02:00
Filipe Manana
59fcf38817 btrfs: change for_rename argument of btrfs_record_unlink_dir() to bool
The for_rename argument of btrfs_record_unlink_dir() is defined as an
integer, but the argument is in fact used as a boolean. So change it to
a boolean to make its use more clear.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Qu Wenruo
b7f9945a14 btrfs: handle tree backref walk error properly
[BUG]
Smatch reports the following errors related to commit ("btrfs: output
affected files when relocation fails"):

	fs/btrfs/inode.c:283 print_data_reloc_error()
	error: uninitialized symbol 'ref_level'.

[CAUSE]
That part of code is mostly copied from scrub, but unfortunately scrub
code from the beginning is not doing the error handling properly.

The offending code looks like this:

	do {
		ret = tree_backref_for_extent();
		btrfs_warn_rl();
	} while (ret != 1);

There are several problems involved:

- No error handling
  If that tree_backref_for_extent() failed, we would output the same
  error again and again, never really exit as it requires ret == 1 to
  exit.

- Always do one extra output
  As tree_backref_for_extent() only return > 0 if there is no more
  backref item.
  This means after the last item we hit, we would output an invalid
  error message for ret > 0 case.

[FIX]
Fix the old code by:

- Move @ref_root and @ref_level into the if branch
  And do not initialize them, so we can catch such uninitialized values
  just like what we do in the inode.c

- Explicitly check the return value of tree_backref_for_extent()
  And handle ret < 0 and ret > 0 cases properly.

- No more do {} while () loop
  Instead go while (true) {} loop since we will handle @ret manually.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Qu Wenruo
b9a9a85059 btrfs: output affected files when relocation fails
[PROBLEM]
When relocation fails (mostly due to checksum mismatch), we only got
very cryptic error messages like:

  BTRFS info (device dm-4): relocating block group 13631488 flags data
  BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 csum 0x373e1ae3 expected csum 0x98757625 mirror 1
  BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  BTRFS info (device dm-4): balance: ended with status: -5

The end user has to decipher the above messages and use various tools to
locate the affected files and find a way to fix the problem (mostly
deleting the file).  This is not an easy work even for experienced
developer, not to mention the end users.

[SCRUB IS DOING BETTER]
By contrast, scrub is providing much better error messages:

  BTRFS error (device dm-4): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
  BTRFS warning (device dm-4): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file)
  BTRFS info (device dm-4): scrub: finished on devid 1 with status: 0

Which provides the affected files directly to the end user.

[IMPROVEMENT]
Instead of the generic data checksum error messages, which is not doing
a good job for data reloc inodes, this patch introduce a scrub like
backref walking based solution.

When a sector fails its checksum for data reloc inode, we go the
following workflow:

- Get the real logical bytenr
  For data reloc inode, the file offset is the offset inside the block
  group.
  Thus the real logical bytenr is @file_off + @block_group->start.

- Do an extent type check
  If it's tree blocks it's much easier to handle, just go through
  all the tree block backref.

- Do a backref walk and inode path resolution for data extents
  This is mostly the same as scrub.
  But unfortunately we can not reuse the same function as the output
  format is different.

Now the new output would be more user friendly:

  BTRFS info (device dm-4): relocating block group 13631488 flags data
  BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 logical 13631488 csum 0x373e1ae3 expected csum 0x98757625 mirror 1
  BTRFS warning (device dm-4): checksum error at logical 13631488 mirror 1 root 5 inode 257 offset 0 length 4096 links 1 (path: file)
  BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  BTRFS info (device dm-4): balance: ended with status: -5

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Christoph Hellwig
e917ff56c8 btrfs: determine synchronous writers from bio or writeback control
The writeback_control structure already passes down the information about
a writeback being synchronous from the core VM code, and thus information
is propagated into the bio REQ_SYNC flag through the wbc_to_write_flags
helper.

Use that information to decide if checksums calculation is offloaded to
a workqueue instead of btrfs_inode::sync_writers field that not only
bloats the inode but also has too wide scope, being inode wide instead
of limited to the actual writeback request.

The sync writes were set in:

- btrfs_do_write_iter - regular IO, sync status is set
- start_ordered_ops - ordered write start, writeback with WB_SYNC_ALL
  mode
- btrfs_write_marked_extents - write marked extents, writeback with
  WB_SYNC_ALL mode

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Anand Jain
29e70be261 btrfs: use SECTOR_SHIFT to convert physical offset to LBA
Use SECTOR_SHIFT while converting a physical address to an LBA, makes
it more readable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Chris Mason
deccae40e4 btrfs: can_nocow_file_extent should pass down args->strict from callers
Commit 619104ba45 ("btrfs: move common NOCOW checks against a file
extent into a helper") changed our call to btrfs_cross_ref_exist() to
always pass false for the 'strict' parameter.  We're passing this down
through the stack so that we can do a full check for cross references
during swapfile activation.

With strict always false, this test fails:

  btrfs subvol create swappy
  chattr +C swappy
  fallocate -l1G swappy/swapfile
  chmod 600 swappy/swapfile
  mkswap swappy/swapfile

  btrfs subvol snap swappy swapsnap
  btrfs subvol del -C swapsnap

  btrfs fi sync /
  sync;sync;sync

  swapon swappy/swapfile

The fix is to just use args->strict, and everyone except swapfile
activation is passing false.

Fixes: 619104ba45 ("btrfs: move common NOCOW checks against a file extent into a helper")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13 00:01:08 +02:00
Christoph Hellwig
7833b86595 btrfs: fix iomap_begin length for nocow writes
can_nocow_extent can reduce the len passed in, which needs to be
propagated to btrfs_dio_iomap_begin so that iomap does not submit
more data then is mapped.

This problems exists since the btrfs_get_blocks_direct helper was added
in commit c5794e5178 ("btrfs: Factor out write portion of
btrfs_get_blocks_direct"), but the ordered_extent splitting added in
commit b73a6fd1b1 ("btrfs: split partial dio bios before submit")
added a WARN_ON that made a syzkaller test fail.

Reported-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com
Fixes: c5794e5178 ("btrfs: Factor out write portion of btrfs_get_blocks_direct")
CC: stable@vger.kernel.org # 6.1+
Tested-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13 00:01:00 +02:00
Naohiro Aota
f84353c7c2 btrfs: zoned: zone finish data relocation BG with last IO
For data block groups, we zone finish a zone (or, just deactivate it) when
seeing the last IO in btrfs_finish_ordered_io(). That is only called for
IOs using ZONE_APPEND, but we use a regular WRITE command for data
relocation IOs. Detect it and call btrfs_zone_finish_endio() properly.

Fixes: be1a1d7a5d ("btrfs: zoned: finish fully written block group")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-10 14:50:12 +02:00
Qu Wenruo
4317ff0056 btrfs: introduce btrfs_bio::fs_info member
Currently we're doing a lot of work for btrfs_bio:

- Checksum verification for data read bios
- Bio splits if it crosses stripe boundary
- Read repair for data read bios

However for the incoming scrub patches, we don't want this extra
functionality at all, just plain logical + mirror -> physical mapping
ability.

Thus here we do the following changes:

- Introduce btrfs_bio::fs_info
  This is for the new scrub specific btrfs_bio, which would not populate
  btrfs_bio::inode.
  Thus we need such new member to grab a fs_info

  This new member will always be populated.

- Replace @inode argument with @fs_info for btrfs_bio_init() and its
  caller
  Since @inode is no longer a mandatory member, replace it with
  @fs_info, and let involved users populate @inode.

- Skip checksum verification and generation if @bbio->inode is NULL

- Add extra ASSERT()s
  To make sure:

  * bbio->inode is properly set for involved read repair path
  * if @file_offset is set, bbio->inode is also populated

- Grab @fs_info from @bbio directly
  We can no longer go @bbio->inode->root->fs_info, as bbio->inode can be
  NULL. This involves:

  * btrfs_simple_end_io()
  * should_async_write()
  * btrfs_wq_submit_bio()
  * btrfs_use_zone_append()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Christoph Hellwig
3480373ebd btrfs, block: move REQ_CGROUP_PUNT to btrfs
REQ_CGROUP_PUNT is a bit annoying as it is hard to follow and adds
a branch to the bio submission hot path.  To fix this, export
blkcg_punt_bio_submit and let btrfs call it directly.  Add a new
REQ_FS_PRIVATE flag for btrfs to indicate to it's own low-level
bio submission code that a punt to the cgroup submission helper
is required.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig
896d7c1a90 btrfs: also use kthread_associate_blkcg for uncompressible ranges
submit_one_async_extent needs to use submit_one_async_extent no matter
if the range it handles ends up beeing compressed or not as the deadlock
risk due to cgroup thottling is the same.  Call kthread_associate_blkcg
earlier to cover submit_uncompressed_range case as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00
Christoph Hellwig
e43a6210b7 btrfs: don't free the async_extent in submit_uncompressed_range
Let submit_one_async_extent, which is the only caller of
submit_uncompressed_range handle freeing of the async_extent in one
central place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:22 +02:00