1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

12906 commits

Author SHA1 Message Date
Filipe Manana
3128b548c7 btrfs: split assert into two different asserts when removing block group
When starting a transaction to remove a block group we have one ASSERT
that checks we found an extent map and that the extent map's start offset
matches the desired chunk offset. In case one of the conditions fails, we
get a stack trace that point to the respective line of code, however we
can't tell which condition failed: either there's no extent map or we got
one with an unexpected start offset. To make such an issue easier to debug
and analyse, split the assertion into two, one for each condition. This
was actually triggered during development of another upcoming change.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:01 +01:00
Filipe Manana
5031660a1b btrfs: mark sanity checks when getting chunk map as unlikely
When getting a chunk map, at btrfs_get_chunk_map(), we do some sanity
checks to verify that we found an extent map and that it includes the
requested logical address. These are never expected to fail, so mark
them as unlikely to make it more clear as well as to allow a compiler
to generate more efficient code.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:01 +01:00
David Sterba
46524fab69 btrfs: remove unused btrfs_root::type
Looks like the struct member was added in 2007 in 2.6.29 in commit
87ee04eb0f ("Btrfs: Add simple stripe size parameter") but hasn't been
used at all since. So let's remove it. This was found by tool
https://github.com/jirislaby/clang-struct, then build tested after
removing the struct member.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:01 +01:00
David Sterba
49542050b1 btrfs: remove unused definition of tree_entry in extent-io-tree.c
The declaration was temporarily moved in a4055213bf ("btrfs: unexport
all the temporary exports for extent-io-tree.c") and then should have
been removed in 6.0 in 071d19f513 ("btrfs: remove struct tree_entry in
extent-io-tree.c") but was not.  This was found by tool
https://github.com/jirislaby/clang-struct .

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:01 +01:00
David Sterba
a0df0a2680 btrfs: raid56: remove unused btrfs_plug_cb::work
The raid56 changes in 6.2 reworked the IO path to RMW, commit
93723095b5 ("btrfs: raid56: switch write path to rmw_rbio()") in
particular removed the last use of the work member so it can be removed
as well. This was found by tool https://github.com/jirislaby/clang-struct .

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:01 +01:00
David Sterba
3d72941664 btrfs: remove unused btrfs_ordered_extent::outstanding_isize
The whole isize code was deleted in 5.6 3f1c64ce04 ("btrfs: delete the
ordered isize update code"), except the struct member.  This was found
by tool https://github.com/jirislaby/clang-struct .

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:01 +01:00
David Sterba
a5e182d85f btrfs: scrub: remove unused scrub_ctx::sectors_per_bio
The recent scrub rewrite forgot to remove the sectors_per_bio in
6.3 in 13a62fd997 ("btrfs: scrub: remove scrub_bio structure").
This was found by tool https://github.com/jirislaby/clang-struct .

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:01 +01:00
Qu Wenruo
cfbf07e278 btrfs: migrate to use folio private instead of page private
As a cleanup and preparation for future folio migration, this patch
would replace all page->private to folio version.  This includes:

- PagePrivate()
  -> folio_test_private()

- page->private
  -> folio_get_private()

- attach_page_private()
  -> folio_attach_private()

- detach_page_private()
  -> folio_detach_private()

Since we're here, also remove the forced cast on page->private, since
it's (void *) already, we don't really need to do the cast.

For now even if we missed some call sites, it won't cause any problem
yet, as we're only using order 0 folio (single page), thus all those
folio/page flags should be synced.

But for the future conversion to utilize higher order folio, the page
<-> folio flag sync is no longer guaranteed, thus we have to migrate to
utilize folio flags.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:01 +01:00
David Sterba
4cea422a77 btrfs: use shrinker for compression page pool
The pages are now allocated and freed centrally, so we can extend the
logic to manage the lifetime. The main idea is to keep a few recently
used pages and hand them to all writers. Ideally we won't have to go to
allocator at all (a slight performance gain) and also raise chance that
we'll have the pages available (slightly increased reliability).

In order to avoid gathering too many pages, the shrinker is attached to
the cache so we can free them on when MM demands that. The first
implementation will drain the whole cache. Further this can be refined
to keep some minimal number of pages for emergency purposes.  The
ultimate goal to avoid memory allocation failures on the write out path
from the compression.

The pool threshold is set to cover full BTRFS_MAX_COMPRESSED / PAGE_SIZE
for minimal thread pool, which is 8 (btrfs_init_fs_info()). This is 128K
/ 4K * 8 = 256 pages at maximum, which is 1MiB.

This is for all filesystems currently mounted, with heavy use of
compression IO the allocator is still needed. The cache helps for short
burst IO.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:01 +01:00
David Sterba
9ba965dca3 btrfs: use page alloc/free wrappers for compression pages
This is a preparation for managing compression pages in a cache-like
manner, instead of asking the allocator each time. The common allocation
and free wrappers are introduced and are functionally equivalent to the
current code.

The freeing helpers need to be carefully placed where the last reference
is dropped.  This is either after directly allocating (error handling)
or when there are no other users of the pages (after copying the contents).

It's safe to not use the helper and use put_page() that will handle the
reference count. Not using the helper means there's lower number of
pages that could be reused without passing them back to allocator.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:01 +01:00
Qu Wenruo
9ba7c686fe btrfs: do not utilize goto to implement delayed inode ref deletion
[PROBLEM]
The function __btrfs_update_delayed_inode() is doing something not
meeting the code standard of today:

	path->slots[0]++
	if (path->slots[0] >= btrfs_header_nritems(leaf))
		goto search;
again:
	if (!is_the_target_inode_ref())
		goto out;
	ret = btrfs_delete_item();
	/* Some cleanup. */
	return ret;

search:
	ret = search_for_the_last_inode_ref();
	goto again;

With the tag named "again", it's pretty common to think it's a loop, but
the truth is, we only need to do the search once, to locate the last
(also the first, since there should only be one INODE_REF or
INODE_EXTREF now) ref of the inode.

[FIX]
Instead of the weird jumps, just do them in a stream-lined fashion.
This removes those weird labels, and add extra comments on why we can do
the different searches.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:00 +01:00
Filipe Manana
80d197fe04 btrfs: make the logic from btrfs_block_can_be_shared() easier to read
The logic in btrfs_block_can_be_shared() is hard to follow as we have a
lot of conditions in a single if statement including a subexpression with
a logical or and two nested if statements inside the main if statement.

Make this easier to read by using separate if statements that return
immediately when we find a condition that determines if a block can be
or can not be shared.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:00 +01:00
Filipe Manana
6e5de50fc5 btrfs: use bool for return type of btrfs_block_can_be_shared()
Currently btrfs_block_can_be_shared() returns an int that is used as a
boolean. Since it all it needs is to return true or false, and it can't
return errors for example, change the return type from int to bool to
make it a bit more readable and obvious.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:00 +01:00
Filipe Manana
6000d9313f btrfs: remove log_extents_lock and logged_list from struct btrfs_root
The logged_list[2] and log_extents_lock[2] members of struct btrfs_root
are no longer used, their last use was removed in commit 5636cf7d6d
("btrfs: remove the logged extents infrastructure"). So remove these
fields. This reduces the size of struct btrfs_root, on a release kernel,
from 1392 bytes down to 1352 bytes.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:00 +01:00
Filipe Manana
b1dd019de6 btrfs: remove duplicate btrfs_clear_buffer_dirty() prototype from disk-io.h
The prototype for btrfs_clear_buffer_dirty() is declared in both disk-io.h
and extent_io.h, but the function is defined at extent_io.c. So remove the
prototype declaration from disk-io.h.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15 20:27:00 +01:00
Linus Torvalds
bdb2701f0b for-6.7-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmV5rTIACgkQxWXV+ddt
 WDuLUg/+Ix/CeA+JY6VZMA2kBHMzmRexSjYONWfQwIL7LPBy4sOuSEaTZt+QQMs+
 AEKau1YfTgo7e9S2DlbZhIWp6P87VFui7Q1E99uJEmKelakvf94DbMrufPTTKjaD
 JG2KB6LsD59yWwfbGHEAVVNGSMRk2LDXzcUWMK6/uzu/7Bcr4ataOymWd86/blUV
 cw5g87uAHpBn+R1ARTf1CkqyYiI9UldNUJmW1q7dwxOyYG+weUtJImosw2Uda76y
 wQXAFQAH3vsFzTC+qjC9Vz7cnyAX9qAw48ODRH7rIT1BQ3yAFQbfXE20jJ/fSE+C
 lz3p05tA9373KAOtLUHmANBwe3NafCnlut6ZYRfpTcEzUslAO5PnajPaHh5Al7uC
 Iwdpy49byoyVFeNf0yECBsuDP8s86HlUALF8mdJabPI1Kl66MUea6KgS1oyO3pCB
 hfqLbpofV4JTywtIRLGQTQvzSwkjPHTbSwtZ9nftTw520a5f7memDu5vi4XzFd+B
 NrJxmz2DrMRlwrLgWg9OXXgx1riWPvHnIoqzjG5W6A9N74Ud1/oz7t3VzjGSQ5S2
 UikRB6iofPE0deD8IF6H6DvFfvQxU9d9BJ6IS9V2zRt5vdgJ2w08FlqbLZewSY4x
 iaQ+L7UYKDjC9hdosXVNu/6fAspyBVdSp2NbKk14fraZtNAoPNs=
 =uF/Q
 -----END PGP SIGNATURE-----

Merge tag 'for-6.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
  "Some fixes to quota accounting code, mostly around error handling and
   correctness:

   - free reserves on various error paths, after IO errors or
     transaction abort

   - don't clear reserved range at the folio release time, it'll be
     properly cleared after final write

   - fix integer overflow due to int used when passing around size of
     freed reservations

   - fix a regression in squota accounting that missed some cases with
     delayed refs"

* tag 'for-6.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: ensure releasing squota reserve on head refs
  btrfs: don't clear qgroup reserved bit in release_folio
  btrfs: free qgroup pertrans reserve on transaction abort
  btrfs: fix qgroup_free_reserved_data int overflow
  btrfs: free qgroup reserve when ORDERED_IOERR is set
2023-12-14 11:53:00 -08:00
Matthew Wilcox (Oracle)
af7628d6ec fs: convert error_remove_page to error_remove_folio
There were already assertions that we were not passing a tail page to
error_remove_page(), so make the compiler enforce that by converting
everything to pass and use a folio.

Link: https://lkml.kernel.org/r/20231117161447.2461643-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:42 -08:00
Boris Burkov
e85a0adacf btrfs: ensure releasing squota reserve on head refs
A reservation goes through a 3 step lifetime:

- generated during delalloc
- released/counted by ordered_extent allocation
- freed by running delayed ref

That third step depends on must_insert_reserved on the head ref, so the
head ref with that field set owns the reservation. Once you prepare to
run the head ref, must_insert_reserved is unset, which means that
running the ref must free the reservation, whether or not it succeeds,
or else the reservation is leaked. That results in either a risk of
spurious ENOSPC if the fs stays writeable or a warning on unmount if it
is readonly.

The existing squota code was aware of these invariants, but missed a few
cases. Improve it by adding a helper function to use in the cleanup
paths and call it from the existing early returns in running delayed
refs. This also simplifies btrfs_record_squota_delta and struct
btrfs_quota_delta.

This fixes (or at least improves the reliability of) generic/475 with
"mkfs -O squota". On my machine, that test failed ~4/10 times without
this patch and passed 100/100 times with it.

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-06 22:32:57 +01:00
Boris Burkov
a86805504b btrfs: don't clear qgroup reserved bit in release_folio
The EXTENT_QGROUP_RESERVED bit is used to "lock" regions of the file for
duplicate reservations. That is two writes to that range in one
transaction shouldn't create two reservations, as the reservation will
only be freed once when the write finally goes down. Therefore, it is
never OK to clear that bit without freeing the associated qgroup
reserve. At this point, we don't want to be freeing the reserve, so mask
off the bit.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-06 22:32:52 +01:00
Boris Burkov
b321a52cce btrfs: free qgroup pertrans reserve on transaction abort
If we abort a transaction, we never run the code that frees the pertrans
qgroup reservation. This results in warnings on unmount as that
reservation has been leaked. The leak isn't a huge issue since the fs is
read-only, but it's better to clean it up when we know we can/should. Do
it during the cleanup_transaction step of aborting.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-06 22:32:49 +01:00
Boris Burkov
9e65bfca24 btrfs: fix qgroup_free_reserved_data int overflow
The reserved data counter and input parameter is a u64, but we
inadvertently accumulate it in an int. Overflowing that int results in
freeing the wrong amount of data and breaking reserve accounting.

Unfortunately, this overflow rot spreads from there, as the qgroup
release/free functions rely on returning an int to take advantage of
negative values for error codes.

Therefore, the full fix is to return the "released" or "freed" amount by
a u64 argument and to return 0 or negative error code via the return
value.

Most of the call sites simply ignore the return value, though some
of them handle the error and count the returned bytes. Change all of
them accordingly.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-06 22:32:46 +01:00
Boris Burkov
f63e1164b9 btrfs: free qgroup reserve when ORDERED_IOERR is set
An ordered extent completing is a critical moment in qgroup reserve
handling, as the ownership of the reservation is handed off from the
ordered extent to the delayed ref. In the happy path we release (unlock)
but do not free (decrement counter) the reservation, and the delayed ref
drives the free. However, on an error, we don't create a delayed ref,
since there is no ref to add. Therefore, free on the error path.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-06 22:32:40 +01:00
Linus Torvalds
18d46e76d7 for-6.7-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmVmHvEACgkQxWXV+ddt
 WDtczA//UdxSPxQwJY1oOQj3k5Zgb/zOThfyD4x5wrDxFAYwVh1XhuyxV2XT8qyq
 ipp+mi9doVykZKLSY+oRAeqjGlF0nIYm1z2PGSU7JXz4VsEj970rsDs3ePwH5TW6
 V8/6VraOIIvmOnTft7aiuM8CXUjyndalNl7RvHu+v6grgAAgQAaly/3CmRIsm/Ui
 2Wb6/J8ciAOBZ8TkFMr0PiTJd+CjUL+1Y9IaYEywujf0nVNJSgHp+R2CpwLDvM/5
 1x6LdRrUKmnY7mhOvC7QwWfQmgsPnj3OuR+3+L+8jULTvcpwka2KEcpCH8/s6mUK
 +4XhKQ4xXOJr8M+KmAUpy1yZZ30G6cDSwnfCWbWRCfR03396tTb08kb6G21fR+NL
 o2qEUOe4DoMbYX/5zd9xEVqbwyGhAIXB0fJ7KJ0RqbaNBh/roRALBVCseP2CFwJE
 P0DE9phjeIGQf3ybdfP7XqnMfk520bqoeV49Akbn2us2SrV1+O9Yjqmj2pbTnljE
 M30Jh/btaiTFtsGB3MBDRRnGhf7F2l1dsmdmMVhdOK8HMY6obcJUdv6YXVLAjBDn
 ATWtUVVizOpHvSZL0G/+1fXqHhLqOnHLY4A97uMjcElK5WJfuYZv8vZK7GVKC/jW
 y5F4w/FPxU8dmhorMGksya2CLMvUsv5dikyAzGHirjEAdyrK1jg=
 =85Pb
 -----END PGP SIGNATURE-----

Merge tag 'for-6.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few fixes and message updates:

   - for simple quotas, handle the case when a snapshot is created and
     the target qgroup already exists

   - fix a warning when file descriptor given to send ioctl is not
     writable

   - fix off-by-one condition when checking chunk maps

   - free pages when page array allocation fails during compression
     read, other cases were handled

   - fix memory leak on error handling path in ref-verify debugging
     feature

   - copy missing struct member 'version' in 64/32bit compat send ioctl

   - tree-checker verifies inline backref ordering

   - print messages to syslog on first mount and last unmount

   - update error messages when reading chunk maps"

* tag 'for-6.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: send: ensure send_fd is writable
  btrfs: free the allocated memory if btrfs_alloc_page_array() fails
  btrfs: fix 64bit compat send ioctl arguments not initializing version member
  btrfs: make error messages more clear when getting a chunk map
  btrfs: fix off-by-one when checking chunk map includes logical address
  btrfs: ref-verify: fix memory leaks in btrfs_ref_tree_mod()
  btrfs: add dmesg output for first mount and last unmount of a filesystem
  btrfs: do not abort transaction if there is already an existing qgroup
  btrfs: tree-checker: add type and sequence check for inline backrefs
2023-11-28 11:16:04 -08:00
Jann Horn
0ac1d13a55 btrfs: send: ensure send_fd is writable
kernel_write() requires the caller to ensure that the file is writable.
Let's do that directly after looking up the ->send_fd.

We don't need a separate bailout path because the "out" path already
does fput() if ->send_filp is non-NULL.

This has no security impact for two reasons:

 - the ioctl requires CAP_SYS_ADMIN
 - __kernel_write() bails out on read-only files - but only since 5.8,
   see commit a01ac27be4 ("fs: check FMODE_WRITE in __kernel_write")

Reported-and-tested-by: syzbot+12e098239d20385264d3@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=12e098239d20385264d3
Fixes: 31db9f7c23 ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-24 18:50:53 +01:00
Qu Wenruo
94dbf7c087 btrfs: free the allocated memory if btrfs_alloc_page_array() fails
[BUG]
If btrfs_alloc_page_array() fail to allocate all pages but part of the
slots, then the partially allocated pages would be leaked in function
btrfs_submit_compressed_read().

[CAUSE]
As explicitly stated, if btrfs_alloc_page_array() returned -ENOMEM,
caller is responsible to free the partially allocated pages.

For the existing call sites, most of them are fine:

- btrfs_raid_bio::stripe_pages
  Handled by free_raid_bio().

- extent_buffer::pages[]
  Handled btrfs_release_extent_buffer_pages().

- scrub_stripe::pages[]
  Handled by release_scrub_stripe().

But there is one exception in btrfs_submit_compressed_read(), if
btrfs_alloc_page_array() failed, we didn't cleanup the array and freed
the array pointer directly.

Initially there is still the error handling in commit dd137dd1f2
("btrfs: factor out allocating an array of pages"), but later in commit
544fe4a903 ("btrfs: embed a btrfs_bio into struct compressed_bio"),
the error handling is removed, leading to the possible memory leak.

[FIX]
This patch would add back the error handling first, then to prevent such
situation from happening again, also
Make btrfs_alloc_page_array() to free the allocated pages as a extra
safety net, then we don't need to add the error handling to
btrfs_submit_compressed_read().

Fixes: 544fe4a903 ("btrfs: embed a btrfs_bio into struct compressed_bio")
CC: stable@vger.kernel.org # 6.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-24 18:50:49 +01:00
David Sterba
5de0434bc0 btrfs: fix 64bit compat send ioctl arguments not initializing version member
When the send protocol versioning was added in 5.16 e77fbf9903
("btrfs: send: prepare for v2 protocol"), the 32/64bit compat code was
not updated (added by 2351f431f7 ("btrfs: fix send ioctl on 32bit with
64bit kernel")), missing the version struct member. The compat code is
probably rarely used, nobody reported any bugs.

Found by tool https://github.com/jirislaby/clang-struct .

Fixes: e77fbf9903 ("btrfs: send: prepare for v2 protocol")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-24 18:50:42 +01:00
Amir Goldstein
2f4d8ad825 btrfs: move file_start_write() to after permission hook
In vfs code, file_start_write() is usually called after the permission
hook in rw_verify_area().  btrfs_ioctl_encoded_write() in an exception
to this rule.

Move file_start_write() to after the rw_verify_area() check in encoded
write to make the permission hook "start-write-safe".

This is needed for fanotify "pre content" events.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231122122715.2561213-9-amir73il@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-24 09:22:28 +01:00
Filipe Manana
7d410d5efe btrfs: make error messages more clear when getting a chunk map
When getting a chunk map, at btrfs_get_chunk_map(), we do some sanity
checks to verify we found a chunk map and that map found covers the
logical address the caller passed in. However the messages aren't very
clear in the sense that don't mention the issue is with a chunk map and
one of them prints the 'length' argument as if it were the end offset of
the requested range (while the in the string format we use %llu-%llu
which suggests a range, and the second %llu-%llu is actually a range for
the chunk map). So improve these two details in the error messages.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-23 22:27:46 +01:00
Filipe Manana
5fba5a5718 btrfs: fix off-by-one when checking chunk map includes logical address
At btrfs_get_chunk_map() we get the extent map for the chunk that contains
the given logical address stored in the 'logical' argument. Then we do
sanity checks to verify the extent map contains the logical address. One
of these checks verifies if the extent map covers a range with an end
offset behind the target logical address - however this check has an
off-by-one error since it will consider an extent map whose start offset
plus its length matches the target logical address as inclusive, while
the fact is that the last byte it covers is behind the target logical
address (by 1).

So fix this condition by using '<=' rather than '<' when comparing the
extent map's "start + length" against the target logical address.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-23 22:27:42 +01:00
Bragatheswaran Manickavel
f91192cd68 btrfs: ref-verify: fix memory leaks in btrfs_ref_tree_mod()
In btrfs_ref_tree_mod(), when !parent 're' was allocated through
kmalloc(). In the following code, if an error occurs, the execution will
be redirected to 'out' or 'out_unlock' and the function will be exited.
However, on some of the paths, 're' are not deallocated and may lead to
memory leaks.

For example: lookup_block_entry() for 'be' returns NULL, the out label
will be invoked. During that flow ref and 'ra' are freed but not 're',
which can potentially lead to a memory leak.

CC: stable@vger.kernel.org # 5.10+
Reported-and-tested-by: syzbot+d66de4cbf532749df35f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d66de4cbf532749df35f
Signed-off-by: Bragatheswaran Manickavel <bragathemanick0908@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-23 22:27:34 +01:00
Qu Wenruo
2db313205f btrfs: add dmesg output for first mount and last unmount of a filesystem
There is a feature request to add dmesg output when unmounting a btrfs.
There are several alternative methods to do the same thing, but with
their own problems:

- Use eBPF to watch btrfs_put_super()/open_ctree()
  Not end user friendly, they have to dip their head into the source
  code.

- Watch for directory /sys/fs/<uuid>/
  This is way more simple, but still requires some simple device -> uuid
  lookups.  And a script needs to use inotify to watch /sys/fs/.

Compared to all these, directly outputting the information into dmesg
would be the most simple one, with both device and UUID included.

And since we're here, also add the output when mounting a filesystem for
the first time for parity. A more fine grained monitoring of subvolume
mounts should be done by another layer, like audit.

Now mounting a btrfs with all default mkfs options would look like this:

  [81.906566] BTRFS info (device dm-8): first mount of filesystem 633b5c16-afe3-4b79-b195-138fe145e4f2
  [81.907494] BTRFS info (device dm-8): using crc32c (crc32c-intel) checksum algorithm
  [81.908258] BTRFS info (device dm-8): using free space tree
  [81.912644] BTRFS info (device dm-8): auto enabling async discard
  [81.913277] BTRFS info (device dm-8): checking UUID tree
  [91.668256] BTRFS info (device dm-8): last unmount of filesystem 633b5c16-afe3-4b79-b195-138fe145e4f2

CC: stable@vger.kernel.org # 5.4+
Link: https://github.com/kdave/btrfs-progs/issues/689
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-23 22:27:26 +01:00
Matthew Wilcox (Oracle)
600f111ef5 fs: Rename mapping private members
It is hard to find where mapping->private_lock, mapping->private_list and
mapping->private_data are used, due to private_XXX being a relatively
common name for variables and structure members in the kernel.  To fit
with other members of struct address_space, rename them all to have an
i_ prefix.  Tested with an allmodconfig build.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20231117215823.2821906-1-willy@infradead.org
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-21 11:57:10 +01:00
Jan Kara
ead622674d
btrfs: Do not restrict writes to btrfs devices
Btrfs device probing code needs adaptation so that it works when writes
are restricted to its mounted devices. Since btrfs maintainer wants to
merge these changes through btrfs tree and there are review bandwidth
issues with that, let's not block all other filesystems and just not
restrict writes to btrfs devices for now.

CC: <linux-btrfs@vger.kernel.org>
CC: David Sterba <dsterba@suse.com>
CC: Josef Bacik <josef@toxicpanda.com>
CC: Chris Mason <clm@fb.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231101174325.10596-4-jack@suse.cz
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18 14:59:25 +01:00
Qu Wenruo
8049ba5d0a btrfs: do not abort transaction if there is already an existing qgroup
[BUG]
Syzbot reported a regression that after commit 6ed05643dd ("btrfs:
create qgroup earlier in snapshot creation") we can trigger transaction
abort during snapshot creation:

  BTRFS: Transaction aborted (error -17)
  WARNING: CPU: 0 PID: 5057 at fs/btrfs/transaction.c:1778 create_pending_snapshot+0x25f4/0x2b70 fs/btrfs/transaction.c:1778
  Modules linked in:
  CPU: 0 PID: 5057 Comm: syz-executor225 Not tainted 6.6.0-syzkaller-15365-g305230142ae0 #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
  RIP: 0010:create_pending_snapshot+0x25f4/0x2b70 fs/btrfs/transaction.c:1778
  Call Trace:
   <TASK>
   create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1967
   btrfs_commit_transaction+0xf1c/0x3730 fs/btrfs/transaction.c:2440
   create_snapshot+0x4a5/0x7e0 fs/btrfs/ioctl.c:845
   btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:995
   btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1041
   __btrfs_ioctl_snap_create+0x344/0x460 fs/btrfs/ioctl.c:1294
   btrfs_ioctl_snap_create+0x13c/0x190 fs/btrfs/ioctl.c:1321
   btrfs_ioctl+0xbbf/0xd40
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:871 [inline]
   __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857
   do_syscall_x64 arch/x86/entry/common.c:51 [inline]
   do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
   entry_SYSCALL_64_after_hwframe+0x63/0x6b
  RIP: 0033:0x7f2f791127b9
   </TASK>

[CAUSE]
The error number is -EEXIST, which can happen for qgroup if there is
already an existing qgroup and then we're trying to create a snapshot
for it.

[FIX]
In that case, we can continue creating the snapshot, although it may
lead to qgroup inconsistency, it's not so critical to abort the current
transaction.

So in this case, we can just ignore the non-critical errors, mostly -EEXIST
(there is already a qgroup).

Reported-by: syzbot+4d81015bc10889fd12ea@syzkaller.appspotmail.com
Fixes: 6ed05643dd ("btrfs: create qgroup earlier in snapshot creation")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-15 17:08:14 +01:00
Qu Wenruo
1645c283a8 btrfs: tree-checker: add type and sequence check for inline backrefs
[BUG]
There is a bug report that ntfs2btrfs had a bug that it can lead to
transaction abort and the filesystem flips to read-only.

[CAUSE]
For inline backref items, kernel has a strict requirement for their
ordered, they must follow the following rules:

- All btrfs_extent_inline_ref::type should be in an ascending order

- Within the same type, the items should follow a descending order by
  their sequence number

  For EXTENT_DATA_REF type, the sequence number is result from
  hash_extent_data_ref().
  For other types, their sequence numbers are
  btrfs_extent_inline_ref::offset.

Thus if there is any code not following above rules, the resulted
inline backrefs can prevent the kernel to locate the needed inline
backref and lead to transaction abort.

[FIX]
Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the
ability to detect such problems.

For kernel, let's be more noisy and be more specific about the order, so
that the next time kernel hits such problem we would reject it in the
first place, without leading to transaction abort.

Link: https://github.com/kdave/btrfs-progs/pull/622
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-15 17:08:09 +01:00
Linus Torvalds
9bacdd8996 for-6.7-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmVSO50ACgkQxWXV+ddt
 WDuiyg/7BZviFAyiQMAzpA319qRJZ+EemfTdF/k69q4axGYuvqVdXKnpOV44AR4I
 dKcHLOPpDZIxsh8lFytkm1UEAHptw1v7A64c+gcdjGK0tAA7aKbw/1nmNysowT23
 L0v2+34hkBUfG8A3uVgOwL1rjItEX5Fl54slpVsazSqlEbKrqC4MGNjmqdp3IOeC
 qfXTgkvkXmm8s8NyoJybKewM9Aw0tmK0jkAFHA+2sgcZPYKXjqWGv9KUOsXnCx5o
 3kPWIRT1sj4q2qzrgP14Q12O6qPLZ2/0oTBhi6nhj8+N1yiH+USS5zBITegF+w2n
 leQeVHtyBYHlPYQSQlCIZy7+10gkePvs+JmoAuL8YFISnGYnvOZqCeArlV7cnNI3
 CQt7ZBER5Dqw78Y756usUhpYrLWa9kOpcPVRmjJ/R62+TY1FkkyY7irETbn5EGjI
 NlhEa4PMYeYpAOccoxWEm9tIiiVD1abURhVBdn3Znfcb1Sv/lrGBlo9DYGFCxbBh
 xU1JP7sly8w0aPLqCbn1X3VY8dXp+CeYz4FQabHjQA/zr9lF08/pRYj3haAbYAyH
 0KphXurwz/YqY+LmRg7SbQ/KMgBAiBV8Qk9JyNvdvaQbnYnq7CWdpoHcpZu3mvpb
 HLGoXew58kZaSfxLHlcT5wwYlbq0rooXRstuFg2+BBcOFOMCQfw=
 =GM+1
 -----END PGP SIGNATURE-----

Merge tag 'for-6.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix potential overflow in returned value from SEARCH_TREE_V2
   ioctl on 32bit architecture

 - zoned mode fixes:

     - drop unnecessary write pointer check for RAID0/RAID1/RAID10
       profiles, now it works because of raid-stripe-tree

     - wait for finishing the zone when direct IO needs a new
       allocation

 - simple quota fixes:

     - pass correct owning root pointer when cleaning up an
       aborted transaction

     - fix leaking some structures when processing delayed refs

     - change key type number of BTRFS_EXTENT_OWNER_REF_KEY,
       reorder it before inline refs that are supposed to be
       sorted, keeping the original number would complicate a lot
       of things; this change needs an updated version of
       btrfs-progs to work and filesystems need to be recreated

 - fix error pointer dereference after failure to allocate fs
   devices

 - fix race between accounting qgroup extents and removing a
   qgroup

* tag 'for-6.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: make OWNER_REF_KEY type value smallest among inline refs
  btrfs: fix qgroup record leaks when using simple quotas
  btrfs: fix race between accounting qgroup extents and removing a qgroup
  btrfs: fix error pointer dereference after failure to allocate fs devices
  btrfs: make found_logical_ret parameter mandatory for function queue_scrub_stripe()
  btrfs: get correct owning_root when dropping snapshot
  btrfs: zoned: wait for data BG to be finished on direct IO allocation
  btrfs: zoned: drop no longer valid write pointer check
  btrfs: directly return 0 on no error code in btrfs_insert_raid_extent()
  btrfs: use u64 for buffer sizes in the tree search ioctls
2023-11-13 09:09:12 -08:00
Filipe Manana
609d993797 btrfs: fix qgroup record leaks when using simple quotas
When using simple quotas we are not supposed to allocate qgroup records
when adding delayed references. However we allocate them if either mode
of quotas is enabled (the new simple one or the old one), but then we
never free them because running the accounting, which frees the records,
is only run when using the old quotas (at btrfs_qgroup_account_extents()),
resulting in a memory leak of the records allocated when adding delayed
references.

Fix this by allocating the records only if the old quotas mode is enabled.
Also fix btrfs_qgroup_trace_extent_nolock() to return 1 if the old quotas
mode is not enabled - meaning the caller has to free the record.

Fixes: 182940f4f4 ("btrfs: qgroup: add new quota mode for simple quotas")
Reported-by: syzbot+d3ddc6dcc6386dea398b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000004769106097f9a34@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-09 14:01:59 +01:00
Filipe Manana
6c8e69e4a7 btrfs: fix race between accounting qgroup extents and removing a qgroup
When doing qgroup accounting for an extent, we take the spinlock
fs_info->qgroup_lock and then add qgroups to the local list (iterator)
named "qgroups". These qgroups are found in the fs_info->qgroup_tree
rbtree. After we're done, we unlock fs_info->qgroup_lock and then call
qgroup_iterator_nested_clean(), which will iterate over all the qgroups
added to the local list "qgroups" and then delete them from the list.
Deleting a qgroup from the list can however result in a use-after-free
if a qgroup remove operation happens after we unlock fs_info->qgroup_lock
and before or while we are at qgroup_iterator_nested_clean().

Fix this by calling qgroup_iterator_nested_clean() while still holding
the lock fs_info->qgroup_lock - we don't need it under the 'out' label
since before taking the lock the "qgroups" list is always empty. This
guarantees safety because btrfs_remove_qgroup() takes that lock before
removing a qgroup from the rbtree fs_info->qgroup_tree.

This was reported by syzbot with the following stack traces:

   BUG: KASAN: slab-use-after-free in __list_del_entry_valid_or_report+0x2f/0x130 lib/list_debug.c:49
   Read of size 8 at addr ffff888027e420b0 by task kworker/u4:3/48

   CPU: 1 PID: 48 Comm: kworker/u4:3 Not tainted 6.6.0-syzkaller-10396-g4652b8e4f3ff #0
   Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
   Workqueue: btrfs-qgroup-rescan btrfs_work_helper
   Call Trace:
    <TASK>
    __dump_stack lib/dump_stack.c:88 [inline]
    dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
    print_address_description mm/kasan/report.c:364 [inline]
    print_report+0x163/0x540 mm/kasan/report.c:475
    kasan_report+0x175/0x1b0 mm/kasan/report.c:588
    __list_del_entry_valid_or_report+0x2f/0x130 lib/list_debug.c:49
    __list_del_entry_valid include/linux/list.h:124 [inline]
    __list_del_entry include/linux/list.h:215 [inline]
    list_del_init include/linux/list.h:287 [inline]
    qgroup_iterator_nested_clean fs/btrfs/qgroup.c:2623 [inline]
    btrfs_qgroup_account_extent+0x18b/0x1150 fs/btrfs/qgroup.c:2883
    qgroup_rescan_leaf fs/btrfs/qgroup.c:3543 [inline]
    btrfs_qgroup_rescan_worker+0x1078/0x1c60 fs/btrfs/qgroup.c:3604
    btrfs_work_helper+0x37c/0xbd0 fs/btrfs/async-thread.c:315
    process_one_work kernel/workqueue.c:2630 [inline]
    process_scheduled_works+0x90f/0x1400 kernel/workqueue.c:2703
    worker_thread+0xa5f/0xff0 kernel/workqueue.c:2784
    kthread+0x2d3/0x370 kernel/kthread.c:388
    ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
    ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
    </TASK>

   Allocated by task 6355:
    kasan_save_stack mm/kasan/common.c:45 [inline]
    kasan_set_track+0x4f/0x70 mm/kasan/common.c:52
    ____kasan_kmalloc mm/kasan/common.c:374 [inline]
    __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:383
    kmalloc include/linux/slab.h:600 [inline]
    kzalloc include/linux/slab.h:721 [inline]
    btrfs_quota_enable+0xee9/0x2060 fs/btrfs/qgroup.c:1209
    btrfs_ioctl_quota_ctl+0x143/0x190 fs/btrfs/ioctl.c:3705
    vfs_ioctl fs/ioctl.c:51 [inline]
    __do_sys_ioctl fs/ioctl.c:871 [inline]
    __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857
    do_syscall_x64 arch/x86/entry/common.c:51 [inline]
    do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
    entry_SYSCALL_64_after_hwframe+0x63/0x6b

   Freed by task 6355:
    kasan_save_stack mm/kasan/common.c:45 [inline]
    kasan_set_track+0x4f/0x70 mm/kasan/common.c:52
    kasan_save_free_info+0x28/0x40 mm/kasan/generic.c:522
    ____kasan_slab_free+0xd6/0x120 mm/kasan/common.c:236
    kasan_slab_free include/linux/kasan.h:164 [inline]
    slab_free_hook mm/slub.c:1800 [inline]
    slab_free_freelist_hook mm/slub.c:1826 [inline]
    slab_free mm/slub.c:3809 [inline]
    __kmem_cache_free+0x263/0x3a0 mm/slub.c:3822
    btrfs_remove_qgroup+0x764/0x8c0 fs/btrfs/qgroup.c:1787
    btrfs_ioctl_qgroup_create+0x185/0x1e0 fs/btrfs/ioctl.c:3811
    vfs_ioctl fs/ioctl.c:51 [inline]
    __do_sys_ioctl fs/ioctl.c:871 [inline]
    __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857
    do_syscall_x64 arch/x86/entry/common.c:51 [inline]
    do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
    entry_SYSCALL_64_after_hwframe+0x63/0x6b

   Last potentially related work creation:
    kasan_save_stack+0x3f/0x60 mm/kasan/common.c:45
    __kasan_record_aux_stack+0xad/0xc0 mm/kasan/generic.c:492
    __call_rcu_common kernel/rcu/tree.c:2667 [inline]
    call_rcu+0x167/0xa70 kernel/rcu/tree.c:2781
    kthread_worker_fn+0x4ba/0xa90 kernel/kthread.c:823
    kthread+0x2d3/0x370 kernel/kthread.c:388
    ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
    ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242

   Second to last potentially related work creation:
    kasan_save_stack+0x3f/0x60 mm/kasan/common.c:45
    __kasan_record_aux_stack+0xad/0xc0 mm/kasan/generic.c:492
    __call_rcu_common kernel/rcu/tree.c:2667 [inline]
    call_rcu+0x167/0xa70 kernel/rcu/tree.c:2781
    kthread_worker_fn+0x4ba/0xa90 kernel/kthread.c:823
    kthread+0x2d3/0x370 kernel/kthread.c:388
    ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
    ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242

   The buggy address belongs to the object at ffff888027e42000
    which belongs to the cache kmalloc-512 of size 512
   The buggy address is located 176 bytes inside of
    freed 512-byte region [ffff888027e42000, ffff888027e42200)

   The buggy address belongs to the physical page:
   page:ffffea00009f9000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x27e40
   head:ffffea00009f9000 order:2 entire_mapcount:0 nr_pages_mapped:0 pincount:0
   flags: 0xfff00000000840(slab|head|node=0|zone=1|lastcpupid=0x7ff)
   page_type: 0xffffffff()
   raw: 00fff00000000840 ffff888012c41c80 ffffea0000a5ba00 dead000000000002
   raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
   page dumped because: kasan: bad access detected
   page_owner tracks the page as allocated
   page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 4514, tgid 4514 (udevadm), ts 24598439480, free_ts 23755696267
    set_page_owner include/linux/page_owner.h:31 [inline]
    post_alloc_hook+0x1e6/0x210 mm/page_alloc.c:1536
    prep_new_page mm/page_alloc.c:1543 [inline]
    get_page_from_freelist+0x31db/0x3360 mm/page_alloc.c:3170
    __alloc_pages+0x255/0x670 mm/page_alloc.c:4426
    alloc_slab_page+0x6a/0x160 mm/slub.c:1870
    allocate_slab mm/slub.c:2017 [inline]
    new_slab+0x84/0x2f0 mm/slub.c:2070
    ___slab_alloc+0xc85/0x1310 mm/slub.c:3223
    __slab_alloc mm/slub.c:3322 [inline]
    __slab_alloc_node mm/slub.c:3375 [inline]
    slab_alloc_node mm/slub.c:3468 [inline]
    __kmem_cache_alloc_node+0x19d/0x270 mm/slub.c:3517
    kmalloc_trace+0x2a/0xe0 mm/slab_common.c:1098
    kmalloc include/linux/slab.h:600 [inline]
    kzalloc include/linux/slab.h:721 [inline]
    kernfs_fop_open+0x3e7/0xcc0 fs/kernfs/file.c:670
    do_dentry_open+0x8fd/0x1590 fs/open.c:948
    do_open fs/namei.c:3622 [inline]
    path_openat+0x2845/0x3280 fs/namei.c:3779
    do_filp_open+0x234/0x490 fs/namei.c:3809
    do_sys_openat2+0x13e/0x1d0 fs/open.c:1440
    do_sys_open fs/open.c:1455 [inline]
    __do_sys_openat fs/open.c:1471 [inline]
    __se_sys_openat fs/open.c:1466 [inline]
    __x64_sys_openat+0x247/0x290 fs/open.c:1466
    do_syscall_x64 arch/x86/entry/common.c:51 [inline]
    do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
    entry_SYSCALL_64_after_hwframe+0x63/0x6b
   page last free stack trace:
    reset_page_owner include/linux/page_owner.h:24 [inline]
    free_pages_prepare mm/page_alloc.c:1136 [inline]
    free_unref_page_prepare+0x8c3/0x9f0 mm/page_alloc.c:2312
    free_unref_page+0x37/0x3f0 mm/page_alloc.c:2405
    discard_slab mm/slub.c:2116 [inline]
    __unfreeze_partials+0x1dc/0x220 mm/slub.c:2655
    put_cpu_partial+0x17b/0x250 mm/slub.c:2731
    __slab_free+0x2b6/0x390 mm/slub.c:3679
    qlink_free mm/kasan/quarantine.c:166 [inline]
    qlist_free_all+0x75/0xe0 mm/kasan/quarantine.c:185
    kasan_quarantine_reduce+0x14b/0x160 mm/kasan/quarantine.c:292
    __kasan_slab_alloc+0x23/0x70 mm/kasan/common.c:305
    kasan_slab_alloc include/linux/kasan.h:188 [inline]
    slab_post_alloc_hook+0x67/0x3d0 mm/slab.h:762
    slab_alloc_node mm/slub.c:3478 [inline]
    slab_alloc mm/slub.c:3486 [inline]
    __kmem_cache_alloc_lru mm/slub.c:3493 [inline]
    kmem_cache_alloc+0x104/0x2c0 mm/slub.c:3502
    getname_flags+0xbc/0x4f0 fs/namei.c:140
    do_sys_openat2+0xd2/0x1d0 fs/open.c:1434
    do_sys_open fs/open.c:1455 [inline]
    __do_sys_openat fs/open.c:1471 [inline]
    __se_sys_openat fs/open.c:1466 [inline]
    __x64_sys_openat+0x247/0x290 fs/open.c:1466
    do_syscall_x64 arch/x86/entry/common.c:51 [inline]
    do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
    entry_SYSCALL_64_after_hwframe+0x63/0x6b

   Memory state around the buggy address:
    ffff888027e41f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff888027e42000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   >ffff888027e42080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                        ^
    ffff888027e42100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff888027e42180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Reported-by: syzbot+e0b615318f8fcfc01ceb@syzkaller.appspotmail.com
Fixes: dce28769a3 ("btrfs: qgroup: use qgroup_iterator_nested to in qgroup_update_refcnt()")
CC: stable@vger.kernel.org # 6.6
Link: https://lore.kernel.org/linux-btrfs/00000000000091a5b2060936bf6d@google.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-09 14:01:54 +01:00
Filipe Manana
cd63ffbd23 btrfs: fix error pointer dereference after failure to allocate fs devices
At device_list_add() we allocate a btrfs_fs_devices structure and then
before checking if the allocation failed (pointer is ERR_PTR(-ENOMEM)),
we dereference the error pointer in a memcpy() argument if the feature
BTRFS_FEATURE_INCOMPAT_METADATA_UUID is enabled.
Fix this by checking for an allocation error before trying the memcpy().

Fixes: f7361d8c3f ("btrfs: sipmlify uuid parameters of alloc_fs_devices()")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-03 16:40:18 +01:00
Qu Wenruo
47e2b06b7b btrfs: make found_logical_ret parameter mandatory for function queue_scrub_stripe()
[BUG]
There is a compilation warning reported on commit ae76d8e3e1 ("btrfs:
scrub: fix grouping of read IO"), where gcc (14.0.0 20231022 experimental)
is reporting the following uninitialized variable:

  fs/btrfs/scrub.c: In function ‘scrub_simple_mirror.isra’:
  fs/btrfs/scrub.c:2075:29: error: ‘found_logical’ may be used uninitialized [-Werror=maybe-uninitialized[https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wmaybe-uninitialized]]
   2075 |                 cur_logical = found_logical + BTRFS_STRIPE_LEN;
  fs/btrfs/scrub.c:2040:21: note: ‘found_logical’ was declared here
   2040 |                 u64 found_logical;
        |                     ^~~~~~~~~~~~~

[CAUSE]
This is a false alert, as @found_logical is passed as parameter
@found_logical_ret of function queue_scrub_stripe().

As long as queue_scrub_stripe() returned 0, we would update
@found_logical_ret.  And if queue_scrub_stripe() returned >0 or <0, the
caller would not utilized @found_logical, thus there should be nothing
wrong.

Although the triggering gcc is still experimental, it looks like the
extra check on "if (found_logical_ret)" can sometimes confuse the
compiler.

Meanwhile the only caller of queue_scrub_stripe() is always passing a
valid pointer, there is no need for such check at all.

[FIX]
Although the report itself is a false alert, we can still make it more
explicit by:

- Replace the check for @found_logical_ret with ASSERT()

- Initialize @found_logical to U64_MAX

- Add one extra ASSERT() to make sure @found_logical got updated

Link: https://lore.kernel.org/linux-btrfs/87fs1x1p93.fsf@gentoo.org/
Fixes: ae76d8e3e1 ("btrfs: scrub: fix grouping of read IO")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-03 16:40:15 +01:00
Josef Bacik
d8ba2a91fc btrfs: get correct owning_root when dropping snapshot
Dave reported a bug where we were aborting the transaction while trying
to cleanup the squota reservation for an extent.

This turned out to be because we're doing btrfs_header_owner(next) in
do_walk_down when we decide to free the block.  However in this code
block we haven't explicitly read next, so it could be stale.  We would
then get whatever garbage happened to be in the pages at this point.
The commit that introduced that is "btrfs: track owning root in
btrfs_ref".

Fix this by saving the owner_root when we do the
btrfs_lookup_extent_info().  We always do this in do_walk_down, it is
how we make the decision of whether or not to delete the block.  This is
cheap because we've already done the extent item lookup at this point,
so it's straightforward to just grab the owner root as well.

Then we can use this when deleting the metadata block without needing to
force a read of the extent buffer to find the owner.

This fixes the problem that Dave reported.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-03 16:39:06 +01:00
Naohiro Aota
776a838f1f btrfs: zoned: wait for data BG to be finished on direct IO allocation
Running the fio command below on a ZNS device results in "Resource
temporarily unavailable" error.

  $ sudo fio --name=w --directory=/mnt --filesize=1GB --bs=16MB --numjobs=16 \
        --rw=write --ioengine=libaio --iodepth=128 --direct=1

  fio: io_u error on file /mnt/w.2.0: Resource temporarily unavailable: write offset=117440512, buflen=16777216
  fio: io_u error on file /mnt/w.2.0: Resource temporarily unavailable: write offset=134217728, buflen=16777216
  ...

This happens because -EAGAIN error returned from btrfs_reserve_extent()
called from btrfs_new_extent_direct() is spilling over to the userland.

btrfs_reserve_extent() returns -EAGAIN when there is no active zone
available. Then, the caller should wait for some other on-going IO to
finish a zone and retry the allocation.

This logic is already implemented for buffered write in cow_file_range(),
but it is missing for the direct IO counterpart. Implement the same logic
for it.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 2ce543f478 ("btrfs: zoned: wait until zone is finished when allocation didn't progress")
CC: stable@vger.kernel.org # 6.1+
Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-03 16:39:01 +01:00
Naohiro Aota
dfcb03ae8a btrfs: zoned: drop no longer valid write pointer check
There is a check of the write pointer vs the zone size to reject an invalid
write pointer. However, as of now, we have RAID0/RAID10 on the zoned
mode, we can have a block group whose size is larger than the zone size.

As an equivalent check against the block group's zone_capacity is already
there, we can just drop this invalid check.

Fixes: 568220fa96 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-03 16:38:56 +01:00
Dan Carpenter
b8212814d1 btrfs: directly return 0 on no error code in btrfs_insert_raid_extent()
It's more obvious to return a literal zero instead of "return ret;".
Plus Smatch complains that ret could be uninitialized if the
ordered_extent->bioc_list list is empty and this silences that warning.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-03 16:38:51 +01:00
Filipe Manana
dec96fc2dc btrfs: use u64 for buffer sizes in the tree search ioctls
In the tree search v2 ioctl we use the type size_t, which is an unsigned
long, to track the buffer size in the local variable 'buf_size'. An
unsigned long is 32 bits wide on a 32 bits architecture. The buffer size
defined in struct btrfs_ioctl_search_args_v2 is a u64, so when we later
try to copy the local variable 'buf_size' to the argument struct, when
the search returns -EOVERFLOW, we copy only 32 bits which will be a
problem on big endian systems.

Fix this by using a u64 type for the buffer sizes, not only at
btrfs_ioctl_tree_search_v2(), but also everywhere down the call chain
so that we can use the u64 at btrfs_ioctl_tree_search_v2().

Fixes: cc68a8a5a4 ("btrfs: new ioctl TREE_SEARCH_V2")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/linux-btrfs/ce6f4bd6-9453-4ffe-ba00-cee35495e10f@moroto.mountain/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-03 16:38:45 +01:00
Linus Torvalds
ecae0bd517 Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
 
 - Kemeng Shi has contributed some compation maintenance work in the
   series "Fixes and cleanups to compaction".
 
 - Joel Fernandes has a patchset ("Optimize mremap during mutual
   alignment within PMD") which fixes an obscure issue with mremap()'s
   pagetable handling during a subsequent exec(), based upon an
   implementation which Linus suggested.
 
 - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
   following patch series:
 
 	mm/damon: misc fixups for documents, comments and its tracepoint
 	mm/damon: add a tracepoint for damos apply target regions
 	mm/damon: provide pseudo-moving sum based access rate
 	mm/damon: implement DAMOS apply intervals
 	mm/damon/core-test: Fix memory leaks in core-test
 	mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
 
 - In the series "Do not try to access unaccepted memory" Adrian Hunter
   provides some fixups for the recently-added "unaccepted memory' feature.
   To increase the feature's checking coverage.  "Plug a few gaps where
   RAM is exposed without checking if it is unaccepted memory".
 
 - In the series "cleanups for lockless slab shrink" Qi Zheng has done
   some maintenance work which is preparation for the lockless slab
   shrinking code.
 
 - Qi Zheng has redone the earlier (and reverted) attempt to make slab
   shrinking lockless in the series "use refcount+RCU method to implement
   lockless slab shrink".
 
 - David Hildenbrand contributes some maintenance work for the rmap code
   in the series "Anon rmap cleanups".
 
 - Kefeng Wang does more folio conversions and some maintenance work in
   the migration code.  Series "mm: migrate: more folio conversion and
   unification".
 
 - Matthew Wilcox has fixed an issue in the buffer_head code which was
   causing long stalls under some heavy memory/IO loads.  Some cleanups
   were added on the way.  Series "Add and use bdev_getblk()".
 
 - In the series "Use nth_page() in place of direct struct page
   manipulation" Zi Yan has fixed a potential issue with the direct
   manipulation of hugetlb page frames.
 
 - In the series "mm: hugetlb: Skip initialization of gigantic tail
   struct pages if freed by HVO" has improved our handling of gigantic
   pages in the hugetlb vmmemmep optimizaton code.  This provides
   significant boot time improvements when significant amounts of gigantic
   pages are in use.
 
 - Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
   rationalization and folio conversions in the hugetlb code.
 
 - Yin Fengwei has improved mlock()'s handling of large folios in the
   series "support large folio for mlock"
 
 - In the series "Expose swapcache stat for memcg v1" Liu Shixin has
   added statistics for memcg v1 users which are available (and useful)
   under memcg v2.
 
 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
   prctl so that userspace may direct the kernel to not automatically
   propagate the denial to child processes.  The series is named "MDWE
   without inheritance".
 
 - Kefeng Wang has provided the series "mm: convert numa balancing
   functions to use a folio" which does what it says.
 
 - In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
   makes is possible for a process to propagate KSM treatment across
   exec().
 
 - Huang Ying has enhanced memory tiering's calculation of memory
   distances.  This is used to permit the dax/kmem driver to use "high
   bandwidth memory" in addition to Optane Data Center Persistent Memory
   Modules (DCPMM).  The series is named "memory tiering: calculate
   abstract distance based on ACPI HMAT"
 
 - In the series "Smart scanning mode for KSM" Stefan Roesch has
   optimized KSM by teaching it to retain and use some historical
   information from previous scans.
 
 - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
   series "mm: memcg: fix tracking of pending stats updates values".
 
 - In the series "Implement IOCTL to get and optionally clear info about
   PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
   us to atomically read-then-clear page softdirty state.  This is mainly
   used by CRIU.
 
 - Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
   - a bunch of relatively minor maintenance tweaks to this code.
 
 - Matthew Wilcox has increased the use of the VMA lock over file-backed
   page faults in the series "Handle more faults under the VMA lock".  Some
   rationalizations of the fault path became possible as a result.
 
 - In the series "mm/rmap: convert page_move_anon_rmap() to
   folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
   and folio conversions.
 
 - In the series "various improvements to the GUP interface" Lorenzo
   Stoakes has simplified and improved the GUP interface with an eye to
   providing groundwork for future improvements.
 
 - Andrey Konovalov has sent along the series "kasan: assorted fixes and
   improvements" which does those things.
 
 - Some page allocator maintenance work from Kemeng Shi in the series
   "Two minor cleanups to break_down_buddy_pages".
 
 - In thes series "New selftest for mm" Breno Leitao has developed
   another MM self test which tickles a race we had between madvise() and
   page faults.
 
 - In the series "Add folio_end_read" Matthew Wilcox provides cleanups
   and an optimization to the core pagecache code.
 
 - Nhat Pham has added memcg accounting for hugetlb memory in the series
   "hugetlb memcg accounting".
 
 - Cleanups and rationalizations to the pagemap code from Lorenzo
   Stoakes, in the series "Abstract vma_merge() and split_vma()".
 
 - Audra Mitchell has fixed issues in the procfs page_owner code's new
   timestamping feature which was causing some misbehaviours.  In the
   series "Fix page_owner's use of free timestamps".
 
 - Lorenzo Stoakes has fixed the handling of new mappings of sealed files
   in the series "permit write-sealed memfd read-only shared mappings".
 
 - Mike Kravetz has optimized the hugetlb vmemmap optimization in the
   series "Batch hugetlb vmemmap modification operations".
 
 - Some buffer_head folio conversions and cleanups from Matthew Wilcox in
   the series "Finish the create_empty_buffers() transition".
 
 - As a page allocator performance optimization Huang Ying has added
   automatic tuning to the allocator's per-cpu-pages feature, in the series
   "mm: PCP high auto-tuning".
 
 - Roman Gushchin has contributed the patchset "mm: improve performance
   of accounted kernel memory allocations" which improves their performance
   by ~30% as measured by a micro-benchmark.
 
 - folio conversions from Kefeng Wang in the series "mm: convert page
   cpupid functions to folios".
 
 - Some kmemleak fixups in Liu Shixin's series "Some bugfix about
   kmemleak".
 
 - Qi Zheng has improved our handling of memoryless nodes by keeping them
   off the allocation fallback list.  This is done in the series "handle
   memoryless nodes more appropriately".
 
 - khugepaged conversions from Vishal Moola in the series "Some
   khugepaged folio conversions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
 jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
 FgeUPAD1oasg6CP+INZvCj34waNxwAc=
 =E+Y4
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Kemeng Shi has contributed some compation maintenance work in the
     series 'Fixes and cleanups to compaction'

   - Joel Fernandes has a patchset ('Optimize mremap during mutual
     alignment within PMD') which fixes an obscure issue with mremap()'s
     pagetable handling during a subsequent exec(), based upon an
     implementation which Linus suggested

   - More DAMON/DAMOS maintenance and feature work from SeongJae Park i
     the following patch series:

	mm/damon: misc fixups for documents, comments and its tracepoint
	mm/damon: add a tracepoint for damos apply target regions
	mm/damon: provide pseudo-moving sum based access rate
	mm/damon: implement DAMOS apply intervals
	mm/damon/core-test: Fix memory leaks in core-test
	mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval

   - In the series 'Do not try to access unaccepted memory' Adrian
     Hunter provides some fixups for the recently-added 'unaccepted
     memory' feature. To increase the feature's checking coverage. 'Plug
     a few gaps where RAM is exposed without checking if it is
     unaccepted memory'

   - In the series 'cleanups for lockless slab shrink' Qi Zheng has done
     some maintenance work which is preparation for the lockless slab
     shrinking code

   - Qi Zheng has redone the earlier (and reverted) attempt to make slab
     shrinking lockless in the series 'use refcount+RCU method to
     implement lockless slab shrink'

   - David Hildenbrand contributes some maintenance work for the rmap
     code in the series 'Anon rmap cleanups'

   - Kefeng Wang does more folio conversions and some maintenance work
     in the migration code. Series 'mm: migrate: more folio conversion
     and unification'

   - Matthew Wilcox has fixed an issue in the buffer_head code which was
     causing long stalls under some heavy memory/IO loads. Some cleanups
     were added on the way. Series 'Add and use bdev_getblk()'

   - In the series 'Use nth_page() in place of direct struct page
     manipulation' Zi Yan has fixed a potential issue with the direct
     manipulation of hugetlb page frames

   - In the series 'mm: hugetlb: Skip initialization of gigantic tail
     struct pages if freed by HVO' has improved our handling of gigantic
     pages in the hugetlb vmmemmep optimizaton code. This provides
     significant boot time improvements when significant amounts of
     gigantic pages are in use

   - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
     rationalization and folio conversions in the hugetlb code

   - Yin Fengwei has improved mlock()'s handling of large folios in the
     series 'support large folio for mlock'

   - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
     added statistics for memcg v1 users which are available (and
     useful) under memcg v2

   - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
     prctl so that userspace may direct the kernel to not automatically
     propagate the denial to child processes. The series is named 'MDWE
     without inheritance'

   - Kefeng Wang has provided the series 'mm: convert numa balancing
     functions to use a folio' which does what it says

   - In the series 'mm/ksm: add fork-exec support for prctl' Stefan
     Roesch makes is possible for a process to propagate KSM treatment
     across exec()

   - Huang Ying has enhanced memory tiering's calculation of memory
     distances. This is used to permit the dax/kmem driver to use 'high
     bandwidth memory' in addition to Optane Data Center Persistent
     Memory Modules (DCPMM). The series is named 'memory tiering:
     calculate abstract distance based on ACPI HMAT'

   - In the series 'Smart scanning mode for KSM' Stefan Roesch has
     optimized KSM by teaching it to retain and use some historical
     information from previous scans

   - Yosry Ahmed has fixed some inconsistencies in memcg statistics in
     the series 'mm: memcg: fix tracking of pending stats updates
     values'

   - In the series 'Implement IOCTL to get and optionally clear info
     about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
     which permits us to atomically read-then-clear page softdirty
     state. This is mainly used by CRIU

   - Hugh Dickins contributed the series 'shmem,tmpfs: general
     maintenance', a bunch of relatively minor maintenance tweaks to
     this code

   - Matthew Wilcox has increased the use of the VMA lock over
     file-backed page faults in the series 'Handle more faults under the
     VMA lock'. Some rationalizations of the fault path became possible
     as a result

   - In the series 'mm/rmap: convert page_move_anon_rmap() to
     folio_move_anon_rmap()' David Hildenbrand has implemented some
     cleanups and folio conversions

   - In the series 'various improvements to the GUP interface' Lorenzo
     Stoakes has simplified and improved the GUP interface with an eye
     to providing groundwork for future improvements

   - Andrey Konovalov has sent along the series 'kasan: assorted fixes
     and improvements' which does those things

   - Some page allocator maintenance work from Kemeng Shi in the series
     'Two minor cleanups to break_down_buddy_pages'

   - In thes series 'New selftest for mm' Breno Leitao has developed
     another MM self test which tickles a race we had between madvise()
     and page faults

   - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
     and an optimization to the core pagecache code

   - Nhat Pham has added memcg accounting for hugetlb memory in the
     series 'hugetlb memcg accounting'

   - Cleanups and rationalizations to the pagemap code from Lorenzo
     Stoakes, in the series 'Abstract vma_merge() and split_vma()'

   - Audra Mitchell has fixed issues in the procfs page_owner code's new
     timestamping feature which was causing some misbehaviours. In the
     series 'Fix page_owner's use of free timestamps'

   - Lorenzo Stoakes has fixed the handling of new mappings of sealed
     files in the series 'permit write-sealed memfd read-only shared
     mappings'

   - Mike Kravetz has optimized the hugetlb vmemmap optimization in the
     series 'Batch hugetlb vmemmap modification operations'

   - Some buffer_head folio conversions and cleanups from Matthew Wilcox
     in the series 'Finish the create_empty_buffers() transition'

   - As a page allocator performance optimization Huang Ying has added
     automatic tuning to the allocator's per-cpu-pages feature, in the
     series 'mm: PCP high auto-tuning'

   - Roman Gushchin has contributed the patchset 'mm: improve
     performance of accounted kernel memory allocations' which improves
     their performance by ~30% as measured by a micro-benchmark

   - folio conversions from Kefeng Wang in the series 'mm: convert page
     cpupid functions to folios'

   - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
     kmemleak'

   - Qi Zheng has improved our handling of memoryless nodes by keeping
     them off the allocation fallback list. This is done in the series
     'handle memoryless nodes more appropriately'

   - khugepaged conversions from Vishal Moola in the series 'Some
     khugepaged folio conversions'"

[ bcachefs conflicts with the dynamically allocated shrinkers have been
  resolved as per Stephen Rothwell in

     https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/

  with help from Qi Zheng.

  The clone3 test filtering conflict was half-arsed by yours truly ]

* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
  mm/damon/sysfs: update monitoring target regions for online input commit
  mm/damon/sysfs: remove requested targets when online-commit inputs
  selftests: add a sanity check for zswap
  Documentation: maple_tree: fix word spelling error
  mm/vmalloc: fix the unchecked dereference warning in vread_iter()
  zswap: export compression failure stats
  Documentation: ubsan: drop "the" from article title
  mempolicy: migration attempt to match interleave nodes
  mempolicy: mmap_lock is not needed while migrating folios
  mempolicy: alloc_pages_mpol() for NUMA policy without vma
  mm: add page_rmappable_folio() wrapper
  mempolicy: remove confusing MPOL_MF_LAZY dead code
  mempolicy: mpol_shared_policy_init() without pseudo-vma
  mempolicy trivia: use pgoff_t in shared mempolicy tree
  mempolicy trivia: slightly more consistent naming
  mempolicy trivia: delete those ancient pr_debug()s
  mempolicy: fix migrate_pages(2) syscall return nr_failed
  kernfs: drop shared NUMA mempolicy hooks
  hugetlbfs: drop shared NUMA mempolicy pretence
  mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
  ...
2023-11-02 19:38:47 -10:00
Linus Torvalds
d5acbc60fa for-6.7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmU/xAEACgkQxWXV+ddt
 WDvYKg//SjTimA5Nins9mb4jdz8n+dDeZnQhKzy3FqInU41EzDRc4WwnEODmDlTa
 AyU9rGB3k0JNSUc075jZFCyLqq/ARiOqRi4x33Gk0ckIlc4X5OgBoqP2XkPh0VlP
 txskLCrmhc3pwyR4ErlFDX2jebIUXfkv39bJuE40grGvUatRe+WNq0ERIrgO8RAr
 Rc3hBotMH8AIqfD1L6j1ZiZIAyrOkT1BJMuqeoq27/gJZn/MRhM9TCrMTzfWGaoW
 SxPrQiCDEN3KECsOY/caroMn3AekDijg/ley1Nf7Z0N6oEV+n4VWWPBFE9HhRz83
 9fIdvSbGjSJF6ekzTjcVXPAbcuKZFzeqOdBRMIW3TIUo7mZQyJTVkMsc1y/NL2Z3
 9DhlRLIzvWJJjt1CEK0u18n5IU+dGngdktbhWWIuIlo8r+G/iKR/7zqU92VfWLHL
 Z7/eh6HgH5zr2bm+yKORbrUjkv4IVhGVarW8D4aM+MCG0lFN2GaPcJCCUrp4n7rZ
 PzpQbxXa38ANBk6hsp4ndS8TJSBL9moY8tumzLcKg97nzNMV6KpBdV/G6/QfRLCN
 3kM6UbwTAkMwGcQS86Mqx6s04ORLnQeD6f7N6X4Ppx0Mi/zkjI2HkRuvQGp12B0v
 iZjCCZAYY2Iu+/TU0GrCXSss/grzIAUPzM9msyV3XGO/VBpwdec=
 =9TVx
 -----END PGP SIGNATURE-----

Merge tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "New features:

   - raid-stripe-tree

     New tree for logical file extent mapping where the physical mapping
     may not match on multiple devices. This is now used in zoned mode
     to implement RAID0/RAID1* profiles, but can be used in non-zoned
     mode as well. The support for RAID56 is in development and will
     eventually fix the problems with the current implementation. This
     is a backward incompatible feature and has to be enabled at mkfs
     time.

   - simple quota accounting (squota)

     A simplified mode of qgroup that accounts all space on the initial
     extent owners (a subvolume), the snapshots are then cheap to create
     and delete. The deletion of snapshots in fully accounting qgroups
     is a known CPU/IO performance bottleneck.

     The squota is not suitable for the general use case but works well
     for containers where the original subvolume exists for the whole
     time. This is a backward incompatible feature as it needs extending
     some structures, but can be enabled on an existing filesystem.

   - temporary filesystem fsid (temp_fsid)

     The fsid identifies a filesystem and is hard coded in the
     structures, which disallows mounting the same fsid found on
     different devices.

     For a single device filesystem this is not strictly necessary, a
     new temporary fsid can be generated on mount e.g. after a device is
     cloned. This will be used by Steam Deck for root partition A/B
     testing, or can be used for VM root images.

  Other user visible changes:

   - filesystems with partially finished metadata_uuid conversion cannot
     be mounted anymore and the uuid fixup has to be done by btrfs-progs
     (btrfstune).

  Performance improvements:

   - reduce reservations for checksum deletions (with enabled free space
     tree by factor of 4), on a sample workload on file with many
     extents the deletion time decreased by 12%

   - make extent state merges more efficient during insertions, reduce
     rb-tree iterations (run time of critical functions reduced by 5%)

  Core changes:

   - the integrity check functionality has been removed, this was a
     debugging feature and removal does not affect other integrity
     checks like checksums or tree-checker

   - space reservation changes:

      - more efficient delayed ref reservations, this avoids building up
        too much work or overusing or exhausting the global block
        reserve in some situations

      - move delayed refs reservation to the transaction start time,
        this prevents some ENOSPC corner cases related to exhaustion of
        global reserve

      - improvements in reducing excessive reservations for block group
        items

      - adjust overcommit logic in near full situations, account for one
        more chunk to eventually allocate metadata chunk, this is mostly
        relevant for small filesystems (<10GiB)

   - single device filesystems are scanned but not registered (except
     seed devices), this allows temp_fsid to work

   - qgroup iterations do not need GFP_ATOMIC allocations anymore

   - cleanups, refactoring, reduced data structure size, function
     parameter simplifications, error handling fixes"

* tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits)
  btrfs: open code timespec64 in struct btrfs_inode
  btrfs: remove redundant log root tree index assignment during log sync
  btrfs: remove redundant initialization of variable dirty in btrfs_update_time()
  btrfs: sysfs: show temp_fsid feature
  btrfs: disable the device add feature for temp-fsid
  btrfs: disable the seed feature for temp-fsid
  btrfs: update comment for temp-fsid, fsid, and metadata_uuid
  btrfs: remove pointless empty log context list check when syncing log
  btrfs: update comment for struct btrfs_inode::lock
  btrfs: remove pointless barrier from btrfs_sync_file()
  btrfs: add and use helpers for reading and writing last_trans_committed
  btrfs: add and use helpers for reading and writing fs_info->generation
  btrfs: add and use helpers for reading and writing log_transid
  btrfs: add and use helpers for reading and writing last_log_commit
  btrfs: support cloned-device mount capability
  btrfs: add helper function find_fsid_by_disk
  btrfs: stop reserving excessive space for block group item insertions
  btrfs: stop reserving excessive space for block group item updates
  btrfs: reorder btrfs_inode to fill gaps
  btrfs: open code btrfs_ordered_inode_tree in btrfs_inode
  ...
2023-10-30 10:42:06 -10:00
Linus Torvalds
14ab6d425e vfs-6.7.ctime
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppYgAKCRCRxhvAZXjc
 okIHAP9anLz1QDyMLH12ASuHjgBc0Of3jcB6NB97IWGpL4O21gEA46ohaD+vcJuC
 YkBLU3lXqQ87nfu28ExFAzh10hG2jwM=
 =m4pB
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs inode time accessor updates from Christian Brauner:
 "This finishes the conversion of all inode time fields to accessor
  functions as discussed on list. Changing timestamps manually as we
  used to do before is error prone. Using accessors function makes this
  robust.

  It does not contain the switch of the time fields to discrete 64 bit
  integers to replace struct timespec and free up space in struct inode.
  But after this, the switch can be trivially made and the patch should
  only affect the vfs if we decide to do it"

* tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits)
  fs: rename inode i_atime and i_mtime fields
  security: convert to new timestamp accessors
  selinux: convert to new timestamp accessors
  apparmor: convert to new timestamp accessors
  sunrpc: convert to new timestamp accessors
  mm: convert to new timestamp accessors
  bpf: convert to new timestamp accessors
  ipc: convert to new timestamp accessors
  linux: convert to new timestamp accessors
  zonefs: convert to new timestamp accessors
  xfs: convert to new timestamp accessors
  vboxsf: convert to new timestamp accessors
  ufs: convert to new timestamp accessors
  udf: convert to new timestamp accessors
  ubifs: convert to new timestamp accessors
  tracefs: convert to new timestamp accessors
  sysv: convert to new timestamp accessors
  squashfs: convert to new timestamp accessors
  server: convert to new timestamp accessors
  client: convert to new timestamp accessors
  ...
2023-10-30 09:47:13 -10:00
Linus Torvalds
7352a6765c vfs-6.7.xattr
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppWAAKCRCRxhvAZXjc
 okB2AP4jjoRErJBwj245OIDJqzoj4m4UVOVd0MH2AkiSpANczwD/TToChdpusY2y
 qAYg1fQoGMbDVlb7Txaj9qI9ieCf9w0=
 =2PXg
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs xattr updates from Christian Brauner:
 "The 's_xattr' field of 'struct super_block' currently requires a
  mutable table of 'struct xattr_handler' entries (although each handler
  itself is const). However, no code in vfs actually modifies the
  tables.

  This changes the type of 's_xattr' to allow const tables, and modifies
  existing file systems to move their tables to .rodata. This is
  desirable because these tables contain entries with function pointers
  in them; moving them to .rodata makes it considerably less likely to
  be modified accidentally or maliciously at runtime"

* tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
  const_structs.checkpatch: add xattr_handler
  net: move sockfs_xattr_handlers to .rodata
  shmem: move shmem_xattr_handlers to .rodata
  overlayfs: move xattr tables to .rodata
  xfs: move xfs_xattr_handlers to .rodata
  ubifs: move ubifs_xattr_handlers to .rodata
  squashfs: move squashfs_xattr_handlers to .rodata
  smb: move cifs_xattr_handlers to .rodata
  reiserfs: move reiserfs_xattr_handlers to .rodata
  orangefs: move orangefs_xattr_handlers to .rodata
  ocfs2: move ocfs2_xattr_handlers and ocfs2_xattr_handler_map to .rodata
  ntfs3: move ntfs_xattr_handlers to .rodata
  nfs: move nfs4_xattr_handlers to .rodata
  kernfs: move kernfs_xattr_handlers to .rodata
  jfs: move jfs_xattr_handlers to .rodata
  jffs2: move jffs2_xattr_handlers to .rodata
  hfsplus: move hfsplus_xattr_handlers to .rodata
  hfs: move hfs_xattr_handlers to .rodata
  gfs2: move gfs2_xattr_handlers_max to .rodata
  fuse: move fuse_xattr_handlers to .rodata
  ...
2023-10-30 09:29:44 -10:00
Linus Torvalds
d4e175f2c4 vfs-6.7.super
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZT0C2gAKCRCRxhvAZXjc
 otV8AQCK5F9ONoQ7ISpdrKyUJiswySGXx0CYPfXbSg5gHH87zgEAua3vwVKeGXXF
 5iVsdiNzIIQDwGDx7FyxufL4ggcN6gQ=
 =E1kV
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs superblock updates from Christian Brauner:
 "This contains the work to make block device opening functions return a
  struct bdev_handle instead of just a struct block_device. The same
  struct bdev_handle is then also passed to block device closing
  functions.

  This allows us to propagate context from opening to closing a block
  device without having to modify all users everytime.

  Sidenote, in the future we might even want to try and have block
  device opening functions return a struct file directly but that's a
  series on top of this.

  These are further preparatory changes to be able to count writable
  opens and blocking writes to mounted block devices. That's a separate
  piece of work for next cycle and for that we absolutely need the
  changes to btrfs that have been quietly dropped somehow.

  Originally the series contained a patch that removed the old
  blkdev_*() helpers. But since this would've caused needles churn in
  -next for bcachefs we ended up delaying it.

  The second piece of work addresses one of the major annoyances about
  the work last cycle, namely that we required dropping s_umount
  whenever we used the superblock and fs_holder_ops for a block device.

  The reason for that requirement had been that in some codepaths
  s_umount could've been taken under disk->open_mutex (that's always
  been the case, at least theoretically). For example, on surprise block
  device removal or media change. And opening and closing block devices
  required grabbing disk->open_mutex as well.

  So we did the work and went through the block layer and fixed all
  those places so that s_umount is never taken under disk->open_mutex.
  This means no more brittle games where we yield and reacquire s_umount
  during block device opening and closing and no more requirements where
  block devices need to be closed. Filesystems don't need to care about
  this.

  There's a bunch of other follow-up work such as moving block device
  freezing and thawing to holder operations which makes it work for all
  block devices and not just the main block device just as we did for
  surprise removal. But that is for next cycle.

  Tested with fstests for all major fses, blktests, LTP"

* tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (37 commits)
  porting: update locking requirements
  fs: assert that open_mutex isn't held over holder ops
  block: assert that we're not holding open_mutex over blk_report_disk_dead
  block: move bdev_mark_dead out of disk_check_media_change
  block: WARN_ON_ONCE() when we remove active partitions
  block: simplify bdev_del_partition()
  fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
  jfs: fix log->bdev_handle null ptr deref in lbmStartIO
  bcache: Fixup error handling in register_cache()
  xfs: Convert to bdev_open_by_path()
  reiserfs: Convert to bdev_open_by_dev/path()
  ocfs2: Convert to use bdev_open_by_dev()
  nfs/blocklayout: Convert to use bdev_open_by_dev/path()
  jfs: Convert to bdev_open_by_dev()
  f2fs: Convert to bdev_open_by_dev/path()
  ext4: Convert to bdev_open_by_dev()
  erofs: Convert to use bdev_open_by_path()
  btrfs: Convert to bdev_open_by_path()
  fs: Convert to bdev_open_by_dev()
  mm/swap: Convert to use bdev_open_by_dev()
  ...
2023-10-30 08:59:05 -10:00