These new trace points record xarray indices and the time of
endpoint registration and unregistration, to co-ordinate with
device removal events.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZrym4AAKCRCRxhvAZXjc
oqT3AP9ydoUNavaZcRayH8r3ybvz9+aJGJ6Q7NznFVCk71vn0gD/buLzmq96Muns
M5DWHbft2AFwK0Rz2nx8j5OXUeHwrQg=
=HZBL
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"VFS:
- Fix the name of file lease slab cache. When file leases were split
out of file locks the name of the file lock slab cache was used for
the file leases slab cache as well.
- Fix a type in take_fd() helper.
- Fix infinite directory iteration for stable offsets in tmpfs.
- When the icache is pruned all reclaimable inodes are marked with
I_FREEING and other processes that try to lookup such inodes will
block.
But some filesystems like ext4 can trigger lookups in their inode
evict callback causing deadlocks. Ext4 does such lookups if the
ea_inode feature is used whereby a separate inode may be used to
store xattrs.
Introduce I_LRU_ISOLATING which pins the inode while its pages are
reclaimed. This avoids inode deletion during inode_lru_isolate()
avoiding the deadlock and evict is made to wait until
I_LRU_ISOLATING is done.
netfs:
- Fault in smaller chunks for non-large folio mappings for
filesystems that haven't been converted to large folios yet.
- Fix the CONFIG_NETFS_DEBUG config option. The config option was
renamed a short while ago and that introduced two minor issues.
First, it depended on CONFIG_NETFS whereas it wants to depend on
CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter
does. Second, the documentation for the config option wasn't fixed
up.
- Revert the removal of the PG_private_2 writeback flag as ceph is
using it and fix how that flag is handled in netfs.
- Fix DIO reads on 9p. A program watching a file on a 9p mount
wouldn't see any changes in the size of the file being exported by
the server if the file was changed directly in the source
filesystem. Fix this by attempting to read the full size specified
when a DIO read is requested.
- Fix a NULL pointer dereference bug due to a data race where a
cachefiles cookies was retired even though it was still in use.
Check the cookie's n_accesses counter before discarding it.
nsfs:
- Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as
the kernel is writing to userspace.
pidfs:
- Prevent the creation of pidfds for kthreads until we have a
use-case for it and we know the semantics we want. It also confuses
userspace why they can get pidfds for kthreads.
squashfs:
- Fix an unitialized value bug reported by KMSAN caused by a
corrupted symbolic link size read from disk. Check that the
symbolic link size is not larger than expected"
* tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
Squashfs: sanity check symbolic link size
9p: Fix DIO read through netfs
vfs: Don't evict inode under the inode lru traversing context
netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag"
file: fix typo in take_fd() comment
pidfd: prevent creation of pidfds for kthreads
netfs: clean up after renaming FSCACHE_DEBUG config
libfs: fix infinite directory reads for offset dir
nsfs: fix ioctl declaration
fs/netfs/fscache_cookie: add missing "n_accesses" check
filelock: fix name of file_lease slab cache
netfs: Fault in smaller chunks for non-large folio mappings
The NETFS_RREQ_USE_PGPRIV2 and NETFS_RREQ_WRITE_TO_CACHE flags aren't used
correctly. The problem is that we try to set them up in the request
initialisation, but we the cache may be in the process of setting up still,
and so the state may not be correct. Further, we secondarily sample the
cache state and make contradictory decisions later.
The issue arises because we set up the cache resources, which allows the
cache's ->prepare_read() to switch on NETFS_SREQ_COPY_TO_CACHE - which
triggers cache writing even if we didn't set the flags when allocating.
Fix this in the following way:
(1) Drop NETFS_ICTX_USE_PGPRIV2 and instead set NETFS_RREQ_USE_PGPRIV2 in
->init_request() rather than trying to juggle that in
netfs_alloc_request().
(2) Repurpose NETFS_RREQ_USE_PGPRIV2 to merely indicate that if caching is
to be done, then PG_private_2 is to be used rather than only setting
it if we decide to cache and then having netfs_rreq_unlock_folios()
set the non-PG_private_2 writeback-to-cache if it wasn't set.
(3) Split netfs_rreq_unlock_folios() into two functions, one of which
contains the deprecated code for using PG_private_2 to avoid
accidentally doing the writeback path - and always use it if
USE_PGPRIV2 is set.
(4) As NETFS_ICTX_USE_PGPRIV2 is removed, make netfs_write_begin() always
wait for PG_private_2. This function is deprecated and only used by
ceph anyway, and so label it so.
(5) Drop the NETFS_RREQ_WRITE_TO_CACHE flag and use
fscache_operation_valid() on the cache_resources instead. This has
the advantage of picking up the result of netfs_begin_cache_read() and
fscache_begin_write_operation() - which are called after the object is
initialised and will wait for the cache to come to a usable state.
Just reverting ae678317b95e[1] isn't a sufficient fix, so this need to be
applied on top of that. Without this as well, things like:
rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: {
and:
WARNING: CPU: 13 PID: 3621 at fs/ceph/caps.c:3386
may happen, along with some UAFs due to PG_private_2 not getting used to
wait on writeback completion.
Fixes: 2ff1e97587 ("netfs: Replace PG_fscache by setting folio->private and marking dirty")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Hristo Venev <hristo@venev.name>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/r/1173209.1723152682@warthog.procyon.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
This reverts commit ae678317b9.
Revert the patch that removes the deprecated use of PG_private_2 in
netfslib for the moment as Ceph is actually still using this to track
data copied to the cache.
Fixes: ae678317b9 ("netfs: Remove deprecated use of PG_private_2 as a second writeback flag")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
https: //lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
Current release - regressions:
- core: drop bad gso csum_start and offset in virtio_net_hdr
- wifi: mt76: fix null pointer access in mt792x_mac_link_bss_remove
- eth: tun: add missing bpf_net_ctx_clear() in do_xdp_generic()
- phy: aquantia: only poll GLOBAL_CFG regs on aqr113, aqr113c and aqr115c
Current release - new code bugs:
- smc: prevent UAF in inet_create()
- bluetooth: btmtk: fix kernel crash when entering btmtk_usb_suspend
- eth: bnxt: reject unsupported hash functions
Previous releases - regressions:
- sched: act_ct: take care of padding in struct zones_ht_key
- netfilter: fix null-ptr-deref in iptable_nat_table_init().
- tcp: adjust clamping window for applications specifying SO_RCVBUF
Previous releases - always broken:
- ethtool: rss: small fixes to spec and GET
- mptcp:
- fix signal endpoint re-add
- pm: fix backup support in signal endpoints
- wifi: ath12k: fix soft lockup on suspend
- eth: bnxt_en: fix RSS logic in __bnxt_reserve_rings()
- eth: ice: fix AF_XDP ZC timeout and concurrency issues
- eth: mlx5:
- fix missing lock on sync reset reload
- fix error handling in irq_pool_request_irq
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmarelYSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkdPwP/2lxh5Cc/SK/mJjBvyBdO2+cuNR0M4Kf
UV2PA4oOLREYXEPgmOtJQ/VcsmOLa1pEPAdJarZwB5ztalgRKkIogHzzjfY43Fmx
rAgZqGnIJrWRtepDM8jAaEJC0bEKywH5Wo6eh+oi0GCS07B48lpYATI/1gQdwBjV
CgcZTQd/04PVx69Bi8LiQyfbwppAsIQQa9YaGmqGuQa74Hp9gz+4VyeRFg54h3CP
6fWwRHNVO8GsGNA1UgWbeXXajhUU+AG/gDThqIcgxs3KmrREzU9EvcQ70XCzphOA
JoUy9yykWRGen7aFGrggfY4NzjQmL6g+/rCvbIMfidRsJKBaQYBeMUkbQRnAh34V
Pe3aSBEnv1aBKaQA7yntdqYGRJ2Sz56a1kjCvI86eDjExt4UshbZi+TfuQSj6zAY
/ejOawhEYPFZw2FlvkBetyck7iroG1404DoBPghoRu9dG2e3p0eJOZfXiEzfS2qB
PsJtMPiexSdEcY3sxVKOMh4hx0Zjkqest7laitb1Lrbg5pLhEiHvDkyhoUPGI2oa
a3N4rsBc6sgSTQfJsx4nXFfKfNQsNu2Nr308BDk16XOHZ4J7Hgt6xR6STDo9ACz1
Gy5munCN2AhGSdhR5niFI3ocNpDM5oWkztBfjz7YmIQv18NcU5nO8ByYytXMbglq
sSsnR+VbYeCu
=z53B
-----END PGP SIGNATURE-----
Merge tag 'net-6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from wireless, bleutooth, BPF and netfilter.
Current release - regressions:
- core: drop bad gso csum_start and offset in virtio_net_hdr
- wifi: mt76: fix null pointer access in mt792x_mac_link_bss_remove
- eth: tun: add missing bpf_net_ctx_clear() in do_xdp_generic()
- phy: aquantia: only poll GLOBAL_CFG regs on aqr113, aqr113c and
aqr115c
Current release - new code bugs:
- smc: prevent UAF in inet_create()
- bluetooth: btmtk: fix kernel crash when entering btmtk_usb_suspend
- eth: bnxt: reject unsupported hash functions
Previous releases - regressions:
- sched: act_ct: take care of padding in struct zones_ht_key
- netfilter: fix null-ptr-deref in iptable_nat_table_init().
- tcp: adjust clamping window for applications specifying SO_RCVBUF
Previous releases - always broken:
- ethtool: rss: small fixes to spec and GET
- mptcp:
- fix signal endpoint re-add
- pm: fix backup support in signal endpoints
- wifi: ath12k: fix soft lockup on suspend
- eth: bnxt_en: fix RSS logic in __bnxt_reserve_rings()
- eth: ice: fix AF_XDP ZC timeout and concurrency issues
- eth: mlx5:
- fix missing lock on sync reset reload
- fix error handling in irq_pool_request_irq"
* tag 'net-6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
mptcp: fix duplicate data handling
mptcp: fix bad RCVPRUNED mib accounting
ipv6: fix ndisc_is_useropt() handling for PIO
igc: Fix double reset adapter triggered from a single taprio cmd
net: MAINTAINERS: Demote Qualcomm IPA to "maintained"
net: wan: fsl_qmc_hdlc: Discard received CRC
net: wan: fsl_qmc_hdlc: Convert carrier_lock spinlock to a mutex
net/mlx5e: Add a check for the return value from mlx5_port_set_eth_ptys
net/mlx5e: Fix CT entry update leaks of modify header context
net/mlx5e: Require mlx5 tc classifier action support for IPsec prio capability
net/mlx5: Fix missing lock on sync reset reload
net/mlx5: Lag, don't use the hardcoded value of the first port
net/mlx5: DR, Fix 'stack guard page was hit' error in dr_rule
net/mlx5: Fix error handling in irq_pool_request_irq
net/mlx5: Always drain health in shutdown callback
net: Add skbuff.h to MAINTAINERS
r8169: don't increment tx_dropped in case of NETDEV_TX_BUSY
netfilter: iptables: Fix potential null-ptr-deref in ip6table_nat_table_init().
netfilter: iptables: Fix null-ptr-deref in iptable_nat_table_init().
net: drop bad gso csum_start and offset in virtio_net_hdr
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmapmOQACgkQxWXV+ddt
WDsXVhAAi4X+xt3o4jcN3IAu08JCQAAyXnFWC3lvn7sqYjSrcccI6ZT4/gAbHss+
qrifakRGoYQ7fAjYBmhw48HqPmHtI2OQjIUDaIqHQOS68aXShBo9HiE460HRY4GT
QV/KT0w37E2/R0EDR9gyjLq3ZA3/raxN1n+LNCFhRWmtsAEZrk4XzsADWb05YkIq
1QBa92DzEhVpd04X8YHIYBgRidWbcYST6xhoWdyL9VZ1pzZsISq5LH67D4f/J1KU
gXNf+ZnF9DXsQnptJrMsjhx61seJ2F0/vozFZ+l6SjRr0jeysmrJI0dxqQc/hUga
gbLmdha6ztKdn03JOIL+lfdZYzICFl/2fekSWI2SNcag+TYszACjlFOyHusOgKsa
3qQwzVB699FheWO5nrOOvOtgq0ZqGsrIvhIXLhA7/bVpNavPnUB7IQCcs8n89ImQ
hUIebfX1FZnYXTrB6Hhm92LUb0lyLSlW1we3SSmaAMiy1TiXHG7hO2G/sIbOPAJC
5VzdHf0DEjzEdjmTrGOV7JBfy5JmMK56oN8viZS95p70DYxNGvEOhLs/8n5twpri
MWV8GElcOjjC+KnGnUH72spsnEKONpdzyccG9kiZEgkEi4csgHSxrkSmAehYD6i6
MFYk+i7jvZ1VsbOulmdGOLbHS7whxi9pWb/CT3KKF1Ei5/v07bU=
=JdOX
-----END PGP SIGNATURE-----
Merge tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix regression in extent map rework when handling insertion of
overlapping compressed extent
- fix unexpected file length when appending to a file using direct io
and buffer not faulted in
- in zoned mode, fix accounting of unusable space when flipping
read-only block group back to read-write
- fix page locking when COWing an inline range, assertion failure found
by syzbot
- fix calculation of space info in debugging print
- tree-checker, add validation of data reference item
- fix a few -Wmaybe-uninitialized build warnings
* tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry()
btrfs: fix corruption after buffer fault in during direct IO append write
btrfs: zoned: fix zone_unusable accounting on making block group read-write again
btrfs: do not subtract delalloc from avail bytes
btrfs: make cow_file_range_inline() honor locked_page on error
btrfs: fix corrupt read due to bad offset of a compressed extent map
btrfs: tree-checker: validate dref root and objectid
The 'mptcp_subflow_context' structure has two items related to the
backup flags:
- 'backup': the subflow has been marked as backup by the other peer
- 'request_bkup': the backup flag has been set by the host
Before this patch, the scheduler was only looking at the 'backup' flag.
That can make sense in some cases, but it looks like that's not what we
wanted for the general use, because either the path-manager was setting
both of them when sending an MP_PRIO, or the receiver was duplicating
the 'backup' flag in the subflow request.
Note that the use of these two flags in the path-manager are going to be
fixed in the next commits, but this change here is needed not to modify
the behaviour.
Fixes: f296234c98 ("mptcp: Add handling of incoming MP_JOIN requests")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When btrfs makes a block group read-only, it adds all free regions in the
block group to space_info->bytes_readonly. That free space excludes
reserved and pinned regions. OTOH, when btrfs makes the block group
read-write again, it moves all the unused regions into the block group's
zone_unusable. That unused region includes reserved and pinned regions.
As a result, it counts too much zone_unusable bytes.
Fortunately (or unfortunately), having erroneous zone_unusable does not
affect the calculation of space_info->bytes_readonly, because free
space (num_bytes in btrfs_dec_block_group_ro) calculation is done based on
the erroneous zone_unusable and it reduces the num_bytes just to cancel the
error.
This behavior can be easily discovered by adding a WARN_ON to check e.g,
"bg->pinned > 0" in btrfs_dec_block_group_ro(), and running fstests test
case like btrfs/282.
Fix it by properly considering pinned and reserved in
btrfs_dec_block_group_ro(). Also, add a WARN_ON and introduce
btrfs_space_info_update_bytes_zone_unusable() to catch a similar mistake.
Fixes: 169e0da91a ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
- Stop testing the group->parent pointer as it is not guaranteed to be
stable over a chain of operations by design. This includes a warning
which would be nice to have but it produces false positives due to
the racy nature of the check.
- Plug a race between CPUs going in and out of idle and a CPU hotplug
operation. The latter can create and connect a new hierarchy level
which is missed in the concurrent updates of CPUs which go into idle.
As a result the events of such a CPU might not be processed and
timers go stale.
Cure it by splitting the hotplug operation into a prepare and online
callback. The prepare callback is guaranteed to run on an online and
therefore active CPU. This CPU updates the hierarchy and being online
ensures that there is always at least one migrator active which
handles the modified hierarchy correctly when going idle. The online
callback which runs on the incoming CPU then just marks the CPU
active and brings it into operation.
- Improve tracing and polish the code further so it is more obvious
what's going on.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmajmXYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoWaGD/4iAuj0S3AQ+odB/Fkg1cvCF8YbOJGy
PEMSDnYtW7ErShEwVQMsWnXCFhIMuqMN0KIzOChpya2ZkmdT47mNwKuwdOMEQZzg
dKhmqGWR+sk4LMszCFbf5u6JjJeQ+nnxthgJ1IieJwC4VEfRceXYk7ng6Wvu1+lU
JEIukUh9nRJWma7FYW8MeNZ4lJGdvawZ5UjUAkPtzeKWn0+0/oqV5t1c8E/1jBbi
sKZW2soL716Xd/3QJUKKmtAcH7yDFwq5AY5bJURr5ztJw/yr2loVdvAPEdn/RQ6f
fzN/J+nu2ig14g/QvhI8Ke+HbHJEZHpo6simSZRbdaqnCX3R/lYZwLDe7EGjqKIb
0slsx2V1UxQ+qIYRplrtr/HGChjG/mXDLPIWRWjsiUAqyygy6QtUIko9AuH99Kd6
7cBjOzajKIAA/J9SUD03VgjXcQ53bW64NMe2pOX9ED1mbfmmu/ROd0neOgksKw5o
G5XQ+T6tNOoHMzJgv4R8PiViVdrf53A/g1wYTY1RR3XI8IWpounkyDExDvbtigGo
N+reKoawDGpXeMAByO2E6UDFNA05NYPjlvSrzTS5ywwyF1qCowKI1Qyup9wA/God
WJvfesmOJHtfcuUcVZf6Pm6+otJiKrT3reauFd6laEbyRuGTKvtNN/pQa6yxiZzT
FTxZKpcYkPE76g==
=G1Lg
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer migration updates from Thomas Gleixner:
"Fixes and minor updates for the timer migration code:
- Stop testing the group->parent pointer as it is not guaranteed to
be stable over a chain of operations by design.
This includes a warning which would be nice to have but it produces
false positives due to the racy nature of the check.
- Plug a race between CPUs going in and out of idle and a CPU hotplug
operation. The latter can create and connect a new hierarchy level
which is missed in the concurrent updates of CPUs which go into
idle. As a result the events of such a CPU might not be processed
and timers go stale.
Cure it by splitting the hotplug operation into a prepare and
online callback. The prepare callback is guaranteed to run on an
online and therefore active CPU. This CPU updates the hierarchy and
being online ensures that there is always at least one migrator
active which handles the modified hierarchy correctly when going
idle. The online callback which runs on the incoming CPU then just
marks the CPU active and brings it into operation.
- Improve tracing and polish the code further so it is more obvious
what's going on"
* tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/migration: Fix grammar in comment
timers/migration: Spare write when nothing changed
timers/migration: Rename childmask by groupmask to make naming more obvious
timers/migration: Read childmask and parent pointer in a single place
timers/migration: Use a single struct for hierarchy walk data
timers/migration: Improve tracing
timers/migration: Move hierarchy setup into cpuhotplug prepare callback
timers/migration: Do not rely always on group->parent
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmaarzgACgkQSfxwEqXe
A66ZWBAAlhXx8bve0uKlDRK8fffWHgruho/fOY4lZJ137AKwA9JCtmOyqdfL4Dmk
VxFe7pEQJlQhcA/6kH54uO7SBXwfKlKZJth6SYnaCRMUIbFifHjjIQ0QqldjEKi0
rP90Hu4FVsbwQC7u9i9lQj9n2P36zb6pn83BzpZQ/2PtoVCSCrdSJUe0Rxa3H3GN
0+nNkDSXQt5otCByLaeE3x7KJgXLWL9+G2eFSFLTZ8rSVfMx1CdOIAG37WlLGdWm
BaFYPDKMyBTVvVJBNgAe9YSqtrsZ5nlmLz+Z9wAe/hTL7RlL03kWUu34/Udcpull
zzMDH0WMntiGK3eFQ2gOYSWqypvAjwHgn3BzqNmjUb69+89mZsdU1slcvnxWsUwU
D3vphrscaqarF629tfsXti3jc5PoXwUTjROZVcCyeFPBhyAZgzK8xUvPpJO+RT+K
EuUABob9cpA6FCpW/QeolDmMDhXlNT8QgsZu1juokZac2xP3Ly3REyEvT7HLbU2W
ZJjbEqm1ppp3RmGELUOJbyhwsLrnbt+OMDO7iEWoG8aSFK4diBK/ZM6WvLMkr8Oi
7ioXGIsYkCy3c47wpZKTrAapOPJp5keqNAiHSEbXw8mozp6429QAEZxNOcczgHKC
Ea2JzRkctqutcIT+Slw/uUe//i1iSsIHXbE81fp5udcQTJcUByo=
=P8aI
-----END PGP SIGNATURE-----
Merge tag 'random-6.11-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
"This adds getrandom() support to the vDSO.
First, it adds a new kind of mapping to mmap(2), MAP_DROPPABLE, which
lets the kernel zero out pages anytime under memory pressure, which
enables allocating memory that never gets swapped to disk but also
doesn't count as being mlocked.
Then, the vDSO implementation of getrandom() is introduced in a
generic manner and hooked into random.c.
Next, this is implemented on x86. (Also, though it's not ready for
this pull, somebody has begun an arm64 implementation already)
Finally, two vDSO selftests are added.
There are also two housekeeping cleanup commits"
* tag 'random-6.11-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
MAINTAINERS: add random.h headers to RNG subsection
random: note that RNDGETPOOL was removed in 2.6.9-rc2
selftests/vDSO: add tests for vgetrandom
x86: vdso: Wire up getrandom() vDSO implementation
random: introduce generic vDSO getrandom() implementation
mm: add MAP_DROPPABLE for designating always lazily freeable mappings
childmask in the group reflects the mask that is required to 'reference'
this group in the parent. When reading childmask, this might be confusing,
as this suggests, that this is the mask of the child of the group.
Clarify this by renaming childmask in the tmigr_group and tmc_group by
groupmask.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-6-757baa7803fe@linutronix.de
walkers") is known to cause a performance regression
(https://lore.kernel.org/all/3acefad9-96e5-4681-8014-827d6be71c7a@linux.ibm.com/T/#mfa809800a7862fb5bdf834c6f71a3a5113eb83ff).
Yu has a fix which I'll send along later via the hotfixes branch.
- In the series "mm: Avoid possible overflows in dirty throttling" Jan
Kara addresses a couple of issues in the writeback throttling code.
These fixes are also targetted at -stable kernels.
- Ryusuke Konishi's series "nilfs2: fix potential issues related to
reserved inodes" does that. This should actually be in the
mm-nonmm-stable tree, along with the many other nilfs2 patches. My bad.
- More folio conversions from Kefeng Wang in the series "mm: convert to
folio_alloc_mpol()"
- Kemeng Shi has sent some cleanups to the writeback code in the series
"Add helper functions to remove repeated code and improve readability of
cgroup writeback"
- Kairui Song has made the swap code a little smaller and a little
faster in the series "mm/swap: clean up and optimize swap cache index".
- In the series "mm/memory: cleanly support zeropage in
vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
Hildenbrand has reworked the rather sketchy handling of the use of the
zeropage in MAP_SHARED mappings. I don't see any runtime effects here -
more a cleanup/understandability/maintainablity thing.
- Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling of
higher addresses, for aarch64. The (poorly named) series is
"Restructure va_high_addr_switch".
- The core TLB handling code gets some cleanups and possible slight
optimizations in Bang Li's series "Add update_mmu_tlb_range() to
simplify code".
- Jane Chu has improved the handling of our
fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in the
series "Enhance soft hwpoison handling and injection".
- Jeff Johnson has sent a billion patches everywhere to add
MODULE_DESCRIPTION() to everything. Some landed in this pull.
- In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang has
simplified migration's use of hardware-offload memory copying.
- Yosry Ahmed performs more folio API conversions in his series "mm:
zswap: trivial folio conversions".
- In the series "large folios swap-in: handle refault cases first",
Chuanhua Han inches us forward in the handling of large pages in the
swap code. This is a cleanup and optimization, working toward the end
objective of full support of large folio swapin/out.
- In the series "mm,swap: cleanup VMA based swap readahead window
calculation", Huang Ying has contributed some cleanups and a possible
fixlet to his VMA based swap readahead code.
- In the series "add mTHP support for anonymous shmem" Baolin Wang has
taught anonymous shmem mappings to use multisize THP. By default this
is a no-op - users must opt in vis sysfs controls. Dramatic
improvements in pagefault latency are realized.
- David Hildenbrand has some cleanups to our remaining use of
page_mapcount() in the series "fs/proc: move page_mapcount() to
fs/proc/internal.h".
- David also has some highmem accounting cleanups in the series
"mm/highmem: don't track highmem pages manually".
- Build-time fixes and cleanups from John Hubbard in the series
"cleanups, fixes, and progress towards avoiding "make headers"".
- Cleanups and consolidation of the core pagemap handling from Barry
Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
and utilize them".
- Lance Yang's series "Reclaim lazyfree THP without splitting" has
reduced the latency of the reclaim of pmd-mapped THPs under fairly
common circumstances. A 10x speedup is seen in a microbenchmark.
It does this by punting to aother CPU but I guess that's a win unless
all CPUs are pegged.
- hugetlb_cgroup cleanups from Xiu Jianfeng in the series
"mm/hugetlb_cgroup: rework on cftypes".
- Miaohe Lin's series "Some cleanups for memory-failure" does just that
thing.
- Is anyone reading this stuff? If so, email me!
- Someone other than SeongJae has developed a DAMON feature in Honggyu
Kim's series "DAMON based tiered memory management for CXL memory".
This adds DAMON features which may be used to help determine the
efficiency of our placement of CXL/PCIe attached DRAM.
- DAMON user API centralization and simplificatio work in SeongJae
Park's series "mm/damon: introduce DAMON parameters online commit
function".
- In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
David Hildenbrand does some maintenance work on zsmalloc - partially
modernizing its use of pageframe fields.
- Kefeng Wang provides more folio conversions in the series "mm: remove
page_maybe_dma_pinned() and page_mkclean()".
- More cleanup from David Hildenbrand, this time in the series
"mm/memory_hotplug: use PageOffline() instead of PageReserved() for
!ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
pages" and permits the removal of some virtio-mem hacks.
- Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
__folio_add_anon_rmap()" is a cleanup to the anon folio handling in
preparation for mTHP (multisize THP) swapin.
- Kefeng Wang's series "mm: improve clear and copy user folio"
implements more folio conversions, this time in the area of large folio
userspace copying.
- The series "Docs/mm/damon/maintaier-profile: document a mailing tool
and community meetup series" tells people how to get better involved
with other DAMON developers. From SeongJae Park.
- A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
that.
- David Hildenbrand sends along more cleanups, this time against the
migration code. The series is "mm/migrate: move NUMA hinting fault
folio isolation + checks under PTL".
- Jan Kara has found quite a lot of strangenesses and minor errors in
the readahead code. He addresses this in the series "mm: Fix various
readahead quirks".
- SeongJae Park's series "selftests/damon: test DAMOS tried regions and
{min,max}_nr_regions" adds features and addresses errors in DAMON's self
testing code.
- Gavin Shan has found a userspace-triggerable WARN in the pagecache
code. The series "mm/filemap: Limit page cache size to that supported
by xarray" addresses this. The series is marked cc:stable.
- Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
and cleanup" cleans up and slightly optimizes KSM.
- Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
code motion. The series (which also makes the memcg-v1 code
Kconfigurable) are
"mm: memcg: separate legacy cgroup v1 code and put under config
option" and
"mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1"
- Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
adds an additional feature to this cgroup-v2 control file.
- The series "Userspace controls soft-offline pages" from Jiaqi Yan
permits userspace to stop the kernel's automatic treatment of excessive
correctable memory errors. In order to permit userspace to monitor and
handle this situation.
- Kefeng Wang's series "mm: migrate: support poison recover from migrate
folio" teaches the kernel to appropriately handle migration from
poisoned source folios rather than simply panicing.
- SeongJae Park's series "Docs/damon: minor fixups and improvements"
does those things.
- In the series "mm/zsmalloc: change back to per-size_class lock"
Chengming Zhou improves zsmalloc's scalability and memory utilization.
- Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
pinning memfd folios" makes the GUP code use FOLL_PIN rather than bare
refcount increments. So these paes can first be moved aside if they
reside in the movable zone or a CMA block.
- Andrii Nakryiko has added a binary ioctl()-based API to /proc/pid/maps
for much faster reading of vma information. The series is "query VMAs
from /proc/<pid>/maps".
- In the series "mm: introduce per-order mTHP split counters" Lance Yang
improves the kernel's presentation of developer information related to
multisize THP splitting.
- Michael Ellerman has developed the series "Reimplement huge pages
without hugepd on powerpc (8xx, e500, book3s/64)". This permits
userspace to use all available huge page sizes.
- In the series "revert unconditional slab and page allocator fault
injection calls" Vlastimil Babka removes a performance-affecting and not
very useful feature from slab fault injection.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZp2C+QAKCRDdBJ7gKXxA
joTkAQDvjqOoFStqk4GU3OXMYB7WCU/ZQMFG0iuu1EEwTVDZ4QEA8CnG7seek1R3
xEoo+vw0sWWeLV3qzsxnCA1BJ8cTJA8=
=z0Lf
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- In the series "mm: Avoid possible overflows in dirty throttling" Jan
Kara addresses a couple of issues in the writeback throttling code.
These fixes are also targetted at -stable kernels.
- Ryusuke Konishi's series "nilfs2: fix potential issues related to
reserved inodes" does that. This should actually be in the
mm-nonmm-stable tree, along with the many other nilfs2 patches. My
bad.
- More folio conversions from Kefeng Wang in the series "mm: convert to
folio_alloc_mpol()"
- Kemeng Shi has sent some cleanups to the writeback code in the series
"Add helper functions to remove repeated code and improve readability
of cgroup writeback"
- Kairui Song has made the swap code a little smaller and a little
faster in the series "mm/swap: clean up and optimize swap cache
index".
- In the series "mm/memory: cleanly support zeropage in
vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
Hildenbrand has reworked the rather sketchy handling of the use of
the zeropage in MAP_SHARED mappings. I don't see any runtime effects
here - more a cleanup/understandability/maintainablity thing.
- Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling
of higher addresses, for aarch64. The (poorly named) series is
"Restructure va_high_addr_switch".
- The core TLB handling code gets some cleanups and possible slight
optimizations in Bang Li's series "Add update_mmu_tlb_range() to
simplify code".
- Jane Chu has improved the handling of our
fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in
the series "Enhance soft hwpoison handling and injection".
- Jeff Johnson has sent a billion patches everywhere to add
MODULE_DESCRIPTION() to everything. Some landed in this pull.
- In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang
has simplified migration's use of hardware-offload memory copying.
- Yosry Ahmed performs more folio API conversions in his series "mm:
zswap: trivial folio conversions".
- In the series "large folios swap-in: handle refault cases first",
Chuanhua Han inches us forward in the handling of large pages in the
swap code. This is a cleanup and optimization, working toward the end
objective of full support of large folio swapin/out.
- In the series "mm,swap: cleanup VMA based swap readahead window
calculation", Huang Ying has contributed some cleanups and a possible
fixlet to his VMA based swap readahead code.
- In the series "add mTHP support for anonymous shmem" Baolin Wang has
taught anonymous shmem mappings to use multisize THP. By default this
is a no-op - users must opt in vis sysfs controls. Dramatic
improvements in pagefault latency are realized.
- David Hildenbrand has some cleanups to our remaining use of
page_mapcount() in the series "fs/proc: move page_mapcount() to
fs/proc/internal.h".
- David also has some highmem accounting cleanups in the series
"mm/highmem: don't track highmem pages manually".
- Build-time fixes and cleanups from John Hubbard in the series
"cleanups, fixes, and progress towards avoiding "make headers"".
- Cleanups and consolidation of the core pagemap handling from Barry
Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
and utilize them".
- Lance Yang's series "Reclaim lazyfree THP without splitting" has
reduced the latency of the reclaim of pmd-mapped THPs under fairly
common circumstances. A 10x speedup is seen in a microbenchmark.
It does this by punting to aother CPU but I guess that's a win unless
all CPUs are pegged.
- hugetlb_cgroup cleanups from Xiu Jianfeng in the series
"mm/hugetlb_cgroup: rework on cftypes".
- Miaohe Lin's series "Some cleanups for memory-failure" does just that
thing.
- Someone other than SeongJae has developed a DAMON feature in Honggyu
Kim's series "DAMON based tiered memory management for CXL memory".
This adds DAMON features which may be used to help determine the
efficiency of our placement of CXL/PCIe attached DRAM.
- DAMON user API centralization and simplificatio work in SeongJae
Park's series "mm/damon: introduce DAMON parameters online commit
function".
- In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
David Hildenbrand does some maintenance work on zsmalloc - partially
modernizing its use of pageframe fields.
- Kefeng Wang provides more folio conversions in the series "mm: remove
page_maybe_dma_pinned() and page_mkclean()".
- More cleanup from David Hildenbrand, this time in the series
"mm/memory_hotplug: use PageOffline() instead of PageReserved() for
!ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
pages" and permits the removal of some virtio-mem hacks.
- Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
__folio_add_anon_rmap()" is a cleanup to the anon folio handling in
preparation for mTHP (multisize THP) swapin.
- Kefeng Wang's series "mm: improve clear and copy user folio"
implements more folio conversions, this time in the area of large
folio userspace copying.
- The series "Docs/mm/damon/maintaier-profile: document a mailing tool
and community meetup series" tells people how to get better involved
with other DAMON developers. From SeongJae Park.
- A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
that.
- David Hildenbrand sends along more cleanups, this time against the
migration code. The series is "mm/migrate: move NUMA hinting fault
folio isolation + checks under PTL".
- Jan Kara has found quite a lot of strangenesses and minor errors in
the readahead code. He addresses this in the series "mm: Fix various
readahead quirks".
- SeongJae Park's series "selftests/damon: test DAMOS tried regions and
{min,max}_nr_regions" adds features and addresses errors in DAMON's
self testing code.
- Gavin Shan has found a userspace-triggerable WARN in the pagecache
code. The series "mm/filemap: Limit page cache size to that supported
by xarray" addresses this. The series is marked cc:stable.
- Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
and cleanup" cleans up and slightly optimizes KSM.
- Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
code motion. The series (which also makes the memcg-v1 code
Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put
under config option" and "mm: memcg: put cgroup v1-specific memcg
data under CONFIG_MEMCG_V1"
- Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
adds an additional feature to this cgroup-v2 control file.
- The series "Userspace controls soft-offline pages" from Jiaqi Yan
permits userspace to stop the kernel's automatic treatment of
excessive correctable memory errors. In order to permit userspace to
monitor and handle this situation.
- Kefeng Wang's series "mm: migrate: support poison recover from
migrate folio" teaches the kernel to appropriately handle migration
from poisoned source folios rather than simply panicing.
- SeongJae Park's series "Docs/damon: minor fixups and improvements"
does those things.
- In the series "mm/zsmalloc: change back to per-size_class lock"
Chengming Zhou improves zsmalloc's scalability and memory
utilization.
- Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
pinning memfd folios" makes the GUP code use FOLL_PIN rather than
bare refcount increments. So these paes can first be moved aside if
they reside in the movable zone or a CMA block.
- Andrii Nakryiko has added a binary ioctl()-based API to
/proc/pid/maps for much faster reading of vma information. The series
is "query VMAs from /proc/<pid>/maps".
- In the series "mm: introduce per-order mTHP split counters" Lance
Yang improves the kernel's presentation of developer information
related to multisize THP splitting.
- Michael Ellerman has developed the series "Reimplement huge pages
without hugepd on powerpc (8xx, e500, book3s/64)". This permits
userspace to use all available huge page sizes.
- In the series "revert unconditional slab and page allocator fault
injection calls" Vlastimil Babka removes a performance-affecting and
not very useful feature from slab fault injection.
* tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits)
mm/mglru: fix ineffective protection calculation
mm/zswap: fix a white space issue
mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio
mm/hugetlb: fix possible recursive locking detected warning
mm/gup: clear the LRU flag of a page before adding to LRU batch
mm/numa_balancing: teach mpol_to_str about the balancing mode
mm: memcg1: convert charge move flags to unsigned long long
alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting
lib: reuse page_ext_data() to obtain codetag_ref
lib: add missing newline character in the warning message
mm/mglru: fix overshooting shrinker memory
mm/mglru: fix div-by-zero in vmpressure_calc_level()
mm/kmemleak: replace strncpy() with strscpy()
mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC
mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB
mm: ignore data-race in __swap_writepage
hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr
mm: shmem: rename mTHP shmem counters
mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async()
mm/migrate: putback split folios when numa hint migration fails
...
There are many line changes for FireWire subsystem, while there is no
functional change for practical use. The most of changes are for code
refactoring, some KUnit tests to added helper functions, and new
tracepoints events for both the core functions and 1394 OHCI driver.
The tracepoints events now cover the verbose logging enabled by debug
parameter of firewire-ohci kernel module. The parameter would be removed
in any future timing, thus it is now deprecated.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQE66IEYNDXNBPeGKSsLtaWM8LwEwUCZptYCwAKCRCsLtaWM8Lw
E169AQCAR9DZxgflPH03bW8ftfS0crwqdYTUpRYIkwBIXzYl4AD+O2bOB28Xn+U+
W4Inye792FjBCRtStp7PsJyhdVYBUws=
=uFe6
-----END PGP SIGNATURE-----
Merge tag 'firewire-updates-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394
Pull firewire updates from Takashi Sakamoto:
"There are many lines of changes for FireWire subsystem, but there is
practically no functional change.
Most of the changes are for code refactoring, some KUnit tests to
added helper functions, and new tracepoints events for both the core
functions and 1394 OHCI driver.
The tracepoints events now cover the verbose logging enabled by debug
parameter of firewire-ohci kernel module. The parameter would be
removed in any future timing, thus it is now deprecated"
* tag 'firewire-updates-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394: (32 commits)
firewire: core: move copy_port_status() helper function to TP_fast_assign() block
Revert "firewire: ohci: use common macro to interpret be32 data in le32 buffer"
firewire: ohci: add tracepoints event for data of Self-ID DMA
firewire: ohci: use inline functions to operate data of self-ID DMA
firewire: ohci: add static inline functions to deserialize for Self-ID DMA operation
firewire: ohci: use static function to handle endian issue on PowerPC platform
firewire: ohci: use common macro to interpret be32 data in le32 buffer
firewire: core: Fix spelling mistakes in tracepoint messages
firewire: ohci: add tracepoints event for hardIRQ event
firewire: ohci: add support for Linux kernel tracepoints
firewire: core: add tracepoints events for completions of packets in isochronous context
firewire: core: add tracepoints events for queueing packets of isochronous context
firewire: core: add tracepoints events for flushing completions of isochronous context
firewire: core: add tracepoints events for flushing of isochronous context
firewire: core: add tracepoints events for starting/stopping of isochronous context
firewire: core: add tracepoints events for setting channels of multichannel context
firewire: core: add tracepoints events for allocation/deallocation of isochronous context
firewire: core: undefine macros after use in tracepoints events
firewire: core: record card index in tracepoints event for self ID sequence
firewire: core: use inline helper functions to serialize phy config packet
...
The vDSO getrandom() implementation works with a buffer allocated with a
new system call that has certain requirements:
- It shouldn't be written to core dumps.
* Easy: VM_DONTDUMP.
- It should be zeroed on fork.
* Easy: VM_WIPEONFORK.
- It shouldn't be written to swap.
* Uh-oh: mlock is rlimited.
* Uh-oh: mlock isn't inherited by forks.
- It shouldn't reserve actual memory, but it also shouldn't crash when
page faulting in memory if none is available
* Uh-oh: VM_NORESERVE means segfaults.
It turns out that the vDSO getrandom() function has three really nice
characteristics that we can exploit to solve this problem:
1) Due to being wiped during fork(), the vDSO code is already robust to
having the contents of the pages it reads zeroed out midway through
the function's execution.
2) In the absolute worst case of whatever contingency we're coding for,
we have the option to fallback to the getrandom() syscall, and
everything is fine.
3) The buffers the function uses are only ever useful for a maximum of
60 seconds -- a sort of cache, rather than a long term allocation.
These characteristics mean that we can introduce VM_DROPPABLE, which
has the following semantics:
a) It never is written out to swap.
b) Under memory pressure, mm can just drop the pages (so that they're
zero when read back again).
c) It is inherited by fork.
d) It doesn't count against the mlock budget, since nothing is locked.
e) If there's not enough memory to service a page fault, it's not fatal,
and no signal is sent.
This way, allocations used by vDSO getrandom() can use:
VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
And there will be no problem with OOMing, crashing on overcommitment,
using memory when not in use, not wiping on fork(), coredumps, or
writing out to swap.
In order to let vDSO getrandom() use this, expose these via mmap(2) as
MAP_DROPPABLE.
Note that this involves removing the MADV_FREE special case from
sort_folio(), which according to Yu Zhao is unnecessary and will simply
result in an extra call to shrink_folio_list() in the worst case. The
chunk removed reenables the swapbacked flag, which we don't want for
VM_DROPPABLE, and we can't conditionalize it here because there isn't a
vma reference available.
Finally, the provided self test ensures that this is working as desired.
Cc: linux-mm@kvack.org
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
New Features:
* Add support for large folios
* Implement rpcrdma generic device removal notification
* Add client support for attribute delegations
* Use a LAYOUTRETURN during reboot recovery to report layoutstats and errors
* Improve throughput for random buffered writes
* Add NVMe support to pnfs/blocklayout
Bugfixes:
* Fix rpcrdma_reqs_reset()
* Avoid soft lockups when using UDP
* Fix an nfs/blocklayout premature PR key unregestration
* Another fix for EXCHGID4_FLAG_USE_PNFS_DS for DS server
* Do not extend writes to the entire folio
* Pass explicit offset and count values to tracepoints
* Fix a race to wake up sleeping SUNRPC sync tasks
* Fix gss_status tracepoint output
Cleanups:
* Add missing MODULE_DESCRIPTION() macros
* Add blocklayout / SCSI layout tracepoints
* Remove asm-generic headers from xprtrdma verbs.c
* Remove unused 'struct mnt_fhstatus'
* Other delegation related cleanups
* Other folio related cleanups
* Other pNFS related cleanups
* Other xprtrdma cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmaZgr0ACgkQ18tUv7Cl
QOv8FxAAnUyYG7Kdbv+5Ko/SFv0imxCb5DQh2XC/hSHNrlKBlDnqe2PANXR9XocL
mS0Wry5tZf/T+o+QoKv0HQUdWFlnqKzwclggrekf/lkioU1feWsLe2RzDl1iUh0V
6fwcCyWXW1mYX2CtCaDe+/ZFcoZOMD+bItNHt/RdDScSnS9Jd8GSyocsVKsqaBx6
3wub0FJ4UBgYNoX2T3YyK2JwvO9GLaKIQRJV74rjgPJKjcjhptbcb5MKBmOZrF95
UCcpl4CwvD9RTsSEp0B98UbAFFpk8Nw1tmHF3GmyG/nsrJomDuLKFvbsiq23eHUf
XeULZIbjMEzU56vjoTglZA4s7JYx17D0vzdPGUqU4mLN3LPm5LtGLBg2uQoPw/xW
50euLU+ol36mfnQlBsuM/tAXgtoAcT63aNeNRNp8aOL47xA+PC6kWTBK9OaR5+x6
w+d22Dpy+riMk1TRaAVt0ANcENKELsWRFvxkuWCpQhVoQ1h8LigQJzeggEEK7Sa6
5u9H6wCTee2wz746uwA43koj1utuyrLq/5S+qEtCY1pbP3U0A+Gh0Xh00OXiYuzL
TgRdksmiAL8cA51WjSrq6HhGLOUJAYLfbdKaVhW+fULxUVwzWhFFaFbbdiq/e4OR
0pfqls8UZWICE51GeTfalEidpKZgV/LxU3QOuVoalWBULyj/TeI=
=avTW
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New Features:
- Add support for large folios
- Implement rpcrdma generic device removal notification
- Add client support for attribute delegations
- Use a LAYOUTRETURN during reboot recovery to report layoutstats
and errors
- Improve throughput for random buffered writes
- Add NVMe support to pnfs/blocklayout
Bugfixes:
- Fix rpcrdma_reqs_reset()
- Avoid soft lockups when using UDP
- Fix an nfs/blocklayout premature PR key unregestration
- Another fix for EXCHGID4_FLAG_USE_PNFS_DS for DS server
- Do not extend writes to the entire folio
- Pass explicit offset and count values to tracepoints
- Fix a race to wake up sleeping SUNRPC sync tasks
- Fix gss_status tracepoint output
Cleanups:
- Add missing MODULE_DESCRIPTION() macros
- Add blocklayout / SCSI layout tracepoints
- Remove asm-generic headers from xprtrdma verbs.c
- Remove unused 'struct mnt_fhstatus'
- Other delegation related cleanups
- Other folio related cleanups
- Other pNFS related cleanups
- Other xprtrdma cleanups"
* tag 'nfs-for-6.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (63 commits)
SUNRPC: Fixup gss_status tracepoint error output
SUNRPC: Fix a race to wake a sync task
nfs: split nfs_read_folio
nfs: pass explicit offset/count to trace events
nfs: do not extend writes to the entire folio
nfs/blocklayout: add support for NVMe
nfs: remove nfs_page_length
nfs: remove the unused max_deviceinfo_size field from struct pnfs_layoutdriver_type
nfs: don't reuse partially completed requests in nfs_lock_and_join_requests
nfs: move nfs_wait_on_request to write.c
nfs: fold nfs_page_group_lock_subrequests into nfs_lock_and_join_requests
nfs: fold nfs_folio_find_and_lock_request into nfs_lock_and_join_requests
nfs: simplify nfs_folio_find_and_lock_request
nfs: remove nfs_folio_private_request
nfs: remove dead code for the old swap over NFS implementation
NFSv4.1 another fix for EXCHGID4_FLAG_USE_PNFS_DS for DS server
nfs: Block on write congestion
nfs: Properly initialize server->writeback
nfs: Drop pointless check from nfs_commit_release_pages()
nfs/blocklayout: SCSI layout trace points for reservation key reg/unreg
...
feature. Also some performance improvements; in particular, improving
IOPS and throughput on fast devices running Async Direct I/O by up to
20% by optimizing jbd2_transaction_committed().
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmaYiqsACgkQ8vlZVpUN
gaOWpQf/d6Y9WGyjeC1jOc+vIBxLgL+X0kbzYkkjGTSIZ7mZJS9X4NMMEtqayJ4f
1zGobcGENc05l4LVxf3uMbDj1aGlHeI9X4GLGaP5s5NcaAl4HKjQ3aFs3MuiJHPj
Ol2CebXJx+NKt1lkD8PSPGgaTb5zg+SeZifI+OZ1RpkcKmGnkSNa5NkUNAaBh6dl
5LLXTc2p9NcCwAwDAQSiAJCV35bAZpcp6fwLLaPQ6Eok9HxGcJuYXW2Fict4rbtV
mXeogXVIo2bkMcfh6tDchDBrFvORYIA7uBVmaG1LgAMrtEnYxnxnEntD0h6j/bzF
Fl4jjQfd8o2uYto/4eo+iY6Z0haxyQ==
=rcOo
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Many cleanups and bug fixes in ext4, especially for the fast commit
feature.
Also some performance improvements; in particular, improving IOPS and
throughput on fast devices running Async Direct I/O by up to 20% by
optimizing jbd2_transaction_committed()"
* tag 'ext4_for_linus-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
ext4: make sure the first directory block is not a hole
ext4: check dot and dotdot of dx_root before making dir indexed
ext4: sanity check for NULL pointer after ext4_force_shutdown
jbd2: increase maximum transaction size
jbd2: drop pointless shrinker batch initialization
jbd2: avoid infinite transaction commit loop
jbd2: precompute number of transaction descriptor blocks
jbd2: make jbd2_journal_get_max_txn_bufs() internal
jbd2: avoid mount failed when commit block is partial submitted
ext4: avoid writing unitialized memory to disk in EA inodes
ext4: don't track ranges in fast_commit if inode has inlined data
ext4: fix possible tid_t sequence overflows
ext4: use ext4_update_inode_fsync_trans() helper in inode creation
ext4: add missing MODULE_DESCRIPTION()
jbd2: add missing MODULE_DESCRIPTION()
ext4: use memtostr_pad() for s_volume_name
jbd2: speed up jbd2_transaction_committed()
ext4: make ext4_da_map_blocks() buffer_head unaware
ext4: make ext4_insert_delayed_block() insert multi-blocks
ext4: factor out a helper to check the cluster allocation state
...
- Set rtla/osnoise default threshold to 1us from 5us
The 5us default was missing noise that people cared about.
Changing it to 1us makes it work as expected.
- Restructure how sched_switch prev_comm and next_comm was being saved.
The prev_comm was being saved along with the other next fields, and the
next_comm was being saved along with the other prev fields. This is just
a cosmetic change.
- Have the allocation of pid_list use GFP_NOWAIT instead of GFP_KERNEL.
The allocation can happen in irq_work context, but luckily, the size
was by default so large, it was never triggered. But in case it ever is,
use the NOWAIT allocation in the interrupt context.
- Fix some kernel doc errors.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZpbRchQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qt9qAQCM/FNa1mRRlnCxMD+SrcGJmW4j/VvE
5c+QY5ezCYxGjwEA5tYwXttoZ1WHGRWCvuRSnkWidE1K0HpRyITo3BVzaQU=
=Sw4T
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
"Trivial updates for 6.11:
- Set rtla/osnoise default threshold to 1us from 5us
The 5us default was missing noise that people cared about. Changing
it to 1us makes it work as expected.
- Restructure how sched_switch prev_comm and next_comm was being saved
The prev_comm was being saved along with the other next fields, and
the next_comm was being saved along with the other prev fields.
This is just a cosmetic change.
- Have the allocation of pid_list use GFP_NOWAIT instead of GFP_KERNEL
The allocation can happen in irq_work context, but luckily, the
size was by default so large, it was never triggered. But in case
it ever is, use the NOWAIT allocation in the interrupt context.
- Fix some kernel doc errors"
* tag 'trace-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
trace/pid_list: Change gfp flags in pid_list_fill_irq()
tracing/sched: sched_switch: place prev_comm and next_comm in right order
rtla/osnoise: set the default threshold to 1us
tracing: Fix trace_pid_list_free() kernel-doc
The GSS routine errors are values, not flags.
Fixes: 0c77668ddb ("SUNRPC: Introduce trace points in rpc_auth_gss.ko")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaVN3MACgkQxWXV+ddt
WDtpIRAAl+1NjsEj8e5V/UYn8Jr06ujTOnrkR3PCTICxDHbUaMLkQEw21H0K/ogQ
3fOiEVpSlZOfKdYXtXaMQbC0jd/Af2eA10Uht96nAEjAtxu1uJ4cFZGu2meNdXZP
xUioivJ/CElMPH2aluG6FaQvUTqmhrEr8tSoYbxzQmUd434q9kqqyjtw1tfzYDG1
VDn2f7ykhpB/8P0aoqgWSshWTmaCzG0GkuI28o1o0iZUIF/P9TKdzxlLRW6BVHE7
T2oGLEQjN1GQbCH75L4IeNJDkCBVfcDcbZkUDJ/ae4Pt/jJQTFY53YIP9wXFZQnd
mdfHmK7Atpsk75ATftYSq+ENkbQ5fsuut5CD63u54gAqA4M1FncDXTAWS1Y30F76
P8juSCmsSy0o3gTflDIo/IMdntoh/JmncwwStF6oKzmyUZZzzarsqM8mc1P03ZNt
3ttlnbY7lC1TDAlD5J2wXE0INCT2pN+4C9IToWdRypeuLu6qrI7cQ0oylyp9OVQM
t9umTXm0B6s1cyqEDjJf0xJZS/JTHYwu7S4EmAJwicgiLpOjABVTmO8021rVmDJy
TAUu6yEhSsrTT6Dxm7/2Et1EEOKFF5hhsG1SiGD9oUIZK6B5+0waT+rbkEWl7osR
4/TAv2zX6tuCc7HIW0fQloM/6/Gyd5wcDVaQNDUzFA075uKstwY=
=k5d3
-----END PGP SIGNATURE-----
Merge tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"The highlights are new logic behind background block group reclaim,
automatic removal of qgroup after removing a subvolume and new
'rescue=' mount options.
The rest is optimizations, cleanups and refactoring.
User visible features:
- dynamic block group reclaim:
- tunable framework to avoid situations where eager data
allocations prevent creating new metadata chunks due to lack of
unallocated space
- reuse sysfs knob bg_reclaim_threshold (otherwise used only in
zoned mode) for a fixed value threshold
- new on/off sysfs knob "dynamic_reclaim" calculating the value
based on heuristics, aiming to keep spare working space for
relocating chunks but not to needlessly relocate partially
utilized block groups or reclaim newly allocated ones
- stats are exported in sysfs per block group type, files
"reclaim_*"
- this may increase IO load at unexpected times but the corner
case of no allocatable block groups is known to be worse
- automatically remove qgroup of deleted subvolumes:
- adjust qgroup removal conditions, make sure all related
subvolume data are already removed, or return EBUSY, also take
into account setting of sysfs drop_subtree_threshold
- also works in squota mode
- mount option updates: new modes of 'rescue=' that allow to mount
images (read-only) that could have been partially converted by user
space tools
- ignoremetacsums - invalid metadata checksums are ignored
- ignoresuperflags - super block flags that track conversion in
progress (like UUID or checksums)
Core:
- size of struct btrfs_inode is now below 1024 (on a release config),
improved memory packing and other secondary effects
- switch tracking of open inodes from rb-tree to xarray, minor
performance improvement
- reduce number of empty transaction commits when there are no dirty
data/metadata
- memory allocation optimizations (reduced numbers, reordering out of
critical sections)
- extent map structure optimizations and refactoring, more sanity
checks
- more subpage in zoned mode preparations or fixes
- general snapshot code cleanups, improvements and documentation
- tree-checker updates: more file extent ram_bytes fixes, continued
- raid-stripe-tree update (not backward compatible):
- remove extent encoding field from the structure, can be inferred
from other information
- requires btrfs-progs 6.9.1 or newer
- cleanups and refactoring
- error message updates
- error handling improvements
- return type and parameter cleanups and improvements"
* tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (152 commits)
btrfs: fix extent map use-after-free when adding pages to compressed bio
btrfs: fix bitmap leak when loading free space cache on duplicate entry
btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io()
btrfs: move extent_range_clear_dirty_for_io() into inode.c
btrfs: enhance compression error messages
btrfs: fix data race when accessing the last_trans field of a root
btrfs: rename the extra_gfp parameter of btrfs_alloc_page_array()
btrfs: remove the extra_gfp parameter from btrfs_alloc_folio_array()
btrfs: introduce new "rescue=ignoresuperflags" mount option
btrfs: introduce new "rescue=ignoremetacsums" mount option
btrfs: output the unrecognized super block flags as hex
btrfs: remove unused Opt enums
btrfs: tree-checker: add extra ram_bytes and disk_num_bytes check
btrfs: fix the ram_bytes assignment for truncated ordered extents
btrfs: make validate_extent_map() catch ram_bytes mismatch
btrfs: ignore incorrect btrfs_file_extent_item::ram_bytes
btrfs: cleanup the bytenr usage inside btrfs_extent_item_to_extent_map()
btrfs: fix typo in error message in btrfs_validate_super()
btrfs: move the direct IO code into its own file
btrfs: pass a btrfs_inode to btrfs_set_prop()
...
- More folio conversions for compressed inodes;
- Stream decompressor (LZMA/DEFLATE/ZSTD) cleanups;
- Minor tracepoint cleanup.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmaUtrsRHHhpYW5nQGtl
cm5lbC5vcmcACgkQUXZn5Zlu5qpu6A//V8ZIE+yCuiqLI+VdTfb0nL1qKVrLUIAO
HffARXPiONyJqZtTFBo2FISkOouuSEYnrdApkDQ9V8suL+FlJ10FcRrP+Qnrvn/v
H6bKDQDvAWd27Z/jBpcma2b6tQG7GcmSaWC/eBdWEvQk+bgP1C/GD8bEDZkAAUIp
Ta5i349BtRlGaHoDRw+xsNUmzPxdC2GsTyBJXrYmhgyyD1QrdEMACkTMh/PjMV8h
LPVn5vo3Gf4tLpkOluqhFIjOkeEMw1LyICtvY23tweN2uzYJzEcmpmP8FjRXLmgQ
X9rTFJQ5OSWfNvstgGLncPvp2C3eE2IarYTNG20JWNJDIRmK3ae6RIRkMO4B4hXM
/cAkBF5eSJWRcTCPdY4T4XuWI2WdfpPMKBD4p/QZf7uyZoHnSRjY5D4fnkb/nLTh
sAgItTKmdPZ1GySVJf9EUzm4SWmxbTiCdgqEKCFn/RFbkbiK7xWbUaNcITY1FrOA
fgeeQe95//H90FiKfPvQK70vPcxxPfkLyxQNSQOXDZqiLzPNOvIDzro2pclk0qbk
/mcIG1JosMlVA/WeYoa0lulKxLzBykLBMM0fAFH1MNoTKFJhi9eMfO5jeUUj08wj
g51OmA2YPB4jAnvxW7Z77KRFo9V0bLqetFOJaxQpdFc9YKlfOph/XjVMq8xomTeZ
b2YHe51/BB0=
=LxxN
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"Updates for folio conversions for compressed inodes: While large folio
support for compressed data could work now, it remains disabled since
the stress test could hang due to page migration in a few hours after
enabling it. I need more time to investigate further before enabling
this feature.
Additionally, clean up stream decompressors and tracepoints for
simplicity.
Summary:
- More folio conversions for compressed inodes
- Stream decompressor (LZMA/DEFLATE/ZSTD) cleanups
- Minor tracepoint cleanup"
* tag 'erofs-for-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: silence uninitialized variable warning in z_erofs_scan_folio()
erofs: avoid refcounting short-lived pages
erofs: get rid of z_erofs_map_blocks_iter_* tracepoints
erofs: tidy up stream decompressors
erofs: refine z_erofs_{init,exit}_subsystem()
erofs: move each decompressor to its own source file
erofs: tidy up `struct z_erofs_bvec`
erofs: teach z_erofs_scan_folios() to handle multi-page folios
erofs: convert z_erofs_read_fragment() to folios
erofs: convert z_erofs_pcluster_readmore() to folios
patchsets (devmem among them) did not make it in time.
Core & protocols
----------------
- Use local_lock in addition to local_bh_disable() to protect per-CPU
resources in networking, a step closer for local_bh_disable() not
to act as a big lock on PREEMPT_RT.
- Use flex array for netdevice priv area, ensure its cache alignment.
- Add a sysctl knob to allow user to specify a default rto_min at socket
init time. Bit of a big hammer but multiple companies were
independently carrying such patch downstream so clearly it's useful.
- Support scheduling transmission of packets based on CLOCK_TAI.
- Un-pin TCP TIMEWAIT timer to avoid it firing on CPUs later cordoned off
using cpusets.
- Support multiple L2TPv3 UDP tunnels using the same 5-tuple address.
- Allow configuration of multipath hash seed, to both allow synchronizing
hashing of two routers, and preventing partial accidental sync.
- Improve TCP compliance with RFC 9293 for simultaneous connect().
- Support sending NAT keepalives in IPsec ESP in UDP states. Userspace
IKE daemon had to do this before, but the kernel can better keep
track of it.
- Support sending supervision HSR frames with MAC addresses stored in
ProxyNodeTable when RedBox (i.e. HSR-SAN) is enabled.
- Introduce IPPROTO_SMC for selecting SMC when socket is created.
- Allow UDP GSO transmit from devices with no checksum offload.
- openvswitch: add packet sampling via psample, separating the sampled
traffic from "upcall" packets sent to user space for forwarding.
- nf_tables: shrink memory consumption for transaction objects.
Things we sprinkled into general kernel code
--------------------------------------------
- Power Sequencing subsystem (used by Qualcomm Bluetooth driver
for QCA6390).
- Add IRQ information in sysfs for auxiliary bus.
- Introduce guard definition for local_lock.
- Add aligned flavor of __cacheline_group_{begin, end}() markings for
grouping fields in structures.
BPF
---
- Notify user space (via epoll) when a struct_ops object is getting
detached/unregistered.
- Add new kfuncs for a generic, open-coded bits iterator.
- Enable BPF programs to declare arrays of kptr, bpf_rb_root, and
bpf_list_head.
- Support resilient split BTF which cuts down on duplication and makes
BTF as compact as possible WRT BTF from modules.
- Add support for dumping kfunc prototypes from BTF which enables both
detecting as well as dumping compilable prototypes for kfuncs.
- riscv64 BPF JIT improvements in particular to add 12-argument support
for BPF trampolines and to utilize bpf_prog_pack for the latter.
- Add the capability to offload the netfilter flowtable in XDP layer
through kfuncs.
Driver API
----------
- Allow users to configure IRQ tresholds between which automatic IRQ
moderation can choose.
- Expand Power Sourcing (PoE) status with power, class and failure
reason. Support setting power limits.
- Track additional RSS contexts in the core, make sure configuration
changes don't break them.
- Support IPsec crypto offload for IPv6 ESP and IPv4 UDP-encapsulated ESP
data paths.
- Support updating firmware on SFP modules.
Tests and tooling
-----------------
- mptcp: use net/lib.sh to manage netns.
- TCP-AO and TCP-MD5: replace debug prints used by tests with
tracepoints.
- openvswitch: make test self-contained (don't depend on OvS CLI tools).
Drivers
-------
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- increase the max total outstanding PTP TX packets to 4
- add timestamping statistics support
- implement netdev_queue_mgmt_ops
- support new RSS context API
- Intel (100G, ice, idpf):
- implement FEC statistics and dumping signal quality indicators
- support E825C products (with 56Gbps PHYs)
- nVidia/Mellanox:
- support HW-GRO
- mlx4/mlx5: support per-queue statistics via netlink
- obey the max number of EQs setting in sub-functions
- AMD/Solarflare:
- support new RSS context API
- AMD/Pensando:
- ionic: rework fix for doorbell miss to lower overhead
and skip it on new HW
- Wangxun:
- txgbe: support Flow Director perfect filters
- Ethernet NICs consumer, embedded and virtual:
- Add driver for Tehuti Networks TN40xx chips
- Add driver for Meta's internal NIC chips
- Add driver for Ethernet MAC on Airoha EN7581 SoCs
- Add driver for Renesas Ethernet-TSN devices
- Google cloud vNIC:
- flow steering support
- Microsoft vNIC:
- support page sizes other than 4KB on ARM64
- vmware vNIC:
- support latency measurement (update to version 9)
- VirtIO net:
- support for Byte Queue Limits
- support configuring thresholds for automatic IRQ moderation
- support for AF_XDP Rx zero-copy
- Synopsys (stmmac):
- support for STM32MP13 SoC
- let platforms select the right PCS implementation
- TI:
- icssg-prueth: add multicast filtering support
- icssg-prueth: enable PTP timestamping and PPS
- Renesas:
- ravb: improve Rx performance 30-400% by using page pool,
theaded NAPI and timer-based IRQ coalescing
- ravb: add MII support for R-Car V4M
- Cadence (macb):
- macb: add ARP support to Wake-On-LAN
- Cortina:
- use phylib for RX and TX pause configuration
- Ethernet switches:
- nVidia/Mellanox:
- support configuration of multipath hash seed
- report more accurate max MTU
- use page_pool to improve Rx performance
- MediaTek:
- mt7530: add support for bridge port isolation
- Qualcomm:
- qca8k: add support for bridge port isolation
- Microchip:
- lan9371/2: add 100BaseTX PHY support
- NXP:
- vsc73xx: implement VLAN operations
- Ethernet PHYs:
- aquantia: enable support for aqr115c
- aquantia: add support for PHY LEDs
- realtek: add support for rtl8224 2.5Gbps PHY
- xpcs: add memory-mapped device support
- add BroadR-Reach link mode and support in Broadcom's PHY driver
- CAN:
- add document for ISO 15765-2 protocol support
- mcp251xfd: workaround for erratum DS80000789E, use timestamps
to catch when device returns incorrect FIFO status
- WiFi:
- mac80211/cfg80211:
- parse Transmit Power Envelope (TPE) data in mac80211 instead of
in drivers
- improvements for 6 GHz regulatory flexibility
- multi-link improvements
- support multiple radios per wiphy
- remove DEAUTH_NEED_MGD_TX_PREP flag
- Intel (iwlwifi):
- bump FW API to 91 for BZ/SC devices
- report 64-bit radiotap timestamp
- enable P2P low latency by default
- handle Transmit Power Envelope (TPE) advertised by AP
- remove support for older FW for new devices
- fast resume (keeping the device configured)
- mvm: re-enable Multi-Link Operation (MLO)
- aggregation (A-MSDU) optimizations
- MediaTek (mt76):
- mt7925 Multi-Link Operation (MLO) support
- Qualcomm (ath10k):
- LED support for various chipsets
- Qualcomm (ath12k):
- remove unsupported Tx monitor handling
- support channel 2 in 6 GHz band
- support Spatial Multiplexing Power Save (SMPS) in 6 GHz band
- supprt multiple BSSID (MBSSID) and Enhanced Multi-BSSID
Advertisements (EMA)
- support dynamic VLAN
- add panic handler for resetting the firmware state
- DebugFS support for datapath statistics
- WCN7850: support for Wake on WLAN
- Microchip (wilc1000):
- read MAC address during probe to make it visible to user space
- suspend/resume improvements
- TI (wl18xx):
- support newer firmware versions
- RealTek (rtw89):
- preparation for RTL8852BE-VT support
- Wake on WLAN support for WiFi 6 chips
- 36-bit PCI DMA support
- RealTek (rtlwifi):
- RTL8192DU support
- Broadcom (brcmfmac):
- Management Frame Protection support (to enable WPA3)
- Bluetooth:
- qualcomm: use the power sequencer for QCA6390
- btusb: mediatek: add ISO data transmission functions
- hci_bcm4377: add BCM4388 support
- btintel: add support for BlazarU core
- btintel: add support for Whale Peak2
- btnxpuart: add support for AW693 A1 chipset
- btnxpuart: add support for IW615 chipset
- btusb: add Realtek RTL8852BE support ID 0x13d3:0x3591
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmaWjBwACgkQMUZtbf5S
IrvuSRAAkJuEzTRqgURBCe4eNEQde6mJJig7l2CKHwCbFiHZpRkFHf8qKbcGWbL6
uLW33SWnKtJVDhxVKWHLq635XW7BAa80YhqGw21GDi+mIEhWXZglHj3xbXNxsMfE
4eg/kG4BkfYWFmHaXOwVWV/mr7nXf6j7WmXNeXEi32ufE1j0OL+YlQenKnMj8yP2
j9JmYa2Chwppng1SblHmcjmGkdNVwFhStKeCG+2K7v06wdDH/QYBlbgUv9gw/cxp
NlW//wgiaeX40U4O3kDwt9C+LDoh+0VrDDeVdQ+IsScLtY3PhAzEoKolFYTq2HSr
I1JpoaHNnyNsJq3DZrACQ5WlH4yDn6C2EUB6dxNnFaI9F1ZPsi+7MTl6Sei1AklD
TuQTj/lxOACBwW2Q77NU72uoxiIUauesGPHcnrAFuoCIEhZF0mso7k59BvrXhsOP
QwcLbQdc1YHNkqv/Vc7NBY+ruMsYB+5Ubbhhj2p27dp/CWFIwxI29fze4dn2uhO6
ejHN3mbqwPdSzg12YJtM6Iq61Cnwo2eVSvhTxl+ZVSZtI4nu2arzR+y7QTYmNrXP
6tkgVN9UsWeLl2xJ8wyyqL5mcvNHP2rPXWZ2X56iTaa26m+UlleeQ7YRaYtQAAr0
Ec/vlDMX64SwHhd+qwE99DXGQf2g+KklHKSLsnajJUVrWFTlRI0=
=opz8
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Not much excitement - a handful of large patchsets (devmem among them)
did not make it in time.
Core & protocols:
- Use local_lock in addition to local_bh_disable() to protect per-CPU
resources in networking, a step closer for local_bh_disable() not
to act as a big lock on PREEMPT_RT
- Use flex array for netdevice priv area, ensure its cache alignment
- Add a sysctl knob to allow user to specify a default rto_min at
socket init time. Bit of a big hammer but multiple companies were
independently carrying such patch downstream so clearly it's useful
- Support scheduling transmission of packets based on CLOCK_TAI
- Un-pin TCP TIMEWAIT timer to avoid it firing on CPUs later cordoned
off using cpusets
- Support multiple L2TPv3 UDP tunnels using the same 5-tuple address
- Allow configuration of multipath hash seed, to both allow
synchronizing hashing of two routers, and preventing partial
accidental sync
- Improve TCP compliance with RFC 9293 for simultaneous connect()
- Support sending NAT keepalives in IPsec ESP in UDP states.
Userspace IKE daemon had to do this before, but the kernel can
better keep track of it
- Support sending supervision HSR frames with MAC addresses stored in
ProxyNodeTable when RedBox (i.e. HSR-SAN) is enabled
- Introduce IPPROTO_SMC for selecting SMC when socket is created
- Allow UDP GSO transmit from devices with no checksum offload
- openvswitch: add packet sampling via psample, separating the
sampled traffic from "upcall" packets sent to user space for
forwarding
- nf_tables: shrink memory consumption for transaction objects
Things we sprinkled into general kernel code:
- Power Sequencing subsystem (used by Qualcomm Bluetooth driver for
QCA6390) [ Already merged separately - Linus ]
- Add IRQ information in sysfs for auxiliary bus
- Introduce guard definition for local_lock
- Add aligned flavor of __cacheline_group_{begin, end}() markings for
grouping fields in structures
BPF:
- Notify user space (via epoll) when a struct_ops object is getting
detached/unregistered
- Add new kfuncs for a generic, open-coded bits iterator
- Enable BPF programs to declare arrays of kptr, bpf_rb_root, and
bpf_list_head
- Support resilient split BTF which cuts down on duplication and
makes BTF as compact as possible WRT BTF from modules
- Add support for dumping kfunc prototypes from BTF which enables
both detecting as well as dumping compilable prototypes for kfuncs
- riscv64 BPF JIT improvements in particular to add 12-argument
support for BPF trampolines and to utilize bpf_prog_pack for the
latter
- Add the capability to offload the netfilter flowtable in XDP layer
through kfuncs
Driver API:
- Allow users to configure IRQ tresholds between which automatic IRQ
moderation can choose
- Expand Power Sourcing (PoE) status with power, class and failure
reason. Support setting power limits
- Track additional RSS contexts in the core, make sure configuration
changes don't break them
- Support IPsec crypto offload for IPv6 ESP and IPv4 UDP-encapsulated
ESP data paths
- Support updating firmware on SFP modules
Tests and tooling:
- mptcp: use net/lib.sh to manage netns
- TCP-AO and TCP-MD5: replace debug prints used by tests with
tracepoints
- openvswitch: make test self-contained (don't depend on OvS CLI
tools)
Drivers:
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- increase the max total outstanding PTP TX packets to 4
- add timestamping statistics support
- implement netdev_queue_mgmt_ops
- support new RSS context API
- Intel (100G, ice, idpf):
- implement FEC statistics and dumping signal quality indicators
- support E825C products (with 56Gbps PHYs)
- nVidia/Mellanox:
- support HW-GRO
- mlx4/mlx5: support per-queue statistics via netlink
- obey the max number of EQs setting in sub-functions
- AMD/Solarflare:
- support new RSS context API
- AMD/Pensando:
- ionic: rework fix for doorbell miss to lower overhead and
skip it on new HW
- Wangxun:
- txgbe: support Flow Director perfect filters
- Ethernet NICs consumer, embedded and virtual:
- Add driver for Tehuti Networks TN40xx chips
- Add driver for Meta's internal NIC chips
- Add driver for Ethernet MAC on Airoha EN7581 SoCs
- Add driver for Renesas Ethernet-TSN devices
- Google cloud vNIC:
- flow steering support
- Microsoft vNIC:
- support page sizes other than 4KB on ARM64
- vmware vNIC:
- support latency measurement (update to version 9)
- VirtIO net:
- support for Byte Queue Limits
- support configuring thresholds for automatic IRQ moderation
- support for AF_XDP Rx zero-copy
- Synopsys (stmmac):
- support for STM32MP13 SoC
- let platforms select the right PCS implementation
- TI:
- icssg-prueth: add multicast filtering support
- icssg-prueth: enable PTP timestamping and PPS
- Renesas:
- ravb: improve Rx performance 30-400% by using page pool,
theaded NAPI and timer-based IRQ coalescing
- ravb: add MII support for R-Car V4M
- Cadence (macb):
- macb: add ARP support to Wake-On-LAN
- Cortina:
- use phylib for RX and TX pause configuration
- Ethernet switches:
- nVidia/Mellanox:
- support configuration of multipath hash seed
- report more accurate max MTU
- use page_pool to improve Rx performance
- MediaTek:
- mt7530: add support for bridge port isolation
- Qualcomm:
- qca8k: add support for bridge port isolation
- Microchip:
- lan9371/2: add 100BaseTX PHY support
- NXP:
- vsc73xx: implement VLAN operations
- Ethernet PHYs:
- aquantia: enable support for aqr115c
- aquantia: add support for PHY LEDs
- realtek: add support for rtl8224 2.5Gbps PHY
- xpcs: add memory-mapped device support
- add BroadR-Reach link mode and support in Broadcom's PHY driver
- CAN:
- add document for ISO 15765-2 protocol support
- mcp251xfd: workaround for erratum DS80000789E, use timestamps to
catch when device returns incorrect FIFO status
- WiFi:
- mac80211/cfg80211:
- parse Transmit Power Envelope (TPE) data in mac80211 instead
of in drivers
- improvements for 6 GHz regulatory flexibility
- multi-link improvements
- support multiple radios per wiphy
- remove DEAUTH_NEED_MGD_TX_PREP flag
- Intel (iwlwifi):
- bump FW API to 91 for BZ/SC devices
- report 64-bit radiotap timestamp
- enable P2P low latency by default
- handle Transmit Power Envelope (TPE) advertised by AP
- remove support for older FW for new devices
- fast resume (keeping the device configured)
- mvm: re-enable Multi-Link Operation (MLO)
- aggregation (A-MSDU) optimizations
- MediaTek (mt76):
- mt7925 Multi-Link Operation (MLO) support
- Qualcomm (ath10k):
- LED support for various chipsets
- Qualcomm (ath12k):
- remove unsupported Tx monitor handling
- support channel 2 in 6 GHz band
- support Spatial Multiplexing Power Save (SMPS) in 6 GHz band
- supprt multiple BSSID (MBSSID) and Enhanced Multi-BSSID
Advertisements (EMA)
- support dynamic VLAN
- add panic handler for resetting the firmware state
- DebugFS support for datapath statistics
- WCN7850: support for Wake on WLAN
- Microchip (wilc1000):
- read MAC address during probe to make it visible to user space
- suspend/resume improvements
- TI (wl18xx):
- support newer firmware versions
- RealTek (rtw89):
- preparation for RTL8852BE-VT support
- Wake on WLAN support for WiFi 6 chips
- 36-bit PCI DMA support
- RealTek (rtlwifi):
- RTL8192DU support
- Broadcom (brcmfmac):
- Management Frame Protection support (to enable WPA3)
- Bluetooth:
- qualcomm: use the power sequencer for QCA6390
- btusb: mediatek: add ISO data transmission functions
- hci_bcm4377: add BCM4388 support
- btintel: add support for BlazarU core
- btintel: add support for Whale Peak2
- btnxpuart: add support for AW693 A1 chipset
- btnxpuart: add support for IW615 chipset
- btusb: add Realtek RTL8852BE support ID 0x13d3:0x3591"
* tag 'net-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1589 commits)
eth: fbnic: Fix spelling mistake "tiggerring" -> "triggering"
tcp: Replace strncpy() with strscpy()
wifi: ath12k: fix build vs old compiler
tcp: Don't access uninit tcp_rsk(req)->ao_keyid in tcp_create_openreq_child().
eth: fbnic: Write the TCAM tables used for RSS control and Rx to host
eth: fbnic: Add L2 address programming
eth: fbnic: Add basic Rx handling
eth: fbnic: Add basic Tx handling
eth: fbnic: Add link detection
eth: fbnic: Add initial messaging to notify FW of our presence
eth: fbnic: Implement Rx queue alloc/start/stop/free
eth: fbnic: Implement Tx queue alloc/start/stop/free
eth: fbnic: Allocate a netdevice and napi vectors with queues
eth: fbnic: Add FW communication mechanism
eth: fbnic: Add message parsing for FW messages
eth: fbnic: Add register init to set PCIe/Ethernet device config
eth: fbnic: Allocate core device specific structures and devlink interface
eth: fbnic: Add scaffolding for Meta's NIC driver
PCI: Add Meta Platforms vendor ID
net/sched: cls_flower: propagate tca[TCA_OPTIONS] to NL_REQ_ATTR_CHECK
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaOTd8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppqIEACUr8Vv2FtezvT3OfVSlYWHHLXzkRhwEG5s
vdk0o7Ow6U54sMjfymbHTgLD0ZOJf3uJ6BI95FQuW41jPzDFVbx4Hy8QzqonMkw9
1D/YQ4zrVL2mOKBzATbKpoGJzMOzGeoXEueFZ1AYPAX7RrDtP4xPQNfrcfkdE2zF
LycJN70Vp6lrZZMuI9yb9ts1tf7TFzK0HJANxOAKTgSiPmBmxesjkJlhrdUrgkAU
qDVyjj7u/ssndBJAb9i6Bl95Do8s9t4DeJq5/6wgKqtf5hClMXzPVB8Wy084gr6E
rTRsCEhOug3qEZSqfAgAxnd3XFRNc/p2KMUe5YZ4mAqux4hpSmIQQDM/5X5K9vEv
f4MNqUGlqyqntZx+KPyFpf7kLHFYS1qK4ub0FojWJEY4GrbBPNjjncLJ9+ozR0c8
kNDaFjMNAjalBee1FxNNH8LdVcd28rrCkPxRLEfO/gvBMUmvJf4ZyKmSED0v5DhY
vZqKlBqG+wg0EXvdiWEHMDh9Y+q/2XBIkS6NN/Bhh61HNu+XzC838ts1X7lR+4o2
AM5Vapw+v0q6kFBMRP3IcJI/c0UcIU8EQU7axMyzWtvhog8kx8x01hIj1L4UyYYr
rUdWrkugBVXJbywFuH/QIJxWxS/z4JdSw5VjASJLIrXy+aANmmG9Wonv95eyhpUv
5iv+EdRSNA==
=wVi8
-----END PGP SIGNATURE-----
Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe updates via Keith:
- Device initialization memory leak fixes (Keith)
- More constants defined (Weiwen)
- Target debugfs support (Hannes)
- PCIe subsystem reset enhancements (Keith)
- Queue-depth multipath policy (Redhat and PureStorage)
- Implement get_unique_id (Christoph)
- Authentication error fixes (Gaosheng)
- MD updates via Song
- sync_action fix and refactoring (Yu Kuai)
- Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu
Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li)
- Fix loop detach/open race (Gulam)
- Fix lower control limit for blk-throttle (Yu)
- Add module descriptions to various drivers (Jeff)
- Add support for atomic writes for block devices, and statx reporting
for same. Includes SCSI and NVMe (John, Prasad, Alan)
- Add IO priority information to block trace points (Dongliang)
- Various zone improvements and tweaks (Damien)
- mq-deadline tag reservation improvements (Bart)
- Ignore direct reclaim swap writes in writeback throttling (Baokun)
- Block integrity improvements and fixes (Anuj)
- Add basic support for rust based block drivers. Has a dummy null_blk
variant for now (Andreas)
- Series converting driver settings to queue limits, and cleanups and
fixes related to that (Christoph)
- Cleanup for poking too deeply into the bvec internals, in preparation
for DMA mapping API changes (Christoph)
- Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas,
Ming, Zhu, Damien, Christophe, Chaitanya)
* tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits)
floppy: add missing MODULE_DESCRIPTION() macro
loop: add missing MODULE_DESCRIPTION() macro
ublk_drv: add missing MODULE_DESCRIPTION() macro
xen/blkback: add missing MODULE_DESCRIPTION() macro
block/rnbd: Constify struct kobj_type
block: take offset into account in blk_bvec_map_sg again
block: fix get_max_segment_size() warning
loop: Don't bother validating blocksize
virtio_blk: Don't bother validating blocksize
null_blk: Don't bother validating blocksize
block: Validate logical block size in blk_validate_limits()
virtio_blk: Fix default logical block size fallback
nvmet-auth: fix nvmet_auth hash error handling
nvme: implement ->get_unique_id
block: pass a phys_addr_t to get_max_segment_size
block: add a bvec_phys helper
blk-lib: check for kill signal in ioctl BLKZEROOUT
block: limit the Write Zeroes to manually writing zeroes fallback
block: refacto blkdev_issue_zeroout
block: move read-only and supported checks into (__)blkdev_issue_zeroout
...
Switch the order of prev_comm and next_comm in sched_switch's code to
align with its printing order.
Cc: <mhiramat@kernel.org>
Signed-off-by: Tio Zhang <tiozhang@didiglobal.com>
Link: https://lore.kernel.org/20240703033353.GA2833@didi-ThinkCentre-M930t-N000
Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c
f7ce5eb2cb ("bnxt_en: Fix crash in bnxt_get_max_rss_ctx_ring()")
20c8ad72eb ("eth: bnxt: use the RSS context XArray instead of the local list")
Adjacent changes:
net/ethtool/ioctl.c
503757c809 ("net: ethtool: Fix RSS setting")
eac9122f0c ("net: ethtool: record custom RSS contexts in the XArray")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaRcQgACgkQxWXV+ddt
WDvAGxAAknJAiREp/AmzhSwkhr+nSnqex0t+VVgsOaMTu0BEHO0xhoXc3l0QuSwS
u2AIqmOYyzr/UQVXCuatBqAE+5T4njtYAYIWwE825yquAtHNyuok9+Sjhfvxrwgs
HmNAN4Vvl2Fwds7xbWE8ug18QlssuRTIX8hk7ZtS6xo49g0tsbRX9KlzIPpsULD3
BOZa+2NJwC1PGVeNPf3p06rfiUkKfmFYgdDybe2zJ17uwsRz1CFSsaEEB35ys1f0
xYOS4epfcie03EGyZmYctuNxatUkk/J/1lTH4Z9JHwvPBvLK1U97SyJ11Wz2VQC/
8ar8gUDRYtjWdf6vn6AWBM4MseaYm9LDMlPhbSfvpDcWiclGTE64IOP4gKKr3mCh
WzlNSIR9I+tYgrhvcsCEzd7lvrSVHa7clwfooYgkEx0wl5lgbN0llAdtJWG3eeLn
3stxje2FqqXsFNj5N9SrPy7f7t6xF2i8vwk4qh6EpRuT4yuatb+nWzDm9EuTT/Bc
P+zM1KFp7Blk7Zw/Tpw0O9qjt1whStY2xrqcMzg539WVo45MmuFEFzmGBRwZsH55
QPGLIjXPpt728AgMdhBFEG0DtWaiA3AOI/C5nYOtLu92aZVBmbaX7/d/GpJv3Vvd
Ihvr9s1c49YvTZsIS0T0tkq/7LXZi/SToRJDjhP5HCrRGf7A30Y=
=gtsF
-----END PGP SIGNATURE-----
Merge tag 'for-6.10-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Fix a regression in extent map shrinker behaviour.
In the past weeks we got reports from users that there are huge
latency spikes or freezes. This was bisected to newly added shrinker
of extent maps (it was added to fix a build up of the structures in
memory).
I'm assuming that the freezes would happen to many users after release
so I'd like to get it merged now so it's in 6.10. Although the diff
size is not small the changes are relatively straightforward, the
reporters verified the fixes and we did testing on our side.
The fixes:
- adjust behaviour under memory pressure and check lock or scheduling
conditions, bail out if needed
- synchronize tracking of the scanning progress so inode ranges are
not skipped or work duplicated
- do a delayed iput when scanning a root so evicting an inode does
not slow things down in case of lots of dirty data, also fix
lockdep warning, a deadlock could happen when writing the dirty
data would need to start a transaction"
* tag 'for-6.10-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: avoid races when tracking progress for extent map shrinking
btrfs: stop extent map shrinker if reschedule is needed
btrfs: use delayed iput during extent map shrinking
It would be possible to put any statement in TP_fast_assign().
This commit obsoletes the helper function and put its statements to
TP_fast_assign() for the code simplicity.
Link: https://lore.kernel.org/r/20240712003010.87341-1-o-takashi@sakamocchi.jp
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
Cross-merge networking fixes after downstream PR.
Conflicts:
net/sched/act_ct.c
26488172b0 ("net/sched: Fix UAF when resolving a clash")
3abbd7ed8b ("act_ct: prepare for stolen verdict coming from conntrack and nat engine")
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We store the progress (root and inode numbers) of the extent map shrinker
in fs_info without any synchronization but we can have multiple tasks
calling into the shrinker during memory allocations when there's enough
memory pressure for example.
This can result in a task A reading fs_info->extent_map_shrinker_last_ino
after another task B updates it, and task A reading
fs_info->extent_map_shrinker_last_root before task B updates it, making
task A see an odd state that isn't necessarily harmful but may make it
skip certain inode ranges or do more work than necessary by going over
the same inodes again. These unprotected accesses would also trigger
warnings from tools like KCSAN.
So add a lock to protect access to these progress fields.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The member extent_map::block_start can be calculated from
extent_map::disk_bytenr + extent_map::offset for regular extents.
And otherwise just extent_map::disk_bytenr.
And this is already validated by the validate_extent_map(). Now we can
remove the member.
However there is a special case in btrfs_create_dio_extent() where we
for NOCOW/PREALLOC ordered extents cannot directly use the resulting
btrfs_file_extent, as btrfs_split_ordered_extent() cannot handle them
yet.
So for that call site, we pass file_extent->disk_bytenr +
file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for
offset.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent_map::block_len is either extent_map::len (non-compressed
extent) or extent_map::disk_num_bytes (compressed extent).
Since we already have sanity checks to do the cross-checks between the
new and old members, we can drop the old extent_map::block_len now.
For most call sites, they can manually select extent_map::len or
extent_map::disk_num_bytes, since most if not all of them have checked
if the extent is compressed.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we have extent_map::offset, the old extent_map::orig_start is just
extent_map::start - extent_map::offset for non-hole/inline extents.
And since the new extent_map::offset is already verified by
validate_extent_map() while the old orig_start is not, let's just remove
the old member from all call sites.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
CONFIG_MEMCG_KMEM used to be a user-visible option for whether slab
tracking is enabled. It has been default-enabled and equivalent to
CONFIG_MEMCG for almost a decade. We've only grown more kernel memory
accounting sites since, and there is no imaginable cgroup usecase going
forward that wants to track user pages but not the multitude of
user-drivable kernel allocations.
Link: https://lkml.kernel.org/r/20240701153148.452230-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Consolidate them under erofs_map_blocks_* for simplicity since we
have many other ways to know if a given inode is compressed or not.
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240710083459.208362-1-hongzhen@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Wait for all disconnects to complete to ensure the transport has
divested all of its hardware resources before the underlying RDMA
device can be removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit e87a911fed ("nvme-rdma: use ib_client API to detect device
removal") explains the benefits of handling device removal outside
of the CM event handler.
Sketch in an IB device removal notification mechanism that can be
used by both the client and server side RPC-over-RDMA transport
implementations.
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
libaokun@huaweicloud.com <libaokun@huaweicloud.com> says:
This is the third version of this patch series, in which another patch set
is subsumed into this one to avoid confusing the two patch sets.
(https://patchwork.kernel.org/project/linux-fsdevel/list/?series=854914)
We've been testing ondemand mode for cachefiles since January, and we're
almost done. We hit a lot of issues during the testing period, and this
patch series fixes some of the issues. The patches have passed internal
testing without regression.
The following is a brief overview of the patches, see the patches for
more details.
Patch 1-2: Add fscache_try_get_volume() helper function to avoid
fscache_volume use-after-free on cache withdrawal.
Patch 3: Fix cachefiles_lookup_cookie() and cachefiles_withdraw_cache()
concurrency causing cachefiles_volume use-after-free.
Patch 4: Propagate error codes returned by vfs_getxattr() to avoid
endless loops.
Patch 5-7: A read request waiting for reopen could be closed maliciously
before the reopen worker is executing or waiting to be scheduled. So
ondemand_object_worker() may be called after the info and object and even
the cache have been freed and trigger use-after-free. So use
cancel_work_sync() in cachefiles_ondemand_clean_object() to cancel the
reopen worker or wait for it to finish. Since it makes no sense to wait
for the daemon to complete the reopen request, to avoid this pointless
operation blocking cancel_work_sync(), Patch 1 avoids request generation
by the DROPPING state when the request has not been sent, and Patch 2
flushes the requests of the current object before cancel_work_sync().
Patch 8: Cyclic allocation of msg_id to avoid msg_id reuse misleading
the daemon to cause hung.
Patch 9: Hold xas_lock during polling to avoid dereferencing reqs causing
use-after-free. This issue was triggered frequently in our tests, and we
found that anolis 5.10 had fixed it. So to avoid failing the test, this
patch is pushed upstream as well.
Baokun Li (7):
netfs, fscache: export fscache_put_volume() and add
fscache_try_get_volume()
cachefiles: fix slab-use-after-free in fscache_withdraw_volume()
cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie()
cachefiles: propagate errors from vfs_getxattr() to avoid infinite
loop
cachefiles: stop sending new request when dropping object
cachefiles: cancel all requests for the object that is being dropped
cachefiles: cyclic allocation of msg_id to avoid reuse
Hou Tao (1):
cachefiles: wait for ondemand_object_worker to finish when dropping
object
Jingbo Xu (1):
cachefiles: add missing lock protection when polling
fs/cachefiles/cache.c | 45 ++++++++++++++++++++++++++++-
fs/cachefiles/daemon.c | 4 +--
fs/cachefiles/internal.h | 3 ++
fs/cachefiles/ondemand.c | 52 ++++++++++++++++++++++++++++++----
fs/cachefiles/volume.c | 1 -
fs/cachefiles/xattr.c | 5 +++-
fs/netfs/fscache_volume.c | 14 +++++++++
fs/netfs/internal.h | 2 --
include/linux/fscache-cache.h | 6 ++++
include/trace/events/fscache.h | 4 +++
10 files changed, 123 insertions(+), 13 deletions(-)
Link: https://lore.kernel.org/r/20240628062930.2467993-1-libaokun@huaweicloud.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
The current patch series introduces DAMON based migration across NUMA
nodes so it'd be better to have a new migrate_reason in trace events.
Link: https://lkml.kernel.org/r/20240614030010.751-5-honggyu.kim@sk.com
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Gregory Price <gregory.price@memverge.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Hyeongtak Ji <hyeongtak.ji@sk.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In 1394 OHCI, the SelfIDComplete event occurs when the hardware has
finished transmitting all of the self ID packets received during the bus
initialization process to the host memory by DMA.
This commit adds a tracepoints event for this event to trace the timing
and packet data of Self-ID DMA. It is the part of following tracepoints
events helpful to debug some events at bus reset; e.g. the issue addressed
at a commit d0b06dc48f ("firewire: core: use long bus reset on gap count
error")[1]:
* firewire_ohci:irqs
* firewire_ohci:self_id_complete
* firewire:bus_reset_handle
* firewire:self_id_sequence
They would be also helpful in the problem about invocation timing of
hardIRQ and process (workqueue) contexts. We can often see this kind of
problem with -rt kernel[2].
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0b06dc48fb1
[2] https://lore.kernel.org/linux-rt-users/YAwPoaUZ1gTD5y+k@hmbx/
Link: https://lore.kernel.org/r/20240702222034.1378764-6-o-takashi@sakamocchi.jp
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
We got the following issue in our fault injection stress test:
==================================================================
BUG: KASAN: slab-use-after-free in fscache_withdraw_volume+0x2e1/0x370
Read of size 4 at addr ffff88810680be08 by task ondemand-04-dae/5798
CPU: 0 PID: 5798 Comm: ondemand-04-dae Not tainted 6.8.0-dirty #565
Call Trace:
kasan_check_range+0xf6/0x1b0
fscache_withdraw_volume+0x2e1/0x370
cachefiles_withdraw_volume+0x31/0x50
cachefiles_withdraw_cache+0x3ad/0x900
cachefiles_put_unbind_pincount+0x1f6/0x250
cachefiles_daemon_release+0x13b/0x290
__fput+0x204/0xa00
task_work_run+0x139/0x230
Allocated by task 5820:
__kmalloc+0x1df/0x4b0
fscache_alloc_volume+0x70/0x600
__fscache_acquire_volume+0x1c/0x610
erofs_fscache_register_volume+0x96/0x1a0
erofs_fscache_register_fs+0x49a/0x690
erofs_fc_fill_super+0x6c0/0xcc0
vfs_get_super+0xa9/0x140
vfs_get_tree+0x8e/0x300
do_new_mount+0x28c/0x580
[...]
Freed by task 5820:
kfree+0xf1/0x2c0
fscache_put_volume.part.0+0x5cb/0x9e0
erofs_fscache_unregister_fs+0x157/0x1b0
erofs_kill_sb+0xd9/0x1c0
deactivate_locked_super+0xa3/0x100
vfs_get_super+0x105/0x140
vfs_get_tree+0x8e/0x300
do_new_mount+0x28c/0x580
[...]
==================================================================
Following is the process that triggers the issue:
mount failed | daemon exit
------------------------------------------------------------
deactivate_locked_super cachefiles_daemon_release
erofs_kill_sb
erofs_fscache_unregister_fs
fscache_relinquish_volume
__fscache_relinquish_volume
fscache_put_volume(fscache_volume, fscache_volume_put_relinquish)
zero = __refcount_dec_and_test(&fscache_volume->ref, &ref);
cachefiles_put_unbind_pincount
cachefiles_daemon_unbind
cachefiles_withdraw_cache
cachefiles_withdraw_volumes
list_del_init(&volume->cache_link)
fscache_free_volume(fscache_volume)
cache->ops->free_volume
cachefiles_free_volume
list_del_init(&cachefiles_volume->cache_link);
kfree(fscache_volume)
cachefiles_withdraw_volume
fscache_withdraw_volume
fscache_volume->n_accesses
// fscache_volume UAF !!!
The fscache_volume in cache->volumes must not have been freed yet, but its
reference count may be 0. So use the new fscache_try_get_volume() helper
function try to get its reference count.
If the reference count of fscache_volume is 0, fscache_put_volume() is
freeing it, so wait for it to be removed from cache->volumes.
If its reference count is not 0, call cachefiles_withdraw_volume() with
reference count protection to avoid the above issue.
Fixes: fe2140e2f5 ("cachefiles: Implement volume support")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/20240628062930.2467993-3-libaokun@huaweicloud.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Abstract the memory type from the page_pool so we can later add support
for new memory types. Convert the page_pool to use the new netmem type
abstraction, rather than use struct page directly.
As of this patch the netmem type is a no-op abstraction: it's always a
struct page underneath. All the page pool internals are converted to
use struct netmem instead of struct page, and the page pool now exports
2 APIs:
1. The existing struct page API.
2. The new struct netmem API.
Keeping the existing API is transitional; we do not want to refactor all
the current drivers using the page pool at once.
The netmem abstraction is currently a no-op. The page_pool uses
page_to_netmem() to convert allocated pages to netmem, and uses
netmem_to_page() to convert the netmem back to pages to pass to mm APIs,
Follow up patches to this series add non-paged netmem support to the
page_pool. This change is factored out on its own to limit the code
churn to this 1 patch, for ease of code review.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/20240628003253.1694510-6-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sometimes we need to track the processing order of requests with
ioprio set. So the ioprio of request can be useful information.
Example:
block_rq_insert: 8,0 RA 16384 () 6500840 + 32 be,0,6 [binder:815_3]
block_rq_issue: 8,0 RA 16384 () 6500840 + 32 be,0,6 [binder:815_3]
block_rq_complete: 8,0 RA () 6500840 + 32 be,0,6 [0]
Signed-off-by: Dongliang Cui <dongliang.cui@unisoc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240614074936.113659-1-dongliang.cui@unisoc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add 'nr_resv' parameter to ext4_da_reserve_space(), which indicates the
number of clusters wants to reserve, make it reserve multiple clusters
at a time.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240517124005.347221-8-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Rename ext4_es_insert_delayed_block() to ext4_es_insert_delayed_extent()
and pass length parameter to make it insert multiple delalloc blocks at
a time. For the case of bigalloc, split the allocated parameter to
lclu_allocated and end_allocated. lclu_allocated indicates the
allocation state of the cluster which is containing the lblk,
end_allocated indicates the allocation state of the extent end, clusters
in the middle of delay allocated extent must be unallocated.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240517124005.347221-7-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
e3f02f32a0 ("ionic: fix kernel panic due to multi-buffer handling")
d9c0420999 ("ionic: Mark error paths in the data path as unlikely")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1394 OHCI hardware triggers PCI interrupts to notify any events to
software. Current driver for the hardware is programmed by the typical
way to utilize top- and bottom- halves, thus it has a timing gap to handle
the notification in softIRQ (tasklet).
This commit adds a tracepoint event for the hardIRQ event. The comparison
of the tracepoint event to tracepoints events in firewire subsystem is
helpful to diagnose the timing gap.
Link: https://lore.kernel.org/r/20240625031806.956650-3-o-takashi@sakamocchi.jp
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
The Linux Kernel Tracepoints framework is enough useful to trace the
interaction between 1394 OHCI hardware and its driver.
This commit adds firewire_ohci subsystem to use the framework. It is
defined as the different subsystem from the existing firewire subsystem.
The definition file for the existing subsystem is slightly changed so that
both subsystems are available in 1394 OHCI driver.
Link: https://lore.kernel.org/r/20240625031806.956650-2-o-takashi@sakamocchi.jp
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
It is helpful to trace completion of packets in isochronous context when
the core function is requested them by both in-kernel units driver and
userspace applications.
This commit adds some tracepoints events for the aim.
Link: https://lore.kernel.org/r/20240623220859.851685-8-o-takashi@sakamocchi.jp
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
It is helpful to trace the queueing packets of isochronous context when
the core function is requested them by both in-kernel unit drivers and
userspace applications.
This commit adds some tracepoints events for the aim.
Link: https://lore.kernel.org/r/20240623220859.851685-7-o-takashi@sakamocchi.jp
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>