Any uring_cmd always has async data allocated now, there's no reason to
check and clear a cached copy of the SQE.
Fixes: d10f19dff5 ("io_uring/uring_cmd: switch to always allocating async data")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5eff57fa9f ("io_uring/uring_cmd: defer SQE copying until it's needed")
moved the unconditional memcpy() of the uring_cmd SQE to async_data
to 2 cases when the request goes async:
- If REQ_F_FORCE_ASYNC is set to force the initial issue to go async
- If ->uring_cmd() returns -EAGAIN in the initial non-blocking issue
Unlike the REQ_F_FORCE_ASYNC case, in the EAGAIN case, io_uring_cmd()
copies the SQE to async_data but neglects to update the io_uring_cmd's
sqe field to point to async_data. As a result, sqe still points to the
slot in the userspace-mapped SQ. At the end of io_submit_sqes(), the
kernel advances the SQ head index, allowing userspace to reuse the slot
for a new SQE. If userspace reuses the slot before the io_uring worker
reissues the original SQE, the io_uring_cmd's SQE will be corrupted.
Introduce a helper io_uring_cmd_cache_sqes() to copy the original SQE to
the io_uring_cmd's async_data and point sqe there. Use it for both the
REQ_F_FORCE_ASYNC and EAGAIN cases. This ensures the uring_cmd doesn't
read from the SQ slot after it has been returned to userspace.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: 5eff57fa9f ("io_uring/uring_cmd: defer SQE copying until it's needed")
Link: https://lore.kernel.org/r/20250212204546.3751645-3-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
eaf72f7b41 ("io_uring/uring_cmd: cleanup struct io_uring_cmd_data
layout") removed most of the places assuming struct io_uring_cmd_data
has sqes as its first field. However, the EAGAIN case in io_uring_cmd()
still compares ioucmd->sqe to the struct io_uring_cmd_data pointer using
a void * cast. Since fa3595523d ("io_uring: get rid of alloc cache
init_once handling"), sqes is no longer io_uring_cmd_data's first field.
As a result, the pointers will always compare unequal and memcpy() may
be called with the same source and destination.
Replace the incorrect void * cast with the address of the sqes field.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: eaf72f7b41 ("io_uring/uring_cmd: cleanup struct io_uring_cmd_data layout")
Link: https://lore.kernel.org/r/20250212204546.3751645-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IORING_REGISTER_PBUF_RING can reuse an old struct io_buffer_list if it
was created for legacy selected buffer and has been emptied. It violates
the requirement that most of the field should stay stable after publish.
Always reallocate it instead.
Cc: stable@vger.kernel.org
Reported-by: Pumpkin Chang <pumpkin@devco.re>
Fixes: 2fcabce2d7 ("io_uring: disallow mixed provided buffer group registrations")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
struct io_tw_state is managed by core io_uring, and opcode handling code
must never try to cheat and create their own instances, it's plain
incorrect.
io_waitid_complete() attempts exactly that outside of the task work
context, and even though the ring is locked, there would be no one to
reap the requests from the defer completion list. It only works now
because luckily it's called before io_uring_try_cancel_uring_cmd(),
which flushes completions.
Fixes: f31ecf671d ("io_uring: add IORING_OP_WAITID support")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a socket is shutdown before the connection completes, POLLERR is set
in the poll mask. However, connect ignores this as it doesn't know, and
attempts the connection again. This may lead to a bogus -ETIMEDOUT
result, where it should have noticed the POLLERR and just returned
-ECONNRESET instead.
Have the poll logic check for whether or not POLLERR is set in the mask,
and if so, mark the request as failed. Then connect can appropriately
fail the request rather than retry it.
Reported-by: Sergey Galas <ssgalas@cloud.ru>
Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/discussions/1335
Fixes: 3fb1bd6881 ("io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of freeing iovecs in case of IO_URING_F_UNLOCKED in
io_rw_recycle(), leave it be and rely on the core io_uring code to
call io_readv_writev_cleanup() later. This way the iovec will get
recycled and we can clean up io_rw_recycle() and kill
io_rw_iovec_free().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/14f83b112eb40078bea18e15d77a4f99fc981a44.1738087204.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Test setups (with KASAN) will avoid !KASAN sections, and so it's not
testing paths that would be exercised otherwise. That's bad as to be
sure that your code works you now have to specifically test both KASAN
and !KASAN configs.
Remove !CONFIG_KASAN guards from io_netmsg_cache_free() and
io_rw_cache_free(). The free functions should always be getting valid
entries, and even though for KASAN iovecs should already be cleared,
that's better than skipping the chunks completely.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/d6078a51c7137a243f9d00849bc3daa660873209.1738087204.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
alloc_cache.h uses types it doesn't declare and thus depends on the
order in which it's included. Make it self contained and pull all needed
definitions.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/39569f3d5b250b4fe78bb609d57f67d3736ebcc4.1738087204.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We do io_kbuf_recycle() when arming a poll but every iteration of a
multishot can grab more buffers, which is why we need to flush the kbuf
ring state before continuing with waiting.
Cc: stable@vger.kernel.org
Fixes: b3fdea6ecb ("io_uring: multishot recv")
Reported-by: Muhammad Ramdhan <ramdhan@starlabs.sg>
Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Reported-by: Jacob Soo <jacob.soo@starlabs.sg>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1bfc9990fe435f1fc6152ca9efeba5eb3e68339c.1738025570.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A previous commit changed all of the migration from the old to the new
ring for resizing to use READ/WRITE_ONCE. However, ->sq_flags is an
atomic_t, and while most archs won't complain on this, some will indeed
flag this:
io_uring/register.c:554:9: sparse: sparse: cast to non-scalar
io_uring/register.c:554:9: sparse: sparse: cast from non-scalar
Just use atomic_set/atomic_read for handling this case.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501242000.A2sKqaCL-lkp@intel.com/
Fixes: 2c5aae129f ("io_uring/register: document io_register_resize_rings() shared mem usage")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
init_once is called when an object doesn't come from the cache, and
hence needs initial clearing of certain members. While the whole
struct could get cleared by memset() in that case, a few of the cache
members are large enough that this may cause unnecessary overhead if
the caches used aren't large enough to satisfy the workload. For those
cases, some churn of kmalloc+kfree is to be expected.
Ensure that the 3 users that need clearing put the members they need
cleared at the start of the struct, and wrap the rest of the struct in
a struct group so the offset is known.
While at it, improve the interaction with KASAN such that when/if
KASAN writes to members inside the struct that should be retained over
caching, it won't trip over itself. For rw and net, the retaining of
the iovec over caching is disabled if KASAN is enabled. A helper will
free and clear those members in that case.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A few spots in uring_cmd assume that the SQEs copied are always at the
start of the structure, and hence mix req->async_data and the struct
itself.
Clean that up and use the proper indices.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring_cmd_sock() does a normal read of cmd->sqe->cmd_op, where it
really should be using a READ_ONCE() as ->sqe may still be pointing to
the original SQE. Since the prep side already does this READ_ONCE() and
stores it locally, use that value rather than re-read it.
Fixes: 8e9fad0e70 ("io_uring: Add io_uring command support for sockets")
Link: https://lore.kernel.org/r/20250121-uring-sockcmd-fix-v1-1-add742802a29@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For remote posting of messages, req->tctx is assigned even though it
is never used. Rather than leave a dangling pointer, just clear it to
NULL and use the previous check for a valid submitter_task to gate on
whether or not the request should be terminated.
Reported-by: Jann Horn <jannh@google.com>
Fixes: b6f58a3f4a ("io_uring: move struct io_kiocb from task_struct to io_uring_task")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Checking for lockdep_assert_held(&ctx->uring_lock) in io_free_rsrc_node()
means that the assertion is only checked when the resource drops to zero
references.
Move the lockdep assertion up into the caller io_put_rsrc_node() so that it
instead happens on every reference count decrement.
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20250120-uring-lockdep-assert-earlier-v1-1-68d8e071a4bb@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring_ctx parameter for io_rsrc_node_alloc() is unused for now.
This patch removes the parameter and fixes the callers accordingly.
Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
Link: https://lore.kernel.org/r/20250115142033.658599-1-sidong.yang@furiosa.ai
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The locking in the buffer cloning code is somewhat complex because it goes
back and forth between locking the source ring and the destination ring.
Make it easier to reason about by locking both rings at the same time.
To avoid ABBA deadlocks, lock the rings in ascending kernel address order,
just like in lock_two_nondirectories().
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20250115-uring-clone-refactor-v2-1-7289ba50776d@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Add preempt lazy support
- Deprecate cxl and cxl flash driver
- Fix a possible IOMMU related OOPS at boot on pSeries
- Optimize sched_clock() in ppc32 by replacing mulhdu() by mul_u64_u64_shr()
Thanks to: Andrew Donnellan, Andy Shevchenko, Ankur Arora, Christophe Leroy,
Frederic Barrat, Gaurav Batra, Luis Felipe Hernandez, Michael Ellerman, Nilay
Shroff, Ricardo B. Marliere, Ritesh Harjani (IBM), Sebastian Andrzej Siewior,
Shrikanth Hegde, Sourabh Jain, Thorsten Blum, Zhu Jun.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqX2DNAOgU8sBX3pRpnEsdPSHZJQFAmePIeIACgkQpnEsdPSH
ZJTLRxAAmtarhPItiCQxwi0uyQpuzBoypcVuX8M9qpAUr1cQJv1swPlJI0tFW2xV
QDK37FlCytYib1oMJpwyhg5DA8kdg08OuWtvGRVxGu4O+vh2v0aehewAfPsBKBwq
JTOhjSlAeDPgsYQQlK6baSlfjb4kYlAFr2mh/oJIfXi2BFV1MB7rQmCXq2sPnfKS
9cFFgsZ74fFhbYOn9qFsldnzb9TPxR0/UcTOETqRcGOjiExv4aYlmWtKGMY/nLkN
k5go3xoB5WP7z11clmg0pp+RIoYKR41kR58CtGdcCEEXJJ6WBGPhPLQzT5cLBkMi
ppZieQNKrZK7J/udrdKP0+2cTmBTbCpjxHicLf7BhzsWwVxHCnyjrJIzUPuLcDUi
Ym9AXsmzBsqMudqnR0lslsY2mUvZOJPYh4ZCKTA5S0TDYWGy/HlAlL7sMs2uCzaM
4g8MVpEJLVo4GAoZM96x4RMcPi4RlHYXbYqNpENRkxiZu2fDoRz9WStPCdda59/D
3rQNaSDT1vBpue9ac6EIMeGgNh+f6q6WKh/PA48QBYDTp/IVbfShD+xiXtaa72cZ
W+JmWUwBRyM4HOP0C5yhXBXwL6a5sHj+d6R4gng4UUww7VppJmkZpBhXZsN4VS55
Xos+2Q75FBSQkAZa84yK6dXvFW3v/upIdSXuWTkSgoKs+4Z7dG8=
=ctY4
-----END PGP SIGNATURE-----
Merge tag 'powerpc-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Madhavan Srinivasan:
- Add preempt lazy support
- Deprecate cxl and cxl flash driver
- Fix a possible IOMMU related OOPS at boot on pSeries
- Optimize sched_clock() in ppc32 by replacing mulhdu() by
mul_u64_u64_shr()
Thanks to Andrew Donnellan, Andy Shevchenko, Ankur Arora, Christophe
Leroy, Frederic Barrat, Gaurav Batra, Luis Felipe Hernandez, Michael
Ellerman, Nilay Shroff, Ricardo B. Marliere, Ritesh Harjani (IBM),
Sebastian Andrzej Siewior, Shrikanth Hegde, Sourabh Jain, Thorsten Blum,
and Zhu Jun.
* tag 'powerpc-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
selftests/powerpc: Fix argument order to timer_sub()
powerpc/prom_init: Use IS_ENABLED()
powerpc/pseries/iommu: IOMMU incorrectly marks MMIO range in DDW
powerpc: Use str_on_off() helper in check_cache_coherency()
powerpc: Large user copy aware of full:rt:lazy preemption
powerpc: Add preempt lazy support
powerpc/book3s64/hugetlb: Fix disabling hugetlb when fadump is active
powerpc/vdso: Mark the vDSO code read-only after init
powerpc/64: Use get_user() in start_thread()
macintosh: declare ctl_table as const
selftest/powerpc/ptrace: Cleanup duplicate macro definitions
selftest/powerpc/ptrace/ptrace-pkey: Remove duplicate macros
selftest/powerpc/ptrace/core-pkey: Remove duplicate macros
powerpc/8xx: Drop legacy-of-mm-gpiochip.h header
scsi/cxlflash: Deprecate driver
cxl: Deprecate driver
selftests/powerpc: Fix typo in test-vphn.c
powerpc/xmon: Use str_yes_no() helper in dump_one_paca()
powerpc/32: Replace mulhdu() by mul_u64_u64_shr()
Confidential Computing:
* Register a platform device when running in CCA realm mode to enable
automatic loading of dependent modules.
CPU Features:
* Update a bunch of system register definitions to pick up new field
encodings from the architectural documentation.
* Add hwcaps and selftests for the new (2024) dpISA extensions.
Documentation:
* Update EL3 (firmware) requirements for booting Linux on modern arm64
designs.
* Remove stale information about the kernel virtual memory map.
Miscellaneous:
* Minor cleanups and typo fixes.
Memory management:
* Fix vmemmap_check_pmd() to look at the PMD type bits
* LPA2 (52-bit physical addressing) cleanups and minor fixes.
* Adjust physical address space depending upon whether or not LPA2 is
enabled.
Perf and PMUs:
* Add port filtering support for NVIDIA's NVLINK-C2C Coresight PMU
* Extend AXI filtering support for the DDR PMU on NXP IMX SoCs
* Fix Designware PCIe PMU event numbering.
* Add generic branch events for the Apple M1 CPU PMU.
* Add support for Marvell Odyssey DDR and LLC-TAD PMUs.
* Cleanups to the Hisilicon DDRC and Uncore PMU code.
* Advertise discard mode for the SPE PMU.
* Add the perf users mailing list to our MAINTAINERS entry.
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmeKZLcQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNEQzB/0X2U89ZiqxIkTPQvfFrjN/uUGybkq59rEL
DfeoGukTgJIwc3GHWXXtQ//wuuYKdTeCXaIz5NFK3+7/wmKSLvjkexmue8pta6EY
5rx9bAPr/D8lAUvhKIN2l3pF/ygoRwDz+nT2yVQ1xlZxYJWX7ZIsMj7W7ceb5kdx
HRrTSQuhEEPREAWWO4oCMWl5SQZSrIflSE3Be/PsP0OhW6k//ZmWbcJTgUcHbKam
o2WtNjITyGzxMpRCcrGEZKoe9YcwSxiut/PoD7JuoB4C/rbsf1cdJ6uLmtvGJcZj
qsdRHhVfBzP1+ahONrDbiT3C2+s1UZySKdCDIxiYy6lB39wpP0dd
=E7Mf
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"We've got a little less than normal thanks to the holidays in
December, but there's the usual summary below. The highlight is
probably the 52-bit physical addressing (LPA2) clean-up from Ard.
Confidential Computing:
- Register a platform device when running in CCA realm mode to enable
automatic loading of dependent modules
CPU Features:
- Update a bunch of system register definitions to pick up new field
encodings from the architectural documentation
- Add hwcaps and selftests for the new (2024) dpISA extensions
Documentation:
- Update EL3 (firmware) requirements for booting Linux on modern
arm64 designs
- Remove stale information about the kernel virtual memory map
Miscellaneous:
- Minor cleanups and typo fixes
Memory management:
- Fix vmemmap_check_pmd() to look at the PMD type bits
- LPA2 (52-bit physical addressing) cleanups and minor fixes
- Adjust physical address space depending upon whether or not LPA2 is
enabled
Perf and PMUs:
- Add port filtering support for NVIDIA's NVLINK-C2C Coresight PMU
- Extend AXI filtering support for the DDR PMU on NXP IMX SoCs
- Fix Designware PCIe PMU event numbering
- Add generic branch events for the Apple M1 CPU PMU
- Add support for Marvell Odyssey DDR and LLC-TAD PMUs
- Cleanups to the Hisilicon DDRC and Uncore PMU code
- Advertise discard mode for the SPE PMU
- Add the perf users mailing list to our MAINTAINERS entry"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (64 commits)
Documentation: arm64: Remove stale and redundant virtual memory diagrams
perf docs: arm_spe: Document new discard mode
perf: arm_spe: Add format option for discard mode
MAINTAINERS: Add perf list for drivers/perf/
arm64: Remove duplicate included header
drivers/perf: apple_m1: Map generic branch events
arm64: rsi: Add automatic arm-cca-guest module loading
kselftest/arm64: Add 2024 dpISA extensions to hwcap test
KVM: arm64: Allow control of dpISA extensions in ID_AA64ISAR3_EL1
arm64/hwcap: Describe 2024 dpISA extensions to userspace
arm64/sysreg: Update ID_AA64SMFR0_EL1 to DDI0601 2024-12
arm64: Filter out SVE hwcaps when FEAT_SVE isn't implemented
drivers/perf: hisi: Set correct IRQ affinity for PMUs with no association
arm64/sme: Move storage of reg_smidr to __cpuinfo_store_cpu()
arm64: mm: Test for pmd_sect() in vmemmap_check_pmd()
arm64/mm: Replace open encodings with PXD_TABLE_BIT
arm64/mm: Rename pte_mkpresent() as pte_mkvalid()
arm64/sysreg: Update ID_AA64ISAR2_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64ZFR0_EL1 to DDI0601 2024-09
arm64/sysreg: Update ID_AA64FPFR0_EL1 to DDI0601 2024-09
...
- Use the generic muldi3 libgcc function,
- Miscellaneous fixes and improvements.
-----BEGIN PGP SIGNATURE-----
iIsEABYKADMWIQQ9qaHoIs/1I4cXmEiKwlD9ZEnxcAUCZ45X7xUcZ2VlcnRAbGlu
dXgtbTY4ay5vcmcACgkQisJQ/WRJ8XB4tQD/WKgZmWHvfi/9Tk7+c8WD/e2ApdQE
ZhL9q0AEUzumpH0A/2ROwftS1g39MXdITUfts2g5j2wQy2ePnRDOZmTtP9kA
=ZK3B
-----END PGP SIGNATURE-----
Merge tag 'm68k-for-v6.14-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k updates from Geert Uytterhoeven:
- Use the generic muldi3 libgcc function
- Miscellaneous fixes and improvements
* tag 'm68k-for-v6.14-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k: libgcc: Fix lvalue abuse in umul_ppmm()
m68k: vga: Fix I/O defines
zorro: Constify 'struct bin_attribute'
m68k: atari: Use str_on_off() helper in atari_nvram_proc_read()
m68k: Use kernel's generic muldi3 libgcc function
- Select config option KASAN_VMALLOC if KASAN is enabled
- Select config option VMAP_STACK unconditionally
- Implement arch_atomic_inc() / arch_atomic_dec() functions
which result in a single instruction if compiled for z196
or newer architectures
- Make layering between atomic.h and atomic_ops.h consistent
- Comment s390 preempt_count implementation
- Remove pre MARCH_HAS_Z196_FEATURES preempt count implementation
- GCC uses the number of lines of an inline assembly to calculate
number of instructions and decide on inlining. Therefore remove
superfluous new lines from a couple of inline assemblies.
- Provide arch_atomic_*_and_test() implementations that allow the
compiler to generate slightly better code.
- Optimize __preempt_count_dec_and_test()
- Remove __bootdata annotations from declarations in header files
- Add missing include of <linux/smp.h> in abs_lowcore.h to provide
declarations for get_cpu() and put_cpu() used in the code
- Fix suboptimal kernel image base when running make kasan.config
- Remove huge_pte_none() and huge_pte_none_mostly() as are identical
to the generic variants
- Remove unused PAGE_KERNEL_EXEC, SEGMENT_KERNEL_EXEC,
and REGION3_KERNEL_EXEC defines
- Simplify noexec page protection handling and change the page,
segment and region3 protection definitions automatically if the
instruction execution-protection facility is not available
- Save one instruction and prefer EXRL instruction over EX in
string, xor_*(), amode31 and other functions
- Create /dev/diag misc device to fetch diagnose specific
information from the kernel and provide it to userspace
- Retrieve electrical power readings using DIAGNOSE 0x324 ioctl
- Make ccw_device_get_ciw() consistent and use array indices
instead of pointer arithmetic
* s390/qdio: Move memory alloc/pointer arithmetic for slib and sl into one place
- The sysfs core now allows instances of 'struct bin_attribute' to be
moved into read-only memory. Make use of that in s390 code
- Add missing TLB range adjustment in pud_free_tlb()
- Improve topology setup by adding early polarization detection
- Fix length checks in codepage_convert() function
- The generic bitops implementation is nearly identical to the s390 one.
Switch to the generic variant and decrease a bit the kernel image size
- Provide an optimized arch_test_bit() implementation which makes use of
flag output constraint. This generates slightly better code
- Provide memory topology information obtanied with DIAGNOSE 0x310
using ioctl.
- Various other small improvements, fixes, and cleanups
These changes were added with a merge of 'pci-device-recovery' branch
- Add PCI error recovery status mechanism
- Simplify and document debug_next_entry() logic
- Split private data allocation and freeing out of debug file
open() and close() operations
- Add debug_dump() function that gets a textual representation
of a debug info (e.g. PCI recovery hardware error logs)
- Add formatted content of pci_debug_msg_id to the PCI report
-----BEGIN PGP SIGNATURE-----
iI0EABYKADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZ4pkVxccYWdvcmRlZXZA
bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8GulAQDg/7pCj1fXH5XKN9W16972OYQD
pNwfCekw8suO8HUBCgEAkzdgTJC6/thifrnUt+Gj8HqASh//Qzw/6Q2Jk6595gk=
=lE3V
-----END PGP SIGNATURE-----
Merge tag 's390-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev:
- Select config option KASAN_VMALLOC if KASAN is enabled
- Select config option VMAP_STACK unconditionally
- Implement arch_atomic_inc() / arch_atomic_dec() functions which
result in a single instruction if compiled for z196 or newer
architectures
- Make layering between atomic.h and atomic_ops.h consistent
- Comment s390 preempt_count implementation
- Remove pre MARCH_HAS_Z196_FEATURES preempt count implementation
- GCC uses the number of lines of an inline assembly to calculate
number of instructions and decide on inlining. Therefore remove
superfluous new lines from a couple of inline assemblies.
- Provide arch_atomic_*_and_test() implementations that allow the
compiler to generate slightly better code.
- Optimize __preempt_count_dec_and_test()
- Remove __bootdata annotations from declarations in header files
- Add missing include of <linux/smp.h> in abs_lowcore.h to provide
declarations for get_cpu() and put_cpu() used in the code
- Fix suboptimal kernel image base when running make kasan.config
- Remove huge_pte_none() and huge_pte_none_mostly() as are identical to
the generic variants
- Remove unused PAGE_KERNEL_EXEC, SEGMENT_KERNEL_EXEC, and
REGION3_KERNEL_EXEC defines
- Simplify noexec page protection handling and change the page, segment
and region3 protection definitions automatically if the instruction
execution-protection facility is not available
- Save one instruction and prefer EXRL instruction over EX in string,
xor_*(), amode31 and other functions
- Create /dev/diag misc device to fetch diagnose specific information
from the kernel and provide it to userspace
- Retrieve electrical power readings using DIAGNOSE 0x324 ioctl
- Make ccw_device_get_ciw() consistent and use array indices instead of
pointer arithmetic
- s390/qdio: Move memory alloc/pointer arithmetic for slib and sl into
one place
- The sysfs core now allows instances of 'struct bin_attribute' to be
moved into read-only memory. Make use of that in s390 code
- Add missing TLB range adjustment in pud_free_tlb()
- Improve topology setup by adding early polarization detection
- Fix length checks in codepage_convert() function
- The generic bitops implementation is nearly identical to the s390
one. Switch to the generic variant and decrease a bit the kernel
image size
- Provide an optimized arch_test_bit() implementation which makes use
of flag output constraint. This generates slightly better code
- Provide memory topology information obtanied with DIAGNOSE 0x310
using ioctl.
- Various other small improvements, fixes, and cleanups
Also, some changes came in through a merge of 'pci-device-recovery'
branch:
- Add PCI error recovery status mechanism
- Simplify and document debug_next_entry() logic
- Split private data allocation and freeing out of debug file open()
and close() operations
- Add debug_dump() function that gets a textual representation of a
debug info (e.g. PCI recovery hardware error logs)
- Add formatted content of pci_debug_msg_id to the PCI report
* tag 's390-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (48 commits)
s390/futex: Fix FUTEX_OP_ANDN implementation
s390/diag: Add memory topology information via diag310
s390/bitops: Provide optimized arch_test_bit()
s390/bitops: Switch to generic bitops
s390/ebcdic: Fix length decrement in codepage_convert()
s390/ebcdic: Fix length check in codepage_convert()
s390/ebcdic: Use exrl instead of ex
s390/amode31: Use exrl instead of ex
s390/stackleak: Use exrl instead of ex in __stackleak_poison()
s390/lib: Use exrl instead of ex in xor functions
s390/topology: Improve topology detection
s390/tlb: Add missing TLB range adjustment
s390/pkey: Constify 'struct bin_attribute'
s390/sclp: Constify 'struct bin_attribute'
s390/pci: Constify 'struct bin_attribute'
s390/ipl: Constify 'struct bin_attribute'
s390/crypto/cpacf: Constify 'struct bin_attribute'
s390/qdio: Move memory alloc/pointer arithmetic for slib and sl into one place
s390/cio: Use array indices instead of pointer arithmetic
s390/qdio: Rename feature flag aif_osa to aif_qdio
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeNDEUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpl5hD/4t7kWWNQDeQG9CiA3QStMJ5Yow2AgYtK8f
sJBr5/6PGEsbTreX//Kh8DtPZPRGcjG9elCo58QxWaPZ2mg3fTOR3/QYLMlaGXU2
hSht58lj32utpuzMjMo9bG3aesi03bLf+buaq7V1FaMlcTV8rXqK1s/HGtphDBRo
8tNLEk3JDJDs3vlWbNp/5Hqh9+Ro6DU8df1zWWH4Vbu8RXaGIPyJyjKvvcbfuuCf
k7Ay45XNAmTZg+rSNGv1H3Yn1LNzPMVFLWBfzRahPCzlKy2+mJMWz1PWu9naaUK+
WTM+kgiBLF24k59G/9xuxC5bYtsTjTbr4GsEE5ZvFBnhKPzLzzaJj7iQHRj83vtv
tqxNmAbA3wJoNk48Zr8+cYbfDX9Q9Pl32wIaS/LxRgF9MT4lem6pyKY7Skd12oK3
rnQ8moGtnOBxp3QUU6BZ7IX3ipb+Bgw7FhZbtVYJdlqKeKyi1QO0MuITwGXpMwk/
EWDDTsspIf+QaTu+fmO8byJavugKljW8t7hM1JpvlfOLl+rsh6/+AYz42fCvcaA0
Tu4bpUk8SuwALvZfU2R6bLkorGG6MFuGI8g3eixOcGir3YAcHBMfdg6ItpZi5qVt
ToM87BMaezOZZvSwX1JBaQ0AR5HBQYmHaiLWgPsORf3PjJ0kz+u21SK9D+yJkUtU
rT6+HvoVXA==
=ufpE
-----END PGP SIGNATURE-----
Merge tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
"Not a lot in terms of features this time around, mostly just cleanups
and code consolidation:
- Support for PI meta data read/write via io_uring, with NVMe and
SCSI covered
- Cleanup the per-op structure caching, making it consistent across
various command types
- Consolidate the various user mapped features into a concept called
regions, making the various users of that consistent
- Various cleanups and fixes"
* tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux: (56 commits)
io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_iname
io_uring: reuse io_should_terminate_tw() for cmds
io_uring: Factor out a function to parse restrictions
io_uring/rsrc: require cloned buffers to share accounting contexts
io_uring: simplify the SQPOLL thread check when cancelling requests
io_uring: expose read/write attribute capability
io_uring/rw: don't gate retry on completion context
io_uring/rw: handle -EAGAIN retry at IO completion time
io_uring/rw: use io_rw_recycle() from cleanup path
io_uring/rsrc: simplify the bvec iter count calculation
io_uring: ensure io_queue_deferred() is out-of-line
io_uring/rw: always clear ->bytes_done on io_async_rw setup
io_uring/rw: use NULL for rw->free_iovec assigment
io_uring/rw: don't mask in f_iocb_flags
io_uring/msg_ring: Drop custom destructor
io_uring: Move old async data allocation helper to header
io_uring/rw: Allocate async data through helper
io_uring/net: Allocate msghdr async data through helper
io_uring/uring_cmd: Allocate async data through generic helper
io_uring/poll: Allocate apoll with generic alloc_cache helper
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeL6hoQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppw2EADQV8nDgLRggZR+il4U03yKHXcQEdAX1GrB
Erowx+dasIJuh6kp3n6qRe9QD/pRqt1DKyLvXoWF8Qfuwq85j7oDnDDYxutNYT27
hDgrLJriJ3VeKYtTu+andHWt8P29b5h57UayInDOUJurEPA6rXyFZ5YVIti8n21K
uDOrQXiACG3qRWS2+p2f3UNhX0MkFNFdN/lxi13WMIJtRWF5bXAP+JOgIWCID4Ze
QuSY6rQD4dp4Q6M2erpX6tn0YZb7Hvw3rPjsd91n6jvYfTUVLH375zg8jCBpi6Wi
Syufbb8xcTtriVPTDRNu0ekjebkc8wD8ax/h86g0z9v3Ua4DlNmsx9eXrtv6r5nu
YXqDODOad6stI0+owFquW2vas0gHmfNSfyfGdlk2g24PMtP5Yx0V6FIEvwIeqnje
ghgxQvBuKUsdhqakByfNnc+XvXi3+RUJek8kvMeUSUQWT1IyMQqPOOk0yp9WdyWD
bY1f2ECP5BR1b37zYOyawewsI5xTupHUswn5a4r4qtGn3O15rGDkX98Nab5aLCnR
rW/DvX7+wT6gW9EwrRHiwjwfNDZbsJ9Ggu3lMhtUl5GUWdk58yTiVgKaHJLnlX9/
CKFKfyyIR1Vl8+gYIpemyFhhcoN+dCSf06ISkrg0jeS0/tYwydaAaCBPL5J4kxZA
h3Rtbh+Pgg==
=EXYs
-----END PGP SIGNATURE-----
Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull requests via Keith:
- Target support for PCI-Endpoint transport (Damien)
- TCP IO queue spreading fixes (Sagi, Chaitanya)
- Target handling for "limited retry" flags (Guixen)
- Poll type fix (Yongsoo)
- Xarray storage error handling (Keisuke)
- Host memory buffer free size fix on error (Francis)
- MD pull requests via Song:
- Reintroduce md-linear (Yu Kuai)
- md-bitmap refactor and fix (Yu Kuai)
- Replace kmap_atomic with kmap_local_page (David Reaver)
- Quite a few queue freeze and debugfs deadlock fixes
Ming introduced lockdep support for this in the 6.13 kernel, and it
has (unsurprisingly) uncovered quite a few issues
- Use const attributes for IO schedulers
- Remove bio ioprio wrappers
- Fixes for stacked device atomic write support
- Refactor queue affinity helpers, in preparation for better supporting
isolated CPUs
- Cleanups of loop O_DIRECT handling
- Cleanup of BLK_MQ_F_* flags
- Add rotational support for null_blk
- Various fixes and cleanups
* tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits)
block: Don't trim an atomic write
block: Add common atomic writes enable flag
md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()
block: limit disk max sectors to (LLONG_MAX >> 9)
block: Change blk_stack_atomic_writes_limits() unit_min check
block: Ensure start sector is aligned for stacking atomic writes
blk-mq: Move more error handling into blk_mq_submit_bio()
block: Reorder the request allocation code in blk_mq_submit_bio()
nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
md/md-bitmap: move bitmap_{start, end}write to md upper layer
md/raid5: implement pers->bitmap_sector()
md: add a new callback pers->bitmap_sector()
md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()
md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()
md: Replace deprecated kmap_atomic() with kmap_local_page()
md: reintroduce md-linear
partitions: ldm: remove the initial kernel-doc notation
blk-cgroup: rwstat: fix kernel-doc warnings in header file
blk-cgroup: fix kernel-doc warnings in header file
nbd: fix partial sending
...
- Fix a case where the new scanning code missed removing an unused rsb.
- Fix the error when removing a configfs entry for an invalid node id.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEcGkeEvkvjdvlR90nOBtzx/yAaaoFAmeOi6EACgkQOBtzx/yA
aaqLPRAApqsWLMdaOejBY3B2KfFeSGqi+hcYO5fpjRgFocRCnG3q2aY1+lNEemDd
8EEvFGDzqCvFKbS+VGWjQ+ABiA8Zro4nqjuc6vW/EHivNsWiAgSqeSwQSG81v7x1
Ht2EkVg9UK2rRYb0Y4Y46XIiGU7Yd9q+bpv1nLtjLsFM+u7j3hC1IrK5Rl71JSYE
ozhHIVkg5VxxNHjr3isc7kChIdYdRIX+xZm+YfAfC3/Z9YHcJAQ436RNvW/rDjJX
iR/td4Z04tACwZVu46TDjHaLS5jQ/Lk/7Vk+FrliuXjTTcNbxM+MTX0uoxKUDJ5D
JD1bMaiFsIvvd146wGk022iRTSUE27KnFJaknvV1njuvY3+jHhV9uDehl8Vurv7b
GS2ZNajUmU/5Mv9MtfxfZNsH5cKQPMKhyKugt5gZhPFLnhf6APEz6htyZ+Sbmueb
8LMycO9SiIDiwOowS8leR0qmfI9k/11vwmO2vi0fbDDCSPTL8wq12JWmg/S+YXAp
HuKNfpnCE6s+c+EB3y50C1jOvbHQ1u96FpdHyUzv1hDrGG9/w5JG95codZcXQ314
uA4uEQBpan7TDLaSlSccSXUcRilrYZ3eY94wKHlRuBhLswsAGAZDz4HmuMm7VPeR
etiZRhehQYdHZs/+Zr5k5sn+AI8yDZKzl+mw55SGeij7ti0sXzY=
=fSW0
-----END PGP SIGNATURE-----
Merge tag 'dlm-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
- Fix a case where the new scanning code missed removing an unused rsb
- Fix the error when removing a configfs entry for an invalid node id
* tag 'dlm-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: return -ENOENT if no comm was found
dlm: fix srcu_read_lock() return type to int
dlm: fix removal of rsb struct that is master and dir record
Lots of scalability work, another big on disk format change. On disk
format version goes from 1.13 to 1.20.
Like 6.11, this is another big and expensive automatic/required on disk
format upgrade. This is planned to be the last big on disk format
upgrade before the experimental label comes off. There will be one more
minor on disk format update for a few things that couldn't make this
release.
Headline improvements:
- Fix mount time regression that some users encountered post the 6.11
disk accounting rewrite.
Accounting keys were encoded little endian (typetag in the low bits) -
which didn't anticipate adding accounting keys for every inode, which
aren't stored in memory and we don't want to scan at mount time.
- fsck time on large filesystems is improved by multiple orders of
magnitude. Previously, 100TB was about the practical max filesystem
size, where users were reporting fsck times of a day+. With the new
changes (which nearly eliminate backpointers fsck overhead), we fsck'd
a filesystem with 10PB of data in 1.5 hours.
The problematic fsck passes were walking every extent and checking for
missing backpointers, and walking every backpointer to check for
dangling backpointers. As we've been adding more and more runtime self
healing there was no reason to keep around the backpointers -> extents
pass; dangling backpointers are just deleted, and we can do that when
using them - thus, backpointers -> extents is now only run in debug
mode.
extents -> backpointers does need to exist, since missing backpointers
would mean we can't find data to move it (for e.g. copygc, device
evacuate, scrub). But the new on disk format version makes possible a
new strategy where we sum up backpointers within a bucket and check it
against the bucket sector counts, and then only scan for missing
backpointers if the counts are off (and then, only for specific
buckets).
Full list of on disk format changes:
- 1.14: backpointer_bucket_gen
Backpointers now have a field for the bucket generation number,
replacing the obsolete bucket_offset field. This is needed for the
new "sum up backpointers within a bucket" code, since backpointers use
the btree write buffer - meaning we will see stale reads, and this
runs online, with the filesystem in full rw mode.
- 1.15: disk_accounting_big_endian
As previously described, fix the endianness of accounting keys so that
accounting keys with the same typetag sort together, and accounting
read can skip types it's not interested in.
- 1.16: reflink_p_may_update_opts:
This version indicates that a new reflink pointer field is understood
and may be used; the field indicates whether the reflink pointer has
permissions to update IO path options (e.g. compression, replicas) may
be updated on the indirect extent it points to.
This completes the rebalance/reflink data path option handling from
the 6.13 pull request.
- 1.17: inode_depth
Add a new inode field, bi_depth, to accelerate the
check_directory_structure fsck path, which checks for loops in the
filesystem heirarchy.
check_inodes and check_dirents check connectivity, so
check_directory_structure only has to check for loops - by walking
back up to the root from every directory.
But a path can't be a loop if it has a counter that increases
monotonically from root to leaf - adding a depth counter means that we
can check for loops with only local (parent -> child) checks. We might
need to occasionally renumber the depth field in fsck if directories
have been moved around, but then future fsck runs will be much faster.
- 1.18: persistent_inode_cursors
Previously, the cursor used for inode allocation was only kept in
memory, which meant that users with large filesystems and lots of
files were reporting that the first create after mounting would take
awhile - since it had to scan from the start.
Inode allocation cursors are now persistent, and also include a
generation field (incremented on wraparound, which will only happen if
inode allocation is restricted to 32 bit inodes), so that we don't
have to leave inode_generation keys around after a delete.
The option for 32 bit inode numbers may now also be set on individual
directories, and non-32 bit inode allocations are disallowed from
allocating from the 32 bit part of the inode number space.
- 1.19: autofix_errors
Runtime self healing is now the default.o
- 1.20: directory size (from Hongbo)
directory i_size is now meaningful, and not 0.
Release notes from the previous 6.13 pull request:
- Self healing work:
Allocator and reflink now run the exact same check/repair code that
fsck does at runtime, where applicable.
The long term goal here is to remove inconsistent() errors (that cause
us to go emergency read only) by lifting fsck code up to normal
runtime paths; we should only go emergency read-only if we detect an
inconsistency that was due to a runtime bug - or truly catastrophic
damage (corrupted btree roots/interior nodes).
- Reflink repair no longer deletes reflink pointers: instead we flip an
error bit and log the error, and they can still be deleted by file
deletion. This means a temporary failure to find an indirect extent
(perhaps repaired later by btree node scan) won't result in
unnecessary data loss
- Improvements to rebalance data path option handling: we can now
correctly apply changed filesystem-level io path options to pending
rebalance work, and soon we'll be able to apply file-level io path
option changes to indirect extents.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmeOiboACgkQE6szbY3K
bnY8zQ//Yoy+5ZA07tQV+Fi0JV0DZ6w3xotxNhAUeaKgCKHgp37gcKa47TFir4pd
6ha7PQV3GimFwHoIUfOY5X4Y+bEm16XblyfK3VU6IgGiE3cUg+1q8b8WrD2eHmLJ
qIT8DWWpAM2AcZ/f5G37hH8pxn2t0TUuzJ1Sz7wEhJUNZEP+z+qaacnGhwuc8yQ3
Srj7Cc/NSd9T+6G2yKhERFITUrXmqVGgGihhVZqs0hCAPt8bwn5K8d1H2IKoj1N6
jJ3MQfmPIzUk0mfIjHrBlqrA+3tjtt5LGU+QpOWs8g509xHCP0BfGGOXQhjMjHVI
JVSqAuIENK4V1ubz7BZcSoPAVncPeFl8Ly5Qdw5FlDBux9kKsch8wJPjn1A1gkPt
Fb9VBTRkCK7WqUzkmbQh152SNC/0plb/8qFjywHNkvYyGMMlJME8zDIg40RN+0Ql
ckXjlvdVGm0GbyM2GLth4gbOSXDzKrq12i3rWROnOLZ0Q2SBKfJe5K0UdRat1/nu
2sWWJNJqDzaaP1Gd/qk3Yht06GWnhI/17Bl/Znt5M8rxtSBbbxO58vi3gxasbccS
l3qozuNouvAMNRBqE4ayVtjV+Aj69j1IBJnAfCareDDDf6ugjooLqu27BQkLOPg7
wswq633T6WG+UfQ44GvseiCaDW5MMh0aq7vxzjnBUoTz5usMfxg=
=d0Zb
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2025-01-20.2' of git://evilpiepirate.org/bcachefs
Pull bcachefs updates from Kent Overstreet:
"Lots of scalability work, another big on-disk format change. On-disk
format version goes from 1.13 to 1.20.
Like 6.11, this is another big and expensive automatic/required on
disk format upgrade. This is planned to be the last big on disk format
upgrade before the experimental label comes off. There will be one
more minor on disk format update for a few things that couldn't make
this release.
Headline improvements:
- Self healing work:
Allocator and reflink now run the exact same check/repair code that
fsck does at runtime, where applicable.
The long term goal here is to remove inconsistent() errors (that
cause us to go emergency read only) by lifting fsck code up to
normal runtime paths; we should only go emergency read-only if we
detect an inconsistency that was due to a runtime bug - or truly
catastrophic damage (corrupted btree roots/interior nodes).
- Reflink repair no longer deletes reflink pointers:
Instead we flip an error bit and log the error, and they can still
be deleted by file deletion. This means a temporary failure to find
an indirect extent (perhaps repaired later by btree node scan)
won't result in unnecessary data loss
- Improvements to rebalance data path option handling:
We can now correctly apply changed filesystem-level io path options
to pending rebalance work, and soon we'll be able to apply
file-level io path option changes to indirect extents
- Fix mount time regression that some users encountered post the 6.11
disk accounting rewrite.
Accounting keys were encoded little endian (typetag in the low
bits) - which didn't anticipate adding accounting keys for every
inode, which aren't stored in memory and we don't want to scan at
mount time.
- fsck time on large filesystems is improved by multiple orders of
magnitude. Previously, 100TB was about the practical max filesystem
size, where users were reporting fsck times of a day+. With the new
changes (which nearly eliminate backpointers fsck overhead), we
fsck'd a filesystem with 10PB of data in 1.5 hours.
The problematic fsck passes were walking every extent and checking
for missing backpointers, and walking every backpointer to check
for dangling backpointers. As we've been adding more and more
runtime self healing there was no reason to keep around the
backpointers -> extents pass; dangling backpointers are just
deleted, and we can do that when using them - thus, backpointers ->
extents is now only run in debug mode.
extents -> backpointers does need to exist, since missing
backpointers would mean we can't find data to move it (for e.g.
copygc, device evacuate, scrub). But the new on disk format version
makes possible a new strategy where we sum up backpointers within a
bucket and check it against the bucket sector counts, and then only
scan for missing backpointers if the counts are off (and then, only
for specific buckets).
Full list of on disk format changes:
- 1.14: backpointer_bucket_gen
Backpointers now have a field for the bucket generation number,
replacing the obsolete bucket_offset field. This is needed for the
new "sum up backpointers within a bucket" code, since backpointers
use the btree write buffer - meaning we will see stale reads, and
this runs online, with the filesystem in full rw mode.
- 1.15: disk_accounting_big_endian
As previously described, fix the endianness of accounting keys so
that accounting keys with the same typetag sort together, and
accounting read can skip types it's not interested in.
- 1.16: reflink_p_may_update_opts:
This version indicates that a new reflink pointer field is
understood and may be used; the field indicates whether the reflink
pointer has permissions to update IO path options (e.g.
compression, replicas) may be updated on the indirect extent it
points to.
This completes the rebalance/reflink data path option handling from
the 6.13 pull request.
- 1.17: inode_depth
Add a new inode field, bi_depth, to accelerate the
check_directory_structure fsck path, which checks for loops in the
filesystem heirarchy.
check_inodes and check_dirents check connectivity, so
check_directory_structure only has to check for loops - by walking
back up to the root from every directory.
But a path can't be a loop if it has a counter that increases
monotonically from root to leaf - adding a depth counter means that
we can check for loops with only local (parent -> child) checks. We
might need to occasionally renumber the depth field in fsck if
directories have been moved around, but then future fsck runs will
be much faster.
- 1.18: persistent_inode_cursors
Previously, the cursor used for inode allocation was only kept in
memory, which meant that users with large filesystems and lots of
files were reporting that the first create after mounting would
take awhile - since it had to scan from the start.
Inode allocation cursors are now persistent, and also include a
generation field (incremented on wraparound, which will only happen
if inode allocation is restricted to 32 bit inodes), so that we
don't have to leave inode_generation keys around after a delete.
The option for 32 bit inode numbers may now also be set on
individual directories, and non-32 bit inode allocations are
disallowed from allocating from the 32 bit part of the inode number
space.
- 1.19: autofix_errors
Runtime self healing is now the default.o
- 1.20: directory size (from Hongbo)
directory i_size is now meaningful, and not 0"
* tag 'bcachefs-2025-01-20.2' of git://evilpiepirate.org/bcachefs: (268 commits)
bcachefs: Fix check_inode_hash_info_matches_root()
bcachefs: Document issue with bch_stripe layout
bcachefs: Fix self healing on read error
bcachefs: Pop all the transactions from the abort one
bcachefs: Only abort the transactions in the cycle
bcachefs: Introduce lock_graph_pop_from
bcachefs: Convert open-coded lock_graph_pop_all to helper
bcachefs: Do not allow no fail lock request to fail
bcachefs: Merge the condition to avoid additional invocation
Revert "bcachefs: Fix bch2_btree_node_upgrade()"
bcachefs: bcachefs_metadata_version_directory_size
bcachefs: make directory i_size meaningful
bcachefs: check_unreachable_inodes is not actually PASS_ONLINE yet
bcachefs: Don't use BTREE_ITER_cached when walking alloc btree during fsck
bcachefs: Check for dirents to overwritten inodes
bcachefs: bch2_btree_iter_peek_slot() handles navigating to nonexistent depth
bcachefs: Don't set btree_path to updtodate if we don't fill
bcachefs: __bch2_btree_pos_to_text()
bcachefs: printbuf_reset() handles tabstops
bcachefs: Silence read-only errors when deleting snapshots
...
- exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case
(Tycho Andersen, Kees Cook)
- binfmt_misc: Fix comment typos (Christophe JAILLET)
- exec: move empty argv[0] warning closer to actual logic (Nir Lichtman)
- exec: remove legacy custom binfmt modules autoloading (Nir Lichtman)
- binfmt_flat: Fix integer overflow bug on 32 bit systems (Dan Carpenter)
- exec: Make sure set_task_comm() always NUL-terminates
- coredump: Do not lock when copying "comm"
- MAINTAINERS: add auxvec.h and set myself as maintainer
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCZ4hNmQAKCRA2KwveOeQk
u0/nAQCTGU0zqhdO6t7ABsL3p9kJ2jVRA5njAoX7A/9jGPSWEQD/boRMqZuUpthV
nMevcQ2F4u0A7kJJBMK05YdXWHkYqgk=
=49Di
-----END PGP SIGNATURE-----
Merge tag 'execve-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook:
- fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case (Tycho
Andersen, Kees Cook)
- binfmt_misc: Fix comment typos (Christophe JAILLET)
- move empty argv[0] warning closer to actual logic (Nir Lichtman)
- remove legacy custom binfmt modules autoloading (Nir Lichtman)
- Make sure set_task_comm() always NUL-terminates
- binfmt_flat: Fix integer overflow bug on 32 bit systems (Dan
Carpenter)
- coredump: Do not lock when copying "comm"
- MAINTAINERS: add auxvec.h and set myself as maintainer
* tag 'execve-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
binfmt_flat: Fix integer overflow bug on 32 bit systems
selftests/exec: add a test for execveat()'s comm
exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case
exec: Make sure task->comm is always NUL-terminated
exec: remove legacy custom binfmt modules autoloading
exec: move warning of null argv to be next to the relevant code
fs: binfmt: Fix a typo
MAINTAINERS: exec: Mark Kees as maintainer
MAINTAINERS: exec: Add auxvec.h UAPI
coredump: Do not lock during 'comm' reporting
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmeHvVQACgkQxWXV+ddt
WDsJ6w//cPqI8tf3kMxurZcG7clJRIIISotPrC6hm3UDNpJLa7HDaVJ50FAoIhMV
sB4RQNZky4mfB6ypXxmETzV3ZHvP0+oFgRs72Ommi0ZbdnBgxhaUTrDXLKl52o4r
UoeqvRKReEYOesN09rPXYPwytUOkxHU/GjNzv7bC/Tzvq/xKaIN5qMYZwkHtJ8PK
JtCFypfbmDPNDJz37l0BhRya2oMtpcUtxM9uP8RWVuQtaELgjcy56W/+osoyJTy9
FSKaoWUPsDVDufnILlGR8Kub2Z5mcISVqyARUdr/q3j5CDfyTdQvahmUy7sHgUAe
HGh5QBdRJu1QTvdZw+nK4YCaYpK6Nj4liDtO1cwVitde5RXsJrt6kYBLlY/kU2Qr
KODOloM/zVKxULR0ARl11NULZquUsczP6Wxfn+dtyDJ3JGlY9OcuESmorHoUtkMX
75Tj1AtRMNcfZAE2HquL1Oz3bIMcg4btDJsC+9Yp5K11SP12XpOwC42k/9Bx3iBe
Iki0BSuppFqX5MMY3OEWzD1pz2vOGYR8ISD6EIsjpjl2vBeRwydaCCZfuszSC7gl
Y4goSdwFMPVlqllL1h27XUjKVXvttCqqdB6P28MbvZKnFAPlm189BJQZC5cbHAJU
ceBww5PvI9QxnJnFG5iOLcnko6liUWPP9l2c5LLtUsJIi8B5Hu0=
=SXLv
-----END PGP SIGNATURE-----
Merge tag 'for-6.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"User visible changes, features:
- rebuilding of the free space tree at mount time is done in more
transactions, fix potential hangs when the transaction thread is
blocked due to large amount of block groups
- more read IO balancing strategies (experimental config), add two
new ways how to select a device for read if the profiles allow that
(all RAID1*), the current default selects the device by pid which
is good on average but less performant for single reader workloads
- select preferred device for all reads (namely for testing)
- round-robin, balance reads across devices relevant for the
requested IO range
- add encoded write ioctl support to io_uring (read was added in
6.12), basis for writing send stream using that instead of
syscalls, non-blocking mode is not yet implemented
- support FS_IOC_READ_VERITY_METADATA, applications can use the
metadata to do their own verification
- pass inode's i_write_hint to bios, for parity with other
filesystems, ioctls F_GET_RW_HINT/F_SET_RW_HINT
Core:
- in zoned mode: allow to directly reclaim a block group by simply
resetting it, then it can be reused and another block group does
not need to be allocated
- super block validation now also does more comprehensive sys array
validation, adding it to the points where superblock is validated
(post-read, pre-write)
- subpage mode fixes:
- fix double accounting of blocks due to some races
- improved or fixed error handling in a few cases (compression,
delalloc)
- raid stripe tree:
- fix various cases with extent range splitting or deleting
- implement hole punching to extent range
- reduce number of stripe tree lookups during bio submission
- more self-tests
- updated self-tests (delayed refs)
- error handling improvements
- cleanups, refactoring
- remove rest of backref caching infrastructure from relocation,
not needed anymore
- error message updates
- remove unnecessary calls when extent buffer was marked dirty
- unused parameter removal
- code moved to new files
Other code changes: add rb_find_add_cached() to the rb-tree API"
* tag 'for-6.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (127 commits)
btrfs: selftests: add a selftest for deleting two out of three extents
btrfs: selftests: add test for punching a hole into 3 RAID stripe-extents
btrfs: selftests: add selftest for punching holes into the RAID stripe extents
btrfs: selftests: test RAID stripe-tree deletion spanning two items
btrfs: selftests: don't split RAID extents in half
btrfs: selftests: check for correct return value of failed lookup
btrfs: don't use btrfs_set_item_key_safe on RAID stripe-extents
btrfs: implement hole punching for RAID stripe extents
btrfs: fix deletion of a range spanning parts two RAID stripe extents
btrfs: fix tail delete of RAID stripe-extents
btrfs: fix front delete range calculation for RAID stripe extents
btrfs: assert RAID stripe-extent length is always greater than 0
btrfs: don't try to delete RAID stripe-extents if we don't need to
btrfs: selftests: correct RAID stripe-tree feature flag setting
btrfs: add io_uring interface for encoded writes
btrfs: remove the unused locked_folio parameter from btrfs_cleanup_ordered_extents()
btrfs: add extra error messages for delalloc range related errors
btrfs: subpage: dump the involved bitmap when ASSERT() failed
btrfs: subpage: fix the bitmap dump of the locked flags
btrfs: do proper folio cleanup when run_delalloc_nocow() failed
...
- In the quota code, to avoid spurious audit messages, don't call
capable() when quotas are off.
- When changing the 'j' flag of an inode, truncate the inode address
space to avoid mixing "buffer head" and "iomap" pages.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmeOWTkUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqMEBAAh5bu0h7611UZTsYkDvGClnxc7oBo
j/0JCMyeHDvIRaGBFnzFFx1QNXW9ChXy/FgocQED13LPaiZ7kuFjwQPJoW9TH440
QrWa8xpxiiSz232maP0wQ7Y6K7EW7GW9tvCrqXj64PGb56TWQu+vEcACUtL6sN3V
+Xli2TljZNDwGsBQygiGvUR2ICfipFNUoV4Yrxv15WdOM1cQmF9F5P7SwFe9mCBh
tjF/D/vxxmQwR5njalnF0oTFSNlQmYpuaLPzZMdOnsFbwyFda/DncdLbbl9LsmMF
+C7zzAC3gY8Iq6K4azDE1SI6Gh5JGxA8lcIHtDVbUoFJwHVV/Jg8kWGPRcTuvKs+
LL8moxut6Id6HmPDmJA2tjjpKYfbnGstNdUIbozNhV5A634AmlbaqA+PwFxDcNs4
JZdbK4tPSVV7fzodQHZg0vewB+E49yBsCtZ+ows27MzgQFYWKrcngkf/Twn+e5F+
s59cFi31KzgaLMMCelkDlFwg5Dp8QDYAKZ0UYknU/rVoHFERGCrBn1QwDCk8aMa3
/IeCQGDu3ry7kpMXGWXIRu4Bmf/k1J89H2UjMdBMTquqzU6QYcKdzvE6JLt8GHTK
buE/D1y9t2wxERKBmjHb6KGUcso7RCVwT48cvEPmj+7OdynBPufDGPP666gpsA46
qZNhpfMnS+f1zrQ=
=0Cc4
-----END PGP SIGNATURE-----
Merge tag 'gfs2-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- In the quota code, to avoid spurious audit messages, don't call
capable() when quotas are off
- When changing the 'j' flag of an inode, truncate the inode address
space to avoid mixing "buffer head" and "iomap" pages
* tag 'gfs2-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Truncate address space when flipping GFS2_DIF_JDATA flag
gfs2: reorder capability check last
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4pS3AAKCRCRxhvAZXjc
okSwAPkB8Ra+oTplB/yzmab5kFB0+IUSHAiBfG6TCYb45op7wgEAs4+ignZkb+Bi
PsrfV7soiTGNUYSDVKOw7LS6PJEzkgA=
=3mcq
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.14-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull afs updates from Christian Brauner:
"Dynamic root improvements:
- Create an /afs/.<cell> mountpoint to match the /afs/<cell>
mountpoint when a cell is created
- Add some more checks on cell names proposed by the user to prevent
dodgy symlink bodies from being created. Also prevent rootcell from
being altered once set to simplify the locking
- Change the handling of /afs/@cell from being a dentry name
substitution at lookup time to making it a symlink to the current
cell name and also provide a /afs/.@cell symlink to point to the
dotted cell mountpoint
Fixes:
- Fix the abort code check in the fallback handling for the
YFS.RemoveFile2 RPC call
- Use call->op->server() for oridnary filesystem RPC calls that have
an operation descriptor instead of call->server()"
* tag 'vfs-6.14-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
afs: Fix the fallback handling for the YFS.RemoveFile2 RPC call
afs: Make /afs/@cell and /afs/.@cell symlinks
afs: Add rootcell checks
afs: Make /afs/.<cell> as well as /afs/<cell> mountpoints
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4pSTwAKCRCRxhvAZXjc
oiSbAQCIWp8Jm2FX9Mv+eX8erLFlyQSDQauAnqPtW/SvMbpgFgEAuZOTSRU3GSqI
NiowpYms9OckO638GlNHlSTUTcV4YwU=
=obHT
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.14-rc1.statx.dio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs direct-io updates from Christian Brauner:
"File systems that write out of place usually require different
alignment for direct I/O writes than what they can do for reads.
Add a separate dio read align field to statx, as many out of place
write file systems can easily do reads aligned to the device sector
size, but require bigger alignment for writes.
This is usually papered over by falling back to buffered I/O for
smaller writes and doing read-modify-write cycles, but performance for
this sucks, so applications benefit from knowing the actual write
alignment"
* tag 'vfs-6.14-rc1.statx.dio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
xfs: report larger dio alignment for COW inodes
xfs: report the correct read/write dio alignment for reflinked inodes
xfs: cleanup xfs_vn_getattr
fs: add STATX_DIO_READ_ALIGN
fs: reformat the statx definition
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4pSLQAKCRCRxhvAZXjc
oq92AP4qTO8+FFRok2nhHlK4YNPhiqni1KabYXuHakL1ESw8OQD+O1wLgw8FUkgv
jxi+KmxMz9Asg2wdnLrSGEZJ709eOgc=
=6dn7
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.14-rc1.libfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs libfs updates from Christian Brauner:
"This improves the stable directory offset behavior in various ways.
Stable offsets are needed so that NFS can reliably read directories on
filesystems such as tmpfs:
- Improve the end-of-directory detection
According to getdents(3), the d_off field in each returned
directory entry points to the next entry in the directory. The
d_off field in the last returned entry in the readdir buffer must
contain a valid offset value, but if it points to an actual
directory entry, then readdir/getdents can loop.
Introduce a specific fixed offset value that is placed in the d_off
field of the last entry in a directory. Some user space
applications assume that the EOD offset value is larger than the
offsets of real directory entries, so the largest valid offset
value is reserved for this purpose. This new value is never
allocated by simple_offset_add().
When ->iterate_dir() returns, getdents{64} inserts the ctx->pos
value into the d_off field of the last valid entry in the readdir
buffer. When it hits EOD, offset_readdir() sets ctx->pos to the EOD
offset value so the last entry is updated to point to the EOD
marker.
When trying to read the entry at the EOD offset, offset_readdir()
terminates immediately.
- Rely on d_children to iterate stable offset directories
Instead of using the mtree to emit entries in the order of their
offset values, use it only to map incoming ctx->pos to a starting
entry. Then use the directory's d_children list, which is already
maintained properly by the dcache, to find the next child to emit.
- Narrow the range of directory offset values returned by
simple_offset_add() to 3 .. (S32_MAX - 1) on all platforms. This
means the allocation behavior is identical on 32-bit systems,
64-bit systems, and 32-bit user space on 64-bit kernels. The new
range still permits over 2 billion concurrent entries per
directory.
- Return ENOSPC when the directory offset range is exhausted. Hitting
this error is almost impossible though.
- Remove the simple_offset_empty() helper"
* tag 'vfs-6.14-rc1.libfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
libfs: Use d_children list to iterate simple_offset directories
libfs: Replace simple_offset end-of-directory detection
Revert "libfs: fix infinite directory reads for offset dir"
Revert "libfs: Add simple_offset_empty()"
libfs: Return ENOSPC when the directory offset range is exhausted
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ44+LwAKCRCRxhvAZXjc
orNaAQCGDqtxgqgGLsdx9dw7yTxOm9opYBaG5qN7KiThLAz2PwD+MsHNNlLVEOKU
IQo9pa23UFUhTipFSeszOWza5SGlxg4=
=hdst
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.14-rc1.mount.v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount updates from Christian Brauner:
- Add a mountinfo program to demonstrate statmount()/listmount()
Add a new "mountinfo" sample userland program that demonstrates how
to use statmount() and listmount() to get at the same info that
/proc/pid/mountinfo provides
- Remove pointless nospec.h include
- Prepend statmount.mnt_opts string with security_sb_mnt_opts()
Currently these mount options aren't accessible via statmount()
- Add new mount namespaces to mount namespace rbtree outside of the
namespace semaphore
- Lockless mount namespace lookup
Currently we take the read lock when looking for a mount namespace to
list mounts in. We can make this lockless. The simple search case can
just use a sequence counter to detect concurrent changes to the
rbtree
For walking the list of mount namespaces sequentially via nsfs we
keep a separate rcu list as rb_prev() and rb_next() aren't usable
safely with rcu. Currently there is no primitive for retrieving the
previous list member. To do this we need a new deletion primitive
that doesn't poison the prev pointer and a corresponding retrieval
helper
Since creating mount namespaces is a relatively rare event compared
with querying mounts in a foreign mount namespace this is worth it.
Once libmount and systemd pick up this mechanism to list mounts in
foreign mount namespaces this will be used very frequently
- Add extended selftests for lockless mount namespace iteration
- Add a sample program to list all mounts on the system, i.e., in
all mount namespaces
- Improve mount namespace iteration performance
Make finding the last or first mount to start iterating the mount
namespace from an O(1) operation and add selftests for iterating the
mount table starting from the first and last mount
- Use an xarray for the old mount id
While the ida does use the xarray internally we can use it explicitly
which allows us to increment the unique mount id under the xa lock.
This allows us to remove the atomic as we're now allocating both ids
in one go
- Use a shared header for vfs sample programs
- Fix build warnings for new sample program to list all mounts
* tag 'vfs-6.14-rc1.mount.v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
samples/vfs: fix build warnings
samples/vfs: use shared header
samples/vfs/mountinfo: Use __u64 instead of uint64_t
fs: remove useless lockdep assertion
fs: use xarray for old mount id
selftests: add listmount() iteration tests
fs: cache first and last mount
samples: add test-list-all-mounts
selftests: remove unneeded include
selftests: add tests for mntns iteration
seltests: move nsfs into filesystems subfolder
fs: simplify rwlock to spinlock
fs: lockless mntns lookup for nsfs
rculist: add list_bidir_{del,prev}_rcu()
fs: lockless mntns rbtree lookup
fs: add mount namespace to rbtree late
fs: prepend statmount.mnt_opts string with security_sb_mnt_opts()
mount: remove inlude/nospec.h include
samples: add a mountinfo program to demonstrate statmount()/listmount()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4pR0wAKCRCRxhvAZXjc
ojb2AQD5QfpTEX/ju1TkenTvoNl+JfnIjaVSY40Lm9DWYzmCMAEAuRvf5WRIV713
00/RVOrUvsLobzhmnk0yw53EQ5A+pA0=
=2NDA
-----END PGP SIGNATURE-----
Merge tag 'kernel-6.14-rc1.pid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pid_max namespacing update from Christian Brauner:
"The pid_max sysctl is a global value. For a long time the default
value has been 65535 and during the pidfd dicussions Linus proposed to
bump pid_max by default. Based on this discussion systemd started
bumping pid_max to 2^22. So all new systems now run with a very high
pid_max limit with some distros having also backported that change.
The decision to bump pid_max is obviously correct. It just doesn't
make a lot of sense nowadays to enforce such a low pid number. There's
sufficient tooling to make selecting specific processes without typing
really large pid numbers available.
In any case, there are workloads that have expections about how large
pid numbers they accept. Either for historical reasons or
architectural reasons. One concreate example is the 32-bit version of
Android's bionic libc which requires pid numbers less than 65536.
There are workloads where it is run in a 32-bit container on a 64-bit
kernel. If the host has a pid_max value greater than 65535 the libc
will abort thread creation because of size assumptions of
pthread_mutex_t.
That's a fairly specific use-case however, in general specific
workloads that are moved into containers running on a host with a new
kernel and a new systemd can run into issues with large pid_max
values. Obviously making assumptions about the size of the allocated
pid is suboptimal but we have userspace that does it.
Of course, giving containers the ability to restrict the number of
processes in their respective pid namespace indepent of the global
limit through pid_max is something desirable in itself and comes in
handy in general.
Independent of motivating use-cases the existence of pid namespaces
makes this also a good semantical extension and there have been prior
proposals pushing in a similar direction. The trick here is to
minimize the risk of regressions which I think is doable. The fact
that pid namespaces are hierarchical will help us here.
What we mostly care about is that when the host sets a low pid_max
limit, say (crazy number) 100 that no descendant pid namespace can
allocate a higher pid number in its namespace. Since pid allocation is
hierarchial this can be ensured by checking each pid allocation
against the pid namespace's pid_max limit. This means if the
allocation in the descendant pid namespace succeeds, the ancestor pid
namespace can reject it. If the ancestor pid namespace has a higher
limit than the descendant pid namespace the descendant pid namespace
will reject the pid allocation. The ancestor pid namespace will
obviously not care about this.
All in all this means pid_max continues to enforce a system wide limit
on the number of processes but allows pid namespaces sufficient leeway
in handling workloads with assumptions about pid values and allows
containers to restrict the number of processes in a pid namespace
through the pid_max interface"
* tag 'kernel-6.14-rc1.pid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
tests/pid_namespace: add pid_max tests
pid: allow pid_max to be set per pid namespace
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4pRdwAKCRCRxhvAZXjc
otQjAP9ooUH2d/jHZ49Rw4q/3BkhX8R2fFEZgj2PMvtYlr0jQwD/d8Ji0k4jINTL
AIFRfPdRwrD+X35IUK3WPO42YFZ4rAg=
=5wgo
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.14-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner:
- Rework inode number allocation
Recently we received a patchset that aims to enable file handle
encoding and decoding via name_to_handle_at(2) and
open_by_handle_at(2).
A crucical step in the patch series is how to go from inode number to
struct pid without leaking information into unprivileged contexts.
The issue is that in order to find a struct pid the pid number in the
initial pid namespace must be encoded into the file handle via
name_to_handle_at(2).
This can be used by containers using a separate pid namespace to
learn what the pid number of a given process in the initial pid
namespace is. While this is a weak information leak it could be used
in various exploits and in general is an ugly wart in the design.
To solve this problem a new way is needed to lookup a struct pid
based on the inode number allocated for that struct pid. The other
part is to remove the custom inode number allocation on 32bit systems
that is also an ugly wart that should go away.
Allocate unique identifiers for struct pid by simply incrementing a
64 bit counter and insert each struct pid into the rbtree so it can
be looked up to decode file handles avoiding to leak actual pids
across pid namespaces in file handles.
On both 64 bit and 32 bit the same 64 bit identifier is used to
lookup struct pid in the rbtree. On 64 bit the unique identifier for
struct pid simply becomes the inode number. Comparing two pidfds
continues to be as simple as comparing inode numbers.
On 32 bit the 64 bit number assigned to struct pid is split into two
32 bit numbers. The lower 32 bits are used as the inode number and
the upper 32 bits are used as the inode generation number. Whenever a
wraparound happens on 32 bit the 64 bit number will be incremented by
2 so inode numbering starts at 2 again.
When a wraparound happens on 32 bit multiple pidfds with the same
inode number are likely to exist. This isn't a problem since before
pidfs pidfds used the anonymous inode meaning all pidfds had the same
inode number. On 32 bit sserspace can thus reconstruct the 64 bit
identifier by retrieving both the inode number and the inode
generation number to compare, or use file handles. This gives the
same guarantees on both 32 bit and 64 bit.
- Implement file handle support
This is based on custom export operation methods which allows pidfs
to implement permission checking and opening of pidfs file handles
cleanly without hacking around in the core file handle code too much.
- Support bind-mounts
Allow bind-mounting pidfds. Similar to nsfs let's allow bind-mounts
for pidfds. This allows pidfds to be safely recovered and checked for
process recycling.
Instead of checking d_ops for both nsfs and pidfs we could in a
follow-up patch add a flag argument to struct dentry_operations that
functions similar to file_operations->fop_flags.
* tag 'vfs-6.14-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests: add pidfd bind-mount tests
pidfs: allow bind-mounts
pidfs: lookup pid through rbtree
selftests/pidfd: add pidfs file handle selftests
pidfs: check for valid ioctl commands
pidfs: implement file handle support
exportfs: add permission method
fhandle: pull CAP_DAC_READ_SEARCH check into may_decode_fh()
exportfs: add open method
fhandle: simplify error handling
pseudofs: add support for export_ops
pidfs: support FS_IOC_GETVERSION
pidfs: remove 32bit inode number handling
pidfs: rework inode number allocation
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4pRjQAKCRCRxhvAZXjc
omUyAP9k31Qr7RY1zNtmpPfejqc+3Xx+xXD7NwHr+tONWtUQiQEA/F94qU2U3ivS
AzyDABWrEQ5ZNsm+Rq2Y3zyoH7of3ww=
=s3Bu
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.14-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Support caching symlink lengths in inodes
The size is stored in a new union utilizing the same space as
i_devices, thus avoiding growing the struct or taking up any more
space
When utilized it dodges strlen() in vfs_readlink(), giving about
1.5% speed up when issuing readlink on /initrd.img on ext4
- Add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag
If a file system supports uncached buffered IO, it may set
FOP_DONTCACHE and enable support for RWF_DONTCACHE.
If RWF_DONTCACHE is attempted without the file system supporting
it, it'll get errored with -EOPNOTSUPP
- Enable VBOXGUEST and VBOXSF_FS on ARM64
Now that VirtualBox is able to run as a host on arm64 (e.g. the
Apple M3 processors) we can enable VBOXSF_FS (and in turn
VBOXGUEST) for this architecture.
Tested with various runs of bonnie++ and dbench on an Apple MacBook
Pro with the latest Virtualbox 7.1.4 r165100 installed
Cleanups:
- Delay sysctl_nr_open check in expand_files()
- Use kernel-doc includes in fiemap docbook
- Use page->private instead of page->index in watch_queue
- Use a consume fence in mnt_idmap() as it's heavily used in
link_path_walk()
- Replace magic number 7 with ARRAY_SIZE() in fc_log
- Sort out a stale comment about races between fd alloc and dup2()
- Fix return type of do_mount() from long to int
- Various cosmetic cleanups for the lockref code
Fixes:
- Annotate spinning as unlikely() in __read_seqcount_begin
The annotation already used to be there, but got lost in commit
52ac39e5db ("seqlock: seqcount_t: Implement all read APIs as
statement expressions")
- Fix proc_handler for sysctl_nr_open
- Flush delayed work in delayed fput()
- Fix grammar and spelling in propagate_umount()
- Fix ESP not readable during coredump
In /proc/PID/stat, there is the kstkesp field which is the stack
pointer of a thread. While the thread is active, this field reads
zero. But during a coredump, it should have a valid value
However, at the moment, kstkesp is zero even during coredump
- Don't wake up the writer if the pipe is still full
- Fix unbalanced user_access_end() in select code"
* tag 'vfs-6.14-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
gfs2: use lockref_init for qd_lockref
erofs: use lockref_init for pcl->lockref
dcache: use lockref_init for d_lockref
lockref: add a lockref_init helper
lockref: drop superfluous externs
lockref: use bool for false/true returns
lockref: improve the lockref_get_not_zero description
lockref: remove lockref_put_not_zero
fs: Fix return type of do_mount() from long to int
select: Fix unbalanced user_access_end()
vbox: Enable VBOXGUEST and VBOXSF_FS on ARM64
pipe_read: don't wake up the writer if the pipe is still full
selftests: coredump: Add stackdump test
fs/proc: do_task_stat: Fix ESP not readable during coredump
fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag
fs: sort out a stale comment about races between fd alloc and dup2
fs: Fix grammar and spelling in propagate_umount()
fs: fc_log replace magic number 7 with ARRAY_SIZE()
fs: use a consume fence in mnt_idmap()
file: flush delayed work in delayed fput()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4pRDgAKCRCRxhvAZXjc
ojt3AQCY/X9EHTeiJ/1eBZd/mopcu6ftyjcVpiCIZzqXnr6DKAD+Odb/8C7/Axlg
A/ne6RjV4+DXOz8qJpaRAu4aV2zyMAs=
=xDe5
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.14-rc1.kcore' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull /proc/kcore updates from Christian Brauner:
"The performance of /proc/kcore reads has been showing up as a
bottleneck for the drgn debugger. drgn scripts often spend ~25% of
their time in the kernel reading from /proc/kcore.
A lot of this overhead comes from silly inefficiencies. This pull
request contains fixes for the low-hanging fruit. The fixes are all
fairly small and straightforward.
The result is a 25% improvement in read latency in micro-benchmarks
(from ~235 nanoseconds to ~175) and a 15% improvement in execution
time for real-world drgn scripts:
- Make /proc/kcore entry permanent
- Avoid walking the list on every read
- Use percpu_rw_semaphore for kclist_lock
- Make Omar Sandoval the official maintainer for /proc/kcore"
* tag 'vfs-6.14-rc1.kcore' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
MAINTAINERS: add me as /proc/kcore maintainer
proc/kcore: use percpu_rw_semaphore for kclist_lock
proc/kcore: don't walk list on every read
proc/kcore: mark proc entry as permanent
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4pRKQAKCRCRxhvAZXjc
ov2dAQCULWjTBWdF8Ro2bfNeXzWvUUnSPjoLJ9B4xlrOB9c2MAEAiwkKHkzAxUco
hCvaRJc3H2ze2wrgbIABPKB2noQVVwk=
=4ojv
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.14-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs netfs updates from Christian Brauner:
"This contains read performance improvements and support for monolithic
single-blob objects that have to be read/written as such (e.g. AFS
directory contents). The implementation of the two parts is interwoven
as each makes the other possible.
- Read performance improvements
The read performance improvements are intended to speed up some
loss of performance detected in cifs and to a lesser extend in afs.
The problem is that we queue too many work items during the
collection of read results: each individual subrequest is collected
by its own work item, and then they have to interact with each
other when a series of subrequests don't exactly align with the
pattern of folios that are being read by the overall request.
Whilst the processing of the pages covered by individual
subrequests as they complete potentially allows folios to be woken
in parallel and with minimum delay, it can shuffle wakeups for
sequential reads out of order - and that is the most common I/O
pattern.
The final assessment and cleanup of an operation is then held up
until the last I/O completes - and for a synchronous sequential
operation, this means the bouncing around of work items just adds
latency.
Two changes have been made to make this work:
(1) All collection is now done in a single "work item" that works
progressively through the subrequests as they complete (and
also dispatches retries as necessary).
(2) For readahead and AIO, this work item be done on a workqueue
and can run in parallel with the ultimate consumer of the data;
for synchronous direct or unbuffered reads, the collection is
run in the application thread and not offloaded.
Functions such as smb2_readv_callback() then just tell netfslib
that the subrequest has terminated; netfslib does a minimal bit of
processing on the spot - stat counting and tracing mostly - and
then queues/wakes up the worker. This simplifies the logic as the
collector just walks sequentially through the subrequests as they
complete and walks through the folios, if buffered, unlocking them
as it goes. It also keeps to a minimum the amount of latency
injected into the filesystem's low-level I/O handling
The way netfs supports filesystems using the deprecated
PG_private_2 flag is changed: folios are flagged and added to a
write request as they complete and that takes care of scheduling
the writes to the cache. The originating read request can then just
unlock the pages whatever happens.
- Single-blob object support
Single-blob objects are files for which the content of the file
must be read from or written to the server in a single operation
because reading them in parts may yield inconsistent results. AFS
directories are an example of this as there exists the possibility
that the contents are generated on the fly and would differ between
reads or might change due to third party interference.
Such objects will be written to and retrieved from the cache if one
is present, though we allow/may need to propose multiple
subrequests to do so. The important part is that read from/write to
the *server* is monolithic.
Single blob reading is, for the moment, fully synchronous and does
result collection in the application thread and, also for the
moment, the API is supplied the buffer in the form of a folio_queue
chain rather than using the pagecache.
- Related afs changes
This series makes a number of changes to the kafs filesystem,
primarily in the area of directory handling:
- AFS's FetchData RPC reply processing is made partially
asynchronous which allows the netfs_io_request's outstanding
operation counter to be removed as part of reducing the
collection to a single work item.
- Directory and symlink reading are plumbed through netfslib using
the single-blob object API and are now cacheable with fscache.
This also allows the afs_read struct to be eliminated and
netfs_io_subrequest to be used directly instead.
- Directory and symlink content are now stored in a folio_queue
buffer rather than in the pagecache. This means we don't require
the RCU read lock and xarray iteration to access it, and folios
won't randomly disappear under us because the VM wants them
back.
- The vnode operation lock is changed from a mutex struct to a
private lock implementation. The problem is that the lock now
needs to be dropped in a separate thread and mutexes don't
permit that.
- When a new directory or symlink is created, we now initialise it
locally and mark it valid rather than downloading it (we know
what it's likely to look like).
- We now use the in-directory hashtable to reduce the number of
entries we need to scan when doing a lookup. The edit routines
have to maintain the hash chains.
- Cancellation (e.g. by signal) of an async call after the
rxrpc_call has been set up is now offloaded to the worker thread
as there will be a notification from rxrpc upon completion. This
avoids a double cleanup.
- A "rolling buffer" implementation is created to abstract out the
two separate folio_queue chaining implementations I had (one for
read and one for write).
- Functions are provided to create/extend a buffer in a folio_queue
chain and tear it down again.
This is used to handle AFS directories, but could also be used to
create bounce buffers for content crypto and transport crypto.
- The was_async argument is dropped from netfs_read_subreq_terminated()
Instead we wake the read collection work item by either queuing it
or waking up the app thread.
- We don't need to use BH-excluding locks when communicating between
the issuing thread and the collection thread as neither of them now
run in BH context.
- Also included are a number of new tracepoints; a split of the
netfslib write collection code to put retrying into its own file
(it gets more complicated with content encryption).
- There are also some minor fixes AFS included, including fixing the
AFS directory format struct layout, reducing some directory
over-invalidation and making afs_mkdir() translate EEXIST to
ENOTEMPY (which is not available on all systems the servers
support).
- Finally, there's a patch to try and detect entry into the folio
unlock function with no folio_queue structs in the buffer (which
isn't allowed in the cases that can get there).
This is a debugging patch, but should be minimal overhead"
* tag 'vfs-6.14-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (31 commits)
netfs: Report on NULL folioq in netfs_writeback_unlock_folios()
afs: Add a tracepoint for afs_read_receive()
afs: Locally initialise the contents of a new symlink on creation
afs: Use the contained hashtable to search a directory
afs: Make afs_mkdir() locally initialise a new directory's content
netfs: Change the read result collector to only use one work item
afs: Make {Y,}FS.FetchData an asynchronous operation
afs: Fix cleanup of immediately failed async calls
afs: Eliminate afs_read
afs: Use netfslib for symlinks, allowing them to be cached
afs: Use netfslib for directories
afs: Make afs_init_request() get a key if not given a file
netfs: Add support for caching single monolithic objects such as AFS dirs
netfs: Add functions to build/clean a buffer in a folio_queue
afs: Add more tracepoints to do with tracking validity
cachefiles: Add auxiliary data trace
cachefiles: Add some subrequest tracepoints
netfs: Remove some extraneous directory invalidations
afs: Fix directory format encoding struct
afs: Fix EEXIST error returned from afs_rmdir() to be ENOTEMPTY
...
This was a suggestion by David Laight, and while I was slightly worried
that some micro-architecture would predict cmov like a conditional
branch, there is little reason to actually believe any core would be
that broken.
Intel documents that their existing cores treat CMOVcc as a data
dependency that will constrain speculation in their "Speculative
Execution Side Channel Mitigations" whitepaper:
"Other instructions such as CMOVcc, AND, ADC, SBB and SETcc can also
be used to prevent bounds check bypass by constraining speculative
execution on current family 6 processors (Intel® Core™, Intel® Atom™,
Intel® Xeon® and Intel® Xeon Phi™ processors)"
and while that leaves the future uarch issues open, that's certainly
true of our traditional SBB usage too.
Any core that predicts CMOV will be unusable for various crypto
algorithms that need data-independent timing stability, so let's just
treat CMOV as the safe choice that simplifies the address masking by
avoiding an extra instruction and doesn't need a temporary register.
Suggested-by: David Laight <David.Laight@aculab.com>
Link: https://www.intel.com/content/dam/develop/external/us/en/documents/336996-speculative-execution-side-channel-mitigations.pdf
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Back when we added SMAP support, all versions of binutils didn't
necessarily understand the 'clac' and 'stac' instructions. So we
implemented those instructions manually as ".byte" sequences.
But we've since upgraded the minimum version of binutils to version
2.25, and that included proper support for the SMAP instructions, and
there's no reason for us to use some line noise to express them any
more.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>