-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeNDEUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpl5hD/4t7kWWNQDeQG9CiA3QStMJ5Yow2AgYtK8f
sJBr5/6PGEsbTreX//Kh8DtPZPRGcjG9elCo58QxWaPZ2mg3fTOR3/QYLMlaGXU2
hSht58lj32utpuzMjMo9bG3aesi03bLf+buaq7V1FaMlcTV8rXqK1s/HGtphDBRo
8tNLEk3JDJDs3vlWbNp/5Hqh9+Ro6DU8df1zWWH4Vbu8RXaGIPyJyjKvvcbfuuCf
k7Ay45XNAmTZg+rSNGv1H3Yn1LNzPMVFLWBfzRahPCzlKy2+mJMWz1PWu9naaUK+
WTM+kgiBLF24k59G/9xuxC5bYtsTjTbr4GsEE5ZvFBnhKPzLzzaJj7iQHRj83vtv
tqxNmAbA3wJoNk48Zr8+cYbfDX9Q9Pl32wIaS/LxRgF9MT4lem6pyKY7Skd12oK3
rnQ8moGtnOBxp3QUU6BZ7IX3ipb+Bgw7FhZbtVYJdlqKeKyi1QO0MuITwGXpMwk/
EWDDTsspIf+QaTu+fmO8byJavugKljW8t7hM1JpvlfOLl+rsh6/+AYz42fCvcaA0
Tu4bpUk8SuwALvZfU2R6bLkorGG6MFuGI8g3eixOcGir3YAcHBMfdg6ItpZi5qVt
ToM87BMaezOZZvSwX1JBaQ0AR5HBQYmHaiLWgPsORf3PjJ0kz+u21SK9D+yJkUtU
rT6+HvoVXA==
=ufpE
-----END PGP SIGNATURE-----
Merge tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
"Not a lot in terms of features this time around, mostly just cleanups
and code consolidation:
- Support for PI meta data read/write via io_uring, with NVMe and
SCSI covered
- Cleanup the per-op structure caching, making it consistent across
various command types
- Consolidate the various user mapped features into a concept called
regions, making the various users of that consistent
- Various cleanups and fixes"
* tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux: (56 commits)
io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_iname
io_uring: reuse io_should_terminate_tw() for cmds
io_uring: Factor out a function to parse restrictions
io_uring/rsrc: require cloned buffers to share accounting contexts
io_uring: simplify the SQPOLL thread check when cancelling requests
io_uring: expose read/write attribute capability
io_uring/rw: don't gate retry on completion context
io_uring/rw: handle -EAGAIN retry at IO completion time
io_uring/rw: use io_rw_recycle() from cleanup path
io_uring/rsrc: simplify the bvec iter count calculation
io_uring: ensure io_queue_deferred() is out-of-line
io_uring/rw: always clear ->bytes_done on io_async_rw setup
io_uring/rw: use NULL for rw->free_iovec assigment
io_uring/rw: don't mask in f_iocb_flags
io_uring/msg_ring: Drop custom destructor
io_uring: Move old async data allocation helper to header
io_uring/rw: Allocate async data through helper
io_uring/net: Allocate msghdr async data through helper
io_uring/uring_cmd: Allocate async data through generic helper
io_uring/poll: Allocate apoll with generic alloc_cache helper
...
- exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case
(Tycho Andersen, Kees Cook)
- binfmt_misc: Fix comment typos (Christophe JAILLET)
- exec: move empty argv[0] warning closer to actual logic (Nir Lichtman)
- exec: remove legacy custom binfmt modules autoloading (Nir Lichtman)
- binfmt_flat: Fix integer overflow bug on 32 bit systems (Dan Carpenter)
- exec: Make sure set_task_comm() always NUL-terminates
- coredump: Do not lock when copying "comm"
- MAINTAINERS: add auxvec.h and set myself as maintainer
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCZ4hNmQAKCRA2KwveOeQk
u0/nAQCTGU0zqhdO6t7ABsL3p9kJ2jVRA5njAoX7A/9jGPSWEQD/boRMqZuUpthV
nMevcQ2F4u0A7kJJBMK05YdXWHkYqgk=
=49Di
-----END PGP SIGNATURE-----
Merge tag 'execve-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook:
- fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case (Tycho
Andersen, Kees Cook)
- binfmt_misc: Fix comment typos (Christophe JAILLET)
- move empty argv[0] warning closer to actual logic (Nir Lichtman)
- remove legacy custom binfmt modules autoloading (Nir Lichtman)
- Make sure set_task_comm() always NUL-terminates
- binfmt_flat: Fix integer overflow bug on 32 bit systems (Dan
Carpenter)
- coredump: Do not lock when copying "comm"
- MAINTAINERS: add auxvec.h and set myself as maintainer
* tag 'execve-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
binfmt_flat: Fix integer overflow bug on 32 bit systems
selftests/exec: add a test for execveat()'s comm
exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case
exec: Make sure task->comm is always NUL-terminated
exec: remove legacy custom binfmt modules autoloading
exec: move warning of null argv to be next to the relevant code
fs: binfmt: Fix a typo
MAINTAINERS: exec: Mark Kees as maintainer
MAINTAINERS: exec: Add auxvec.h UAPI
coredump: Do not lock during 'comm' reporting
Output of io_uring_show_fdinfo() has several problems:
* racy use of ->d_iname
* junk if the name is long - in that case it's not stored in ->d_iname
at all
* lack of quoting (names can contain newlines, etc. - or be equal to "<none>",
for that matter).
* lines for empty slots are pointless noise - we already have the total
amount, so having just the non-empty ones would carry the same information.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeJnF4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgptMlD/0QfIv0xMET+tYYbS88RSsPyXLC8/OLJHfZ
QZ5d0Q7F6qEKaCgtj0ttqDiUKsKJSyDRs93sDR7IzAdf8i79kIlQh8kqpD6PgPHu
pKxBvU+a1x7EIafZw3jYo6yE1r+W7QgxzJY8Y/DxN81P4ahqwE2f019HuJ3uFj9j
AzUXz/upVTMhq2i5DODS6FhyeF66ROsEvJxuCtdkpXS/9tptCn1wiGYQ5ES8s6CJ
UnwpNdg3rbpo8/moglqJeKbugd/0BH5u3kjntXnSmBEYXojxz28Fj1wg5DfpNCF6
4o8sxlzlH5EKgTGjy5JtRZdYH4VZ8q09rymot6vMPwJu+i7Xgz+Hn+YQyRWkFQB+
y6oqad3DP0E1+k7chmWx8CMBiK4pABevSwzxrJGlM4RxDuLA7B8YTOew6G7NDtYL
AbPabqDcne+UgegXZ+rMUB7u7B0TGNdlm4P2kDjxl8dKKPNWmvyvy0LNMVjLUfln
VNHNkaAkuURs6QY2CYfWSFkbHGyjWJVi1wrnePSArWmGSQjYMGg2QPP4YIHH4sqP
szosm8Orl68Gw73OjHnndGOMgYlZB+lTysZHMzIUpWpxwaWH5OpwR3QEbJE29mzZ
8At74cCVxEpH1rno+E7uWuwYyoHJnOorz/SEl4E9n65MsS5IgjPDHYyvQ6i48Nqr
klswSIPHPA==
=c+iG
-----END PGP SIGNATURE-----
Merge tag 'io_uring-6.13-20250116' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
"One fix for the error handling in buffer cloning, and one fix for the
ring resizing.
Two minor followups for the latter as well.
Both of these issues only affect 6.13, so not marked for stable"
* tag 'io_uring-6.13-20250116' of git://git.kernel.dk/linux:
io_uring/register: cache old SQ/CQ head reading for copies
io_uring/register: document io_register_resize_rings() shared mem usage
io_uring/register: use stable SQ/CQ ring data during resize
io_uring/rsrc: fixup io_clone_buffers() error handling
The SQ and CQ ring heads are read twice - once for verifying that it's
within bounds, and once inside the loops copying SQE and CQE entries.
This is technically incorrect, in case the values could get modified
in between verifying them and using them in the copy loop. While this
won't lead to anything truly nefarious, it may cause longer loop times
for the copies than expected.
Read the ring head values once, and use the verified value in the copy
loops.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It can be a bit hard to tell which parts of io_register_resize_rings()
are operating on shared memory, and which ones are not. And anything
reading or writing to those regions should really use the read/write
once primitives.
Hence add those, ensuring sanity in how this memory is accessed, and
helping document the shared nature of it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Normally the kernel would not expect an application to modify any of
the data shared with the kernel during a resize operation, but of
course the kernel cannot always assume good intent on behalf of the
application.
As part of resizing the rings, existing SQEs and CQEs are copied over
to the new storage. Resizing uses the masks in the newly allocated
shared storage to index the arrays, however it's possible that malicious
userspace could modify these after they have been sanity checked.
Use the validated and locally stored CQ and SQ ring sizing for masking
to ensure the values are both stable and valid.
Fixes: 79cfe9e59c ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jann reports he can trigger a UAF if the target ring unregisters
buffers before the clone operation is fully done. And additionally
also an issue related to node allocation failures. Both of those
stemp from the fact that the cleanup logic puts the buffers manually,
rather than just relying on io_rsrc_data_free() doing it. Hence kill
the manual cleanup code and just let io_rsrc_data_free() handle it,
it'll put the nodes appropriately.
Reported-by: Jann Horn <jannh@google.com>
Fixes: 3597f2786b ("io_uring/rsrc: unify file and buffer resource tables")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In io_uring_try_cancel_requests, we check whether sq_data->thread ==
current to determine if the function is called by the SQPOLL thread to do
iopoll when IORING_SETUP_SQPOLL is set. This check can race with the SQPOLL
thread termination.
io_uring_cancel_generic is used in 2 places: io_uring_cancel_generic and
io_ring_exit_work. In io_uring_cancel_generic, we have the information
whether the current is SQPOLL thread already. And the SQPOLL thread never
reaches io_ring_exit_work.
So to avoid the racy check, this commit adds a boolean flag to
io_uring_try_cancel_requests to determine if the caller is SQPOLL thread.
Reported-by: syzbot+3c750be01dab672c513d@syzkaller.appspotmail.com
Reported-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20250113160331.44057-1-minhquangbui99@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeCmJkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppx5EACv65GSg4kQZTFtuSQ3Z1fq53Itg2vVS6Bo
d9IcO99T23IezMpRzk/HrEXWE3Kdjkp/z8spKo0ZP//dYMTm4js/PWLxfH81zluc
lfvnJhGjsZvhQBHKZggVE70W0lmWE6OBuC0jmuujVqtHmu3d7OzGkPK7CmSyKaxR
2ekFKaa7QvLgmx0gEPpmEsfAWzlM5hhNAPbWcdAUTQvUtnMxpTowYY8bI/drPUC0
bOvoYq7O/ZCdgobNGPiiOUrEQfDAuc7S3aQ+i5zn7gIu0BHe31XlwR8hbt6mRz/0
SHk2eSecrv0H6rA4YPKFno7eQZOIWd43T+t9IjJUpykcXkMfyNOKO2HLR/FaQkxN
kFNcCjFNJ6qLacTtbIZCzRs2Skhe5AF56jJ9FiVZbE3MKNuQBjcM2DpRlkuJLGvw
71T5cldS0394+lIA+B2DjYVJ6IqMBHQ23brnL0HfMBuRuLaPweHj//wh5S6oCLg0
X9Nq0tvgoYVo0M+jNS8NW4zWaoOdAw8eIlTVl8VNr1mSklpA0ZCgFXFsnCBZZb3N
C7SgG1lrmI+IYTC30LKxDcwmCi3JhDQg5Yvz9trQzMDMJaePMms+achcHyY9WfL5
0feUMe4RZAOEros0W7QshaAiz5TWFCoGi18muhzXDECQEQ9cV+Mh2BJ+JFiOP/ZT
LxNpFaFwDg==
=XUlm
-----END PGP SIGNATURE-----
Merge tag 'io_uring-6.13-20250111' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Fix for multishot timeout updates only using the updated value for
the first invocation, not subsequent ones
- Silence a false positive lockdep warning
- Fix the eventfd signaling and putting RCU logic
- Fix fault injected SQPOLL setup not clearing the task pointer in the
error path
- Fix local task_work looking at the SQPOLL thread rather than just
signaling the safe variant. Again one of those theoretical issues,
which should be closed up none the less.
* tag 'io_uring-6.13-20250111' of git://git.kernel.dk/linux:
io_uring: don't touch sqd->thread off tw add
io_uring/sqpoll: zero sqd->thread on tctx errors
io_uring/eventfd: ensure io_eventfd_signal() defers another RCU period
io_uring: silence false positive warnings
io_uring/timeout: fix multishot updates
After commit 9a213d3b80c0, we can pass additional attributes along with
read/write. However, userspace doesn't know that. Add a new feature flag
IORING_FEAT_RW_ATTR, to notify the userspace that the kernel has this
ability.
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20241205062109.1788-1-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With IORING_SETUP_SQPOLL all requests are created by the SQPOLL task,
which means that req->task should always match sqd->thread. Since
accesses to sqd->thread should be separately protected, use req->task
in io_req_normal_work_add() instead.
Note, in the eyes of io_req_normal_work_add(), the SQPOLL task struct
is always pinned and alive, and sqd->thread can either be the task or
NULL. It's only problematic if the compiler decides to reload the value
after the null check, which is not so likely.
Cc: stable@vger.kernel.org
Cc: Bui Quang Minh <minhquangbui99@gmail.com>
Reported-by: lizetao <lizetao1@huawei.com>
Fixes: 78f9b61bd8 ("io_uring: wake SQPOLL task when task_work is added to an empty queue")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1cbbe72cf32c45a8fee96026463024cd8564a7d7.1736541357.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Syzkeller reports:
BUG: KASAN: slab-use-after-free in thread_group_cputime+0x409/0x700 kernel/sched/cputime.c:341
Read of size 8 at addr ffff88803578c510 by task syz.2.3223/27552
Call Trace:
<TASK>
...
kasan_report+0x143/0x180 mm/kasan/report.c:602
thread_group_cputime+0x409/0x700 kernel/sched/cputime.c:341
thread_group_cputime_adjusted+0xa6/0x340 kernel/sched/cputime.c:639
getrusage+0x1000/0x1340 kernel/sys.c:1863
io_uring_show_fdinfo+0xdfe/0x1770 io_uring/fdinfo.c:197
seq_show+0x608/0x770 fs/proc/fd.c:68
...
That's due to sqd->task not being cleared properly in cases where
SQPOLL task tctx setup fails, which can essentially only happen with
fault injection to insert allocation errors.
Cc: stable@vger.kernel.org
Fixes: 1251d2025c ("io_uring/sqpoll: early exit thread if task_context wasn't allocated")
Reported-by: syzbot+3d92cfcfa84070b0a470@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/efc7ec7010784463b2e7466d7b5c02c2cb381635.1736519461.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4EhtAAKCRCRxhvAZXjc
orToAQCIKKS7fk9j8CUSAdRG5mMy7Q++8OEVA+gyyMWuXnBPYwD/ehy+1xBVjCcI
FBzLadaJSuygjZVCzhVXsE0oRf4A2wg=
=waDA
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.13-rc7.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"afs:
- Fix the maximum cell name length
- Fix merge preference rule failure condition
fuse:
- Fix fuse_get_user_pages() so it doesn't risk misleading the caller
to think pages have been allocated when they actually haven't
- Fix direct-io folio offset and length calculation
netfs:
- Fix async direct-io handling
- Fix read-retry for filesystems that don't provide a
->prepare_read() method
vfs:
- Prevent truncating 64-bit offsets to 32-bits in iomap
- Fix memory barrier interactions when polling
- Remove MNT_ONRB to fix concurrent modification of @mnt->mnt_flags
leading to MNT_ONRB to not be raised and invalid access to a list
member"
* tag 'vfs-6.13-rc7.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
poll: kill poll_does_not_wait()
sock_poll_wait: kill the no longer necessary barrier after poll_wait()
io_uring_poll: kill the no longer necessary barrier after poll_wait()
poll_wait: kill the obsolete wait_address check
poll_wait: add mb() to fix theoretical race between waitqueue_active() and .poll()
afs: Fix merge preference rule failure condition
netfs: Fix read-retry for fs with no ->prepare_read()
netfs: Fix kernel async DIO
fs: kill MNT_ONRB
iomap: avoid avoid truncating 64-bit offset to 32 bits
afs: Fix the maximum cell name length
fuse: Set *nbytesp=0 in fuse_get_user_pages on allocation failure
fuse: fix direct io folio offset and length calculation
nvme multipath reports that they see spurious -EAGAIN bubbling back to
userspace, which is caused by how they handle retries internally through
a kworker. However, any data that needs preserving or importing for
a read/write request has always been done so at prep time, and we can
sanely skip this check.
Reported-by: "Haeuptle, Michael" <michael.haeuptle@hpe.com>
Link: https://lore.kernel.org/io-uring/DS7PR84MB31105C2C63CFA47BE8CBD6EE95102@DS7PR84MB3110.NAMPRD84.PROD.OUTLOOK.COM/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Rather than try and have io_read/io_write turn REQ_F_REISSUE into
-EAGAIN, catch the REQ_F_REISSUE when the request is otherwise
considered as done. This is saner as we know this isn't happening
during an actual submission, and it removes the need to randomly
check REQ_F_REISSUE after read/write submission.
If REQ_F_REISSUE is set, __io_submit_flush_completions() will skip over
this request in terms of posting a CQE, and the regular request
cleaning will ensure that it gets reissued via io-wq.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that poll_wait() provides a full barrier we can remove smp_rmb() from
io_uring_poll().
In fact I don't think smp_rmb() was correct, it can't serialize LOADs and
STOREs.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250107162730.GA18940@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmd/7dYACgkQxWXV+ddt
WDuX7Q//UkrNtVh7UEiyNyujLjjvczfMXhpD1fAdVU0zMon6ux3RQ3JSs3xvAGrb
jFFa9c9+Db8/kWzdWp5n1u9Q/+sy4XBaeKGuzPRLPPGT1yXfKEa4mrm1sCrWRJoS
c8b07Kfuepldcim80x8WSa2qhr5gmDmSZBgvjKt63ppp5/jaNKCZg+d3BhwqhHbI
XA9JjIk9j0ZsAYauYflQTwgUpkyvXV1a9YyeKv4U6mYA1r+rXl2aolcndNkS1U/D
dDGuiDpOjKtIUecRi4YbOkt2zvwREDdQCbRV/QLsZajHxqeHV5QH0TBI/URikx2z
1shwYMzLfLtQIW0+PhHCGKiftMIb4NliyMUxxviPdN78nCFmocrR/ZkPx+a5M9Io
d7oqwS/8U3pFGeB4bAey8WvMzQI5BtCCYJY+3HreNTDkiubqcRtTCtJ9dNDTAMFH
FMZ6DA8wTsqSA2e9Q8OwKNjvMCLAKevXn/4wiJi5b75Fiu5ZB/imTfJ+geEMUZCR
3uq9oybFCKti7lestM0z06K19AKtmPWLoq5YJ1Hg69DsafS2aR3CBeYOi7uQ+56D
7uwAFjVrGPrxOgGkCohYpPLCUikJ0y3Nl/k5fnybsnLPWr0cenLroUeP7Rao4fFU
8hLzMSv3ImL+Io0RjH0XBAM8YLN+xO3CLYCv6D8d42AlQTgAIVw=
=QYC1
-----END PGP SIGNATURE-----
Merge tag 'for-6.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes.
Besides the one-liners in Btrfs there's fix to the io_uring and
encoded read integration (added in this development cycle). The update
to io_uring provides more space for the ongoing command that is then
used in Btrfs to handle some cases.
- io_uring and encoded read:
- provide stable storage for io_uring command data
- make a copy of encoded read ioctl call, reuse that in case the
call would block and will be called again
- properly initialize zlib context for hardware compression on s390
- fix max extent size calculation on filesystems with non-zoned
devices
- fix crash in scrub on crafted image due to invalid extent tree"
* tag 'for-6.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zlib: fix avail_in bytes for s390 zlib HW compression path
btrfs: zoned: calculate max_extent_size properly on non-zoned setup
btrfs: avoid NULL pointer dereference if no valid extent tree
btrfs: don't read from userspace twice in btrfs_uring_encoded_read()
io_uring: add io_uring_cmd_get_async_data helper
io_uring/cmd: add per-op data to struct io_uring_cmd_data
io_uring/cmd: rename struct uring_cache to io_uring_cmd_data
io_eventfd_do_signal() is invoked from an RCU callback, but when
dropping the reference to the io_ev_fd, it calls io_eventfd_free()
directly if the refcount drops to zero. This isn't correct, as any
potential freeing of the io_ev_fd should be deferred another RCU grace
period.
Just call io_eventfd_put() rather than open-code the dec-and-test and
free, which will correctly defer it another RCU grace period.
Fixes: 21a091b970 ("io_uring: signal registered eventfd to process deferred task work")
Reported-by: Jann Horn <jannh@google.com>
Cc: stable@vger.kernel.org
Tested-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Li Zetao<lizetao1@huawei.com>
Reviewed-by: Prasanna Kumar T S M <ptsm@linux.microsoft.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case an op handler for ->uring_cmd() needs stable storage for user
data, it can allocate io_uring_cmd_data->op_data and use it for the
duration of the request. When the request gets cleaned up, uring_cmd
will free it automatically.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation for making this more generically available for
->uring_cmd() usage that needs stable command data, rename it and move
it to io_uring/cmd.h instead.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: David Sterba <dsterba@suse.com>
After update only the first shot of a multishot timeout request adheres
to the new timeout value while all subsequent retries continue to use
the old value. Don't forget to update the timeout stored in struct
io_timeout_data.
Cc: stable@vger.kernel.org
Fixes: ea97f6c855 ("io_uring: add support for multishot timeouts")
Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e6516c3304eb654ec234cfa65c88a9579861e597.1736015288.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As we don't use iov_iter_advance() but our own logic in io_import_fixed(),
we can remove the logic that over-sets the iter's count to len + offset
then adjusts it later to len. This helps to make the code cleaner.
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://lore.kernel.org/r/20250103150412.12549-1-minhquangbui99@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For non-pollable files, buffer ring consumption will commit upfront.
This is fine, but io_ring_buffer_select() will return the address of the
buffer after having committed it. For incrementally consumed buffers,
this is incorrect as it will modify the buffer address.
Store the pre-committed value and return that. If that isn't done, then
the initial part of the buffer is not used and the application will
correctly assume the content arrived at the start of the userspace
buffer, but the kernel will have put it later in the buffer. Or it can
cause a spurious -EFAULT returned in the CQE, depending on the buffer
size. As bounds are suitably checked for doing the actual IO, no adverse
side effects are possible - it's just a data misplacement within the
existing buffer.
Reported-by: Gwendal Fernet <gwendalfernet@gmail.com>
Cc: stable@vger.kernel.org
Fixes: ae98dbf43d ("io_uring/kbuf: add support for incremental buffer consumption")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
syzbot reports that ->msg_inq may get used uinitialized from the
following path:
BUG: KMSAN: uninit-value in io_recv_buf_select io_uring/net.c:1094 [inline]
BUG: KMSAN: uninit-value in io_recv+0x930/0x1f90 io_uring/net.c:1158
io_recv_buf_select io_uring/net.c:1094 [inline]
io_recv+0x930/0x1f90 io_uring/net.c:1158
io_issue_sqe+0x420/0x2130 io_uring/io_uring.c:1740
io_queue_sqe io_uring/io_uring.c:1950 [inline]
io_req_task_submit+0xfa/0x1d0 io_uring/io_uring.c:1374
io_handle_tw_list+0x55f/0x5c0 io_uring/io_uring.c:1057
tctx_task_work_run+0x109/0x3e0 io_uring/io_uring.c:1121
tctx_task_work+0x6d/0xc0 io_uring/io_uring.c:1139
task_work_run+0x268/0x310 kernel/task_work.c:239
io_run_task_work+0x43a/0x4a0 io_uring/io_uring.h:343
io_cqring_wait io_uring/io_uring.c:2527 [inline]
__do_sys_io_uring_enter io_uring/io_uring.c:3439 [inline]
__se_sys_io_uring_enter+0x204f/0x4ce0 io_uring/io_uring.c:3330
__x64_sys_io_uring_enter+0x11f/0x1a0 io_uring/io_uring.c:3330
x64_sys_call+0xce5/0x3c30 arch/x86/include/generated/asm/syscalls_64.h:427
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
and it is correct, as it's never initialized upfront. Hence the first
submission can end up using it uninitialized, if the recv wasn't
successful and the networking stack didn't honor ->msg_get_inq being set
and filling in the output value of ->msg_inq as requested.
Set it to 0 upfront when it's allocated, just to silence this KMSAN
warning. There's no side effect of using it uninitialized, it'll just
potentially cause the next receive to use a recv value hint that's not
accurate.
Fixes: c6f32c7d9e ("io_uring/net: get rid of ->prep_async() for receive side")
Reported-by: syzbot+068ff190354d2f74892f@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is not the hot path, it's a slow path. Yet the locking for it is
in the hot path, and __cold does not prevent it from being inlined.
Move the locking to the function itself, and mark it noinline as well
to avoid it polluting the icache of the hot path.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
syzbot reports that a recent fix causes nesting issues between the (now)
raw timeoutlock and the eventfd locking:
=============================
[ BUG: Invalid wait context ]
6.13.0-rc4-00080-g9828a4c0901f #29 Not tainted
-----------------------------
kworker/u32:0/68094 is trying to lock:
ffff000014d7a520 (&ctx->wqh#2){..-.}-{3:3}, at: eventfd_signal_mask+0x64/0x180
other info that might help us debug this:
context-{5:5}
6 locks held by kworker/u32:0/68094:
#0: ffff0000c1d98148 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work+0x4e8/0xfc0
#1: ffff80008d927c78 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x53c/0xfc0
#2: ffff0000c59bc3d8 (&ctx->completion_lock){+.+.}-{3:3}, at: io_kill_timeouts+0x40/0x180
#3: ffff0000c59bc358 (&ctx->timeout_lock){-.-.}-{2:2}, at: io_kill_timeouts+0x48/0x180
#4: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38
#5: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38
stack backtrace:
CPU: 7 UID: 0 PID: 68094 Comm: kworker/u32:0 Not tainted 6.13.0-rc4-00080-g9828a4c0901f #29
Hardware name: linux,dummy-virt (DT)
Workqueue: iou_exit io_ring_exit_work
Call trace:
show_stack+0x1c/0x30 (C)
__dump_stack+0x24/0x30
dump_stack_lvl+0x60/0x80
dump_stack+0x14/0x20
__lock_acquire+0x19f8/0x60c8
lock_acquire+0x1a4/0x540
_raw_spin_lock_irqsave+0x90/0xd0
eventfd_signal_mask+0x64/0x180
io_eventfd_signal+0x64/0x108
io_req_local_work_add+0x294/0x430
__io_req_task_work_add+0x1c0/0x270
io_kill_timeout+0x1f0/0x288
io_kill_timeouts+0xd4/0x180
io_uring_try_cancel_requests+0x2e8/0x388
io_ring_exit_work+0x150/0x550
process_one_work+0x5e8/0xfc0
worker_thread+0x7ec/0xc80
kthread+0x24c/0x300
ret_from_fork+0x10/0x20
because after the preempt-rt fix for the timeout lock nesting inside
the io-wq lock, we now have the eventfd spinlock nesting inside the
raw timeout spinlock.
Rather than play whack-a-mole with other nesting on the timeout lock,
split the deletion and killing of timeouts so queueing the task_work
for the timeout cancelations can get done outside of the timeout lock.
Reported-by: syzbot+b1fc199a40b65d601b65@syzkaller.appspotmail.com
Fixes: 020b40f356 ("io_uring: make ctx->timeout_lock a raw spinlock")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The io-wq path can downgrade a multishot request to oneshot mode,
however io_read_mshot() doesn't handle that and would still post
multiple CQEs. That's not allowed, because io_req_post_cqe() requires
stricter context requirements.
The described can only happen with pollable files that don't support
FMODE_NOWAIT, which is an odd combination, so if even allowed it should
be fairly rare.
Cc: stable@vger.kernel.org
Reported-by: chase xd <sl1589472800@gmail.com>
Fixes: bee1d5becd ("io_uring: disable io-wq execution of multishot NOWAIT requests")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c5c8c4a50a882fd581257b81bf52eee260ac29fd.1735407848.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A previous commit mistakenly moved the clearing of the in-progress byte
count into the section that's dependent on having a cached iovec or not,
but it should be cleared for any IO. If not, then extra bytes may be
added at IO completion time, causing potentially weird behavior like
over-reporting the amount of IO done.
Fixes: d7f11616ed ("io_uring/rw: Allocate async data through helper")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202412271132.a09c3500-lkp@intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's a pointer, don't use 0 for that. sparse throws a warning for that,
as the kernel test robot noticed.
Fixes: d7f11616ed ("io_uring/rw: Allocate async data through helper")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202412180253.YML3qN4d-lkp@intel.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A previous commit changed overwriting kiocb->ki_flags with
->f_iocb_flags with masking it in. This breaks for retry situations,
where we don't necessarily want to retain previously set flags, like
IOCB_NOWAIT.
The use case needs IOCB_HAS_METADATA to be persistent, but the change
makes all flags persistent, which is an issue. Add a request flag to
track whether the request has metadata or not, as that is persistent
across issues.
Fixes: 59a7d12a7f ("io_uring: introduce attributes for read/write and PI support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are two remaining uses of the old async data allocator that do not
rely on the alloc cache. I don't want to make them use the new
allocator helper because that would require a if(cache) check, which
will result in dead code for the cached case (for callers passing a
cache, gcc can't prove the cache isn't NULL, and will therefore preserve
the check. Since this is an inline function and just a few lines long,
keep a second helper to deal with cases where we don't have an async
data cache.
No functional change intended here. This is just moving the helper
around and making it inline.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20241216204615.759089-9-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This helper replaces io_alloc_async_data by using the folded allocation.
Do it in a header to allow the compiler to decide whether to inline.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20241216204615.759089-3-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
BUG: KASAN: slab-use-after-free in __lock_acquire+0x370b/0x4a10 kernel/locking/lockdep.c:5089
Call Trace:
<TASK>
...
_raw_spin_lock_irqsave+0x3d/0x60 kernel/locking/spinlock.c:162
class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline]
try_to_wake_up+0xb5/0x23c0 kernel/sched/core.c:4205
io_sq_thread_park+0xac/0xe0 io_uring/sqpoll.c:55
io_sq_thread_finish+0x6b/0x310 io_uring/sqpoll.c:96
io_sq_offload_create+0x162/0x11d0 io_uring/sqpoll.c:497
io_uring_create io_uring/io_uring.c:3724 [inline]
io_uring_setup+0x1728/0x3230 io_uring/io_uring.c:3806
...
Kun Hu reports that the SQPOLL creating error path has UAF, which
happens if io_uring_alloc_task_context() fails and then io_sq_thread()
manages to run and complete before the rest of error handling code,
which means io_sq_thread_finish() is looking at already killed task.
Note that this is mostly theoretical, requiring fault injection on
the allocation side to trigger in practice.
Cc: stable@vger.kernel.org
Reported-by: Kun Hu <huk23@m.fudan.edu.cn>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0f2f1aa5729332612bd01fe0f2f385fd1f06ce7c.1735231717.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The allocation paths that use alloc_cache duplicate the same code
pattern, sometimes in a quite convoluted way. Fold the allocation into
the cache code itself, making it just an allocator function, and keeping
the cache policy invisible to callers. Another justification for doing
this, beyond code simplicity, is that it makes it trivial to test the
impact of disabling the cache and using slab directly, which I've used
for slab improvement experiments.
One relevant detail is that we provide a callback to optionally
initialize memory only when we actually reach slab. This allows us to
avoid blindly executing the allocation with GFP_ZERO and only clean
fields when they matter.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20241216204615.759089-2-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With *ENTER_EXT_ARG_REG instead of passing a user pointer with arguments
for the waiting loop the user can specify an offset into a pre-mapped
region of memory, in which case the
[offset, offset + sizeof(io_uring_reg_wait)) will be intepreted as the
argument.
As we address a kernel array using a user given index, it'd be a subject
to speculation type of exploits. Use array_index_nospec() to prevent
that. Make sure to pass not the full region size but truncate by the
maximum offset allowed considering the structure size.
Fixes: d617b3147d ("io_uring: restore back registered wait arguments")
Fixes: aa00f67adc ("io_uring: add support for fixed wait regions")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1e3d9da7c43d619de7bcf41d1cd277ab2688c443.1733694126.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When io_check_coalesce_buffer() meets a single page buffer it bails out
and tells that it can be coalesced. That's fine for registered buffers
as io_coalesce_buffer() wouldn't change anything, but the region code
now uses the function to decided on whether to vmap the buffer or not.
Report that a single page buffer is trivially coalescable and let
io_sqe_buffer_register() to filter them.
Fixes: c4d0ac1c15 ("io_uring/memmap: optimise single folio regions")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cb83e053f318857068447d40c95becebcd8aeced.1733689833.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove unnecessary call to iov_iter_save_state() in io_prep_rw_setup()
as io_import_iovec() already does this. Then the result from
io_import_iovec() can be returned directly.
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Tested-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20241207004144.783631-1-dw@davidwei.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>