If a request completed inline the result should only be zero, it's a
grave error otherwise. So, when we see REQ_F_COMPLETE_INLINE it's not
even necessary to check the return code, and the flag check can be moved
earlier.
It's one "if" less for inline completions, and same two checks for it
normally completing (ret == 0). Those are two cases we care about the
most.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ebd4e397a9c26d96c99b24447acc309741041a83.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Extract slow paths from __io_queue_sqe() into a function and inline the
hot path. With that we have everything completely inlined on the
submission path up until io_issue_sqe().
-> io_submit_sqes()
-> io_submit_sqe() (inlined)
-> io_queue_sqe() (inlined)
-> __io_queue_sqe() (inlined)
-> io_issue_sqe()
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f1606864d95d7f26dc28c7eec3dc6ed6ec32618a.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't want the slow path of io_queue_sqe to be inlined, so extract a
function from it.
text data bss dec hex filename
91950 13986 8 105944 19dd8 ./fs/io_uring.o
91758 13986 8 105752 19d18 ./fs/io_uring.o
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fb01253911f8fb374268f65b1ba939b54ca6583f.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req->ctx->active_drain is a bit too expensive, partially because of two
dereferences. Do a trick, if we see it set in io_init_req(), set
REQ_F_FORCE_ASYNC and it automatically goes through a slower path where
we can catch it. It's nearly free to do in io_init_req() because there
is already ->restricted check and it's in the same byte of a bitmask.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d7e7ddc63c15e8a300833132abb3eb8fd3918aef.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are two call sites of io_queue_sqe() in io_submit_sqe(), combine
them into one, because io_queue_sqe() is inline and we don't want to
bloat binary, and will become even bigger
text data bss dec hex filename
92126 13986 8 106120 19e88 ./fs/io_uring.o
91966 13986 8 105960 19de8 ./fs/io_uring.o
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/506124b8e767f0a4576f7a459f6aea3d13fb4dda.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
First, convert rest of iopoll bits to single linked lists, and also
replace per-request list_add_tail() with splicing a part of slist.
With that, use io_free_batch_list() to put/free requests. The main
advantage of it is that it's now the only user of struct req_batch and
friends, and so they can be inlined. The main overhead there was
per-request call to not-inlined io_req_free_batch(), which is expensive
enough.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b37fc6d5954b241e025eead7ab92c6f44a42f229.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert explicit barrier around iopoll_completed to smp_load_acquire()
and smp_store_release(). Similar on the callback side, but replaces a
single smp_rmb() with per-request smp_load_acquire(), neither imply any
extra CPU ordering for x86. Use READ_ONCE as usual where it doesn't
matter.
Use it to move filling CQEs by iopoll earlier, that will be necessary
to avoid traversing the list one extra time in the future.
Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8bd663cb15efdc72d6247c38ee810964e744a450.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The main loop of io_do_iopoll() iterates and does ->iopoll() until it
meets a first completed request, then it continues from that position
and splices requests to pass them through io_iopoll_complete().
Split the loop in two for clearness, iopolling and reaping completed
requests from the list.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a7f6fd27a94845e5dc925a47a4a9765a92e514fb.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently we collect requests for completion batching in an array.
Replace them with a singly linked list. It's as fast as arrays but
doesn't take some much space in ctx, and will be used in future patches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a666826f2854d17e9fb9417fb302edfeb750f425.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't really need to pass the number of requests to complete into
io_do_iopoll(), a flag whether to enforce non-spin mode is enough.
Should be straightforward, maybe except io_iopoll_check(). We pass !min
there, because we do never enter with the number of already reaped
requests is larger than the specified @min, apart from the first
iteration, where nr_events is 0 and so the final check should be
identical.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/782b39d1d8ec584eae15bca0a1feb6f0571fe5b8.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hint the compiler that it's not as likely to have creds different from
current attached to a request. The current code generation is far from
ideal, hopefully it can help to some compilers to remove duplicated jump
tables and so.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e7815251ac4bf5a4a23d298c752f029ae19f3837.1632516769.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IOSQE_IO_DRAIN is quite marginal and we don't care too much about
IOSQE_BUFFER_SELECT. Save to ifs and hide both of them under
SQE_VALID_FLAGS check. Now we first check whether it uses a "safe"
subset, i.e. without DRAIN and BUFFER_SELECT, and only if it's not
true we test the rest of the flags.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dccfb9ab2ab0969a2d8dc59af88fa0ce44eeb1d5.1631703764.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now completions are done from task context, that means that it's either
the task itself, task_work or io-wq worker. In all those cases the ctx
will be staying alive by mutexing, explicit referencing or req references
by iowq. Remove extra ctx pinning from io_req_complete_post().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/60a0e96434c16ab4fe587651448290d61ec9a113.1631703756.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Developers may need some uring info to help themselves debug and address
issues in production. This includes sqring/cqring head/tail and the
detailed sqe/cqe info, which is very useful when an application is hung
on a ring.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210913130854.38542-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't do io_submit_flush_completions() when there is no requests
enqueued, and every single caller checks for it. Hide that check into
the function not forgetting about inlining. That will make it much
easier for changing the empty check condition in the future.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d7ff8cef5da1b38e8ea648f5aad9a315ddfc7b57.1631115443.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
->ios_left is only used to decide whether to plug or not, kill it to
avoid this extra accounting, just use the initial submission number.
There is no much difference in regards of enabling plugging, where this
one does it in a few more cases, but all major ones should be covered
well.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f13993bcf5b477f9a7d52881fc49f9457ea9870a.1631115443.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
I recently had to look at a production problem where a request ended
up getting the dreaded -EINVAL error on submit. The most used and
hence useless of error codes, as it just tells you that something
was wrong with your request, but not more than that.
Let's dump the full sqe contents if we run into an issue failure,
that'll allow easier diagnosing of a wide variety of issues.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Wire up using an io_comp_batch for f_op->iopoll(). If the lower stack
supports it, we can handle high rates of polled IO more efficiently.
This raises the single core efficiency on my system from ~6.1M IOPS to
~6.6M IOPS running a random read workload at depth 128 on two gen2
Optane drives.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
struct io_comp_batch contains a list head and a completion handler, which
will allow completions to more effciently completed batches of IO.
For now, no functional changes in this patch, we just define the
io_comp_batch structure and add the argument to the file_operations iopoll
handler.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no point in sleeping for the expected I/O completion timeout
in the io_uring async polling model as we never poll for a specific
I/O.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Switch the boolean spin argument to blk_poll to passing a set of flags
instead. This will allow to control polling behavior in a more fine
grained way.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-10-hch@lst.de
[axboe: adapt to changed io_uring iopoll]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
syscall-level code can't just poke into the details of the poll cookie,
which is private information of the block layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012111226.760968-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Grab uring lock when we are in io-worker rather than in the original
or system-wq context since we already hold it in these two situation.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Fixes: b66ceaf324 ("io_uring: move iopoll reissue into regular IO path")
Link: https://lore.kernel.org/r/20211014140400.50235-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have never supported fasync properly, it would only fire when there
is something polling io_uring making it useless. The original support came
in through the initial io_uring merge for 5.1. Since it's broken and
nobody has reported it, get rid of the fasync bits.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2f7ca3d344d406d34fa6713824198915c41cea86.1633080236.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There aren't any actual callers of lock_page_async(), so remove it.
Convert filemap_update_page() to call __folio_lock_async().
__folio_lock_async() is 21 bytes smaller than __lock_page_async(),
but the real savings come from using a folio in filemap_update_page(),
shrinking it from 515 bytes to 404 bytes, saving 110 bytes. The text
shrinks by 132 bytes in total.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
From recently open/accept are now able to manipulate fixed file table,
but it's inconsistent that close can't. Close the gap, keep API same as
with open/accept, i.e. via sqe->file_slot.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't retry short writes and so we would never get to async setup in
io_write() in that case. Thus ret2 > 0 is always false and
iov_iter_advance() is never used. Apparently, the same is found by
Coverity, which complains on the code.
Fixes: cd65869512 ("io_uring: use iov_iter state save/restore helpers")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5b33e61034748ef1022766efc0fb8854cfcf749c.1632500058.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's no reason to punt it unconditionally, we just need to ensure that
the submit lock grabbing is conditional.
Fixes: 05f3fb3c53 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For each provided buffer, we allocate a struct io_buffer to hold the
data associated with it. As a large number of buffers can be provided,
account that data with memcg.
Fixes: ddf0322db7 ("io_uring: add IORING_OP_PROVIDE_BUFFERS")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For multishot mode, there may be cases like:
iowq original context
io_poll_add
_arm_poll()
mask = vfs_poll() is not 0
if mask
(2) io_poll_complete()
compl_unlock
(interruption happens
tw queued to original
context)
io_poll_task_func()
compl_lock
(3) done = io_poll_complete() is true
compl_unlock
put req ref
(1) if (poll->flags & EPOLLONESHOT)
put req ref
EPOLLONESHOT flag in (1) may be from (2) or (3), so there are multiple
combinations that can cause ref underfow.
Let's address it by:
- check the return value in (2) as done
- change (1) to if (done)
in this way, we only do ref put in (1) if 'oneshot flag' is from
(2)
- do poll.done check in io_poll_task_func(), so that we won't put ref
for the second time.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210922101238.7177-4-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should set EPOLLONESHOT if cqring_fill_event() returns false since
io_poll_add() decides to put req or not by it.
Fixes: 5082620fb2 ("io_uring: terminate multishot poll for CQ ring overflow")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210922101238.7177-3-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If poll arming and poll completion runs in parallel, there maybe races.
For instance, run io_poll_add in iowq and io_poll_task_func in original
context, then:
iowq original context
io_poll_add
vfs_poll
(interruption happens
tw queued to original
context) io_poll_task_func
generate cqe
del from cancel_hash[]
if !poll.done
insert to cancel_hash[]
The entry left in cancel_hash[], similar case for fast poll.
Fix it by set poll.done = true when del from cancel_hash[].
Fixes: 5082620fb2 ("io_uring: terminate multishot poll for CQ ring overflow")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210922101238.7177-2-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A full expalantion of io_uring is beyond the scope of this commit
description, but in summary it is an asynchronous I/O mechanism
which allows for I/O requests and the resulting data to be queued
in memory mapped "rings" which are shared between the kernel and
userspace. Optionally, io_uring offers the ability for applications
to spawn kernel threads to dequeue I/O requests from the ring and
submit the requests in the kernel, helping to minimize the syscall
overhead. Rings are accessed in userspace by memory mapping a file
descriptor provided by the io_uring_setup(2), and can be shared
between applications as one might do with any open file descriptor.
Finally, process credentials can be registered with a given ring
and any process with access to that ring can submit I/O requests
using any of the registered credentials.
While the io_uring functionality is widely recognized as offering a
vastly improved, and high performing asynchronous I/O mechanism, its
ability to allow processes to submit I/O requests with credentials
other than its own presents a challenge to LSMs. When a process
creates a new io_uring ring the ring's credentials are inhertied
from the calling process; if this ring is shared with another
process operating with different credentials there is the potential
to bypass the LSMs security policy. Similarly, registering
credentials with a given ring allows any process with access to that
ring to submit I/O requests with those credentials.
In an effort to allow LSMs to apply security policy to io_uring I/O
operations, this patch adds two new LSM hooks. These hooks, in
conjunction with the LSM anonymous inode support previously
submitted, allow an LSM to apply access control policy to the
sharing of io_uring rings as well as any io_uring credential changes
requested by a process.
The new LSM hooks are described below:
* int security_uring_override_creds(cred)
Controls if the current task, executing an io_uring operation,
is allowed to override it's credentials with @cred. In cases
where the current task is a user application, the current
credentials will be those of the user application. In cases
where the current task is a kernel thread servicing io_uring
requests the current credentials will be those of the io_uring
ring (inherited from the process that created the ring).
* int security_uring_sqpoll(void)
Controls if the current task is allowed to create an io_uring
polling thread (IORING_SETUP_SQPOLL). Without a SQPOLL thread
in the kernel processes must submit I/O requests via
io_uring_enter(2) which allows us to compare any requested
credential changes against the application making the request.
With a SQPOLL thread, we can no longer compare requested
credential changes against the application making the request,
the comparison is made against the ring's credentials.
Signed-off-by: Paul Moore <paul@paul-moore.com>
Converting io_uring's anonymous inode to the secure anon inode API
enables LSMs to enforce policy on the io_uring anonymous inodes if
they chose to do so. This is an important first step towards
providing the necessary mechanisms so that LSMs can apply security
policy to io_uring operations.
Signed-off-by: Paul Moore <paul@paul-moore.com>
This patch adds basic auditing to io_uring operations, regardless of
their context. This is accomplished by allocating audit_context
structures for the io-wq worker and io_uring SQPOLL kernel threads
as well as explicitly auditing the io_uring operations in
io_issue_sqe(). Individual io_uring operations can bypass auditing
through the "audit_skip" field in the struct io_op_def definition for
the operation; although great care must be taken so that security
relevant io_uring operations do not bypass auditing; please contact
the audit mailing list (see the MAINTAINERS file) with any questions.
The io_uring operations are audited using a new AUDIT_URINGOP record,
an example is shown below:
type=UNKNOWN[1336] msg=audit(1631800225.981:37289):
uring_op=19 success=yes exit=0 items=0 ppid=15454 pid=15681
uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0
subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
key=(null)
Thanks to Richard Guy Briggs for review and feedback.
Signed-off-by: Paul Moore <paul@paul-moore.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFEikcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmG4D/93W/CdNgw88WFkYPfjwICKHOcSDZhGqMzh
Ug1cp4BP8lPkiCvyC8VfM3XMBUWf9j8Ijb4X7b+wjuBWaNQdJHlcb1XSEQj4sh8/
w6MUGUz76/z1z6DE0HzzPHRZyrdog+oW9jZ+qpKCjguVBcs4eu3NdY3LbDcrVvzV
xzi3o52NbvpHdgWl6LuQqJiIq0twG/6RiguKfqZDfxZxPq6m3cSgjWRLquAV9nUJ
+S6/wyGkaRK3qPMTtphWyL9TM1pr+od8K5tfKYlgdjsAoCkqIzpIJUR62rTKz3Be
jjPLxkP0TkE3YPRCjyvZR1Eb7ZwgfuyCszWnGtmBmOt5/JXDUPXEqiQPCg7rVj47
6x2JGe/bglCnSTWwYSvOQNJDqRVBiXBr59jOvSWNTFO2Tj5v9Q0dk2etgMYwA9oS
k5vdDhFLNW5T4aibNbpJFJctZaHu9N1rFkzvW4DTdur7lj64ePRMtugaU2F9PhBt
VwQlkjcuvz5GBjpwS6QdZ78ro0oUSgGOhYiRHJ8JUHJOqDv4SChyC3Tf9sD7ELzZ
/JJNviD8/iv8ZpHNKGlbwFdive4CxqXIrOYaTycrDJ32/oQkYnEWIaLMmGHaF/F+
hasiUdS5D277DVz2/R2e0e2s8YXhkmRipoHjEdq57zk7PqRolheVQdaqYuCSmtwH
MjcJi1hi6g==
=TnwU
-----END PGP SIGNATURE-----
Merge tag 'iov_iter.3-5.15-2021-09-17' of git://git.kernel.dk/linux-block
Pull io_uring iov_iter retry fixes from Jens Axboe:
"This adds a helper to save/restore iov_iter state, and modifies
io_uring to use it.
After that is done, we can now kill the iter->truncated addition that
we added for this release. The io_uring change is being overly
cautious with the save/restore/advance, but better safe than sorry and
we can always improve that and reduce the overhead if it proves to be
of concern. The only case to be worried about in this regard is huge
IO, where iteration can take a while to iterate segments.
I spent some time writing test cases, and expanded the coverage quite
a bit from the last posting of this. liburing carries this regression
test case now:
https://git.kernel.dk/cgit/liburing/tree/test/file-verify.c
which exercises all of this. It now also supports provided buffers,
and explicitly tests for end-of-file/device truncation as well.
On top of that, Pavel sanitized the IOPOLL retry path to follow the
exact same pattern as normal IO"
* tag 'iov_iter.3-5.15-2021-09-17' of git://git.kernel.dk/linux-block:
io_uring: move iopoll reissue into regular IO path
Revert "iov_iter: track truncated size"
io_uring: use iov_iter state save/restore helpers
iov_iter: add helper to save iov_iter state