linux

mirror of synced 2025-03-06 20:59:54 +01:00

Author	SHA1	Message	Date
Pavel Begunkov	2bbb146d96	io_uring: refactor poll update Clean up io_poll_update() and unify cancellation paths for remove and update. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5937138b6265a1285220e2fab1b28132c1d73ce3.1639605189.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-28 09:51:14 -08:00
Pavel Begunkov	e840b4baf3	io_uring: remove double poll on poll update Before updating a poll request we should remove it from poll queues, including the double poll entry. Fixes: `b69de288e9` ("io_uring: allow events and user_data update of running poll requests") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ac39e7f80152613603b8a6cc29a2b6063ac2434f.1639605189.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-28 09:51:14 -08:00
Jens Axboe	7b9762a5e8	io_uring: zero iocb->ki_pos for stream file types io_uring supports using offset == -1 for using the current file position, and we read that in as part of read/write command setup. For the non-iter read/write types we pass in NULL for the position pointer, but for the iter types we should not be passing any anything but 0 for the position for a stream. Clear kiocb->ki_pos if the file is a stream, don't leave it as -1. If we do, then the request will error with -ESPIPE. Fixes: `ba04291eb6` ("io_uring: allow use of offset == -1 to mean file position") Link: https://github.com/axboe/liburing/discussions/501 Reported-by: Samuel Williams <samuel.williams@oriontransfer.co.nz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-22 20:34:32 -07:00
Hao Xu	33ce2aff7d	io_uring: code clean for some ctx usage There are some functions doing ctx = req->ctx while still using req->ctx, update those places. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211214055904.61772-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-14 06:49:06 -07:00
Jens Axboe	78a7806020	io_uring: ensure task_work gets run as part of cancelations If we successfully cancel a work item but that work item needs to be processed through task_work, then we can be sleeping uninterruptibly in io_uring_cancel_generic() and never process it. Hence we don't make forward progress and we end up with an uninterruptible sleep warning. While in there, correct a comment that should be IFF, not IIF. Reported-and-tested-by: syzbot+21e6887c0be14181206d@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-10 13:56:28 -07:00
Hao Xu	f28c240e71	io_uring: batch completion in prior_task_list In previous patches, we have already gathered some tw with io_req_task_complete() as callback in prior_task_list, let's complete them in batch while we cannot grab uring lock. In this way, we batch the req_complete_post path. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211208052125.351587-1-haoxu@linux.alibaba.com Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-08 11:34:48 -07:00
Hao Xu	a37fae8aaa	io_uring: split io_req_complete_post() and add a helper Split io_req_complete_post(), this is a prep for the next patch. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211207093951.247840-5-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-07 15:02:42 -07:00
Hao Xu	9f8d032a36	io_uring: add helper for task work execution code Add a helper for task work execution code. We will use it later. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211207093951.247840-4-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-07 15:01:57 -07:00
Hao Xu	4813c37792	io_uring: add a priority tw list for irq completion work Now we have a lot of task_work users, some are just to complete a req and generate a cqe. Let's put the work to a new tw list which has a higher priority, so that it can be handled quickly and thus to reduce avg req latency and users can issue next round of sqes earlier. An explanatory case: origin timeline: submit_sqe-->irq-->add completion task_work -->run heavy work0~n-->run completion task_work now timeline: submit_sqe-->irq-->add completion task_work -->run completion task_work-->run heavy work0~n Limitation: this optimization is only for those that submission and reaping process are in different threads. Otherwise anyhow we have to submit new sqes after returning to userspace, then the order of TWs doesn't matter. Tested this patch(and the following ones) by manually replace __io_queue_sqe() in io_queue_sqe() by io_req_task_queue() to construct 'heavy' task works. Then test with fio: ioengine=io_uring sqpoll=1 thread=1 bs=4k direct=1 rw=randread time_based=1 runtime=600 randrepeat=0 group_reporting=1 filename=/dev/nvme0n1 Tried various iodepth. The peak IOPS for this patch is 710K, while the old one is 665K. For avg latency, difference shows when iodepth grow: depth and avg latency(usec): depth new old 1 7.05 7.10 2 8.47 8.60 4 10.42 10.42 8 13.78 13.22 16 27.41 24.33 32 49.40 53.08 64 102.53 103.36 128 196.98 205.61 256 372.99 414.88 512 747.23 791.30 1024 1472.59 1538.72 2048 3153.49 3329.01 4096 6387.86 6682.54 8192 12150.25 12774.14 16384 23085.58 26044.71 Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20211207093951.247840-3-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-07 15:01:57 -07:00
Pavel Begunkov	a90c8bf659	io_uring: reuse io_req_task_complete for timeouts With kbuf unification io_req_task_complete() is now a generic function, use it for timeout's tw completions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7142fa3cbaf3a4140d59bcba45cbe168cf40fac2.1638714983.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-05 08:56:24 -07:00
Pavel Begunkov	83a13a4181	io_uring: tweak iopoll CQE_SKIP event counting When iopolling the userspace specifies the minimum number of "events" it expects. Previously, we had one CQE per request, so the definition of an "event" was unequivocal, but that's not more the case anymore with REQ_F_CQE_SKIP. Currently it counts the number of completed requests, replace it with the number of posted CQEs. This allows users of the "one CQE per link" scheme to wait for all N links in a single syscall, which is not possible without the patch and requires extra context switches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d5a965c4d2249827392037bbd0186f87fea49c55.1638714983.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-05 08:56:24 -07:00
Pavel Begunkov	d1fd1c201d	io_uring: simplify selected buf handling As selected buffers are now stored in a separate field in a request, get rid of rw/recv specific helpers and simplify the code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bd4a866d8d91b044f748c40efff9e4eacd07536e.1638714983.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-05 08:56:24 -07:00
Hao Xu	3648e5265c	io_uring: move up io_put_kbuf() and io_put_rw_kbuf() Move them up to avoid explicit declaration. We will use them in later patches. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3631243d6fc4a79bbba0cd62597fc8cd5be95924.1638714983.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-05 08:56:24 -07:00
Ye Bin	2087009c74	io_uring: validate timespec for timeout removals Like commit `f6223ff799`, timeout removal should also validate the timespec that is being passed in. Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20211129041537.1936270-1-yebin10@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-29 06:46:39 -07:00
Ye Bin	f6223ff799	io_uring: Fix undefined-behaviour in io_issue_sqe We got issue as follows: ================================================================================ UBSAN: Undefined behaviour in ./include/linux/ktime.h:42:14 signed integer overflow: -4966321760114568020 * 1000000000 cannot be represented in type 'long long int' CPU: 1 PID: 2186 Comm: syz-executor.2 Not tainted 4.19.90+ #12 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x3f0 arch/arm64/kernel/time.c:78 show_stack+0x28/0x38 arch/arm64/kernel/traps.c:158 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x170/0x1dc lib/dump_stack.c:118 ubsan_epilogue+0x18/0xb4 lib/ubsan.c:161 handle_overflow+0x188/0x1dc lib/ubsan.c:192 __ubsan_handle_mul_overflow+0x34/0x44 lib/ubsan.c:213 ktime_set include/linux/ktime.h:42 [inline] timespec64_to_ktime include/linux/ktime.h:78 [inline] io_timeout fs/io_uring.c:5153 [inline] io_issue_sqe+0x42c8/0x4550 fs/io_uring.c:5599 __io_queue_sqe+0x1b0/0xbc0 fs/io_uring.c:5988 io_queue_sqe+0x1ac/0x248 fs/io_uring.c:6067 io_submit_sqe fs/io_uring.c:6137 [inline] io_submit_sqes+0xed8/0x1c88 fs/io_uring.c:6331 __do_sys_io_uring_enter fs/io_uring.c:8170 [inline] __se_sys_io_uring_enter fs/io_uring.c:8129 [inline] __arm64_sys_io_uring_enter+0x490/0x980 fs/io_uring.c:8129 invoke_syscall arch/arm64/kernel/syscall.c:53 [inline] el0_svc_common+0x374/0x570 arch/arm64/kernel/syscall.c:121 el0_svc_handler+0x190/0x260 arch/arm64/kernel/syscall.c:190 el0_svc+0x10/0x218 arch/arm64/kernel/entry.S:1017 ================================================================================ As ktime_set only judge 'secs' if big than KTIME_SEC_MAX, but if we pass negative value maybe lead to overflow. To address this issue, we must check if 'sec' is negative. Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20211118015907.844807-1-yebin10@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-27 06:41:38 -07:00
Ye Bin	1d0254e6b4	io_uring: fix soft lockup when call __io_remove_buffers I got issue as follows: [ 567.094140] __io_remove_buffers: [1]start ctx=0xffff8881067bf000 bgid=65533 buf=0xffff8881fefe1680 [ 594.360799] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u32:5:108] [ 594.364987] Modules linked in: [ 594.365405] irq event stamp: 604180238 [ 594.365906] hardirqs last enabled at (604180237): [<ffffffff93fec9bd>] _raw_spin_unlock_irqrestore+0x2d/0x50 [ 594.367181] hardirqs last disabled at (604180238): [<ffffffff93fbbadb>] sysvec_apic_timer_interrupt+0xb/0xc0 [ 594.368420] softirqs last enabled at (569080666): [<ffffffff94200654>] __do_softirq+0x654/0xa9e [ 594.369551] softirqs last disabled at (569080575): [<ffffffff913e1d6a>] irq_exit_rcu+0x1ca/0x250 [ 594.370692] CPU: 2 PID: 108 Comm: kworker/u32:5 Tainted: G L 5.15.0-next-20211112+ #88 [ 594.371891] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 [ 594.373604] Workqueue: events_unbound io_ring_exit_work [ 594.374303] RIP: 0010:_raw_spin_unlock_irqrestore+0x33/0x50 [ 594.375037] Code: 48 83 c7 18 53 48 89 f3 48 8b 74 24 10 e8 55 f5 55 fd 48 89 ef e8 ed a7 56 fd 80 e7 02 74 06 e8 43 13 7b fd fb bf 01 00 00 00 <e8> f8 78 474 [ 594.377433] RSP: 0018:ffff888101587a70 EFLAGS: 00000202 [ 594.378120] RAX: 0000000024030f0d RBX: 0000000000000246 RCX: 1ffffffff2f09106 [ 594.379053] RDX: 0000000000000000 RSI: ffffffff9449f0e0 RDI: 0000000000000001 [ 594.379991] RBP: ffffffff9586cdc0 R08: 0000000000000001 R09: fffffbfff2effcab [ 594.380923] R10: ffffffff977fe557 R11: fffffbfff2effcaa R12: ffff8881b8f3def0 [ 594.381858] R13: 0000000000000246 R14: ffff888153a8b070 R15: 0000000000000000 [ 594.382787] FS: 0000000000000000(0000) GS:ffff888399c00000(0000) knlGS:0000000000000000 [ 594.383851] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 594.384602] CR2: 00007fcbe71d2000 CR3: 00000000b4216000 CR4: 00000000000006e0 [ 594.385540] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 594.386474] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 594.387403] Call Trace: [ 594.387738] <TASK> [ 594.388042] find_and_remove_object+0x118/0x160 [ 594.389321] delete_object_full+0xc/0x20 [ 594.389852] kfree+0x193/0x470 [ 594.390275] __io_remove_buffers.part.0+0xed/0x147 [ 594.390931] io_ring_ctx_free+0x342/0x6a2 [ 594.392159] io_ring_exit_work+0x41e/0x486 [ 594.396419] process_one_work+0x906/0x15a0 [ 594.399185] worker_thread+0x8b/0xd80 [ 594.400259] kthread+0x3bf/0x4a0 [ 594.401847] ret_from_fork+0x22/0x30 [ 594.402343] </TASK> Message from syslogd@localhost at Nov 13 09:09:54 ... kernel:watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u32:5:108] [ 596.793660] __io_remove_buffers: [2099199]start ctx=0xffff8881067bf000 bgid=65533 buf=0xffff8881fefe1680 We can reproduce this issue by follow syzkaller log: r0 = syz_io_uring_setup(0x401, &(0x7f0000000300), &(0x7f0000003000/0x2000)=nil, &(0x7f0000ff8000/0x4000)=nil, &(0x7f0000000280)=<r1=>0x0, &(0x7f0000000380)=<r2=>0x0) sendmsg$ETHTOOL_MSG_FEATURES_SET(0xffffffffffffffff, &(0x7f0000003080)={0x0, 0x0, &(0x7f0000003040)={&(0x7f0000000040)=ANY=[], 0x18}}, 0x0) syz_io_uring_submit(r1, r2, &(0x7f0000000240)=@IORING_OP_PROVIDE_BUFFERS={0x1f, 0x5, 0x0, 0x401, 0x1, 0x0, 0x100, 0x0, 0x1, {0xfffd}}, 0x0) io_uring_enter(r0, 0x3a2d, 0x0, 0x0, 0x0, 0x0) The reason above issue is 'buf->list' has 2,100,000 nodes, occupied cpu lead to soft lockup. To solve this issue, we need add schedule point when do while loop in '__io_remove_buffers'. After add schedule point we do regression, get follow data. [ 240.141864] __io_remove_buffers: [1]start ctx=0xffff888170603000 bgid=65533 buf=0xffff8881116fcb00 [ 268.408260] __io_remove_buffers: [1]start ctx=0xffff8881b92d2000 bgid=65533 buf=0xffff888130c83180 [ 275.899234] __io_remove_buffers: [2099199]start ctx=0xffff888170603000 bgid=65533 buf=0xffff8881116fcb00 [ 296.741404] __io_remove_buffers: [1]start ctx=0xffff8881b659c000 bgid=65533 buf=0xffff8881010fe380 [ 305.090059] __io_remove_buffers: [2099199]start ctx=0xffff8881b92d2000 bgid=65533 buf=0xffff888130c83180 [ 325.415746] __io_remove_buffers: [1]start ctx=0xffff8881b92d1000 bgid=65533 buf=0xffff8881a17d8f00 [ 333.160318] __io_remove_buffers: [2099199]start ctx=0xffff8881b659c000 bgid=65533 buf=0xffff8881010fe380 ... Fixes:8bab4c09f24e("io_uring: allow conditional reschedule for intensive iterators") Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20211122024737.2198530-1-yebin10@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-27 06:41:32 -07:00
Pavel Begunkov	6af3f48bf6	io_uring: fix link traversal locking WARNING: inconsistent lock state 5.16.0-rc2-syzkaller #0 Not tainted inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage. ffff888078e11418 (&ctx->timeout_lock ){?.+.}-{2:2} , at: io_timeout_fn+0x6f/0x360 fs/io_uring.c:5943 {HARDIRQ-ON-W} state was registered at: [...] spin_unlock_irq include/linux/spinlock.h:399 [inline] __io_poll_remove_one fs/io_uring.c:5669 [inline] __io_poll_remove_one fs/io_uring.c:5654 [inline] io_poll_remove_one+0x236/0x870 fs/io_uring.c:5680 io_poll_remove_all+0x1af/0x235 fs/io_uring.c:5709 io_ring_ctx_wait_and_kill+0x1cc/0x322 fs/io_uring.c:9534 io_uring_release+0x42/0x46 fs/io_uring.c:9554 __fput+0x286/0x9f0 fs/file_table.c:280 task_work_run+0xdd/0x1a0 kernel/task_work.c:164 exit_task_work include/linux/task_work.h:32 [inline] do_exit+0xc14/0x2b40 kernel/exit.c:832 `674ee8e1b4` ("io_uring: correct link-list traversal locking") fixed a data race but introduced a possible deadlock and inconsistentcy in irq states. E.g. io_poll_remove_all() spin_lock_irq(timeout_lock) io_poll_remove_one() spin_lock/unlock_irq(poll_lock); spin_unlock_irq(timeout_lock) Another type of problem is freeing a request while holding ->timeout_lock, which may leads to a deadlock in io_commit_cqring() -> io_flush_timeouts() and other places. Having 3 nested locks is also too ugly. Add io_match_task_safe(), which would briefly take and release timeout_lock for race prevention inside, so the actuall request cancellation / free / etc. code doesn't have it taken. Reported-by: syzbot+ff49a3059d49b0ca0eec@syzkaller.appspotmail.com Reported-by: syzbot+847f02ec20a6609a328b@syzkaller.appspotmail.com Reported-by: syzbot+3368aadcd30425ceb53b@syzkaller.appspotmail.com Reported-by: syzbot+51ce8887cdef77c9ac83@syzkaller.appspotmail.com Reported-by: syzbot+3cb756a49d2f394a9ee3@syzkaller.appspotmail.com Fixes: `674ee8e1b4` ("io_uring: correct link-list traversal locking") Cc: stable@kernel.org # 5.15+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/397f7ebf3f4171f1abe41f708ac1ecb5766f0b68.1637937097.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-26 08:35:57 -07:00
Pavel Begunkov	617a89484d	io_uring: fail cancellation for EXITING tasks WARNING: CPU: 1 PID: 20 at fs/io_uring.c:6269 io_try_cancel_userdata+0x3c5/0x640 fs/io_uring.c:6269 CPU: 1 PID: 20 Comm: kworker/1:0 Not tainted 5.16.0-rc1-syzkaller #0 Workqueue: events io_fallback_req_func RIP: 0010:io_try_cancel_userdata+0x3c5/0x640 fs/io_uring.c:6269 Call Trace: <TASK> io_req_task_link_timeout+0x6b/0x1e0 fs/io_uring.c:6886 io_fallback_req_func+0xf9/0x1ae fs/io_uring.c:1334 process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298 worker_thread+0x658/0x11f0 kernel/workqueue.c:2445 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK> We need original task's context to do cancellations, so if it's dying and the callback is executed in a fallback mode, fail the cancellation attempt. Fixes: `89b263f6d5` ("io_uring: run linked timeouts from task_work") Cc: stable@kernel.org # 5.15+ Reported-by: syzbot+ab0cfe96c2b3cd1c1153@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4c41c5f379c6941ad5a07cd48cb66ed62199cf7e.1637937097.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-26 08:35:43 -07:00
Hao Xu	b6c7db3218	io_uring: better to use REQ_F_IO_DRAIN for req->flags It's better to use REQ_F_IO_DRAIN for req->flags rather than IOSQE_IO_DRAIN though they have same value. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211125092103.224502-3-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-25 09:00:42 -07:00
Hao Xu	e302f1046f	io_uring: fix no lock protection for ctx->cq_extra ctx->cq_extra should be protected by completion lock so that the req_need_defer() does the right check. Cc: stable@vger.kernel.org Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211125092103.224502-2-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-25 09:00:42 -07:00
Pavel Begunkov	5562a8d71a	io_uring: disable drain with cqe skip Current IOSQE_IO_DRAIN implementation doesn't work well with CQE skipping and it's not allowed, otherwise some requests might be not executed until the ring is destroyed and the userspace would hang. Let's fail all drain requests after seeing IOSQE_CQE_SKIP_SUCCESS at least once. All drained requests prior to that will get run normally, so there should be no stalls. However, even though such mixing wouldn't lead to issues at the moment, it's still not allowed as the behaviour may change. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bcf7164f8bf3eb54b7bb7b4fd119907fa4d4d43b.1636559119.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-24 11:17:53 -07:00
Pavel Begunkov	3d4aeb9f98	io_uring: don't spinlock when not posting CQEs When no of queued for the batch completion requests need to post an CQE, see IOSQE_CQE_SKIP_SUCCESS, avoid grabbing ->completion_lock and other commit/post. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8d4b4a08bca022cbe19af00266407116775b3e4d.1636559119.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-24 11:17:53 -07:00
Pavel Begunkov	04c76b41ca	io_uring: add option to skip CQE posting Emitting a CQE is expensive from the kernel perspective. Often, it's also not convenient for the userspace, spends some cycles on processing and just complicates the logic. A similar problems goes for linked requests, where we post an CQE for each request in the link. Introduce a new flags, IOSQE_CQE_SKIP_SUCCESS, trying to help with it. When set and a request completed successfully, it won't generate a CQE. When fails, it produces an CQE, but all following linked requests will be CQE-less, regardless whether they have IOSQE_CQE_SKIP_SUCCESS or not. The notion of "fail" is the same as for link failing-cancellation, where it's opcode dependent, and _usually_ result >= 0 is a success, but not always. Linked timeouts are a bit special. When the requests it's linked to was not attempted to be executed, e.g. failing linked requests, it follows the description above. Otherwise, whether a linked timeout will post a completion or not solely depends on IOSQE_CQE_SKIP_SUCCESS of that linked timeout request. Linked timeout never "fail" during execution, so for them it's unconditional. It's expected for users to not really care about the result of it but rely solely on the result of the master request. Another reason for such a treatment is that it's racy, and the timeout callback may be running awhile the master request posts its completion. use case 1: If one doesn't care about results of some requests, e.g. normal timeouts, just set IOSQE_CQE_SKIP_SUCCESS. Error result will still be posted and need to be handled. use case 2: Set IOSQE_CQE_SKIP_SUCCESS for all requests of a link but the last, and it'll post a completion only for the last one if everything goes right, otherwise there will be one only one CQE for the first failed request. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0220fbe06f7cf99e6fc71b4297bb1cb6c0e89c2c.1636559119.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-24 11:17:53 -07:00
Pavel Begunkov	913a571aff	io_uring: clean cqe filling functions Split io_cqring_fill_event() into a couple of more targeted functions. The first on is io_fill_cqe_aux() for completions that are not associated with request completions and doing the ->cq_extra accounting. Examples are additional CQEs from multishot poll and rsrc notifications. The second is io_fill_cqe_req(), should be called when it's a normal request completion. Nothing more to it at the moment, will be used in later patches. The last one is inlined __io_fill_cqe() for a finer grained control, should be used with caution and in hottest places. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/59a9117a4a44fc9efcf04b3afa51e0d080f5943c.1636559119.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-24 11:17:53 -07:00
Pavel Begunkov	2ea537ca02	io_uring: improve argument types of kiocb_done() kiocb_done() accepts a pointer to struct kiocb, pass struct io_kiocb (i.e. io_uring's request) instead so we can get rid of useless container_of(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/252016eed77806f58b48251a85cd8c645f900433.1637524285.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-23 12:24:20 -07:00
Pavel Begunkov	f3251183b2	io_uring: clean __io_import_iovec() Apparently, implicit 0 to NULL conversion with ERR_PTR is not recommended and makes some tooling like Smatch to complain. Handle it explicitly, compilers are perfectly capable to optimise it out. Link: https://lore.kernel.org/all/20211108134937.GA2863@kili/ Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5c6ed369ad95075dab345df679f8677b8fe66656.1637524285.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-23 12:24:20 -07:00
Pavel Begunkov	7297ce3d59	io_uring: improve send/recv error handling Hide all error handling under common if block, removes two extra ifs on the success path and keeps the handling more condensed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5761545158a12968f3caf30f747eea65ed75dfc1.1637524285.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-23 12:24:20 -07:00
Pavel Begunkov	06bdea20c1	io_uring: simplify reissue in kiocb_done Simplify failed resubmission prep in kiocb_done(), it's a bit ugly with conditional logic and hand handling cflags / select buffers. Instead, punt to tw and use io_req_task_complete() already handling all the cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/667c33484b05b612e9420e1b1d5f4dc46d0ee9ce.1637524285.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-23 12:24:20 -07:00
Pavel Begunkov	674ee8e1b4	io_uring: correct link-list traversal locking As io_remove_next_linked() is now under ->timeout_lock (see io_link_timeout_fn), we should update locking around io_for_each_link() and io_match_task() to use the new lock. Cc: stable@kernel.org # 5.15+ Fixes: `89850fce16` ("io_uring: run timeouts from task_work") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b54541cedf7de59cb5ae36109e58529ca16e66aa.1637631883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-22 19:31:54 -07:00
Kamal Mostafa	f6f9b278f2	io_uring: fix missed comment from task_file rename Fix comment referring to function "io_uring_del_task_file()", now called "io_uring_del_tctx_node()". Fixes: `eef51daa72` ("io_uring: rename function task_file") Signed-off-by: Kamal Mostafa <kamal@canonical.com> Link: https://lore.kernel.org/r/20211116175530.31608-1-kamal@canonical.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-16 17:24:34 -07:00
Pavel Begunkov	bad119b9a0	io_uring: honour zeroes as io-wq worker limits When we pass in zero as an io-wq worker number limit it shouldn't actually change the limits but return the old value, follow that behaviour with deferred limits setup as well. Cc: stable@kernel.org # 5.15 Reported-by: Beld Zhang <beldzhang@gmail.com> Fixes: `e139a1ec92` ("io_uring: apply max_workers limit to all future users") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1b222a92f7a78a24b042763805e891a4cdd4b544.1636384034.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-08 08:39:48 -07:00
Jens Axboe	a19577808f	io_uring: remove dead 'sqe' store The kernel test robot correctly identifies that we store sqe twice, remove the earlier one that is done before validating the index. Fixes: `f75d118349` ("io_uring: harder fdinfo sq/cq ring iterating") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-05 09:31:05 -06:00
Nghia Le	83956c86ff	io_uring: remove redundant assignment to ret in io_register_iowq_max_workers() After the assignment, only exit path with label 'err' uses ret as return value. However,before exiting through this path with label 'err', ret is assigned with the return value of io_wq_max_workers(). Hence, the initial assignment is redundant and can be removed. Signed-off-by: Nghia Le <nghialm78@gmail.com> Link: https://lore.kernel.org/r/20211102190521.28291-1-nghialm78@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-02 14:29:12 -06:00
Pavel Begunkov	9881024aab	io_uring: clean up io_queue_sqe_arm_apoll The fix for linked timeout unprep got a bit distored with two rebases, handle linked timeouts for IO_APOLL_READY as with all other cases, i.e. queue it at the end of the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/130b1ea5605bbd81d7b874a95332295799d33b81.1635863773.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-02 09:26:14 -06:00
Linus Torvalds	cdab10bf32	selinux/stable-5.16 PR 20211101 -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmGANbAUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNaMBAAg+9gZr0F7xiafu8JFZqZfx/AQdJ2 G2cn3le+/tXGZmF8m/+82lOaR6LeQLatgSDJNSkXWkKr0nRwseQJDbtRfvYJdn0t Ax05/Fmz6OGxQ2wgRYgaFiSrKpE5p3NhDtiLFVdkCJaQNe/8DZOc7NhBl6EjZf3x ubhl2hUiJ4AmiXGwcYhr4uKgP4nhW8OM1/OkskVi+bBMmLA8KTY9kslmIDP5E3BW 29W4qhqeLNQupY5dGMEMVcyxY9ZUWpO39q4uOaQVZrUGE7xABkj/jhnxT5gFTSlI pu8VhsYXm9KuRVveIsv0L5SZfadwoM9YAl7ki1wD3W5rHqOAte3rBTm6VmNlQwfU MqxP65Jiyxudxet5Be3/dCRH/+MDQuwBxivgmZXbeVxor2SeznVb0GDaEUC5FSHu CJIgWtQzsPJMxgAEGXN4F3QGP0htTTJni56GUPOsrf4TIBW02TT+oLTLFRIokQQL INNOfwVSRXElnCsvxsHR4oB+JZ9pJyBaAmeupcQ6jmcKiWlbLj4s+W0U0pM5h91v hmMpz7KMxrX6gVL4gB2Jj4aN3r5YRbq26NBu6D+wdwwBTeTTocaHSpAqkv4buClf uNk3cG8Hkp8TTg9cM8jYgpxMyzKH/AI/Uw3VhEa1xCiq2Ck3DgfnZvnvcRRaZevU FPgmwgqePJXGi60= =sb8J -----END PGP SIGNATURE----- Merge tag 'selinux-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull selinux updates from Paul Moore: - Add LSM/SELinux/Smack controls and auditing for io-uring. As usual, the individual commit descriptions have more detail, but we were basically missing two things which we're adding here: + establishment of a proper audit context so that auditing of io-uring ops works similarly to how it does for syscalls (with some io-uring additions because io-uring ops are not syscalls) + additional LSM hooks to enable access control points for some of the more unusual io-uring features, e.g. credential overrides. The additional audit callouts and LSM hooks were done in conjunction with the io-uring folks, based on conversations and RFC patches earlier in the year. - Fixup the binder credential handling so that the proper credentials are used in the LSM hooks; the commit description and the code comment which is removed in these patches are helpful to understand the background and why this is the proper fix. - Enable SELinux genfscon policy support for securityfs, allowing improved SELinux filesystem labeling for other subsystems which make use of securityfs, e.g. IMA. * tag 'selinux-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: security: Return xattr name from security_dentry_init_security() selinux: fix a sock regression in selinux_ip_postroute_compat() binder: use cred instead of task for getsecid binder: use cred instead of task for selinux checks binder: use euid from cred instead of using task LSM: Avoid warnings about potentially unused hook variables selinux: fix all of the W=1 build warnings selinux: make better use of the nf_hook_state passed to the NF hooks selinux: fix race condition when computing ocontext SIDs selinux: remove unneeded ipv6 hook wrappers selinux: remove the SELinux lockdown implementation selinux: enable genfscon labeling for securityfs Smack: Brutalist io_uring support selinux: add support for the io_uring access controls lsm,io_uring: add LSM hooks to io_uring io_uring: convert io_uring to the secure anon inode interface fs: add anon_inode_getfile_secure() similar to anon_inode_getfd_secure() audit: add filtering for io_uring records audit,io_uring,io-wq: add some basic audit support to io_uring audit: prepare audit_context for use in calling contexts beyond syscalls	2021-11-01 21:06:18 -07:00
Linus Torvalds	b6773cdb0e	for-5.16/ki_complete-2021-10-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8MOUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmeqEACrayLMDMdlb1FduTYw29QAL7XxS375r92T bwLippmKQIFNi8p5ScHraelV5ixgxse2j68MexlQHpl9aHIn/oL7qHACIMgDP05m KaSy8Hr2abqr+zz+rLMhkm21zAva6aWjQu7NoEjBE4dC5L4l9p885LaA+jmqQUno 1wvpaEcype8cITJ+sSCb3kD6nZx7y1Lt5zEefUfk6ruMm9x9FwvU6uc4rIHi+Zve Hwo8yGbTvlU8rGSi9naC/U8pIZ4bqEuTAcV5VHNrWG+b4aA/aFPpSjpIiSBZSXo0 HXa+jmcr6gkejfPeOZkBbRub6Fm9Wq2pDAZskPWFX6zyX0pIV05GjJ2J/ba8rovn QrcfxaBv8XitKgrjFZeR0ZBqD2iJjPA/Yq5/r1ZmZ0wSHI3W4UuTGhQYEPyDLceH ZWq/wcfVFek4kAoCxCqy9kWiOujY90WWKQW3yD7b8FPZ0d+/R1Mn+drlYaSKN1Pk /9/+z1DaLtBWbJ2G+BQ9oUkYmNSapAiYc2YXVss86hmhLX+prFtSj3zECZUvhyAz b42A2DVsjU+65yT2zdPBXlMrbI91qNnvIXcz5szNdTfHTn9FiLQb4BffMV0FHT3g vap8N3Rb8UkZ3v4NCVAtlfcGr0kvYHQH+Qgh6oAlXB4NQoKJCVadzpTFPMWjx788 oHBUjA0UTQ== =4vl/ -----END PGP SIGNATURE----- Merge tag 'for-5.16/ki_complete-2021-10-29' of git://git.kernel.dk/linux-block Pull kiocb->ki_complete() cleanup from Jens Axboe: "This removes the res2 argument from kiocb->ki_complete(). Only the USB gadget code used it, everybody else passes 0. The USB guys checked the user gadget code they could find, and everybody just uses res as expected for the async interface" * tag 'for-5.16/ki_complete-2021-10-29' of git://git.kernel.dk/linux-block: fs: get rid of the res2 iocb->ki_complete argument usb: remove res2 argument from gadget code completions	2021-11-01 10:17:11 -07:00
Linus Torvalds	8d1f01775f	for-5.16/io_uring-2021-10-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8KHcQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgphvVEADHMsZP3fOGyJNqnIibIrDL5ZdUGtr5iH3c 0UIi9It0jo9xOyPX/aY2n1pInXK4vvND9ULC+XGYttSJZXWuYEbMGYQ34du2EP0r dypN4JPwO6X+mFkJND6x8IeDCzj/fy6LCFbWbRlDNsndTZ/gavVTOybMpOLdCJx9 IyXE1iHismaIaD7I3Q77zvN0ei87cEwBfg9R0vRAXKBKUh5raSiLWsOYOiXQkZH4 8iUeDmOLlaWghgXwweODxARXuWq+gWZgiBMd0tp0QCECXMv+NIpfJYauvLHJDa/u QScr9uRMrJS3KgRgt61o+Z2fcpzJF/bL0e0s5Ul9CgflRWucARbgodUMl4rZCi9D WOwxPxv8Oab8IT7Qc/ZHdY3ULJsULRgbtmc/9OqPL5Y/Ww9/9E63Is8O4q/QFc7T xJ1p5yZKw3G+G7oG0YBYE0U+x3RUzi4b/Ob+ECeLcAAAcp+XFg6epK6Aj8HDWd8K kGYlEBKEq1hILM44K59YTwAT/Cp+fkwe+x7pNQ3JjqtPpVpqGT7RoMUuCduofT1J ROtB+S8/AwhdABL6KKUYSVF8zlfoXbQpQs3SUKjaBtPVjwXLZwXERy7ttD/4STtT QjC+5/qAWnMR8CYADE0E3rlicUkHJm1+AHukYLz0REphDcNO8GuB9PCDzX4SX/ol SGJ6hoprYQ== =5U4u -----END PGP SIGNATURE----- Merge tag 'for-5.16/io_uring-2021-10-29' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: "Light on new features - basically just the hybrid mode support. Outside of that it's just fixes, cleanups, and performance improvements. In detail: - Add ring related information to the fdinfo output (Hao) - Hybrid async mode (Hao) - Support for batched issue on block (me) - sqe error trace improvement (me) - IOPOLL efficiency improvements (Pavel) - submit state cleanups and improvements (Pavel) - Completion side improvements (Pavel) - Drain improvements (Pavel) - Buffer selection cleanups (Pavel) - Fixed file node improvements (Pavel) - io-wq setup cancelation fix (Pavel) - Various other performance improvements and cleanups (Pavel) - Misc fixes (Arnd, Bixuan, Changcheng, Hao, me, Noah)" * tag 'for-5.16/io_uring-2021-10-29' of git://git.kernel.dk/linux-block: (97 commits) io-wq: remove worker to owner tw dependency io_uring: harder fdinfo sq/cq ring iterating io_uring: don't assign write hint in the read path io_uring: clusterise ki_flags access in rw_prep io_uring: kill unused param from io_file_supports_nowait io_uring: clean up timeout async_data allocation io_uring: don't try io-wq polling if not supported io_uring: check if opcode needs poll first on arming io_uring: clean iowq submit work cancellation io_uring: clean io_wq_submit_work()'s main loop io-wq: use helper for worker refcounting io_uring: implement async hybrid mode for pollable requests io_uring: Use ERR_CAST() instead of ERR_PTR(PTR_ERR()) io_uring: split logic of force_nonblock io_uring: warning about unused-but-set parameter io_uring: inform block layer of how many requests we are submitting io_uring: simplify io_file_supports_nowait() io_uring: combine REQ_F_NOWAIT_{READ,WRITE} flags io_uring: arm poll for non-nowait files fs/io_uring: Prioritise checking faster conditions first in io_write ...	2021-11-01 09:41:33 -07:00
Linus Torvalds	33c8846c81	for-5.16/block-2021-10-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8KDgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmQ2D/wO0nH3U+3+OZChi3XUwYck9Dev3o6BANCF ClATiK/kivZY0xY1r8J4ixirZo2gcjIMpWSC3JGYZ5LdspfmYGLUbMjfZsaeU23i lAKaX1IqfArmHN76k3IU1bKCg7B0/LFwC0q9QTFWTSwNSs8RK/EZLJ61U1hEXUb3 OfIpaMmvPiMaU7yuPqhcZK14m1cg1srrLM4rFB/PqsWWStF07pHq32WeArGDAU0e Fe0YSnYD7qqA5Qc37KwqjCTmmxKX5YZf7etIcA6p3DNmwcuQrVNzKoCH/ZEDijaD E2bS/BWbN1x96+rtoEZfBYEaNIrkmJzmW6+fJ53OITbJF3KqP6V66erhqNcFYCzC mhFlRe7voXb/8AP7zQqSIhK529BUBM36sQ6nF7EiQcDrfLc1z39mq6eblUxbknIA DDPISD5Tseik9N9x0bc7vINseKyHI1E90VAU/XKADcuGbzLvehPx+2p+Iq5ch5Ah oa1G3RdlWWQOZxphJHWJhu1qMfo5+FP9dFZj1aoo7b8Kbc/CedyoQe71cpIE5wNh Jj/EpWJnuyKXwuTic2VYGC+6ezM9O5DSdqCfP3YuZky95VESyvRCKJYMMgBYRVdC /LuxhnBXIY2G8An7ZTnX0kLCCvLbapIwa0NyA98/xeOngO843coJ6wn8ZmE9LJNH kMmpCygUrA== =QWC+ -----END PGP SIGNATURE----- Merge tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - mq-deadline accounting improvements (Bart) - blk-wbt timer fix (Andrea) - Untangle the block layer includes (Christoph) - Rework the poll support to be bio based, which will enable adding support for polling for bio based drivers (Christoph) - Block layer core support for multi-actuator drives (Damien) - blk-crypto improvements (Eric) - Batched tag allocation support (me) - Request completion batching support (me) - Plugging improvements (me) - Shared tag set improvements (John) - Concurrent queue quiesce support (Ming) - Cache bdev in ->private_data for block devices (Pavel) - bdev dio improvements (Pavel) - Block device invalidation and block size improvements (Xie) - Various cleanups, fixes, and improvements (Christoph, Jackie, Masahira, Tejun, Yu, Pavel, Zheng, me) * tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits) blk-mq-debugfs: Show active requests per queue for shared tags block: improve readability of blk_mq_end_request_batch() virtio-blk: Use blk_validate_block_size() to validate block size loop: Use blk_validate_block_size() to validate block size nbd: Use blk_validate_block_size() to validate block size block: Add a helper to validate the block size block: re-flow blk_mq_rq_ctx_init() block: prefetch request to be initialized block: pass in blk_mq_tags to blk_mq_rq_ctx_init() block: add rq_flags to struct blk_mq_alloc_data block: add async version of bio_set_polled block: kill DIO_MULTI_BIO block: kill unused polling bits in __blkdev_direct_IO() block: avoid extra iter advance with async iocb block: Add independent access ranges support blk-mq: don't issue request directly in case that current is to be blocked sbitmap: silence data race warning blk-cgroup: synchronize blkg creation against policy deactivation block: refactor bio_iov_bvec_set() block: add single bio async direct IO helper ...	2021-11-01 09:19:50 -07:00
Linus Torvalds	49f8275c7d	Memory folios Add memory folios, a new type to represent either order-0 pages or the head page of a compound page. This should be enough infrastructure to support filesystems converting from pages to folios. -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmF9uI0ACgkQDpNsjXcp gj7MUAf/R7LCZ+xFiIedw7SAgb/DGK0C9uVjuBEIZgAw21ZUw/GuPI6cuKBMFGGf rRcdtlvMpwi7yZJcoNXxaqU/xPaaJMjf2XxscIvYJP1mjlZVuwmP9dOx0neNvWOc T+8lqR6c1TLl82lpqIjGFLwvj2eVowq2d3J5jsaIJFd4odmmYVInrhJXOzC/LQ54 Niloj5ksehf+KUIRLDz7ycppvIHhlVsoAl0eM2dWBAtL0mvT7Nyn/3y+vnMfV2v3 Flb4opwJUgTJleYc16oxTn9svT2yS8q2uuUemRDLW8ABghoAtH3fUUk43RN+5Krd LYCtbeawtkikPVXZMfWybsx5vn0c3Q== =7SBe -----END PGP SIGNATURE----- Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache Pull memory folios from Matthew Wilcox: "Add memory folios, a new type to represent either order-0 pages or the head page of a compound page. This should be enough infrastructure to support filesystems converting from pages to folios. The point of all this churn is to allow filesystems and the page cache to manage memory in larger chunks than PAGE_SIZE. The original plan was to use compound pages like THP does, but I ran into problems with some functions expecting only a head page while others expect the precise page containing a particular byte. The folio type allows a function to declare that it's expecting only a head page. Almost incidentally, this allows us to remove various calls to VM_BUG_ON(PageTail(page)) and compound_head(). This converts just parts of the core MM and the page cache. For 5.17, we intend to convert various filesystems (XFS and AFS are ready; other filesystems may make it) and also convert more of the MM and page cache to folios. For 5.18, multi-page folios should be ready. The multi-page folios offer some improvement to some workloads. The 80% win is real, but appears to be an artificial benchmark (postgres startup, which isn't a serious workload). Real workloads (eg building the kernel, running postgres in a steady state, etc) seem to benefit between 0-10%. I haven't heard of any performance losses as a result of this series. Nobody has done any serious performance tuning; I imagine that tweaking the readahead algorithm could provide some more interesting wins. There are also other places where we could choose to create large folios and currently do not, such as writes that are larger than PAGE_SIZE. I'd like to thank all my reviewers who've offered review/ack tags: Christoph Hellwig, David Howells, Jan Kara, Jeff Layton, Johannes Weiner, Kirill A. Shutemov, Michal Hocko, Mike Rapoport, Vlastimil Babka, William Kucharski, Yu Zhao and Zi Yan. I'd also like to thank those who gave feedback I incorporated but haven't offered up review tags for this part of the series: Nick Piggin, Mel Gorman, Ming Lei, Darrick Wong, Ted Ts'o, John Hubbard, Hugh Dickins, and probably a few others who I forget" * tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache: (90 commits) mm/writeback: Add folio_write_one mm/filemap: Add FGP_STABLE mm/filemap: Add filemap_get_folio mm/filemap: Convert mapping_get_entry to return a folio mm/filemap: Add filemap_add_folio() mm/filemap: Add filemap_alloc_folio mm/page_alloc: Add folio allocation functions mm/lru: Add folio_add_lru() mm/lru: Convert __pagevec_lru_add_fn to take a folio mm: Add folio_evictable() mm/workingset: Convert workingset_refault() to take a folio mm/filemap: Add readahead_folio() mm/filemap: Add folio_mkwrite_check_truncate() mm/filemap: Add i_blocks_per_folio() mm/writeback: Add folio_redirty_for_writepage() mm/writeback: Add folio_account_redirty() mm/writeback: Add folio_clear_dirty_for_io() mm/writeback: Add folio_cancel_dirty() mm/writeback: Add folio_account_cleaned() mm/writeback: Add filemap_dirty_folio() ...	2021-11-01 08:47:59 -07:00
Jens Axboe	f75d118349	io_uring: harder fdinfo sq/cq ring iterating The ring iteration is racy, which isn't necessarily a problem except it can cause us to iterate the whole thing. That isn't desired or ideal, and it can lead to excessive runtimes of reading fdinfo. Cap the iteration at tail - head OR the ring size. While in there, clean up the ring masking and just dump the raw values along with the masks. That provides more useful debug info. Fixes: `83f84356bc` ("io_uring: add more uring info to fdinfo for debug") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-29 09:49:33 -06:00
Jens Axboe	3884b83dff	io_uring: don't assign write hint in the read path Move this out of the generic read/write prep path, and place it in the write specific kiocb setup instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-26 15:54:40 -06:00
Jens Axboe	6b19b766e8	fs: get rid of the res2 iocb->ki_complete argument The second argument was only used by the USB gadget code, yet everyone pays the overhead of passing a zero to be passed into aio, where it ends up being part of the aio res2 value. Now that everybody is passing in zero, kill off the extra argument. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-25 10:36:24 -06:00
Pavel Begunkov	fb27274a90	io_uring: clusterise ki_flags access in rw_prep ioprio setup doesn't depend on other fields that are modified in io_prep_rw() and we can move it down in the function without worrying about performance. It's useful as it makes iocb->ki_flags accesses/modifications closer together, so it's more likely the compiler will cache it in a register and avoid extra reloads. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8ee98779c06f1b59f6039b1e292db4332efd664b.1634987320.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-25 07:42:35 -06:00
Pavel Begunkov	b9a6b8f92f	io_uring: kill unused param from io_file_supports_nowait io_file_supports_nowait() doesn't use rw argument anymore, remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4bd6709fc573d70c866ea656cb7a7dbe94be8026.1634987320.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-25 07:42:35 -06:00
Pavel Begunkov	d6a644a795	io_uring: clean up timeout async_data allocation opcode prep functions are one of the first things that are called, we can't have ->async_data allocated at this point and it's certainly a bug. Reflect this assumption in io_timeout_prep() and add a WARN_ONCE just in case. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/75a28ca7dbcc5af8b6cd9092819e8384c24dedd4.1634987320.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-25 07:42:35 -06:00
Pavel Begunkov	afb7f56fc6	io_uring: don't try io-wq polling if not supported If an opcode doesn't support polling, just let it be executed synchronously in iowq, otherwise it will do a nonblock attempt just to fail in io_arm_poll_handler() and return back to blocking execution. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6401256db01b88f448f15fcd241439cb76f5b940.1634987320.git.asml.silence@gmail.com Reviewed-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-25 07:42:33 -06:00
Pavel Begunkov	658d0a4016	io_uring: check if opcode needs poll first on arming ->pollout or ->pollin are set only for opcodes that need a file, so if io_arm_poll_handler() tests them first we can be sure that the request has file set and the ->file check can be removed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9adfe4f543d984875e516fce6da35348aab48668.1634987320.git.asml.silence@gmail.com Reviewed-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-25 07:42:31 -06:00
Pavel Begunkov	d01905db14	io_uring: clean iowq submit work cancellation If we've got IO_WQ_WORK_CANCEL in io_wq_submit_work(), handle the error on the same lines as the check instead of having a weird code flow. The main loop doesn't change but goes one indention left. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ff4a09cf41f7a22bbb294b6f1faea721e21fe615.1634987320.git.asml.silence@gmail.com Reviewed-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-25 07:42:29 -06:00
Pavel Begunkov	255657d237	io_uring: clean io_wq_submit_work()'s main loop Do a bit of cleaning for the main loop of io_wq_submit_work(). Get rid of switch, just replace it with a single if as we're retrying in both other cases. Kill issue_sqe label, Get rid of needs_poll nesting and disambiguate a bit the comment. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ed12ce0c64e051f9a6b8a37a24f8ea554d299c29.1634987320.git.asml.silence@gmail.com Reviewed-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-25 07:42:20 -06:00
Hao Xu	90fa02883f	io_uring: implement async hybrid mode for pollable requests The current logic of requests with IOSQE_ASYNC is first queueing it to io-worker, then execute it in a synchronous way. For unbound works like pollable requests(e.g. read/write a socketfd), the io-worker may stuck there waiting for events for a long time. And thus other works wait in the list for a long time too. Let's introduce a new way for unbound works (currently pollable requests), with this a request will first be queued to io-worker, then executed in a nonblock try rather than a synchronous way. Failure of that leads it to arm poll stuff and then the worker can begin to handle other works. The detail process of this kind of requests is: step1: original context: queue it to io-worker step2: io-worker context: nonblock try(the old logic is a synchronous try here) \| \|--fail--> arm poll \| \|--(fail/ready)-->synchronous issue \| \|--(succeed)-->worker finish it's job, tw take over the req This works much better than the old IOSQE_ASYNC logic in cases where unbound max_worker is relatively small. In this case, number of io-worker eazily increments to max_worker, new worker cannot be created and running workers stuck there handling old works in IOSQE_ASYNC mode. In my 64-core machine, set unbound max_worker to 20, run echo-server, turns out: (arguments: register_file, connetion number is 1000, message size is 12 Byte) original IOSQE_ASYNC: 76664.151 tps after this patch: 166934.985 tps Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211018133445.103438-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-10-22 19:20:57 -06:00

1 2 3 4 5 ...

1846 commits