summaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)AuthorFilesLines
4 daysio_uring/kbuf: propagate BUF_MORE through early buffer commit pathJens Axboe2-3/+7
Commit 418eab7a6f3c002d8e64d6e95ec27118017019af upstream. When io_should_commit() returns true (eg for non-pollable files), buffer commit happens at buffer selection time and sel->buf_list is set to NULL. When __io_put_kbufs() generates CQE flags at completion time, it calls __io_put_kbuf_ring() which finds a NULL buffer_list and hence cannot determine whether the buffer was consumed or not. This means that IORING_CQE_F_BUF_MORE is never set for non-pollable input with incrementally consumed buffers. Likewise for io_buffers_select(), which always commits upfront and discards the return value of io_kbuf_commit(). Add REQ_F_BUF_MORE to store the result of io_kbuf_commit() during early commit. Then __io_put_kbuf_ring() can check this flag and set IORING_F_BUF_MORE accordingy. Reported-by: Martin Michaelis <code@mgjm.de> Cc: stable@vger.kernel.org Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://github.com/axboe/liburing/issues/1553 Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 daysio_uring/kbuf: check if target buffer list is still legacy on recycleJens Axboe1-2/+10
Commit c2c185be5c85d37215397c8e8781abf0a69bec1f upstream. There's a gap between when the buffer was grabbed and when it potentially gets recycled, where if the list is empty, someone could've upgraded it to a ring provided type. This can happen if the request is forced via io-wq. The legacy recycling is missing checking if the buffer_list still exists, and if it's of the correct type. Add those checks. Cc: stable@vger.kernel.org Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Reported-by: Keenan Dong <keenanat2000@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 daysio_uring/uring_cmd: fix too strict requirement on ioctlAsbjørn Sloth Tønnesen1-3/+6
[ Upstream commit 600b665b903733bd60334e86031b157cc823ee55 ] Attempting SOCKET_URING_OP_SETSOCKOPT on an AF_NETLINK socket resulted in an -EOPNOTSUPP, as AF_NETLINK doesn't have an ioctl in its struct proto, but only in struct proto_ops. Prior to the blamed commit, io_uring_cmd_sock() only had two cmd_op operations, both requiring ioctl, thus the check was warranted. Since then, 4 new cmd_op operations have been added, none of which depend on ioctl. This patch moves the ioctl check, so it only applies to the original operations. AFAICT, the ioctl requirement was unintentional, and it wasn't visible in the blamed patch within 3 lines of context. Cc: stable@vger.kernel.org Fixes: a5d2f99aff6b ("io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT") Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> [Asbjørn: function moved in commit 91db6edc573b; updated subject prefix] Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-03-04io_uring/filetable: clamp alloc_hint to the configured alloc rangeJens Axboe1-0/+4
[ Upstream commit a6bded921ed35f21b3f6bd8e629bf488499ca442 ] Explicit fixed file install/remove operations on slots outside the configured alloc range can corrupt alloc_hint via io_file_bitmap_set() and io_file_bitmap_clear(), which unconditionally update alloc_hint to the bit position. This causes subsequent auto-allocations to fall outside the configured range. For example, if the alloc range is [10, 20) and a file is removed at slot 2, alloc_hint gets set to 2. The next auto-alloc then starts searching from slot 2, potentially returning a slot below the range. Fix this by clamping alloc_hint to [file_alloc_start, file_alloc_end) at the top of io_file_bitmap_get() before starting the search. Cc: stable@vger.kernel.org Fixes: 6e73dffbb93c ("io_uring: let to set a range for file slot allocation") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04io_uring/net: don't continue send bundle if poll was required for retryJens Axboe1-1/+5
[ Upstream commit 806ae939c41e5da1d94a1e2b31f5702e96b6c3e3 ] If a send bundle has picked a bunch of buffers, then it needs to send all of those to be complete. This may require poll arming, if the send buffer ends up being full. Once a send bundle has been poll armed, no further bundles should be attempted. This allows a current bundle to complete even though it needs to go through polling to do so, but it will not allow another bundle to be started once that has happened. Ideally we would abort a bundle if it was only partially sent, but as some parts of it already went out on the wire, this obviously isn't feasible. Not continuing more bundle attempts post encountering a full socket buffer is the second best thing. Cc: stable@vger.kernel.org Fixes: a05d1f625c7a ("io_uring/net: support bundles for send") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04io_uring/cancel: de-unionize file and user_data in struct io_cancel_dataJens Axboe1-4/+2
[ Upstream commit 22dbb0987bd1e0ec3b1e4ad20756a98f99aa4a08 ] By having them share the same space in struct io_cancel_data, it ends up disallowing IORING_ASYNC_CANCEL_FD|IORING_ASYNC_CANCEL_USERDATA from working. Eg you cannot match on both a file and user_data for cancelation purposes. This obviously isn't a common use case as nobody has reported this, but it does result in -ENOENT potentially being returned when trying to match on both, rather than actually doing what the API says it would. Fixes: 4bf94615b888 ("io_uring: allow IORING_OP_ASYNC_CANCEL with 'fd' key") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04io_uring/sync: validate passed in offsetJens Axboe1-0/+2
[ Upstream commit 649dd18f559891bdafc5532d737c7dfb56060a6d ] Check if the passed in offset is negative once cast to sync->off. This ensures that -EINVAL is returned for that case, like it would be for sync_file_range(2). Fixes: c992fe2925d7 ("io_uring: add fsync support") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04io_uring: use release-acquire ordering for IORING_SETUP_R_DISABLEDCaleb Sander Mateos3-4/+17
[ Upstream commit 7a8737e1132ff07ca225aa7a4008f87319b5b1ca ] io_uring_enter(), __io_msg_ring_data(), and io_msg_send_fd() read ctx->flags and ctx->submitter_task without holding the ctx's uring_lock. This means they may race with the assignment to ctx->submitter_task and the clearing of IORING_SETUP_R_DISABLED from ctx->flags in io_register_enable_rings(). Ensure the correct ordering of the ctx->flags and ctx->submitter_task memory accesses by storing to ctx->flags using release ordering and loading it using acquire ordering. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 4add705e4eeb ("io_uring: remove io_register_submitter") Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-12io_uring/rw: recycle buffers manually for non-mshot readsJens Axboe1-0/+2
Commit d8e1dec2f860ee40623609aa6c4f22e1ee45605d upstream. The mshot side of reads already does this, but the regular read path does not. This leads to needing recycling checks sprinkled in various spots in the "go async" path, like arming poll. In preparation for getting rid of those, ensure that read recycles appropriately. Link: https://lore.kernel.org/r/20250821020750.598432-8-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-30io_uring/io-wq: check IO_WQ_BIT_EXIT inside work run loopJens Axboe1-1/+1
commit 10dc959398175736e495f71c771f8641e1ca1907 upstream. Currently this is checked before running the pending work. Normally this is quite fine, as work items either end up blocking (which will create a new worker for other items), or they complete fairly quickly. But syzbot reports an issue where io-wq takes seemingly forever to exit, and with a bit of debugging, this turns out to be because it queues a bunch of big (2GB - 4096b) reads with a /dev/msr* file. Since this file type doesn't support ->read_iter(), loop_rw_iter() ends up handling them. Each read returns 16MB of data read, which takes 20 (!!) seconds. With a bunch of these pending, processing the whole chain can take a long time. Easily longer than the syzbot uninterruptible sleep timeout of 140 seconds. This then triggers a complaint off the io-wq exit path: INFO: task syz.4.135:6326 blocked for more than 143 seconds. Not tainted syzkaller #0 Blocked by coredump. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz.4.135 state:D stack:26824 pid:6326 tgid:6324 ppid:5957 task_flags:0x400548 flags:0x00080000 Call Trace: <TASK> context_switch kernel/sched/core.c:5256 [inline] __schedule+0x1139/0x6150 kernel/sched/core.c:6863 __schedule_loop kernel/sched/core.c:6945 [inline] schedule+0xe7/0x3a0 kernel/sched/core.c:6960 schedule_timeout+0x257/0x290 kernel/time/sleep_timeout.c:75 do_wait_for_common kernel/sched/completion.c:100 [inline] __wait_for_common+0x2fc/0x4e0 kernel/sched/completion.c:121 io_wq_exit_workers io_uring/io-wq.c:1328 [inline] io_wq_put_and_exit+0x271/0x8a0 io_uring/io-wq.c:1356 io_uring_clean_tctx+0x10d/0x190 io_uring/tctx.c:203 io_uring_cancel_generic+0x69c/0x9a0 io_uring/cancel.c:651 io_uring_files_cancel include/linux/io_uring.h:19 [inline] do_exit+0x2ce/0x2bd0 kernel/exit.c:911 do_group_exit+0xd3/0x2a0 kernel/exit.c:1112 get_signal+0x2671/0x26d0 kernel/signal.c:3034 arch_do_signal_or_restart+0x8f/0x7e0 arch/x86/kernel/signal.c:337 __exit_to_user_mode_loop kernel/entry/common.c:41 [inline] exit_to_user_mode_loop+0x8c/0x540 kernel/entry/common.c:75 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline] syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:159 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:194 [inline] do_syscall_64+0x4ee/0xf80 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fa02738f749 RSP: 002b:00007fa0281ae0e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: fffffffffffffe00 RBX: 00007fa0275e6098 RCX: 00007fa02738f749 RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007fa0275e6098 RBP: 00007fa0275e6090 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fa0275e6128 R14: 00007fff14e4fcb0 R15: 00007fff14e4fd98 There's really nothing wrong here, outside of processing these reads will take a LONG time. However, we can speed up the exit by checking the IO_WQ_BIT_EXIT inside the io_worker_handle_work() loop, as syzbot will exit the ring after queueing up all of these reads. Then once the first item is processed, io-wq will simply cancel the rest. That should avoid syzbot running into this complaint again. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/68a2decc.050a0220.e29e5.0099.GAE@google.com/ Reported-by: syzbot+4eb282331cab6d5b6588@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-23io_uring: move local task_work in exit cancel loopMing Lei1-4/+4
commit da579f05ef0faada3559e7faddf761c75cdf85e1 upstream. With IORING_SETUP_DEFER_TASKRUN, task work is queued to ctx->work_llist (local work) rather than the fallback list. During io_ring_exit_work(), io_move_task_work_from_local() was called once before the cancel loop, moving work from work_llist to fallback_llist. However, task work can be added to work_llist during the cancel loop itself. There are two cases: 1) io_kill_timeouts() is called from io_uring_try_cancel_requests() to cancel pending timeouts, and it adds task work via io_req_queue_tw_complete() for each cancelled timeout: 2) URING_CMD requests like ublk can be completed via io_uring_cmd_complete_in_task() from ublk_queue_rq() during canceling, given ublk request queue is only quiesced when canceling the 1st uring_cmd. Since io_allowed_defer_tw_run() returns false in io_ring_exit_work() (kworker != submitter_task), io_run_local_work() is never invoked, and the work_llist entries are never processed. This causes io_uring_try_cancel_requests() to loop indefinitely, resulting in 100% CPU usage in kworker threads. Fix this by moving io_move_task_work_from_local() inside the cancel loop, ensuring any work on work_llist is moved to fallback before each cancel attempt. Cc: stable@vger.kernel.org Fixes: c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08io_uring: fix min_wait wakeups for SQPOLLJens Axboe1-0/+3
Commit e15cb2200b934e507273510ba6bc747d5cde24a3 upstream. Using min_wait, two timeouts are given: 1) The min_wait timeout, within which up to 'wait_nr' events are waited for. 2) The overall long timeout, which is entered if no events are generated in the min_wait window. If the min_wait has expired, any event being posted must wake the task. For SQPOLL, that isn't the case, as it won't trigger the io_has_work() condition, as it will have already processed the task_work that happened when an event was posted. This causes any event to trigger post the min_wait to not always cause the waiting application to wakeup, and instead it will wait until the overall timeout has expired. This can be shown in a test case that has a 1 second min_wait, with a 5 second overall wait, even if an event triggers after 1.5 seconds: axboe@m2max-kvm /d/iouring-mre (master)> zig-out/bin/iouring info: MIN_TIMEOUT supported: true, features: 0x3ffff info: Testing: min_wait=1000ms, timeout=5s, wait_nr=4 info: 1 cqes in 5000.2ms where the expected result should be: axboe@m2max-kvm /d/iouring-mre (master)> zig-out/bin/iouring info: MIN_TIMEOUT supported: true, features: 0x3ffff info: Testing: min_wait=1000ms, timeout=5s, wait_nr=4 info: 1 cqes in 1500.3ms When the min_wait timeout triggers, reset the number of completions needed to wake the task. This should ensure that any future events will wake the task, regardless of how many events it originally wanted to wait for. Reported-by: Tip ten Brink <tip@tenbrinkmeijs.com> Cc: stable@vger.kernel.org Fixes: 1100c4a2656d ("io_uring: add support for batch wait timeout") Link: https://github.com/axboe/liburing/issues/1477 Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit e15cb2200b934e507273510ba6bc747d5cde24a3) Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08io_uring/poll: correctly handle io_poll_add() return value on updateJens Axboe1-2/+7
Commit 84230ad2d2afbf0c44c32967e525c0ad92e26b4e upstream. When the core of io_uring was updated to handle completions consistently and with fixed return codes, the POLL_REMOVE opcode with updates got slightly broken. If a POLL_ADD is pending and then POLL_REMOVE is used to update the events of that request, if that update causes the POLL_ADD to now trigger, then that completion is lost and a CQE is never posted. Additionally, ensure that if an update does cause an existing POLL_ADD to complete, that the completion value isn't always overwritten with -ECANCELED. For that case, whatever io_poll_add() set the value to should just be retained. Cc: stable@vger.kernel.org Fixes: 97b388d70b53 ("io_uring: handle completions in the core") Reported-by: syzbot+641eec6b7af1f62f2b99@syzkaller.appspotmail.com Tested-by: syzbot+641eec6b7af1f62f2b99@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08io_uring: fix filename leak in __io_openat_prep()Prithvi Tambewagh1-1/+1
commit b14fad555302a2104948feaff70503b64c80ac01 upstream. __io_openat_prep() allocates a struct filename using getname(). However, for the condition of the file being installed in the fixed file table as well as having O_CLOEXEC flag set, the function returns early. At that point, the request doesn't have REQ_F_NEED_CLEANUP flag set. Due to this, the memory for the newly allocated struct filename is not cleaned up, causing a memory leak. Fix this by setting the REQ_F_NEED_CLEANUP for the request just after the successful getname() call, so that when the request is torn down, the filename will be cleaned up, along with other resources needing cleanup. Reported-by: syzbot+00e61c43eb5e4740438f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=00e61c43eb5e4740438f Tested-by: syzbot+00e61c43eb5e4740438f@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Prithvi Tambewagh <activprithvi@gmail.com> Fixes: b9445598d8c6 ("io_uring: openat directly into fixed fd table") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24io_uring/napi: fix io_napi_entry RCU accessesOlivier Langlois1-7/+12
[Upstream commit 45b3941d09d13b3503309be1f023b83deaf69b4d ] correct 3 RCU structures modifications that were not using the RCU functions to make their update. Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: io-uring@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: lvc-project@linuxtesting.org Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/9f53b5169afa8c7bf3665a0b19dc2f7061173530.1728828877.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk> [Stepan Artuhov: cherry-picked a commit] Signed-off-by: Stepan Artuhov <s.artuhov@tssltd.ru> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-13io_uring/zctx: check chained notif contextsPavel Begunkov1-0/+5
[ Upstream commit ab3ea6eac5f45669b091309f592c4ea324003053 ] Send zc only links ubuf_info for requests coming from the same context. There are some ambiguous syz reports, so let's check the assumption on notification completion. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fd527d8638203fe0f1c5ff06ff2e1d8fd68f831b.1755179962.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-29io_uring/sqpoll: be smarter on when to update the stime usageJens Axboe1-11/+32
Commit a94e0657269c5b8e1a90b17aa2c048b3d276e16d upstream. The current approach is a bit naive, and hence calls the time querying way too often. Only start the "doing work" timer when there's actual work to do, and then use that information to terminate (and account) the work time once done. This greatly reduces the frequency of these calls, when they cannot have changed anyway. Running a basic random reader that is setup to use SQPOLL, a profile before this change shows these as the top cycle consumers: + 32.60% iou-sqp-1074 [kernel.kallsyms] [k] thread_group_cputime_adjusted + 19.97% iou-sqp-1074 [kernel.kallsyms] [k] thread_group_cputime + 12.20% io_uring io_uring [.] submitter_uring_fn + 4.13% iou-sqp-1074 [kernel.kallsyms] [k] getrusage + 2.45% iou-sqp-1074 [kernel.kallsyms] [k] io_submit_sqes + 2.18% iou-sqp-1074 [kernel.kallsyms] [k] __pi_memset_generic + 2.09% iou-sqp-1074 [kernel.kallsyms] [k] cputime_adjust and after this change, top of profile looks as follows: + 36.23% io_uring io_uring [.] submitter_uring_fn + 23.26% iou-sqp-819 [kernel.kallsyms] [k] io_sq_thread + 10.14% iou-sqp-819 [kernel.kallsyms] [k] io_sq_tw + 6.52% iou-sqp-819 [kernel.kallsyms] [k] tctx_task_work_run + 4.82% iou-sqp-819 [kernel.kallsyms] [k] nvme_submit_cmds.part.0 + 2.91% iou-sqp-819 [kernel.kallsyms] [k] io_submit_sqes [...] 0.02% iou-sqp-819 [kernel.kallsyms] [k] cputime_adjust where it's spending the cycles on things that actually matter. Reported-by: Fengnan Chang <changfengnan@bytedance.com> Cc: stable@vger.kernel.org Fixes: 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29io_uring/sqpoll: switch away from getrusage() for CPU accountingJens Axboe3-18/+23
Commit 8ac9b0d33e5c0a995338ee5f25fe1b6ff7d97f65 upstream. getrusage() does a lot more than what the SQPOLL accounting needs, the latter only cares about (and uses) the stime. Rather than do a full RUSAGE_SELF summation, just query the used stime instead. Cc: stable@vger.kernel.org Fixes: 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29io_uring: correct __must_hold annotation in io_install_fixed_fileAlok Tiwari1-1/+1
[ Upstream commit c5efc6a0b3940381d67887302ddb87a5cf623685 ] The __must_hold annotation references &req->ctx->uring_lock, but req is not in scope in io_install_fixed_file. This change updates the annotation to reference the correct ctx->uring_lock. improving code clarity. Fixes: f110ed8498af ("io_uring: split out fixed file installation and removal") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-23Revert "io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common()"Jens Axboe1-1/+1
Commit 927069c4ac2cd1a37efa468596fb5b8f86db9df0 upstream. This reverts commit 90bfb28d5fa8127a113a140c9791ea0b40ab156a. Kevin reports that this commit causes an issue for him with LVM snapshots, most likely because of turning off NOWAIT support while a snapshot is being created. This makes -EOPNOTSUPP bubble back through the completion handler, where io_uring read/write handling should just retry it. Reinstate the previous check removed by the referenced commit. Cc: stable@vger.kernel.org Fixes: 90bfb28d5fa8 ("io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common()") Reported-by: Salvatore Bonaccorso <carnil@debian.org> Reported-by: Kevin Lumik <kevin@xf.ee> Link: https://lore.kernel.org/io-uring/cceb723c-051b-4de2-9a4c-4aa82e1619ee@kernel.dk/ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-15io_uring/waitid: always prune wait queue entry in io_waitid_wait()Jens Axboe1-1/+2
commit 2f8229d53d984c6a05b71ac9e9583d4354e3b91f upstream. For a successful return, always remove our entry from the wait queue entry list. Previously this was skipped if a cancelation was in progress, but this can race with another invocation of the wait queue entry callback. Cc: stable@vger.kernel.org Fixes: f31ecf671ddc ("io_uring: add IORING_OP_WAITID support") Reported-by: syzbot+b9e83021d9c642a33d8c@syzkaller.appspotmail.com Tested-by: syzbot+b9e83021d9c642a33d8c@syzkaller.appspotmail.com Link: https://lore.kernel.org/io-uring/68e5195e.050a0220.256323.001f.GAE@google.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-25io_uring: fix incorrect io_kiocb reference in io_link_skbYang Xiuwei1-1/+1
[ Upstream commit 2c139a47eff8de24e3350dadb4c9d5e3426db826 ] In io_link_skb function, there is a bug where prev_notif is incorrectly assigned using 'nd' instead of 'prev_nd'. This causes the context validation check to compare the current notification with itself instead of comparing it with the previous notification. Fix by using the correct prev_nd parameter when obtaining prev_notif. Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Fixes: 6fe4220912d19 ("io_uring/notif: implement notification stacking") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-25io_uring/kbuf: drop WARN_ON_ONCE() from incremental length checkJens Axboe1-1/+1
Partially based on commit 98b6fa62c84f2e129161e976a5b9b3cb4ccd117b upstream. This can be triggered by userspace, so just drop it. The condition is appropriately handled. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-25io_uring/msg_ring: kill alloc_cache for io_kiocb allocationsJens Axboe2-27/+2
Commit df8922afc37aa2111ca79a216653a629146763ad upstream. A recent commit: fc582cd26e88 ("io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU") fixed an issue with not deferring freeing of io_kiocb structs that msg_ring allocates to after the current RCU grace period. But this only covers requests that don't end up in the allocation cache. If a request goes into the alloc cache, it can get reused before it is sane to do so. A recent syzbot report would seem to indicate that there's something there, however it may very well just be because of the KASAN poisoning that the alloc_cache handles manually. Rather than attempt to make the alloc_cache sane for that use case, just drop the usage of the alloc_cache for msg_ring request payload data. Fixes: 50cf5f3842af ("io_uring/msg_ring: add an alloc cache for io_kiocb entries") Link: https://lore.kernel.org/io-uring/68cc2687.050a0220.139b6.0005.GAE@google.com/ Reported-by: syzbot+baa2e0f4e02df602583e@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-25io_uring: include dying ring in task_work "should cancel" stateJens Axboe5-7/+9
Commit 3539b1467e94336d5854ebf976d9627bfb65d6c3 upstream. When running task_work for an exiting task, rather than perform the issue retry attempt, the task_work is canceled. However, this isn't done for a ring that has been closed. This can lead to requests being successfully completed post the ring being closed, which is somewhat confusing and surprising to an application. Rather than just check the task exit state, also include the ring ref state in deciding whether or not to terminate a given request when run from task_work. Cc: stable@vger.kernel.org # 6.1+ Link: https://github.com/axboe/liburing/discussions/1459 Reported-by: Benedek Thaler <thaler@thaler.hu> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-25io_uring: backport io_should_terminate_tw()Jens Axboe5-6/+17
Parts of commit b6f58a3f4aa8dba424356c7a69388a81f4459300 upstream. Backport io_should_terminate_tw() helper to judge whether task_work should be run or terminated. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-25io_uring/cmd: let cmds to know about dying taskPavel Begunkov1-1/+5
Commit df3b8ca604f224eb4cd51669416ad4d607682273 upstream. When the taks that submitted a request is dying, a task work for that request might get run by a kernel thread or even worse by a half dismantled task. We can't just cancel the task work without running the callback as the cmd might need to do some clean up, so pass a flag instead. If set, it's not safe to access any task resources and the callback is expected to cancel the cmd ASAP. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-09io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCUJens Axboe1-2/+2
Commit fc582cd26e888b0652bc1494f252329453fd3b23 upstream. syzbot reports that defer/local task_work adding via msg_ring can hit a request that has been freed: CPU: 1 UID: 0 PID: 19356 Comm: iou-wrk-19354 Not tainted 6.16.0-rc4-syzkaller-00108-g17bbde2e1716 #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:408 [inline] print_report+0xd2/0x2b0 mm/kasan/report.c:521 kasan_report+0x118/0x150 mm/kasan/report.c:634 io_req_local_work_add io_uring/io_uring.c:1184 [inline] __io_req_task_work_add+0x589/0x950 io_uring/io_uring.c:1252 io_msg_remote_post io_uring/msg_ring.c:103 [inline] io_msg_data_remote io_uring/msg_ring.c:133 [inline] __io_msg_ring_data+0x820/0xaa0 io_uring/msg_ring.c:151 io_msg_ring_data io_uring/msg_ring.c:173 [inline] io_msg_ring+0x134/0xa00 io_uring/msg_ring.c:314 __io_issue_sqe+0x17e/0x4b0 io_uring/io_uring.c:1739 io_issue_sqe+0x165/0xfd0 io_uring/io_uring.c:1762 io_wq_submit_work+0x6e9/0xb90 io_uring/io_uring.c:1874 io_worker_handle_work+0x7cd/0x1180 io_uring/io-wq.c:642 io_wq_worker+0x42f/0xeb0 io_uring/io-wq.c:696 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> which is supposed to be safe with how requests are allocated. But msg ring requests alloc and free on their own, and hence must defer freeing to a sane time. Add an rcu_head and use kfree_rcu() in both spots where requests are freed. Only the one in io_msg_tw_complete() is strictly required as it has been visible on the other ring, but use it consistently in the other spot as well. This should not cause any other issues outside of KASAN rightfully complaining about it. Link: https://lore.kernel.org/io-uring/686cd2ea.a00a0220.338033.0007.GAE@google.com/ Reported-by: syzbot+54cbbfb4db9145d26fc2@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Fixes: 0617bb500bfa ("io_uring/msg_ring: improve handling of target CQE posting") Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit fc582cd26e888b0652bc1494f252329453fd3b23) Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28io_uring/futex: ensure io_futex_wait() cleans up properly on failureJens Axboe1-0/+3
commit 508c1314b342b78591f51c4b5dadee31a88335df upstream. The io_futex_data is allocated upfront and assigned to the io_kiocb async_data field, but the request isn't marked with REQ_F_ASYNC_DATA at that point. Those two should always go together, as the flag tells io_uring whether the field is valid or not. Additionally, on failure cleanup, the futex handler frees the data but does not clear ->async_data. Clear the data and the flag in the error path as well. Thanks to Trend Micro Zero Day Initiative and particularly ReDress for reporting this. Cc: stable@vger.kernel.org Fixes: 194bb58c6090 ("io_uring: add support for futex wake and wait") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28io_uring/net: commit partial buffers on retryJens Axboe1-12/+15
commit 41b70df5b38bc80967d2e0ed55cc3c3896bba781 upstream. Ring provided buffers are potentially only valid within the single execution context in which they were acquired. io_uring deals with this and invalidates them on retry. But on the networking side, if MSG_WAITALL is set, or if the socket is of the streaming type and too little was processed, then it will hang on to the buffer rather than recycle or commit it. This is problematic for two reasons: 1) If someone unregisters the provided buffer ring before a later retry, then the req->buf_list will no longer be valid. 2) If multiple sockers are using the same buffer group, then multiple receives can consume the same memory. This can cause data corruption in the application, as either receive could land in the same userspace buffer. Fix this by disallowing partial retries from pinning a provided buffer across multiple executions, if ring provided buffers are used. Cc: stable@vger.kernel.org Reported-by: pt x <superman.xpt@gmail.com> Fixes: c56e022c0a27 ("io_uring: add support for user mapped provided buffer ring") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20io_uring/rw: cast rw->flags assignment to rwf_tJens Axboe1-1/+1
commit 825aea662b492571877b32aeeae13689fd9fbee4 upstream. kernel test robot reports that a recent change of the sqe->rw_flags field throws a sparse warning on 32-bit archs: >> io_uring/rw.c:291:19: sparse: sparse: incorrect type in assignment (different base types) @@ expected restricted __kernel_rwf_t [usertype] flags @@ got unsigned int @@ io_uring/rw.c:291:19: sparse: expected restricted __kernel_rwf_t [usertype] flags io_uring/rw.c:291:19: sparse: got unsigned int Force cast it to rwf_t to silence that new sparse warning. Fixes: cf73d9970ea4 ("io_uring: don't use int for ABI") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202507032211.PwSNPNSP-lkp@intel.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-24io_uring/poll: fix POLLERR handlingPavel Begunkov2-6/+8
commit c7cafd5b81cc07fb402e3068d134c21e60ea688c upstream. 8c8492ca64e7 ("io_uring/net: don't retry connect operation on EPOLLERR") is a little dirty hack that 1) wrongfully assumes that POLLERR equals to a failed request, which breaks all POLLERR users, e.g. all error queue recv interfaces. 2) deviates the connection request behaviour from connect(2), and 3) racy and solved at a wrong level. Nothing can be done with 2) now, and 3) is beyond the scope of the patch. At least solve 1) by moving the hack out of generic poll handling into io_connect(). Cc: stable@vger.kernel.org Fixes: 8c8492ca64e79 ("io_uring/net: don't retry connect operation on EPOLLERR") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3dc89036388d602ebd84c28e5042e457bdfc952b.1752682444.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-17io_uring: make fallocate be hashed workFengnan Chang1-0/+1
[ Upstream commit 88a80066af1617fab444776135d840467414beb6 ] Like ftruncate and write, fallocate operations on the same file cannot be executed in parallel, so it is better to make fallocate be hashed work. Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Link: https://lore.kernel.org/r/20250623110218.61490-1-changfengnan@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-06io_uring/kbuf: flag partial buffer mappingsJens Axboe3-8/+17
A previous commit aborted mapping more for a non-incremental ring for bundle peeking, but depending on where in the process this peeking happened, it would not necessarily prevent a retry by the user. That can create gaps in the received/read data. Add struct buf_sel_arg->partial_map, which can pass this information back. The networking side can then map that to internal state and use it to gate retry as well. Since this necessitates a new flag, change io_sr_msg->retry to a retry_flags member, and store both the retry and partial map condition in there. Cc: stable@vger.kernel.org Fixes: 26ec15e4b0c1 ("io_uring/kbuf: don't truncate end buffer for multiple buffer peeks") Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 178b8ff66ff827c41b4fa105e9aabb99a0b5c537) Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-06io_uring/net: mark iov as dynamically allocated even for single segmentsJens Axboe1-5/+6
Commit 9a709b7e98e6fa51600b5f2d24c5068efa6d39de upstream. A bigger array of vecs could've been allocated, but io_ring_buffers_peek() still decided to cap the mapped range depending on how much data was available. Hence don't rely on the segment count to know if the request should be marked as needing cleanup, always check upfront if the iov array is different than the fast_iov array. Fixes: 26ec15e4b0c1 ("io_uring/kbuf: don't truncate end buffer for multiple buffer peeks") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-06io_uring/net: always use current transfer count for buffer putJens Axboe1-1/+1
A previous fix corrected the retry condition for when to continue a current bundle, but it missed that the current (not the total) transfer count also applies to the buffer put. If not, then for incrementally consumed buffer rings repeated completions on the same request may end up over consuming. Reported-by: Roy Tang (ErgoniaTrading) <royonia@ergonia.io> Cc: stable@vger.kernel.org Fixes: 3a08988123c8 ("io_uring/net: only retry recv bundle for a full transfer") Link: https://github.com/axboe/liburing/issues/1423 Signed-off-by: Jens Axboe <axboe@kernel.dk> (cherry picked from commit 51a4598ad5d9eb6be4ec9ba65bbfdf0ac302eb2e) Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-06io_uring/net: only consider msg_inq if larger than 1Jens Axboe1-2/+2
Commit 2c7f023219966777be0687e15b57689894304cd3 upstream. Currently retry and general validity of msg_inq is gated on it being larger than zero, but it's entirely possible for this to be slightly inaccurate. In particular, if FIN is received, it'll return 1. Just use larger than 1 as the check. This covers both the FIN case, and at the same time, it doesn't make much sense to retry a recv immediately if there's even just a single 1 byte of valid data in the socket. Leave the SOCK_NONEMPTY flagging when larger than 0 still, as an app may use that for the final receive. Cc: stable@vger.kernel.org Reported-by: Christian Mazakas <christian.mazakas@gmail.com> Fixes: 7c71a0af81ba ("io_uring/net: improve recv bundles") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-06io_uring/net: only retry recv bundle for a full transferJens Axboe1-4/+10
Commit 3a08988123c868dbfdd054541b1090fb891fa49e upstream. If a shorter than assumed transfer was seen, a partial buffer will have been filled. For that case it isn't sane to attempt to fill more into the bundle before posting a completion, as that will cause a gap in the received data. Check if the iterator has hit zero and only allow to continue a bundle operation if that is the case. Also ensure that for putting finished buffers, only the current transfer is accounted. Otherwise too many buffers may be put for a short transfer. Link: https://github.com/axboe/liburing/issues/1409 Cc: stable@vger.kernel.org Fixes: 7c71a0af81ba ("io_uring/net: improve recv bundles") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-06io_uring/net: improve recv bundlesJens Axboe1-0/+18
Commit 7c71a0af81ba72de9b2c501065e4e718aba9a271 upstream. Current recv bundles are only supported for multishot receives, and additionally they also always post at least 2 CQEs if more data is available than what a buffer will hold. This happens because the initial bundle recv will do a single buffer, and then do the rest of what is in the socket as a followup receive. As shown in a test program, if 1k buffers are available and 32k is available to receive in the socket, you'd get the following completions: bundle=1, mshot=0 cqe res 1024 cqe res 1024 [...] cqe res 1024 bundle=1, mshot=1 cqe res 1024 cqe res 31744 where bundle=1 && mshot=0 will post 32 1k completions, and bundle=1 && mshot=1 will post a 1k completion and then a 31k completion. To support bundle recv without multishot, it's possible to simply retry the recv immediately and post a single completion, rather than split it into two completions. With the below patch, the same test looks as follows: bundle=1, mshot=0 cqe res 32768 bundle=1, mshot=1 cqe res 32768 where mshot=0 works fine for bundles, and both of them post just a single 32k completion rather than split it into separate completions. Posting fewer completions is always a nice win, and not needing multishot for proper bundle efficiency is nice for cases that can't necessarily use multishot. Reported-by: Norman Maurer <norman_maurer@apple.com> Link: https://lore.kernel.org/r/184f9f92-a682-4205-a15d-89e18f664502@kernel.dk Fixes: 2f9c9515bdfd ("io_uring/net: support bundles for recv") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-06io_uring/rsrc: don't rely on user vaddr alignmentPavel Begunkov2-1/+5
Commit 3a3c6d61577dbb23c09df3e21f6f9eda1ecd634b upstream. There is no guaranteed alignment for user pointers, however the calculation of an offset of the first page into a folio after coalescing uses some weird bit mask logic, get rid of it. Cc: stable@vger.kernel.org Reported-by: David Hildenbrand <david@redhat.com> Fixes: a8edbb424b139 ("io_uring/rsrc: enable multi-hugepage buffer coalescing") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/io-uring/e387b4c78b33f231105a601d84eefd8301f57954.1750771718.git.asml.silence@gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-06io_uring/rsrc: fix folio unpinningPavel Begunkov1-4/+9
Commit 5afb4bf9fc62d828647647ec31745083637132e4 upstream. syzbot complains about an unmapping failure: [ 108.070381][ T14] kernel BUG at mm/gup.c:71! [ 108.070502][ T14] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [ 108.123672][ T14] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250221-8.fc42 02/21/2025 [ 108.127458][ T14] Workqueue: iou_exit io_ring_exit_work [ 108.174205][ T14] Call trace: [ 108.175649][ T14] sanity_check_pinned_pages+0x7cc/0x7d0 (P) [ 108.178138][ T14] unpin_user_page+0x80/0x10c [ 108.180189][ T14] io_release_ubuf+0x84/0xf8 [ 108.182196][ T14] io_free_rsrc_node+0x250/0x57c [ 108.184345][ T14] io_rsrc_data_free+0x148/0x298 [ 108.186493][ T14] io_sqe_buffers_unregister+0x84/0xa0 [ 108.188991][ T14] io_ring_ctx_free+0x48/0x480 [ 108.191057][ T14] io_ring_exit_work+0x764/0x7d8 [ 108.193207][ T14] process_one_work+0x7e8/0x155c [ 108.195431][ T14] worker_thread+0x958/0xed8 [ 108.197561][ T14] kthread+0x5fc/0x75c [ 108.199362][ T14] ret_from_fork+0x10/0x20 We can pin a tail page of a folio, but then io_uring will try to unpin the head page of the folio. While it should be fine in terms of keeping the page actually alive, mm folks say it's wrong and triggers a debug warning. Use unpin_user_folio() instead of unpin_user_page*. Cc: stable@vger.kernel.org Debugged-by: David Hildenbrand <david@redhat.com> Reported-by: syzbot+1d335893772467199ab6@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/683f1551.050a0220.55ceb.0017.GAE@google.com Fixes: a8edbb424b139 ("io_uring/rsrc: enable multi-hugepage buffer coalescing") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/io-uring/a28b0f87339ac2acf14a645dad1e95bbcbf18acd.1750771718.git.asml.silence@gmail.com/ [axboe: adapt to current tree, massage commit message] Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-06io_uring: fix potential page leak in io_sqe_buffer_register()Penglei Jiang1-4/+5
Commit e1c75831f682eef0f68b35723437146ed86070b1 upstream. If allocation of the 'imu' fails, then the existing pages aren't unpinned in the error path. This is mostly a theoretical issue, requiring fault injection to hit. Move unpin_user_pages() to unified error handling to fix the page leak issue. Fixes: d8c2237d0aa9 ("io_uring: add io_pin_pages() helper") Signed-off-by: Penglei Jiang <superman.xpt@gmail.com> Link: https://lore.kernel.org/r/20250617165644.79165-1-superman.xpt@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27io_uring/sqpoll: don't put task_struct on tctx setup failureJens Axboe1-4/+1
[ Upstream commit f2320f1dd6f6f82cb2c7aff23a12bab537bdea89 ] A recent commit moved the error handling of sqpoll thread and tctx failures into the thread itself, as part of fixing an issue. However, it missed that tctx allocation may also fail, and that io_sq_offload_create() does its own error handling for the task_struct in that case. Remove the manual task putting in io_sq_offload_create(), as io_sq_thread() will notice that the tctx did not get setup and hence it should put itself and exit. Reported-by: syzbot+763e12bbf004fb1062e4@syzkaller.appspotmail.com Fixes: ac0b8b327a56 ("io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27io_uring: fix task leak issue in io_wq_create()Penglei Jiang1-1/+3
commit 89465d923bda180299e69ee2800aab84ad0ba689 upstream. Add missing put_task_struct() in the error path Cc: stable@vger.kernel.org Fixes: 0f8baa3c9802 ("io-wq: fully initialize wqe before calling cpuhp_state_add_instance_nocalls()") Signed-off-by: Penglei Jiang <superman.xpt@gmail.com> Link: https://lore.kernel.org/r/20250615163906.2367-1-superman.xpt@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27io_uring/kbuf: don't truncate end buffer for multiple buffer peeksJens Axboe1-1/+4
commit 26ec15e4b0c1d7b25214d9c0be1d50492e2f006c upstream. If peeking a bunch of buffers, normally io_ring_buffers_peek() will truncate the end buffer. This isn't optimal as presumably more data will be arriving later, and hence it's better to stop with the last full buffer rather than truncate the end buffer. Cc: stable@vger.kernel.org Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers") Reported-by: Christian Mazakas <christian.mazakas@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27io_uring/kbuf: account ring io_buffer_list memoryPavel Begunkov1-1/+1
commit 475a8d30371604a6363da8e304a608a5959afc40 upstream. Follow the non-ringed pbuf struct io_buffer_list allocations and account it against the memcg. There is low chance of that being an actual problem as ring provided buffer should either pin user memory or allocate it, which is already accounted. Cc: stable@vger.kernel.org # 6.1 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3985218b50d341273cafff7234e1a7e6d0db9808.1747150490.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27io_uring: account drain memory to cgroupPavel Begunkov1-1/+1
commit f979c20547e72568e3c793bc92c7522bc3166246 upstream. Account drain allocations against memcg. It's not a big problem as each such allocation is paired with a request, which is accounted, but it's nicer to follow the limits more closely. Cc: stable@vger.kernel.org # 6.1 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f8dfdbd755c41fd9c75d12b858af07dfba5bbb68.1746788718.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-19io_uring: consistently use rcu semantics with sqpoll threadKeith Busch4-15/+38
[ Upstream commit c538f400fae22725580842deb2bef546701b64bd ] The sqpoll thread is dereferenced with rcu read protection in one place, so it needs to be annotated as an __rcu type, and should consistently use rcu helpers for access and assignment to make sparse happy. Since most of the accesses occur under the sqd->lock, we can use rcu_dereference_protected() without declaring an rcu read section. Provide a simple helper to get the thread from a locked context. Fixes: ac0b8b327a5677d ("io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()") Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250611205343.1821117-1-kbusch@meta.com [axboe: fold in fix for register.c] Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()Penglei Jiang2-7/+14
[ Upstream commit ac0b8b327a5677dc6fecdf353d808161525b1ff0 ] syzbot reports: BUG: KASAN: slab-use-after-free in getrusage+0x1109/0x1a60 Read of size 8 at addr ffff88810de2d2c8 by task a.out/304 CPU: 0 UID: 0 PID: 304 Comm: a.out Not tainted 6.16.0-rc1 #1 PREEMPT(voluntary) Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x53/0x70 print_report+0xd0/0x670 ? __pfx__raw_spin_lock_irqsave+0x10/0x10 ? getrusage+0x1109/0x1a60 kasan_report+0xce/0x100 ? getrusage+0x1109/0x1a60 getrusage+0x1109/0x1a60 ? __pfx_getrusage+0x10/0x10 __io_uring_show_fdinfo+0x9fe/0x1790 ? ksys_read+0xf7/0x1c0 ? do_syscall_64+0xa4/0x260 ? vsnprintf+0x591/0x1100 ? __pfx___io_uring_show_fdinfo+0x10/0x10 ? __pfx_vsnprintf+0x10/0x10 ? mutex_trylock+0xcf/0x130 ? __pfx_mutex_trylock+0x10/0x10 ? __pfx_show_fd_locks+0x10/0x10 ? io_uring_show_fdinfo+0x57/0x80 io_uring_show_fdinfo+0x57/0x80 seq_show+0x38c/0x690 seq_read_iter+0x3f7/0x1180 ? inode_set_ctime_current+0x160/0x4b0 seq_read+0x271/0x3e0 ? __pfx_seq_read+0x10/0x10 ? __pfx__raw_spin_lock+0x10/0x10 ? __mark_inode_dirty+0x402/0x810 ? selinux_file_permission+0x368/0x500 ? file_update_time+0x10f/0x160 vfs_read+0x177/0xa40 ? __pfx___handle_mm_fault+0x10/0x10 ? __pfx_vfs_read+0x10/0x10 ? mutex_lock+0x81/0xe0 ? __pfx_mutex_lock+0x10/0x10 ? fdget_pos+0x24d/0x4b0 ksys_read+0xf7/0x1c0 ? __pfx_ksys_read+0x10/0x10 ? do_user_addr_fault+0x43b/0x9c0 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f0f74170fc9 Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 8 RSP: 002b:00007fffece049e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0f74170fc9 RDX: 0000000000001000 RSI: 00007fffece049f0 RDI: 0000000000000004 RBP: 00007fffece05ad0 R08: 0000000000000000 R09: 00007fffece04d90 R10: 0000000000000000 R11: 0000000000000206 R12: 00005651720a1100 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Allocated by task 298: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 __kasan_slab_alloc+0x6e/0x70 kmem_cache_alloc_node_noprof+0xe8/0x330 copy_process+0x376/0x5e00 create_io_thread+0xab/0xf0 io_sq_offload_create+0x9ed/0xf20 io_uring_setup+0x12b0/0x1cc0 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 22: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x37/0x50 kmem_cache_free+0xc4/0x360 rcu_core+0x5ff/0x19f0 handle_softirqs+0x18c/0x530 run_ksoftirqd+0x20/0x30 smpboot_thread_fn+0x287/0x6c0 kthread+0x30d/0x630 ret_from_fork+0xef/0x1a0 ret_from_fork_asm+0x1a/0x30 Last potentially related work creation: kasan_save_stack+0x33/0x60 kasan_record_aux_stack+0x8c/0xa0 __call_rcu_common.constprop.0+0x68/0x940 __schedule+0xff2/0x2930 __cond_resched+0x4c/0x80 mutex_lock+0x5c/0xe0 io_uring_del_tctx_node+0xe1/0x2b0 io_uring_clean_tctx+0xb7/0x160 io_uring_cancel_generic+0x34e/0x760 do_exit+0x240/0x2350 do_group_exit+0xab/0x220 __x64_sys_exit_group+0x39/0x40 x64_sys_call+0x1243/0x1840 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x77/0x7f The buggy address belongs to the object at ffff88810de2cb00 which belongs to the cache task_struct of size 3712 The buggy address is located 1992 bytes inside of freed 3712-byte region [ffff88810de2cb00, ffff88810de2d980) which is caused by the task_struct pointed to by sq->thread being released while it is being used in the function __io_uring_show_fdinfo(). Holding ctx->uring_lock does not prevent ehre relase or exit of sq->thread. Fix this by assigning and looking up ->thread under RCU, and grabbing a reference to the task_struct. This ensures that it cannot get released while fdinfo is using it. Reported-by: syzbot+531502bbbe51d2f769f4@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/682b06a5.a70a0220.3849cf.00b3.GAE@google.com Fixes: 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads") Signed-off-by: Penglei Jiang <superman.xpt@gmail.com> Link: https://lore.kernel.org/r/20250610171801.70960-1-superman.xpt@gmail.com [axboe: massage commit message] Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29io_uring: fix overflow resched cqe reorderingPavel Begunkov1-0/+1
[ Upstream commit a7d755ed9ce9738af3db602eb29d32774a180bc7 ] Leaving the CQ critical section in the middle of a overflow flushing can cause cqe reordering since the cache cq pointers are reset and any new cqe emitters that might get called in between are not going to be forced into io_cqe_cache_refill(). Fixes: eac2ca2d682f9 ("io_uring: check if we need to reschedule during overflow flush") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/90ba817f1a458f091f355f407de1c911d2b93bbf.1747483784.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>