summaryrefslogtreecommitdiff
path: root/drivers/block
AgeCommit message (Collapse)AuthorFilesLines
2025-05-18loop: Add sanity check for read/write_iterLizhi Xu1-0/+23
[ Upstream commit f5c84eff634ba003326aa034c414e2a9dcb7c6a7 ] Some file systems do not support read_iter/write_iter, such as selinuxfs in this issue. So before calling them, first confirm that the interface is supported and then call it. It is releavant in that vfs_iter_read/write have the check, and removal of their used caused szybot to be able to hit this issue. Fixes: f2fed441c69b ("loop: stop using vfs_iter__{read,write} for buffered I/O") Reported-by: syzbot+6af973a3b8dfd2faefdc@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6af973a3b8dfd2faefdc Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250428143626.3318717-1-lizhi.xu@windriver.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-18loop: factor out a loop_assign_backing_file helperChristoph Hellwig1-10/+10
[ Upstream commit d278164832618bf2775c6a89e6434e2633de1eed ] Split the code for setting up a backing file into a helper in preparation of adding more code to this path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250131120120.1315125-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: f5c84eff634b ("loop: Add sanity check for read/write_iter") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-09ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmdMing Lei1-6/+21
[ Upstream commit f40139fde5278d81af3227444fd6e76a76b9506d ] ublk_cancel_cmd() calls io_uring_cmd_done() to complete uring_cmd, but we may have scheduled task work via io_uring_cmd_complete_in_task() for dispatching request, then kernel crash can be triggered. Fix it by not trying to canceling the command if ublk block request is started. Fixes: 216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd") Reported-by: Jared Holzman <jholzman@nvidia.com> Tested-by: Jared Holzman <jholzman@nvidia.com> Closes: https://lore.kernel.org/linux-block/d2179120-171b-47ba-b664-23242981ef19@nvidia.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250425013742.1079549-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09ublk: simplify aborting ublk requestMing Lei1-62/+20
[ Upstream commit e63d2228ef831af36f963b3ab8604160cfff84c1 ] Now ublk_abort_queue() is moved to ublk char device release handler, meantime our request queue is "quiesced" because either ->canceling was set from uring_cmd cancel function or all IOs are inflight and can't be completed by ublk server, things becomes easy much: - all uring_cmd are done, so we needn't to mark io as UBLK_IO_FLAG_ABORTED for handling completion from uring_cmd - ublk char device is closed, no one can hold IO request reference any more, so we can simply complete this request or requeue it for ublk_nosrv_should_reissue_outstanding. Reviewed-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250416035444.99569-8-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09ublk: remove __ublk_quiesce_dev()Ming Lei1-17/+2
[ Upstream commit 736b005b413a172670711ee17cab3c8ccab83223 ] Remove __ublk_quiesce_dev() and open code for updating device state as QUIESCED. We needn't to drain inflight requests in __ublk_quiesce_dev() any more, because all inflight requests are aborted in ublk char device release handler. Also we needn't to set ->canceling in __ublk_quiesce_dev() any more because it is done unconditionally now in ublk_ch_release(). Reviewed-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250416035444.99569-7-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09ublk: improve detection and handling of ublk server exitUday Shankar1-99/+124
[ Upstream commit 82a8a30c581bbbe653d33c6ce2ef67e3072c7f12 ] There are currently two ways in which ublk server exit is detected by ublk_drv: 1. uring_cmd cancellation. If there are any outstanding uring_cmds which have not been completed to the ublk server when it exits, io_uring calls the uring_cmd callback with a special cancellation flag as the issuing task is exiting. 2. I/O timeout. This is needed in addition to the above to handle the "saturated queue" case, when all I/Os for a given queue are in the ublk server, and therefore there are no outstanding uring_cmds to cancel when the ublk server exits. There are a couple of issues with this approach: - It is complex and inelegant to have two methods to detect the same condition - The second method detects ublk server exit only after a long delay (~30s, the default timeout assigned by the block layer). This delays the nosrv behavior from kicking in and potential subsequent recovery of the device. The second issue is brought to light with the new test_generic_06 which will be added in following patch. It fails before this fix: selftests: ublk: test_generic_06.sh dev id is 0 dd: error writing '/dev/ublkb0': Input/output error 1+0 records in 0+0 records out 0 bytes copied, 30.0611 s, 0.0 kB/s DEAD dd took 31 seconds to exit (>= 5s tolerance)! generic_06 : [FAIL] Fix this by instead detecting and handling ublk server exit in the character file release callback. This has several advantages: - This one place can handle both saturated and unsaturated queues. Thus, it replaces both preexisting methods of detecting ublk server exit. - It runs quickly on ublk server exit - there is no 30s delay. - It starts the process of removing task references in ublk_drv. This is needed if we want to relax restrictions in the driver like letting only one thread serve each queue There is also the disadvantage that the character file release callback can also be triggered by intentional close of the file, which is a significant behavior change. Preexisting ublk servers (libublksrv) are dependent on the ability to open/close the file multiple times. To address this, only transition to a nosrv state if the file is released while the ublk device is live. This allows for programs to open/close the file multiple times during setup. It is still a behavior change if a ublk server decides to close/reopen the file while the device is LIVE (i.e. while it is responsible for serving I/O), but that would be highly unusual. This behavior is in line with what is done by FUSE, which is very similar to ublk in that a userspace daemon is providing services traditionally provided by the kernel. With this change in, the new test (and all other selftests, and all ublksrv tests) pass: selftests: ublk: test_generic_06.sh dev id is 0 dd: error writing '/dev/ublkb0': Input/output error 1+0 records in 0+0 records out 0 bytes copied, 0.0376731 s, 0.0 kB/s DEAD generic_04 : [PASS] Signed-off-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250416035444.99569-6-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09ublk: move device reset into ublk_ch_release()Ming Lei1-49/+72
[ Upstream commit 728cbac5fe219d3b8a21a0688a08f2b7f8aeda2b ] ublk_ch_release() is called after ublk char device is closed, when all uring_cmd are done, so it is perfect fine to move ublk device reset to ublk_ch_release() from ublk_ctrl_start_recovery(). This way can avoid to grab the exiting daemon task_struct too long. However, reset of the following ublk IO flags has to be moved until ublk io_uring queues are ready: - ubq->canceling For requeuing IO in case of ublk_nosrv_dev_should_queue_io() before device is recovered - ubq->fail_io For failing IO in case of UBLK_F_USER_RECOVERY_FAIL_IO before device is recovered - ublk_io->flags For preventing using io->cmd With this way, recovery is simplified a lot. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250416035444.99569-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09ublk: properly serialize all FETCH_REQsUday Shankar1-28/+49
[ Upstream commit b69b8edfb27dfa563cd53f590ec42b481f9eb174 ] Most uring_cmds issued against ublk character devices are serialized because each command affects only one queue, and there is an early check which only allows a single task (the queue's ubq_daemon) to issue uring_cmds against that queue. However, this mechanism does not work for FETCH_REQs, since they are expected before ubq_daemon is set. Since FETCH_REQs are only used at initialization and not in the fast path, serialize them using the per-ublk-device mutex. This fixes a number of data races that were previously possible if a badly behaved ublk server decided to issue multiple FETCH_REQs against the same qid/tag concurrently. Reported-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250416035444.99569-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09ublk: add helper of ublk_need_map_io()Ming Lei1-4/+9
[ Upstream commit 1d781c0de08c0b35948ad4aaf609a4cc9995d9f6 ] ublk_need_map_io() is more readable. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250327095123.179113-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02ublk: don't fail request for recovery & reissue in case of ubq->cancelingMing Lei1-4/+5
commit 18461f2a02be04f8bbbe3b37fecfc702e3fa5bc2 upstream. ubq->canceling is set with request queue quiesced when io_uring context is exiting. USER_RECOVERY or !RECOVERY_FAIL_IO requires request to be re-queued and re-dispatch after device is recovered. However commit d796cea7b9f3 ("ublk: implement ->queue_rqs()") still may fail any request in case of ubq->canceling, this way breaks USER_RECOVERY or !RECOVERY_FAIL_IO. Fix it by calling __ublk_abort_rq() in case of ubq->canceling. Reviewed-by: Uday Shankar <ushankar@purestorage.com> Reported-by: Uday Shankar <ushankar@purestorage.com> Closes: https://lore.kernel.org/linux-block/Z%2FQkkTRHfRxtN%2FmB@dev-ushankar.dev.purestorage.com/ Fixes: d796cea7b9f3 ("ublk: implement ->queue_rqs()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250409011444.2142010-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_ioMing Lei1-14/+17
[ Upstream commit 7e26cb69c5e62152a6f05a2ae23605a983a8ef31 ] Now ublk deals with ublk_nosrv_dev_should_queue_io() by keeping request queue as quiesced. This way is fragile because queue quiesce crosses syscalls or process contexts. Switch to rely on ubq->canceling for dealing with ublk_nosrv_dev_should_queue_io(), because it has been used for this purpose during io_uring context exiting, and it can be reused before recovering too. In ublk_queue_rq(), the request will be added to requeue list without kicking off requeue in case of ubq->canceling, and finally requests added in requeue list will be dispatched from either ublk_stop_dev() or ublk_ctrl_end_recovery(). Meantime we have to move reset of ubq->canceling from ublk_ctrl_start_recovery() to ublk_ctrl_end_recovery(), when IO handling can be recovered completely. Then blk_mq_quiesce_queue() and blk_mq_unquiesce_queue() are always used in same context. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Uday Shankar <ushankar@purestorage.com> Link: https://lore.kernel.org/r/20250416035444.99569-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-02ublk: add ublk_force_abort_dev()Ming Lei1-13/+8
[ Upstream commit 00b3b0d7cb454d614117c93f33351cdcd20b5b93 ] Add ublk_force_abort_dev() for handling ublk_nosrv_dev_should_queue_io() in ublk_stop_dev(). Then queue quiesce and unquiesce can be paired in single function. Meantime not change device state to QUIESCED any more, since the disk is going to be removed soon. Reviewed-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250416035444.99569-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-02ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATAMing Lei1-11/+3
[ Upstream commit d6aa0c178bf81f30ae4a780b2bca653daa2eb633 ] We call io_uring_cmd_complete_in_task() to schedule task_work for handling UBLK_U_IO_NEED_GET_DATA. This way is really not necessary because the current context is exactly the ublk queue context, so call ublk_dispatch_req() directly for handling UBLK_U_IO_NEED_GET_DATA. Fixes: 216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd") Tested-by: Jared Holzman <jholzman@nvidia.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250425013742.1079549-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-02ublk: remove unused cmd argument to ublk_dispatch_req()Caleb Sander Mateos1-5/+2
[ Upstream commit dfbce8b798fb848a42706e2e544b78b3db22aaae ] ublk_dispatch_req() never uses its struct io_uring_cmd *cmd argument. Drop it so callers don't have to pass a value. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250328180411.2696494-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: d6aa0c178bf8 ("ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-02ublk: implement ->queue_rqs()Ming Lei1-20/+111
[ Upstream commit d796cea7b9f33b6315362f504b15fcc26d678493 ] Implement ->queue_rqs() for improving perf in case of MQ. In this way, we just need to call io_uring_cmd_complete_in_task() once for whole IO batch, then both io_uring and ublk server can get exact batch from ublk frontend. Follows IOPS improvement: - tests tools/testing/selftests/ublk/kublk add -t null -q 2 [-z] fio/t/io_uring -p0 /dev/ublkb0 - results: more than 10% IOPS boost observed Pass all ublk selftests, especially the io dispatch order test. Cc: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250327095123.179113-9-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: d6aa0c178bf8 ("ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-02ublk: comment on ubq->canceling handling in ublk_queue_rq()Ming Lei1-0/+5
[ Upstream commit 7e2fe01a69f6be3e284b38cfd2e4e0598a3b0a8f ] In ublk_queue_rq(), ubq->canceling has to be handled after ->fail_io and ->force_abort are dealt with, otherwise the request may not be failed when deleting disk. Add comment on this usage. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250327095123.179113-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: d6aa0c178bf8 ("ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-02ublk: remove io_cmds list in ublk_queueUday Shankar1-35/+11
[ Upstream commit 989bcd623a8b0c32b76d9258767d8b37e53419e6 ] The current I/O dispatch mechanism - queueing I/O by adding it to the io_cmds list (and poking task_work as needed), then dispatching it in ublk server task context by reversing io_cmds and completing the io_uring command associated to each one - was introduced by commit 7d4a93176e014 ("ublk_drv: don't forward io commands in reserve order") to ensure that the ublk server received I/O in the same order that the block layer submitted it to ublk_drv. This mechanism was only needed for the "raw" task_work submission mechanism, since the io_uring task work wrapper maintains FIFO ordering (using quite a similar mechanism in fact). The "raw" task_work submission mechanism is no longer supported in ublk_drv as of commit 29dc5d06613f2 ("ublk: kill queuing request by task_work_add"), so the explicit llist/reversal is no longer needed - it just duplicates logic already present in the underlying io_uring APIs. Remove it. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250318-ublk_io_cmds-v1-1-c1bb74798fef@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: d6aa0c178bf8 ("ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25loop: LOOP_SET_FD: send uevents for partitionsThomas Weißschuh1-1/+2
commit 0dba7a05b9e47d8b546399117b0ddf2426dc6042 upstream. Remove the suppression of the uevents before scanning for partitions. The partitions inherit their suppression settings from their parent device, which lead to the uevents being dropped. This is similar to the same changes for LOOP_CONFIGURE done in commit bb430b694226 ("loop: LOOP_CONFIGURE: send uevents for partitions"). Fixes: 498ef5c777d9 ("loop: suppress uevents while reconfiguring the device") Cc: stable@vger.kernel.org Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20250415-loop-uevent-changed-v3-1-60ff69ac6088@linutronix.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25loop: properly send KOBJ_CHANGED uevent for disk deviceThomas Weißschuh1-2/+2
commit e7bc0010ceb403d025100698586c8e760921d471 upstream. The original commit message and the wording "uncork" in the code comment indicate that it is expected that the suppressed event instances are automatically sent after unsuppressing. This is not the case, instead they are discarded. In effect this means that no "changed" events are emitted on the device itself by default. While each discovered partition does trigger a changed event on the device, devices without partitions don't have any event emitted. This makes udev miss the device creation and prompted workarounds in userspace. See the linked util-linux/losetup bug. Explicitly emit the events and drop the confusingly worded comments. Link: https://github.com/util-linux/util-linux/issues/2434 Fixes: 498ef5c777d9 ("loop: suppress uevents while reconfiguring the device") Cc: stable@vger.kernel.org Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://lore.kernel.org/r/20250415-loop-uevent-changed-v2-1-0c4e6a923b2a@linutronix.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25loop: stop using vfs_iter_{read,write} for buffered I/OChristoph Hellwig1-95/+17
[ Upstream commit f2fed441c69b9237760840a45a004730ff324faf ] vfs_iter_{read,write} always perform direct I/O when the file has the O_DIRECT flag set, which breaks disabling direct I/O using the LOOP_SET_STATUS / LOOP_SET_STATUS64 ioctls. This was recenly reported as a regression, but as far as I can tell was only uncovered by better checking for block sizes and has been around since the direct I/O support was added. Fix this by using the existing aio code that calls the raw read/write iter methods instead. Note that despite the comments there is no need for block drivers to ever call flush_dcache_page themselves, and the call is a left-over from prehistoric times. Fixes: ab1cb278bc70 ("block: loop: introduce ioctl command of LOOP_SET_DIRECT_IO") Reported-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20250409130940.3685677-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25loop: aio inherit the ioprio of original requestYunlong Xing1-1/+1
[ Upstream commit 1fdb8188c3d505452b40cdb365b1bb32be533a8e ] Set cmd->iocb.ki_ioprio to the ioprio of loop device's request. The purpose is to inherit the original request ioprio in the aio flow. Signed-off-by: Yunlong Xing <yunlong.xing@unisoc.com> Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250414030159.501180-1-yunlong.xing@unisoc.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: f2fed441c69b ("loop: stop using vfs_iter_{read,write} for buffered I/O") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20ublk: fix handling recovery & reissue in ublk_abort_queue()Ming Lei1-4/+26
[ Upstream commit 6ee6bd5d4fce502a5b5a2ea805e9ff16e6aa890f ] Commit 8284066946e6 ("ublk: grab request reference when the request is handled by userspace") doesn't grab request reference in case of recovery reissue. Then the request can be requeued & re-dispatch & failed when canceling uring command. If it is one zc request, the request can be freed before io_uring returns the zc buffer back, then cause kernel panic: [ 126.773061] BUG: kernel NULL pointer dereference, address: 00000000000000c8 [ 126.773657] #PF: supervisor read access in kernel mode [ 126.774052] #PF: error_code(0x0000) - not-present page [ 126.774455] PGD 0 P4D 0 [ 126.774698] Oops: Oops: 0000 [#1] SMP NOPTI [ 126.775034] CPU: 13 UID: 0 PID: 1612 Comm: kworker/u64:55 Not tainted 6.14.0_blk+ #182 PREEMPT(full) [ 126.775676] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014 [ 126.776275] Workqueue: iou_exit io_ring_exit_work [ 126.776651] RIP: 0010:ublk_io_release+0x14/0x130 [ublk_drv] Fixes it by always grabbing request reference for aborting the request. Reported-by: Caleb Sander Mateos <csander@purestorage.com> Closes: https://lore.kernel.org/linux-block/CADUfDZodKfOGUeWrnAxcZiLT+puaZX8jDHoj_sfHZCOZwhzz6A@mail.gmail.com/ Fixes: 8284066946e6 ("ublk: grab request reference when the request is handled by userspace") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250409011444.2142010-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ublk: make sure ubq->canceling is set when queue is frozenMing Lei1-10/+29
[ Upstream commit 8741d0737921ec1c03cf59aebf4d01400c2b461a ] Now ublk driver depends on `ubq->canceling` for deciding if the request can be dispatched via uring_cmd & io_uring_cmd_complete_in_task(). Once ubq->canceling is set, the uring_cmd can be done via ublk_cancel_cmd() and io_uring_cmd_done(). So set ubq->canceling when queue is frozen, this way makes sure that the flag can be observed from ublk_queue_rq() reliably, and avoids use-after-free on uring_cmd. Fixes: 216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250327095123.179113-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10badblocks: use sector_t instead of int to avoid truncation of badblocks lengthZheng Qixing1-2/+1
[ Upstream commit d301f164c3fbff611bd71f57dfa553b9219f0f5e ] There is a truncation of badblocks length issue when set badblocks as follow: echo "2055 4294967299" > bad_blocks cat bad_blocks 2055 3 Change 'sectors' argument type from 'int' to 'sector_t'. This change avoids truncation of badblocks length for large sectors by replacing 'int' with 'sector_t' (u64), enabling proper handling of larger disk sizes and ensuring compatibility with 64-bit sector addressing. Fixes: 9e0e252a048b ("badblocks: Add core badblock management code") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-13-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10badblocks: return boolean from badblocks_set() and badblocks_clear()Zheng Qixing1-7/+7
[ Upstream commit c8775aefba959cdfbaa25408a84d3dd15bbeb991 ] Change the return type of badblocks_set() and badblocks_clear() from int to bool, indicating success or failure. Specifically: - _badblocks_set() and _badblocks_clear() functions now return true for success and false for failure. - All calls to these functions are updated to handle the new boolean return type. - This change improves code clarity and ensures a more consistent handling of success and failure states. Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Acked-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20250227075507.151331-11-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: d301f164c3fb ("badblocks: use sector_t instead of int to avoid truncation of badblocks length") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-15Merge tag 'block-6.14-20250313' of git://git.kernel.dk/linuxLinus Torvalds2-4/+5
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Concurrent pci error and hotplug handling fix (Keith) - Endpoint function fixes (Damien) - Fix for a regression introduced in this cycle with error checking for batched request completions (Shin'ichiro) * tag 'block-6.14-20250313' of git://git.kernel.dk/linux: block: change blk_mq_add_to_batch() third argument type to bool nvme: move error logging from nvme_end_req() to __nvme_end_req() nvmet: pci-epf: Do not add an IRQ vector if not needed nvmet: pci-epf: Set NVMET_PCI_EPF_Q_LIVE when a queue is fully created nvme-pci: fix stuck reset on concurrent DPC and HP
2025-03-12block: change blk_mq_add_to_batch() third argument type to boolShin'ichiro Kawasaki2-4/+5
Commit 1f47ed294a2b ("block: cleanup and fix batch completion adding conditions") modified the evaluation criteria for the third argument, 'ioerror', in the blk_mq_add_to_batch() function. Initially, the function had checked if 'ioerror' equals zero. Following the commit, it started checking for negative error values, with the presumption that such values, for instance -EIO, would be passed in. However, blk_mq_add_to_batch() callers do not pass negative error values. Instead, they pass status codes defined in various ways: - NVMe PCI and Apple drivers pass NVMe status code - virtio_blk driver passes the virtblk request header status byte - null_blk driver passes blk_status_t These codes are either zero or positive, therefore the revised check fails to function as intended. Specifically, with the NVMe PCI driver, this modification led to the failure of the blktests test case nvme/039. In this test scenario, errors are artificially injected to the NVMe driver, resulting in positive NVMe status codes passed to blk_mq_add_to_batch(), which unexpectedly processes the failed I/O in a batch. Hence the failure. To correct the ioerror check within blk_mq_add_to_batch(), make all callers to uniformly pass the argument as boolean. Modify the callers to check their specific status codes and pass the boolean value 'is_error'. Also describe the arguments of blK_mq_add_to_batch as kerneldoc. Fixes: 1f47ed294a2b ("block: cleanup and fix batch completion adding conditions") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20250311104359.1767728-3-shinichiro.kawasaki@wdc.com [axboe: fold in documentation update] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-08Merge tag 'block-6.14-20250306' of git://git.kernel.dk/linuxLinus Torvalds1-2/+5
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - TCP use after free fix on polling (Sagi) - Controller memory buffer cleanup fixes (Icenowy) - Free leaking requests on bad user passthrough commands (Keith) - TCP error message fix (Maurizio) - TCP corruption fix on partial PDU (Maurizio) - TCP memory ordering fix for weakly ordered archs (Meir) - Type coercion fix on message error for TCP (Dan) - Name the RQF flags enum, fixing issues with anon enums and BPF import of it - ublk parameter setting fix - GPT partition 7-bit conversion fix * tag 'block-6.14-20250306' of git://git.kernel.dk/linux: block: Name the RQF flags enum nvme-tcp: fix signedness bug in nvme_tcp_init_connection() block: fix conversion of GPT partition name to 7-bit ublk: set_params: properly check if parameters can be applied nvmet-tcp: Fix a possible sporadic response drops in weakly ordered arch nvme-tcp: fix potential memory corruption in nvme_tcp_recv_pdu() nvme-tcp: Fix a C2HTermReq error message nvmet: remove old function prototype nvme-ioctl: fix leaked requests on mapping error nvme-pci: skip CMB blocks incompatible with PCI P2P DMA nvme-pci: clean up CMBMSC when registering CMB fails nvme-tcp: fix possible UAF in nvme_tcp_poll
2025-03-05ublk: set_params: properly check if parameters can be appliedUday Shankar1-2/+5
The parameters set by the set_params call are only applied to the block device in the start_dev call. So if a device has already been started, a subsequently issued set_params on that device will not have the desired effect, and should return an error. There is an existing check for this - set_params fails on devices in the LIVE state. But this check is not sufficient to cover the recovery case. In this case, the device will be in the QUIESCED or FAIL_IO states, so set_params will succeed. But this success is misleading, because the parameters will not be applied, since the device has already been started (by a previous ublk server). The bit UB_STATE_USED is set on completion of the start_dev; use it to detect and fail set_params commands which arrive too late to be applied (after start_dev). Signed-off-by: Uday Shankar <ushankar@purestorage.com> Fixes: 0aa73170eba5 ("ublk_drv: add SET_PARAMS/GET_PARAMS control command") Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250304-set_params-v1-1-17b5e0887606@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-07Merge tag 'block-6.14-20250207' of git://git.kernel.dk/linuxLinus Torvalds1-2/+2
Pull block fixes from Jens Axboe: - MD pull request via Song: - fix an error handling path for md-linear - NVMe pull request via Keith: - Connection fixes for fibre channel transport (Daniel) - Endian fixes (Keith, Christoph) - Cleanup fix for host memory buffer (Francis) - Platform specific power quirks (Georg) - Target memory leak (Sagi) - Use appropriate controller state accessor (Daniel) - Fixup for a regression introduced last week, where sunvdc wasn't updated for an API change, causing compilation failures on sparc64. * tag 'block-6.14-20250207' of git://git.kernel.dk/linux: drivers/block/sunvdc.c: update the correct AIP call md: Fix linear_set_limits() nvme-fc: use ctrl state getter nvme: make nvme_tls_attrs_group static nvmet: add a missing endianess conversion in nvmet_execute_admin_connect nvmet: the result field in nvmet_alloc_ctrl_args is little endian nvmet: fix a memory leak in controller identify nvme-fc: do not ignore connectivity loss during connecting nvme: handle connectivity loss in nvme_set_queue_count nvme-fc: go straight to connecting state when initializing nvme-pci: Add TUXEDO IBP Gen9 to Samsung sleep quirk nvme-pci: Add TUXEDO InfinityFlex to Samsung sleep quirk nvme-pci: remove redundant dma frees in hmb nvmet: fix rw control endian access
2025-02-03drivers/block/sunvdc.c: update the correct AIP callStephen Rothwell1-2/+2
My sparc64 defconfig build failed like this: drivers/block/sunvdc.c: In function 'vdc_queue_drain': drivers/block/sunvdc.c:1130:9: error: too many arguments to function 'blk_mq_unquiesce_queue' 1130 | blk_mq_unquiesce_queue(q, memflags); | ^~~~~~~~~~~~~~~~~~~~~~ In file included from drivers/block/sunvdc.c:10: include/linux/blk-mq.h:895:6: note: declared here 895 | void blk_mq_unquiesce_queue(struct request_queue *q); | ^~~~~~~~~~~~~~~~~~~~~~ drivers/block/sunvdc.c:1131:9: error: too few arguments to function 'blk_mq_unfreeze_queue' 1131 | blk_mq_unfreeze_queue(q); | ^~~~~~~~~~~~~~~~~~~~~ In file included from drivers/block/sunvdc.c:10: include/linux/blk-mq.h:914:1: note: declared here 914 | blk_mq_unfreeze_queue(struct request_queue *q, unsigned int memflags) | ^~~~~~~~~~~~~~~~~~~~~ Fixes: 1e1a9cecfab3 ("block: force noio scope in blk_mq_freeze_queue") Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-31Merge tag 'block-6.14-20250131' of git://git.kernel.dk/linuxLinus Torvalds8-25/+35
Pull more block updates from Jens Axboe: - MD pull request via Song: - Fix a md-cluster regression introduced - More sysfs race fixes - Mark anything inside queue freezing as not being able to do IO for memory allocations - Fix for a regression introduced in loop in this merge window - Fix for a regression in queue mapping setups introduced in this merge window - Fix for the block dio fops attempting an iov_iter revert upton getting -EIOCBQUEUED on the read side. This one is going to stable as well * tag 'block-6.14-20250131' of git://git.kernel.dk/linux: block: force noio scope in blk_mq_freeze_queue block: fix nr_hw_queue update racing with disk addition/removal block: get rid of request queue ->sysfs_dir_lock loop: don't clear LO_FLAGS_PARTSCAN on LOOP_SET_STATUS{,64} md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime blk-mq: create correct map for fallback case block: don't revert iter for -EIOCBQUEUED
2025-01-31block: force noio scope in blk_mq_freeze_queueChristoph Hellwig8-23/+34
When block drivers or the core block code perform allocations with a frozen queue, this could try to recurse into the block device to reclaim memory and deadlock. Thus all allocations done by a process that froze a queue need to be done without __GFP_IO and __GFP_FS. Instead of tying to track all of them down, force a noio scope as part of freezing the queue. Note that nvme is a bit of a mess here due to the non-owner freezes, and they will be addressed separately. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250131120352.1315351-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28Merge tag 'driver-core-6.14-rc1' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core and debugfs updates from Greg KH: "Here is the big set of driver core and debugfs updates for 6.14-rc1. Included in here is a bunch of driver core, PCI, OF, and platform rust bindings (all acked by the different subsystem maintainers), hence the merge conflict with the rust tree, and some driver core api updates to mark things as const, which will also require some fixups due to new stuff coming in through other trees in this merge window. There are also a bunch of debugfs updates from Al, and there is at least one user that does have a regression with these, but Al is working on tracking down the fix for it. In my use (and everyone else's linux-next use), it does not seem like a big issue at the moment. Here's a short list of the things in here: - driver core rust bindings for PCI, platform, OF, and some i/o functions. We are almost at the "write a real driver in rust" stage now, depending on what you want to do. - misc device rust bindings and a sample driver to show how to use them - debugfs cleanups in the fs as well as the users of the fs api for places where drivers got it wrong or were unnecessarily doing things in complex ways. - driver core const work, making more of the api take const * for different parameters to make the rust bindings easier overall. - other small fixes and updates All of these have been in linux-next with all of the aforementioned merge conflicts, and the one debugfs issue, which looks to be resolved "soon"" * tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits) rust: device: Use as_char_ptr() to avoid explicit cast rust: device: Replace CString with CStr in property_present() devcoredump: Constify 'struct bin_attribute' devcoredump: Define 'struct bin_attribute' through macro rust: device: Add property_present() saner replacement for debugfs_rename() orangefs-debugfs: don't mess with ->d_name octeontx2: don't mess with ->d_parent or ->d_parent->d_name arm_scmi: don't mess with ->d_parent->d_name slub: don't mess with ->d_name sof-client-ipc-flood-test: don't mess with ->d_name qat: don't mess with ->d_name xhci: don't mess with ->d_iname mtu3: don't mess wiht ->d_iname greybus/camera - stop messing with ->d_iname mediatek: stop messing with ->d_iname netdevsim: don't embed file_operations into your structs b43legacy: make use of debugfs_get_aux() b43: stop embedding struct file_operations into their objects carl9170: stop embedding file_operations into their objects ...
2025-01-28Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-3/+25
Pull virtio updates from Michael Tsirkin: "A small number of improvements all over the place: - vdpa/octeon support for multiple interrupts - virtio-pci support for error recovery - vp_vdpa support for notification with data - vhost/net fix to set num_buffers for spec compliance - virtio-mem now works with kdump on s390 And small cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (23 commits) virtio_blk: Add support for transport error recovery virtio_pci: Add support for PCIe Function Level Reset vhost/net: Set num_buffers for virtio 1.0 vdpa/octeon_ep: read vendor-specific PCI capability virtio-pci: define type and header for PCI vendor data vdpa/octeon_ep: handle device config change events vdpa/octeon_ep: enable support for multiple interrupts per device vdpa: solidrun: Replace deprecated PCI functions s390/kdump: virtio-mem kdump support (CONFIG_PROC_VMCORE_DEVICE_RAM) virtio-mem: support CONFIG_PROC_VMCORE_DEVICE_RAM virtio-mem: remember usable region size virtio-mem: mark device ready before registering callbacks in kdump mode fs/proc/vmcore: introduce PROC_VMCORE_DEVICE_RAM to detect device RAM ranges in 2nd kernel fs/proc/vmcore: factor out freeing a list of vmcore ranges fs/proc/vmcore: factor out allocating a vmcore range and adding it to a list fs/proc/vmcore: move vmcore definitions out of kcore.h fs/proc/vmcore: prefix all pr_* with "vmcore:" fs/proc/vmcore: disallow vmcore modifications while the vmcore is open fs/proc/vmcore: replace vmcoredd_mutex by vmcore_mutex fs/proc/vmcore: convert vmcore_cb_lock into vmcore_mutex ...
2025-01-27loop: don't clear LO_FLAGS_PARTSCAN on LOOP_SET_STATUS{,64}Christoph Hellwig1-2/+1
LOOP_SET_STATUS{,64} can set a lot more flags than it is supposed to clear (the LOOP_SET_STATUS_CLEARABLE_FLAGS vs LOOP_SET_STATUS_SETTABLE_FLAGS defines should have been a hint..). Fix this by only clearing the bits in LOOP_SET_STATUS_CLEARABLE_FLAGS. Fixes: ae074d07a0e5 ("loop: move updating lo_flag s out of loop_set_status_from_info") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250127143045.538279-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-27virtio_blk: Add support for transport error recoveryIsrael Rukshin1-3/+25
Add support for proper cleanup and re-initialization of virtio-blk devices during transport reset error recovery flow. This enhancement includes: - Pre-reset handler (reset_prepare) to perform device-specific cleanup - Post-reset handler (reset_done) to re-initialize the device These changes allow the device to recover from various reset scenarios, ensuring proper functionality after a reset event occurs. Without this implementation, the device cannot properly recover from resets, potentially leading to undefined behavior or device malfunction. This feature has been tested using PCI transport with Function Level Reset (FLR) as an example reset mechanism. The reset can be triggered manually via sysfs (echo 1 > /sys/bus/pci/devices/$PCI_ADDR/reset). Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Message-Id: <1732690652-3065-3-git-send-email-israelr@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27Merge tag 'mm-stable-2025-01-26-14-59' of ↵Linus Torvalds2-135/+170
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ...
2025-01-27Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Mainly individually changelogged singleton patches. The patch series in this pull are: - "lib min_heap: Improve min_heap safety, testing, and documentation" from Kuan-Wei Chiu provides various tightenings to the min_heap library code - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some cleanup and Rust preparation in the xarray library code - "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes pathnames in some code comments - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the new secs_to_jiffies() in various places where that is appropriate - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen switches two filesystems to the new mount API - "Convert ocfs2 to use folios" from Matthew Wilcox does that - "Remove get_task_comm() and print task comm directly" from Yafang Shao removes now-unneeded calls to get_task_comm() in various places - "squashfs: reduce memory usage and update docs" from Phillip Lougher implements some memory savings in squashfs and performs some maintainability work - "lib: clarify comparison function requirements" from Kuan-Wei Chiu tightens the sort code's behaviour and adds some maintenance work - "nilfs2: protect busy buffer heads from being force-cleared" from Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a corrupted image - "nilfs2: fix kernel-doc comments for function return values" from Ryusuke Konishi fixes some nilfs kerneldoc - "nilfs2: fix issues with rename operations" from Ryusuke Konishi addresses some nilfs BUG_ONs which syzbot was able to trigger - "minmax.h: Cleanups and minor optimisations" from David Laight does some maintenance work on the min/max library code - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work on the xarray library code" * tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits) ocfs2: use str_yes_no() and str_no_yes() helper functions include/linux/lz4.h: add some missing macros Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent Xarray: remove repeat check in xas_squash_marks() Xarray: distinguish large entries correctly in xas_split_alloc() Xarray: move forward index correctly in xas_pause() Xarray: do not return sibling entries from xas_find_marked() ipc/util.c: complete the kernel-doc function descriptions gcov: clang: use correct function param names latencytop: use correct kernel-doc format for func params minmax.h: remove some #defines that are only expanded once minmax.h: simplify the variants of clamp() minmax.h: move all the clamp() definitions after the min/max() ones minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp() minmax.h: reduce the #define expansion of min(), max() and clamp() minmax.h: update some comments minmax.h: add whitespace around operators and after commas nilfs2: do not update mtime of renamed directory that is not moved nilfs2: handle errors that nilfs_prepare_chunk() may return CREDITS: fix spelling mistake ...
2025-01-26zram: remove zcomp_stream_put() from write_incompressible_page()Sergey Senozhatsky1-1/+0
We cannot and should not put per-CPU compression stream in write_incompressible_page() because that function never gets any per-CPU streams in the first place. It's zram_write_page() that puts the stream before it calls write_incompressible_page(). Link: https://lkml.kernel.org/r/20250115072003.380567-1-senozhatsky@chromium.org Fixes: 485d11509d6d ("zram: factor out ZRAM_HUGE write") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-26zram: cond_resched() in writeback loopSergey Senozhatsky1-0/+2
zram writeback is a costly operation, because every target slot (unless ZRAM_HUGE) is decompressed before it gets written to a backing device. The writeback to a backing device uses submit_bio_wait() which may look like a rescheduling point. However, if the backing device has BD_HAS_SUBMIT_BIO bit set __submit_bio() calls directly disk->fops->submit_bio(bio) on the backing device and so when submit_bio_wait() calls blk_wait_io() the I/O is already done. On such systems we effective end up in a loop for_each (target slot) { decompress(slot) __submit_bio() disk->fops->submit_bio(bio) } Which on PREEMPT_NONE systems triggers watchdogs (since there are no explicit rescheduling points). Add cond_resched() to the zram writeback loop. Link: https://lkml.kernel.org/r/20241218063513.297475-8-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-26zram: use zram_read_from_zspool() in writebackSergey Senozhatsky1-7/+4
We only can read pages from zspool in writeback, zram_read_page() is not really right in that context not only because it's a more generic function that handles ZRAM_WB pages, but also because it requires us to unlock slot between slot flag check and actual page read. Use zram_read_from_zspool() instead and do slot flags check and page read under the same slot lock. Link: https://lkml.kernel.org/r/20241218063513.297475-7-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-26zram: factor out different page types readSergey Senozhatsky1-33/+52
Similarly to write, split the page read code into ZRAM_HUGE read, ZRAM_SAME read and compressed page read to simplify the code. Link: https://lkml.kernel.org/r/20241218063513.297475-6-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-26zram: factor out ZRAM_HUGE writeSergey Senozhatsky1-53/+83
zram_write_page() handles: ZRAM_SAME pages (which was already factored out) stores, regular page stores and ZRAM_HUGE pages stores. ZRAM_HUGE handling adds a significant amount of complexity. Instead, we can handle ZRAM_HUGE in a separate function. This allows us to simplify zs_handle allocations slow-path, as it now does not handle ZRAM_HUGE case. ZRAM_HUGE zs_handle allocation, on the other hand, can now drop __GFP_KSWAPD_RECLAIM because we handle ZRAM_HUGE in preemptible context (outside of local-lock scope). Link: https://lkml.kernel.org/r/20241218063513.297475-5-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-26zram: factor out ZRAM_SAME writeSergey Senozhatsky1-16/+21
Handling of ZRAM_SAME now uses a goto to the final stages of zram_write_page() plus it introduces a branch and flags variable, which is not making the code any simpler. In reality, we can handle ZRAM_SAME immediately when we detect such pages and remove a goto and a branch. Factor out ZRAM_SAME handling into a separate routine to simplify zram_write_page(). Link: https://lkml.kernel.org/r/20241218063513.297475-4-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-26zram: remove entry element memberSergey Senozhatsky2-22/+6
Element is in the same anon union as handle and hence holds the same value, which makes code below sort of confusing handle = zram_get_handle() if (!handle) element = zram_get_element() Element doesn't really simplify the code, let's just remove it. We already re-purpose handle to store the block id a written back page. Link: https://lkml.kernel.org/r/20241218063513.297475-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-26zram: free slot memory early during writeSergey Senozhatsky1-6/+5
Patch series "zram: split page type read/write handling", v2. This is a subset of [1] series which contains only fixes and improvements (no new features, as ZRAM_HUGE split is still under consideration). The motivation for factoring out is that zram_write_page() gets more and more complex all the time, because it tries to handle too many scenarios: ZRAM_SAME store, ZRAM_HUGE store, compress page store with zs_malloc allocation slowpath and conditional recompression, etc. Factor those out and make things easier to handle. Addition of cond_resched() is simply a fix, I can trigger watchdog from zram writeback(). And early slot free is just a reasonable thing to do. [1] https://lore.kernel.org/linux-kernel/20241119072057.3440039-1-senozhatsky@chromium.org This patch (of 7): In the current implementation entry's previously allocated memory is released in the very last moment, when we already have allocated a new memory for new data. This, basically, temporarily increases memory usage for no good reason. For example, consider the case when both old (stale) and new entry data are incompressible so such entry will temporarily use two physical pages - one for stale (old) data and one for new data. We can release old memory as soon as we get a write request for entry. Link: https://lkml.kernel.org/r/20241218063513.297475-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20241218063513.297475-2-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-21Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linuxLinus Torvalds21-157/+236
Pull block updates from Jens Axboe: - NVMe pull requests via Keith: - Target support for PCI-Endpoint transport (Damien) - TCP IO queue spreading fixes (Sagi, Chaitanya) - Target handling for "limited retry" flags (Guixen) - Poll type fix (Yongsoo) - Xarray storage error handling (Keisuke) - Host memory buffer free size fix on error (Francis) - MD pull requests via Song: - Reintroduce md-linear (Yu Kuai) - md-bitmap refactor and fix (Yu Kuai) - Replace kmap_atomic with kmap_local_page (David Reaver) - Quite a few queue freeze and debugfs deadlock fixes Ming introduced lockdep support for this in the 6.13 kernel, and it has (unsurprisingly) uncovered quite a few issues - Use const attributes for IO schedulers - Remove bio ioprio wrappers - Fixes for stacked device atomic write support - Refactor queue affinity helpers, in preparation for better supporting isolated CPUs - Cleanups of loop O_DIRECT handling - Cleanup of BLK_MQ_F_* flags - Add rotational support for null_blk - Various fixes and cleanups * tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits) block: Don't trim an atomic write block: Add common atomic writes enable flag md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add() block: limit disk max sectors to (LLONG_MAX >> 9) block: Change blk_stack_atomic_writes_limits() unit_min check block: Ensure start sector is aligned for stacking atomic writes blk-mq: Move more error handling into blk_mq_submit_bio() block: Reorder the request allocation code in blk_mq_submit_bio() nvme: fix bogus kzalloc() return check in nvme_init_effects_log() md/md-bitmap: move bitmap_{start, end}write to md upper layer md/raid5: implement pers->bitmap_sector() md: add a new callback pers->bitmap_sector() md/md-bitmap: remove the last parameter for bimtap_ops->endwrite() md/md-bitmap: factor behind write counters out from bitmap_{start/end}write() md: Replace deprecated kmap_atomic() with kmap_local_page() md: reintroduce md-linear partitions: ldm: remove the initial kernel-doc notation blk-cgroup: rwstat: fix kernel-doc warnings in header file blk-cgroup: fix kernel-doc warnings in header file nbd: fix partial sending ...
2025-01-13nbd: fix partial sendingMing Lei1-10/+85
nbd driver sends request header and payload with multiple call of sock_sendmsg, and partial sending can't be avoided. However, nbd driver returns BLK_STS_RESOURCE to block core in this situation. This way causes one issue: request->tag may change in the next run of nbd_queue_rq(), but the original old tag has been sent as part of header cookie, this way confuses nbd driver reply handling, since the real request can't be retrieved any more with the obsolete old tag. Fix it by retrying sending directly in per-socket work function, meantime return BLK_STS_OK to block layer core. Cc: vincent.chen@sifive.com Cc: Leon Schuermann <leon@is.currently.online> Cc: Bart Van Assche <bvanassche@acm.org> Reported-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Link: https://lore.kernel.org/r/20241029011941.153037-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-13Merge 6.13-rc7 into driver-core-nextGreg Kroah-Hartman2-14/+27
We need the debugfs / driver-core fixes in here as well for testing and to build on top of. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>