summaryrefslogtreecommitdiff
path: root/drivers/block
AgeCommit message (Collapse)AuthorFilesLines
4 daysublk: fix use-after-free in ublk_partition_scan_workMing Lei1-15/+22
[ Upstream commit f0d385f6689f37a2828c686fb279121df006b4cb ] A race condition exists between the async partition scan work and device teardown that can lead to a use-after-free of ub->ub_disk: 1. ublk_ctrl_start_dev() schedules partition_scan_work after add_disk() 2. ublk_stop_dev() calls ublk_stop_dev_unlocked() which does: - del_gendisk(ub->ub_disk) - ublk_detach_disk() sets ub->ub_disk = NULL - put_disk() which may free the disk 3. The worker ublk_partition_scan_work() then dereferences ub->ub_disk leading to UAF Fix this by using ublk_get_disk()/ublk_put_disk() in the worker to hold a reference to the disk during the partition scan. The spinlock in ublk_get_disk() synchronizes with ublk_detach_disk() ensuring the worker either gets a valid reference or sees NULL and exits early. Also change flush_work() to cancel_work_sync() to avoid running the partition scan work unnecessarily when the disk is already detached. Fixes: 7fc4da6a304b ("ublk: scan partition in async way") Reported-by: Ruikai Peng <ruikai@pwno.io> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
4 daysublk: reorder tag_set initialization before queue allocationMing Lei1-6/+6
commit 011af85ccd871526df36988c7ff20ca375fb804d upstream. Move ublk_add_tag_set() before ublk_init_queues() in the device initialization path. This allows us to use the blk-mq CPU-to-queue mapping established by the tag_set to determine the appropriate NUMA node for each queue allocation. The error handling paths are also reordered accordingly. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> [ Upstream commit 529d4d632788 ("ublk: implement NUMA-aware memory allocation") is ported to linux-6.18.y, but it depends on commit 011af85ccd87 ("ublk: reorder tag_set initialization before queue allocation"). kernel panic is reported on 6.18.y: https://github.com/ublk-org/ublksrv/issues/174 ] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
13 daysublk: scan partition in async wayMing Lei1-3/+32
[ Upstream commit 7fc4da6a304bdcd3de14fc946dc2c19437a9cc5a ] Implement async partition scan to avoid IO hang when reading partition tables. Similar to nvme_partition_scan_work(), partition scanning is deferred to a work queue to prevent deadlocks. When partition scan happens synchronously during add_disk(), IO errors can cause the partition scan to wait while holding ub->mutex, which can deadlock with other operations that need the mutex. Changes: - Add partition_scan_work to ublk_device structure - Implement ublk_partition_scan_work() to perform async scan - Always suppress sync partition scan during add_disk() - Schedule async work after add_disk() for trusted daemons - Add flush_work() in ublk_stop_dev() before grabbing ub->mutex Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reported-by: Yoav Cohen <yoav@nvidia.com> Closes: https://lore.kernel.org/linux-block/DM4PR12MB63280C5637917C071C2F0D65A9A8A@DM4PR12MB6328.namprd12.prod.outlook.com/ Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
13 daysublk: implement NUMA-aware memory allocationMing Lei1-31/+53
[ Upstream commit 529d4d6327880e5c60f4e0def39b3faaa7954e54 ] Implement NUMA-friendly memory allocation for ublk driver to improve performance on multi-socket systems. This commit includes the following changes: 1. Rename __queues to queues, dropping the __ prefix since the field is now accessed directly throughout the codebase rather than only through the ublk_get_queue() helper. 2. Remove the queue_size field from struct ublk_device as it is no longer needed. 3. Move queue allocation and deallocation into ublk_init_queue() and ublk_deinit_queue() respectively, improving encapsulation. This simplifies ublk_init_queues() and ublk_deinit_queues() to just iterate and call the per-queue functions. 4. Add ublk_get_queue_numa_node() helper function to determine the appropriate NUMA node for a queue by finding the first CPU mapped to that queue via tag_set.map[HCTX_TYPE_DEFAULT].mq_map[] and converting it to a NUMA node using cpu_to_node(). This function is called internally by ublk_init_queue() to determine the allocation node. 5. Allocate each queue structure on its local NUMA node using kvzalloc_node() in ublk_init_queue(). 6. Allocate the I/O command buffer on the same NUMA node using alloc_pages_node(). This reduces memory access latency on multi-socket NUMA systems by ensuring each queue's data structures are local to the CPUs that access them. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 7fc4da6a304b ("ublk: scan partition in async way") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-02zloop: make the write pointer of full zones invalidDamien Le Moal1-3/+5
commit 866d65745b635927c3d1343ab67e6fd4a99d116d upstream. The write pointer of zones that are in the full condition is always invalid. Reflect that fact by setting the write pointer of full zones to ULLONG_MAX. Fixes: eb0570c7df23 ("block: new zoned loop block device driver") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02zloop: fail zone append operations that are targeting full zonesDamien Le Moal1-0/+4
commit cf28f6f923cb1dd2765b5c3d7697bb4dcf2096a0 upstream. zloop_rw() will fail any regular write operation that targets a full sequential zone. The check for this is indirect and achieved by checking the write pointer alignment of the write operation. But this check is ineffective for zone append operations since these are alwasy automatically directed at a zone write pointer. Prevent zone append operations from being executed in a full zone with an explicit check of the zone condition. Fixes: eb0570c7df23 ("block: new zoned loop block device driver") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02floppy: fix for PAGE_SIZE != 4KBRene Rebe1-1/+1
commit 82d20481024cbae2ea87fe8b86d12961bfda7169 upstream. For years I wondered why the floppy driver does not just work on sparc64, e.g: root@SUNW_375_0066:# disktype /dev/fd0 disktype: Can't open /dev/fd0: No such device or address [ 525.341906] disktype: attempt to access beyond end of device fd0: rw=0, sector=0, nr_sectors = 16 limit=8 [ 525.341991] floppy: error 10 while reading block 0 Turns out floppy.c __floppy_read_block_0 tries to read one page for the first test read to determine the disk size and thus fails if that is greater than 4k. Adjust minimum MAX_DISK_SIZE to PAGE_SIZE to fix floppy on sparc64 and likely all other PAGE_SIZE != 4KB configs. Cc: stable@vger.kernel.org Signed-off-by: René Rebe <rene@exactco.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02block: rnbd-clt: Fix signedness bug in init_dev()Dan Carpenter1-1/+1
[ Upstream commit 1ddb815fdfd45613c32e9bd1f7137428f298e541 ] The "dev->clt_device_id" variable is set using ida_alloc_max() which returns an int and in particular it returns negative error codes. Change the type from u32 to int to fix the error checking. Fixes: c9b5645fd8ca ("block: rnbd-clt: Fix leaked ID in init_dev()") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-02ublk: clean up user copy references on ublk server exitCaleb Sander Mateos1-2/+1
[ Upstream commit daa24603d9f0808929514ee62ced30052ca7221c ] If a ublk server process releases a ublk char device file, any requests dispatched to the ublk server but not yet completed will retain a ref value of UBLK_REFCOUNT_INIT. Before commit e63d2228ef83 ("ublk: simplify aborting ublk request"), __ublk_fail_req() would decrement the reference count before completing the failed request. However, that commit optimized __ublk_fail_req() to call __ublk_complete_rq() directly without decrementing the request reference count. The leaked reference count incorrectly allows user copy and zero copy operations on the completed ublk request. It also triggers the WARN_ON_ONCE(refcount_read(&io->ref)) warnings in ublk_queue_reinit() and ublk_deinit_queue(). Commit c5c5eb24ed61 ("ublk: avoid ublk_io_release() called after ublk char dev is closed") already fixed the issue for ublk devices using UBLK_F_SUPPORT_ZERO_COPY or UBLK_F_AUTO_BUF_REG. However, the reference count leak also affects UBLK_F_USER_COPY, the other reference-counted data copy mode. Fix the condition in ublk_check_and_reset_active_ref() to include all reference-counted data copy modes. This ensures that any ublk requests still owned by the ublk server when it exits have their reference counts reset to 0. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: e63d2228ef83 ("ublk: simplify aborting ublk request") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-02block: rnbd-clt: Fix leaked ID in init_dev()Thomas Fourier1-5/+8
[ Upstream commit c9b5645fd8ca10f310e41b07540f98e6a9720f40 ] If kstrdup() fails in init_dev(), then the newly allocated ID is lost. Fixes: 64e8a6ece1a5 ("block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name") Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-02ublk: fix deadlock when reading partition tableMing Lei1-4/+28
[ Upstream commit c258f5c4502c9667bccf5d76fa731ab9c96687c1 ] When one process(such as udev) opens ublk block device (e.g., to read the partition table via bdev_open()), a deadlock[1] can occur: 1. bdev_open() grabs disk->open_mutex 2. The process issues read I/O to ublk backend to read partition table 3. In __ublk_complete_rq(), blk_update_request() or blk_mq_end_request() runs bio->bi_end_io() callbacks 4. If this triggers fput() on file descriptor of ublk block device, the work may be deferred to current task's task work (see fput() implementation) 5. This eventually calls blkdev_release() from the same context 6. blkdev_release() tries to grab disk->open_mutex again 7. Deadlock: same task waiting for a mutex it already holds The fix is to run blk_update_request() and blk_mq_end_request() with bottom halves disabled. This forces blkdev_release() to run in kernel work-queue context instead of current task work context, and allows ublk server to make forward progress, and avoids the deadlock. Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Link: https://github.com/ublk-org/ublksrv/issues/170 [1] Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> [axboe: rewrite comment in ublk] Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-02ublk: refactor auto buffer register in ublk_dispatch_req()Ming Lei1-21/+43
[ Upstream commit 0a9beafa7c633e6ff66b05b81eea78231b7e6520 ] Refactor auto buffer register code and prepare for supporting batch IO feature, and the main motivation is to put 'ublk_io' operation code together, so that per-io lock can be applied for the code block. The key changes are: - Rename ublk_auto_buf_reg() as ublk_do_auto_buf_reg() - Introduce an enum `auto_buf_reg_res` to represent the result of the buffer registration attempt (FAIL, FALLBACK, OK). - Split the existing `ublk_do_auto_buf_reg` function into two: - `__ublk_do_auto_buf_reg`: Performs the actual buffer registration and returns the `auto_buf_reg_res` status. - `ublk_do_auto_buf_reg`: A wrapper that calls the internal function and handles the I/O preparation based on the result. - Introduce `ublk_prep_auto_buf_reg_io` to encapsulate the logic for preparing the I/O for completion after buffer registration. - Pass the `tag` directly to `ublk_auto_buf_reg_fallback` to avoid recalculating it. This refactoring makes the control flow clearer and isolates the different stages of the auto buffer registration process. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: c258f5c4502c ("ublk: fix deadlock when reading partition table") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-02ublk: add `union ublk_io_buf` with improved namingMing Lei1-18/+22
[ Upstream commit 8d61ece156bd4f2b9e7d3b2a374a26d42c7a4a06 ] Add `union ublk_io_buf` for naming the anonymous union of struct ublk_io's addr and buf fields, meantime apply it to `struct ublk_io` for storing either ublk auto buffer register data or ublk server io buffer address. The union uses clear field names: - `addr`: for regular ublk server io buffer addresses - `auto_reg`: for ublk auto buffer registration data This eliminates confusing access patterns and improves code readability. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: c258f5c4502c ("ublk: fix deadlock when reading partition table") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-02ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg()Ming Lei1-5/+7
[ Upstream commit 3035b9b46b0611898babc0b96ede65790d3566f7 ] Add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() and prepare for reusing this helper for the coming UBLK_BATCH_IO feature, which can fetch & commit one batch of io commands via single uring_cmd. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: c258f5c4502c ("ublk: fix deadlock when reading partition table") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18ublk: prevent invalid access with DEBUGKevin Brodsky1-2/+2
[ Upstream commit c6a45ee7607de3a350008630f4369b1b5ac80884 ] ublk_ch_uring_cmd_local() may jump to the out label before initialising the io pointer. This will cause trouble if DEBUG is defined, because the pr_devel() call dereferences io. Clang reports: drivers/block/ublk_drv.c:2403:6: error: variable 'io' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized] 2403 | if (tag >= ub->dev_info.queue_depth) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/block/ublk_drv.c:2492:32: note: uninitialized use occurs here 2492 | __func__, cmd_op, tag, ret, io->flags); | Fix this by initialising io to NULL and checking it before dereferencing it. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18ps3disk: use memcpy_{from,to}_bvec indexRene Rebe1-0/+4
[ Upstream commit 79bd8c9814a273fa7ba43399e1c07adec3fc95db ] With 6e0a48552b8c (ps3disk: use memcpy_{from,to}_bvec) converting ps3disk to new bvec helpers, incrementing the offset was accidently lost, corrupting consecutive buffers. Restore index for non-corrupted data transfers. Fixes: 6e0a48552b8c (ps3disk: use memcpy_{from,to}_bvec) Signed-off-by: René Rebe <rene@exactco.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18nbd: defer config unlock in nbd_genl_connectZheng Qixing1-1/+2
[ Upstream commit 1649714b930f9ea6233ce0810ba885999da3b5d4 ] There is one use-after-free warning when running NBD_CMD_CONNECT and NBD_CLEAR_SOCK: nbd_genl_connect nbd_alloc_and_init_config // config_refs=1 nbd_start_device // config_refs=2 set NBD_RT_HAS_CONFIG_REF open nbd // config_refs=3 recv_work done // config_refs=2 NBD_CLEAR_SOCK // config_refs=1 close nbd // config_refs=0 refcount_inc -> uaf ------------[ cut here ]------------ refcount_t: addition on 0; use-after-free. WARNING: CPU: 24 PID: 1014 at lib/refcount.c:25 refcount_warn_saturate+0x12e/0x290 nbd_genl_connect+0x16d0/0x1ab0 genl_family_rcv_msg_doit+0x1f3/0x310 genl_rcv_msg+0x44a/0x790 The issue can be easily reproduced by adding a small delay before refcount_inc(&nbd->config_refs) in nbd_genl_connect(): mutex_unlock(&nbd->config_lock); if (!ret) { set_bit(NBD_RT_HAS_CONFIG_REF, &config->runtime_flags); + printk("before sleep\n"); + mdelay(5 * 1000); + printk("after sleep\n"); refcount_inc(&nbd->config_refs); nbd_connect_reply(info, nbd->index); } Fixes: e46c7287b1c2 ("nbd: add a basic netlink interface") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18nbd: defer config put in recv_workZheng Qixing1-1/+1
[ Upstream commit 9517b82d8d422d426a988b213fdd45c6b417b86d ] There is one uaf issue in recv_work when running NBD_CLEAR_SOCK and NBD_CMD_RECONFIGURE: nbd_genl_connect // conf_ref=2 (connect and recv_work A) nbd_open // conf_ref=3 recv_work A done // conf_ref=2 NBD_CLEAR_SOCK // conf_ref=1 nbd_genl_reconfigure // conf_ref=2 (trigger recv_work B) close nbd // conf_ref=1 recv_work B config_put // conf_ref=0 atomic_dec(&config->recv_threads); -> UAF Or only running NBD_CLEAR_SOCK: nbd_genl_connect // conf_ref=2 nbd_open // conf_ref=3 NBD_CLEAR_SOCK // conf_ref=2 close nbd nbd_release config_put // conf_ref=1 recv_work config_put // conf_ref=0 atomic_dec(&config->recv_threads); -> UAF Commit 87aac3a80af5 ("nbd: call nbd_config_put() before notifying the waiter") moved nbd_config_put() to run before waking up the waiter in recv_work, in order to ensure that nbd_start_device_ioctl() would not be woken up while nbd->task_recv was still uncleared. However, in nbd_start_device_ioctl(), after being woken up it explicitly calls flush_workqueue() to make sure all current works are finished. Therefore, there is no need to move the config put ahead of the wakeup. Move nbd_config_put() to the end of recv_work, so that the reference is held for the whole lifetime of the worker thread. This makes sure the config cannot be freed while recv_work is still running, even if clear + reconfigure interleave. In addition, we don't need to worry about recv_work dropping the last nbd_put (which causes deadlock): path A (netlink with NBD_CFLAG_DESTROY_ON_DISCONNECT): connect // nbd_refs=1 (trigger recv_work) open nbd // nbd_refs=2 NBD_CLEAR_SOCK close nbd nbd_release nbd_disconnect_and_put flush_workqueue // recv_work done nbd_config_put nbd_put // nbd_refs=1 nbd_put // nbd_refs=0 queue_work path B (netlink without NBD_CFLAG_DESTROY_ON_DISCONNECT): connect // nbd_refs=2 (trigger recv_work) open nbd // nbd_refs=3 NBD_CLEAR_SOCK // conf_refs=2 close nbd nbd_release nbd_config_put // conf_refs=1 nbd_put // nbd_refs=2 recv_work done // conf_refs=0, nbd_refs=1 rmmod // nbd_refs=0 Reported-by: syzbot+56fbf4c7ddf65e95c7cc@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6907edce.a70a0220.37351b.0014.GAE@google.com/T/ Fixes: 87aac3a80af5 ("nbd: make the config put is called before the notifying the waiter") Depends-on: e2daec488c57 ("nbd: Fix hungtask when nbd_config_put") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-31Merge tag 'block-6.18-20251031' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Fix blk-crypto reporting EIO when EINVAL is the correct error code - Two bug fixes for the block zone support - NVME pull request via Keith: - Target side authentication fixup - Peer-to-peer metadata fixup - null_blk DMA alignment fix * tag 'block-6.18-20251031' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: null_blk: set dma alignment to logical block size blk-crypto: use BLK_STS_INVAL for alignment errors block: make REQ_OP_ZONE_OPEN a write operation block: fix op_is_zone_mgmt() to handle REQ_OP_ZONE_RESET_ALL nvme-pci: use blk_map_iter for p2p metadata nvmet-auth: update sc_c in host response
2025-10-31null_blk: set dma alignment to logical block sizeHans Holmberg1-0/+1
This driver assumes that bio vectors are memory aligned to the logical block size, so set the queue limit to reflect that. Unless we set up the limit based on the logical block size, we will go out of page bounds in copy_to_nullb / copy_from_nullb. Apparently this wasn't noticed so far because none of the tests generate such buffers, but since commit 851c4c96db00 ("xfs: implement XFS_IOC_DIOINFO in terms of vfs_getattr") xfstests generates unaligned I/O, which now lead to memory corruption when using null_blk devices with 4k block size. Fixes: bf8d08532bc1 ("iomap: add support for dma aligned direct-io") Fixes: b1a000d3b8ec ("block: relax direct io memory alignment") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-24Merge tag 'block-6.18-20251023' of ↵Linus Torvalds1-0/+15
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Fix dma alignment for PI - Fix selinux bogosity with nbd, where sendmsg would get rejected * tag 'block-6.18-20251023' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: block: require LBA dma_alignment when using PI nbd: override creds to kernel when calling sock_{send,recv}msg()
2025-10-20nbd: override creds to kernel when calling sock_{send,recv}msg()Ondrej Mosnacek1-0/+15
sock_{send,recv}msg() internally calls security_socket_{send,recv}msg(), which does security checks (e.g. SELinux) for socket access against the current task. However, _sock_xmit() in drivers/block/nbd.c may be called indirectly from a userspace syscall, where the NBD socket access would be incorrectly checked against the calling userspace task (which simply tries to read/write a file that happens to reside on an NBD device). To fix this, temporarily override creds to kernel ones before calling the sock_*() functions. This allows the security modules to recognize this as internal access by the kernel, which will normally be allowed. A way to trigger the issue is to do the following (on a system with SELinux set to enforcing): ### Create nbd device: truncate -s 256M /tmp/testfile nbd-server localhost:10809 /tmp/testfile ### Connect to the nbd server: nbd-client localhost ### Create mdraid array mdadm --create -l 1 -n 2 /dev/md/testarray /dev/nbd0 missing After these steps, assuming the SELinux policy doesn't allow the unexpected access pattern, errors will be visible on the kernel console: [ 142.204243] nbd0: detected capacity change from 0 to 524288 [ 165.189967] md: async del_gendisk mode will be removed in future, please upgrade to mdadm-4.5+ [ 165.252299] md/raid1:md127: active with 1 out of 2 mirrors [ 165.252725] md127: detected capacity change from 0 to 522240 [ 165.255434] block nbd0: Send control failed (result -13) [ 165.255718] block nbd0: Request send failed, requeueing [ 165.256006] block nbd0: Dead connection, failed to find a fallback [ 165.256041] block nbd0: Receive control failed (result -32) [ 165.256423] block nbd0: shutting down sockets [ 165.257196] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.257736] Buffer I/O error on dev md127, logical block 0, async page read [ 165.258263] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.259376] Buffer I/O error on dev md127, logical block 0, async page read [ 165.259920] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.260628] Buffer I/O error on dev md127, logical block 0, async page read [ 165.261661] ldm_validate_partition_table(): Disk read failed. [ 165.262108] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.262769] Buffer I/O error on dev md127, logical block 0, async page read [ 165.263697] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.264412] Buffer I/O error on dev md127, logical block 0, async page read [ 165.265412] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.265872] Buffer I/O error on dev md127, logical block 0, async page read [ 165.266378] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.267168] Buffer I/O error on dev md127, logical block 0, async page read [ 165.267564] md127: unable to read partition table [ 165.269581] I/O error, dev nbd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.269960] Buffer I/O error on dev nbd0, logical block 0, async page read [ 165.270316] I/O error, dev nbd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.270913] Buffer I/O error on dev nbd0, logical block 0, async page read [ 165.271253] I/O error, dev nbd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.271809] Buffer I/O error on dev nbd0, logical block 0, async page read [ 165.272074] ldm_validate_partition_table(): Disk read failed. [ 165.272360] nbd0: unable to read partition table [ 165.289004] ldm_validate_partition_table(): Disk read failed. [ 165.289614] nbd0: unable to read partition table The corresponding SELinux denial on Fedora/RHEL will look like this (assuming it's not silenced): type=AVC msg=audit(1758104872.510:116): avc: denied { write } for pid=1908 comm="mdadm" laddr=::1 lport=32772 faddr=::1 fport=10809 scontext=system_u:system_r:mdadm_t:s0-s0:c0.c1023 tcontext=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tclass=tcp_socket permissive=0 The respective backtrace looks like this: @security[mdadm, -13, handshake_exit+221615650 handshake_exit+221615650 handshake_exit+221616465 security_socket_sendmsg+5 sock_sendmsg+106 handshake_exit+221616150 sock_sendmsg+5 __sock_xmit+162 nbd_send_cmd+597 nbd_handle_cmd+377 nbd_queue_rq+63 blk_mq_dispatch_rq_list+653 __blk_mq_do_dispatch_sched+184 __blk_mq_sched_dispatch_requests+333 blk_mq_sched_dispatch_requests+38 blk_mq_run_hw_queue+239 blk_mq_dispatch_plug_list+382 blk_mq_flush_plug_list.part.0+55 __blk_flush_plug+241 __submit_bio+353 submit_bio_noacct_nocheck+364 submit_bio_wait+84 __blkdev_direct_IO_simple+232 blkdev_read_iter+162 vfs_read+591 ksys_read+95 do_syscall_64+92 entry_SYSCALL_64_after_hwframe+120 ]: 1 The issue has started to appear since commit 060406c61c7c ("block: add plug while submitting IO"). Cc: Ming Lei <ming.lei@redhat.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2348878 Fixes: 060406c61c7c ("block: add plug while submitting IO") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Acked-by: Paul Moore <paul@paul-moore.com> Acked-by: Stephen Smalley <stephen.smalley.work@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-10Merge tag 'block-6.18-20251009' of ↵Linus Torvalds1-3/+7
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Don't include __GFP_NOWARN for loop worker allocation, as it already uses GFP_NOWAIT which has __GFP_NOWARN set already - Small series cleaning up the recent bio_iov_iter_get_pages() changes - loop fix for leaking the backing reference file, if validation fails - Update of a comment pertaining to disk/partition stat locking * tag 'block-6.18-20251009' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: loop: remove redundant __GFP_NOWARN flag block: move bio_iov_iter_get_bdev_pages to block/fops.c iomap: open code bio_iov_iter_get_bdev_pages block: rename bio_iov_iter_get_pages_aligned to bio_iov_iter_get_pages block: remove bio_iov_iter_get_pages block: Update a comment of disk statistics loop: fix backing file reference leak on validation error
2025-10-08loop: remove redundant __GFP_NOWARN flagPedro Demarchi Gomes1-1/+1
GFP_NOWAIT already includes __GFP_NOWARN, so let's remove the redundant __GFP_NOWARN. Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-03Merge tag 'mm-stable-2025-10-01-19-00' of ↵Linus Torvalds2-16/+9
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling * tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ...
2025-10-03loop: fix backing file reference leak on validation errorLi Chen1-2/+6
loop_change_fd() and loop_configure() call loop_check_backing_file() to validate the new backing file. If validation fails, the reference acquired by fget() was not dropped, leaking a file reference. Fix this by calling fput(file) before returning the error. Cc: stable@vger.kernel.org Cc: Markus Elfring <Markus.Elfring@web.de> CC: Yang Erkun <yangerkun@huawei.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Yu Kuai <yukuai1@huaweicloud.com> Fixes: f5c84eff634b ("loop: Add sanity check for read/write_iter") Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yang Erkun <yangerkun@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-02Merge tag 'for-6.18/block-20250929' of ↵Linus Torvalds23-296/+617
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - NVMe pull request via Keith: - FC target fixes (Daniel) - Authentication fixes and updates (Martin, Chris) - Admin controller handling (Kamaljit) - Target lockdep assertions (Max) - Keep-alive updates for discovery (Alastair) - Suspend quirk (Georg) - MD pull request via Yu: - Add support for a lockless bitmap. A key feature for the new bitmap are that the IO fastpath is lockless. If a user issues lots of write IO to the same bitmap bit in a short time, only the first write has additional overhead to update bitmap bit, no additional overhead for the following writes. By supporting only resync or recover written data, means in the case creating new array or replacing with a new disk, there is no need to do a full disk resync/recovery. - Switch ->getgeo() and ->bios_param() to using struct gendisk rather than struct block_device. - Rust block changes via Andreas. This series adds configuration via configfs and remote completion to the rnull driver. The series also includes a set of changes to the rust block device driver API: a few cleanup patches, and a few features supporting the rnull changes. The series removes the raw buffer formatting logic from `kernel::block` and improves the logic available in `kernel::string` to support the same use as the removed logic. - floppy arch cleanups - Reduce the number of dereferencing needed for ublk commands - Restrict supported sockets for nbd. Mostly done to eliminate a class of issues perpetually reported by syzbot, by using nonsensical socket setups. - A few s390 dasd block fixes - Fix a few issues around atomic writes - Improve DMA interation for integrity requests - Improve how iovecs are treated with regards to O_DIRECT aligment constraints. We used to require each segment to adhere to the constraints, now only the request as a whole needs to. - Clean up and improve p2p support, enabling use of p2p for metadata payloads - Improve locking of request lookup, using SRCU where appropriate - Use page references properly for brd, avoiding very long RCU sections - Fix ordering of recursively submitted IOs - Clean up and improve updating nr_requests for a live device - Various fixes and cleanups * tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits) s390/dasd: enforce dma_alignment to ensure proper buffer validation s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request ublk: remove redundant zone op check in ublk_setup_iod() nvme: Use non zero KATO for persistent discovery connections nvmet: add safety check for subsys lock nvme-core: use nvme_is_io_ctrl() for I/O controller check nvme-core: do ioccsz/iorcsz validation only for I/O controllers nvme-core: add method to check for an I/O controller blk-cgroup: fix possible deadlock while configuring policy blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path blk-mq: Fix more tag iteration function documentation selftests: ublk: fix behavior when fio is not installed ublk: don't access ublk_queue in ublk_unmap_io() ublk: pass ublk_io to __ublk_complete_rq() ublk: don't access ublk_queue in ublk_need_complete_req() ublk: don't access ublk_queue in ublk_check_commit_and_fetch() ublk: don't pass ublk_queue to ublk_fetch() ublk: don't access ublk_queue in ublk_config_io_buf() ublk: don't access ublk_queue in ublk_check_fetch_buf() ublk: pass q_id and tag to __ublk_check_and_get_req() ...
2025-10-02Merge tag 'for-6.18/io_uring-20250929' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Store ring provided buffers locally for the users, rather than stuff them into struct io_kiocb. These types of buffers must always be fully consumed or recycled in the current context, and leaving them in struct io_kiocb is hence not a good ideas as that struct has a vastly different life time. Basically just an architecture cleanup that can help prevent issues with ring provided buffers in the future. - Support for mixed CQE sizes in the same ring. Before this change, a CQ ring either used the default 16b CQEs, or it was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where a few 32b CQEs were needed, this caused everything else to use big CQEs. This is wasteful both in terms of memory usage, but also memory bandwidth for the posted CQEs. With IORING_SETUP_CQE_MIXED, applications may use request types that post both normal 16b and big 32b CQEs on the same ring. - Add helpers for async data management, to make it harder for opcode handlers to mess it up. - Add support for multishot for uring_cmd, which ublk can use. This helps improve efficiency, by providing a persistent request type that can trigger multiple CQEs. - Add initial support for ring feature querying. We had basic support for probe operations, but the API isn't great. Rather than expand that, add support for QUERY which is easily expandable and can cover a lot more cases than the existing probe support. This will help applications get a better idea of what operations are supported on a given host. - zcrx improvements from Pavel: - Improve refill entry alignment for better caching - Various cleanups, especially around deduplicating normal memory vs dmabuf setup. - Generalisation of the niov size (Patch 12). It's still hard coded to PAGE_SIZE on init, but will let the user to specify the rx buffer length on setup. - Syscall / synchronous bufer return. It'll be used as a slow fallback path for returning buffers when the refill queue is full. Useful for tolerating slight queue size misconfiguration or with inconsistent load. - Accounting more memory to cgroups. - Additional independent cleanups that will also be useful for mutli-area support. - Various fixes and cleanups * tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits) io_uring/cmd: drop unused res2 param from io_uring_cmd_done() io_uring: fix nvme's 32b cqes on mixed cq io_uring/query: cap number of queries io_uring/query: prevent infinite loops io_uring/zcrx: account niov arrays to cgroup io_uring/zcrx: allow synchronous buffer return io_uring/zcrx: introduce io_parse_rqe() io_uring/zcrx: don't adjust free cache space io_uring/zcrx: use guards for the refill lock io_uring/zcrx: reduce netmem scope in refill io_uring/zcrx: protect netdev with pp_lock io_uring/zcrx: rename dma lock io_uring/zcrx: make niov size variable io_uring/zcrx: set sgt for umem area io_uring/zcrx: remove dmabuf_offset io_uring/zcrx: deduplicate area mapping io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback() io_uring/zcrx: check all niovs filled with dma addresses io_uring/zcrx: move area reg checks into io_import_area io_uring/zcrx: don't pass slot to io_zcrx_create_area ...
2025-10-01Merge tag 'rust-6.18' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux Pull rust updates from Miguel Ojeda: "Toolchain and infrastructure: - Derive 'Zeroable' for all structs and unions generated by 'bindgen' where possible and corresponding cleanups. To do so, add the 'pin-init' crate as a dependency to 'bindings' and 'uapi'. It also includes its first use in the 'cpufreq' module, with more to come in the next cycle. - Add warning to the 'rustdoc' target to detect broken 'srctree/' links and fix existing cases. - Remove support for unused (since v6.16) host '#[test]'s, simplifying the 'rusttest' target. Tests should generally run within KUnit. 'kernel' crate: - Add 'ptr' module with a new 'Alignment' type, which is always a power of two and is used to validate that a given value is a valid alignment and to perform masking and alignment operations: // Checked at build time. assert_eq!(Alignment::new::<16>().as_usize(), 16); // Checked at runtime. assert_eq!(Alignment::new_checked(15), None); assert_eq!(Alignment::of::<u8>().log2(), 0); assert_eq!(0x25u8.align_down(Alignment::new::<0x10>()), 0x20); assert_eq!(0x5u8.align_up(Alignment::new::<0x10>()), Some(0x10)); assert_eq!(u8::MAX.align_up(Alignment::new::<0x10>()), None); It also includes its first use in Nova. - Add 'core::mem::{align,size}_of{,_val}' to the prelude, matching Rust 1.80.0. - Keep going with the steps on our migration to the standard library 'core::ffi::CStr' type (use 'kernel::{fmt, prelude::fmt!}' and use upstream method names). - 'error' module: improve 'Error::from_errno' and 'to_result' documentation, including examples/tests. - 'sync' module: extend 'aref' submodule documentation now that it exists, and more updates to complete the ongoing move of 'ARef' and 'AlwaysRefCounted' to 'sync::aref'. - 'list' module: add an example/test for 'ListLinksSelfPtr' usage. - 'alloc' module: - Implement 'Box::pin_slice()', which constructs a pinned slice of elements. - Provide information about the minimum alignment guarantees of 'Kmalloc', 'Vmalloc' and 'KVmalloc'. - Take minimum alignment guarantees of allocators for 'ForeignOwnable' into account. - Remove the 'allocator_test' (including 'Cmalloc'). - Add doctest for 'Vec::as_slice()'. - Constify various methods. - 'time' module: - Add methods on 'HrTimer' that can only be called with exclusive access to an unarmed timer, or from timer callback context. - Add arithmetic operations to 'Instant' and 'Delta'. - Add a few convenience and access methods to 'HrTimer' and 'Instant'. 'macros' crate: - Reduce collections in 'quote!' macro. And a few other cleanups and improvements" * tag 'rust-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: (58 commits) gpu: nova-core: use Alignment for alignment-related operations rust: add `Alignment` type rust: macros: reduce collections in `quote!` macro rust: acpi: use `core::ffi::CStr` method names rust: of: use `core::ffi::CStr` method names rust: net: use `core::ffi::CStr` method names rust: miscdevice: use `core::ffi::CStr` method names rust: kunit: use `core::ffi::CStr` method names rust: firmware: use `core::ffi::CStr` method names rust: drm: use `core::ffi::CStr` method names rust: cpufreq: use `core::ffi::CStr` method names rust: configfs: use `core::ffi::CStr` method names rust: auxiliary: use `core::ffi::CStr` method names drm/panic: use `core::ffi::CStr` method names rust: device: use `kernel::{fmt,prelude::fmt!}` rust: sync: use `kernel::{fmt,prelude::fmt!}` rust: seq_file: use `kernel::{fmt,prelude::fmt!}` rust: kunit: use `kernel::{fmt,prelude::fmt!}` rust: file: use `kernel::{fmt,prelude::fmt!}` rust: device: use `kernel::{fmt,prelude::fmt!}` ...
2025-09-24ublk: remove redundant zone op check in ublk_setup_iod()Caleb Sander Mateos1-5/+0
ublk_setup_iod() checks first whether the request is a zoned operation issued to a device without zoned support and returns BLK_STS_IOERR if so. However, such a request would already hit the default case in the subsequent switch statement and fail the ublk_queue_is_zoned() check, which also results in a return of BLK_STS_IOERR. So remove the redundant early check for unsupported zone ops. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-23io_uring/cmd: drop unused res2 param from io_uring_cmd_done()Caleb Sander Mateos1-3/+3
Commit 79525b51acc1 ("io_uring: fix nvme's 32b cqes on mixed cq") split out a separate io_uring_cmd_done32() helper for ->uring_cmd() implementations that return 32-byte CQEs. The res2 value passed to io_uring_cmd_done() is now unused because __io_uring_cmd_done() ignores it when is_cqe32 is passed as false. So drop the parameter from io_uring_cmd_done() to simplify the callers and clarify that it's not possible to return an extra value beyond the 32-bit CQE result. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-22aoe: stop calling page_address() in free_page()Vishal Moola (Oracle)1-1/+1
free_page() should be used when we only have a virtual address. We should call __free_page() directly on our page instead. Link: https://lkml.kernel.org/r/20250903185921.1785167-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Justin Sanders <justin@coraid.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-22Merge branch 'mm-hotfixes-stable' into mm-stable in order to pick upAndrew Morton1-5/+3
changes required by mm-stable material: hugetlb and damon.
2025-09-20ublk: don't access ublk_queue in ublk_unmap_io()Caleb Sander Mateos1-10/+14
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_unmap_io() is a frequent cache miss. Pass to __ublk_complete_rq() whether the ublk server's data buffer needs to be copied to the request. In the callers __ublk_fail_req() and ublk_ch_uring_cmd_local(), get the flags from the ublk_device instead, as its flags have just been read. In ublk_put_req_ref(), pass false since all the features that require reference counting disable copying of the data buffer upon completion. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: pass ublk_io to __ublk_complete_rq()Caleb Sander Mateos1-6/+5
All callers of __ublk_complete_rq() already know the ublk_io. Pass it in to avoid looking it up again. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_need_complete_req()Caleb Sander Mateos1-3/+3
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_need_complete_req() is a frequent cache miss. Get the flags from the ublk_device instead, which is accessed earlier in ublk_ch_uring_cmd_local(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_check_commit_and_fetch()Caleb Sander Mateos1-4/+4
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_check_commit_and_fetch() is a frequent cache miss. Get the flags from the ublk_device instead, which is accessed earlier in ublk_ch_uring_cmd_local(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't pass ublk_queue to ublk_fetch()Caleb Sander Mateos1-3/+2
ublk_fetch() only uses the ublk_queue to get the ublk_device, which its caller already has. So just pass the ublk_device directly. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_config_io_buf()Caleb Sander Mateos1-5/+5
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_config_io_buf() is a frequent cache miss. Get the flags from the ublk_device instead, which is accessed earlier in ublk_ch_uring_cmd_local(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_check_fetch_buf()Caleb Sander Mateos1-4/+4
Obtain the ublk device flags from ublk_device to avoid needing to access the ublk_queue, which may be a cache miss. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: pass q_id and tag to __ublk_check_and_get_req()Caleb Sander Mateos1-13/+11
__ublk_check_and_get_req() only uses its ublk_queue argument to get the q_id and tag. Pass those arguments explicitly to save an access to the ublk_queue. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_daemon_register_io_buf()Caleb Sander Mateos1-1/+1
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_daemon_register_io_buf() is a frequent cache miss. Get the flags from the ublk_device instead, which is accessed earlier in ublk_ch_uring_cmd_local(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_register_io_buf()Caleb Sander Mateos1-1/+1
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_register_io_buf() is a frequent cache miss. Get the flags from the ublk_device instead, which is accessed earlier in ublk_ch_uring_cmd_local(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: pass ublk_device to ublk_register_io_buf()Caleb Sander Mateos1-4/+6
Avoid repeating the 2 dereferences to get the ublk_device from the io_uring_cmd by passing it from ublk_ch_uring_cmd_local() to ublk_register_io_buf(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't dereference ublk_queue in ublk_check_and_get_req()Caleb Sander Mateos1-2/+2
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_ch_{read,write}_iter() is a frequent cache miss. Get the flags and queue depth from the ublk_device instead, which is accessed just before. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't dereference ublk_queue in ublk_ch_uring_cmd_local()Caleb Sander Mateos1-1/+1
For ublk servers with many ublk queues, accessing the ublk_queue to handle a ublk command is a frequent cache miss. Get the queue depth from the ublk_device instead, which is accessed just before. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: add helpers to check ublk_device flagsCaleb Sander Mateos1-0/+34
Introduce ublk_device analogues of the ublk_queue flag helpers: - ublk_support_zero_copy() -> ublk_dev_support_user_copy() - ublk_support_auto_buf_reg() -> ublk_dev_support_auto_buf_reg() - ublk_support_user_copy() -> ublk_dev_support_user_copy() - ublk_need_map_io() -> ublk_dev_need_map_io() - ublk_need_req_ref() -> ublk_dev_need_req_ref() - ublk_need_get_data() -> ublk_dev_need_get_data() These will be used in subsequent changes to avoid accessing the ublk_queue just for the flags, and instead use the ublk_device. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't pass ublk_queue to __ublk_fail_req()Caleb Sander Mateos1-3/+3
__ublk_fail_req() only uses the ublk_queue to get the ublk_device, which its caller already has. So just pass the ublk_device directly. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't pass q_id to ublk_queue_cmd_buf_size()Caleb Sander Mateos1-7/+5
ublk_queue_cmd_buf_size() only needs the queue depth, which is the same for all queues. Get the queue depth from the ublk_device instead so the q_id parameter can be dropped. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: remove ubq check in ublk_check_and_get_req()Caleb Sander Mateos1-3/+0
ublk_get_queue() never returns a NULL pointer, so there's no need to check its return value in ublk_check_and_get_req(). Drop the check. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>