summaryrefslogtreecommitdiff
path: root/drivers/block
AgeCommit message (Collapse)AuthorFilesLines
7 daysloop: use vfs_getattr_nosec for accurate file sizeRajeev Mishra1-2/+13
Use vfs_getattr_nosec() in lo_calculate_size() for getting the file size, rather than just read the cached inode size via i_size_read(). This provides better results than cached inode data, particularly for network filesystems where metadata may be stale. Signed-off-by: Rajeev Mishra <rajeevm@hpe.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250818184821.115033-3-rajeevm@hpe.com [axboe: massage commit message] Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 daysloop: Consolidate size calculation logic into lo_calculate_size()Rajeev Mishra1-17/+9
Renamed get_size to lo_calculate_size and merged the logic from get_size and get_loop_size into a single function. Update all callers to use lo_calculate_size. This is done in preparation for improving the size detection logic. Signed-off-by: Rajeev Mishra <rajeevm@hpe.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250818184821.115033-2-rajeevm@hpe.com [axboe: massage commit message] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-11ublk: check for unprivileged daemon on each I/O fetchCaleb Sander Mateos1-9/+7
Commit ab03a61c6614 ("ublk: have a per-io daemon instead of a per-queue daemon") allowed each ublk I/O to have an independent daemon task. However, nr_privileged_daemon is only computed based on whether the last I/O fetched in each ublk queue has an unprivileged daemon task. Fix this by checking whether every fetched I/O's daemon is privileged. Change nr_privileged_daemon from a count of queues to a boolean indicating whether any I/Os have an unprivileged daemon. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: ab03a61c6614 ("ublk: have a per-io daemon instead of a per-queue daemon") Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250808155216.296170-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-11ublk: don't quiesce in ublk_ch_releaseUday Shankar1-7/+5
ublk_ch_release currently quiesces the device's request_queue while setting force_abort/fail_io. This avoids data races by preventing concurrent reads from the I/O path, but is not strictly needed - at this point, canceling is already set and guaranteed to be observed by any concurrently executing I/Os, so they will be handled properly even if the changes to force_abort/fail_io propagate to the I/O path later. Remove the quiesce/unquiesce calls from ublk_ch_release. This makes the writes to force_abort/fail_io concurrent with the reads in the I/O path, so make the accesses atomic. Before this change, the call to blk_mq_quiesce_queue was responsible for most (90%) of the runtime of ublk_ch_release. With that call eliminated, ublk_ch_release runs much faster. Here is a comparison of the total time spent in calls to ublk_ch_release when a server handling 128 devices exits, before and after this change: before: 1.11s after: 0.09s Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250808-ublk_quiesce2-v1-1-f87ade33fa3d@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-11drbd: Remove the open-coded page poolPhilipp Reisner4-347/+71
If the network stack keeps a reference for too long, DRBD keeps references on a higher number of pages as a consequence. Fix all that by no longer relying on page reference counts dropping to an expected value. Instead, DRBD gives up its reference and lets the system handle everything else. While at it, remove the open-coded custom page pool mechanism and use the page_pool included in the kernel. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Tested-by: Eric Hagberg <ehagberg@janestreet.com> Link: https://lore.kernel.org/r/20250605103852.23029-1-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-09Merge tag 'block-6.17-20250808' of git://git.kernel.dk/linuxLinus Torvalds1-1/+2
Pull more block updates from Jens Axboe: - MD pull request via Yu: - mddev null-ptr-dereference fix, by Erkun - md-cluster fail to remove the faulty disk regression fix, by Heming - minor cleanup, by Li Nan and Jinchao - mdadm lifetime regression fix reported by syzkaller, by Yu Kuai - MD pull request via Christoph - add support for getting the FDP featuee in fabrics passthru path (Nitesh Shetty) - add capability to connect to an administrative controller (Kamaljit Singh) - fix a leak on sgl setup error (Keith Busch) - initialize discovery subsys after debugfs is initialized (Mohamed Khalfella) - fix various comment typos (Bjorn Helgaas) - remove unneeded semicolons (Jiapeng Chong) - nvmet debugfs ordering issue fix - Fix UAF in the tag_set in zloop - Ensure sbitmap shallow depth covers entire set - Reduce lock roundtrips in io context lookup - Move scheduler tags alloc/free out of elevator and freeze lock, to fix some lockdep found issues - Improve robustness of queue limits checking - Fix a regression with IO priorities, if no io context exists * tag 'block-6.17-20250808' of git://git.kernel.dk/linux: (26 commits) lib/sbitmap: make sbitmap_get_shallow() internal lib/sbitmap: convert shallow_depth from one word to the whole sbitmap nvmet: exit debugfs after discovery subsystem exits block, bfq: Reorder struct bfq_iocq_bfqq_data md: make rdev_addable usable for rcu mode md/raid1: remove struct pool_info and related code md/raid1: change r1conf->r1bio_pool to a pointer type block: ensure discard_granularity is zero when discard is not supported zloop: fix KASAN use-after-free of tag set block: Fix default IO priority if there is no IO context nvme: fix various comment typos nvme-auth: remove unneeded semicolon nvme-pci: fix leak on sgl setup error nvmet: initialize discovery subsys after debugfs is initialized nvme: add capability to connect to an administrative controller nvmet: add support for FDP in fabrics passthru path md: rename recovery_cp to resync_offset md/md-cluster: handle REMOVE message earlier md: fix create on open mddev lifetime regression block: fix potential deadlock while running nr_hw_queue update ...
2025-08-01Merge tag 'mm-stable-2025-07-30-15-25' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "As usual, many cleanups. The below blurbiage describes 42 patchsets. 21 of those are partially or fully cleanup work. "cleans up", "cleanup", "maintainability", "rationalizes", etc. I never knew the MM code was so dirty. "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes) addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for merging with existing adjacent VMAs. "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park) adds a new kernel module which simplifies the setup and usage of DAMON in production environments. "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig) is a cleanup to the writeback code which removes a couple of pointers from struct writeback_control. "drivers/base/node.c: optimization and cleanups" (Donet Tom) contains largely uncorrelated cleanups to the NUMA node setup and management code. "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman) does some maintenance work on the userfaultfd code. "Readahead tweaks for larger folios" (Ryan Roberts) implements some tuneups for pagecache readahead when it is reading into order>0 folios. "selftests/mm: Tweaks to the cow test" (Mark Brown) provides some cleanups and consistency improvements to the selftests code. "Optimize mremap() for large folios" (Dev Jain) does that. A 37% reduction in execution time was measured in a memset+mremap+munmap microbenchmark. "Remove zero_user()" (Matthew Wilcox) expunges zero_user() in favor of the more modern memzero_page(). "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand) addresses some warts which David noticed in the huge page code. These were not known to be causing any issues at this time. "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park) provides some cleanup and consolidation work in DAMON. "use vm_flags_t consistently" (Lorenzo Stoakes) uses vm_flags_t in places where we were inappropriately using other types. "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy) increases the reliability of large page allocation in the memfd code. "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple) removes several now-unneeded PFN_* flags. "mm/damon: decouple sysfs from core" (SeongJae Park) implememnts some cleanup and maintainability work in the DAMON sysfs layer. "madvise cleanup" (Lorenzo Stoakes) does quite a lot of cleanup/maintenance work in the madvise() code. "madvise anon_name cleanups" (Vlastimil Babka) provides additional cleanups on top or Lorenzo's effort. "Implement numa node notifier" (Oscar Salvador) creates a standalone notifier for NUMA node memory state changes. Previously these were lumped under the more general memory on/offline notifier. "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan) cleans up the pageblock isolation code and fixes a potential issue which doesn't seem to cause any problems in practice. "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park) adds additional drgn- and python-based DAMON selftests which are more comprehensive than the existing selftest suite. "Misc rework on hugetlb faulting path" (Oscar Salvador) fixes a rather obscure deadlock in the hugetlb fault code and follows that fix with a series of cleanups. "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport) rationalizes and cleans up the highmem-specific code in the CMA allocator. "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand) provides cleanups and future-preparedness to the migration code. "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park) adds some tracepoints to some DAMON auto-tuning code. "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park) does that. "mm/damon: misc cleanups" (SeongJae Park) also does what it claims. "mm: folio_pte_batch() improvements" (David Hildenbrand) cleans up the large folio PTE batching code. "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park) facilitates dynamic alteration of DAMON's inter-node allocation policy. "Remove unmap_and_put_page()" (Vishal Moola) provides a couple of page->folio conversions. "mm: per-node proactive reclaim" (Davidlohr Bueso) implements a per-node control of proactive reclaim - beyond the current memcg-based implementation. "mm/damon: remove damon_callback" (SeongJae Park) replaces the damon_callback interface with a more general and powerful damon_call()+damos_walk() interface. "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes) implements a number of mremap cleanups (of course) in preparation for adding new mremap() functionality: newly permit the remapping of multiple VMAs when the user is specifying MREMAP_FIXED. It still excludes some specialized situations where this cannot be performed reliably. "drop hugetlb_free_pgd_range()" (Anthony Yznaga) switches some sparc hugetlb code over to the generic version and removes the thus-unneeded hugetlb_free_pgd_range(). "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park) augments the present userspace-requested update of DAMON sysfs monitoring files. Automatic update is now provided, along with a tunable to control the update interval. "Some randome fixes and cleanups to swapfile" (Kemeng Shi) does what is claims. "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand) provides (and uses) a means by which debug-style functions can grab a copy of a pageframe and inspect it locklessly without tripping over the races inherent in operating on the live pageframe directly. "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan) addresses the large contention issues which can be triggered by reads from that procfs file. Latencies are reduced by more than half in some situations. The series also introduces several new selftests for the /proc/pid/maps interface. "__folio_split() clean up" (Zi Yan) cleans up __folio_split()! "Optimize mprotect() for large folios" (Dev Jain) provides some quite large (>3x) speedups to mprotect() when dealing with large folios. "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian) does some cleanup work in the selftests code. "tools/testing: expand mremap testing" (Lorenzo Stoakes) extends the mremap() selftest in several ways, including adding more checking of Lorenzo's recently added "permit mremap() move of multiple VMAs" feature. "selftests/damon/sysfs.py: test all parameters" (SeongJae Park) extends the DAMON sysfs interface selftest so that it tests all possible user-requested parameters. Rather than the present minimal subset" * tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits) MAINTAINERS: add missing headers to mempory policy & migration section MAINTAINERS: add missing file to cgroup section MAINTAINERS: add MM MISC section, add missing files to MISC and CORE MAINTAINERS: add missing zsmalloc file MAINTAINERS: add missing files to page alloc section MAINTAINERS: add missing shrinker files MAINTAINERS: move memremap.[ch] to hotplug section MAINTAINERS: add missing mm_slot.h file THP section MAINTAINERS: add missing interval_tree.c to memory mapping section MAINTAINERS: add missing percpu-internal.h file to per-cpu section mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info() selftests/damon: introduce _common.sh to host shared function selftests/damon/sysfs.py: test runtime reduction of DAMON parameters selftests/damon/sysfs.py: test non-default parameters runtime commit selftests/damon/sysfs.py: generalize DAMON context commit assertion selftests/damon/sysfs.py: generalize monitoring attributes commit assertion selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion selftests/damon/sysfs.py: test DAMOS filters commitment selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion selftests/damon/sysfs.py: test DAMOS destinations commitment ...
2025-08-01zloop: fix KASAN use-after-free of tag setShin'ichiro Kawasaki1-1/+2
When a zoned loop device, or zloop device, is removed, KASAN enabled kernel reports "BUG KASAN use-after-free" in blk_mq_free_tag_set(). The BUG happens because zloop_ctl_remove() calls put_disk(), which invokes zloop_free_disk(). The zloop_free_disk() frees the memory allocated for the zlo pointer. However, after the memory is freed, zloop_ctl_remove() calls blk_mq_free_tag_set(&zlo->tag_set), which accesses the freed zlo. Hence the KASAN use-after-free. zloop_ctl_remove() put_disk(zlo->disk) put_device() kobject_put() ... zloop_free_disk() kvfree(zlo) blk_mq_free_tag_set(&zlo->tag_set) To avoid the BUG, move the call to blk_mq_free_tag_set(&zlo->tag_set) from zloop_ctl_remove() into zloop_free_disk(). This ensures that the tag_set is freed before the call to kvfree(zlo). Fixes: eb0570c7df23 ("block: new zoned loop block device driver") CC: stable@vger.kernel.org Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250731110745.165751-1-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-29Merge tag 'for-6.17/block-20250728' of git://git.kernel.dk/linuxLinus Torvalds14-3231/+449
Pull block updates from Jens Axboe: - MD pull request via Yu: - call del_gendisk synchronously (Xiao) - cleanup unused variable (John) - cleanup workqueue flags (Ryo) - fix faulty rdev can't be removed during resync (Qixing) - NVMe pull request via Christoph: - try PCIe function level reset on init failure (Keith Busch) - log TLS handshake failures at error level (Maurizio Lombardi) - pci-epf: do not complete commands twice if nvmet_req_init() fails (Rick Wertenbroek) - misc cleanups (Alok Tiwari) - Removal of the pktcdvd driver This has been more than a decade coming at this point, and some recently revealed breakages that had it causing issues even for cases where it isn't required made me re-pull the trigger on this one. It's known broken and nobody has stepped up to maintain the code - Series for ublk supporting batch commands, enabling the use of multishot where appropriate - Speed up ublk exit handling - Fix for the two-stage elevator fixing which could leak data - Convert NVMe to use the new IOVA based API - Increase default max transfer size to something more reasonable - Series fixing write operations on zoned DM devices - Add tracepoints for zoned block device operations - Prep series working towards improving blk-mq queue management in the presence of isolated CPUs - Don't allow updating of the block size of a loop device that is currently under exclusively ownership/open - Set chunk sectors from stacked device stripe size and use it for the atomic write size limit - Switch to folios in bcache read_super() - Fix for CD-ROM MRW exit flush handling - Various tweaks, fixes, and cleanups * tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux: (94 commits) block: restore two stage elevator switch while running nr_hw_queue update cdrom: Call cdrom_mrw_exit from cdrom_release function sunvdc: Balance device refcount in vdc_port_mpgroup_check nvme-pci: try function level reset on init failure dm: split write BIOs on zone boundaries when zone append is not emulated block: use chunk_sectors when evaluating stacked atomic write limits dm-stripe: limit chunk_sectors to the stripe size md/raid10: set chunk_sectors limit md/raid0: set chunk_sectors limit block: sanitize chunk_sectors for atomic write limits ilog2: add max_pow_of_two_factor() nvmet: pci-epf: Do not complete commands twice if nvmet_req_init() fails nvme-tcp: log TLS handshake failures at error level docs: nvme: fix grammar in nvme-pci-endpoint-target.rst nvme: fix typo in status code constant for self-test in progress nvmet: remove redundant assignment of error code in nvmet_ns_enable() nvme: fix incorrect variable in io cqes error message nvme: fix multiple spelling and grammar issues in host drivers block: fix blk_zone_append_update_request_bio() kernel-doc md/raid10: fix set but not used variable in sync_request_write() ...
2025-07-22sunvdc: Balance device refcount in vdc_port_mpgroup_checkMa Ke1-1/+3
Using device_find_child() to locate a probed virtual-device-port node causes a device refcount imbalance, as device_find_child() internally calls get_device() to increment the device’s reference count before returning its pointer. vdc_port_mpgroup_check() directly returns true upon finding a matching device without releasing the reference via put_device(). We should call put_device() to decrement refcount. As comment of device_find_child() says, 'NOTE: you will need to drop the reference with put_device() after use'. Found by code review. Cc: stable@vger.kernel.org Fixes: 3ee70591d6c4 ("sunvdc: prevent sunvdc panic when mpgroup disk added to guest domain") Signed-off-by: Ma Ke <make24@iscas.ac.cn> Link: https://lore.kernel.org/r/20250719075856.3447953-1-make24@iscas.ac.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-18Merge tag 'block-6.16-20250718' of git://git.kernel.dk/linuxLinus Torvalds1-3/+2
Pull block fixes from Jens Axboe: - NVMe changes via Christoph: - revert the cross-controller atomic write size validation that caused regressions (Christoph Hellwig) - fix endianness of command word printout in nvme_log_err_passthru() (John Garry) - fix callback lock for TLS handshake (Maurizio Lombardi) - fix misaccounting of nvme-mpath inflight I/O (Yu Kuai) - fix inconsistent RCU list manipulation in nvme_ns_add_to_ctrl_list() (Zheng Qixing) - Fix for a kobject leak in queue unregistration - Fix for loop async file write start/end handling * tag 'block-6.16-20250718' of git://git.kernel.dk/linux: loop: use kiocb helpers to fix lockdep warning nvmet-tcp: fix callback lock for TLS handshake nvme: fix misaccounting of nvme-mpath inflight I/O nvme: revert the cross-controller atomic write size validation nvme: fix endianness of command word prints in nvme_log_err_passthru() nvme: fix inconsistent RCU list manipulation in nvme_ns_add_to_ctrl_list() block: fix kobject leak in blk_unregister_queue
2025-07-16loop: use kiocb helpers to fix lockdep warningMing Lei1-3/+2
The lockdep tool can report a circular lock dependency warning in the loop driver's AIO read/write path: ``` [ 6540.587728] kworker/u96:5/72779 is trying to acquire lock: [ 6540.593856] ff110001b5968440 (sb_writers#9){.+.+}-{0:0}, at: loop_process_work+0x11a/0xf70 [loop] [ 6540.603786] [ 6540.603786] but task is already holding lock: [ 6540.610291] ff110001b5968440 (sb_writers#9){.+.+}-{0:0}, at: loop_process_work+0x11a/0xf70 [loop] [ 6540.620210] [ 6540.620210] other info that might help us debug this: [ 6540.627499] Possible unsafe locking scenario: [ 6540.627499] [ 6540.634110] CPU0 [ 6540.636841] ---- [ 6540.639574] lock(sb_writers#9); [ 6540.643281] lock(sb_writers#9); [ 6540.646988] [ 6540.646988] *** DEADLOCK *** ``` This patch fixes the issue by using the AIO-specific helpers `kiocb_start_write()` and `kiocb_end_write()`. These functions are designed to be used with a `kiocb` and manage write sequencing correctly for asynchronous I/O without introducing the problematic lock dependency. The `kiocb` is already part of the `loop_cmd` struct, so this change also simplifies the completion function `lo_rw_aio_do_completion()` by using the `iocb` from the `cmd` struct directly, instead of retrieving the loop device from the request queue. Fixes: 39d86db34e41 ("loop: add file_start_write() and file_end_write()") Cc: Changhui Zhong <czhong@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250716114808.3159657-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: remove unused req argument from ublk_sub_req_ref()Caleb Sander Mateos1-5/+4
Since commit b749965edda8 ("ublk: remove ublk_commit_and_fetch()"), ublk_sub_req_ref() no longer uses its struct request *req argument. So drop the argument from ublk_sub_req_ref(), and from ublk_need_complete_req(), which only passes it to ublk_sub_req_ref(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250715154244.1626810-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: pass 'const struct ublk_io *' to ublk_[un]map_io()Ming Lei1-2/+2
Pass 'const struct ublk_io *' to ublk_[un]map_io() since just io->addr and io->res are read in the two helpers. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-11-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: remove ublk_commit_and_fetch()Ming Lei1-18/+18
Remove ublk_commit_and_fetch() and open code request completion. Consolidate accesses to struct ublk_io in UBLK_IO_COMMIT_AND_FETCH_REQ. When the ublk_io daemon task restriction is relaxed in the future, ublk_io will need to be protected by a lock. Unregister the auto-registered buffer and complete the request last, as these don't need to happen under the lock. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-10-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: add helper ublk_check_fetch_buf()Ming Lei1-13/+19
Add a helper ublk_check_fetch_buf() to validate UBLK_IO_FETCH_REQ's addr. This doesn't require access to the ublk_io, so it can be done before taking the ublk_device mutex. This way also fixes one missing return value of -EINVAL in case of early failure from ublk_fetch(). Fixes: b69b8edfb27d ("ublk: properly serialize all FETCH_REQs") Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-9-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: store auto buffer register data into `struct ublk_io`Ming Lei1-18/+12
We can share space of `io->addr` for storing auto buffer register data and user space buffer address. So store auto buffer register data into `struct ublk_io`. Prepare for supporting batch IO in which many ublk IOs share single uring_cmd, so we can't store auto buffer register data into uring_cmd pdu. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-8-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: move auto buffer register handling into one dedicated helperMing Lei1-56/+71
Move check & clearing UBLK_IO_FLAG_AUTO_BUF_REG to ublk_handle_auto_buf_reg(), also return buffer index from this helper. Also move ublk_set_auto_buf_reg() to this single helper too. Add ublk_config_io_buf() for setting up ublk io buffer, covers both ublk buffer copy or auto buffer register. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-7-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: avoid to pass `struct ublksrv_io_cmd *` to ublk_commit_and_fetch()Ming Lei1-15/+29
Refactor ublk_commit_and_fetch() in the following way for removing parameter of `struct ublksrv_io_cmd *`: - return `struct request *` from ublk_fill_io_cmd(), so that we can use request reference reliably in this way cause both request and io_uring_cmd reference share same storage - move ublk_fill_io_cmd() before calling into ublk_commit_and_fetch(), so that ublk_fill_io_cmd() could be run with per-io lock held for supporting command batch. - pass ->zone_append_lba to ublk_commit_and_fetch() directly The main motivation is to reproduce ublk_commit_and_fetch() for fetching io command batch with multishot uring_cmd. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-6-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: let ublk_fill_io_cmd() cover more thingsMing Lei1-4/+2
Let ublk_fill_io_cmd() clear UBLK_IO_FLAG_OWNED_BY_SRV too. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: move fake timeout logic into __ublk_complete_rq()Ming Lei1-4/+1
Almost every block driver deals with fake timeout logic around real request completion code. Also the existing way may cause request reference count leak, so move the logic into __ublk_complete_rq(), then we can skip the completion in the last step like other drivers. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: look up ublk task via its pid in timeout handlerMing Lei1-8/+17
Look up ublk process via its pid in timeout handler, so we can avoid to touch io->task, because it is fragile to touch task structure. It is fine to kill ublk server process and this way is simpler. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-15ublk: validate ublk server pidMing Lei1-0/+9
ublk server pid(the `tgid` of the process opening the ublk device) is stored in `ublk_device->ublksrv_tgid`. This `tgid` is then checked against the `ublksrv_pid` in `ublk_ctrl_start_dev` and `ublk_ctrl_end_recovery`. This ensures that correct ublk server pid is stored in device info. Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250713143415.2857561-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-13block: floppy: Fix uninitialized use of outparamPurva Yeshi1-1/+1
Fix Smatch-detected error: drivers/block/floppy.c:3569 fd_locked_ioctl() error: uninitialized symbol 'outparam'. Smatch may incorrectly warn about uninitialized use of 'outparam' in fd_locked_ioctl(), even though all _IOC_READ commands guarantee its initialization. Initialize outparam to NULL to make this explicit and suppress the false positive. Signed-off-by: Purva Yeshi <purvayeshi550@gmail.com> Reviewed-by: Denis Efremov <efremov@linux.com> Link: https://lore.kernel.org/r/20250713070020.14530-1-purvayeshi550@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-12loop: Avoid updating block size under exclusive ownerJan Kara1-8/+30
Syzbot came up with a reproducer where a loop device block size is changed underneath a mounted filesystem. This causes a mismatch between the block device block size and the block size stored in the superblock causing confusion in various places such as fs/buffer.c. The particular issue triggered by syzbot was a warning in __getblk_slow() due to requested buffer size not matching block device block size. Fix the problem by getting exclusive hold of the loop device to change its block size. This fails if somebody (such as filesystem) has already an exclusive ownership of the block device and thus prevents modifying the loop device under some exclusive owner which doesn't expect it. Reported-by: syzbot+01ef7a8da81a975e1ccd@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: syzbot+01ef7a8da81a975e1ccd@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/20250711163202.19623-2-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-11Merge tag 'block-6.16-20250710' of git://git.kernel.dk/linuxLinus Torvalds1-3/+3
Pull block fixes from Jens Axboe: - MD changes via Yu: - fix UAF due to stack memory used for bio mempool (Jinchao) - fix raid10/raid1 nowait IO error path (Nigel and Qixing) - fix kernel crash from reading bitmap sysfs entry (Håkon) - Fix for a UAF in the nbd connect error path - Fix for blocksize being bigger than pagesize, if THP isn't enabled * tag 'block-6.16-20250710' of git://git.kernel.dk/linux: block: reject bs > ps block devices when THP is disabled nbd: fix uaf in nbd_genl_connect() error path md/md-bitmap: fix GPF in bitmap_get_stats() md/raid1,raid10: strip REQ_NOWAIT from member bios raid10: cleanup memleak at raid10_make_request md/raid1: Fix stack memory use after return in raid1_reshape
2025-07-10null_blk: use memzero_page()Matthew Wilcox (Oracle)1-1/+1
memzero_page() is the new name for zero_user(). Link: https://lkml.kernel.org/r/20250612143443.2848197-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Xiubo Li <xiubli@redhat.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-10nbd: fix lockdep deadlock warningMing Lei1-1/+11
nbd grabs device lock nbd->config_lock for updating nr_hw_queues, this ways cause the following lock dependency: -> #2 (&disk->open_mutex){+.+.}-{4:4}: lock_acquire kernel/locking/lockdep.c:5871 [inline] lock_acquire+0x1ac/0x448 kernel/locking/lockdep.c:5828 __mutex_lock_common kernel/locking/mutex.c:602 [inline] __mutex_lock+0x166/0x1292 kernel/locking/mutex.c:747 mutex_lock_nested+0x14/0x1c kernel/locking/mutex.c:799 __del_gendisk+0x132/0xac6 block/genhd.c:706 del_gendisk+0xf6/0x19a block/genhd.c:819 nbd_dev_remove+0x3c/0xf2 drivers/block/nbd.c:268 nbd_dev_remove_work+0x1c/0x26 drivers/block/nbd.c:284 process_one_work+0x96a/0x1f32 kernel/workqueue.c:3238 process_scheduled_works kernel/workqueue.c:3321 [inline] worker_thread+0x5ce/0xde8 kernel/workqueue.c:3402 kthread+0x39c/0x7d4 kernel/kthread.c:464 ret_from_fork_kernel+0x2a/0xbb2 arch/riscv/kernel/process.c:214 ret_from_fork_kernel_asm+0x16/0x18 arch/riscv/kernel/entry.S:327 -> #1 (&set->update_nr_hwq_lock){++++}-{4:4}: lock_acquire kernel/locking/lockdep.c:5871 [inline] lock_acquire+0x1ac/0x448 kernel/locking/lockdep.c:5828 down_write+0x9c/0x19a kernel/locking/rwsem.c:1577 blk_mq_update_nr_hw_queues+0x3e/0xb86 block/blk-mq.c:5041 nbd_start_device+0x140/0xb2c drivers/block/nbd.c:1476 nbd_genl_connect+0xae0/0x1b24 drivers/block/nbd.c:2201 genl_family_rcv_msg_doit+0x206/0x2e6 net/netlink/genetlink.c:1115 genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline] genl_rcv_msg+0x514/0x78e net/netlink/genetlink.c:1210 netlink_rcv_skb+0x206/0x3be net/netlink/af_netlink.c:2534 genl_rcv+0x36/0x4c net/netlink/genetlink.c:1219 netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline] netlink_unicast+0x4f0/0x82c net/netlink/af_netlink.c:1339 netlink_sendmsg+0x85e/0xdd6 net/netlink/af_netlink.c:1883 sock_sendmsg_nosec net/socket.c:712 [inline] __sock_sendmsg+0xcc/0x160 net/socket.c:727 ____sys_sendmsg+0x63e/0x79c net/socket.c:2566 ___sys_sendmsg+0x144/0x1e6 net/socket.c:2620 __sys_sendmsg+0x188/0x246 net/socket.c:2652 __do_sys_sendmsg net/socket.c:2657 [inline] __se_sys_sendmsg net/socket.c:2655 [inline] __riscv_sys_sendmsg+0x70/0xa2 net/socket.c:2655 syscall_handler+0x94/0x118 arch/riscv/include/asm/syscall.h:112 do_trap_ecall_u+0x396/0x530 arch/riscv/kernel/traps.c:341 handle_exception+0x146/0x152 arch/riscv/kernel/entry.S:197 -> #0 (&nbd->config_lock){+.+.}-{4:4}: check_noncircular+0x132/0x146 kernel/locking/lockdep.c:2178 check_prev_add kernel/locking/lockdep.c:3168 [inline] check_prevs_add kernel/locking/lockdep.c:3287 [inline] validate_chain kernel/locking/lockdep.c:3911 [inline] __lock_acquire+0x12b2/0x24ea kernel/locking/lockdep.c:5240 lock_acquire kernel/locking/lockdep.c:5871 [inline] lock_acquire+0x1ac/0x448 kernel/locking/lockdep.c:5828 __mutex_lock_common kernel/locking/mutex.c:602 [inline] __mutex_lock+0x166/0x1292 kernel/locking/mutex.c:747 mutex_lock_nested+0x14/0x1c kernel/locking/mutex.c:799 refcount_dec_and_mutex_lock+0x60/0xd8 lib/refcount.c:118 nbd_config_put+0x3a/0x610 drivers/block/nbd.c:1423 nbd_release+0x94/0x15c drivers/block/nbd.c:1735 blkdev_put_whole+0xac/0xee block/bdev.c:721 bdev_release+0x3fe/0x600 block/bdev.c:1144 blkdev_release+0x1a/0x26 block/fops.c:684 __fput+0x382/0xa8c fs/file_table.c:465 ____fput+0x1c/0x26 fs/file_table.c:493 task_work_run+0x16a/0x25e kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop+0x118/0x134 kernel/entry/common.c:114 exit_to_user_mode_prepare include/linux/entry-common.h:330 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:414 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:449 [inline] do_trap_ecall_u+0x3f0/0x530 arch/riscv/kernel/traps.c:355 handle_exception+0x146/0x152 arch/riscv/kernel/entry.S:197 Also it isn't necessary to require nbd->config_lock, because blk_mq_update_nr_hw_queues() does grab tagset lock for sync everything. Fixes the issue by releasing ->config_lock & retry in case of concurrent updating nr_hw_queues. Fixes: 98e68f67020c ("block: prevent adding/deleting disk during updating nr_hw_queues") Reported-by: syzbot+2bcecf3c38cb3e8fdc8d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6855034f.a00a0220.137b3.0031.GAE@google.com Reviewed-by: Yu Kuai <yukuai3@huawei.com> Cc: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250709111744.2353050-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08drbd: add missing kref_get in handle_write_conflictsSarah Newman1-1/+5
With `two-primaries` enabled, DRBD tries to detect "concurrent" writes and handle write conflicts, so that even if you write to the same sector simultaneously on both nodes, they end up with the identical data once the writes are completed. In handling "superseeded" writes, we forgot a kref_get, resulting in a premature drbd_destroy_device and use after free, and further to kernel crashes with symptoms. Relevance: No one should use DRBD as a random data generator, and apparently all users of "two-primaries" handle concurrent writes correctly on layer up. That is cluster file systems use some distributed lock manager, and live migration in virtualization environments stops writes on one node before starting writes on the other node. Which means that other than for "test cases", this code path is never taken in real life. FYI, in DRBD 9, things are handled differently nowadays. We still detect "write conflicts", but no longer try to be smart about them. We decided to disconnect hard instead: upper layers must not submit concurrent writes. If they do, that's their fault. Signed-off-by: Sarah Newman <srn@prgmr.com> Signed-off-by: Lars Ellenberg <lars@linbit.com> Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20250627095728.800688-1-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08block: mtip32xx: Fix usage of dma_map_sg()Thomas Fourier1-10/+17
The dma_map_sg() can fail and, in case of failure, returns 0. If it fails, mtip_hw_submit_io() returns an error. The dma_unmap_sg() requires the nents parameter to be the same as the one passed to dma_map_sg(). This patch saves the nents in command->scatter_ents. Fixes: 88523a61558a ("block: Add driver for Micron RealSSD pcie flash cards") Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20250627121123.203731-2-fourier.thomas@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-07nbd: fix uaf in nbd_genl_connect() error pathZheng Qixing1-3/+3
There is a use-after-free issue in nbd: block nbd6: Receive control failed (result -104) block nbd6: shutting down sockets ================================================================== BUG: KASAN: slab-use-after-free in recv_work+0x694/0xa80 drivers/block/nbd.c:1022 Write of size 4 at addr ffff8880295de478 by task kworker/u33:0/67 CPU: 2 UID: 0 PID: 67 Comm: kworker/u33:0 Not tainted 6.15.0-rc5-syzkaller-00123-g2c89c1b655c0 #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Workqueue: nbd6-recv recv_work Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:408 [inline] print_report+0xc3/0x670 mm/kasan/report.c:521 kasan_report+0xe0/0x110 mm/kasan/report.c:634 check_region_inline mm/kasan/generic.c:183 [inline] kasan_check_range+0xef/0x1a0 mm/kasan/generic.c:189 instrument_atomic_read_write include/linux/instrumented.h:96 [inline] atomic_dec include/linux/atomic/atomic-instrumented.h:592 [inline] recv_work+0x694/0xa80 drivers/block/nbd.c:1022 process_one_work+0x9cc/0x1b70 kernel/workqueue.c:3238 process_scheduled_works kernel/workqueue.c:3319 [inline] worker_thread+0x6c8/0xf10 kernel/workqueue.c:3400 kthread+0x3c2/0x780 kernel/kthread.c:464 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:153 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> nbd_genl_connect() does not properly stop the device on certain error paths after nbd_start_device() has been called. This causes the error path to put nbd->config while recv_work continue to use the config after putting it, leading to use-after-free in recv_work. This patch moves nbd_start_device() after the backend file creation. Reported-by: syzbot+48240bab47e705c53126@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68227a04.050a0220.f2294.00b5.GAE@google.com/T/ Fixes: 6497ef8df568 ("nbd: provide a way for userspace processes to identify device backends") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250612132405.364904-1-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-07block: remove pktcdvd driverJens Axboe3-2960/+0
This driver has long outlived it's utility, and it's broken and unloved. The main use case for this was direct mount with UDF of cd-rw drives that required 32kb packets. It would collect writes into that size and write them out in multiples of that. That's not a common use case anymore, the world has moved on from those kinds of media. To make matters worse, it's actively breaking setups where it's not even required or useful. Link: https://lore.kernel.org/linux-block/fxg6dksau4jsk3u5xldlyo2m7qgiux6vtdrz5rywseotsouqdv@urcrwz6qtd3r/ Link: https://lore.kernel.org/linux-block/dcc4836e-6da9-4208-ad27-bbd44b3a2063@kernel.dk/ Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-04Merge tag 'block-6.16-20250704' of git://git.kernel.dk/linuxLinus Torvalds2-7/+10
Pull block fixes from Jens Axboe: - NVMe fixes via Christoph: - fix incorrect cdw15 value in passthru error logging (Alok Tiwari) - fix memory leak of bio integrity in nvmet (Dmitry Bogdanov) - refresh visible attrs after being checked (Eugen Hristev) - fix suspicious RCU usage warning in the multipath code (Geliang Tang) - correctly account for namespace head reference counter (Nilay Shroff) - Fix for a regression introduced in ublk in this cycle, where it would attempt to queue a canceled request. - brd RCU sleeping fix, also introduced in this cycle. Bare bones fix, should be improved upon for the next release. * tag 'block-6.16-20250704' of git://git.kernel.dk/linux: brd: fix sleeping function called from invalid context in brd_insert_page() ublk: don't queue request if the associated uring_cmd is canceled nvme-multipath: fix suspicious RCU usage warning nvme-pci: refresh visible attrs after being checked nvmet: fix memory leak of bio integrity nvme: correctly account for namespace head reference counter nvme: Fix incorrect cdw15 value in passthru error logging
2025-07-04ublk: introduce and use ublk_set_canceling helperUday Shankar1-20/+34
For performance reasons (minimizing the number of cache lines accessed in the hot path), we store the "canceling" state redundantly - there is one flag in the device, which can be considered the source of truth, and per-queue copies of that flag. This redundancy can cause confusion, and opens the door to bugs where the state is set inconsistently. Try to guard against these bugs by introducing a ublk_set_canceling helper which is the sole mutator of both the per-device and per-queue canceling state. This helper always sets the state consistently. Use the helper in all places where we need to modify the canceling state. No functional changes are expected. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250703-ublk_too_many_quiesce-v2-2-3527b5339eeb@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-04ublk: speed up ublk server exit handlingUday Shankar1-15/+21
Recently, we've observed a few cases where a ublk server is able to complete restart more quickly than the driver can process the exit of the previous ublk server. The new ublk server comes up, attempts recovery of the preexisting ublk devices, and observes them still in state UBLK_S_DEV_LIVE. While this is possible due to the asynchronous nature of io_uring cleanup and should therefore be handled properly in the ublk server, it is still preferable to make ublk server exit handling faster if possible, as we should strive for it to not be a limiting factor in how fast a ublk server can restart and provide service again. Analysis of the issue showed that the vast majority of the time spent in handling the ublk server exit was in calls to blk_mq_quiesce_queue, which is essentially just a (relatively expensive) call to synchronize_rcu. The ublk server exit path currently issues an unnecessarily large number of calls to blk_mq_quiesce_queue, for two reasons: 1. It tries to call blk_mq_quiesce_queue once per ublk_queue. However, blk_mq_quiesce_queue targets the request_queue of the underlying ublk device, of which there is only one. So the number of calls is larger than necessary by a factor of nr_hw_queues. 2. In practice, it calls blk_mq_quiesce_queue _more_ than once per ublk_queue. This is because of a data race where we read ubq->canceling without any locking when deciding if we should call ublk_start_cancel. It is thus possible for two calls to ublk_uring_cmd_cancel_fn against the same ublk_queue to both call ublk_start_cancel against the same ublk_queue. Fix this by making the "canceling" flag a per-device state. This actually matches the existing code better, as there are several places where the flag is set or cleared for all queues simultaneously, and there is the general expectation that cancellation corresponds with ublk server exit. This per-device canceling flag is then checked under a (new) lock (addressing the data race (2) above), and the queue is only quiesced if it is cleared (addressing (1) above). The result is just one call to blk_mq_quiesce_queue per ublk device. To minimize the number of cache lines that are accessed in the hot path, the per-queue canceling flag is kept. The values of the per-device canceling flag and all per-queue canceling flags should always match. In our setup, where one ublk server handles I/O for 128 ublk devices, each having 24 hardware queues of depth 4096, here are the results before and after this patch, where teardown time is measured from the first call to io_ring_ctx_wait_and_kill to the return from the last ublk_ch_release: before after number of calls to blk_mq_quiesce_queue: 6469 256 teardown time: 11.14s 2.44s There are still some potential optimizations here, but this takes care of a big chunk of the ublk server exit handling delay. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250703-ublk_too_many_quiesce-v2-1-3527b5339eeb@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-04zram: pass buffer offset to zcomp_available_show()Sergey Senozhatsky3-13/+13
In most cases zcomp_available_show() is the only emitting function that is called from sysfs read() handler, so it assumes that there is a whole PAGE_SIZE buffer to work with. There is an exception, however: recomp_algorithm_show(). In recomp_algorithm_show() we prepend the buffer with priority number before we pass it to zcomp_available_show(), so it cannot assume PAGE_SIZE anymore and must take recomp_algorithm_show() modifications into consideration. Therefore we need to pass buffer offset to zcomp_available_show(). Also convert it to use sysfs_emit_at(), to stay aligned with the rest of zram's sysfs read() handlers. On practice we are never even close to using the whole PAGE_SIZE buffer, so that's not a critical bug, but still. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Link: https://lore.kernel.org/r/20250627071840.1394242-1-senozhatsky@chromium.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-04block: zram: replace scnprintf() with sysfs_emit() in *_show() functionsRahul Kumar1-11/+11
Replace scnprintf() with sysfs_emit() or sysfs_emit_at() in sysfs *_show() functions in zram_drv.c to follow the kernel's guidelines from Documentation/filesystems/sysfs.rst. This improves consistency, safety, and makes the code easier to maintain and update in the future. Signed-off-by: Rahul Kumar <rk0006818@gmail.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Link: https://lore.kernel.org/r/20250627035256.1120740-1-rk0006818@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01virtio: blk/scsi: use block layer helpers to calculate num of queuesDaniel Wagner1-3/+2
The calculation of the upper limit for queues does not depend solely on the number of possible CPUs; for example, the isolcpus kernel command-line option must also be considered. To account for this, the block layer provides a helper function to retrieve the maximum number of queues. Use it to set an appropriate upper queue number limit. Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-5-13923686b54b@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01brd: fix sleeping function called from invalid context in brd_insert_page()Yu Kuai1-2/+4
__xa_cmpxchg() is called with rcu_read_lock(), and it will allocate memory if necessary. Fix the problem by moving rcu_read_lock() after __xa_cmpxchg(), meanwhile, it still should be held before xa_unlock(), prevent returned page to be freed by concurrent discard. Fixes: bbcacab2e8ee ("brd: avoid extra xarray lookups on first write") Reported-by: syzbot+ea4c8fd177a47338881a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/685ec4c9.a00a0220.129264.000c.GAE@google.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250630112828.421219-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01ublk: don't queue request if the associated uring_cmd is canceledMing Lei1-5/+6
Commit 524346e9d79f ("ublk: build batch from IOs in same io_ring_ctx and io task") need to dereference `io->cmd` for checking if the IO can be added to current batch, see ublk_belong_to_same_batch() and io_uring_cmd_ctx_handle(). However, `io->cmd` may become invalid after the uring_cmd is canceled. Fixes it by only allowing to queue this IO in case that ublk_prep_req() returns `BLK_STS_OK`, when 'io->cmd' is guaranteed to be valid. Reported-by: Changhui Zhong <czhong@redhat.com> Fixes: 524346e9d79f ("ublk: build batch from IOs in same io_ring_ctx and io task") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250701072325.1458109-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01ublk: cache-align struct ublk_ioCaleb Sander Mateos1-1/+1
struct ublk_io is already 56 bytes on 64-bit architectures, so round it up to a full cache line (typically 64 bytes). This ensures a single ublk_io doesn't span multiple cache lines and prevents false sharing if consecutive ublk_io's are accessed by different daemon tasks. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-15-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01ublk: remove ubq checks from ublk_{get,put}_req_ref()Caleb Sander Mateos1-24/+11
ublk_get_req_ref() and ublk_put_req_ref() currently call ublk_need_req_ref(ubq) to check whether the ublk device features require reference counting of its requests. However, all callers already know that reference counting is required: - __ublk_check_and_get_req() is only called from ublk_check_and_get_req() if user copy is enabled, and from ublk_register_io_buf() if zero copy is enabled - ublk_io_release() is only called for requests registered by ublk_register_io_buf(), which requires zero copy - ublk_ch_read_iter() and ublk_ch_write_iter() only call ublk_put_req_ref() if ublk_check_and_get_req() succeeded, which requires user copy to be enabled So drop the ublk_need_req_ref() check and the ubq argument in ublk_get_req_ref() and ublk_put_req_ref(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-14-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon taskCaleb Sander Mateos1-1/+8
ublk_io_release() performs an expensive atomic refcount decrement. This atomic operation is unnecessary in the common case where the request's buffer is registered and unregistered on the daemon task before handling UBLK_IO_COMMIT_AND_FETCH_REQ for the I/O. So if ublk_io_release() is called on the daemon task and task_registered_buffers is positive, just decrement task_registered_buffers (nonatomically). ublk_sub_req_ref() will apply this decrement when it atomically subtracts from io->ref. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-13-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon taskCaleb Sander Mateos1-9/+61
ublk_register_io_buf() performs an expensive atomic refcount increment, as well as a lot of pointer chasing to look up the struct request. Create a separate ublk_daemon_register_io_buf() for the daemon task to call. Initialize ublk_io's reference count to a large number, introduce a field task_registered_buffers to count the buffers registered on the daemon task, and atomically subtract the large number minus task_registered_buffers in ublk_commit_and_fetch(). Also obtain the struct request directly from ublk_io's req field instead of looking it up on the tagset. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-12-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01ublk: return early if blk_should_fake_timeout()Caleb Sander Mateos1-2/+3
Make the unlikely case blk_should_fake_timeout() return early to reduce the indentation of the successful path. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-11-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01ublk: allow UBLK_IO_(UN)REGISTER_IO_BUF on any taskCaleb Sander Mateos1-5/+22
Currently, UBLK_IO_REGISTER_IO_BUF and UBLK_IO_UNREGISTER_IO_BUF are only permitted on the ublk_io's daemon task. But this restriction is unnecessary. ublk_register_io_buf() calls __ublk_check_and_get_req() to look up the request from the tagset and atomically take a reference on the request without accessing the ublk_io. ublk_unregister_io_buf() doesn't use the q_id or tag at all. So allow these opcodes even on tasks other than io->task. Handle UBLK_IO_UNREGISTER_IO_BUF before obtaining the ubq and io since the buffer index being unregistered is not necessarily related to the specified q_id and tag. Add a feature flag UBLK_F_BUF_REG_OFF_DAEMON that userspace can use to determine whether the kernel supports off-daemon buffer registration. Suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-10-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01ublk: don't take ublk_queue in ublk_unregister_io_buf()Caleb Sander Mateos1-3/+3
UBLK_IO_UNREGISTER_IO_BUF currently requires a valid q_id and tag to be passed in the ublksrv_io_cmd. However, only the addr (registered buffer index) is actually used to unregister the buffer. There is no check that the q_id and tag are for the ublk request whose buffer is registered at the given index. To prepare to allow userspace to omit the q_id and tag, check the UBLK_F_SUPPORT_ZERO_COPY flag on the ublk_device instead of the ublk_queue. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-9-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01ublk: consolidate UBLK_IO_FLAG_{ACTIVE,OWNED_BY_SRV} checksCaleb Sander Mateos1-4/+1
UBLK_IO_FLAG_ACTIVE and UBLK_IO_FLAG_OWNED_BY_SRV are mutually exclusive. So just check that UBLK_IO_FLAG_OWNED_BY_SRV is set in __ublk_ch_uring_cmd(); that implies UBLK_IO_FLAG_ACTIVE is unset. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-7-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01ublk: remove task variable from __ublk_ch_uring_cmd()Caleb Sander Mateos1-3/+1
The variable is computed from a simple expression and used once, so just replace it with the expression. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-6-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01ublk: handle UBLK_IO_FETCH_REQ earlierCaleb Sander Mateos1-9/+12
Check for UBLK_IO_FETCH_REQ early in __ublk_ch_uring_cmd() and skip the rest of the checks in this case. This allows removing the checks for NULL io->task and UBLK_IO_FLAG_OWNED_BY_SRV unset in io->flags, which are only allowed for FETCH. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250620151008.3976463-5-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>