summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-07-10xen/blkback: add missing MODULE_DESCRIPTION() macroJeff Johnson1-0/+1
make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/xen-blkback/xen-blkback.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Link: https://lore.kernel.org/r/20240602-md-block-xen-blkback-v1-1-6ff5b58bdee1@quicinc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09block/rnbd: Constify struct kobj_typeChristophe JAILLET2-3/+3
'struct kobj_type' is not modified in this driver. It is only used with kobject_init_and_add() which takes a "const struct kobj_type *" parameter. Constifying this structure moves some data to a read-only section, so increase overall security. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 4082 792 8 4882 1312 drivers/block/rnbd/rnbd-srv-sysfs.o After: ===== text data bss dec hex filename 4210 672 8 4890 131a drivers/block/rnbd/rnbd-srv-sysfs.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/e3d454173ffad30726c9351810d3aa7b75122711.1720462252.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09block: take offset into account in blk_bvec_map_sg againChristoph Hellwig1-2/+2
The rebase of commit 09595e0c9d65 ("block: pass a phys_addr_t to get_max_segment_size") lost adding the total to to the offset in blk_bvec_map_sg. Add it back. Fixes: 09595e0c9d65 ("block: pass a phys_addr_t to get_max_segment_size") Reported-by: Yi Zhang <yi.zhang@redhat.com> Reported-by: Chaitanya Kulkarni <chaitanyak@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240709070126.3019940-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09block: fix get_max_segment_size() warningChaitanya Kulkarni1-1/+1
Correct the parameter name in the comment of get_max_segment_size() to fix following warning:- block/blk-merge.c:220: warning: Function parameter or struct member 'len' not described in 'get_max_segment_size' block/blk-merge.c:220: warning: Excess function parameter 'max_len' description in 'get_max_segment_size' Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240709045432.8688-1-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09loop: Don't bother validating blocksizeJohn Garry1-11/+1
The block queue limits validation does this for us now. The loop_configure() -> WARN_ON_ONCE() call is dropped, as an invalid block size would trigger this now. We don't want userspace to be able to directly trigger WARNs. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Link: https://lore.kernel.org/r/20240708091651.177447-6-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09virtio_blk: Don't bother validating blocksizeJohn Garry1-10/+1
The block queue limits validation does this for us now. Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240708091651.177447-5-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09null_blk: Don't bother validating blocksizeJohn Garry1-3/+0
The block queue limits validation does this for us now. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Link: https://lore.kernel.org/r/20240708091651.177447-4-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09block: Validate logical block size in blk_validate_limits()John Garry2-0/+5
Some drivers validate that their own logical block size. It is no harm to always do this, so validate in blk_validate_limits(). This allows us to remove the validation in most of those drivers. Add a comment to blk_validate_block_size() to inform users that self- validation of LBS is usually unnecessary. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240708091651.177447-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09virtio_blk: Fix default logical block size fallbackJohn Garry1-13/+13
If we fail to read a logical block size in virtblk_read_limits() -> virtio_cread_feature(), then we default to what is in lim->logical_block_size, but that would be 0. We can deal with lim->logical_block_size = 0 later in the blk_mq_alloc_disk(), but the code in virtblk_read_limits() needs a proper default, so give a default of SECTOR_SIZE. Fixes: 27e32cd23fed ("block: pass a queue_limits argument to blk_mq_alloc_disk") Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240708091651.177447-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09Merge tag 'nvme-6.11-2024-07-08' of git://git.infradead.org/nvme into ↵Jens Axboe36-229/+902
for-6.11/block Pull NVMe updates from Keith: "nvme updates for Linux 6.11 - Device initialization memory leak fixes (Keith) - More constants defined (Weiwen) - Target debugfs support (Hannes) - PCIe subsystem reset enhancements (Keith) - Queue-depth multipath policy (Redhat and PureStorage) - Implement get_unique_id (Christoph) - Authentication error fixes (Gaosheng)" * tag 'nvme-6.11-2024-07-08' of git://git.infradead.org/nvme: (21 commits) nvmet-auth: fix nvmet_auth hash error handling nvme: implement ->get_unique_id nvme-multipath: implement "queue-depth" iopolicy nvme-multipath: prepare for "queue-depth" iopolicy nvme-pci: do not directly handle subsys reset fallout lpfc_nvmet: implement 'host_traddr' nvme-fcloop: implement 'host_traddr' nvmet-fc: implement host_traddr() nvmet-rdma: implement host_traddr() nvmet-tcp: implement host_traddr() nvmet: add 'host_traddr' callback for debugfs nvmet: add debugfs support mailmap: add entry for Weiwen Hu nvme: rename CDR/MORE/DNR to NVME_STATUS_* nvme: fix status magic numbers nvme: rename nvme_sc_to_pr_err to nvme_status_to_pr_err nvme: split device add from initialization nvme: fc: split controller bringup handling nvme: rdma: split controller bringup handling nvme: tcp: split controller bringup handling ...
2024-07-08nvmet-auth: fix nvmet_auth hash error handlingGaosheng Cui1-6/+8
If we fail to call nvme_auth_augmented_challenge, or fail to kmalloc for shash, we should free the memory allocation for challenge, so add err path out_free_challenge to fix the memory leak. Fixes: 7a277c37d352 ("nvmet-auth: Diffie-Hellman key exchange support") Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-08nvme: implement ->get_unique_idChristoph Hellwig3-0/+46
Implement the get_unique_id method to allow pNFS SCSI layout access to NVMe namespaces. This is the server side implementation of RFC 9561 "Using the Parallel NFS (pNFS) SCSI Layout to Access Non-Volatile Memory Express (NVMe) Storage Devices". Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-08block: pass a phys_addr_t to get_max_segment_sizeChristoph Hellwig1-14/+11
Work on a single address to simplify the logic, and prepare the callers from using better helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240706075228.2350978-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-08block: add a bvec_phys helperChristoph Hellwig4-4/+18
Get callers out of poking into bvec internals a bit more. Not a huge win right now, but with the proposed new DMA mapping API we might end up with a lot more of this otherwise. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240706075228.2350978-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05blk-lib: check for kill signal in ioctl BLKZEROOUTChristoph Hellwig3-24/+45
Zeroout can access a significant capacity and take longer than the user expected. A user may change their mind about wanting to run that command and attempt to kill the process and do something else with their device. But since the task is uninterruptable, they have to wait for it to finish, which could be many hours. Add a new BLKDEV_ZERO_KILLABLE flag for blkdev_issue_zeroout that checks for a fatal signal at each iteration so the user doesn't have to wait for their regretted operation to complete naturally. Heavily based on an earlier patch from Keith Busch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240701165219.1571322-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05block: limit the Write Zeroes to manually writing zeroes fallbackChristoph Hellwig1-1/+1
Only fall back from hardware Write Zeroes failures when blkdev_issue_write_zeroes returns -EOPNOTSUPP; Note that blkdev_issue_write_zeroes turns any failure into -EOPNOTSUPP when the write zeroes queue limit has been cleared to 0, so this still catches all I/O errors where the driver detected missing support for the hardware acceleration. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240701165219.1571322-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05block: refacto blkdev_issue_zerooutChristoph Hellwig1-39/+55
Split out two well-defined helpers for hardware supported Write Zeroes and manually writing zeroes using the Write command. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240701165219.1571322-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05block: move read-only and supported checks into (__)blkdev_issue_zerooutChristoph Hellwig1-28/+23
Move these checks out of the lower level helpers and into the higher level ones to prepare for refactoring. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240701165219.1571322-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05block: remove the LBA alignment check in __blkdev_issue_zerooutChristoph Hellwig1-5/+0
__blkdev_issue_zeroout is a purely kernel internal API and thus can rely on the block layer sector alignment checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240701165219.1571322-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05block: factor out a blk_write_zeroes_limit helperChristoph Hellwig1-7/+11
Contrary to the comment in __blkdev_issue_write_zeroes, nothing here checks for a potential bi_size overflow. Add a helper mirroring the secure erase code for the check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240701165219.1571322-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05block: Remove blk_alloc_zone_bitmap()Damien Le Moal1-10/+2
Remove the helper function blk_alloc_zone_bitmap() and replace its single call site with a call to bitmap_alloc(). To be consistent with this change, use bitmap_free() to free a disk convnetional zone bitmap. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240704052816.623865-6-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05block: Remove REQ_OP_ZONE_RESET_ALL emulationDamien Le Moal8-87/+9
Now that device mapper can handle resetting all zones of a mapped zoned device using REQ_OP_ZONE_RESET_ALL, all zoned block device drivers support this operation. With this, the request queue feature BLK_FEAT_ZONE_RESETALL is not necessary and the emulation code in blk-zone.c can be removed. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240704052816.623865-5-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05dm: handle REQ_OP_ZONE_RESET_ALLDamien Le Moal4-5/+190
This commit implements processing of the REQ_OP_ZONE_RESET_ALL operation for zoned mapped devices. Given that this operation always has a BIO sector of 0 and a 0 size, processing through the regular BIO __split_and_process_bio() function does not work because this function would always select the first target. Instead, handling of this operation is implemented using the function __send_zone_reset_all(). Similarly to the __send_empty_flush() function, the new __send_zone_reset_all() function manually goes through all targets of a mapped device table doing the following: 1) If the target can natively support REQ_OP_ZONE_RESET_ALL, __send_duplicate_bios() is used to forward the reset all operation to the target. This case is handled with the __send_zone_reset_all_native() function. 2) For other targets, the function __send_zone_reset_all_emulated() is executed to emulate the execution of REQ_OP_ZONE_RESET_ALL using regular REQ_OP_ZONE_RESET operations. Targets that can natively support REQ_OP_ZONE_RESET_ALL are identified using the new target field zone_reset_all_supported. This boolean is set to true in for targets that have reliable zone limits, that is, targets that map all sequential write required zones of their zoned device(s). Setting this field is handled in dm_set_zones_restrictions() and device_get_zone_resource_limits(). For targets with unreliable zone limits, REQ_OP_ZONE_RESET_ALL must be emulated (case 2 above). This is implemented with __send_zone_reset_all_emulated() and is similar to the block layer function blkdev_zone_reset_all_emulated(): first a report zones is done for the zones of the target to identify zones that need reset, that is, any sequential write required zone that is not already empty. This is done using a bitmap and the function dm_zone_get_reset_bitmap() which sets to 1 the bit corresponding to a zone that needs reset. Next, this zone bitmap is inspected and a clone BIO modified to use the REQ_OP_ZONE_RESET operation issued for any zone with its bit set in the zone bitmap. This implementation is more efficient than what the block layer does with blkdev_zone_reset_all_emulated(), which is always used for DM zoned devices currently: as we can natively use REQ_OP_ZONE_RESET_ALL on targets mapping all sequential write required zones, resetting all zones of a zoned mapped device can be much faster compared to always emulating this operation using regular per-zone reset. In the worst case, this implementation is as-efficient as the block layer emulation. This reduction in the time it takes to reset all zones of a zoned mapped device depends directly on the mapped device targets mapping (reliable zone limits or not). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240704052816.623865-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05dm: Refactor is_abnormal_io()Damien Le Moal1-13/+11
Use a single switch-case to simplify is_abnormal_io() and make this function more readable and easier to modify. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240704052816.623865-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05null_blk: Introduce the zone_full parameterDamien Le Moal3-3/+17
Allow creating a zoned null_blk device with the initial state of its sequential write required zones to be FULL. This is convenient to avoid having to first write these zones to perform read performance evaluation or test zone management operations such as zone reset (and zone reset all). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240704052816.623865-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05loop: remove the unused inode variable in loop_configureChristoph Hellwig1-2/+0
Remove the inode variable now that the last user is gone. Fixes: a17ece76bcfe ("loop: regularize upgrading the block size for direct I/O") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240705053114.2042976-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-04block: reuse original bio_vec array for integrity during cloneAnuj Gupta1-6/+3
Modify bio_integrity_clone to reuse the original bvec array instead of allocating and copying it, similar to how bio data path is cloned. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240702100753.2168-1-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-04null_blk: don't initialize static 'g_virt_boundary' to falseZhu Yanjun1-1/+1
No functional changes intended. Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://lore.kernel.org/r/20240704010638.324349-1-yanjun.zhu@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-04Merge tag 'md-6.11-20240704' of ↵Jens Axboe4-42/+52
git://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.11/block Merge MD fixes from Song: "This PR contains various small fixes by Yu Kuai, Benjamin Marzinski, Christophe JAILLET, and Yang Li." * tag 'md-6.11-20240704' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid5: recheck if reshape has finished with device_lock held md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctl md-cluster: Constify struct md_cluster_operations md: Remove unneeded semicolon md/raid5: fix spares errors about rcu usage
2024-07-04block: t10-pi: Return correct ref tag when queue has no integrity profileAnuj Gupta1-2/+8
Commit c6e56cf6b2e7 ("block: move integrity information into queue_limits") changed the ref tag calculation logic. It would break if there is no integrity profile. This in turn causes read/write failures for such cases. Fixes: c6e56cf6b2e7 ("block: move integrity information into queue_limits") Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/20240704061515.282343-1-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-04md/raid5: recheck if reshape has finished with device_lock heldBenjamin Marzinski1-23/+41
When handling an IO request, MD checks if a reshape is currently happening, and if so, where the IO sector is in relation to the reshape progress. MD uses conf->reshape_progress for both of these tasks. When the reshape finishes, conf->reshape_progress is set to MaxSector. If this occurs after MD checks if the reshape is currently happening but before it calls ahead_of_reshape(), then ahead_of_reshape() will end up comparing the IO sector against MaxSector. During a backwards reshape, this will make MD think the IO sector is in the area not yet reshaped, causing it to use the previous configuration, and map the IO to the sector where that data was before the reshape. This bug can be triggered by running the lvm2 lvconvert-raid-reshape-linear_to_raid6-single-type.sh test in a loop, although it's very hard to reproduce. Fix this by factoring the code that checks where the IO sector is in relation to the reshape out to a helper called get_reshape_loc(), which reads reshape_progress and reshape_safe while holding the device_lock, and then rechecks if the reshape has finished before calling ahead_of_reshape with the saved values. Also use the helper during the REQ_NOWAIT check to see if the location is inside of the reshape region. Fixes: fef9c61fdfabf ("md/raid5: change reshape-progress measurement to cope with reshaping backwards.") Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240702151802.1632010-1-bmarzins@redhat.com
2024-07-04md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctlYu Kuai1-6/+0
Commit 90f5f7ad4f38 ("md: Wait for md_check_recovery before attempting device removal.") explained in the commit message that failed device must be reomoved from the personality first by md_check_recovery(), before it can be removed from the array. That's the reason the commit add the code to wait for MD_RECOVERY_NEEDED. However, this is not the case now, because remove_and_add_spares() is called directly from hot_remove_disk() from ioctl path, hence failed device(marked faulty) can be removed from the personality by ioctl. On the other hand, the commit introduced a performance problem that if MD_RECOVERY_NEEDED is set and the array is not running, ioctl will wait for 5s before it can return failure to user. Since the waiting is not needed now, fix the problem by removing the waiting. Fixes: 90f5f7ad4f38 ("md: Wait for md_check_recovery before attempting device removal.") Reported-by: Mateusz Kusiak <mateusz.kusiak@linux.intel.com> Closes: https://lore.kernel.org/all/814ff6ee-47a2-4ba0-963e-cf256ee4ecfa@linux.intel.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240627112321.3044744-1-yukuai1@huaweicloud.com
2024-07-04md-cluster: Constify struct md_cluster_operationsChristophe JAILLET3-5/+5
'struct md_cluster_operations' is not modified in this driver. Constifying this structure moves some data to a read-only section, so increase overall security. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 51941 1442 80 53463 d0d7 drivers/md/md-cluster.o After: ===== text data bss dec hex filename 52133 1246 80 53459 d0d3 drivers/md/md-cluster.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/3727f3ce9693cae4e62ae6778ea13971df805479.1719173852.git.christophe.jaillet@wanadoo.fr
2024-07-04md: Remove unneeded semicolonYang Li1-1/+1
./drivers/md/md.c:630:21-22: Unneeded semicolon Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9344 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240618010759.85416-1-yang.lee@linux.alibaba.com
2024-07-04md/raid5: fix spares errors about rcu usageYu Kuai1-7/+5
As commit ad8606702f26 ("md/raid5: remove rcu protection to access rdev from conf") explains, rcu protection can be removed, however, there are three places left, there won't be any real problems. drivers/md/raid5.c:8071:24: error: incompatible types in comparison expression (different address spaces): drivers/md/raid5.c:8071:24: struct md_rdev [noderef] __rcu * drivers/md/raid5.c:8071:24: struct md_rdev * drivers/md/raid5.c:7569:25: error: incompatible types in comparison expression (different address spaces): drivers/md/raid5.c:7569:25: struct md_rdev [noderef] __rcu * drivers/md/raid5.c:7569:25: struct md_rdev * drivers/md/raid5.c:7573:25: error: incompatible types in comparison expression (different address spaces): drivers/md/raid5.c:7573:25: struct md_rdev [noderef] __rcu * drivers/md/raid5.c:7573:25: struct md_rdev * Fixes: ad8606702f26 ("md/raid5: remove rcu protection to access rdev from conf") Cc: stable@vger.kernel.org Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240615085143.1648223-1-yukuai1@huaweicloud.com
2024-07-02xen-blkfront: fix sector_size propagation to the block layerChristoph Hellwig1-11/+5
Ensure that info->sector_size and info->physical_sector_size are set before the call to blkif_set_queue_limits by doing away with the local variables and arguments that propagate them. Thanks to Marek Marczykowski-Górecki and Jürgen Groß for root causing the issue. Fixes: ba3f67c11638 ("xen-blkfront: atomically update queue limits") Reported-by: Rusty Bird <rustybird@net-c.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Link: https://lore.kernel.org/r/20240625055238.7934-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-02null_blk: Fix description of the fua parameterDamien Le Moal1-1/+1
The description of the fua module parameter is defined using MODULE_PARM_DESC() with the first argument passed being "zoned". That is the wrong name, obviously. Fix that by using the correct "fua" parameter name so that "modinfo null_blk" displays correct information. Fixes: f4f84586c8b9 ("null_blk: Introduce fua attribute") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240702073234.206458-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-02block/mq-deadline: Fix the tag reservation codeBart Van Assche1-3/+17
The current tag reservation code is based on a misunderstanding of the meaning of data->shallow_depth. Fix the tag reservation code as follows: * By default, do not reserve any tags for synchronous requests because for certain use cases reserving tags reduces performance. See also Harshit Mogalapalli, [bug-report] Performance regression with fio sequential-write on a multipath setup, 2024-03-07 (https://lore.kernel.org/linux-block/5ce2ae5d-61e2-4ede-ad55-551112602401@oracle.com/) * Reduce min_shallow_depth to one because min_shallow_depth must be less than or equal any shallow_depth value. * Scale dd->async_depth from the range [1, nr_requests] to [1, bits_per_sbitmap_word]. Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Zhiguo Niu <zhiguo.niu@unisoc.com> Fixes: 07757588e507 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240509170149.7639-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-02block: Call .limit_depth() after .hctx has been setBart Van Assche1-6/+6
Call .limit_depth() after data->hctx has been set such that data->hctx can be used in .limit_depth() implementations. Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Zhiguo Niu <zhiguo.niu@unisoc.com> Fixes: 07757588e507 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240509170149.7639-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-02nvme-multipath: implement "queue-depth" iopolicyThomas Song3-5/+87
The round-robin path selector is inefficient in cases where there is a difference in latency between paths. In the presence of one or more high latency paths the round-robin selector continues to use the high latency path equally. This results in a bias towards the highest latency path and can cause a significant decrease in overall performance as IOs pile on the highest latency path. This problem is acute with NVMe-oF controllers. The queue-depth path selector sends I/O down the path with the lowest number of requests in its request queue. Paths with lower latency will clear requests more quickly and have less requests queued compared to higher latency paths. The goal of this path selector is to make more use of lower latency paths which will bring down overall IO latency and increase throughput and performance. Signed-off-by: Thomas Song <tsong@purestorage.com> [emilne: commandeered patch developed by Thomas Song @ Pure Storage] Co-developed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Ewan D. Milne <emilne@redhat.com> Co-developed-by: John Meneghini <jmeneghi@redhat.com> Signed-off-by: John Meneghini <jmeneghi@redhat.com> Link: https://lore.kernel.org/linux-nvme/20240509202929.831680-1-jmeneghi@redhat.com/ Tested-by: Marco Patalano <mpatalan@redhat.com> Tested-by: Jyoti Rani <jrani@purestorage.com> Tested-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Randy Jennings <randyj@purestorage.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-01nvme: don't set io_opt if NOWS is zeroChristoph Hellwig1-1/+2
NOWS is one of the annoying "0's based values" in NVMe, where 0 means one and we thus can't detect if it isn't set. Thus a NOWS value of 0 means that the Namespace Optimal Write Size is a single LBA, which is clearly bogus. Ignore the value in that case and don't propagate an io_opt value to the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240701051800.1245240-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-01block: don't reduce max_sectors based on io_optChristoph Hellwig1-1/+1
Don't reduce the max_sectors value below the normal cap when the driver advertsizes a very low io_opt. This restores the behavior we had before the recent changes to the max_sectors calculation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240701051800.1245240-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-01block: remove a duplicate io_min check in blk_validate_limitsChristoph Hellwig1-2/+1
If io_min is larger than the cap, it must by definition be non-zero. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240701051800.1245240-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-01blk-wbt: don't throttle swap writes in direct reclaimBaokun Li1-9/+9
Now we avoid throttling swap writes by determining whether the current process is kswapd (aka current_is_kswapd()), but swap writes can come from either kswapd or direct reclaim, so the swap writes from direct reclaim will still be throttled. When a process holds a lock to allocate a free page, and enters direct reclaim because there is no free memory, then it might trigger a hung due to the wbt throttling that causes other processes to fail to get the lock. Both kswapd and direct reclaim set the REQ_SWAP flag, so use REQ_SWAP instead of current_is_kswapd() to avoid throttling swap writes. Also renamed WBT_KSWAPD to WBT_SWAP and WBT_RWQ_KSWAPD to WBT_RWQ_SWAP. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240604030522.3686177-1-libaokun@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-29block: pass a gendisk to the queue_sysfs_entry methodsChristoph Hellwig3-97/+96
The kobject for the queue entries is embedded into a struct gendisk. Pass it to the sysfs methods instead of the request_queue derived from it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240627111407.476276-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-29block: add helper macros to de-duplicate the queue sysfs attributesChristoph Hellwig1-173/+82
A lof the code to implement the queue sysfs attributes is repetitive. Add a few macros to generate the common cases. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240627111407.476276-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-29block: simplify queue_logical_block_sizeChristoph Hellwig1-6/+1
queue_logical_block_size is never called with a 0 queue, and the logical_block_size field in queue_limits is always initialized for a live queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240627111407.476276-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28blk-throttle: fix lower control under super low iops limitYu Kuai1-0/+3
User will configure allowed iops limit in 1s, and calculate_io_allowed() will calculate allowed iops in the slice by: limit * HZ / throtl_slice However, if limit is quite low, the result can be 0, then allowed IO in the slice is 0, this will cause missing dispatch and control will be lower than limit. For example, set iops_limit to 5 with HD disk, and test will found that iops will be 3. This is usually not a big deal, because user will unlikely to configure such low iops limit, however, this is still a problem in the extreme scene. Fix the problem by making sure the wait time calculated by tg_within_iops_limit() should allow at least one IO to be dispatched. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240618062108.3680835-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28block: set bip_vcnt correctlyAnuj Gupta1-0/+2
Set the bip_vcnt correctly in bio_integrity_init_user and bio_integrity_copy_user. If the bio gets split at a later point, this value is required to set the right bip_vcnt in the cloned bio. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240626100700.3629-3-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28rust: block: fix generated bindings after refactoring of featuresAndreas Hindborg1-0/+2
Block device features and flags were refactored from `enum` to `#define`. This broke Rust binding generation. This patch fixes the binding generation. Fixes: fcf865e357f8 ("block: convert features and flags to __bitwise types") Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com> Acked-by: Miguel Ojeda <ojeda@kernel.org> Link: https://lore.kernel.org/r/20240628091152.2185241-1-nmi@metaspace.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>