summaryrefslogtreecommitdiff
path: root/block/ioctl.c
AgeCommit message (Collapse)AuthorFilesLines
2024-09-11block: implement async io_uring discard cmdPavel Begunkov1-0/+112
io_uring allows implementing custom file specific asynchronous operations via the fops->uring_cmd callback, a.k.a. IORING_OP_URING_CMD requests or just io_uring commands. Use it to add support for async discards. Normally, it first tries to queue up bios in a non-blocking context, and if that fails, we'd retry from a blocking context by returning -EAGAIN to the core io_uring. We always get the result from bios asynchronously by setting a custom bi_end_io callback, at which point we drag the request into the task context to either reissue or complete it and post a completion to the user. Unlike ioctl(BLKDISCARD) with stronger guarantees against races, we only do a best effort attempt to invalidate page cache, and it can race with any writes and reads and leave page cache stale. It's the same kind of races we allow to direct writes. Also, apart from cases where discarding is not allowed at all, e.g. discards are not supported or the file/device is read only, the user should assume that the sector range on disk is not valid anymore, even when an error was returned to the user. Suggested-by: Conrad Meyer <conradmeyer@meta.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2b5210443e4fa0257934f73dfafcc18a77cd0e09.1726072086.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11block: introduce blk_validate_byte_range()Pavel Begunkov1-19/+32
In preparation to further changes extract a helper function out of blk_ioctl_discard() that validates if we can do IO against the given range of disk byte addresses. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/19a7779323c71e742a2f511e4cf49efcfd68cfd4.1726072086.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-04block: fix integer overflow in BLKSECDISCARDAlexey Dobriyan1-4/+5
I independently rediscovered commit 22d24a544b0d49bbcbd61c8c0eaf77d3c9297155 block: fix overflow in blk_ioctl_discard() but for secure erase. Same problem: uint64_t r[2] = {512, 18446744073709551104ULL}; ioctl(fd, BLKSECDISCARD, r); will enter near infinite loop inside blkdev_issue_secure_erase(): a.out: attempt to access beyond end of device loop0: rw=5, sector=3399043073, nr_sectors = 1024 limit=2048 bio_check_eod: 3286214 callbacks suppressed Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Link: https://lore.kernel.org/r/9e64057f-650a-46d1-b9f7-34af391536ef@p183 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05blk-lib: check for kill signal in ioctl BLKZEROOUTChristoph Hellwig1-1/+1
Zeroout can access a significant capacity and take longer than the user expected. A user may change their mind about wanting to run that command and attempt to kill the process and do something else with their device. But since the task is uninterruptable, they have to wait for it to finish, which could be many hours. Add a new BLKDEV_ZERO_KILLABLE flag for blkdev_issue_zeroout that checks for a fatal signal at each iteration so the user doesn't have to wait for their regretted operation to complete naturally. Heavily based on an earlier patch from Keith Busch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240701165219.1571322-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-21Merge tag 'pull-bd_flags-2' of ↵Linus Torvalds1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull bdev flags update from Al Viro: "Compactifying bdev flags. We can easily have up to 24 flags with sane atomicity, _without_ pushing anything out of the first cacheline of struct block_device" * tag 'pull-bd_flags-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: bdev: move ->bd_make_it_fail to ->__bd_flags bdev: move ->bd_ro_warned to ->__bd_flags bdev: move ->bd_has_subit_bio to ->__bd_flags bdev: move ->bd_write_holder into ->__bd_flags bdev: move ->bd_read_only to ->__bd_flags bdev: infrastructure for flags wrapper for access to ->bd_partno Use bdev_is_paritition() instead of open-coding it
2024-05-21Merge tag 'pull-bd_inode-1' of ↵Linus Torvalds1-8/+6
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull bdev bd_inode updates from Al Viro: "Replacement of bdev->bd_inode with sane(r) set of primitives by me and Yu Kuai" * tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RIP ->bd_inode dasd_format(): killing the last remaining user of ->bd_inode nilfs_attach_log_writer(): use ->bd_mapping->host instead of ->bd_inode block/bdev.c: use the knowledge of inode/bdev coallocation gfs2: more obvious initializations of mapping->host fs/buffer.c: massage the remaining users of ->bd_inode to ->bd_mapping blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here... grow_dev_folio(): we only want ->bd_inode->i_mapping there use ->bd_mapping instead of ->bd_inode->i_mapping block_device: add a pointer to struct address_space (page cache of bdev) missing helpers: bdev_unhash(), bdev_drop() block: move two helpers into bdev.c block2mtd: prevent direct access of bd_inode dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode) blkdev_write_iter(): saner way to get inode and bdev bcachefs: remove dead function bdev_sectors() ext4: remove block_device_ejected() erofs_buf: store address_space instead of inode erofs: switch erofs_bread() to passing offset instead of block number
2024-05-21Merge tag 'pull-set_blocksize' of ↵Linus Torvalds1-9/+12
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs blocksize updates from Al Viro: "This gets rid of bogus set_blocksize() uses, switches it over to be based on a 'struct file *' and verifies that the caller has the device opened exclusively" * tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: make set_blocksize() fail unless block device is opened exclusive set_blocksize(): switch to passing struct file * btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens swsusp: don't bother with setting block size zram: don't bother with reopening - just use O_EXCL for open swapon(2): open swap with O_EXCL swapon(2)/swapoff(2): don't bother with block size pktcdvd: sort set_blocksize() calls out bcache_register(): don't bother with set_blocksize()
2024-05-13Merge tag 'for-6.10/block-20240511' of git://git.kernel.dk/linuxLinus Torvalds1-6/+36
Pull block updates from Jens Axboe: - Add a partscan attribute in sysfs, fixing an issue with systemd relying on an internal interface that went away. - Attempt #2 at making long running discards interruptible. The previous attempt went into 6.9, but we ended up mostly reverting it as it had issues. - Remove old ida_simple API in bcache - Support for zoned write plugging, greatly improving the performance on zoned devices. - Remove the old throttle low interface, which has been experimental since 2017 and never made it beyond that and isn't being used. - Remove page->index debugging checks in brd, as it hasn't caught anything and prepares us for removing in struct page. - MD pull request from Song - Don't schedule block workers on isolated CPUs * tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux: (84 commits) blk-throttle: delay initialization until configuration blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW block: fix that util can be greater than 100% block: support to account io_ticks precisely block: add plug while submitting IO bcache: fix variable length array abuse in btree_iter bcache: Remove usage of the deprecated ida_simple_xx() API md: Revert "md: Fix overflow in is_mddev_idle" blk-lib: check for kill signal in ioctl BLKDISCARD block: add a bio_await_chain helper block: add a blk_alloc_discard_bio helper block: add a bio_chain_and_submit helper block: move discard checks into the ioctl handler block: remove the discard_granularity check in __blkdev_issue_discard block/ioctl: prefer different overflow check null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION() block: fix and simplify blkdevparts= cmdline parsing block: refine the EOF check in blkdev_iomap_begin block: add a partscan sysfs attribute for disks block: add a disk_has_partscan helper ...
2024-05-07blk-lib: check for kill signal in ioctl BLKDISCARDChristoph Hellwig1-3/+30
Discards can access a significant capacity and take longer than the user expected. A user may change their mind about wanting to run that command and attempt to kill the process and do something else with their device. But since the task is uninterruptable, they have to wait for it to finish, which could be many hours. Open code blkdev_issue_discard in the BLKDISCARD ioctl handler and check for a fatal signal at each iteration so the user doesn't have to wait for their regretted operation to complete naturally. Heavily based on an earlier patch from Keith Busch. Reported-by: Conrad Meyer <conradmeyer@meta.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07block: move discard checks into the ioctl handlerChristoph Hellwig1-2/+5
Most bio operations get basic sanity checking in submit_bio and anything more complicated than that is done in the callers. Discards are a bit different from that in that a lot of checking is done in __blkdev_issue_discard, and the specific errnos for that are returned to userspace. Move the checks that require specific errnos to the ioctl handler instead, and just leave the basic sanity checking in submit_bio for the other handlers. This introduces two changes in behavior: 1) the logical block size alignment check of the start and len is lost for non-ioctl callers. This matches what is done for other operations including reads and writes. We should probably verify this for all bios, but for now make discards match the normal flow. 2) for non-ioctl callers all errors are reported on I/O completion now instead of synchronously. Callers in general mostly ignore or log errors so this will actually simplify the code once cleaned up Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240506042027.2289826-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-07block/ioctl: prefer different overflow checkJustin Stitt1-1/+1
Running syzkaller with the newly reintroduced signed integer overflow sanitizer shows this report: [ 62.982337] ------------[ cut here ]------------ [ 62.985692] cgroup: Invalid name [ 62.986211] UBSAN: signed-integer-overflow in ../block/ioctl.c:36:46 [ 62.989370] 9pnet_fd: p9_fd_create_tcp (7343): problem connecting socket to 127.0.0.1 [ 62.992992] 9223372036854775807 + 4095 cannot be represented in type 'long long' [ 62.997827] 9pnet_fd: p9_fd_create_tcp (7345): problem connecting socket to 127.0.0.1 [ 62.999369] random: crng reseeded on system resumption [ 63.000634] GUP no longer grows the stack in syz-executor.2 (7353): 20002000-20003000 (20001000) [ 63.000668] CPU: 0 PID: 7353 Comm: syz-executor.2 Not tainted 6.8.0-rc2-00035-gb3ef86b5a957 #1 [ 63.000677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 63.000682] Call Trace: [ 63.000686] <TASK> [ 63.000731] dump_stack_lvl+0x93/0xd0 [ 63.000919] __get_user_pages+0x903/0xd30 [ 63.001030] __gup_longterm_locked+0x153e/0x1ba0 [ 63.001041] ? _raw_read_unlock_irqrestore+0x17/0x50 [ 63.001072] ? try_get_folio+0x29c/0x2d0 [ 63.001083] internal_get_user_pages_fast+0x1119/0x1530 [ 63.001109] iov_iter_extract_pages+0x23b/0x580 [ 63.001206] bio_iov_iter_get_pages+0x4de/0x1220 [ 63.001235] iomap_dio_bio_iter+0x9b6/0x1410 [ 63.001297] __iomap_dio_rw+0xab4/0x1810 [ 63.001316] iomap_dio_rw+0x45/0xa0 [ 63.001328] ext4_file_write_iter+0xdde/0x1390 [ 63.001372] vfs_write+0x599/0xbd0 [ 63.001394] ksys_write+0xc8/0x190 [ 63.001403] do_syscall_64+0xd4/0x1b0 [ 63.001421] ? arch_exit_to_user_mode_prepare+0x3a/0x60 [ 63.001479] entry_SYSCALL_64_after_hwframe+0x6f/0x77 [ 63.001535] RIP: 0033:0x7f7fd3ebf539 [ 63.001551] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 [ 63.001562] RSP: 002b:00007f7fd32570c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 63.001584] RAX: ffffffffffffffda RBX: 00007f7fd3ff3f80 RCX: 00007f7fd3ebf539 [ 63.001590] RDX: 4db6d1e4f7e43360 RSI: 0000000020000000 RDI: 0000000000000004 [ 63.001595] RBP: 00007f7fd3f1e496 R08: 0000000000000000 R09: 0000000000000000 [ 63.001599] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 63.001604] R13: 0000000000000006 R14: 00007f7fd3ff3f80 R15: 00007ffd415ad2b8 ... [ 63.018142] ---[ end trace ]--- Historically, the signed integer overflow sanitizer did not work in the kernel due to its interaction with `-fwrapv` but this has since been changed [1] in the newest version of Clang; It was re-enabled in the kernel with Commit 557f8c582a9ba8ab ("ubsan: Reintroduce signed overflow sanitizer"). Let's rework this overflow checking logic to not actually perform an overflow during the check itself, thus avoiding the UBSAN splat. [1]: https://github.com/llvm/llvm-project/pull/82432 Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240507-b4-sio-block-ioctl-v3-1-ba0c2b32275e@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-03blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here...Al Viro1-6/+4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-6-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03use ->bd_mapping instead of ->bd_inode->i_mappingAl Viro1-2/+2
Just the low-hanging fruit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-2-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-03bdev: move ->bd_read_only to ->__bd_flagsAl Viro1-1/+4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-03set_blocksize(): switch to passing struct file *Al Viro1-9/+12
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-04-18block: propagate partition scanning errors to the BLKRRPART ioctlChristoph Hellwig1-1/+2
Commit 4601b4b130de ("block: reopen the device in blkdev_reread_part") lost the propagation of I/O errors from the low-level read of the partition table to the user space caller of the BLKRRPART. Apparently some user space relies on, so restore the propagation. This isn't exactly pretty as other block device open calls explicitly do not are about these errors, so add a new BLK_OPEN_STRICT_SCAN to opt into the error propagation. Fixes: 4601b4b130de ("block: reopen the device in blkdev_reread_part") Reported-by: Saranya Muruganandam <saranyamohan@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20240417144743.2277601-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02block: fix overflow in blk_ioctl_discard()Li Nan1-2/+3
There is no check for overflow of 'start + len' in blk_ioctl_discard(). Hung task occurs if submit an discard ioctl with the following param: start = 0x80000000000ff000, len = 0x8000000000fff000; Add the overflow validation now. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240329012319.2034550-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-11Merge tag 'for-6.9/block-20240310' of git://git.kernel.dk/linuxLinus Torvalds1-1/+8
Pull block updates from Jens Axboe: - MD pull requests via Song: - Cleanup redundant checks (Yu Kuai) - Remove deprecated headers (Marc Zyngier, Song Liu) - Concurrency fixes (Li Lingfeng) - Memory leak fix (Li Nan) - Refactor raid1 read_balance (Yu Kuai, Paul Luse) - Clean up and fix for md_ioctl (Li Nan) - Other small fixes (Gui-Dong Han, Heming Zhao) - MD atomic limits (Christoph) - NVMe pull request via Keith: - RDMA target enhancements (Max) - Fabrics fixes (Max, Guixin, Hannes) - Atomic queue_limits usage (Christoph) - Const use for class_register (Ricardo) - Identification error handling fixes (Shin'ichiro, Keith) - Improvement and cleanup for cached request handling (Christoph) - Moving towards atomic queue limits. Core changes and driver bits so far (Christoph) - Fix UAF issues in aoeblk (Chun-Yi) - Zoned fix and cleanups (Damien) - s390 dasd cleanups and fixes (Jan, Miroslav) - Block issue timestamp caching (me) - noio scope guarding for zoned IO (Johannes) - block/nvme PI improvements (Kanchan) - Ability to terminate long running discard loop (Keith) - bdev revalidation fix (Li) - Get rid of old nr_queues hack for kdump kernels (Ming) - Support for async deletion of ublk (Ming) - Improve IRQ bio recycling (Pavel) - Factor in CPU capacity for remote vs local completion (Qais) - Add shared_tags configfs entry for null_blk (Shin'ichiro - Fix for a regression in page refcounts introduced by the folio unification (Tony) - Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid, Ricardo, Roman, Tang, Uwe) * tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux: (221 commits) block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC block/swim: Convert to platform remove callback returning void cdrom: gdrom: Convert to platform remove callback returning void block: remove disk_stack_limits md: remove mddev->queue md: don't initialize queue limits md/raid10: use the atomic queue limit update APIs md/raid5: use the atomic queue limit update APIs md/raid1: use the atomic queue limit update APIs md/raid0: use the atomic queue limit update APIs md: add queue limit helpers md: add a mddev_is_dm helper md: add a mddev_add_trace_msg helper md: add a mddev_trace_remap helper bcache: move calculation of stripe_size and io_opt into bcache_device_init virtio_blk: Do not use disk_set_max_open/active_zones() aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts block: move capacity validation to blkpg_do_ioctl() block: prevent division by zero in blk_rq_stat_sum() drbd: atomically update queue limits in drbd_reconsider_queue_parameters ...
2024-03-11Merge tag 'vfs-6.9.super' of ↵Linus Torvalds1-5/+4
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull block handle updates from Christian Brauner: "Last cycle we changed opening of block devices, and opening a block device would return a bdev_handle. This allowed us to implement support for restricting and forbidding writes to mounted block devices. It was accompanied by converting and adding helpers to operate on bdev_handles instead of plain block devices. That was already a good step forward but ultimately it isn't necessary to have special purpose helpers for opening block devices internally that return a bdev_handle. Fundamentally, opening a block device internally should just be equivalent to opening files. So now all internal opens of block devices return files just as a userspace open would. Instead of introducing a separate indirection into bdev_open_by_*() via struct bdev_handle bdev_file_open_by_*() is made to just return a struct file. Opening and closing a block device just becomes equivalent to opening and closing a file. This all works well because internally we already have a pseudo fs for block devices and so opening block devices is simple. There's a few places where we needed to be careful such as during boot when the kernel is supposed to mount the rootfs directly without init doing it. Here we need to take care to ensure that we flush out any asynchronous file close. That's what we already do for opening, unpacking, and closing the initramfs. So nothing new here. The equivalence of opening and closing block devices to regular files is a win in and of itself. But it also has various other advantages. We can remove struct bdev_handle completely. Various low-level helpers are now private to the block layer. Other helpers were simply removable completely. A follow-up series that is already reviewed build on this and makes it possible to remove bdev->bd_inode and allows various clean ups of the buffer head code as well. All places where we stashed a bdev_handle now just stash a file and use simple accessors to get to the actual block device which was already the case for bdev_handle" * tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) block: remove bdev_handle completely block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access bdev: remove bdev pointer from struct bdev_handle bdev: make struct bdev_handle private to the block layer bdev: make bdev_{release, open_by_dev}() private to block layer bdev: remove bdev_open_by_path() reiserfs: port block device access to file ocfs2: port block device access to file nfs: port block device access to files jfs: port block device access to file f2fs: port block device access to files ext4: port block device access to file erofs: port device access to file btrfs: port device access to file bcachefs: port block device access to file target: port block device access to file s390: port block device access to file nvme: port block device access to file block2mtd: port device access to files bcache: port block device access to files ...
2024-03-06block: move capacity validation to blkpg_do_ioctl()Li Lingfeng1-1/+8
Commit 6d4e80db4ebe ("block: add capacity validation in bdev_add_partition()") add check of partition's start and end sectors to prevent exceeding the size of the disk when adding partitions. However, there is still no check for resizing partitions now. Move the check to blkpg_do_ioctl() to cover resizing partitions. Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240305032132.548958-1-lilingfeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-25block/ioctl: port blkdev_bszset() to fileChristian Brauner1-5/+4
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-2-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-01-22block: Move checking GENHD_FL_NO_PART to bdev_add_partition()Li Lingfeng1-2/+0
Commit 1a721de8489f ("block: don't add or resize partition on the disk with GENHD_FL_NO_PART") prevented all operations about partitions on disks with GENHD_FL_NO_PART in blkpg_do_ioctl() since they are meaningless. However, it changed error code in some scenarios. So move checking GENHD_FL_NO_PART to bdev_add_partition() to eliminate impact. Fixes: 1a721de8489f ("block: don't add or resize partition on the disk with GENHD_FL_NO_PART") Reported-by: Allison Karlitskaya <allison.karlitskaya@redhat.com> Closes: https://lore.kernel.org/all/CAOYeF9VsmqKMcQjo1k6YkGNujwN-nzfxY17N3F-CMikE1tYp+w@mail.gmail.com/ Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240118130401.792757-1-lilingfeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-13block: add check that partition length needs to be aligned with block sizeMin Li1-4/+7
Before calling add partition or resize partition, there is no check on whether the length is aligned with the logical block size. If the logical block size of the disk is larger than 512 bytes, then the partition size maybe not the multiple of the logical block size, and when the last sector is read, bio_truncate() will adjust the bio size, resulting in an IO error if the size of the read command is smaller than the logical block size.If integrity data is supported, this will also result in a null pointer dereference when calling bio_integrity_free. Cc: <stable@vger.kernel.org> Signed-off-by: Min Li <min15.li@samsung.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230629142517.121241-1-min15.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-28fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lockJan Kara1-2/+3
The implementation of bdev holder operations such as fs_bdev_mark_dead() and fs_bdev_sync() grab sb->s_umount semaphore under bdev->bd_holder_lock. This is problematic because it leads to disk->open_mutex -> sb->s_umount lock ordering which is counterintuitive (usually we grab higher level (e.g. filesystem) locks first and lower level (e.g. block layer) locks later) and indeed makes lockdep complain about possible locking cycles whenever we open a block device while holding sb->s_umount semaphore. Implement a function bdev_super_lock_shared() which safely transitions from holding bdev->bd_holder_lock to holding sb->s_umount on alive superblock without introducing the problematic lock dependency. We use this function fs_bdev_sync() and fs_bdev_mark_dead(). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231018152924.3858-1-jack@suse.cz Link: https://lore.kernel.org/r/20231017184823.1383356-1-hch@lst.de Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28block: Use bdev_open_by_dev() in disk_scan_partitions() and blkdev_bszset()Jan Kara1-2/+4
Convert disk_scan_partitions() and blkdev_bszset() to use bdev_open_by_dev(). Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-3-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-31block: don't add or resize partition on the disk with GENHD_FL_NO_PARTLi Lingfeng1-0/+2
Commit a33df75c6328 ("block: use an xarray for disk->part_tbl") remove disk_expand_part_tbl() in add_partition(), which means all kinds of devices will support extended dynamic `dev_t`. However, some devices with GENHD_FL_NO_PART are not expected to add or resize partition. Fix this by adding check of GENHD_FL_NO_PART before add or resize partition. Fixes: a33df75c6328 ("block: use an xarray for disk->part_tbl") Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230831075900.1725842-1-lilingfeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-21block: call into the file system for ioctl BLKFLSBUFChristoph Hellwig1-1/+8
BLKFLSBUF is a historic ioctl that is called on a file handle to a block device and syncs either the file system mounted on that block device if there is one, or otherwise the just the data on the block device. Replace the get_super based syncing with a holder operation to remove the last usage of get_super, and to also support syncing the file system if the block device is not the main block device stored in s_dev. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-16-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-20block: fine-granular CAP_SYS_ADMIN for Persistent ReservationJingbo Xu1-19/+22
Allow of unprivileged Persistent Reservation operations on devices if the write permission check on the device node has passed. brw-rw---- 1 root disk 259, 0 Jun 13 07:09 /dev/nvme0n1 In the example above, the "disk" group of nvme0n1 is also allowed to make reservations on the device even without CAP_SYS_ADMIN. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230613084008.93795-3-jefflexu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20block: disallow Persistent Reservation on partitionsJingbo Xu1-5/+17
Refuse Persistent Reservation operations on partitions as reservation on partitions doesn't make sense. Besides, introduce blkdev_pr_allowed() helper, where more policies could be placed here later. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230613084008.93795-2-jefflexu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: replace fmode_t with a block-specific type for block open flagsChristoph Hellwig1-30/+12
The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: remove unused fmode_t arguments from ioctl handlersChristoph Hellwig1-7/+7
A few ioctl handlers have fmode_t arguments that are entirely unused, remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20230608110258.189493-27-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: use the holder as indication for exclusive opensChristoph Hellwig1-3/+2
The current interface for exclusive opens is rather confusing as it requires both the FMODE_EXCL flag and a holder. Remove the need to pass FMODE_EXCL and just key off the exclusive open off a non-NULL holder. For blkdev_put this requires adding the holder argument, which provides better debug checking that only the holder actually releases the hold, but at the same time allows removing the now superfluous mode argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05block: introduce holder opsChristoph Hellwig1-1/+2
Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and installed in the block_device for exclusive claims. It will be used to allow the block layer to call back into the user of the block device for thing like notification of a removed device or a device resize. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-17block: fix scan partition for exclusively open device againYu Kuai1-1/+1
As explained in commit 36369f46e917 ("block: Do not reread partition table on exclusively open device"), reread partition on the device that is exclusively opened by someone else is problematic. This patch will make sure partition scan will only be proceed if current thread open the device exclusively, or the device is not opened exclusively, and in the later case, other scanners and exclusive openers will be blocked temporarily until partition scan is done. Fixes: 10c70d95c0f2 ("block: remove the bd_openers checks in blk_drop_partitions") Cc: <stable@vger.kernel.org> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230217022200.3092987-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-17block: Revert "block: Do not reread partition table on exclusively open device"Yu Kuai1-7/+6
This reverts commit 36369f46e91785688a5f39d7a5590e3f07981316. This patch can't fix the problem in a corner case that device can be opened exclusively after the checking and before blkdev_get_by_dev(). We'll use a new solution to fix the problem in the next patch, and the new solution doesn't need to change apis. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230217022200.3092987-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-01block: Do not reread partition table on exclusively open deviceJan Kara1-5/+7
Since commit 10c70d95c0f2 ("block: remove the bd_openers checks in blk_drop_partitions") we allow rereading of partition table although there are users of the block device. This has an undesirable consequence that e.g. if sda and sdb are assembled to a RAID1 device md0 with partitions, BLKRRPART ioctl on sda will rescan partition table and create sda1 device. This partition device under a raid device confuses some programs (such as libstorage-ng used for initial partitioning for distribution installation) leading to failures. Fix the problem refusing to rescan partitions if there is another user that has the block device exclusively open. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20221130135344.2ul4cyfstfs3znxg@quack3 Fixes: 10c70d95c0f2 ("block: remove the bd_openers checks in blk_drop_partitions") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221130175653.24299-1-jack@suse.cz [axboe: fold in followup fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06block: replace blkdev_nr_zones with bdev_nr_zonesChristoph Hellwig1-1/+1
Pass a block_device instead of a request_queue as that is what most callers have at hand. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-18block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARDChristoph Hellwig1-8/+35
Secure erase is a very different operation from discard in that it is a data integrity operation vs hint. Fully split the limits and helper infrastructure to make the separation more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nifs2] Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> [f2fs] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Chao Yu <chao@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-27-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-18block: remove QUEUE_FLAG_DISCARDChristoph Hellwig1-2/+1
Just use a non-zero max_discard_sectors as an indicator for discard support, similar to what is done for write zeroes. The only places where needs special attention is the RAID5 driver, which must clear discard support for security reasons by default, even if the default stacking rules would allow for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-18block: add a bdev_nonrot helperChristoph Hellwig1-1/+1
Add a helper to check the nonrot flag based on the block_device instead of having to poke into the block layer internal request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-15block/compat_ioctl: fix range check in BLKGETSIZEKhazhismel Kumykov1-1/+1
kernel ulong and compat_ulong_t may not be same width. Use type directly to eliminate mismatches. This would result in truncation rather than EFBIG for 32bit mode for large disks. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220414224056.2875681-1-khazhy@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29block: merge disk_scan_partitions and blkdev_reread_partChristoph Hellwig1-26/+5
Unify the functionality that implements a partition rescan for a gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211122130625.1136848-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-09block: Hold invalidate_lock in BLKZEROOUT ioctlShin'ichiro Kawasaki1-3/+9
When BLKZEROOUT ioctl and data read race, the data read leaves stale page cache. To avoid the stale page cache, hold invalidate_lock of the block device file mapping. The stale page cache is observed when blktests test case block/009 is modified to call "blkdiscard -z" command and repeated hundreds of times. This patch can be applied back to the stable kernel version v5.15.y. Rework is required for older stable kernels. Fixes: 22dd6d356628 ("block: invalidate the page cache when issuing BLKZEROOUT") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Cc: stable@vger.kernel.org # v5.15 Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20211109104723.835533-3-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-09block: Hold invalidate_lock in BLKDISCARD ioctlShin'ichiro Kawasaki1-3/+9
When BLKDISCARD ioctl and data read race, the data read leaves stale page cache. To avoid the stale page cache, hold invalidate_lock of the block device file mapping. The stale page cache is observed when blktests test case block/009 is repeated hundreds of times. This patch can be applied back to the stable kernel version v5.15.y with slight patch edit. Rework is required for older stable kernels. Fixes: 351499a172c0 ("block: Invalidate cache on discard v2") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Cc: stable@vger.kernel.org # v5.15 Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20211109104723.835533-2-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19block/ioctl: use bdev_nr_sectors and bdev_nr_bytesChristoph Hellwig1-12/+8
Use the proper helper to read the block device size. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211019062024.2171074-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: merge block_ioctl into blkdev_ioctlChristoph Hellwig1-4/+14
Simplify the ioctl path and match the code structure on the compat side. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012104450.659013-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: unexport blkdev_ioctlChristoph Hellwig1-1/+0
With the raw driver gone, there is no modular user left. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211012104450.659013-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-12block: pass a gendisk to bdev_resize_partitionChristoph Hellwig1-1/+1
bdev_resize_partition can only operate on the whole device. Make that clear by passing a gendisk instead of a block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210810154512.1809898-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-12block: pass a gendisk to bdev_del_partitionChristoph Hellwig1-1/+1
bdev_del_partition can only operate on the whole device. Make that clear by passing a gendisk instead of a block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210810154512.1809898-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-12block: pass a gendisk to bdev_add_partitionChristoph Hellwig1-1/+2
bdev_add_partition can only operate on the whole device. Make that clear by passing a gendisk instead of a block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210810154512.1809898-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>