summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2024-07-02dm-verity: move hash algorithm setup into its own functionEric Biggers1-31/+39
Move the code that sets up the hash transformation into its own function. No change in behavior. Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-02dm init: Handle minors larger than 255Benjamin Marzinski1-1/+3
dm_parse_device_entry() simply copies the minor number into dmi.dev, but the dev_t format splits the minor number between the lowest 8 bytes and highest 12 bytes. If the minor number is larger than 255, part of it will end up getting treated as the major number Fix this by checking that the minor number is valid and then encoding it as a dev_t. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-02dm cache metadata: remove unused struct 'thunk'Dr. David Alan Gilbert1-9/+0
'thunk' has been unused since commit f177940a8091 ("dm cache metadata: switch to using the new cursor api for loading metadata"). Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-02dm io: remove code duplication between sync_io and aysnc_ioBenjamin Marzinski1-36/+23
The only difference between the code to setup and dispatch the io in sync_io() and async_io() is the sync argument to dispatch_io(), which is used to update the opf argument. Update the opf argument direcly in sync_io(), and remove the sync argument from dispatch_io(). Then, make sync_io() call async_io() instead of duplicting all of its code. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-02dm io: don't call the async_io notify.fn on invalid num_regionsBenjamin Marzinski1-19/+12
If dm_io() returned an error, callers that set a notify.fn and wanted it called on an error need to check the return value and call notify.fn themselves if it was -EINVAL but not if it was -EIO. None of them do this (granted, all the existing async_io users of dm_io call it in a way that is guaranteed to not return an error). Simplify the interface by never calling the notify.fn if dm_io returns an error. This works with the existing dm_io callers which check for an error and handle it using the same methods as the notify.fn. This also allows us to move the now equivalent num_regions checks out of sync_io() and async_io() and into dm_io() itself. Additionally, change async_io() into a void function, since it can no longer fail. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-07-02dm io: bump num_bvecs to handle offset memoryBenjamin Marzinski1-1/+1
If dp->get_page() returns a non-zero offset, the bio might need an additional bvec to deal with the offset. For example, if remaining is exactly one page size, but there is an offset, the memory will span two pages. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-06-28bcache: work around a __bitwise to bool conversion sparse warningChristoph Hellwig1-2/+2
Sparse is a bit dumb about bitwise operation on __bitwise types used in boolean contexts. Add a !! to explicitly propagate to boolean without a warning. Fixes: fcf865e357f8 ("block: convert features and flags to __bitwise types") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20240628131657.667797-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26md: set md-specific flags for all queue limitsChristoph Hellwig6-9/+14
The md driver wants to enforce a number of flags for all devices, even when not inheriting them from the underlying devices. To make sure these flags survive the queue_limits_set calls that md uses to update the queue limits without deriving them form the previous limits add a new md_init_stacking_limits helper that calls blk_set_stacking_limits and sets these flags. Fixes: 1122c0c1cc71 ("block: move cache control settings out of queue->flags") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240626142637.300624-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26dm: optimize flushesMikulas Patocka5-12/+46
Device mapper sends flush bios to all the targets and the targets send it to the underlying device. That may be inefficient, for example if a table contains 10 linear targets pointing to the same physical device, then device mapper would send 10 flush bios to that device - despite the fact that only one bio would be sufficient. This commit optimizes the flush behavior. It introduces a per-target variable flush_bypasses_map - it is set when the target supports flush optimization - currently, the dm-linear and dm-stripe targets support it. When all the targets in a table have flush_bypasses_map, flush_bypasses_map on the table is set. __send_empty_flush tests if the table has flush_bypasses_map - and if it has, no flush bios are sent to the targets via the "map" method and the list dm_table->devices is iterated and the flush bios are sent to each member of the list. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Suggested-by: Yang Yang <yang.yang@vivo.com>
2024-06-25bcache: remove heap-related macros and switch to generic min_heapKuan-Wei Chiu11-217/+263
Drop the heap-related macros from bcache and replacing them with the generic min_heap implementation from include/linux. By doing so, code readability is improved by using functions instead of macros. Moreover, the min_heap implementation in include/linux adopts a bottom-up variation compared to the textbook version currently used in bcache. This bottom-up variation allows for approximately 50% reduction in the number of comparison operations during heap siftdown, without changing the number of swaps, thus making it more efficient. Link: https://lkml.kernel.org/ioyfizrzq7w7mjrqcadtzsfgpuntowtjdw5pgn4qhvsdp4mqqg@nrlek5vmisbu Link: https://lkml.kernel.org/r/20240524152958.919343-16-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Ian Rogers <irogers@google.com> Acked-by: Coly Li <colyli@suse.de> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-25lib min_heap: rename min_heapify() to min_heap_sift_down()Kuan-Wei Chiu1-1/+1
After adding min_heap_sift_up(), the naming convention has been adjusted to maintain consistency with the min_heap_sift_up(). Consequently, min_heapify() has been renamed to min_heap_sift_down(). Link: https://lkml.kernel.org/CAP-5=fVcBAxt8Mw72=NCJPRJfjDaJcqk4rjbadgouAEAHz_q1A@mail.gmail.com Link: https://lkml.kernel.org/r/20240524152958.919343-13-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-25lib min_heap: add args for min_heap_callbacksKuan-Wei Chiu2-9/+10
Add a third parameter 'args' for the 'less' and 'swp' functions in the 'struct min_heap_callbacks'. This additional parameter allows these comparison and swap functions to handle extra arguments when necessary. Link: https://lkml.kernel.org/r/20240524152958.919343-9-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-25lib min_heap: add type safe interfaceKuan-Wei Chiu2-7/+7
Implement a type-safe interface for min_heap using strong type pointers instead of void * in the data field. This change includes adding small macro wrappers around functions, enabling the use of __minheap_cast and __minheap_obj_size macros for type casting and obtaining element size. This implementation removes the necessity of passing element size in min_heap_callbacks. Additionally, introduce the MIN_HEAP_PREALLOCATED macro for preallocating some elements. Link: https://lkml.kernel.org/ioyfizrzq7w7mjrqcadtzsfgpuntowtjdw5pgn4qhvsdp4mqqg@nrlek5vmisbu Link: https://lkml.kernel.org/r/20240524152958.919343-5-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-25bcache: fix typoKuan-Wei Chiu1-1/+1
Replace 'utiility' with 'utility'. Link: https://lkml.kernel.org/r/20240524152958.919343-3-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Ian Rogers <irogers@google.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-21block: Generalize chunk_sectors support as boundary supportJohn Garry1-1/+1
The purpose of the chunk_sectors limit is to ensure that a mergeble request fits within the boundary of the chunck_sector value. Such a feature will be useful for other request_queue boundary limits, so generalize the chunk_sectors merge code. This idea was proposed by Hannes Reinecke. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20Merge branch 'for-6.11/block-limits' into for-6.11/blockJens Axboe5-6/+3
Merge in queue limits cleanups. * for-6.11/block-limits: block: move the raid_partial_stripes_expensive flag into the features field block: remove the discard_alignment flag block: move the misaligned flag into the features field block: renumber and rename the cache disabled flag block: fix spelling and grammar for in writeback_cache_control.rst block: remove the unused blk_bounce enum
2024-06-20block: move the raid_partial_stripes_expensive flag into the features fieldChristoph Hellwig2-3/+3
Move the raid_partial_stripes_expensive flags into the features field to reclaim a little bit of space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240619154623.450048-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20block: remove the discard_alignment flagChristoph Hellwig3-3/+0
queue_limits.discard_alignment is never read except in the places where it is stacked into another limit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240619154623.450048-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19Merge branch 'for-6.11/block-limits' into for-6.11/blockJens Axboe7-202/+59
Merge in last round of queue limits changes from Christoph. * for-6.11/block-limits: (26 commits) block: move the bounce flag into the features field block: move the skip_tagset_quiesce flag to queue_limits block: move the pci_p2pdma flag to queue_limits block: move the zone_resetall flag to queue_limits block: move the zoned flag into the features field block: move the poll flag to queue_limits block: move the dax flag to queue_limits block: move the nowait flag to queue_limits block: move the synchronous flag to queue_limits block: move the stable_writes flag to queue_limits block: move the io_stat flag setting to queue_limits block: move the add_random flag to queue_limits block: move the nonrot flag to queue_limits block: move cache control settings out of queue->flags block: remove blk_flush_policy block: freeze the queue in queue_attr_store nbd: move setting the cache control flags to __nbd_set_size virtio_blk: remove virtblk_update_cache_mode loop: fold loop_update_rotational into loop_reconfigure_limits loop: also use the default block size from an underlying block device ... Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the zoned flag into the features fieldChristoph Hellwig3-7/+8
Move the zoned flags into the features field to reclaim a little bit of space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-23-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the poll flag to queue_limitsChristoph Hellwig1-41/+13
Move the poll flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Stacking drivers are simplified in that they now can simply set the flag, and blk_stack_limits will clear it when the features is not supported by any of the underlying devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-22-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the dax flag to queue_limitsChristoph Hellwig1-2/+2
Move the dax flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-21-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the nowait flag to queue_limitsChristoph Hellwig2-32/+4
Move the nowait flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Stacking drivers are simplified in that they now can simply set the flag, and blk_stack_limits will clear it when the features is not supported by any of the underlying devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the stable_writes flag to queue_limitsChristoph Hellwig2-21/+4
Move the stable_writes flag into the queue_limits feature field so that it can be set atomically with the queue frozen. The flag is now inherited by blk_stack_limits, which greatly simplifies the code in dm, and fixed md which previously did not pass on the flag set on lower devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the io_stat flag setting to queue_limitsChristoph Hellwig3-16/+14
Move the io_stat flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Simplify md and dm to set the flag unconditionally instead of avoiding setting a simple flag for cases where it already is set by other means, which is a bit pointless. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the add_random flag to queue_limitsChristoph Hellwig1-18/+0
Move the add_random flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Note that this also removes code from dm to clear the flag based on the underlying devices, which can't be reached as dm devices will always start out without the flag set. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the nonrot flag to queue_limitsChristoph Hellwig3-27/+0
Move the nonrot flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Use the chance to switch to defaulting to non-rotational and require the driver to opt into rotational, which matches the polarity of the sysfs interface. For the z2ram, ps3vram, 2x memstick, ubiblock and dcssblk the new rotational flag is not set as they clearly are not rotational despite this being a behavior change. There are some other drivers that unconditionally set the rotational flag to keep the existing behavior as they arguably can be used on rotational devices even if that is probably not their main use today (e.g. virtio_blk and drbd). The flag is automatically inherited in blk_stack_limits matching the existing behavior in dm and md. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move cache control settings out of queue->flagsChristoph Hellwig3-40/+14
Move the cache control settings into the queue_limits so that the flags can be set atomically with the device queue frozen. Add new features and flags field for the driver set flags, and internal (usually sysfs-controlled) flags in the block layer. Note that we'll eventually remove enough field from queue_limits to bring it back to the previous size. The disable flag is inverted compared to the previous meaning, which means it now survives a rescan, similar to the max_sectors and max_discard_sectors user limits. The FLUSH and FUA flags are now inherited by blk_stack_limits, which simplified the code in dm a lot, but also causes a slight behavior change in that dm-switch and dm-unstripe now advertise a write cache despite setting num_flush_bios to 0. The I/O path will handle this gracefully, but as far as I can tell the lack of num_flush_bios and thus flush support is a pre-existing data integrity bug in those targets that really needs fixing, after which a non-zero num_flush_bios should be required in dm for targets that map to underlying devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-16dm: Remove unused macro DM_ZONE_INVALID_WP_OFSTDamien Le Moal1-2/+0
With the switch to using the zone append emulation of the block layer zone write plugging, the macro DM_ZONE_INVALID_WP_OFST is no longer used in dm-zone.c. Remove its definition. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240611023639.89277-5-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-16dm: Improve zone resource limits handlingDamien Le Moal1-30/+150
The generic stacking of limits implemented in the block layer cannot correctly handle stacking of zone resource limits (max open zones and max active zones) because these limits are for an entire device but the stacking may be for a portion of that device (e.g. a dm-linear target that does not cover an entire block device). As a result, when DM devices are created on top of zoned block devices, the DM device never has any zone resource limits advertized, which is only correct if all underlying target devices also have no zone resource limits. If at least one target device has resource limits, the user may see either performance issues (if the max open zone limit of the device is exceeded) or write I/O errors if the max active zone limit of one of the underlying target devices is exceeded. While it is very difficult to correctly and reliably stack zone resource limits in general, cases where targets are not sharing zone resources of the same device can be dealt with relatively easily. Such situation happens when a target maps all sequential zones of a zoned block device: for such mapping, other targets mapping other parts of the same zoned block device can only contain conventional zones and thus will not require any zone resource to correctly handle write operations. For a mapped device constructed with such targets, which includes mapped devices constructed with targets mapping entire zoned block devices, the zone resource limits can be reliably determined using the non-zero minimum of the zone resource limits of all targets. For mapped devices that include targets partially mapping the set of sequential write required zones of zoned block devices, instead of advertizing no zone resource limits, it is also better to set the mapped device limits to the non-zero minimum of the limits of all targets. In this case the limits for a target depend on the number of sequential zones being mapped: if this number of zone is larger than the limits, then the limits of the device apply and can be used. If on the other hand the target maps a number of zones smaller than the limits, then no limits is needed and we can assume that the target has no limits (limits set to 0). This commit improves zone resource limits handling as described above by modifying dm_set_zones_restrictions() to iterate the targets of a mapped device to evaluate the max open and max active zone limits. This relies on an internal "stacking" of the limits of the target devices combined with a direct counting of the number of sequential zones mapped by the targets. 1) For a target mapping an entire zoned block device, the limits for the target are set to the limits of the device. 2) For a target partially mapping a zoned block device, the number of mapped sequential zones is used to determine the limits: if the target maps more sequential write required zones than the device limits, then the limits of the device are used as-is. If the number of mapped sequential zones is lower than the limits, then we assume that the target has no limits (limits set to 0). As this evaluation is done for each target, the zone resource limits for the mapped device are evaluated as the non-zero minimum of the limits of all the targets. For configurations resulting in unreliable limits, i.e. a table containing a target partially mapping a zoned device, a warning message is issued. The counting of mapped sequential zones for the target is done using the new function dm_device_count_zones() which performs a report zones on the entire block device with the callback dm_device_count_zones_cb(). This count of mapped sequential zones is also used to determine if the mapped device contains only conventional zones. This allows simplifying dm_set_zones_restrictions() to not do a report zones just for this. For mapped devices mapping only conventional zones, as before, the mapped device is changed to a regular device by setting its zoned limit to false and clearing all its zone related limits. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240611023639.89277-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-16dm: Call dm_revalidate_zones() after setting the queue limitsDamien Le Moal3-19/+22
dm_revalidate_zones() is called from dm_set_zone_restrictions() when the mapped device queue limits are not yet set. However, dm_revalidate_zones() calls blk_revalidate_disk_zones() and this function consults and modifies the mapped device queue limits. Thus, currently, blk_revalidate_disk_zones() operates on limits that are not yet initialized. Fix this by moving the call to dm_revalidate_zones() out of dm_set_zone_restrictions() and into dm_table_set_restrictions() after executing queue_limits_set(). To further cleanup dm_set_zones_restrictions(), the message about the type of zone append (native or emulated) is also moved inside dm_revalidate_zones(). Fixes: 1c0e720228ad ("dm: use queue_limits_set") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240611023639.89277-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14Merge branch 'for-6.11/block-limits' into for-6.11/blockJens Axboe10-240/+79
Pull in block limits branch, which exists as a shared branch for both the block and SCSI tree. * for-6.11/block-limits: (26 commits) block: move integrity information into queue_limits block: invert the BLK_INTEGRITY_{GENERATE,VERIFY} flags block: bypass the STABLE_WRITES flag for protection information block: don't require stable pages for non-PI metadata block: use kstrtoul in flag_store block: factor out flag_{store,show} helper for integrity block: remove the blk_flush_integrity call in blk_integrity_unregister block: remove the blk_integrity_profile structure dm-integrity: use the nop integrity profile md/raid1: don't free conf on raid0_run failure md/raid0: don't free conf on raid0_run failure block: initialize integrity buffer to zero before writing it to media block: add special APIs for run-time disabling of discard and friends block: remove unused queue limits API sr: convert to the atomic queue limits API sd: convert to the atomic queue limits API sd: cleanup zoned queue limits initialization sd: factor out a sd_discard_mode helper sd: simplify the disable case in sd_config_discard sd: add a sd_disable_write_same helper ...
2024-06-14block: move integrity information into queue_limitsChristoph Hellwig9-218/+77
Move the integrity information into the queue limits so that it can be set atomically with other queue limits, and that the sysfs changes to the read_verify and write_generate flags are properly synchronized. This also allows to provide a more useful helper to stack the integrity fields, although it still is separate from the main stacking function as not all stackable devices want to inherit the integrity settings. Even with that it greatly simplifies the code in md and dm. Note that the integrity field is moved as-is into the queue limits. While there are good arguments for removing the separate blk_integrity structure, this would cause a lot of churn and might better be done at a later time if desired. However the integrity field in the queue_limits structure is now unconditional so that various ifdefs can be avoided or replaced with IS_ENABLED(). Given that tiny size of it that seems like a worthwhile trade off. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14block: remove the blk_integrity_profile structureChristoph Hellwig1-1/+1
Block layer integrity configuration is a bit complex right now, as it indirects through operation vectors for a simple two-dimensional configuration: a) the checksum type of none, ip checksum, crc, crc64 b) the presence or absence of a reference tag Remove the integrity profile, and instead add a separate csum_type flag which replaces the existing ip-checksum field and a new flag that indicates the presence of the reference tag. This removes up to two layers of indirect calls, remove the need to offload the no-op verification of non-PI metadata to a workqueue and generally simplifies the code. The downside is that block/t10-pi.c now has to be built into the kernel when CONFIG_BLK_DEV_INTEGRITY is supported. Given that both nvme and SCSI require t10-pi.ko, it is loaded for all usual configurations that enabled CONFIG_BLK_DEV_INTEGRITY already, though. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14dm-integrity: use the nop integrity profileChristoph Hellwig2-22/+2
Use the block layer built-in nop profile instead of duplicating it. Tested by: $ dd if=/dev/urandom of=key.bin bs=512 count=1 $ cryptsetup luksFormat -q --type luks2 --integrity hmac-sha256 \ --integrity-no-wipe /dev/nvme0n1 key.bin $ cryptsetup luksOpen /dev/nvme0n1 luks-integrity --key-file key.bin and then doing mkfs.xfs and simple I/O on the mount file system. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Milan Broz <gmazyland@gmail.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14md/raid1: don't free conf on raid0_run failureChristoph Hellwig1-11/+3
The core md code calls the ->free method which already frees conf. Fixes: 07f1a6850c5d ("md/raid1: fail run raid1 array when active disk less than one") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14md/raid0: don't free conf on raid0_run failureChristoph Hellwig1-16/+5
The core md code calls the ->free method which already frees conf. Fixes: 0c031fd37f69 ("md: Move alloc/free acct bioset in to personality") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240613084839.1044015-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-12md/raid5: avoid BUG_ON() while continue reshape after reassemblingYu Kuai1-7/+13
Currently, mdadm support --revert-reshape to abort the reshape while reassembling, as the test 07revert-grow. However, following BUG_ON() can be triggerred by the test: kernel BUG at drivers/md/raid5.c:6278! invalid opcode: 0000 [#1] PREEMPT SMP PTI irq event stamp: 158985 CPU: 6 PID: 891 Comm: md0_reshape Not tainted 6.9.0-03335-g7592a0b0049a #94 RIP: 0010:reshape_request+0x3f1/0xe60 Call Trace: <TASK> raid5_sync_request+0x43d/0x550 md_do_sync+0xb7a/0x2110 md_thread+0x294/0x2b0 kthread+0x147/0x1c0 ret_from_fork+0x59/0x70 ret_from_fork_asm+0x1a/0x30 </TASK> Root cause is that --revert-reshape update the raid_disks from 5 to 4, while reshape position is still set, and after reassembling the array, reshape position will be read from super block, then during reshape the checking of 'writepos' that is caculated by old reshape position will fail. Fix this panic the easy way first, by converting the BUG_ON() to WARN_ON(), and stop the reshape if checkings fail. Noted that mdadm must fix --revert-shape as well, and probably md/raid should enhance metadata validation as well, however this means reassemble will fail and there must be user tools to fix the wrong metadata. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-13-yukuai1@huaweicloud.com
2024-06-12md: pass in max_sectors for pers->sync_request()Yu Kuai5-14/+10
For different sync_action, sync_thread will use different max_sectors, see details in md_sync_max_sectors(), currently both md_do_sync() and pers->sync_request() in eatch iteration have to get the same max_sectors. Hence pass in max_sectors for pers->sync_request() to prevent redundant code. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-12-yukuai1@huaweicloud.com
2024-06-12md: factor out helpers for different sync_action in md_do_sync()Yu Kuai1-50/+73
Make code cleaner by replacing if else if with switch, and it's more obvious now what is doing for each sync_action. There are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-11-yukuai1@huaweicloud.com
2024-06-12md: replace last_sync_action with new enum typeYu Kuai3-9/+9
The only difference is that "none" is removed and initial last_sync_action will be idle. On the one hand, this value is introduced by commit c4a395514516 ("MD: Remember the last sync operation that was performed"), and the usage described in commit message is not affected. On the other hand, last_sync_action is not used in mdadm or mdmon, and none of the tests that I can find. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-10-yukuai1@huaweicloud.com
2024-06-12md: use new helpers in md_do_sync()Yu Kuai2-17/+6
Make code cleaner. and also use the action_name directly in kernel log: - "check" instead of "data-check" - "repair" instead of "requested-resync" Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-9-yukuai1@huaweicloud.com
2024-06-12md: don't fail action_store() if sync_thread is not registeredYu Kuai2-54/+33
MD_RECOVERY_RUNNING will always be set when trying to register a new sync_thread, however, if md_start_sync() turns out to do nothing, MD_RECOVERY_RUNNING will be cleared in this case. And during the race window, action_store() will return -EBUSY, which will cause some mdadm tests to fail. For example: The test 07reshape5intr will add a new disk to array, then start reshape: mdadm /dev/md0 --add /dev/xxx mdadm --grow /dev/md0 -n 3 And add_bound_rdev() from mdadm --add will set MD_RECOVERY_NEEDED, then during the race windown, mdadm --grow will fail. Fix the problem by waiting in action_store() during the race window, fail only if sync_thread is registered. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-8-yukuai1@huaweicloud.com
2024-06-12md: remove parameter check_seq for stop_sync_thread()Yu Kuai1-15/+11
Caller will always set MD_RECOVERY_FROZEN if check_seq is true, and always clear MD_RECOVERY_FROZEN if check_seq is false, hence replace the parameter with test_bit() to make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-7-yukuai1@huaweicloud.com
2024-06-12md: replace sysfs api sync_action with new helpersYu Kuai1-42/+52
To get rid of extrem long if else if usage, and make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-6-yukuai1@huaweicloud.com
2024-06-12md: factor out helper to start reshape from action_store()Yu Kuai1-24/+41
There are no functional changes, just to make code cleaner and prepare for following refactor. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-5-yukuai1@huaweicloud.com
2024-06-12md: add new helpers for sync_actionYu Kuai2-0/+82
The new helpers will get current sync_action of the array, will be used in later patches to make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-4-yukuai1@huaweicloud.com
2024-06-12md: add a new enum type sync_actionYu Kuai1-1/+56
In order to make code related to sync_thread cleaner in following patches, also add detail comment about each sync action. And also prepare to remove the related recovery_flags in the fulture. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-3-yukuai1@huaweicloud.com
2024-06-12md: rearrange recovery_flagsYu Kuai1-14/+38
Currently there are lots of flags with the same confusing prefix "MD_REOCVERY_", and there are two main types of flags, sync thread runnng status, I prefer prefix "SYNC_THREAD_", and sync thread action, I perfer prefix "SYNC_ACTION_". For now, rearrange and update comment to improve code readability, there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-2-yukuai1@huaweicloud.com
2024-06-12md/md-bitmap: fix writing non bitmap pagesOfir Gal1-3/+3
__write_sb_page() rounds up the io size to the optimal io size if it doesn't exceed the data offset, but it doesn't check the final size exceeds the bitmap length. For example: page count - 1 page size - 4K data offset - 1M optimal io size - 256K The final io size would be 256K (64 pages) but md_bitmap_storage_alloc() allocated 1 page, the IO would write 1 valid page and 63 pages that happens to be allocated afterwards. This leaks memory to the raid device superblock. This issue caused a data transfer failure in nvme-tcp. The network drivers checks the first page of an IO with sendpage_ok(), it returns true if the page isn't a slabpage and refcount >= 1. If the page !sendpage_ok() the network driver disables MSG_SPLICE_PAGES. As of now the network layer assumes all the pages of the IO are sendpage_ok() when MSG_SPLICE_PAGES is on. The bitmap pages aren't slab pages, the first page of the IO is sendpage_ok(), but the additional pages that happens to be allocated after the bitmap pages might be !sendpage_ok(). That cause skb_splice_from_iter() to stop the data transfer, in the case below it hangs 'mdadm --create'. The bug is reproducible, in order to reproduce we need nvme-over-tcp controllers with optimal IO size bigger than PAGE_SIZE. Creating a raid with bitmap over those devices reproduces the bug. In order to simulate large optimal IO size you can use dm-stripe with a single device. Script to reproduce the issue on top of brd devices using dm-stripe is attached below (will be added to blktest). I have added some logs to test the theory: ... md: created bitmap (1 pages) for device md127 __write_sb_page before md_super_write offset: 16, size: 262144. pfn: 0x53ee === __write_sb_page before md_super_write. logging pages === pfn: 0x53ee, slab: 0 <-- the only page that allocated for the bitmap pfn: 0x53ef, slab: 1 pfn: 0x53f0, slab: 0 pfn: 0x53f1, slab: 0 pfn: 0x53f2, slab: 0 pfn: 0x53f3, slab: 1 ... nvme_tcp: sendpage_ok - pfn: 0x53ee, len: 262144, offset: 0 skbuff: before sendpage_ok() - pfn: 0x53ee skbuff: before sendpage_ok() - pfn: 0x53ef WARNING at net/core/skbuff.c:6848 skb_splice_from_iter+0x142/0x450 skbuff: !sendpage_ok - pfn: 0x53ef. is_slab: 1, page_count: 1 ... Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ofir Gal <ofir.gal@volumez.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240607072748.3182199-1-ofir.gal@volumez.com