| Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit e9b004ff83067cdf96774b45aea4b239ace99a2f ]
wbt_init_enable_default() uses WARN_ON_ONCE to check for failures from
wbt_alloc() and wbt_init(). However, both are expected failure paths:
- wbt_alloc() can return NULL under memory pressure (-ENOMEM)
- wbt_init() can fail with -EBUSY if wbt is already registered
syzbot triggers this by injecting memory allocation failures during MTD
partition creation via ioctl(BLKPG), causing a spurious warning.
wbt_init_enable_default() is a best-effort initialization called from
blk_register_queue() with a void return type. Failure simply means the
disk operates without writeback throttling, which is harmless.
Replace WARN_ON_ONCE with plain if-checks, consistent with how
wbt_set_lat() in the same file already handles these failures. Add a
pr_warn() for the wbt_init() failure to retain diagnostic information
without triggering a full stack trace.
Reported-by: syzbot+71fcf20f7c1e5043d78c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=71fcf20f7c1e5043d78c
Fixes: 41afaeeda509 ("blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter")
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://patch.msgid.link/20260316070358.65225-2-ytohnuki@amazon.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 23308af722fefed00af5f238024c11710938fba3 ]
Add the missing put_disk() on the error path in
blkcg_maybe_throttle_current(). When blkcg lookup, blkg lookup, or
blkg_tryget() fails, the function jumps to the out label which only
calls rcu_read_unlock() but does not release the disk reference acquired
by blkcg_schedule_throttle() via get_device(). Since current->throttle_disk
is already set to NULL before the lookup, blkcg_exit() cannot release
this reference either, causing the disk to never be freed.
Restore the reference release that was present as blk_put_queue() in the
original code but was inadvertently dropped during the conversion from
request_queue to gendisk.
Fixes: f05837ed73d0 ("blk-cgroup: store a gendisk to throttle in struct task_struct")
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260331085054.46857-1-liu.yun@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2a2f520fda824b5a25c93f2249578ea150c24e06 ]
When blk_revalidate_disk_zones() fails after disk_revalidate_zone_resources()
has allocated args.zones_cond, the memory is leaked because no error path
frees it.
Fixes: 6e945ffb6555 ("block: use zone condition to determine conventional zones")
Suggested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Link: https://patch.msgid.link/20260331111216.24242-1-liu.yun@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 267ec4d7223a783f029a980f41b93c39b17996da ]
When LOOP_CONFIGURE is called with LO_FLAGS_PARTSCAN, the following
sequence occurs:
1. disk_force_media_change() sets GD_NEED_PART_SCAN
2. Uevent suppression is lifted and a KOBJ_CHANGE uevent is sent
3. loop_global_unlock() releases the lock
4. loop_reread_partitions() calls bdev_disk_changed() to scan
There is a race between steps 2 and 4: when udev receives the uevent
and opens the device before loop_reread_partitions() runs,
blkdev_get_whole() in bdev.c sees GD_NEED_PART_SCAN set and calls
bdev_disk_changed() for a first scan. Then loop_reread_partitions()
does a second scan. The open_mutex serializes these two scans, but
does not prevent both from running.
The second scan in bdev_disk_changed() drops all partition devices
from the first scan (via blk_drop_partitions()) before re-adding
them, causing partition block devices to briefly disappear. This
breaks any systemd unit with BindsTo= on the partition device: systemd
observes the device going dead, fails the dependent units, and does
not retry them when the device reappears.
Fix this by removing the GD_NEED_PART_SCAN set from
disk_force_media_change() entirely. None of the current callers need
the lazy on-open partition scan triggered by this flag:
- floppy: sets GENHD_FL_NO_PART, so disk_has_partscan() is always
false and GD_NEED_PART_SCAN has no effect.
- loop (loop_configure, loop_change_fd): when LO_FLAGS_PARTSCAN is
set, loop_reread_partitions() performs an explicit scan. When not
set, GD_SUPPRESS_PART_SCAN prevents the lazy scan path.
- loop (__loop_clr_fd): calls bdev_disk_changed() explicitly if
LO_FLAGS_PARTSCAN is set.
- nbd (nbd_clear_sock_ioctl): capacity is set to zero immediately
after; nbd manages GD_NEED_PART_SCAN explicitly elsewhere.
With GD_NEED_PART_SCAN no longer set by disk_force_media_change(),
udev opening the loop device after the uevent no longer triggers a
redundant scan in blkdev_get_whole(), and only the single explicit
scan from loop_reread_partitions() runs.
A regression test for this bug has been submitted to blktests:
https://github.com/linux-blktests/blktests/pull/240.
Fixes: 9f65c489b68d ("loop: raise media_change event")
Signed-off-by: Daan De Meyer <daan@amutable.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://patch.msgid.link/20260331105130.1077599-1-daan@amutable.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3dbaacf6ab68f81e3375fe769a2ecdbd3ce386fd ]
When a queue is shared across disk rebind (e.g., SCSI unbind/bind), the
previous disk's blkcg state is cleaned up asynchronously via
disk_release() -> blkcg_exit_disk(). If the new disk's blkcg_init_disk()
runs before that cleanup finishes, we may overwrite q->root_blkg while
the old one is still alive, and radix_tree_insert() in blkg_create()
fails with -EEXIST because the old blkg entries still occupy the same
queue id slot in blkcg->blkg_tree. This causes the sd probe to fail
with -ENOMEM.
Fix it by waiting in blkcg_init_disk() for root_blkg to become NULL,
which indicates the previous disk's blkcg cleanup has completed.
Fixes: 1059699f87eb ("block: move blkcg initialization/destroy into disk allocation/release handler")
Cc: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260311032837.2368714-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 212ec34e4e726e8cd4af7bea4740db24de8a9dab upstream.
This passthrough helper currently only supports discards. Part of that
command is the start and length, which is read from the SQE. It does
so on every invocation, where it really should just make it stable
on the first invocation. This avoids needing to copy the SQE upfront,
as we only really need those two 8b values stored in our per-req
payload.
Cc: stable@vger.kernel.org # 6.17+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b7d4ffb510373cc6ecf16022dd0e510a023034fb upstream.
Commit 7b295187287e ("block: Do not remove zone write plugs still in
use") modified disk_should_remove_zone_wplug() to add a check on the
reference count of a zone write plug to prevent removing zone write
plugs from a disk hash table when the plugs are still being referenced
by BIOs or requests in-flight. However, this check does not take into
account that a BIO completion may happen right after its submission by
a zone write plug BIO work, and before the zone write plug BIO work
releases the zone write plug reference count. This situation leads to
disk_should_remove_zone_wplug() returning false as in this case the zone
write plug reference count is at least equal to 3. If the BIO that
completes in such manner transitioned the zone to the FULL condition,
the zone write plug for the FULL zone will remain in the disk hash
table.
Furthermore, relying on a particular value of a zone write plug
reference count to set the BLK_ZONE_WPLUG_UNHASHED flag is fragile as
reading the atomic reference count and doing a comparison with some
value is not overall atomic at all.
Address these issues by reworking the reference counting of zone write
plugs so that removing plugs from a disk hash table can be done
directly from disk_put_zone_wplug() when the last reference on a plug
is dropped.
To do so, replace the function disk_remove_zone_wplug() with
disk_mark_zone_wplug_dead(). This new function sets the zone write plug
flag BLK_ZONE_WPLUG_DEAD (which replaces BLK_ZONE_WPLUG_UNHASHED) and
drops the initial reference on the zone write plug taken when the plug
was added to the disk hash table. This function is called either for
zones that are empty or full, or directly in the case of a forced plug
removal (e.g. when the disk hash table is being destroyed on disk
removal). With this change, disk_should_remove_zone_wplug() is also
removed.
disk_put_zone_wplug() is modified to call the function
disk_free_zone_wplug() to remove a zone write plug from a disk hash
table and free the plug structure (with a call_rcu()), when the last
reference on a zone write plug is dropped. disk_free_zone_wplug()
always checks that the BLK_ZONE_WPLUG_DEAD flag is set.
In order to avoid having multiple zone write plugs for the same zone in
the disk hash table, disk_get_and_lock_zone_wplug() checked for the
BLK_ZONE_WPLUG_UNHASHED flag. This check is removed and a check for
the new BLK_ZONE_WPLUG_DEAD flag is added to
blk_zone_wplug_handle_write(). With this change, we continue preventing
adding multiple zone write plugs for the same zone and at the same time
re-inforce checks on the user behavior by failing new incoming write
BIOs targeting a zone that is marked as dead. This case can happen only
if the user erroneously issues write BIOs to zones that are full, or to
zones that are currently being reset or finished.
Fixes: 7b295187287e ("block: Do not remove zone write plugs still in use")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 13920e4b7b784b40cf4519ff1f0f3e513476a499 upstream.
biovec_phys_mergeable() is used by the request merge, DMA mapping,
and integrity merge paths to decide if two physically contiguous
bvec segments can be coalesced into one. It currently has no check
for whether the segments belong to different dev_pagemaps.
When zone device memory is registered in multiple chunks, each chunk
gets its own dev_pagemap. A single bio can legitimately contain
bvecs from different pgmaps -- iov_iter_extract_bvecs() breaks at
pgmap boundaries but the outer loop in bio_iov_iter_get_pages()
continues filling the same bio. If such bvecs are physically
contiguous, biovec_phys_mergeable() will coalesce them, making it
impossible to recover the correct pgmap for the merged segment
via page_pgmap().
Add a zone_device_pages_have_same_pgmap() check to prevent merging
bvec segments that span different pgmaps.
Fixes: 49580e690755 ("block: add check when merging zone device pages")
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
Link: https://patch.msgid.link/20260410153414.4159050-2-namjain@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 41c665aae2b5dbecddddcc8ace344caf630cc7a4 upstream.
bio_add_page() and bio_integrity_add_page() reject pages from different
dev_pagemaps entirely, returning 0 even when those pages have compatible
DMA mapping requirements. This forces callers to start a new bio when
buffers span pgmap boundaries, even though the pages could safely coexist
as separate bvec entries.
This matters for guests where memory is registered through
devm_memremap_pages() with MEMORY_DEVICE_GENERIC in multiple calls,
creating separate dev_pagemaps for each chunk. When a direct I/O buffer
spans two such chunks, bio_add_page() rejects the second page, forcing an
unnecessary bio split or I/O failure.
Introduce zone_device_pages_compatible() in blk.h to check whether two
pages can coexist in the same bio as separate bvec entries. The block DMA
iterator (blk_dma_map_iter_start) caches the P2PDMA mapping state from the
first segment and applies it to all others, so P2PDMA pages from different
pgmaps must not be mixed, and neither must P2PDMA and non-P2PDMA pages.
All other combinations (MEMORY_DEVICE_GENERIC pages from different pgmaps,
or MEMORY_DEVICE_GENERIC with normal RAM) use the same dma_map_phys path
and are safe.
Replace the blanket zone_device_pages_have_same_pgmap() rejection with
zone_device_pages_compatible(), while keeping
zone_device_pages_have_same_pgmap() as a merge guard.
Pages from different pgmaps can be added as separate bvec entries but
must not be coalesced into the same segment, as that would make
it impossible to recover the correct pgmap via page_pgmap().
Fixes: 49580e690755 ("block: add check when merging zone device pages")
Cc: stable@vger.kernel.org
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260410153414.4159050-3-namjain@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
disk_zone_wplug_schedule_bio_work()
commit 0a8b8af896e0ef83e188e1fe20f98f2bbb1c2459 upstream.
The function disk_zone_wplug_schedule_bio_work() always takes a
reference on the zone write plug of the BIO work being scheduled. This
ensures that the zone write plug cannot be freed while the BIO work is
being scheduled but has not run yet. However, this unconditional
reference taking is fragile since the reference taken is released by the
BIO work blk_zone_wplug_bio_work() function, which implies that there
always must be a 1:1 relation between the work being scheduled and the
work running.
Make sure to drop the reference taken when scheduling the BIO work if
the work is already scheduled, that is, when queue_work() returns false.
Fixes: 9e78c38ab30b ("block: Hold a reference on zone write plugs to schedule submission")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- Improve quirk visibility and configurability (Maurizio)
- Fix runtime user modification to queue setup (Keith)
- Fix multipath leak on try_module_get failure (Keith)
- Ignore ambiguous spec definitions for better atomics support
(John)
- Fix admin queue leak on controller reset (Ming)
- Fix large allocation in persistent reservation read keys
(Sungwoo Kim)
- Fix fcloop callback handling (Justin)
- Securely free DHCHAP secrets (Daniel)
- Various cleanups and typo fixes (John, Wilfred)
- Avoid a circular lock dependency issue in the sysfs nr_requests or
scheduler store handling
- Fix a circular lock dependency with the pcpu mutex and the queue
freeze lock
- Cleanup for bio_copy_kern(), using __bio_add_page() rather than the
bio_add_page(), as adding a page here cannot fail. The exiting code
had broken cleanup for the error condition, so make it clear that the
error condition cannot happen
- Fix for a __this_cpu_read() in preemptible context splat
* tag 'block-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
block: use trylock to avoid lockdep circular dependency in sysfs
nvme: fix memory allocation in nvme_pr_read_keys()
block: use __bio_add_page in bio_copy_kern
block: break pcpu_alloc_mutex dependency on freeze_lock
blktrace: fix __this_cpu_read/write in preemptible context
nvme-multipath: fix leak on try_module_get failure
nvmet-fcloop: Check remoteport port_state before calling done callback
nvme-pci: do not try to add queue maps at runtime
nvme-pci: cap queue creation to used queues
nvme-pci: ensure we're polling a polled queue
nvme: fix memory leak in quirks_param_set()
nvme: correct comment about nvme_ns_remove()
nvme: stop setting namespace gendisk device driver data
nvme: add support for dynamic quirk configuration via module parameter
nvme: fix admin queue leak on controller reset
nvme-fabrics: use kfree_sensitive() for DHCHAP secrets
nvme: stop using AWUPF
nvme: expose active quirks in sysfs
nvme/host: fixup some typos
|
|
Use trylock instead of blocking lock acquisition for update_nr_hwq_lock
in queue_requests_store() and elv_iosched_store() to avoid circular lock
dependency with kernfs active reference during concurrent disk deletion:
update_nr_hwq_lock -> kn->active (via del_gendisk -> kobject_del)
kn->active -> update_nr_hwq_lock (via sysfs write path)
Return -EBUSY when the lock is not immediately available.
Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/linux-block/CAHj4cs-em-4acsHabMdT=jJhXkCzjnprD-aQH1OgrZo4nTnmMw@mail.gmail.com/
Fixes: 626ff4f8ebcb ("blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since the bio is allocated with the exact number of pages needed via
blk_rq_map_bio_alloc(), and the loop iterates exactly that many times,
bio_add_page() cannot fail due to insufficient space. Switch to
__bio_add_page() and remove the dead error handling code.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
While nr_hw_update allocates tagset tags it acquires ->pcpu_alloc_mutex
after ->freeze_lock is acquired or queue is frozen. This potentially
creates a circular dependency involving ->fs_reclaim if reclaim is
triggered simultaneously in a code path which first acquires ->pcpu_
alloc_mutex. As the queue is already frozen while nr_hw_queue update
allocates tagsets, the reclaim can't forward progress and thus it could
cause a potential deadlock as reported in lockdep splat[1].
Fix this by pre-allocating tagset tags before we freeze queue during
nr_hw_queue update. Later the allocated tagset tags could be safely
installed and used after queue is frozen.
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs8F=OV9s3La2kEQ34YndgfZP-B5PHS4Z8_b9euKG6J4mw@mail.gmail.com/ [1]
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
[axboe: fix brace style issue]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Conversion performed via this Coccinelle script:
// SPDX-License-Identifier: GPL-2.0-only
// Options: --include-headers-for-types --all-includes --include-headers --keep-comments
virtual patch
@gfp depends on patch && !(file in "tools") && !(file in "samples")@
identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
kzalloc_obj,kzalloc_objs,kzalloc_flex,
kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
@@
ALLOC(...
- , GFP_KERNEL
)
$ make coccicheck MODE=patch COCCI=gfp.cocci
Build and boot tested x86_64 with Fedora 42's GCC and Clang:
Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull more block updates from Jens Axboe:
- Fix partial IOVA mapping cleanup in error handling
- Minor prep series ignoring discard return value, as
the inline value is always known
- Ensure BLK_FEAT_STABLE_WRITES is set for drbd
- Fix leak of folio in bio_iov_iter_bounce_read()
- Allow IOC_PR_READ_* for read-only open
- Another debugfs deadlock fix
- A few doc updates
* tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
blk-mq: use NOIO context to prevent deadlock during debugfs creation
blk-stat: convert struct blk_stat_callback to kernel-doc
block: fix enum descriptions kernel-doc
block: update docs for bio and bvec_iter
block: change return type to void
nvmet: ignore discard return value
md: ignore discard return value
block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova
block: fix folio leak in bio_iov_iter_bounce_read()
block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ
drbd: always set BLK_FEAT_STABLE_WRITES
|
|
Creating debugfs entries can trigger fs reclaim, which can enter back
into the block layer request_queue. This can cause deadlock if the
queue is frozen.
Previously, a WARN_ON_ONCE check was used in debugfs_create_files()
to detect this condition, but it was racy since the queue can be frozen
from another context at any time.
Introduce blk_debugfs_lock()/blk_debugfs_unlock() helpers that combine
the debugfs_mutex with memalloc_noio_save()/restore() to prevent fs
reclaim from triggering block I/O. Also add blk_debugfs_lock_nomemsave()
and blk_debugfs_unlock_nomemrestore() variants for callers that don't
need NOIO protection (e.g., debugfs removal or read-only operations).
Replace all raw debugfs_mutex lock/unlock pairs with these helpers,
using the _nomemsave/_nomemrestore variants where appropriate.
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs9gNKEYAPagD9JADfO5UH+OiCr4P7OO2wjpfOYeM-RV=A@mail.gmail.com/
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/all/aYWQR7CtYdk3K39g@shinmob/
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Most of struct blk_stat_callback documentation is already in kernel-doc
format. Convert the remaining struct members to kernel-doc to avoid
kernel-doc warnings:
Warning: block/blk-stat.h:62 struct member 'list' not described
in 'blk_stat_callback'
Warning: block/blk-stat.h:62 struct member 'timer_fn' not described
in 'blk_stat_callback'
Warning: block/blk-stat.h:62 struct member 'rcu' not described
in 'blk_stat_callback'
Warning: block/blk-stat.h:133 No description found for return value of
'blk_stat_is_active'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "powerpc/64s: do not re-activate batched TLB flush" makes
arch_{enter|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev)
It adds a generic enter/leave layer and switches architectures to use
it. Various hacks were removed in the process.
- "zram: introduce compressed data writeback" implements data
compression for zram writeback (Richard Chang and Sergey Senozhatsky)
- "mm: folio_zero_user: clear page ranges" adds clearing of contiguous
page ranges for hugepages. Large improvements during demand faulting
are demonstrated (David Hildenbrand)
- "memcg cleanups" tidies up some memcg code (Chen Ridong)
- "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos
stats" improves DAMOS stat's provided information, deterministic
control, and readability (SeongJae Park)
- "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few
issues in the hugetlb cgroup charging selftests (Li Wang)
- "Fix va_high_addr_switch.sh test failure - again" addresses several
issues in the va_high_addr_switch test (Chunyu Hu)
- "mm/damon/tests/core-kunit: extend existing test scenarios" improves
the KUnit test coverage for DAMON (Shu Anzai)
- "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a
glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to
transiently return -EAGAIN (Shivank Garg)
- "arch, mm: consolidate hugetlb early reservation" reworks and
consolidates a pile of straggly code related to reservation of
hugetlb memory from bootmem and creation of CMA areas for hugetlb
(Mike Rapoport)
- "mm: clean up anon_vma implementation" cleans up the anon_vma
implementation in various ways (Lorenzo Stoakes)
- "tweaks for __alloc_pages_slowpath()" does a little streamlining of
the page allocator's slowpath code (Vlastimil Babka)
- "memcg: separate private and public ID namespaces" cleans up the
memcg ID code and prevents the internal-only private IDs from being
exposed to userspace (Shakeel Butt)
- "mm: hugetlb: allocate frozen gigantic folio" cleans up the
allocation of frozen folios and avoids some atomic refcount
operations (Kefeng Wang)
- "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement
of memory betewwn the active and inactive LRUs and adds auto-tuning
of the ratio-based quotas and of monitoring intervals (SeongJae Park)
- "Support page table check on PowerPC" makes
CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan)
- "nodemask: align nodes_and{,not} with underlying bitmap ops" makes
nodes_and() and nodes_andnot() propagate the return values from the
underlying bit operations, enabling some cleanup in calling code
(Yury Norov)
- "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up
some DAMON internal interfaces (SeongJae Park)
- "mm/khugepaged: cleanups and scan limit fix" does some cleanup work
in khupaged and fixes a scan limit accounting issue (Shivank Garg)
- "mm: balloon infrastructure cleanups" goes to town on the balloon
infrastructure and its page migration function. Mainly cleanups, also
some locking simplification (David Hildenbrand)
- "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds
additional tracepoints to the page reclaim code (Jiayuan Chen)
- "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is
part of Marco's kernel-wide migration from the legacy workqueue APIs
over to the preferred unbound workqueues (Marco Crivellari)
- "Various mm kselftests improvements/fixes" provides various unrelated
improvements/fixes for the mm kselftests (Kevin Brodsky)
- "mm: accelerate gigantic folio allocation" greatly speeds up gigantic
folio allocation, mainly by avoiding unnecessary work in
pfn_range_valid_contig() (Kefeng Wang)
- "selftests/damon: improve leak detection and wss estimation
reliability" improves the reliability of two of the DAMON selftests
(SeongJae Park)
- "mm/damon: cleanup kdamond, damon_call(), damos filter and
DAMON_MIN_REGION" does some cleanup work in the core DAMON code
(SeongJae Park)
- "Docs/mm/damon: update intro, modules, maintainer profile, and misc"
performs maintenance work on the DAMON documentation (SeongJae Park)
- "mm: add and use vma_assert_stabilised() helper" refactors and cleans
up the core VMA code. The main aim here is to be able to use the mmap
write lock's lockdep state to perform various assertions regarding
the locking which the VMA code requires (Lorenzo Stoakes)
- "mm, swap: swap table phase II: unify swapin use" removes some old
swap code (swap cache bypassing and swap synchronization) which
wasn't working very well. Various other cleanups and simplifications
were made. The end result is a 20% speedup in one benchmark (Kairui
Song)
- "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM
available on 64-bit alpha, loongarch, mips, parisc, and um. Various
cleanups were performed along the way (Qi Zheng)
* tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits)
mm/memory: handle non-split locks correctly in zap_empty_pte_table()
mm: move pte table reclaim code to memory.c
mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE
mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config
um: mm: enable MMU_GATHER_RCU_TABLE_FREE
parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE
mips: mm: enable MMU_GATHER_RCU_TABLE_FREE
LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE
alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE
mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h
mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles
zsmalloc: make common caches global
mm: add SPDX id lines to some mm source files
mm/zswap: use %pe to print error pointers
mm/vmscan: use %pe to print error pointers
mm/readahead: fix typo in comment
mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file()
mm: refactor vma_map_pages to use vm_insert_pages
mm/damon: unify address range representation with damon_addr_range
mm/cma: replace snprintf with strscpy in cma_new_area
...
|
|
Now that all the callers of __blkdev_issue_discard() have been changed
to ignore its return value, change its return type from int to void.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When dma_iova_link() fails partway through mapping a request's bvec
list, the function breaks out of the loop without cleaning up
already mapped segments. Similarly, if dma_iova_sync() fails after
linking all segments, no cleanup is performed.
This leaves partial IOVA mappings in place. The completion path
attempts to unmap the full expected size via dma_iova_destroy() or
nvme_unmap_data(), but only a partial size was actually mapped,
leading to incorrect unmap operations.
Add an out_unlink error path that calls dma_iova_destroy() to clean
up partial mappings before returning failure. The dma_iova_destroy()
function handles both partial unlink and IOVA space freeing. It
correctly handles the mapped_len == 0 case (first dma_iova_link()
failure) by only freeing the IOVA allocation without attempting to
unmap.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If iov_iter_extract_bvecs() returns an error or zero bytes extracted,
then the folio allocated is leaked on return. Ensure it's put before
returning.
Fixes: 8dd5e7c75d7b ("block: add helpers to bounce buffer an iov_iter into bios")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The recently added IOC_PR_READ_* ioctls require the same BLK_OPEN_WRITE
permission as the older persistent reservation ioctls. This has the
drawback that udev triggers when the file descriptor is closed,
resulting in unnecessary activity like scanning partitions even though
these read-only ioctls do not modify the device.
Change IOC_PR_READ_KEYS and IOC_PR_READ_RESERVATION to require
BLK_OPEN_READ. This prevents unnecessary activity every time `blkpr
--read-keys` or `blkpr --read-reservation` is invoked by shell scripts,
for example.
It is safe to reduce the permission requirement from BLK_OPEN_WRITE to
BLK_OPEN_READ since these two ioctls do not modify the persistent
reservation state. Userspace cannot use the information fetched by these
ioctls to make changes to the device unless it later opens the device
with BLK_OPEN_WRITE.
Fixes: 3e2cb9ee76c2 ("block: add IOC_PR_READ_RESERVATION ioctl")
Fixes: 22a1ffea5f80 ("block: add IOC_PR_READ_KEYS ioctl")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin Wilck <mwilck@suse.com>
Cc: Benjamin Marzinski <bmarzins@redhat.com>
Suggested-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull kthread updates from Frederic Weisbecker:
"The kthread code provides an infrastructure which manages the
preferred affinity of unbound kthreads (node or custom cpumask)
against housekeeping (CPU isolation) constraints and CPU hotplug
events.
One crucial missing piece is the handling of cpuset: when an isolated
partition is created, deleted, or its CPUs updated, all the unbound
kthreads in the top cpuset become indifferently affine to _all_ the
non-isolated CPUs, possibly breaking their preferred affinity along
the way.
Solve this with performing the kthreads affinity update from cpuset to
the kthreads consolidated relevant code instead so that preferred
affinities are honoured and applied against the updated cpuset
isolated partitions.
The dispatch of the new isolated cpumasks to timers, workqueues and
kthreads is performed by housekeeping, as per the nice Tejun's
suggestion.
As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set
from boot defined domain isolation (through isolcpus=) and cpuset
isolated partitions. Housekeeping cpumasks are now modifiable with a
specific RCU based synchronization. A big step toward making
nohz_full= also mutable through cpuset in the future"
* tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits)
doc: Add housekeeping documentation
kthread: Document kthread_affine_preferred()
kthread: Comment on the purpose and placement of kthread_affine_node() call
kthread: Honour kthreads preferred affinity after cpuset changes
sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN
sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN
kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management
kthread: Include kthreadd to the managed affinity list
kthread: Include unbound kthreads in the managed affinity list
kthread: Refine naming of affinity related fields
PCI: Remove superfluous HK_TYPE_WQ check
sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated()
cpuset: Remove cpuset_cpu_is_isolated()
timers/migration: Remove superfluous cpuset isolation test
cpuset: Propagate cpuset isolation update to timers through housekeeping
cpuset: Propagate cpuset isolation update to workqueue through housekeeping
PCI: Flush PCI probe workqueue on cpuset isolated partition change
sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
sched/isolation: Flush memcg workqueues on cpuset isolated partition change
cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull bounce buffer dio for stable pages from Jens Axboe:
"This adds support for bounce buffering of dio for stable pages. This
was all done by Christoph. In his words:
This series tries to address the problem that under I/O pages can be
modified during direct I/O, even when the device or file system
require stable pages during I/O to calculate checksums, parity or data
operations. It does so by adding block layer helpers to bounce buffer
an iov_iter into a bio, then wires that up in iomap and ultimately
XFS.
The reason that the file system even needs to know about it, is
because reads need a user context to copy the data back, and the
infrastructure to defer ioends to a workqueue currently sits in XFS.
I'm going to look into moving that into ioend and enabling it for
other file systems. Additionally btrfs already has it's own
infrastructure for this, and actually an urgent need to bounce buffer,
so this should be useful there and could be wire up easily. In fact
the idea comes from patches by Qu that did this in btrfs.
This patch fixes all but one xfstests failures on T10 PI capable
devices (generic/095 seems to have issues with a mix of mmap and
splice still, I'm looking into that separately), and make qemu VMs
running Windows, or Linux with swap enabled fine on an XFS file on a
device using PI.
Performance numbers on my (not exactly state of the art) NVMe PI test
setup:
Sequential reads using io_uring, QD=16.
Bandwidth and CPU usage (usr/sys):
| size | zero copy | bounce |
+------+--------------------------+--------------------------+
| 4k | 1316MiB/s (12.65/55.40%) | 1081MiB/s (11.76/49.78%) |
| 64K | 3370MiB/s ( 5.46/18.20%) | 3365MiB/s ( 4.47/15.68%) |
| 1M | 3401MiB/s ( 0.76/23.05%) | 3400MiB/s ( 0.80/09.06%) |
+------+--------------------------+--------------------------+
Sequential writes using io_uring, QD=16.
Bandwidth and CPU usage (usr/sys):
| size | zero copy | bounce |
+------+--------------------------+--------------------------+
| 4k | 882MiB/s (11.83/33.88%) | 750MiB/s (10.53/34.08%) |
| 64K | 2009MiB/s ( 7.33/15.80%) | 2007MiB/s ( 7.47/24.71%) |
| 1M | 1992MiB/s ( 7.26/ 9.13%) | 1992MiB/s ( 9.21/19.11%) |
+------+--------------------------+--------------------------+
Note that the 64k read numbers look really odd to me for the baseline
zero copy case, but are reproducible over many repeated runs.
The bounce read numbers should further improve when moving the PI
validation to the file system and removing the double context switch,
which I have patches for that will sent out soon"
* tag 'for-7.0/block-stable-pages-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
xfs: use bounce buffering direct I/O when the device requires stable pages
iomap: add a flag to bounce buffer direct I/O
iomap: support ioends for direct reads
iomap: rename IOMAP_DIO_DIRTY to IOMAP_DIO_USER_BACKED
iomap: free the bio before completing the dio
iomap: share code between iomap_dio_bio_end_io and iomap_finish_ioend_direct
iomap: split out the per-bio logic from iomap_dio_bio_iter
iomap: simplify iomap_dio_bio_iter
iomap: fix submission side handling of completion side errors
block: add helpers to bounce buffer an iov_iter into bios
block: remove bio_release_page
iov_iter: extract a iov_iter_extract_bvecs helper from bio code
block: open code bio_add_page and fix handling of mismatching P2P ranges
block: refactor get_contig_folio_len
block: add a BIO_MAX_SIZE constant and use it
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- Support for batch request processing for ublk, improving the
efficiency of the kernel/ublk server communication. This can yield
nice 7-12% performance improvements
- Support for integrity data for ublk
- Various other ublk improvements and additions, including a ton of
selftests additions and updated
- Move the handling of blk-crypto software fallback from below the
block layer to above it. This reduces the complexity of dealing with
bio splitting
- Series fixing a number of potential deadlocks in blk-mq related to
the queue usage counter and writeback throttling and rq-qos debugfs
handling
- Add an async_depth queue attribute, to resolve a performance
regression that's been around for a qhilw related to the scheduler
depth handling
- Only use task_work for IOPOLL completions on NVMe, if it is necessary
to do so. An earlier fix for an issue resulted in all these
completions being punted to task_work, to guarantee that completions
were only run for a given io_uring ring when it was local to that
ring. With the new changes, we can detect if it's necessary to use
task_work or not, and avoid it if possible.
- rnbd fixes:
- Fix refcount underflow in device unmap path
- Handle PREFLUSH and NOUNMAP flags properly in protocol
- Fix server-side bi_size for special IOs
- Zero response buffer before use
- Fix trace format for flags
- Add .release to rnbd_dev_ktype
- MD pull requests via Yu Kuai
- Fix raid5_run() to return error when log_init() fails
- Fix IO hang with degraded array with llbitmap
- Fix percpu_ref not resurrected on suspend timeout in llbitmap
- Fix GPF in write_page caused by resize race
- Fix NULL pointer dereference in process_metadata_update
- Fix hang when stopping arrays with metadata through dm-raid
- Fix any_working flag handling in raid10_sync_request
- Refactor sync/recovery code path, improve error handling for
badblocks, and remove unused recovery_disabled field
- Consolidate mddev boolean fields into mddev_flags
- Use mempool to allocate stripe_request_ctx and make sure
max_sectors is not less than io_opt in raid5
- Fix return value of mddev_trylock
- Fix memory leak in raid1_run()
- Add Li Nan as mdraid reviewer
- Move phys_vec definitions to the kernel types, mostly in preparation
for some VFIO and RDMA changes
- Improve the speed for secure erase for some devices
- Various little rust updates
- Various other minor fixes, improvements, and cleanups
* tag 'for-7.0/block-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits)
blk-mq: ABI/sysfs-block: fix docs build warnings
selftests: ublk: organize test directories by test ID
block: decouple secure erase size limit from discard size limit
block: remove redundant kill_bdev() call in set_blocksize()
blk-mq: add documentation for new queue attribute async_dpeth
block, bfq: convert to use request_queue->async_depth
mq-deadline: covert to use request_queue->async_depth
kyber: covert to use request_queue->async_depth
blk-mq: add a new queue sysfs attribute async_depth
blk-mq: factor out a helper blk_mq_limit_depth()
blk-mq-sched: unify elevators checking for async requests
block: convert nr_requests to unsigned int
block: don't use strcpy to copy blockdev name
blk-mq-debugfs: warn about possible deadlock
blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs()
blk-mq-debugfs: remove blk_mq_debugfs_unregister_rqos()
blk-mq-debugfs: make blk_mq_debugfs_register_rqos() static
blk-rq-qos: fix possible debugfs_mutex deadlock
blk-mq-debugfs: factor out a helper to register debugfs for all rq_qos
blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter
...
|
|
Pull xfs updates from Carlos Maiolino:
"This contains several improvements to zoned device support,
performance improvements for the parent pointers, and a new health
monitoring feature. There are some improvements in the journaling code
too but no behavior change expected.
Last but not least, some code refactoring and bug fixes are also
included in this series"
* tag 'xfs-merge-7.0' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (67 commits)
xfs: add sysfs stats for zoned GC
xfs: give the defer_relog stat a xs_ prefix
xfs: add zone reset error injection
xfs: refactor zone reset handling
xfs: don't mark all discard issued by zoned GC as sync
xfs: allow setting errortags at mount time
xfs: use WRITE_ONCE/READ_ONCE for m_errortag
xfs: move the guts of XFS_ERRORTAG_DELAY out of line
xfs: don't validate error tags in the I/O path
xfs: allocate m_errortag early
xfs: fix the errno sign for the xfs_errortag_{add,clearall} stubs
xfs: validate log record version against superblock log version
xfs: fix spacing style issues in xfs_alloc.c
xfs: remove xfs_zone_gc_space_available
xfs: use a seprate member to track space availabe in the GC scatch buffer
xfs: check for deleted cursors when revalidating two btrees
xfs: fix UAF in xchk_btree_check_block_owner
xfs: check return value of xchk_scrub_create_subord
xfs: only call xf{array,blob}_destroy if we have a valid pointer
xfs: get rid of the xchk_xfile_*_descr calls
...
|
|
Secure erase should use max_secure_erase_sectors instead of being limited
by max_discard_sectors. Separate the handling of REQ_OP_SECURE_ERASE from
REQ_OP_DISCARD to allow each operation to use its own size limit.
Signed-off-by: Luke Wang <ziniu.wang_1@nxp.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The second kill_bdev() call in set_blocksize() is redundant as the first
call already clears all buffers and pagecache, and locks prevent new
pagecache creation between the calls.
Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The default limits is unchanged, and user can configure async_depth now.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In downstream kernel, we test with mq-deadline with many fio workloads, and
we found a performance regression after commit 39823b47bbd4
("block/mq-deadline: Fix the tag reservation code") with following test:
[global]
rw=randread
direct=1
ramp_time=1
ioengine=libaio
iodepth=1024
numjobs=24
bs=1024k
group_reporting=1
runtime=60
[job1]
filename=/dev/sda
Root cause is that mq-deadline now support configuring async_depth,
although the default value is nr_request, however the minimal value is
1, hence min_shallow_depth is set to 1, causing wake_batch to be 1. For
consequence, sbitmap_queue will be waken up after each IO instead of
8 IO.
In this test case, sda is HDD and max_sectors is 128k, hence each
submitted 1M io will be splited into 8 sequential 128k requests, however
due to there are 24 jobs and total tags are exhausted, the 8 requests are
unlikely to be dispatched sequentially, and changing wake_batch to 1
will make this much worse, accounting blktrace D stage, the percentage
of sequential io is decreased from 8% to 0.8%.
Fix this problem by converting to request_queue->async_depth, where
min_shallow_depth is set each time async_depth is updated.
Noted elevator attribute async_depth is now removed, queue attribute
with the same name is used instead.
Fixes: 39823b47bbd4 ("block/mq-deadline: Fix the tag reservation code")
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Instead of the internal async_depth, remove kqd->async_depth and related
helpers.
Noted elevator attribute async_depth is now removed, queue attribute
with the same name is used instead.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a new field async_depth to request_queue and related APIs, this is
currently not used, following patches will convert elevators to use
this instead of internal async_depth.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There are no functional changes, just make code cleaner.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bfq and mq-deadline consider sync writes as async requests and only
reserve tags for sync reads by async_depth, however, kyber doesn't
consider sync writes as async requests for now.
Consider the case there are lots of dirty pages, and user use fsync to
flush dirty pages. In this case sched_tags can be exhausted by sync writes
and sync reads can stuck waiting for tag. Hence let kyber follow what
mq-deadline and bfq did, and unify async requests checking for all
elevators.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The block subsystem prevents running the workqueue to isolated CPUs,
including those defined by cpuset isolated partitions. Since
HK_TYPE_DOMAIN will soon contain both and be subject to runtime
modifications, synchronize against housekeeping using the relevant lock.
For full support of cpuset changes, the block subsystem may need to
propagate changes to isolated cpumask through the workqueue in the
future.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-block@vger.kernel.org
|
|
Creating new debugfs entries can trigger fs reclaim, hence we can't do
this with queue frozen, meanwhile, other locks that can be held while
queue is frozen should not be held as well.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In blk_mq_update_nr_hw_queues(), debugfs_mutex is not held while
creating debugfs entries for hctxs. Hence add debugfs_mutex there,
it's safe because queue is not frozen.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Because this helper is only used by iocost and iolatency, while they
don't have debugfs entries.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Because it's only used inside blk-mq-debugfs.c now.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently rq-qos debugfs entries are created from rq_qos_add(), while
rq_qos_add() can be called while queue is still frozen. This can
deadlock because creating new entries can trigger fs reclaim.
Fix this problem by delaying creating rq-qos debugfs entries after queue
is unfrozen.
- For wbt, 1) it can be initialized by default, fix it by calling new
helper after wbt_init() from wbt_init_enable_default(); 2) it can be
initialized by sysfs, fix it by calling new helper after queue is
unfrozen from wbt_set_lat().
- For iocost and iolatency, they can only be initialized by blkcg
configuration, however, they don't have debugfs entries for now, hence
they are not handled yet.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is already a helper blk_mq_debugfs_register_rqos() to register
one rqos, however this helper is called synchronously when the rqos is
created with queue frozen.
Prepare to fix possible deadlock to create blk-mq debugfs entries while
queue is still frozen.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If wbt is disabled by default and user configures wbt by sysfs, queue
will be frozen first and then pcpu_alloc_mutex will be held in
blk_stat_alloc_callback().
Fix this problem by allocating memory first before queue frozen.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
To move implementation details inside blk-wbt.c, prepare to fix possible
deadlock to call wbt_init() while queue is frozen in the next patch.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The locking ranges count and the array items are always ignored unless
Single User Mode (SUM) is requested in the activate method.
It is useless to enforce limits of unused array in the non-SUM case.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Introduce the helper function bdev_rot() to test if a block device is a
rotational one. The existing function bdev_nonrot() which tests for the
opposite condition is redefined using this new helper.
This avoids the double negation (operator and name) that appears when
testing if a block device is a rotational device, thus making the code a
little easier to read.
Call sites of bdev_nonrot() in the block layer are updated to use this
new helper. Remaining users in other subsystems are left unchanged for
now.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
To check if a request queue is for a rotational device, a double
negation is needed with the pattern "!blk_queue_nonrot(q)". Simplify
this with the introduction of the helper blk_queue_rot() which tests
if a requests queue limit has the BLK_FEAT_ROTATIONAL feature set.
All call sites of blk_queue_nonrot() are modified to use blk_queue_rot()
and blk_queue_nonrot() definition removed.
No functional changes.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|