| Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"Two rstat fixes:
- Out-of-bounds access in the css_rstat_updated() BPF kfunc when
called with an unchecked user-supplied cpu
- Over-strict NMI guard after the recent switch to try_cmpxchg left
sparc and ppc64 unable to queue rstat updates from NMI"
* tag 'cgroup-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rstat: relax NMI guard after switch to try_cmpxchg
cgroup/rstat: validate cpu before css_rstat_cpu() access
|
|
The function disk_update_zone_resources() may call
disk_free_zone_resources() in case of error, and following this,
blk_revalidate_disk_zones() will again calls disk_free_zone_resources() if
disk_update_zone_resources() failed. If a zone worker thread is being used
(which is the default for a rotational media zoned device),
disk_free_zone_resources() will try to stop the zone worker thread twice
because disk->zone_wplugs_worker is not reset to NULL when the worker
thread is stopped the first time.
In disk_free_zone_resources(), fix this by correctly clearing
disk->zone_wplugs_worker to NULL when the worker thread is stopped.
And while at it, since disk_free_zone_resources() is always called after a
failed call to disk_update_zone_resources(), remove the unnecessary call
to disk_free_zone_resources() in disk_update_zone_resources().
Fixes: 1365b6904fd0 ("block: allow submitting all zone writes from a single context")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260522115622.588535-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When submitting a bio to blk-mq, if the task should sleep after peeking
a cached request, but before it pops it, the plug flushes and calls
blk_mq_free_plug_rqs, freeing the cached_rqs. This creates a
use-after-free bug. Fix this by popping the cached request before any
possible blocking calls if it is suitable for use.
Popping this request first holds a queue reference, so avoid any
serialization races with queue freezes and can safely proceed with
dispatching that request to the driver. This potentially increases a
timing window from when a driver wants to freeze its queue to when
requests stop being dispatched. That scenario is off the fast path
though, and drivers need to appropriately handle requests during a
freeze request anyway.
The downside is the popped element needs to be individually freed when
we performed a bio plug merge. The cached request would have had to be
freed later anyway, but this patch does it inline with building the plug
list instead of after flushing it.
Fixes: b0077e269f6c1 ("blk-mq: make sure active queue usage is held for bio_integrity_prep()")
Fixes: 7b4f36cd22a65 ("block: ensure we hold a queue reference when using queue limits")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://patch.msgid.link/20260521190253.242065-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bio_integrity_verify() expects the passed struct bvec_iter to be an
iterator over bio data, not integrity. So construct a separate data
bvec_iter without the bio_integrity_bytes() conversion and pass it to
bio_integrity_verify() instead of bip_iter.
Fixes: 0bde8a12b554 ("block: add fs_bio_integrity helpers")
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260513182924.1753582-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
css_rstat_updated() is exposed as a BPF kfunc and accepts a
caller-provided cpu argument. The function uses cpu for per-cpu rstat
lookups without checking whether it refers to a valid possible CPU.
A BPF iter/cgroup program with CAP_BPF and CAP_PERFMON can pass an
invalid cpu value. On an unfixed UBSCAN_BOUNDS test kernel, cpu ==
0x7fffffff triggers:
UBSAN: array-index-out-of-bounds in kernel/cgroup/rstat.c:31:9
index 2147483647 is out of range for type 'long unsigned int [64]'
Call Trace:
css_rstat_updated
bpf_iter_run_prog
cgroup_iter_seq_show
bpf_seq_read
Add cpu validation to the BPF-facing css_rstat_updated() kfunc and
move the common implementation to __css_rstat_updated() for in-kernel
callers.
Fixes: a319185be9f5 ("cgroup: bpf: enable bpf programs to integrate with rstat")
Signed-off-by: Qing Ming <a0yami@mailbox.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Just like for the extract user pages path, we need to align down the
size to the supported boundary.
Fixes: 8dd5e7c75d7b ("block: add helpers to bounce buffer an iov_iter into bios")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Link: https://patch.msgid.link/20260507050153.1298375-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When bouncing for block size > PAGE_SIZE file systems that require
file system block size alignment (e.g. zoned XFS), the bio needs to
be big enough to fit an entire block.
Fixes: 8dd5e7c75d7b ("block: add helpers to bounce buffer an iov_iter into bios")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Link: https://patch.msgid.link/20260507050153.1298375-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Shin'ichiro reported hard to reproduce unaligned write errors with zoned
block devices. Under normal operation conditions (e.g. running XFS on an
SMR disk), these errors are nearly impossible to trigger. But using a
"slow" kernel with many debug options enables and some specific use
cases (e.g. fio zbd test case 46), the errors can be reproduced fairly
easily.
The unaligned write errors come from mishandling a valid reference
counting pattern of zone write plugs. Such pattern triggers for instance
if a process A writes a zone (not necessarilly to the full state),
another process B immediately resets the zone and immediately following
the completion of the zone reset, starts issuing writes to the zone.
With such pattern, in some cases, the zone write plugs worker thread of
the device may still be holding a reference to the zone write plug of
the zone taken when process A was writing to the zone. The following
zone reset from process B marks the zone as dead but does not remove the
zone write plug from the device hash table as a reference to the plug
still exist. Once process B starts issuing new writes, the zone write
plug is seen as dead and the writes from process B are immediately
failed, despite this write pattern being perfectly legal.
Fix this by allowing restoring a dead zone write plug to a live state if
a write is issued to the zone when the zone is: marked as dead, empty
and the write sector corresponds to the first sector of the zone (that
is, the write is aligned to the zone write pointer). This is done with
the new helper function disk_check_zone_wplug_dead(), which restores a
dead zone write plug to a live state by clearing the BLK_ZONE_WPLUG_DEAD
flag and restoring the initial reference to the zone write plug taken
when the plug was added to the device hash table.
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: b7d4ffb51037 ("block: fix zone write plug removal")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://patch.msgid.link/20260513111129.108809-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
pin_user_pages_fast() can partially succeed and return the number of
pages that were actually pinned. However, the bio_integrity_map_user()
does not handle this partial pinning. This leads to a general protection
fault since bvec_from_pages() dereferences an unpinned page address,
which is 0.
To fix this, add a check to verify that all requested memory is pinned.
If partial pinning occurs, unpin the memory and return -EFAULT.
Kernel Oops:
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 0 UID: 0 PID: 1061 Comm: nvme-passthroug Not tainted 7.0.0-11783-g90957f9314e8-dirty #16 PREEMPT(lazy)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
RIP: 0010:bio_integrity_map_user.cold+0x1b0/0x9d6
Fixes: 492c5d455969 ("block: bio-integrity: directly map user buffers")
Acked-by: Chao Shi <cshi008@fiu.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Acked-by: Dave Tian <daveti@purdue.edu>
Signed-off-by: Sungwoo Kim <iam@sung-woo.kim>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://github.com/linux-blktests/blktests/pull/244
Link: https://patch.msgid.link/20260512050929.541397-2-iam@sung-woo.kim
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
blk_insert_cloned_request() already recomputes nr_phys_segments
against the bottom queue, because "the queue settings related to
segment counting may differ from the original queue." The exact same
reasoning applies to integrity segments: a stacked driver's underlying
queue can have tighter virt_boundary_mask, seg_boundary_mask, or
max_segment_size than the top queue, in which case
blk_rq_count_integrity_sg() against the bottom queue produces a
different count than the cached rq->nr_integrity_segments inherited
from the source request by blk_rq_prep_clone().
When the cached count is lower than the bottom queue's actual count,
blk_rq_map_integrity_sg() trips
BUG_ON(segments > rq->nr_integrity_segments);
on dispatch. The same families of stacked setups that motivated the
existing nr_phys_segments recompute -- dm-multipath fanning out to
nvme-rdma in particular -- can produce this.
Mirror the nr_phys_segments handling: when the request carries
integrity, recompute nr_integrity_segments against the bottom queue
and reject the request if it exceeds the bottom queue's
max_integrity_segments. blk_rq_count_integrity_sg() and
queue_max_integrity_segments() are both already available via
<linux/blk-integrity.h>, which blk-mq.c includes.
This closes a latent gap in the stacking contract and brings the
integrity-segment accounting in line with the existing
phys-segment accounting.
Fixes: 76c313f658d2 ("blk-integrity: improved sg segment mapping")
Signed-off-by: Casey Chen <cachen@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260511212230.27511-1-cachen@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bio_integrity_add_page() already sets bip_vcnt to 1 for the bounce
segment. Overwriting it with nr_vecs breaks bip_vcnt <= bip_max_vcnt
on WRITE (bip_max_vcnt is 1), so the gap-merge checks in block/blk.h
read past the bip_vec[] flex array. On READ the read is in bounds
but lands on a saved user bvec instead of the bounce.
The line was added for split propagation, but bio_integrity_clone()
doesn't copy bip_vcnt and BIP_CLONE_FLAGS excludes BIP_COPY_USER.
Fixes: 3991657ae707 ("block: set bip_vcnt correctly")
Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260511215151.346228-1-devnexen@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This passthrough helper currently only supports discards. Part of that
command is the start and length, which is read from the SQE. It does
so on every invocation, where it really should just make it stable
on the first invocation. This avoids needing to copy the SQE upfront,
as we only really need those two 8b values stored in our per-req
payload.
Cc: stable@vger.kernel.org # 6.17+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If the caller is asking for a non-blocking allocation, we should not
further restrict the gfp mask, which just increases the likelihood
of failures.
Fixes: b520c4eef83d ("block: split bio_alloc_bioset more clearly into a fast and slowpath")
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20260415060813.807659-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.
In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: cgroups@vger.kernel.org
Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20260223092920.60424-3-marco.crivellari@suse.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.
In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.
Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20260223092920.60424-2-marco.crivellari@suse.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bio_add_page() and bio_integrity_add_page() reject pages from different
dev_pagemaps entirely, returning 0 even when those pages have compatible
DMA mapping requirements. This forces callers to start a new bio when
buffers span pgmap boundaries, even though the pages could safely coexist
as separate bvec entries.
This matters for guests where memory is registered through
devm_memremap_pages() with MEMORY_DEVICE_GENERIC in multiple calls,
creating separate dev_pagemaps for each chunk. When a direct I/O buffer
spans two such chunks, bio_add_page() rejects the second page, forcing an
unnecessary bio split or I/O failure.
Introduce zone_device_pages_compatible() in blk.h to check whether two
pages can coexist in the same bio as separate bvec entries. The block DMA
iterator (blk_dma_map_iter_start) caches the P2PDMA mapping state from the
first segment and applies it to all others, so P2PDMA pages from different
pgmaps must not be mixed, and neither must P2PDMA and non-P2PDMA pages.
All other combinations (MEMORY_DEVICE_GENERIC pages from different pgmaps,
or MEMORY_DEVICE_GENERIC with normal RAM) use the same dma_map_phys path
and are safe.
Replace the blanket zone_device_pages_have_same_pgmap() rejection with
zone_device_pages_compatible(), while keeping
zone_device_pages_have_same_pgmap() as a merge guard.
Pages from different pgmaps can be added as separate bvec entries but
must not be coalesced into the same segment, as that would make
it impossible to recover the correct pgmap via page_pgmap().
Fixes: 49580e690755 ("block: add check when merging zone device pages")
Cc: stable@vger.kernel.org
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260410153414.4159050-3-namjain@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
biovec_phys_mergeable() is used by the request merge, DMA mapping,
and integrity merge paths to decide if two physically contiguous
bvec segments can be coalesced into one. It currently has no check
for whether the segments belong to different dev_pagemaps.
When zone device memory is registered in multiple chunks, each chunk
gets its own dev_pagemap. A single bio can legitimately contain
bvecs from different pgmaps -- iov_iter_extract_bvecs() breaks at
pgmap boundaries but the outer loop in bio_iov_iter_get_pages()
continues filling the same bio. If such bvecs are physically
contiguous, biovec_phys_mergeable() will coalesce them, making it
impossible to recover the correct pgmap for the merged segment
via page_pgmap().
Add a zone_device_pages_have_same_pgmap() check to prevent merging
bvec segments that span different pgmaps.
Fixes: 49580e690755 ("block: add check when merging zone device pages")
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
Link: https://patch.msgid.link/20260410153414.4159050-2-namjain@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- Add shared memory zero-copy I/O support for ublk, bypassing per-I/O
copies between kernel and userspace by matching registered buffer
PFNs at I/O time. Includes selftests.
- Refactor bio integrity to support filesystem initiated integrity
operations and arbitrary buffer alignment.
- Clean up bio allocation, splitting bio_alloc_bioset() into clear fast
and slow paths. Add bio_await() and bio_submit_or_kill() helpers,
unify synchronous bi_end_io callbacks.
- Fix zone write plug refcount handling and plug removal races. Add
support for serializing zone writes at QD=1 for rotational zoned
devices, yielding significant throughput improvements.
- Add SED-OPAL ioctls for Single User Mode management and a STACK_RESET
command.
- Add io_uring passthrough (uring_cmd) support to the BSG layer.
- Replace pp_buf in partition scanning with struct seq_buf.
- zloop improvements and cleanups.
- drbd genl cleanup, switching to pre_doit/post_doit.
- NVMe pull request via Keith:
- Fabrics authentication updates
- Enhanced block queue limits support
- Workqueue usage updates
- A new write zeroes device quirk
- Tagset cleanup fix for loop device
- MD pull requests via Yu Kuai:
- Fix raid5 soft lockup in retry_aligned_read()
- Fix raid10 deadlock with check operation and nowait requests
- Fix raid1 overlapping writes on writemostly disks
- Fix sysfs deadlock on array_state=clear
- Proactive RAID-5 parity building with llbitmap, with
write_zeroes_unmap optimization for initial sync
- Fix llbitmap barrier ordering, rdev skipping, and bitmap_ops
version mismatch fallback
- Fix bcache use-after-free and uninitialized closure
- Validate raid5 journal metadata payload size
- Various cleanups
- Various other fixes, improvements, and cleanups
* tag 'for-7.1/block-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (146 commits)
ublk: fix tautological comparison warning in ublk_ctrl_reg_buf
scsi: bsg: fix buffer overflow in scsi_bsg_uring_cmd()
block: refactor blkdev_zone_mgmt_ioctl
MAINTAINERS: update ublk driver maintainer email
Documentation: ublk: address review comments for SHMEM_ZC docs
ublk: allow buffer registration before device is started
ublk: replace xarray with IDA for shmem buffer index allocation
ublk: simplify PFN range loop in __ublk_ctrl_reg_buf
ublk: verify all pages in multi-page bvec fall within registered range
ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support
xfs: use bio_await in xfs_zone_gc_reset_sync
block: add a bio_submit_or_kill helper
block: factor out a bio_await helper
block: unify the synchronous bi_end_io callbacks
xfs: fix number of GC bvecs
selftests/ublk: add read-only buffer registration test
selftests/ublk: add filesystem fio verify test for shmem_zc
selftests/ublk: add hugetlbfs shmem_zc test for loop target
selftests/ublk: add shared memory zero-copy test
selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs buffer_head updates from Christian Brauner:
"This cleans up the mess that has accumulated over the years in
metadata buffer_head tracking for inodes.
It moves the tracking into dedicated structure in filesystem-private
part of the inode (so that we don't use private_list, private_data,
and private_lock in struct address_space), and also moves couple other
users of private_data and private_list so these are removed from
struct address_space saving 3 longs in struct inode for 99% of inodes"
* tag 'vfs-7.1-rc1.bh.metadata' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits)
fs: Drop i_private_list from address_space
fs: Drop mapping_metadata_bhs from address space
ext4: Track metadata bhs in fs-private inode part
minix: Track metadata bhs in fs-private inode part
udf: Track metadata bhs in fs-private inode part
fat: Track metadata bhs in fs-private inode part
bfs: Track metadata bhs in fs-private inode part
affs: Track metadata bhs in fs-private inode part
ext2: Track metadata bhs in fs-private inode part
fs: Provide functions for handling mapping_metadata_bhs directly
fs: Switch inode_has_buffers() to take mapping_metadata_bhs
fs: Make bhs point to mapping_metadata_bhs
fs: Move metadata bhs tracking to a separate struct
fs: Fold fsync_buffers_list() into sync_mapping_buffers()
fs: Drop osync_buffers_list()
kvm: Use private inode list instead of i_private_list
fs: Remove i_private_data
aio: Stop using i_private_data and i_private_lock
hugetlbfs: Stop using i_private_data
fs: Stop using i_private_data for metadata bh tracking
...
|
|
Split the zone reset case into a separate helper so that the conditional
locking goes away.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/20260327090032.3722065-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Factor the common logic for the ioctl helpers to either submit a bio or
end if the process is being killed. As this is now the only user of
bio_await_chain, open code that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260407140538.633364-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a new helper to wait for a bio and anything chained off it to
complete synchronously after submitting it. This factors common code out
of submit_bio_wait and bio_await_chain and will also be useful for
file system code and thus is exported.
Note that this will now set REQ_SYNC also for the bio_await case for
consistency. Nothing should look at the flag in the end_io handler,
but if something does having the flag set makes more sense.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260407140538.633364-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Put the bio in bio_await_chain after waiting for the completion, and
share the now identical callbacks between submit_bio_wait and
bio_await_chain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260407140538.633364-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
wbt_init_enable_default() uses WARN_ON_ONCE to check for failures from
wbt_alloc() and wbt_init(). However, both are expected failure paths:
- wbt_alloc() can return NULL under memory pressure (-ENOMEM)
- wbt_init() can fail with -EBUSY if wbt is already registered
syzbot triggers this by injecting memory allocation failures during MTD
partition creation via ioctl(BLKPG), causing a spurious warning.
wbt_init_enable_default() is a best-effort initialization called from
blk_register_queue() with a void return type. Failure simply means the
disk operates without writeback throttling, which is harmless.
Replace WARN_ON_ONCE with plain if-checks, consistent with how
wbt_set_lat() in the same file already handles these failures. Add a
pr_warn() for the wbt_init() failure to retain diagnostic information
without triggering a full stack trace.
Reported-by: syzbot+71fcf20f7c1e5043d78c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=71fcf20f7c1e5043d78c
Fixes: 41afaeeda509 ("blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter")
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://patch.msgid.link/20260316070358.65225-2-ytohnuki@amazon.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Replace sprintf() with sysfs_emit() in sysfs show functions.
sysfs_emit() is preferred for formatting sysfs output because it
provides safer bounds checking.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260402164958.894879-4-thorsten.blum@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When a bio is allocated from the mempool with REQ_ALLOC_CACHE set and
later completed, bio_put() places it into the per-cpu bio_alloc_cache
via bio_put_percpu_cache() instead of freeing it back to the
mempool/slab. The slab allocation remains tracked by kmemleak, but the
only reference to the bio is through the percpu cache's free_list,
which kmemleak fails to trace through percpu memory. This causes
kmemleak to report the cached bios as unreferenced objects.
Use symmetric kmemleak_free()/kmemleak_alloc() calls to properly track
bios across percpu cache transitions:
- bio_put_percpu_cache: call kmemleak_free() when a bio enters the
cache, unregistering it from kmemleak tracking.
- bio_alloc_percpu_cache: call kmemleak_alloc() when a bio is taken
from the cache for reuse, re-registering it so that genuine leaks
of reused bios remain detectable.
- __bio_alloc_cache_prune: call kmemleak_alloc() before bio_free() so
that kmem_cache_free()'s internal kmemleak_free() has a matching
allocation to pair with.
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260326144058.2392319-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When a disk is saturated, it is common for no IOs to complete within a
timer period. Currently, in this case, rq_wait_pct and missed_ppm are
calculated as 0, the iocost incorrectly interprets this as meeting QoS
targets and resets busy_level to 0.
This reset prevents busy_level from reaching the threshold (4) needed
to reduce vrate. On certain cloud storage, such as Azure Premium SSD,
we observed that iocost may fail to reduce vrate for tens of seconds
during saturation, failing to mitigate noisy neighbor issues.
Fix this by tracking the number of IO completions (nr_done) in a period.
If nr_done is 0 and there are lagging IOs, the saturation status is
unknown, so we keep busy_level unchanged.
The issue is consistently reproducible on Azure Standard_D8as_v5 (Dasv5)
VMs with 512GB Premium SSD (P20) using the script below. It was not
observed on GCP n2d VMs (with 100G pd-ssd and 1.5T local-ssd), and no
regressions were found with this patch. In this script, cgA performs
large IOs with iodepth=128, while cgB performs small IOs with iodepth=1
rate_iops=100 rw=randrw. With iocost enabled, we expect it to throttle
cgA, the submission latency (slat) of cgA should be significantly higher,
cgB can reach 200 IOPS and the completion latency (clat) should below.
BLK_DEVID="8:0"
MODEL="rbps=173471131 rseqiops=3566 rrandiops=3566 wbps=173333269 wseqiops=3566 wrandiops=3566"
QOS="rpct=90 rlat=3500 wpct=90 wlat=3500 min=80 max=10000"
echo "$BLK_DEVID ctrl=user model=linear $MODEL" > /sys/fs/cgroup/io.cost.model
echo "$BLK_DEVID enable=1 ctrl=user $QOS" > /sys/fs/cgroup/io.cost.qos
CG_A="/sys/fs/cgroup/cgA"
CG_B="/sys/fs/cgroup/cgB"
FILE_A="/path/to/sda/A.fio.testfile"
FILE_B="/path/to/sda/B.fio.testfile"
RESULT_DIR="./iocost_results_$(date +%Y%m%d_%H%M%S)"
mkdir -p "$CG_A" "$CG_B" "$RESULT_DIR"
get_result() {
local file=$1
local label=$2
local results=$(jq -r '
.jobs[0].mixed |
( .iops | tonumber | round ) as $iops |
( .bw_bytes / 1024 / 1024 ) as $bps |
( .slat_ns.mean / 1000000 ) as $slat |
( .clat_ns.mean / 1000000 ) as $avg |
( .clat_ns.max / 1000000 ) as $max |
( .clat_ns.percentile["90.000000"] / 1000000 ) as $p90 |
( .clat_ns.percentile["99.000000"] / 1000000 ) as $p99 |
( .clat_ns.percentile["99.900000"] / 1000000 ) as $p999 |
( .clat_ns.percentile["99.990000"] / 1000000 ) as $p9999 |
"\($iops)|\($bps)|\($slat)|\($avg)|\($max)|\($p90)|\($p99)|\($p999)|\($p9999)"
' "$file")
IFS='|' read -r iops bps slat avg max p90 p99 p999 p9999 <<<"$results"
printf "%-8s %-6s %-7.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f %-8.2f\n" \
"$label" "$iops" "$bps" "$slat" "$avg" "$max" "$p90" "$p99" "$p999" "$p9999"
}
run_fio() {
local cg_path=$1
local filename=$2
local name=$3
local bs=$4
local qd=$5
local out=$6
shift 6
local extra=$@
(
pid=$(sh -c 'echo $PPID')
echo $pid >"${cg_path}/cgroup.procs"
fio --name="$name" --filename="$filename" --direct=1 --rw=randrw --rwmixread=50 \
--ioengine=libaio --bs="$bs" --iodepth="$qd" --size=4G --runtime=10 \
--time_based --group_reporting --unified_rw_reporting=mixed \
--output-format=json --output="$out" $extra >/dev/null 2>&1
) &
}
echo "Starting Test ..."
for bs_b in "4k" "32k" "256k"; do
echo "Running iteration: BS=$bs_b"
out_a="${RESULT_DIR}/cgA_1m.json"
out_b="${RESULT_DIR}/cgB_${bs_b}.json"
# cgA: Heavy background (BS 1MB, QD 128)
run_fio "$CG_A" "$FILE_A" "cgA" "1m" 128 "$out_a"
# cgB: Latency sensitive (Variable BS, QD 1, Read/Write IOPS limit 100)
run_fio "$CG_B" "$FILE_B" "cgB" "$bs_b" 1 "$out_b" "--rate_iops=100"
wait
SUMMARY_DATA+="$(get_result "$out_a" "cgA-1m")"$'\n'
SUMMARY_DATA+="$(get_result "$out_b" "cgB-$bs_b")"$'\n\n'
done
echo -e "\nFinal Results Summary:\n"
printf "%-8s %-6s %-7s %-8s %-8s %-8s %-8s %-8s %-8s %-8s\n" \
"" "" "" "slat" "clat" "clat" "clat" "clat" "clat" "clat"
printf "%-8s %-6s %-7s %-8s %-8s %-8s %-8s %-8s %-8s %-8s\n\n" \
"CGROUP" "IOPS" "MB/s" "avg(ms)" "avg(ms)" "max(ms)" "P90(ms)" "P99" "P99.9" "P99.99"
echo "$SUMMARY_DATA"
echo "Results saved in $RESULT_DIR"
Before:
slat clat clat clat clat clat clat
CGROUP IOPS MB/s avg(ms) avg(ms) max(ms) P90(ms) P99 P99.9 P99.99
cgA-1m 166 166.37 3.44 748.95 1298.29 977.27 1233.13 1300.23 1300.23
cgB-4k 5 0.02 0.02 181.74 761.32 742.39 759.17 759.17 759.17
cgA-1m 167 166.51 1.98 748.68 1549.41 809.50 1451.23 1551.89 1551.89
cgB-32k 6 0.18 0.02 169.98 761.76 742.39 759.17 759.17 759.17
cgA-1m 166 165.55 2.89 750.89 1540.37 851.44 1451.23 1535.12 1535.12
cgB-256k 5 1.30 0.02 191.35 759.51 750.78 759.17 759.17 759.17
After:
slat clat clat clat clat clat clat
CGROUP IOPS MB/s avg(ms) avg(ms) max(ms) P90(ms) P99 P99.9 P99.99
cgA-1m 162 162.48 6.14 749.69 850.02 826.28 834.67 843.06 851.44
cgB-4k 199 0.78 0.01 1.95 42.12 2.57 7.50 34.87 42.21
cgA-1m 146 146.20 6.83 833.04 908.68 893.39 901.78 910.16 910.16
cgB-32k 200 6.25 0.01 2.32 31.40 3.06 7.50 16.58 31.33
cgA-1m 110 110.46 9.04 1082.67 1197.91 1182.79 1199.57 1199.57 1199.57
cgB-256k 200 49.98 0.02 3.69 22.20 4.88 9.11 20.05 22.15
Signed-off-by: Jialin Wang <wjl.linux@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://patch.msgid.link/20260331100509.182882-1-wjl.linux@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add the missing put_disk() on the error path in
blkcg_maybe_throttle_current(). When blkcg lookup, blkg lookup, or
blkg_tryget() fails, the function jumps to the out label which only
calls rcu_read_unlock() but does not release the disk reference acquired
by blkcg_schedule_throttle() via get_device(). Since current->throttle_disk
is already set to NULL before the lookup, blkcg_exit() cannot release
this reference either, causing the disk to never be freed.
Restore the reference release that was present as blk_put_queue() in the
original code but was inadvertently dropped during the conversion from
request_queue to gendisk.
Fixes: f05837ed73d0 ("blk-cgroup: store a gendisk to throttle in struct task_struct")
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260331085054.46857-1-liu.yun@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When blk_revalidate_disk_zones() fails after disk_revalidate_zone_resources()
has allocated args.zones_cond, the memory is leaked because no error path
frees it.
Fixes: 6e945ffb6555 ("block: use zone condition to determine conventional zones")
Suggested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Link: https://patch.msgid.link/20260331111216.24242-1-liu.yun@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When LOOP_CONFIGURE is called with LO_FLAGS_PARTSCAN, the following
sequence occurs:
1. disk_force_media_change() sets GD_NEED_PART_SCAN
2. Uevent suppression is lifted and a KOBJ_CHANGE uevent is sent
3. loop_global_unlock() releases the lock
4. loop_reread_partitions() calls bdev_disk_changed() to scan
There is a race between steps 2 and 4: when udev receives the uevent
and opens the device before loop_reread_partitions() runs,
blkdev_get_whole() in bdev.c sees GD_NEED_PART_SCAN set and calls
bdev_disk_changed() for a first scan. Then loop_reread_partitions()
does a second scan. The open_mutex serializes these two scans, but
does not prevent both from running.
The second scan in bdev_disk_changed() drops all partition devices
from the first scan (via blk_drop_partitions()) before re-adding
them, causing partition block devices to briefly disappear. This
breaks any systemd unit with BindsTo= on the partition device: systemd
observes the device going dead, fails the dependent units, and does
not retry them when the device reappears.
Fix this by removing the GD_NEED_PART_SCAN set from
disk_force_media_change() entirely. None of the current callers need
the lazy on-open partition scan triggered by this flag:
- floppy: sets GENHD_FL_NO_PART, so disk_has_partscan() is always
false and GD_NEED_PART_SCAN has no effect.
- loop (loop_configure, loop_change_fd): when LO_FLAGS_PARTSCAN is
set, loop_reread_partitions() performs an explicit scan. When not
set, GD_SUPPRESS_PART_SCAN prevents the lazy scan path.
- loop (__loop_clr_fd): calls bdev_disk_changed() explicitly if
LO_FLAGS_PARTSCAN is set.
- nbd (nbd_clear_sock_ioctl): capacity is set to zero immediately
after; nbd manages GD_NEED_PART_SCAN explicitly elsewhere.
With GD_NEED_PART_SCAN no longer set by disk_force_media_change(),
udev opening the loop device after the uevent no longer triggers a
redundant scan in blkdev_get_whole(), and only the single explicit
scan from loop_reread_partitions() runs.
A regression test for this bug has been submitted to blktests:
https://github.com/linux-blktests/blktests/pull/240.
Fixes: 9f65c489b68d ("loop: raise media_change event")
Signed-off-by: Daan De Meyer <daan@amutable.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://patch.msgid.link/20260331105130.1077599-1-daan@amutable.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The TCG Opal device could enter a state where no new session can be
created, blocking even Discovery or PSID reset. While a power cycle
or waiting for the timeout should work, there is another possibility
for recovery: using the Stack Reset command.
The Stack Reset command is defined in the TCG Storage Architecture Core
Specification and is mandatory for all Opal devices (see Section 3.3.6
of the Opal SSC specification).
This patch implements the Stack Reset command. Sending it should clear
all active sessions immediately, allowing subsequent commands to run
successfully. While it is a TCG transport layer command, the Linux
kernel implements only Opal ioctls, so it makes sense to use the
IOC_OPAL ioctl interface.
The Stack Reset takes no arguments; the response can be success or pending.
If the command reports a pending state, userspace can try to repeat it;
in this case, the code returns -EBUSY.
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Ondrej Kozina <okozina@redhat.com>
Link: https://patch.msgid.link/20260310095349.411287-1-gmazyland@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Nobody is calling mark_buffer_dirty_inode() with internal bdev inode and
it doesn't make sense for internal bdev inode to have any metadata
buffer heads. Just drop the pointless invalidate_inode_buffers() call
and consequently the whole bdev_evict_inode() because generic code takes
care of the rest.
CC: linux-block@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260326095354.16340-47-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
bio_alloc_bioset() first strips __GFP_DIRECT_RECLAIM from the optimistic
fast allocation attempt with try_alloc_gfp(). If that fast path fails,
the slowpath checks saved_gfp to decide whether blocking allocation is
allowed, but then still calls mempool_alloc() with the stripped gfp mask.
That can lead to a NULL bio pointer being passed into bio_init().
Fix the slowpath by using saved_gfp for the bio and bvec mempool
allocations.
Fixes: b520c4eef83d ("block: split bio_alloc_bioset more clearly into a fast and slowpath")
Reported-by: syzbot+09ddb593eea76a158f42@syzkaller.appspotmail.com
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/p01.gc6e9ad5845ad.ttca29g@ub.hpns
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In preparation for removing the strlcat API[1], replace the char *pp_buf
with a struct seq_buf, which tracks the current write position and
remaining space internally. This allows for:
- Direct use of seq_buf_printf() in place of snprintf()+strlcat()
pairs, eliminating local tmp buffers throughout.
- Adjacent strlcat() calls that build strings piece-by-piece
(e.g., strlcat("["); strlcat(name); strlcat("]")) to be collapsed
into single seq_buf_printf() calls.
- Simpler call sites: seq_buf_puts() takes only the buffer and string,
with no need to pass PAGE_SIZE at every call.
The backing buffer allocation is unchanged (__get_free_page), and the
output path uses seq_buf_str() to NUL-terminate before passing to
printk().
Link: https://github.com/KSPP/linux/issues/370 [1]
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Josh Law <objecting@objecting.org>
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Josh Law <objecting@objecting.org>
Link: https://patch.msgid.link/20260321004840.work.670-kees@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add an io_uring command handler to the generic BSG layer. The new
.uring_cmd file operation validates io_uring features and delegates
handling to a per-queue bsg_uring_cmd_fn callback.
Extend bsg_register_queue() so transport drivers can register both
sg_io and io_uring command handlers.
Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://patch.msgid.link/20260317072226.2598233-3-yangxiuwei@kylinos.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The function bio_add_page() returns the number of bytes added to the
bio, and if that failed it should return 0.
However there is a special quirk, if a caller is passing a page with
length 0, that function will always return 0 but with different results:
- The page is added to the bio
If there is enough bvec slot or the folio can be merged with the last
bvec.
The return value 0 is just the length passed in, which is also 0.
- The page is not added to the bio
If the page is not mergeable with the last bvec, or there is no bvec
slot available.
The return value 0 means page is not added into the bio.
Unfortunately the caller is not able to distinguish the above two cases,
and will treat the 0 return value as page addition failure.
In that case, this can lead to the double releasing of the last page:
- By the bio cleanup
Which normally goes through every page of the bio, including the last
page which is added into the bio.
- By the caller
Which believes the page is not added into the bio, thus would manually
release the page.
I do not think anyone should call bio_add_folio()/bio_add_page() with zero
length, but idiots like me can still show up.
So add an extra WARN_ON_ONCE() check for zero length and rejects it
early to avoid double freeing.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Link: https://patch.msgid.link/bc2223c080f38d0b63f968f605c918181c840f40.1773734749.git.wqu@suse.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The blk_mq_hw_ctx_sysfs_entry structures are never modified,
mark them as const.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://patch.msgid.link/20260316-b4-sysfs-const-attr-block-v1-4-a35d73b986b0@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The blk_crypto_attrs structures are never modified, mark them as const.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: John Garry <john.g.garry@oracle.com>>
Link: https://patch.msgid.link/20260316-b4-sysfs-const-attr-block-v1-3-a35d73b986b0@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The blk_ia_range_sysfs_entry structures are never modified,
mark them as const.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://patch.msgid.link/20260316-b4-sysfs-const-attr-block-v1-2-a35d73b986b0@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The queue_sysfs_entry structures are never modified, mark them as const.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://patch.msgid.link/20260316-b4-sysfs-const-attr-block-v1-1-a35d73b986b0@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bvec_free is only called by bio_free, so inline it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> -ck
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://patch.msgid.link/20260316161144.1607877-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bio_alloc_bioset tries non-waiting slab allocations first for the bio and
bvec array, but does so in a somewhat convoluted way.
Restructure the function so that it first open codes these slab
allocations, and then falls back to the mempools with the original
gfp mask.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> -ck
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://patch.msgid.link/20260316161144.1607877-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Only used in bio.c these days.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> -ck
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://patch.msgid.link/20260316161144.1607877-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A bio segment may have partial interval block data with the rest
continuing into the next segments because direct-io data payloads only
need to align in memory to the device's DMA limits.
At the same time, the protection information may also be split in
multiple segments. The most likely way that may happen is if two
requests merge, or if we're directly using the io_uring user metadata.
The generate/verify, however, only ever accessed the first bip_vec.
Further, it may be possible to unalign the protection fields from the
user space buffer, or if there are odd additional opaque bytes in front
or in back of the protection information metadata region.
Change up the iteration to allow spanning multiple segments. This patch
is mostly a re-write of the protection information handling to allow any
arbitrary alignments, so it's probably easier to review the end result
rather than the diff.
Many controllers are not able to handle interval data composed of
multiple segments when PI is used, so this patch introduces a new
integrity limit that a low level driver can set to notify that it is
capable, default to false. The nvme driver is the first one to enable it
in this patch. Everyone else will force DMA alignment to the logical
block size as before to ensure interval data is always aligned within a
single segment.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://patch.msgid.link/20260313144701.1221652-2-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When a queue is shared across disk rebind (e.g., SCSI unbind/bind), the
previous disk's blkcg state is cleaned up asynchronously via
disk_release() -> blkcg_exit_disk(). If the new disk's blkcg_init_disk()
runs before that cleanup finishes, we may overwrite q->root_blkg while
the old one is still alive, and radix_tree_insert() in blkg_create()
fails with -EEXIST because the old blkg entries still occupy the same
queue id slot in blkcg->blkg_tree. This causes the sd probe to fail
with -ENOMEM.
Fix it by waiting in blkcg_init_disk() for root_blkg to become NULL,
which indicates the previous disk's blkcg cleanup has completed.
Fixes: 1059699f87eb ("block: move blkcg initialization/destroy into disk allocation/release handler")
Cc: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260311032837.2368714-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260226031243.87200-3-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
blk_steal_bios() transfers bios from a request to a bio_list when the
request is requeued to a different queue. The NVMe multipath failover
path (nvme_failover_req) currently open-codes clearing of REQ_POLLED,
bi_cookie, and REQ_NOWAIT on each bio before calling blk_steal_bios().
Move these fixups into blk_steal_bios() itself so that any caller
automatically gets correct flag state when bios cross queue boundaries.
Simplify nvme_failover_req() accordingly.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260226031243.87200-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Bring in the shared branch with the block layer.
* 'for-7.1/block-integrity' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
block: pass a maxlen argument to bio_iov_iter_bounce
block: add fs_bio_integrity helpers
block: make max_integrity_io_size public
block: prepare generation / verification helpers for fs usage
block: add a bdev_has_integrity_csum helper
block: factor out a bio_integrity_setup_default helper
block: factor out a bio_integrity_action helper
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Merge in integrity changes which are also landing in the VFS tree as
dependencies for fs related changes.
* for-7.1/block-integrity:
block: pass a maxlen argument to bio_iov_iter_bounce
block: add fs_bio_integrity helpers
block: make max_integrity_io_size public
block: prepare generation / verification helpers for fs usage
block: add a bdev_has_integrity_csum helper
block: factor out a bio_integrity_setup_default helper
block: factor out a bio_integrity_action helper
|
|
Correct the comments that the cloned bio must be freed before the memory
pointed to by @bio_src->bi_io_vecs (is freed).
Christoph Hellwig contributed most the of the update wording.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|