Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 70904263512a74a3b8941dd9e6e515ca6fc57821 ]
We have seen rare IO stalls as follows:
* blk_mq_plug_issue_direct() is entered with an mq_list containing two
requests.
* For the first request, it sets last == false and enters the driver's
queue_rq callback.
* The driver queue_rq callback indirectly calls schedule() which calls
blk_flush_plug(). This may happen if the driver has the
BLK_MQ_F_BLOCKING flag set and is allowed to sleep in ->queue_rq.
* blk_flush_plug() handles the remaining request in the mq_list. mq_list
is now empty.
* The original call to queue_rq resumes (with last == false).
* The loop in blk_mq_plug_issue_direct() terminates because there are no
remaining requests in mq_list.
The IO is now stalled because the last request submitted to the driver
had last == false and there was no subsequent call to commit_rqs().
Fix this by returning early in blk_mq_flush_plug_list() if rq_count is 0
which it will be in the recursive case, rather than checking if the
mq_list is empty. At the same time, adjust one of the callers to skip
the mq_list empty check as it is not necessary.
Fixes: dc5fc361d891 ("block: attempt direct issue of plug list")
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230714101106.3635611-1-ross.lagerwall@citrix.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 245165658e1c9f95c0fecfe02b9b1ebd30a1198a ]
After grabbing q->sysfs_lock, q->elevator may become NULL because of
elevator switch.
Fix the NULL dereference on q->elevator by checking it with lock.
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230616132354.415109-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2fb48d88e77f29bf9d278f25bcfe82cf59a0e09b ]
When a device-mapper device is passing through the inline encryption
support of an underlying device, calls to blk_crypto_evict_key() take
the blk_crypto_profile::lock of the device-mapper device, then take the
blk_crypto_profile::lock of the underlying device (nested). This isn't
a real deadlock, but it causes a lockdep report because there is only
one lock class for all instances of this lock.
Lockdep subclasses don't really work here because the hierarchy of block
devices is dynamic and could have more than 2 levels.
Instead, register a dynamic lock class for each blk_crypto_profile, and
associate that with the lock.
This avoids false-positive lockdep reports like the following:
============================================
WARNING: possible recursive locking detected
6.4.0-rc5 #2 Not tainted
--------------------------------------------
fscryptctl/1421 is trying to acquire lock:
ffffff80829ca418 (&profile->lock){++++}-{3:3}, at: __blk_crypto_evict_key+0x44/0x1c0
but task is already holding lock:
ffffff8086b68ca8 (&profile->lock){++++}-{3:3}, at: __blk_crypto_evict_key+0xc8/0x1c0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&profile->lock);
lock(&profile->lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
Fixes: 1b2628397058 ("block: Keyslot Manager for Inline Encryption")
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230610061139.212085-1-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 7eb1e47696aa231b1a567846bbe3a1e1befe1854 upstream.
Making 'blk' sector_t (i.e. 64 bit if LBD support is active) fails the
'blk>0' test in the partition block loop if a value of (signed int) -1 is
used to mark the end of the partition block list.
Explicitly cast 'blk' to signed int to allow use of -1 to terminate the
partition block linked list.
Fixes: b6f3f28f604b ("block: add overflow checks for Amiga partition support")
Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Link: https://lore.kernel.org/r/024ce4fa-cc6d-50a2-9aae-3701d0ebf668@xenosoft.de
Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
Reviewed-by: Martin Steigerwald <martin@lichtvoll.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit ad7c3b41e86b59943a903d23c7b037d820e6270c ]
After commit f382fb0bcef4 ("block: remove legacy IO schedulers"),
blkio.throttle.io_serviced and blkio.throttle.io_service_bytes become
the only stable io stats interface of cgroup v1, and these statistics
are done in the blk-throttle code. But the current code only counts the
bios that are actually throttled. When the user does not add the throttle
limit, the io stats for cgroup v1 has nothing. I fix it according to the
statistical method of v2, and made it count all ios accurately.
Fixes: a7b36ee6ba29 ("block: move blk-throtl fast path inline")
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Jinke Han <hanjinke.666@bytedance.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230507170631.89607-1-hanjinke.666@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit b90ecc0379eb7bbe79337b0c7289390a98752646 upstream.
Currently, associating a loop device with a different file descriptor
does not increment its diskseq. This allows the following race
condition:
1. Program X opens a loop device
2. Program X gets the diskseq of the loop device.
3. Program X associates a file with the loop device.
4. Program X passes the loop device major, minor, and diskseq to
something.
5. Program X exits.
6. Program Y detaches the file from the loop device.
7. Program Y attaches a different file to the loop device.
8. The opener finally gets around to opening the loop device and checks
that the diskseq is what it expects it to be. Even though the
diskseq is the expected value, the result is that the opener is
accessing the wrong file.
From discussions with Christoph Hellwig, it appears that
disk_force_media_change() was supposed to call inc_diskseq(), but in
fact it does not. Adding a Fixes: tag to indicate this. Christoph's
Reported-by is because he stated that disk_force_media_change()
calls inc_diskseq(), which is what led me to discover that it should but
does not.
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Fixes: e6138dc12de9 ("block: add a helper to raise a media changed event")
Cc: stable@vger.kernel.org # 5.15+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230607170837.1559-1-demi@invisiblethingslab.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b6f3f28f604ba3de4724ad82bea6adb1300c0b5f upstream.
The Amiga partition parser module uses signed int for partition sector
address and count, which will overflow for disks larger than 1 TB.
Use u64 as type for sector address and size to allow using disks up to
2 TB without LBD support, and disks larger than 2 TB with LBD. The RBD
format allows to specify disk sizes up to 2^128 bytes (though native
OS limitations reduce this somewhat, to max 2^68 bytes), so check for
u64 overflow carefully to protect against overflowing sector_t.
Bail out if sector addresses overflow 32 bits on kernels without LBD
support.
This bug was reported originally in 2012, and the fix was created by
the RDB author, Joanne Dow <jdow@earthlink.net>. A patch had been
discussed and reviewed on linux-m68k at that time but never officially
submitted (now resubmitted as patch 1 in this series).
This patch adds additional error checking and warning messages.
Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=43511
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Message-ID: <201206192146.09327.Martin@lichtvoll.de>
Cc: <stable@vger.kernel.org> # 5.2
Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/r/20230620201725.7020-4-schmitzmic@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fc3d092c6bb48d5865fec15ed5b333c12f36288c upstream.
The Amiga partition parser module uses signed int for partition sector
address and count, which will overflow for disks larger than 1 TB.
Use sector_t as type for sector address and size to allow using disks
up to 2 TB without LBD support, and disks larger than 2 TB with LBD.
This bug was reported originally in 2012, and the fix was created by
the RDB author, Joanne Dow <jdow@earthlink.net>. A patch had been
discussed and reviewed on linux-m68k at that time but never officially
submitted. This patch differs from Joanne's patch only in its use of
sector_t instead of unsigned int. No checking for overflows is done
(see patch 3 of this series for that).
Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=43511
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Message-ID: <201206192146.09327.Martin@lichtvoll.de>
Cc: <stable@vger.kernel.org> # 5.2
Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
Tested-by: Martin Steigerwald <Martin@lichtvoll.de>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620201725.7020-2-schmitzmic@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 2293cae703cda162684ae966db6b1b4a11b5e88f ]
In case of real io scheduler, q->elevator is set, so blk_mq_run_hw_queue()
may just check if scheduler queue has request to dispatch, see
__blk_mq_sched_dispatch_requests(). Then IO hang may be caused because
all passthorugh requests may stay in sw queue.
And any passthrough request should have been inserted to hctx->dispatch
always.
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Fixes: d97217e7f024 ("blk-mq: don't queue plugged passthrough requests into scheduler")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230621132208.1142318-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit dd7de3704af9989b780693d51eaea49a665bd9c2 ]
Commit 99d055b4fd4b ("block: remove per-disk debugfs files in
blk_unregister_queue") moves blk_trace_shutdown() from
blk_release_queue() to blk_unregister_queue(), this is safe if blktrace
is created through sysfs, however, there is a regression in corner
case.
blktrace can still be enabled after del_gendisk() through ioctl if
the disk is opened before del_gendisk(), and if blktrace is not shutdown
through ioctl before closing the disk, debugfs entries will be leaked.
Fix this problem by shutdown blktrace in disk_release(), this is safe
because blk_trace_remove() is reentrant.
Fixes: 99d055b4fd4b ("block: remove per-disk debugfs files in blk_unregister_queue")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230610022003.2557284-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 4f1731df60f9033669f024d06ae26a6301260b55 ]
In __blk_mq_tag_busy/idle(), updating 'active_queues' and calculating
'wake_batch' is not atomic:
t1: t2:
_blk_mq_tag_busy blk_mq_tag_busy
inc active_queues
// assume 1->2
inc active_queues
// 2 -> 3
blk_mq_update_wake_batch
// calculate based on 3
blk_mq_update_wake_batch
/* calculate based on 2, while active_queues is actually 3. */
Fix this problem by protecting them wih 'tags->lock', this is not a hot
path, so performance should not be concerned. And now that all writers
are inside the lock, switch 'actives_queues' from atomic to unsigned
int.
Fixes: 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230610023043.2559121-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3d2af77e31ade05ff7ccc3658c3635ec1bea0979 ]
When blkg_alloc() is called to allocate a blkcg_gq structure
with the associated blkg_iostat_set's, there are 2 fields within
blkg_iostat_set that requires proper initialization - blkg & sync.
The former field was introduced by commit 3b8cc6298724 ("blk-cgroup:
Optimize blkcg_rstat_flush()") while the later one was introduced by
commit f73316482977 ("blk-cgroup: reimplement basic IO stats using
cgroup rstat").
Unfortunately those fields in the blkg_iostat_set's are not properly
re-initialized when they are cleared in v1's blkcg_reset_stats(). This
can lead to a kernel panic due to NULL pointer access of the blkg
pointer. The missing initialization of sync is less problematic and
can be a problem in a debug kernel due to missing lockdep initialization.
Fix these problems by re-initializing them after memory clearing.
Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()")
Fixes: f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup rstat")
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230606180724.2455066-1-longman@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8d211554679d0b23702bd32ba04aeac0c1c4f660 ]
adjust_inuse_and_calc_cost() use spin_lock_irq() and IRQ will be enabled
when unlock. DEADLOCK might happen if we have held other locks and disabled
IRQ before invoking it.
Fix it by using spin_lock_irqsave() instead, which can keep IRQ state
consistent with before when unlock.
================================
WARNING: inconsistent lock state
5.10.0-02758-g8e5f91fd772f #26 Not tainted
--------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
kworker/2:3/388 [HC0[0]:SC0[0]:HE0:SE1] takes:
ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq
ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390
{IN-HARDIRQ-W} state was registered at:
__lock_acquire+0x3d7/0x1070
lock_acquire+0x197/0x4a0
__raw_spin_lock_irqsave
_raw_spin_lock_irqsave+0x3b/0x60
bfq_idle_slice_timer_body
bfq_idle_slice_timer+0x53/0x1d0
__run_hrtimer+0x477/0xa70
__hrtimer_run_queues+0x1c6/0x2d0
hrtimer_interrupt+0x302/0x9e0
local_apic_timer_interrupt
__sysvec_apic_timer_interrupt+0xfd/0x420
run_sysvec_on_irqstack_cond
sysvec_apic_timer_interrupt+0x46/0xa0
asm_sysvec_apic_timer_interrupt+0x12/0x20
irq event stamp: 837522
hardirqs last enabled at (837521): [<ffffffff84b9419d>] __raw_spin_unlock_irqrestore
hardirqs last enabled at (837521): [<ffffffff84b9419d>] _raw_spin_unlock_irqrestore+0x3d/0x40
hardirqs last disabled at (837522): [<ffffffff84b93fa3>] __raw_spin_lock_irq
hardirqs last disabled at (837522): [<ffffffff84b93fa3>] _raw_spin_lock_irq+0x43/0x50
softirqs last enabled at (835852): [<ffffffff84e00558>] __do_softirq+0x558/0x8ec
softirqs last disabled at (835845): [<ffffffff84c010ff>] asm_call_irq_on_stack+0xf/0x20
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&bfqd->lock);
<Interrupt>
lock(&bfqd->lock);
*** DEADLOCK ***
3 locks held by kworker/2:3/388:
#0: ffff888107af0f38 ((wq_completion)kthrotld){+.+.}-{0:0}, at: process_one_work+0x742/0x13f0
#1: ffff8881176bfdd8 ((work_completion)(&td->dispatch_work)){+.+.}-{0:0}, at: process_one_work+0x777/0x13f0
#2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq
#2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390
stack backtrace:
CPU: 2 PID: 388 Comm: kworker/2:3 Not tainted 5.10.0-02758-g8e5f91fd772f #26
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Workqueue: kthrotld blk_throtl_dispatch_work_fn
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x107/0x167
print_usage_bug
valid_state
mark_lock_irq.cold+0x32/0x3a
mark_lock+0x693/0xbc0
mark_held_locks+0x9e/0xe0
__trace_hardirqs_on_caller
lockdep_hardirqs_on_prepare.part.0+0x151/0x360
trace_hardirqs_on+0x5b/0x180
__raw_spin_unlock_irq
_raw_spin_unlock_irq+0x24/0x40
spin_unlock_irq
adjust_inuse_and_calc_cost+0x4fb/0x970
ioc_rqos_merge+0x277/0x740
__rq_qos_merge+0x62/0xb0
rq_qos_merge
bio_attempt_back_merge+0x12c/0x4a0
blk_mq_sched_try_merge+0x1b6/0x4d0
bfq_bio_merge+0x24a/0x390
__blk_mq_sched_bio_merge+0xa6/0x460
blk_mq_sched_bio_merge
blk_mq_submit_bio+0x2e7/0x1ee0
__submit_bio_noacct_mq+0x175/0x3b0
submit_bio_noacct+0x1fb/0x270
blk_throtl_dispatch_work_fn+0x1ef/0x2b0
process_one_work+0x83e/0x13f0
process_scheduled_works
worker_thread+0x7e3/0xd80
kthread+0x353/0x470
ret_from_fork+0x1f/0x30
Fixes: b0853ab4a238 ("blk-iocost: revamp in-period donation snapbacks")
Signed-off-by: Li Nan <linan122@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230527091904.3001833-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a13bd91be22318768d55470cbc0b0f4488ef9edf ]
commit 50e34d78815e ("block: disable the elevator int del_gendisk")
move rq_qos_exit() from disk_release() to del_gendisk(), this will
introduce some problems:
1) If rq_qos_add() is triggered by enabling iocost/iolatency through
cgroupfs, then it can concurrent with del_gendisk(), it's not safe to
write 'q->rq_qos' concurrently.
2) Activate cgroup policy that is relied on rq_qos will call
rq_qos_add() and blkcg_activate_policy(), and if rq_qos_exit() is
called in the middle, null-ptr-dereference will be triggered in
blkcg_activate_policy().
3) blkg_conf_open_bdev() can call blkdev_get_no_open() first to find the
disk, then if rq_qos_exit() from del_gendisk() is done before
rq_qos_add(), then memory will be leaked.
This patch add a new disk level mutex 'rq_qos_mutex':
1) The lock will protect rq_qos_exit() directly.
2) For wbt that doesn't relied on blk-cgroup, rq_qos_add() can only be
called from disk initialization for now because wbt can't be
destructed until rq_qos_exit(), so it's safe not to protect wbt for
now. Hoever, in case that rq_qos dynamically destruction is supported
in the furture, this patch also protect rq_qos_add() from wbt_init()
directly, this is enough because blk-sysfs already synchronize
writers with disk removal.
3) For iocost and iolatency, in order to synchronize disk removal and
cgroup configuration, the lock is held after blkdev_get_no_open()
from blkg_conf_open_bdev(), and is released in blkg_conf_exit().
In order to fix the above memory leak, disk_live() is checked after
holding the new lock.
Fixes: 50e34d78815e ("block: disable the elevator int del_gendisk")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230414084008.2085155-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d97217e7f024bbe9aa62aea070771234c2879358 ]
Passthrough requests should never be queued to the I/O scheduler,
as scheduling these opaque requests doesn't make sense, and I/O
schedulers might require req->bio to be always valid.
We never let passthrough requests insert into the scheduler before
commit 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging"),
restore this behavior even for passthrough requests issued under a plug.
[hch: use blk_mq_insert_requests for passthrough requests,
fix up the commit message and comments]
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Closes: https://lore.kernel.org/linux-block/CAGS2=YosaYaUTEMU3uaf+y=8MqSrhL7sYsJn8EwbaM=76p_4Qg@mail.gmail.com/
Investigated-by: Yu Kuai <yukuai1@huaweicloud.com>
Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230518053101.760632-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
When __blkcg_rstat_flush() is called from cgroup_rstat_flush*() code
path, interrupt is always disabled.
When we start to flush blkcg per-cpu stats list in __blkg_release()
for avoiding to leak blkcg_gq's reference in commit 20cb1c2fb756
("blk-cgroup: Flush stats before releasing blkcg_gq"), local irq
isn't disabled yet, then lockdep warning may be triggered because
the dependent cgroup locks may be acquired from irq(soft irq) handler.
Fix the issue by disabling local irq always.
Fixes: 20cb1c2fb756 ("blk-cgroup: Flush stats before releasing blkcg_gq")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/linux-block/pz2wzwnmn5tk3pwpskmjhli6g3qly7eoknilb26of376c7kwxy@qydzpvt6zpis/T/#u
Cc: stable@vger.kernel.org
Cc: Jay Shin <jaeshin@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20230622084249.1208005-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
As noted by Michal, the blkg_iostat_set's in the lockless list hold
reference to blkg's to protect against their removal. Those blkg's
hold reference to blkcg. When a cgroup is being destroyed,
cgroup_rstat_flush() is only called at css_release_work_fn() which
is called when the blkcg reference count reaches 0. This circular
dependency will prevent blkcg and some blkgs from being freed after
they are made offline.
It is less a problem if the cgroup to be destroyed also has other
controllers like memory that will call cgroup_rstat_flush() which will
clean up the reference count. If block is the only controller that uses
rstat, these offline blkcg and blkgs may never be freed leaking more
and more memory over time.
To prevent this potential memory leak:
- flush blkcg per-cpu stats list in __blkg_release(), when no new stat
can be added
- add global blkg_stat_lock for covering concurrent parent blkg stat
update
- don't grab bio->bi_blkg reference when adding the stats into blkcg's
per-cpu stat list since all stats are guaranteed to be consumed before
releasing blkg instance, and grabbing blkg reference for stats was the
most fragile part of original patch
Based on Waiman's patch:
https://lore.kernel.org/linux-block/20221215033132.230023-3-longman@redhat.com/
Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()")
Cc: stable@vger.kernel.org
Reported-by: Jay Shin <jaeshin@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: mkoutny@suse.com
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230609234249.1412858-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The nr_active counter continues to increase over time which causes the
blk_mq_get_tag to hang until the thread is rescheduled to a different
core despite there are still tags available.
kernel-stack
INFO: task inboundIOReacto:3014879 blocked for more than 2 seconds
Not tainted 6.1.15-amd64 #1 Debian 6.1.15~debian11
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:inboundIOReacto state:D stack:0 pid:3014879 ppid:4557 flags:0x00000000
Call Trace:
<TASK>
__schedule+0x351/0xa20
scheduler+0x5d/0xe0
io_schedule+0x42/0x70
blk_mq_get_tag+0x11a/0x2a0
? dequeue_task_stop+0x70/0x70
__blk_mq_alloc_requests+0x191/0x2e0
kprobe output showing RQF_MQ_INFLIGHT bit is not cleared before
__blk_mq_free_request being called.
320 320 kworker/29:1H __blk_mq_free_request rq_flags 0x220c0 in-flight 1
b'__blk_mq_free_request+0x1 [kernel]'
b'bt_iter+0x50 [kernel]'
b'blk_mq_queue_tag_busy_iter+0x318 [kernel]'
b'blk_mq_timeout_work+0x7c [kernel]'
b'process_one_work+0x1c4 [kernel]'
b'worker_thread+0x4d [kernel]'
b'kthread+0xe6 [kernel]'
b'ret_from_fork+0x1f [kernel]'
Signed-off-by: Tian Lan <tian.lan@twosigma.com>
Fixes: 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230513221227.497327-1-tilan7663@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The scsi driver function sd_read_block_characteristics() always calls
disk_set_zoned() to a disk zoned model correctly, in case the device
model changed. This is done even for regular disks to set the zoned
model to BLK_ZONED_NONE and free any zone related resources if the drive
previously was zoned.
This behavior significantly impact the time it takes to revalidate disks
on a large system as the call to disk_clear_zone_settings() done from
disk_set_zoned() for the BLK_ZONED_NONE case results in the device
request queued to be frozen, even if there are no zone resources to
free.
Avoid this overhead for non-zoned devices by not calling
disk_clear_zone_settings() in disk_set_zoned() if the device model
was already set to BLK_ZONED_NONE, which is always the case for regular
devices.
Reported by: Brian Bunker <brian@purestorage.com>
Fixes: 508aebb80527 ("block: introduce blk_queue_clear_zone_settings()")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230529073237.1339862-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since the dawn of time bio_check_eod has a check for a non-zero size of
the device. This doesn't really make any sense as we never want to send
I/O to a device that's been set to zero size, or never moved out of that.
I am a bit surprised we haven't caught this for a long time, but the
removal of the extra validation inside of zram caused syzbot to trip
over this issue recently. I've added a Fixes tag for that commit, but
the issue really goes back way before git history.
Fixes: 9fe95babc742 ("zram: remove valid_io_request")
Reported-by: syzbot+b8d61a58b7c7ebd2c8e0@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230524060538.1593686-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
commit <8af870aa5b847> ("block: enable bio caching use for passthru IO")
introduced bio-cache for passthru IO. In case when nr_vecs are greater
than BIO_INLINE_VECS, bio and bvecs are allocated from mempool (instead
of percpu cache) and REQ_ALLOC_CACHE is cleared. This causes the side
effect of not freeing bio/bvecs into mempool on completion.
This patch lets the passthru IO fallback to allocation using bio_kmalloc
when nr_vecs are greater than BIO_INLINE_VECS. The corresponding bio
is freed during call to blk_mq_map_bio_put during completion.
Cc: stable@vger.kernel.org # 6.1
fixes <8af870aa5b847> ("block: enable bio caching use for passthru IO")
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20230523111709.145676-1-anuj20.g@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If multiple CPUs are sharing the same hardware queue, it can
cause leak in the active queue counter tracking when __blk_mq_tag_busy()
is executed simultaneously.
Fixes: ee78ec1077d3 ("blk-mq: blk_mq_tag_busy is no need to return a value")
Signed-off-by: Tian Lan <tian.lan@twosigma.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20230522210555.794134-1-tilan7663@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
commit b11d31ae01e6 ("blk-wbt: remove unnecessary check in
wbt_enable_default()") removes the checking of CONFIG_BLK_WBT_MQ by
mistake, which is used to control enable or disable wbt by default.
Fix the problem by adding back the checking. This patch also do a litter
cleanup to make related code more readable.
Fixes: b11d31ae01e6 ("blk-wbt: remove unnecessary check in wbt_enable_default()")
Reported-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Link: https://lore.kernel.org/lkml/CAKXUXMzfKq_J9nKHGyr5P5rvUETY4B-fxoQD4sO+NYjFOfVtZA@mail.gmail.com/t/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230522121854.2928880-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
User should not be able to write block device if it is read-only at
block level (e.g force_ro attribute). This is ensured in the regular
fops write operation (blkdev_write_iter) but not when writing via
user mapping (mmap), allowing user to actually write a read-only
block device via a PROT_WRITE mapping.
Example: This can lead to integrity issue of eMMC boot partition
(e.g mmcblk0boot0) which is read-only by default.
To fix this issue, simply deny shared writable mapping if the block
is readonly.
Note: Block remains writable if switch to read-only is performed
after the initial mapping, but this is expected behavior according
to commit a32e236eb93e ("Partially revert "block: fail op_is_write()
requests to read-only partitions"")'.
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230510074223.991297-1-loic.poulain@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull more block updates from Jens Axboe:
- MD pull request via Song:
- Improve raid5 sequential IO performance on spinning disks, which
fixes a regression since v6.0 (Jan Kara)
- Fix bitmap offset types, which fixes an issue introduced in this
merge window (Jonathan Derrick)
- Cleanup of hweight type used for cgroup writeback (Maxim)
- Fix a regression with the "has_submit_bio" changes across partitions
(Ming)
- Cleanup of QUEUE_FLAG_ADD_RANDOM clearing.
We used to set this flag on queues non blk-mq queues, and hence some
drivers clear it unconditionally. Since all of these have since been
converted to true blk-mq drivers, drop the useless clear as the bit
is not set (Chaitanya)
- Fix the flags being set in a bio for a flush for drbd (Christoph)
- Cleanup and deduplication of the code handling setting block device
capacity (Damien)
- Fix for ublk handling IO timeouts (Ming)
- Fix for a regression in blk-cgroup teardown (Tao)
- NBD documentation and code fixes (Eric)
- Convert blk-integrity to using device_attributes rather than a second
kobject to manage lifetimes (Thomas)
* tag 'for-6.4/block-2023-05-06' of git://git.kernel.dk/linux:
ublk: add timeout handler
drbd: correctly submit flush bio on barrier
mailmap: add mailmap entries for Jens Axboe
block: Skip destroyed blkg when restart in blkg_destroy_all()
writeback: fix call of incorrect macro
md: Fix bitmap offset type in sb writer
md/raid5: Improve performance for sequential IO
docs nbd: userspace NBD now favors github over sourceforge
block nbd: use req.cookie instead of req.handle
uapi nbd: add cookie alias to handle
uapi nbd: improve doc links to userspace spec
blk-integrity: register sysfs attributes on struct device
blk-integrity: convert to struct device_attribute
blk-integrity: use sysfs_emit
block/drivers: remove dead clear of random flag
block: sync part's ->bd_has_submit_bio with disk's
block: Cleanup set_capacity()/bdev_set_nr_sectors()
|
|
Kernel hang in blkg_destroy_all() when total blkg greater than
BLKG_DESTROY_BATCH_SIZE, because of not removing destroyed blkg in
blkg_list. So the size of blkg_list is same after destroying a
batch of blkg, and the infinite 'restart' occurs.
Since blkg should stay on the queue list until blkg_free_workfn(),
skip destroyed blkg when restart a new round, which will solve this
kernel hang issue and satisfy the previous will to restart.
Reported-by: Xiangfei Ma <xiangfeix.ma@intel.com>
Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Suggested-and-reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230428045149.1310073-1-tao1.su@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the large set of driver core changes for 6.4-rc1.
Once again, a busy development cycle, with lots of changes happening
in the driver core in the quest to be able to move "struct bus" and
"struct class" into read-only memory, a task now complete with these
changes.
This will make the future rust interactions with the driver core more
"provably correct" as well as providing more obvious lifetime rules
for all busses and classes in the kernel.
The changes required for this did touch many individual classes and
busses as many callbacks were changed to take const * parameters
instead. All of these changes have been submitted to the various
subsystem maintainers, giving them plenty of time to review, and most
of them actually did so.
Other than those changes, included in here are a small set of other
things:
- kobject logging improvements
- cacheinfo improvements and updates
- obligatory fw_devlink updates and fixes
- documentation updates
- device property cleanups and const * changes
- firwmare loader dependency fixes.
All of these have been in linux-next for a while with no reported
problems"
* tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits)
device property: make device_property functions take const device *
driver core: update comments in device_rename()
driver core: Don't require dynamic_debug for initcall_debug probe timing
firmware_loader: rework crypto dependencies
firmware_loader: Strip off \n from customized path
zram: fix up permission for the hot_add sysfs file
cacheinfo: Add use_arch[|_cache]_info field/function
arch_topology: Remove early cacheinfo error message if -ENOENT
cacheinfo: Check cache properties are present in DT
cacheinfo: Check sib_leaf in cache_leaves_are_shared()
cacheinfo: Allow early level detection when DT/ACPI info is missing/broken
cacheinfo: Add arm64 early level initializer implementation
cacheinfo: Add arch specific early level initializer
tty: make tty_class a static const structure
driver core: class: remove struct class_interface * from callbacks
driver core: class: mark the struct class in struct class_interface constant
driver core: class: make class_register() take a const *
driver core: class: mark class_release() as taking a const *
driver core: remove incorrect comment for device_create*
MIPS: vpe-cmp: remove module owner pointer from struct class usage.
...
|
|
The "integrity" kobject only acted as a holder for static sysfs entries.
It also was embedded into struct gendisk without managing it, violating
assumptions of the driver core.
Instead register the sysfs entries directly onto the struct device.
Also drop the now unused member integrity_kobj from struct gendisk.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230309-kobj_release-gendisk_integrity-v3-3-ceccb4493c46@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
An upcoming patch will register the integrity attributes directly with
the struct device kobject.
For this the attributes have to be implemented in terms of
struct device_attribute.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230309-kobj_release-gendisk_integrity-v3-2-ceccb4493c46@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The correct way to emit data into sysfs is via sysfs_emit(), use it.
Also perform some trivial syntactic cleanups.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230309-kobj_release-gendisk_integrity-v3-1-ceccb4493c46@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull block updates from Jens Axboe:
- drbd patches, bringing us closer to unifying the out-of-tree version
and the in tree one (Andreas, Christoph)
- support for auto-quiesce for the s390 dasd driver (Stefan)
- MD pull request via Song:
- md/bitmap: Optimal last page size (Jon Derrick)
- Various raid10 fixes (Yu Kuai, Li Nan)
- md: add error_handlers for raid0 and linear (Mariusz Tkaczyk)
- NVMe pull request via Christoph:
- Drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas)
- Validate nvmet module parameters (Chaitanya Kulkarni)
- Fence TCP socket on receive error (Chris Leech)
- Fix async event trace event (Keith Busch)
- Minor cleanups (Chaitanya Kulkarni, zhenwei pi)
- Fix and cleanup nvmet Identify handling (Damien Le Moal,
Christoph Hellwig)
- Fix double blk_mq_complete_request race in the timeout handler
(Lei Yin)
- Fix irq locking in nvme-fcloop (Ming Lei)
- Remove queue mapping helper for rdma devices (Sagi Grimberg)
- use structured request attribute checks for nbd (Jakub)
- fix blk-crypto race conditions between keyslot management (Eric)
- add sed-opal support for reading read locking range attributes
(Ondrej)
- make fault injection configurable for null_blk (Akinobu)
- clean up the request insertion API (Christoph)
- clean up the queue running API (Christoph)
- blkg config helper cleanups (Tejun)
- lazy init support for blk-iolatency (Tejun)
- various fixes and tweaks to ublk (Ming)
- remove hybrid polling. It hasn't really been useful since we got
async polled IO support, and these days we don't support sync polled
IO at all (Keith)
- misc fixes, cleanups, improvements (Zhong, Ondrej, Colin, Chengming,
Chaitanya, me)
* tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux: (118 commits)
nbd: fix incomplete validation of ioctl arg
ublk: don't return 0 in case of any failure
sed-opal: geometry feature reporting command
null_blk: Always check queue mode setting from configfs
block: ublk: switch to ioctl command encoding
blk-mq: fix the blk_mq_add_to_requeue_list call in blk_kick_flush
block, bfq: Fix division by zero error on zero wsum
fault-inject: fix build error when FAULT_INJECTION_CONFIGFS=y and CONFIGFS_FS=m
block: store bdev->bd_disk->fops->submit_bio state in bdev
block: re-arrange the struct block_device fields for better layout
md/raid5: remove unused working_disks variable
md/raid10: don't call bio_start_io_acct twice for bio which experienced read error
md/raid10: fix memleak of md thread
md/raid10: fix memleak for 'conf->bio_split'
md/raid10: fix leak of 'r10bio->remaining' for recovery
md/raid10: don't BUG_ON() in raise_barrier()
md: fix soft lockup in status_resync
md: add error_handlers for raid0 and linear
md: Use optimal I/O size for last bitmap page
md: Fix types in sb writer
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Mostly core changes and cleanups, some notable fixes and two
performance improvements in directory logging.
The IO path cleanups are removing or refactoring old code, scrub main
loop has been completely rewritten also refactoring old code.
There are some changes to non-btrfs code, mostly trivial, the cgroup
punt bio logic is only moved from generic code.
Performance improvements:
- improve logging changes in a directory during one transaction,
avoid iterating over items and reduce lock contention (fsync time
4x lower)
- when logging directory entries during one transaction, reduce
locking of subvolume trees by checking tree-log instead
(improvement in throughput and latency for concurrent access to a
subvolume)
Notable fixes:
- dev-replace:
- properly honor read mode when requested to avoid reading from
source device
- target device won't be used for eventual read repair, this is
unreliable for NODATASUM files
- when there are unpaired (and unrepairable) metadata during
replace, exit early with error and don't try to finish whole
operation
- scrub ioctl properly rejects unknown flags
- fix global block reserve calculations
- fix partial direct io write when there's a page fault in the
middle, iomap will try to continue with partial request but the
btrfs part did not match that, this can lead to zeros written
instead of data
Core changes:
- io path:
- continued cleanups and refactoring around bio handling
- extent io submit path simplifications and cleanups
- flush write path simplifications and cleanups
- rework logic of passing sync mode of bio, with further cleanups
- rewrite scrub code flow, restructure how the stripes are enumerated
and verified in a more unified way
- allow to set lower threshold for block group reclaim in debug mode
to aid zoned mode testing
- remove obsolete time-based delayed ref throttling logic when
truncating items
- DREW locks are not using percpu variables anymore
- more warning fixes (-Wmaybe-uninitialized)
- u64 division simplifications
- error handling improvements
Non-btrfs code changes:
- push cgroup punt bio logic to btrfs code (there was no other user
of that), the functionality can be now selected separately by
BLK_CGROUP_PUNT_BIO
- crc32c_impl removed after removing last uses in btrfs code
- add btrfs_assertfail() to objtool table"
* tag 'for-6.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits)
btrfs: mark btrfs_assertfail() __noreturn
btrfs: fix uninitialized variable warnings
btrfs: use log root when iterating over index keys when logging directory
btrfs: avoid iterating over all indexes when logging directory
btrfs: dev-replace: error out if we have unrepaired metadata error during
btrfs: remove pointless loop at btrfs_get_next_valid_item()
btrfs: scrub: reject unsupported scrub flags
btrfs: reinterpret async discard iops_limit=0 as no delay
btrfs: set default discard iops_limit to 1000
btrfs: remove unused raid56 functions which were dedicated for scrub
btrfs: scrub: remove scrub_bio structure
btrfs: scrub: remove scrub_block and scrub_sector structures
btrfs: scrub: remove the old scrub recheck code
btrfs: scrub: remove the old writeback infrastructure
btrfs: scrub: remove scrub_parity structure
btrfs: scrub: use scrub_stripe to implement RAID56 P/Q scrub
btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure
btrfs: scrub: introduce helper to queue a stripe for scrub
btrfs: scrub: introduce error reporting functionality for scrub_stripe
btrfs: scrub: introduce a writeback helper for scrub_stripe
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"There are a number of major cleanups in ext4 this cycle:
- The data=journal writepath has been significantly cleaned up and
simplified, and reduces a large number of data=journal special
cases by Jan Kara.
- Ojaswin Muhoo has replaced linked list used to track extents that
have been used for inode preallocation with a red-black tree in the
multi-block allocator. This improves performance for workloads
which do a large number of random allocating writes.
- Thanks to Kemeng Shi for a lot of cleanup and bug fixes in the
multi-block allocator.
- Matthew wilcox has converted the code paths for reading and writing
ext4 pages to use folios.
- Jason Yan has continued to factor out ext4_fill_super() into
smaller functions for improve ease of maintenance and
comprehension.
- Josh Triplett has created an uapi header for ext4 userspace API's"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (105 commits)
ext4: Add a uapi header for ext4 userspace APIs
ext4: remove useless conditional branch code
ext4: remove unneeded check of nr_to_submit
ext4: move dax and encrypt checking into ext4_check_feature_compatibility()
ext4: factor out ext4_block_group_meta_init()
ext4: move s_reserved_gdt_blocks and addressable checking into ext4_check_geometry()
ext4: rename two functions with 'check'
ext4: factor out ext4_flex_groups_free()
ext4: use ext4_group_desc_free() in ext4_put_super() to save some duplicated code
ext4: factor out ext4_percpu_param_init() and ext4_percpu_param_destroy()
ext4: factor out ext4_hash_info_init()
Revert "ext4: Fix warnings when freezing filesystem with journaled data"
ext4: Update comment in mpage_prepare_extent_to_map()
ext4: Simplify handling of journalled data in ext4_bmap()
ext4: Drop special handling of journalled data from ext4_quota_on()
ext4: Drop special handling of journalled data from ext4_evict_inode()
ext4: Fix special handling of journalled data from extent zeroing
ext4: Drop special handling of journalled data from extent shifting operations
ext4: Drop special handling of journalled data from ext4_sync_file()
ext4: Commit transaction before writing back pages in data=journal mode
...
|
|
submit_bio() always uses bio->bi_bdev->bd_has_submit_bio to decide if
disk's ->submit_bio() is called, and bio->bi_bdev could point to one
partition device.
So we have to sync part bdev's ->bd_has_submit_bio with disk's.
Reported-by: Changhui Zhong <czhong@redhat.com>
Link: https://lore.kernel.org/linux-block/ZEdItaPqif8fp85H@ovpn-8-24.pek2.redhat.com/T/#t
Fixes: 9f4107b07b17 ("block: store bdev->bd_disk->fops->submit_bio state in bdev")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230425034154.110099-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull ITER_UBUF updates from Jens Axboe:
"This turns singe vector imports into ITER_UBUF, rather than
ITER_IOVEC.
The former is more trivial to iterate and advance, and hence a bit
more efficient. From some very unscientific testing, ~60% of all iovec
imports are single vector"
* tag 'iter-ubuf.2-2023-04-21' of git://git.kernel.dk/linux:
iov_iter: Mark copy_compat_iovec_from_user() noinline
iov_iter: import single vector iovecs as ITER_UBUF
iov_iter: convert import_single_range() to ITER_UBUF
iov_iter: overlay struct iovec and ubuf/len
iov_iter: set nr_segs = 1 for ITER_UBUF
iov_iter: remove iov_iter_iovec()
iov_iter: add iter_iov_addr() and iter_iov_len() helpers
ALSA: pcm: check for user backed iterator, not specific iterator type
IB/qib: check for user backed iterator, not specific iterator type
IB/hfi1: check for user backed iterator, not specific iterator type
iov_iter: add iter_iovec() helper
block: ensure bio_alloc_map_data() deals with ITER_UBUF correctly
|
|
The code for setting a block device capacity (bd_nr_sectors field of
struct block_device) is duplicated in set_capacity() and
bdev_set_nr_sectors(). Clean this up by making bdev_set_nr_sectors()
a block layer internal function defined in block/bdev.c instead of
having this function statically defined in block/partitions/core.c.
With this change, set_capacity() implementation can be simplified to
only calling bdev_set_nr_sectors().
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230424131318.79935-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This reverts commit 23f3e3272e7a4d9fb870485cd6df1e4f9539282c.
blk-mq sched bio merge still needs request to grab queue usage counter,
so we can't simply call blk_mq_attempt_bio_merge() when queue usage
counter isn't held.
Fixes: 23f3e3272e7a ("block: Merge bio before checking ->cached_rq")
Cc: Xiao Ni <xni@redhat.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230420112018.1108058-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Locking range start and locking range length
attributes may be require to satisfy restrictions
exposed by OPAL2 geometry feature reporting.
Geometry reporting feature is described in TCG OPAL SSC,
section 3.1.1.4 (ALIGN, LogicalBlockSize, AlignmentGranularity
and LowestAlignedLBA).
4.3.5.2.1.1 RangeStart Behavior:
[ StartAlignment = (RangeStart modulo AlignmentGranularity) - LowestAlignedLBA ]
When processing a Set method or CreateRow method on the Locking
table for a non-Global Range row, if:
a) the AlignmentRequired (ALIGN above) column in the LockingInfo
table is TRUE;
b) RangeStart is non-zero; and
c) StartAlignment is non-zero, then the method SHALL fail and
return an error status code INVALID_PARAMETER.
4.3.5.2.1.2 RangeLength Behavior:
If RangeStart is zero, then
[ LengthAlignment = (RangeLength modulo AlignmentGranularity) - LowestAlignedLBA ]
If RangeStart is non-zero, then
[ LengthAlignment = (RangeLength modulo AlignmentGranularity) ]
When processing a Set method or CreateRow method on the Locking
table for a non-Global Range row, if:
a) the AlignmentRequired (ALIGN above) column in the LockingInfo
table is TRUE;
b) RangeLength is non-zero; and
c) LengthAlignment is non-zero, then the method SHALL fail and
return an error status code INVALID_PARAMETER
In userspace we stuck to logical block size reported by general
block device (via sysfs or ioctl), but we can not read
'AlignmentGranularity' or 'LowestAlignedLBA' anywhere else and
we need to get those values from sed-opal interface otherwise
we will not be able to report or avoid locking range setup
INVALID_PARAMETER errors above.
Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Tested-by: Milan Broz <gmazyland@gmail.com>
Link: https://lore.kernel.org/r/20230411090931.9193-2-okozina@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Guard all the code to punt bios to a per-cgroup submission helper by a
new CONFIG_BLK_CGROUP_PUNT_BIO symbol that is selected by btrfs.
This way non-btrfs kernel builds don't need to have this code.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
async_bio_lock is only taken from bio submission and workqueue context,
both are never in bottom halves.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
REQ_CGROUP_PUNT is a bit annoying as it is hard to follow and adds
a branch to the bio submission hot path. To fix this, export
blkcg_punt_bio_submit and let btrfs call it directly. Add a new
REQ_FS_PRIVATE flag for btrfs to indicate to it's own low-level
bio submission code that a punt to the cgroup submission helper
is required.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Commit b12e5c6c755a accidentally changes blk_kick_flush to do a head
insert into the requeue list, fix this up.
Fixes: b12e5c6c755a ("blk-mq: pass a flags argument to blk_mq_add_to_requeue_list")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230416073553.966161-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When the weighted sum is zero the calculation of limit causes
a division by zero error. Fix this by continuing to the next level.
This was discovered by running as root:
stress-ng --ioprio 0
Fixes divison by error oops:
[ 521.450556] divide error: 0000 [#1] SMP NOPTI
[ 521.450766] CPU: 2 PID: 2684464 Comm: stress-ng-iopri Not tainted 6.2.1-1280.native #1
[ 521.451117] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[ 521.451627] RIP: 0010:bfqq_request_over_limit+0x207/0x400
[ 521.451875] Code: 01 48 8d 0c c8 74 0b 48 8b 82 98 00 00 00 48 8d 0c c8 8b 85 34 ff ff ff 48 89 ca 41 0f af 41 50 48 d1 ea 48 98 48 01 d0 31 d2 <48> f7 f1 41 39 41 48 89 85 34 ff ff ff 0f 8c 7b 01 00 00 49 8b 44
[ 521.452699] RSP: 0018:ffffb1af84eb3948 EFLAGS: 00010046
[ 521.452938] RAX: 000000000000003c RBX: 0000000000000000 RCX: 0000000000000000
[ 521.453262] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb1af84eb3978
[ 521.453584] RBP: ffffb1af84eb3a30 R08: 0000000000000001 R09: ffff8f88ab8a4ba0
[ 521.453905] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8f88ab8a4b18
[ 521.454224] R13: ffff8f8699093000 R14: 0000000000000001 R15: ffffb1af84eb3970
[ 521.454549] FS: 00005640b6b0b580(0000) GS:ffff8f88b3880000(0000) knlGS:0000000000000000
[ 521.454912] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 521.455170] CR2: 00007ffcbcae4e38 CR3: 00000002e46de001 CR4: 0000000000770ee0
[ 521.455491] PKRU: 55555554
[ 521.455619] Call Trace:
[ 521.455736] <TASK>
[ 521.455837] ? bfq_request_merge+0x3a/0xc0
[ 521.456027] ? elv_merge+0x115/0x140
[ 521.456191] bfq_limit_depth+0xc8/0x240
[ 521.456366] __blk_mq_alloc_requests+0x21a/0x2c0
[ 521.456577] blk_mq_submit_bio+0x23c/0x6c0
[ 521.456766] __submit_bio+0xb8/0x140
[ 521.457236] submit_bio_noacct_nocheck+0x212/0x300
[ 521.457748] submit_bio_noacct+0x1a6/0x580
[ 521.458220] submit_bio+0x43/0x80
[ 521.458660] ext4_io_submit+0x23/0x80
[ 521.459116] ext4_do_writepages+0x40a/0xd00
[ 521.459596] ext4_writepages+0x65/0x100
[ 521.460050] do_writepages+0xb7/0x1c0
[ 521.460492] __filemap_fdatawrite_range+0xa6/0x100
[ 521.460979] file_write_and_wait_range+0xbf/0x140
[ 521.461452] ext4_sync_file+0x105/0x340
[ 521.461882] __x64_sys_fsync+0x67/0x100
[ 521.462305] ? syscall_exit_to_user_mode+0x2c/0x1c0
[ 521.462768] do_syscall_64+0x3b/0xc0
[ 521.463165] entry_SYSCALL_64_after_hwframe+0x5a/0xc4
[ 521.463621] RIP: 0033:0x5640b6c56590
[ 521.464006] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d 71 70 0e 00 00 74 17 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20230413133009.1605335-1-colin.i.king@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We have a long chain of memory dereferencing just to whether or not
this disk has a special submit_bio helper. As that's not necessarily
the common case, add a bd_has_submit_bio state in the bdev to avoid
traversing this memory dependency chain if we don't need to.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
for-6.4/block
Pull NVMe updates from Christoph:
"nvme updates for Linux 6.4
- drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas)
- validate nvmet module parameters (Chaitanya Kulkarni)
- fence TCP socket on receive error (Chris Leech)
- fix async event trace event (Keith Busch)
- minor cleanups (Chaitanya Kulkarni, zhenwei pi)
- fix and cleanup nvmet Identify handling (Damien Le Moal,
Christoph Hellwig)
- fix double blk_mq_complete_request race in the timeout handler
(Lei Yin)
- fix irq locking in nvme-fcloop (Ming Lei)
- remove queue mapping helper for rdma devices (Sagi Grimberg)"
* tag 'nvme-6.4-2023-04-14' of git://git.infradead.org/nvme:
nvme-fcloop: fix "inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage"
blk-mq-rdma: remove queue mapping helper for rdma devices
nvme-rdma: minor cleanup in nvme_rdma_create_cq()
nvme: fix double blk_mq_complete_request for timeout request with low probability
nvme: fix async event trace event
nvme-apple: return directly instead of else
nvme-apple: return directly instead of else
nvmet-tcp: validate idle poll modparam value
nvmet-tcp: validate so_priority modparam value
nvme-tcp: fence TCP socket on receive error
nvmet: remove nvmet_req_cns_error_complete
nvmet: rename nvmet_execute_identify_cns_cs_ns
nvmet: fix Identify Identification Descriptor List handling
nvmet: cleanup nvmet_execute_identify()
nvmet: fix I/O Command Set specific Identify Controller
nvmet: fix Identify Active Namespace ID list handling
nvmet: fix Identify Controller handling
nvmet: fix Identify Namespace handling
nvmet: fix error handling in nvmet_execute_identify_cns_cs_ns()
nvme-pci: drop redundant pci_enable_pcie_error_reporting()
|
|
__blk_mq_run_hw_queue just contains a WARN_ON_ONCE for calls from
interrupt context and a blk_mq_run_dispatch_ops-protected call to
blk_mq_sched_dispatch_requests. Open code the call to
blk_mq_sched_dispatch_requests in both callers, and move the WARN_ON_ONCE
to blk_mq_run_hw_queue where it can be extended to all !async calls,
while the other call is from workqueue context and thus obviously does
not need the assert.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Only blk_mq_run_hw_queue can call __blk_mq_delay_run_hw_queue with
async=false, so move the handling there.
With this __blk_mq_delay_run_hw_queue can be merged into
blk_mq_delay_run_hw_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For the in-context dispatch, blk_mq_hctx_stopped is alredy checked in
blk_mq_sched_dispatch_requests under blk_mq_run_dispatch_ops() protection.
For the async dispatch case having a check before scheduling the work
still makes sense to avoid needless workqueue scheduling, so just keep it
for that case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
blk_mq_hctx_stopped is already checked in blk_mq_sched_dispatch_requests
under blk_mq_run_dispatch_ops() protection, so remove the duplicate check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
__blk_mq_sched_dispatch_requests currently has duplicated logic
for the cases where requests are on the hctx dispatch list or not.
Merge the two with a new need_dispatch variable and remove a few
pointless local variables.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413060651.694656-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|