Age | Commit message (Collapse) | Author | Files | Lines |
|
Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak
fix. Otherwise we end up having conflicts with future patches between
for-5.3/block and master that touch this area. In particular, it makes
the bio_full() fix hard to backport to stable.
* tag 'v5.2-rc6': (482 commits)
Linux 5.2-rc6
Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
Bluetooth: Fix regression with minimum encryption key size alignment
tcp: refine memory limit test in tcp_fragment()
x86/vdso: Prevent segfaults due to hoisted vclock reads
SUNRPC: Fix a credential refcount leak
Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
net :sunrpc :clnt :Fix xps refcount imbalance on the error path
NFS4: Only set creation opendata if O_CREAT
ARM: 8867/1: vdso: pass --be8 to linker if necessary
KVM: nVMX: reorganize initial steps of vmx_set_nested_state
KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
habanalabs: use u64_to_user_ptr() for reading user pointers
nfsd: replace Jeff by Chuck as nfsd co-maintainer
inet: clear num_timeout reqsk_alloc()
PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present
net: mvpp2: debugfs: Add pmap to fs dump
ipv6: Default fib6_type to RTN_UNICAST when not set
net: hns3: Fix inconsistent indenting
net/af_iucv: always register net_device notifier
...
|
|
If we pass pages through an iov_iter we always already have a reference
in the caller. Thus remove the ITER_BVEC_FLAG_NO_REF and don't take
reference to pages by default for bvec backed iov_iters.
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In commit af7ddd8a627c
("Merge tag 'dma-mapping-4.21' of git://git.infradead.org/users/hch/dma-mapping"),
dma_alloc_coherent has already zeroed the memory.
So memset is not needed.
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In commit af7ddd8a627c
("Merge tag 'dma-mapping-4.21' of git://git.infradead.org/users/hch/dma-mapping"),
dma_alloc_coherent has already zeroed the memory.
So memset is not needed.
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In null_handle_cmd() when device is configured as zoned, variable op is
decalred as an int, where it is used to hold values of type
REQ_OP_XXX which is of type enum req_opf. Change the type from
int to enum req_opf.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Now we have counters for how many times jouranl is reclaimed, how many
times cached dirty btree nodes are flushed, but we don't know how many
jouranl buckets are really reclaimed.
This patch adds reclaimed_journal_buckets into struct cache_set, this
is an increasing only counter, to tell how many journal buckets are
reclaimed since cache set runs. From all these three counters (reclaim,
reclaimed_journal_buckets, flush_write), we can have idea how well
current journal space reclaim code works.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch improves performance for btree_flush_write() in following
ways,
- Use another spinlock journal.flush_write_lock to replace the very
hot journal.lock. We don't have to use journal.lock here, selecting
candidate btree nodes takes a lot of time, hold journal.lock here will
block other jouranling threads and drop the overall I/O performance.
- Only select flushing btree node from c->btree_cache list. When the
machine has a large system memory, mca cache may have a huge number of
cached btree nodes. Iterating all the cached nodes will take a lot
of CPU time, and most of the nodes on c->btree_cache_freeable and
c->btree_cache_freed lists are cleared and have need to flush. So only
travel mca list c->btree_cache to select flushing btree node should be
enough for most of the cases.
- Don't iterate whole c->btree_cache list, only reversely select first
BTREE_FLUSH_NR btree nodes to flush. Iterate all btree nodes from
c->btree_cache and select the oldest journal pin btree nodes consumes
huge number of CPU cycles if the list is huge (push and pop a node
into/out of a heap is expensive). The last several dirty btree nodes
on the tail of c->btree_cache list are earlest allocated and cached
btree nodes, they are relative to the oldest journal pin btree nodes.
Therefore only flushing BTREE_FLUSH_NR btree nodes from tail of
c->btree_cache probably includes the oldest journal pin btree nodes.
In my testing, the above change decreases 50%+ CPU consumption when
journal space is full. Some times IOPS drops to 0 for 5-8 seconds,
comparing blocking I/O for 120+ seconds in previous code, this is much
better. Maybe there is room to improve in future, but at this momment
the fix looks fine and performs well in my testing.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.
Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.
This race was introduced in commit cafe56359144 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.
Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.
The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.
Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
2149 err_free2:
2150 bkey_put(b->c, &n2->key);
2151 btree_node_free(n2);
2152 rw_unlock(true, n2);
2153 err_free1:
2154 bkey_put(b->c, &n1->key);
2155 btree_node_free(n1);
2156 rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
delay for 1 us and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.
Fixes: cafe56359144 ("bcache: A block layer cache")
Signed-off-by: Coly Li <colyli@suse.de>
Reported-and-tested-by: kbuild test robot <lkp@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In struct cache_set, retry_flush_write is added for commit c4dc2497d50d
("bcache: fix high CPU occupancy during journal") which is reverted in
previous patch.
Now it is useless anymore, and this patch removes it from bcache code.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When accessing or modifying BTREE_NODE_dirty bit, it is not always
necessary to acquire b->write_lock. In bch_btree_cache_free() and
mca_reap() acquiring b->write_lock is necessary, and this patch adds
comments to explain why mutex_lock(&b->write_lock) is necessary for
checking or clearing BTREE_NODE_dirty bit there.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In bch_btree_cache_free() and btree_node_free(), BTREE_NODE_dirty is
always set no matter btree node is dirty or not. The code looks like
this,
if (btree_node_dirty(b))
btree_complete_write(b, btree_current_write(b));
clear_bit(BTREE_NODE_dirty, &b->flags);
Indeed if btree_node_dirty(b) returns false, it means BTREE_NODE_dirty
bit is cleared, then it is unnecessary to clear the bit again.
This patch only clears BTREE_NODE_dirty when btree_node_dirty(b) is
true (the bit is set), to save a few CPU cycles.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This reverts commit c4dc2497d50d9c6fb16aa0d07b6a14f3b2adb1e0.
This patch enlarges a race between normal btree flush code path and
flush_btree_write(), which causes deadlock when journal space is
exhausted. Reverts this patch makes the race window from 128 btree
nodes to only 1 btree nodes.
Fixes: c4dc2497d50d ("bcache: fix high CPU occupancy during journal")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Tang Junhui <tang.junhui.linux@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This reverts commit 6268dc2c4703aabfb0b35681be709acf4c2826c6.
This patch depends on commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal") which is reverted in previous patch. So
revert this one too.
Fixes: 6268dc2c4703 ("bcache: free heap cache_set->flush_btree in bch_journal_free")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When cache set starts, bch_btree_check() will check all bkeys on cache
device by calculating the checksum. This operation will consume a huge
number of system memory if there are a lot of data cached. Since bcache
uses its own mca cache to maintain all its read-in btree nodes, and only
releases the cache space when system memory manage code starts to shrink
caches. Then before memory manager code to call the mca cache shrinker
callback, bcache mca cache will compete memory resource with user space
application, which may have nagive effect to performance of user space
workloads (e.g. data base, or I/O service of distributed storage node).
This patch tries to call bcache mca shrinker routine to proactively
release mca cache memory, to decrease the memory pressure of system and
avoid negative effort of the overall system I/O performance.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In journal_read_bucket() when setting ja->seq[bucket_index], there might
be potential case that a later non-maximum overwrites a better sequence
number to ja->seq[bucket_index]. This patch adds a check to make sure
that ja->seq[bucket_index] will be only set a new value if it is bigger
then current value.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch adds more code comments in journal_read_bucket(), this is an
effort to make the code to be more understandable.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When enable lockdep and reboot system with a writeback mode bcache
device, the following potential deadlock warning is reported by lockdep
engine.
[ 101.536569][ T401] kworker/2:2/401 is trying to acquire lock:
[ 101.538575][ T401] 00000000bbf6e6c7 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
[ 101.542054][ T401]
[ 101.542054][ T401] but task is already holding lock:
[ 101.544587][ T401] 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[ 101.548386][ T401]
[ 101.548386][ T401] which lock already depends on the new lock.
[ 101.548386][ T401]
[ 101.551874][ T401]
[ 101.551874][ T401] the existing dependency chain (in reverse order) is:
[ 101.555000][ T401]
[ 101.555000][ T401] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
[ 101.557860][ T401] process_one_work+0x277/0x640
[ 101.559661][ T401] worker_thread+0x39/0x3f0
[ 101.561340][ T401] kthread+0x125/0x140
[ 101.562963][ T401] ret_from_fork+0x3a/0x50
[ 101.564718][ T401]
[ 101.564718][ T401] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
[ 101.567701][ T401] lock_acquire+0xb4/0x1c0
[ 101.569651][ T401] flush_workqueue+0xae/0x4c0
[ 101.571494][ T401] drain_workqueue+0xa9/0x180
[ 101.573234][ T401] destroy_workqueue+0x17/0x250
[ 101.575109][ T401] cached_dev_free+0x44/0x120 [bcache]
[ 101.577304][ T401] process_one_work+0x2a4/0x640
[ 101.579357][ T401] worker_thread+0x39/0x3f0
[ 101.581055][ T401] kthread+0x125/0x140
[ 101.582709][ T401] ret_from_fork+0x3a/0x50
[ 101.584592][ T401]
[ 101.584592][ T401] other info that might help us debug this:
[ 101.584592][ T401]
[ 101.588355][ T401] Possible unsafe locking scenario:
[ 101.588355][ T401]
[ 101.590974][ T401] CPU0 CPU1
[ 101.592889][ T401] ---- ----
[ 101.594743][ T401] lock((work_completion)(&cl->work)#2);
[ 101.596785][ T401] lock((wq_completion)bcache_writeback_wq);
[ 101.600072][ T401] lock((work_completion)(&cl->work)#2);
[ 101.602971][ T401] lock((wq_completion)bcache_writeback_wq);
[ 101.605255][ T401]
[ 101.605255][ T401] *** DEADLOCK ***
[ 101.605255][ T401]
[ 101.608310][ T401] 2 locks held by kworker/2:2/401:
[ 101.610208][ T401] #0: 00000000cf2c7d17 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
[ 101.613709][ T401] #1: 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[ 101.617480][ T401]
[ 101.617480][ T401] stack backtrace:
[ 101.619539][ T401] CPU: 2 PID: 401 Comm: kworker/2:2 Tainted: G W 5.2.0-rc4-lp151.20-default+ #1
[ 101.623225][ T401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[ 101.627210][ T401] Workqueue: events cached_dev_free [bcache]
[ 101.629239][ T401] Call Trace:
[ 101.630360][ T401] dump_stack+0x85/0xcb
[ 101.631777][ T401] print_circular_bug+0x19a/0x1f0
[ 101.633485][ T401] __lock_acquire+0x16cd/0x1850
[ 101.635184][ T401] ? __lock_acquire+0x6a8/0x1850
[ 101.636863][ T401] ? lock_acquire+0xb4/0x1c0
[ 101.638421][ T401] ? find_held_lock+0x34/0xa0
[ 101.640015][ T401] lock_acquire+0xb4/0x1c0
[ 101.641513][ T401] ? flush_workqueue+0x87/0x4c0
[ 101.643248][ T401] flush_workqueue+0xae/0x4c0
[ 101.644832][ T401] ? flush_workqueue+0x87/0x4c0
[ 101.646476][ T401] ? drain_workqueue+0xa9/0x180
[ 101.648303][ T401] drain_workqueue+0xa9/0x180
[ 101.649867][ T401] destroy_workqueue+0x17/0x250
[ 101.651503][ T401] cached_dev_free+0x44/0x120 [bcache]
[ 101.653328][ T401] process_one_work+0x2a4/0x640
[ 101.655029][ T401] worker_thread+0x39/0x3f0
[ 101.656693][ T401] ? process_one_work+0x640/0x640
[ 101.658501][ T401] kthread+0x125/0x140
[ 101.660012][ T401] ? kthread_create_worker_on_cpu+0x70/0x70
[ 101.661985][ T401] ret_from_fork+0x3a/0x50
[ 101.691318][ T401] bcache: bcache_device_free() bcache0 stopped
Here is how the above potential deadlock may happen in reboot/shutdown
code path,
1) bcache_reboot() is called firstly in the reboot/shutdown code path,
then in bcache_reboot(), bcache_device_stop() is called.
2) bcache_device_stop() sets BCACHE_DEV_CLOSING on d->falgs, then call
closure_queue(&d->cl) to invoke cached_dev_flush(). And in turn
cached_dev_flush() calls cached_dev_free() via closure_at()
3) In cached_dev_free(), after stopped writebach kthread
dc->writeback_thread, the kwork dc->writeback_write_wq is stopping by
destroy_workqueue().
4) Inside destroy_workqueue(), drain_workqueue() is called. Inside
drain_workqueue(), flush_workqueue() is called. Then wq->lockdep_map
is acquired by lock_map_acquire() in flush_workqueue(). After the
lock acquired the rest part of flush_workqueue() just wait for the
workqueue to complete.
5) Now we look back at writeback thread routine bch_writeback_thread(),
in the main while-loop, write_dirty() is called via continue_at() in
read_dirty_submit(), which is called via continue_at() in while-loop
level called function read_dirty(). Inside write_dirty() it may be
re-called on workqueeu dc->writeback_write_wq via continue_at().
It means when the writeback kthread is stopped in cached_dev_free()
there might be still one kworker queued on dc->writeback_write_wq
to execute write_dirty() again.
6) Now this kworker is scheduled on dc->writeback_write_wq to run by
process_one_work() (which is called by worker_thread()). Before
calling the kwork routine, wq->lockdep_map is acquired.
7) But wq->lockdep_map is acquired already in step 4), so a A-A lock
(lockdep terminology) scenario happens.
Indeed on multiple cores syatem, the above deadlock is very rare to
happen, just as the code comments in process_one_work() says,
2263 * AFAICT there is no possible deadlock scenario between the
2264 * flush_work() and complete() primitives (except for
single-threaded
2265 * workqueues), so hiding them isn't a problem.
But it is still good to fix such lockdep warning, even no one running
bcache on single core system.
The fix is simple. This patch solves the above potential deadlock by,
- Do not destroy workqueue dc->writeback_write_wq in cached_dev_free().
- Flush and destroy dc->writeback_write_wq in writebach kthread routine
bch_writeback_thread(), where after quit the thread main while-loop
and before cached_dev_put() is called.
By this fix, dc->writeback_write_wq will be stopped and destroy before
the writeback kthread stopped, so the chance for a A-A locking on
wq->lockdep_map is disappeared, such A-A deadlock won't happen
any more.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When enable lockdep engine, a lockdep warning can be observed when
reboot or shutdown system,
[ 3142.764557][ T1] bcache: bcache_reboot() Stopping all devices:
[ 3142.776265][ T2649]
[ 3142.777159][ T2649] ======================================================
[ 3142.780039][ T2649] WARNING: possible circular locking dependency detected
[ 3142.782869][ T2649] 5.2.0-rc4-lp151.20-default+ #1 Tainted: G W
[ 3142.785684][ T2649] ------------------------------------------------------
[ 3142.788479][ T2649] kworker/3:67/2649 is trying to acquire lock:
[ 3142.790738][ T2649] 00000000aaf02291 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
[ 3142.794678][ T2649]
[ 3142.794678][ T2649] but task is already holding lock:
[ 3142.797402][ T2649] 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
[ 3142.801462][ T2649]
[ 3142.801462][ T2649] which lock already depends on the new lock.
[ 3142.801462][ T2649]
[ 3142.805277][ T2649]
[ 3142.805277][ T2649] the existing dependency chain (in reverse order) is:
[ 3142.808902][ T2649]
[ 3142.808902][ T2649] -> #2 (&bch_register_lock){+.+.}:
[ 3142.812396][ T2649] __mutex_lock+0x7a/0x9d0
[ 3142.814184][ T2649] cached_dev_free+0x17/0x120 [bcache]
[ 3142.816415][ T2649] process_one_work+0x2a4/0x640
[ 3142.818413][ T2649] worker_thread+0x39/0x3f0
[ 3142.820276][ T2649] kthread+0x125/0x140
[ 3142.822061][ T2649] ret_from_fork+0x3a/0x50
[ 3142.823965][ T2649]
[ 3142.823965][ T2649] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
[ 3142.827244][ T2649] process_one_work+0x277/0x640
[ 3142.829160][ T2649] worker_thread+0x39/0x3f0
[ 3142.830958][ T2649] kthread+0x125/0x140
[ 3142.832674][ T2649] ret_from_fork+0x3a/0x50
[ 3142.834915][ T2649]
[ 3142.834915][ T2649] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
[ 3142.838121][ T2649] lock_acquire+0xb4/0x1c0
[ 3142.840025][ T2649] flush_workqueue+0xae/0x4c0
[ 3142.842035][ T2649] drain_workqueue+0xa9/0x180
[ 3142.844042][ T2649] destroy_workqueue+0x17/0x250
[ 3142.846142][ T2649] cached_dev_free+0x52/0x120 [bcache]
[ 3142.848530][ T2649] process_one_work+0x2a4/0x640
[ 3142.850663][ T2649] worker_thread+0x39/0x3f0
[ 3142.852464][ T2649] kthread+0x125/0x140
[ 3142.854106][ T2649] ret_from_fork+0x3a/0x50
[ 3142.855880][ T2649]
[ 3142.855880][ T2649] other info that might help us debug this:
[ 3142.855880][ T2649]
[ 3142.859663][ T2649] Chain exists of:
[ 3142.859663][ T2649] (wq_completion)bcache_writeback_wq --> (work_completion)(&cl->work)#2 --> &bch_register_lock
[ 3142.859663][ T2649]
[ 3142.865424][ T2649] Possible unsafe locking scenario:
[ 3142.865424][ T2649]
[ 3142.868022][ T2649] CPU0 CPU1
[ 3142.869885][ T2649] ---- ----
[ 3142.871751][ T2649] lock(&bch_register_lock);
[ 3142.873379][ T2649] lock((work_completion)(&cl->work)#2);
[ 3142.876399][ T2649] lock(&bch_register_lock);
[ 3142.879727][ T2649] lock((wq_completion)bcache_writeback_wq);
[ 3142.882064][ T2649]
[ 3142.882064][ T2649] *** DEADLOCK ***
[ 3142.882064][ T2649]
[ 3142.885060][ T2649] 3 locks held by kworker/3:67/2649:
[ 3142.887245][ T2649] #0: 00000000e774cdd0 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
[ 3142.890815][ T2649] #1: 00000000f7df89da ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[ 3142.894884][ T2649] #2: 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
[ 3142.898797][ T2649]
[ 3142.898797][ T2649] stack backtrace:
[ 3142.900961][ T2649] CPU: 3 PID: 2649 Comm: kworker/3:67 Tainted: G W 5.2.0-rc4-lp151.20-default+ #1
[ 3142.904789][ T2649] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[ 3142.909168][ T2649] Workqueue: events cached_dev_free [bcache]
[ 3142.911422][ T2649] Call Trace:
[ 3142.912656][ T2649] dump_stack+0x85/0xcb
[ 3142.914181][ T2649] print_circular_bug+0x19a/0x1f0
[ 3142.916193][ T2649] __lock_acquire+0x16cd/0x1850
[ 3142.917936][ T2649] ? __lock_acquire+0x6a8/0x1850
[ 3142.919704][ T2649] ? lock_acquire+0xb4/0x1c0
[ 3142.921335][ T2649] ? find_held_lock+0x34/0xa0
[ 3142.923052][ T2649] lock_acquire+0xb4/0x1c0
[ 3142.924635][ T2649] ? flush_workqueue+0x87/0x4c0
[ 3142.926375][ T2649] flush_workqueue+0xae/0x4c0
[ 3142.928047][ T2649] ? flush_workqueue+0x87/0x4c0
[ 3142.929824][ T2649] ? drain_workqueue+0xa9/0x180
[ 3142.931686][ T2649] drain_workqueue+0xa9/0x180
[ 3142.933534][ T2649] destroy_workqueue+0x17/0x250
[ 3142.935787][ T2649] cached_dev_free+0x52/0x120 [bcache]
[ 3142.937795][ T2649] process_one_work+0x2a4/0x640
[ 3142.939803][ T2649] worker_thread+0x39/0x3f0
[ 3142.941487][ T2649] ? process_one_work+0x640/0x640
[ 3142.943389][ T2649] kthread+0x125/0x140
[ 3142.944894][ T2649] ? kthread_create_worker_on_cpu+0x70/0x70
[ 3142.947744][ T2649] ret_from_fork+0x3a/0x50
[ 3142.970358][ T2649] bcache: bcache_device_free() bcache0 stopped
Here is how the deadlock happens.
1) bcache_reboot() calls bcache_device_stop(), then inside
bcache_device_stop() BCACHE_DEV_CLOSING bit is set on d->flags.
Then closure_queue(&d->cl) is called to invoke cached_dev_flush().
2) In cached_dev_flush(), cached_dev_free() is called by continu_at().
3) In cached_dev_free(), when stopping the writeback kthread of the
cached device by kthread_stop(), dc->writeback_thread will be waken
up to quite the kthread while-loop, then cached_dev_put() is called
in bch_writeback_thread().
4) Calling cached_dev_put() in writeback kthread may drop dc->count to
0, then dc->detach kworker is scheduled, which is initialized as
cached_dev_detach_finish().
5) Inside cached_dev_detach_finish(), the last line of code is to call
closure_put(&dc->disk.cl), which drops the last reference counter of
closrure dc->disk.cl, then the callback cached_dev_flush() gets
called.
Now cached_dev_flush() is called for second time in the code path, the
first time is in step 2). And again bch_register_lock will be acquired
again, and a A-A lock (lockdep terminology) is happening.
The root cause of the above A-A lock is in cached_dev_free(), mutex
bch_register_lock is held before stopping writeback kthread and other
kworkers. Fortunately now we have variable 'bcache_is_reboot', which may
prevent device registration or unregistration during reboot/shutdown
time, so it is unncessary to hold bch_register_lock such early now.
This is how this patch fixes the reboot/shutdown time A-A lock issue:
After moving mutex_lock(&bch_register_lock) to a later location where
before atomic_read(&dc->running) in cached_dev_free(), such A-A lock
problem can be solved without any reboot time registration race.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Now there is variable bcache_is_reboot to prevent device register or
unregister during reboot, it is unncessary to still hold mutex lock
bch_register_lock before stopping writeback_rate_update kworker and
writeback kthread. And if the stopping kworker or kthread holding
bch_register_lock inside their routine (we used to have such problem
in writeback thread, thanks to Junhui Wang fixed it), it is very easy
to introduce deadlock during reboot/shutdown procedure.
Therefore in this patch, the location to acquire bch_register_lock is
moved to the location before calling calc_cached_dev_sectors(). Which
is later then original location in cached_dev_detach_finish().
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It is quite frequently to observe deadlock in bcache_reboot() happens
and hang the system reboot process. The reason is, in bcache_reboot()
when calling bch_cache_set_stop() and bcache_device_stop() the mutex
bch_register_lock is held. But in the process to stop cache set and
bcache device, bch_register_lock will be acquired again. If this mutex
is held here, deadlock will happen inside the stopping process. The
aftermath of the deadlock is, whole system reboot gets hung.
The fix is to avoid holding bch_register_lock for the following loops
in bcache_reboot(),
list_for_each_entry_safe(c, tc, &bch_cache_sets, list)
bch_cache_set_stop(c);
list_for_each_entry_safe(dc, tdc, &uncached_devices, list)
bcache_device_stop(&dc->disk);
A module range variable 'bcache_is_reboot' is added, it sets to true
in bcache_reboot(). In register_bcache(), if bcache_is_reboot is checked
to be true, reject the registration by returning -EBUSY immediately.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In bch_cached_dev_attach() after bch_cached_dev_writeback_start()
called, the wrireback kthread and writeback rate update kworker of the
cached device are created, if the following bch_cached_dev_run()
failed, bch_cached_dev_attach() will return with -ENOMEM without
stopping the writeback related kthread and kworker.
This patch stops writeback kthread and writeback rate update kworker
before returning -ENOMEM if bch_cached_dev_run() returns error.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 9baf30972b55 ("bcache: fix for gc and write-back race") added a
new work queue dc->writeback_write_wq, but forgot to destroy it in the
error condition when creating dc->writeback_thread failed.
This patch destroys dc->writeback_write_wq if kthread_create() returns
error pointer to dc->writeback_thread, then a memory leak is avoided.
Fixes: 9baf30972b55 ("bcache: fix for gc and write-back race")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In bch_cached_dev_files[] from driver/md/bcache/sysfs.c, sysfs_errors is
incorrectly inserted in. The correct entry should be sysfs_io_errors.
This patch fixes the problem and now I/O errors of cached device can be
read from /sys/block/bcache<N>/bcache/io_errors.
Fixes: c7b7bd07404c5 ("bcache: add io_disable to struct cached_dev")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If a bcache device is in dirty state and its cache set is not
registered, this bcache device will not appear in /dev/bcache<N>,
and there is no way to stop it or remove the bcache kernel module.
This is an as-designed behavior, but sometimes people has to reboot
whole system to release or stop the pending backing device.
This sysfs interface may remove such pending bcache devices when
write anything into the sysfs file manually.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The purpose of following code in bset_search_tree() is to avoid a branch
instruction,
994 if (likely(f->exponent != 127))
995 n = j * 2 + (((unsigned int)
996 (f->mantissa -
997 bfloat_mantissa(search, f))) >> 31);
998 else
999 n = (bkey_cmp(tree_to_bkey(t, j), search) > 0)
1000 ? j * 2
1001 : j * 2 + 1;
This piece of code is not very clear to understand, even when I tried to
add code comment for it, I made mistake. This patch removes the implict
bit operation and uses explicit branch to calculate next location in
binary tree search.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In previous bcache patches for Linux v5.2, the failure code path of
run_cache_set() is tested and fixed. So now the following comment
line can be removed from run_cache_set(),
/* XXX: test this, it's broken */
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch adds more error message in bch_cached_dev_run() to indicate
the exact reason why an error value is returned. Please notice when
printing out the "is running already" message, pr_info() is used here,
because in this case also -EBUSY is returned, the bcache device can
continue to attach to the cache devince and run, so it won't be an
error level message in kernel message.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch adds more error message for attaching cached device, this is
helpful to debug code failure during bache device start up.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch adds more accurate error message for specific
ssyfs_create_link() call, to help debugging failure during
bcache device start tup.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When too many I/O errors happen on cache set and CACHE_SET_IO_DISABLE
bit is set, bch_journal() may continue to work because the journaling
bkey might be still in write set yet. The caller of bch_journal() may
believe the journal still work but the truth is in-memory journal write
set won't be written into cache device any more. This behavior may
introduce potential inconsistent metadata status.
This patch checks CACHE_SET_IO_DISABLE bit at the head of bch_journal(),
if the bit is set, bch_journal() returns NULL immediately to notice
caller to know journal does not work.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If CACHE_SET_IO_DISABLE of a cache set flag is set by too many I/O
errors, currently allocator routines can still continue allocate
space which may introduce inconsistent metadata state.
This patch checkes CACHE_SET_IO_DISABLE bit in following allocator
routines,
- bch_bucket_alloc()
- __bch_bucket_alloc_set()
Once CACHE_SET_IO_DISABLE is set on cache set, the allocator routines
may reject allocation request earlier to avoid potential inconsistent
metadata.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Function bch_btree_keys_init() initializes b->set[].size and
b->set[].data to zero. As the code comments indicates, these code indeed
is unncessary, because both struct btree_keys and struct bset_tree are
nested embedded into struct btree, when struct btree is filled with 0
bits by kzalloc() in mca_bucket_alloc(), b->set[].size and
b->set[].data are initialized to 0 (a.k.a NULL) already.
This patch removes the redundant code, and add comments in
bch_btree_keys_init() and mca_bucket_alloc() to explain why it's safe.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch adds return value check to bch_cached_dev_run(), now if there
is error happens inside bch_cached_dev_run(), it can be catched.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The arrays (of strings) that are passed to __sysfs_match_string() are
static, so use sysfs_match_string() which does an implicit ARRAY_SIZE()
over these arrays.
Functionally, this doesn't change anything.
The change is more cosmetic.
It only shrinks the static arrays by 1 byte each.
Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In function bset_search_tree(), when p >= t->size, t->tree[0] will be
prefetched by the following code piece,
974 unsigned int p = n << 4;
975
976 p &= ((int) (p - t->size)) >> 31;
977
978 prefetch(&t->tree[p]);
The purpose of the above code is to avoid a branch instruction, but
when p >= t->size, prefetch(&t->tree[0]) has no positive performance
contribution at all. This patch avoids the unncessary prefetch by only
calling prefetch() when p < t->size.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When backing device super block is written by bch_write_bdev_super(),
the bio complete callback write_bdev_super_endio() simply ignores I/O
status. Indeed such write request also contribute to backing device
health status if the request failed.
This patch checkes bio->bi_status in write_bdev_super_endio(), if there
is error, bch_count_backing_io_errors() will be called to count an I/O
error to dc->io_errors.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When md raid device (e.g. raid456) is used as backing device, read-ahead
requests on a degrading and recovering md raid device might be failured
immediately by md raid code, but indeed this md raid array can still be
read or write for normal I/O requests. Therefore such failed read-ahead
request are not real hardware failure. Further more, after degrading and
recovering accomplished, read-ahead requests will be handled by md raid
array again.
For such condition, I/O failures of read-ahead requests don't indicate
real health status (because normal I/O still be served), they should not
be counted into I/O error counter dc->io_errors.
Since there is no simple way to detect whether the backing divice is a
md raid device, this patch simply ignores I/O failures for read-ahead
bios on backing device, to avoid bogus backing device failure on a
degrading md raid array.
Suggested-and-tested-by: Thorsten Knabe <linux@thorsten-knabe.de>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When cache_set_flush() is called for too many I/O errors detected on
cache device and the cache set is retiring, inside the function it
doesn't make sense to flushing cached btree nodes from c->btree_cache
because CACHE_SET_IO_DISABLE is set on c->flags already and all I/Os
onto cache device will be rejected.
This patch checks in cache_set_flush() that whether CACHE_SET_IO_DISABLE
is set. If yes, then avoids to flush the cached btree nodes to reduce
more time and make cache set retiring more faster.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This reverts commit 6147305c73e4511ca1a975b766b97a779d442567.
Although this patch helps the failed bcache device to stop faster when
too many I/O errors detected on corresponding cached device, setting
CACHE_SET_IO_DISABLE bit to cache set c->flags was not a good idea. This
operation will disable all I/Os on cache set, which means other attached
bcache devices won't work neither.
Without this patch, the failed bcache device can also be stopped
eventually if internal I/O accomplished (e.g. writeback). Therefore here
I revert it.
Fixes: 6147305c73e4 ("bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()")
Reported-by: Yong Li <mr.liyong@qq.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When everything is OK in bch_journal_read(), finally the return value
is returned by,
return ret;
which assumes ret will be 0 here. This assumption is wrong when all
journal buckets as are full and filled with valid journal entries. In
such cache the last location referencess read_bucket() sets 'ret' to
1, which means new jset added into jset list. The jset list is list
'journal' in caller run_cache_set().
Return 1 to run_cache_set() means something wrong and the cache set
won't start, but indeed everything is OK.
This patch changes the line at end of bch_journal_read() to directly
return 0 since everything if verything is good. Then a bogus error
is fixed.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When system memory is in heavy pressure, bch_gc_thread_start() from
run_cache_set() may fail due to out of memory. In such condition,
c->gc_thread is assigned to -ENOMEM, not NULL pointer. Then in following
failure code path bch_cache_set_error(), when cache_set_flush() gets
called, the code piece to stop c->gc_thread is broken,
if (!IS_ERR_OR_NULL(c->gc_thread))
kthread_stop(c->gc_thread);
And KASAN catches such NULL pointer deference problem, with the warning
information:
[ 561.207881] ==================================================================
[ 561.207900] BUG: KASAN: null-ptr-deref in kthread_stop+0x3b/0x440
[ 561.207904] Write of size 4 at addr 000000000000001c by task kworker/15:1/313
[ 561.207913] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G W 5.0.0-vanilla+ #3
[ 561.207916] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
[ 561.207935] Workqueue: events cache_set_flush [bcache]
[ 561.207940] Call Trace:
[ 561.207948] dump_stack+0x9a/0xeb
[ 561.207955] ? kthread_stop+0x3b/0x440
[ 561.207960] ? kthread_stop+0x3b/0x440
[ 561.207965] kasan_report+0x176/0x192
[ 561.207973] ? kthread_stop+0x3b/0x440
[ 561.207981] kthread_stop+0x3b/0x440
[ 561.207995] cache_set_flush+0xd4/0x6d0 [bcache]
[ 561.208008] process_one_work+0x856/0x1620
[ 561.208015] ? find_held_lock+0x39/0x1d0
[ 561.208028] ? drain_workqueue+0x380/0x380
[ 561.208048] worker_thread+0x87/0xb80
[ 561.208058] ? __kthread_parkme+0xb6/0x180
[ 561.208067] ? process_one_work+0x1620/0x1620
[ 561.208072] kthread+0x326/0x3e0
[ 561.208079] ? kthread_create_worker_on_cpu+0xc0/0xc0
[ 561.208090] ret_from_fork+0x3a/0x50
[ 561.208110] ==================================================================
[ 561.208113] Disabling lock debugging due to kernel taint
[ 561.208115] irq event stamp: 11800231
[ 561.208126] hardirqs last enabled at (11800231): [<ffffffff83008538>] do_syscall_64+0x18/0x410
[ 561.208127] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
[ 561.208129] #PF error: [WRITE]
[ 561.312253] hardirqs last disabled at (11800230): [<ffffffff830052ff>] trace_hardirqs_off_thunk+0x1a/0x1c
[ 561.312259] softirqs last enabled at (11799832): [<ffffffff850005c7>] __do_softirq+0x5c7/0x8c3
[ 561.405975] PGD 0 P4D 0
[ 561.442494] softirqs last disabled at (11799821): [<ffffffff831add2c>] irq_exit+0x1ac/0x1e0
[ 561.791359] Oops: 0002 [#1] SMP KASAN NOPTI
[ 561.791362] CPU: 15 PID: 313 Comm: kworker/15:1 Tainted: G B W 5.0.0-vanilla+ #3
[ 561.791363] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
[ 561.791371] Workqueue: events cache_set_flush [bcache]
[ 561.791374] RIP: 0010:kthread_stop+0x3b/0x440
[ 561.791376] Code: 00 00 65 8b 05 26 d5 e0 7c 89 c0 48 0f a3 05 ec aa df 02 0f 82 dc 02 00 00 4c 8d 63 20 be 04 00 00 00 4c 89 e7 e8 65 c5 53 00 <f0> ff 43 20 48 8d 7b 24 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48
[ 561.791377] RSP: 0018:ffff88872fc8fd10 EFLAGS: 00010286
[ 561.838895] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 561.838916] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 561.838934] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 561.838948] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 561.838966] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 561.838979] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 561.838996] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 563.067028] RAX: 0000000000000000 RBX: fffffffffffffffc RCX: ffffffff832dd314
[ 563.067030] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000297
[ 563.067032] RBP: ffff88872fc8fe88 R08: fffffbfff0b8213d R09: fffffbfff0b8213d
[ 563.067034] R10: 0000000000000001 R11: fffffbfff0b8213c R12: 000000000000001c
[ 563.408618] R13: ffff88dc61cc0f68 R14: ffff888102b94900 R15: ffff88dc61cc0f68
[ 563.408620] FS: 0000000000000000(0000) GS:ffff888f7dc00000(0000) knlGS:0000000000000000
[ 563.408622] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 563.408623] CR2: 000000000000001c CR3: 0000000f48a1a004 CR4: 00000000007606e0
[ 563.408625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 563.408627] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 563.904795] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 563.915796] PKRU: 55555554
[ 563.915797] Call Trace:
[ 563.915807] cache_set_flush+0xd4/0x6d0 [bcache]
[ 563.915812] process_one_work+0x856/0x1620
[ 564.001226] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 564.033563] ? find_held_lock+0x39/0x1d0
[ 564.033567] ? drain_workqueue+0x380/0x380
[ 564.033574] worker_thread+0x87/0xb80
[ 564.062823] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 564.118042] ? __kthread_parkme+0xb6/0x180
[ 564.118046] ? process_one_work+0x1620/0x1620
[ 564.118048] kthread+0x326/0x3e0
[ 564.118050] ? kthread_create_worker_on_cpu+0xc0/0xc0
[ 564.167066] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 564.252441] ret_from_fork+0x3a/0x50
[ 564.252447] Modules linked in: msr rpcrdma sunrpc rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib i40iw configfs iw_cm ib_cm libiscsi scsi_transport_iscsi mlx4_ib ib_uverbs mlx4_en ib_core nls_iso8859_1 nls_cp437 vfat fat intel_rapl skx_edac x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ses raid0 aesni_intel cdc_ether enclosure usbnet ipmi_ssif joydev aes_x86_64 i40e scsi_transport_sas mii bcache md_mod crypto_simd mei_me ioatdma crc64 ptp cryptd pcspkr i2c_i801 mlx4_core glue_helper pps_core mei lpc_ich dca wmi ipmi_si ipmi_devintf nd_pmem dax_pmem nd_btt ipmi_msghandler device_dax pcc_cpufreq button hid_generic usbhid mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect xhci_pci sysimgblt fb_sys_fops xhci_hcd ttm megaraid_sas drm usbcore nfit libnvdimm sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
[ 564.299390] bcache: bch_count_io_errors() nvme0n1: IO error on writing btree.
[ 564.348360] CR2: 000000000000001c
[ 564.348362] ---[ end trace b7f0e5cc7b2103b0 ]---
Therefore, it is not enough to only check whether c->gc_thread is NULL,
we should use IS_ERR_OR_NULL() to check both NULL pointer and error
value.
This patch changes the above buggy code piece in this way,
if (!IS_ERR_OR_NULL(c->gc_thread))
kthread_stop(c->gc_thread);
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When gc is running, user space I/O processes may wait inside
bcache code, so no new I/O coming. Indeed this is not a real idle
time, maximum writeback rate should not be set in such situation.
Otherwise a faster writeback thread may compete locks with gc thread
and makes garbage collection slower, which results a longer I/O
freeze period.
This patch checks c->gc_mark_valid in set_at_max_writeback_rate(). If
c->gc_mark_valid is 0 (gc running), set_at_max_writeback_rate() returns
false, then update_writeback_rate() will not set writeback rate to
maximum value even c->idle_counter reaches an idle threshold.
Now writeback thread won't interfere gc thread performance.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The WARN_ON() macro doesn't take an error message, it just takes a
condition. I've changed this to use WARN(1, "...") instead.
Fixes: 3e148a320979 ("md/raid1: fix potential data inconsistency issue with write behind device")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Pull NVMe updates from Christoph:
"A large chunk of NVMe updates for 5.3. Highlights:
- improved PCIe suspent support (Keith Busch)
- error injection support for the admin queue (Akinobu Mita)
- Fibre Channel discovery improvements (James Smart)
- tracing improvements including nvmetc tracing support (Minwoo Im)
- misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
Kulkarni)"
* 'nvme-5.3' of git://git.infradead.org/nvme: (26 commits)
Documentation: nvme: add an example for nvme fault injection
nvme: enable to inject errors into admin commands
nvme: prepare for fault injection into admin commands
nvmet: introduce target-side trace
nvme-trace: print result and status in hex format
nvme-trace: support for fabrics commands in host-side
nvme-trace: move opcode symbol print to nvme.h
nvme-trace: do not export nvme_trace_disk_name
nvme-pci: clean up nvme_remove_dead_ctrl a bit
nvme-pci: properly report state change failure in nvme_reset_work
nvme-pci: set the errno on ctrl state change error
nvme-pci: adjust irq max_vector using num_possible_cpus()
nvme-pci: remove queue_count_ops for write_queues and poll_queues
nvme-pci: remove unnecessary zero for static var
nvme-pci: use host managed power state for suspend
nvme: introduce nvme_is_fabrics to check fabrics cmd
nvme: export get and set features
nvme: fix possible io failures when removing multipathed ns
nvme-fc: add message when creating new association
lpfc: add sysfs interface to post NVME RSCN
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fix from Joerg Roedel:
"Revert a commit from the previous pile of fixes which causes new
lockdep splats. It is better to revert it for now and work on a better
and more well tested fix"
* tag 'iommu-fix-v5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
|
|
device_domain_lock"
This reverts commit 7560cc3ca7d9d11555f80c830544e463fcdb28b8.
With 5.2.0-rc5 I can easily trigger this with lockdep and iommu=pt:
======================================================
WARNING: possible circular locking dependency detected
5.2.0-rc5 #78 Not tainted
------------------------------------------------------
swapper/0/1 is trying to acquire lock:
00000000ea2b3beb (&(&iommu->lock)->rlock){+.+.}, at: domain_context_mapping_one+0xa5/0x4e0
but task is already holding lock:
00000000a681907b (device_domain_lock){....}, at: domain_context_mapping_one+0x8d/0x4e0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (device_domain_lock){....}:
_raw_spin_lock_irqsave+0x3c/0x50
dmar_insert_one_dev_info+0xbb/0x510
domain_add_dev_info+0x50/0x90
dev_prepare_static_identity_mapping+0x30/0x68
intel_iommu_init+0xddd/0x1422
pci_iommu_init+0x16/0x3f
do_one_initcall+0x5d/0x2b4
kernel_init_freeable+0x218/0x2c1
kernel_init+0xa/0x100
ret_from_fork+0x3a/0x50
-> #0 (&(&iommu->lock)->rlock){+.+.}:
lock_acquire+0x9e/0x170
_raw_spin_lock+0x25/0x30
domain_context_mapping_one+0xa5/0x4e0
pci_for_each_dma_alias+0x30/0x140
dmar_insert_one_dev_info+0x3b2/0x510
domain_add_dev_info+0x50/0x90
dev_prepare_static_identity_mapping+0x30/0x68
intel_iommu_init+0xddd/0x1422
pci_iommu_init+0x16/0x3f
do_one_initcall+0x5d/0x2b4
kernel_init_freeable+0x218/0x2c1
kernel_init+0xa/0x100
ret_from_fork+0x3a/0x50
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(device_domain_lock);
lock(&(&iommu->lock)->rlock);
lock(device_domain_lock);
lock(&(&iommu->lock)->rlock);
*** DEADLOCK ***
2 locks held by swapper/0/1:
#0: 00000000033eb13d (dmar_global_lock){++++}, at: intel_iommu_init+0x1e0/0x1422
#1: 00000000a681907b (device_domain_lock){....}, at: domain_context_mapping_one+0x8d/0x4e0
stack backtrace:
CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc5 #78
Hardware name: LENOVO 20KGS35G01/20KGS35G01, BIOS N23ET50W (1.25 ) 06/25/2018
Call Trace:
dump_stack+0x85/0xc0
print_circular_bug.cold.57+0x15c/0x195
__lock_acquire+0x152a/0x1710
lock_acquire+0x9e/0x170
? domain_context_mapping_one+0xa5/0x4e0
_raw_spin_lock+0x25/0x30
? domain_context_mapping_one+0xa5/0x4e0
domain_context_mapping_one+0xa5/0x4e0
? domain_context_mapping_one+0x4e0/0x4e0
pci_for_each_dma_alias+0x30/0x140
dmar_insert_one_dev_info+0x3b2/0x510
domain_add_dev_info+0x50/0x90
dev_prepare_static_identity_mapping+0x30/0x68
intel_iommu_init+0xddd/0x1422
? printk+0x58/0x6f
? lockdep_hardirqs_on+0xf0/0x180
? do_early_param+0x8e/0x8e
? e820__memblock_setup+0x63/0x63
pci_iommu_init+0x16/0x3f
do_one_initcall+0x5d/0x2b4
? do_early_param+0x8e/0x8e
? rcu_read_lock_sched_held+0x55/0x60
? do_early_param+0x8e/0x8e
kernel_init_freeable+0x218/0x2c1
? rest_init+0x230/0x230
kernel_init+0xa/0x100
ret_from_fork+0x3a/0x50
domain_context_mapping_one() is taking device_domain_lock first then
iommu lock, while dmar_insert_one_dev_info() is doing the reverse.
That should be introduced by commit:
7560cc3ca7d9 ("iommu/vt-d: Fix lock inversion between iommu->lock and
device_domain_lock", 2019-05-27)
So far I still cannot figure out how the previous deadlock was
triggered (I cannot find iommu lock taken before calling of
iommu_flush_dev_iotlb()), however I'm pretty sure that that change
should be incomplete at least because it does not fix all the places
so we're still taking the locks in different orders, while reverting
that commit is very clean to me so far that we should always take
device_domain_lock first then the iommu lock.
We can continue to try to find the real culprit mentioned in
7560cc3ca7d9, but for now I think we should revert it to fix current
breakage.
CC: Joerg Roedel <joro@8bytes.org>
CC: Lu Baolu <baolu.lu@linux.intel.com>
CC: dave.jiang@intel.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI fix from Bjorn Helgaas:
"If an IOMMU is present, ignore the P2PDMA whitelist we added for v5.2
because we don't yet know how to support P2PDMA in that case (Logan
Gunthorpe)"
* tag 'pci-v5.2-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Three driver fixes (and one version number update): a suspend hang in
ufs, a qla hard lock on module removal and a qedi panic during
discovery"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: qla2xxx: Fix hardlockup in abort command during driver remove
scsi: ufs: Avoid runtime suspend possibly being blocked forever
scsi: qedi: update driver version to 8.37.0.20
scsi: qedi: Check targetname while finding boot target information
|
|
Pull networking fixes from David Miller:
1) Fix leak of unqueued fragments in ipv6 nf_defrag, from Guillaume
Nault.
2) Don't access the DDM interface unless the transceiver implements it
in bnx2x, from Mauro S. M. Rodrigues.
3) Don't double fetch 'len' from userspace in sock_getsockopt(), from
JingYi Hou.
4) Sign extension overflow in lio_core, from Colin Ian King.
5) Various netem bug fixes wrt. corrupted packets from Jakub Kicinski.
6) Fix epollout hang in hvsock, from Sunil Muthuswamy.
7) Fix regression in default fib6_type, from David Ahern.
8) Handle memory limits in tcp_fragment more appropriately, from Eric
Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (24 commits)
tcp: refine memory limit test in tcp_fragment()
inet: clear num_timeout reqsk_alloc()
net: mvpp2: debugfs: Add pmap to fs dump
ipv6: Default fib6_type to RTN_UNICAST when not set
net: hns3: Fix inconsistent indenting
net/af_iucv: always register net_device notifier
net/af_iucv: build proper skbs for HiperTransport
net/af_iucv: remove GFP_DMA restriction for HiperTransport
net: dsa: mv88e6xxx: fix shift of FID bits in mv88e6185_g1_vtu_loadpurge()
hvsock: fix epollout hang from race condition
net/udp_gso: Allow TX timestamp with UDP GSO
net: netem: fix use after free and double free with packet corruption
net: netem: fix backlog accounting for corrupted GSO frames
net: lio_core: fix potential sign-extension overflow on large shift
tipc: pass tunnel dev as NULL to udp_tunnel(6)_xmit_skb
ip6_tunnel: allow not to count pkts on tstats by passing dev as NULL
ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL
tun: wake up waitqueues after IFF_UP is set
net: remove duplicate fetch in sock_getsockopt
tipc: fix issues with early FAILOVER_MSG from peer
...
|
|
Pull rdma fixes from Doug Ledford:
"This is probably our last -rc pull request. We don't have anything
else outstanding at the moment anyway, and with the summer months on
us and people taking trips, I expect the next weeks leading up to the
merge window to be pretty calm and sedate.
This has two simple, no brainer fixes for the EFA driver.
Then it has ten not quite so simple fixes for the hfi1 driver. The
problem with them is that they aren't simply one liner typo fixes.
They're still fixes, but they're more complex issues like livelock
under heavy load where the answer was to change work queue usage and
spinlock usage to resolve the problem, or issues with orphaned
requests during certain types of failures like link down which
required some more complex work to fix too. They all look like
legitimate fixes to me, they just aren't small like I wish they were.
Summary:
- 2 minor EFA fixes
- 10 hfi1 fixes related to scaling issues"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/efa: Handle mmap insertions overflow
RDMA/efa: Fix success return value in case of error
IB/hfi1: Handle port down properly in pio
IB/hfi1: Handle wakeup of orphaned QPs for pio
IB/hfi1: Wakeup QPs orphaned on wait list after flush
IB/hfi1: Use aborts to trigger RC throttling
IB/hfi1: Create inline to get extended headers
IB/hfi1: Silence txreq allocation warnings
IB/hfi1: Avoid hardlockup with flushlist_lock
IB/hfi1: Correct tid qp rcd to match verbs context
IB/hfi1: Close PSM sdma_progress sleep window
IB/hfi1: Validate fault injection opcode user input
|