kernel/linux.git/lib, branch v6.1.20

sbitmap: Try each queue to wake up at least one waiter

2023-03-10T08:34:34+00:00

commit 26edb30dd1c0c9be11fa676b4f330ada7b794ba6 upstream. Jan reported the new algorithm as merged might be problematic if the queue being awaken becomes empty between the waitqueue_active inside sbq_wake_ptr check and the wake up. If that happens, wake_up_nr will not wake up any waiter and we loose too many wake ups. In order to guarantee progress, we need to wake up at least one waiter here, if there are any. This now requires trying to wake up from every queue. Instead of walking through all the queues with sbq_wake_ptr, this call moves the wake up inside that function. In a previous version of the patch, I found that updating wake_index several times when walking through queues had a measurable overhead. This ensures we only update it once, at the end. Fixes: 4f8126bb2308 ("sbitmap: Use single per-bitmap counting to wake up queued tags") Reported-by: Jan Kara Signed-off-by: Gabriel Krisman Bertazi Reviewed-by: Jan Kara Link: https://lore.kernel.org/r/20221115224553.23594-4-krisman@suse.de Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

sbitmap: Advance the queue index before waking up a queue

2023-03-10T08:34:34+00:00

commit 976570b4ecd30d3ec6e1b0910da8e5edc591f2b6 upstream. When a queue is awaken, the wake_index written by sbq_wake_ptr currently keeps pointing to the same queue. On the next wake up, it will thus retry the same queue, which is unfair to other queues, and can lead to starvation. This patch, moves the index update to happen before the queue is returned, such that it will now try a different queue first on the next wake up, improving fairness. Fixes: 4f8126bb2308 ("sbitmap: Use single per-bitmap counting to wake up queued tags") Reported-by: Jan Kara Reviewed-by: Jan Kara Signed-off-by: Gabriel Krisman Bertazi Link: https://lore.kernel.org/r/20221115224553.23594-2-krisman@suse.de Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman

cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG

2023-03-10T08:33:47+00:00

[ Upstream commit 5a5d7e9badd2cb8065db171961bd30bd3595e4b6 ] In order to avoid WARN/BUG from generating nested or even recursive warnings, force rcu_is_watching() true during WARN/lockdep_rcu_suspicious(). Notably things like unwinding the stack can trigger rcu_dereference() warnings, which then triggers more unwinding which then triggers more warnings etc.. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20230126151323.408156109@infradead.org Signed-off-by: Sasha Levin

kobject: Fix slab-out-of-bounds in fill_kobj_path()

2023-03-10T08:33:30+00:00

[ Upstream commit 3bb2a01caa813d3a1845d378bbe4169ef280d394 ] In kobject_get_path(), if kobj->name is changed between calls get_kobj_path_length() and fill_kobj_path() and the length becomes longer, then fill_kobj_path() will have an out-of-bounds bug. The actual current problem occurs when the ixgbe probe. In ixgbe_mii_bus_init(), if the length of netdev->dev.kobj.name length becomes longer, out-of-bounds will occur. cpu0 cpu1 ixgbe_probe register_netdev(netdev) netdev_register_kobject device_add kobject_uevent // Sending ADD events systemd-udevd // rename netdev dev_change_name device_rename kobject_rename ixgbe_mii_bus_init | mdiobus_register | __mdiobus_register | device_register | device_add | kobject_uevent | kobject_get_path | len = get_kobj_path_length // old name | path = kzalloc(len, gfp_mask); | kobj->name = name; /* name length becomes * longer */ fill_kobj_path /* kobj path length is * longer than path, * resulting in out of * bounds when filling path */ This is the kasan report: ================================================================== BUG: KASAN: slab-out-of-bounds in fill_kobj_path+0x50/0xc0 Write of size 7 at addr ff1100090573d1fd by task kworker/28:1/673 Workqueue: events work_for_cpu_fn Call Trace: dump_stack_lvl+0x34/0x48 print_address_description.constprop.0+0x86/0x1e7 print_report+0x36/0x4f kasan_report+0xad/0x130 kasan_check_range+0x35/0x1c0 memcpy+0x39/0x60 fill_kobj_path+0x50/0xc0 kobject_get_path+0x5a/0xc0 kobject_uevent_env+0x140/0x460 device_add+0x5c7/0x910 __mdiobus_register+0x14e/0x490 ixgbe_probe.cold+0x441/0x574 [ixgbe] local_pci_probe+0x78/0xc0 work_for_cpu_fn+0x26/0x40 process_one_work+0x3b6/0x6a0 worker_thread+0x368/0x520 kthread+0x165/0x1a0 ret_from_fork+0x1f/0x30 This reproducer triggers that bug: while: do rmmod ixgbe sleep 0.5 modprobe ixgbe sleep 0.5 When calling fill_kobj_path() to fill path, if the name length of kobj becomes longer, return failure and retry. This fixes the problem. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Wang Hai Link: https://lore.kernel.org/r/20221220012143.52141-1-wanghai38@huawei.com Signed-off-by: Greg Kroah-Hartman Signed-off-by: Sasha Levin

kobject: modify kobject_get_path() to take a const *

2023-03-10T08:33:29+00:00

[ Upstream commit 33a0a1e3b3d17445832177981dc7a1c6a5b009f8 ] kobject_get_path() does not modify the kobject passed to it, so make the pointer constant. Cc: "Rafael J. Wysocki" Link: https://lore.kernel.org/r/20221001165315.2690141-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman Stable-dep-of: 3bb2a01caa81 ("kobject: Fix slab-out-of-bounds in fill_kobj_path()") Signed-off-by: Sasha Levin

printf: fix errname.c list

2023-03-10T08:33:27+00:00

[ Upstream commit 0c2baf6509af1d11310ae4c1c839481a6e9a4bc4 ] On most architectures, gcc -Wextra warns about the list of error numbers containing both EDEADLK and EDEADLOCK: lib/errname.c:15:67: warning: initialized field overwritten [-Woverride-init] 15 | #define E(err) [err + BUILD_BUG_ON_ZERO(err <= 0 || err > 300)] = "-" #err | ^~~ lib/errname.c:172:2: note: in expansion of macro 'E' 172 | E(EDEADLK), /* EDEADLOCK */ | ^ On parisc, a similar error happens with -ECANCELLED, which is an alias for ECANCELED. Make the EDEADLK printing conditional on the number being distinct from EDEADLOCK, and remove the -ECANCELLED bit completely as it can never be hit. To ensure these are correct, add static_assert lines that verify all the remaining aliases are in fact identical to the canonical name. Fixes: 57f5677e535b ("printf: add support for printing symbolic error names") Cc: Petr Mladek Suggested-by: Rasmus Villemoes Acked-by: Uwe Kleine-König Reviewed-by: Andy Shevchenko Link: https://lore.kernel.org/all/20210514213456.745039-1-arnd@kernel.org/ Link: https://lore.kernel.org/all/20210927123409.1109737-1-arnd@kernel.org/ Signed-off-by: Arnd Bergmann Reviewed-by: Sergey Senozhatsky Acked-by: Rasmus Villemoes Reviewed-by: Petr Mladek Signed-off-by: Petr Mladek Link: https://lore.kernel.org/r/20230206194126.380350-1-arnd@kernel.org Signed-off-by: Sasha Levin

lib/mpi: Fix buffer overrun when SG is too long

2023-03-10T08:32:52+00:00

[ Upstream commit 7361d1bc307b926cbca214ab67b641123c2d6357 ] The helper mpi_read_raw_from_sgl sets the number of entries in the SG list according to nbytes. However, if the last entry in the SG list contains more data than nbytes, then it may overrun the buffer because it only allocates enough memory for nbytes. Fixes: 2d4d1eea540b ("lib/mpi: Add mpi sgl helpers") Reported-by: Roberto Sassu Signed-off-by: Herbert Xu Reviewed-by: Eric Biggers Signed-off-by: Herbert Xu Signed-off-by: Sasha Levin

sbitmap: correct wake_batch recalculation to avoid potential IO hung

2023-03-10T08:32:42+00:00

[ Upstream commit b5fcf7871acb7f9a3a8ed341a68bd86aba3e254a ] Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened") mentioned that in case of shared tags, there could be just one real active hctx(queue) because of lazy detection of tag idle. Then driver tag allocation may wait forever on this real active hctx(queue) if wake_batch is > hctx_max_depth where hctx_max_depth is available tags depth for the actve hctx(queue). However, the condition wake_batch > hctx_max_depth is not strong enough to avoid IO hung as the sbitmap_queue_wake_up will only wake up one wait queue for each wake_batch even though there is only one waiter in the woken wait queue. After this, there is only one tag to free and wake_batch may not be reached anymore. Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened") methioned that driver tag allocation may wait forever. Actually, the inactive hctx(queue) will be truely idle after at most 30 seconds and will call blk_mq_tag_wakeup_all to wake one waiter per wait queue to break the hung. But IO hung for 30 seconds is also not acceptable. Set batch size to small enough that depth of the shared hctx(queue) is enough to wake up all of the queues like sbq_calc_wake_batch do to fix this potential IO hung. Although hctx_max_depth will be clamped to at least 4 while wake_batch recalculation does not do the clamp, the wake_batch will be always recalculated to 1 when hctx_max_depth <= 4. Fixes: 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened") Reviewed-by: Jan Kara Signed-off-by: Kemeng Shi Link: https://lore.kernel.org/r/20230116205059.3821738-6-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin

sbitmap: Use single per-bitmap counting to wake up queued tags

2023-03-10T08:32:42+00:00

[ Upstream commit 4f8126bb2308066b877859e4b5923ffb54143630 ] sbitmap suffers from code complexity, as demonstrated by recent fixes, and eventual lost wake ups on nested I/O completion. The later happens, from what I understand, due to the non-atomic nature of the updates to wait_cnt, which needs to be subtracted and eventually reset when equal to zero. This two step process can eventually miss an update when a nested completion happens to interrupt the CPU in between the wait_cnt updates. This is very hard to fix, as shown by the recent changes to this code. The code complexity arises mostly from the corner cases to avoid missed wakes in this scenario. In addition, the handling of wake_batch recalculation plus the synchronization with sbq_queue_wake_up is non-trivial. This patchset implements the idea originally proposed by Jan [1], which removes the need for the two-step updates of wait_cnt. This is done by tracking the number of completions and wakeups in always increasing, per-bitmap counters. Instead of having to reset the wait_cnt when it reaches zero, we simply keep counting, and attempt to wake up N threads in a single wait queue whenever there is enough space for a batch. Waking up less than batch_wake shouldn't be a problem, because we haven't changed the conditions for wake up, and the existing batch calculation guarantees at least enough remaining completions to wake up a batch for each queue at any time. Performance-wise, one should expect very similar performance to the original algorithm for the case where there is no queueing. In both the old algorithm and this implementation, the first thing is to check ws_active, which bails out if there is no queueing to be managed. In the new code, we took care to avoid accounting completions and wakeups when there is no queueing, to not pay the cost of atomic operations unnecessarily, since it doesn't skew the numbers. For more interesting cases, where there is queueing, we need to take into account the cross-communication of the atomic operations. I've been benchmarking by running parallel fio jobs against a single hctx nullb in different hardware queue depth scenarios, and verifying both IOPS and queueing. Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel jobs. fio was issuing fixed-size randwrites with qd=64 against nullb, varying only the hardware queue length per test. queue size 2 4 8 16 32 64 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K) patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K) The following is a similar experiment, ran against a nullb with a single bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40 parallel fio jobs operating on the same device queue size 2 4 8 16 32 64 6.1-rc2 1081.0K (2.3K) 957.2K (1.5K) 1699.1K (5.7K) 6178.2K (124.6K) 12227.9K (37.7K) 13286.6K (92.9K) patched 1081.8K (2.8K) 1316.5K (5.4K) 2364.4K (1.8K) 6151.4K (20.0K) 11893.6K (17.5K) 12385.6K (18.4K) It has also survived blktests and a 12h-stress run against nullb. I also ran the code against nvme and a scsi SSD, and I didn't observe performance regression in those. If there are other tests you think I should run, please let me know and I will follow up with results. [1] https://lore.kernel.org/all/aef9de29-e9f5-259a-f8be-12d1b734e72@google.com/ Cc: Hugh Dickins Cc: Keith Busch Cc: Liu Song Suggested-by: Jan Kara Signed-off-by: Gabriel Krisman Bertazi Link: https://lore.kernel.org/r/20221105231055.25953-1-krisman@suse.de Signed-off-by: Jens Axboe Stable-dep-of: b5fcf7871acb ("sbitmap: correct wake_batch recalculation to avoid potential IO hung") Signed-off-by: Sasha Levin

sbitmap: remove redundant check in __sbitmap_queue_get_batch

2023-03-10T08:32:42+00:00

[ Upstream commit 903e86f3a64d9573352bbab2f211fdbbaa5772b7 ] Commit fbb564a557809 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()") mentioned that "Checking free bits when setting the target bits. Otherwise, it may reuse the busying bits." This commit add check to make sure all masked bits in word before cmpxchg is zero. Then the existing check after cmpxchg to check any zero bit is existing in masked bits in word is redundant. Actually, old value of word before cmpxchg is stored in val and we will filter out busy bits in val by "(get_mask & ~val)" after cmpxchg. So we will not reuse busy bits methioned in commit fbb564a557809 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()"). Revert new-added check to remove redundant check. Fixes: fbb564a55780 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()") Reviewed-by: Jan Kara Signed-off-by: Kemeng Shi Link: https://lore.kernel.org/r/20230116205059.3821738-3-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin