<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/lib/sbitmap.c, branch v6.1.168</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.1.168</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.1.168'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2024-10-17T13:21:10+00:00</updated>
<entry>
<title>lib/sbitmap: define swap_lock as raw_spinlock_t</title>
<updated>2024-10-17T13:21:10+00:00</updated>
<author>
<name>Ming Lei</name>
<email>ming.lei@redhat.com</email>
</author>
<published>2024-09-19T02:17:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f892a0cffc85ca10f36a95088f5fcadc0c3d8b57'/>
<id>urn:sha1:f892a0cffc85ca10f36a95088f5fcadc0c3d8b57</id>
<content type='text'>
[ Upstream commit 65f666c6203600053478ce8e34a1db269a8701c9 ]

When called from sbitmap_queue_get(), sbitmap_deferred_clear() may be run
with preempt disabled. In RT kernel, spin_lock() can sleep, then warning
of "BUG: sleeping function called from invalid context" can be triggered.

Fix it by replacing it with raw_spin_lock.

Cc: Yang Yang &lt;yang.yang@vivo.com&gt;
Fixes: 72d04bdcf3f7 ("sbitmap: fix io hung due to race on sbitmap_word::cleared")
Signed-off-by: Ming Lei &lt;ming.lei@redhat.com&gt;
Reviewed-by: Yang Yang &lt;yang.yang@vivo.com&gt;
Link: https://lore.kernel.org/r/20240919021709.511329-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sbitmap: fix io hung due to race on sbitmap_word::cleared</title>
<updated>2024-08-03T06:49:29+00:00</updated>
<author>
<name>Yang Yang</name>
<email>yang.yang@vivo.com</email>
</author>
<published>2024-07-16T08:26:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=681583ad673186b5dec793f4b8c4f1dd242b89a5'/>
<id>urn:sha1:681583ad673186b5dec793f4b8c4f1dd242b89a5</id>
<content type='text'>
[ Upstream commit 72d04bdcf3f7d7e07d82f9757946f68802a7270a ]

Configuration for sbq:
  depth=64, wake_batch=6, shift=6, map_nr=1

1. There are 64 requests in progress:
  map-&gt;word = 0xFFFFFFFFFFFFFFFF
2. After all the 64 requests complete, and no more requests come:
  map-&gt;word = 0xFFFFFFFFFFFFFFFF, map-&gt;cleared = 0xFFFFFFFFFFFFFFFF
3. Now two tasks try to allocate requests:
  T1:                                       T2:
  __blk_mq_get_tag                          .
  __sbitmap_queue_get                       .
  sbitmap_get                               .
  sbitmap_find_bit                          .
  sbitmap_find_bit_in_word                  .
  __sbitmap_get_word  -&gt; nr=-1              __blk_mq_get_tag
  sbitmap_deferred_clear                    __sbitmap_queue_get
  /* map-&gt;cleared=0xFFFFFFFFFFFFFFFF */     sbitmap_find_bit
    if (!READ_ONCE(map-&gt;cleared))           sbitmap_find_bit_in_word
      return false;                         __sbitmap_get_word -&gt; nr=-1
    mask = xchg(&amp;map-&gt;cleared, 0)           sbitmap_deferred_clear
    atomic_long_andnot()                    /* map-&gt;cleared=0 */
                                              if (!(map-&gt;cleared))
                                                return false;
                                     /*
                                      * map-&gt;cleared is cleared by T1
                                      * T2 fail to acquire the tag
                                      */

4. T2 is the sole tag waiter. When T1 puts the tag, T2 cannot be woken
up due to the wake_batch being set at 6. If no more requests come, T1
will wait here indefinitely.

This patch achieves two purposes:
1. Check on -&gt;cleared and update on both -&gt;cleared and -&gt;word need to
be done atomically, and using spinlock could be the simplest solution.
2. Add extra check in sbitmap_deferred_clear(), to identify whether
-&gt;word has free bits.

Fixes: ea86ea2cdced ("sbitmap: ammortize cost of clearing bits")
Signed-off-by: Yang Yang &lt;yang.yang@vivo.com&gt;
Reviewed-by: Ming Lei &lt;ming.lei@redhat.com&gt;
Reviewed-by: Bart Van Assche &lt;bvanassche@acm.org&gt;
Link: https://lore.kernel.org/r/20240716082644.659566-1-yang.yang@vivo.com
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sbitmap: use READ_ONCE to access map-&gt;word</title>
<updated>2024-08-03T06:49:29+00:00</updated>
<author>
<name>linke li</name>
<email>lilinke99@qq.com</email>
</author>
<published>2024-04-26T10:34:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=aba6f11e23f89e63ae7f2ceb22ffc35b8f359cc6'/>
<id>urn:sha1:aba6f11e23f89e63ae7f2ceb22ffc35b8f359cc6</id>
<content type='text'>
[ Upstream commit 6ad0d7e0f4b68f87a98ea2b239123b7d865df86b ]

In __sbitmap_queue_get_batch(), map-&gt;word is read several times, and
update atomically using atomic_long_try_cmpxchg(). But the first two read
of map-&gt;word is not protected.

This patch moves the statement val = READ_ONCE(map-&gt;word) forward,
eliminating unprotected accesses to map-&gt;word within the function.
It is aimed at reducing the number of benign races reported by KCSAN in
order to focus future debugging effort on harmful races.

Signed-off-by: linke li &lt;lilinke99@qq.com&gt;
Link: https://lore.kernel.org/r/tencent_0B517C25E519D3D002194E8445E86C04AD0A@qq.com
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Stable-dep-of: 72d04bdcf3f7 ("sbitmap: fix io hung due to race on sbitmap_word::cleared")
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sbitmap: rewrite sbitmap_find_bit_in_index to reduce repeat code</title>
<updated>2024-08-03T06:49:29+00:00</updated>
<author>
<name>Kemeng Shi</name>
<email>shikemeng@huaweicloud.com</email>
</author>
<published>2023-01-16T20:50:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0de2fb1f78e98e4654c9e4544118dd1d1cf5bc48'/>
<id>urn:sha1:0de2fb1f78e98e4654c9e4544118dd1d1cf5bc48</id>
<content type='text'>
[ Upstream commit 08470a98a7d7e32c787b23b87353f13b03c23195 ]

Rewrite sbitmap_find_bit_in_index as following:
1. Rename sbitmap_find_bit_in_index to sbitmap_find_bit_in_word
2. Accept "struct sbitmap_word *" directly instead of accepting
"struct sbitmap *" and "int index" to get "struct sbitmap_word *".
3. Accept depth/shallow_depth and wrap for __sbitmap_get_word from caller
to support need of both __sbitmap_get_shallow and __sbitmap_get.

With helper function sbitmap_find_bit_in_word, we can remove repeat
code in __sbitmap_get_shallow to find bit considring deferred clear.

Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Kemeng Shi &lt;shikemeng@huaweicloud.com&gt;
Link: https://lore.kernel.org/r/20230116205059.3821738-4-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Stable-dep-of: 72d04bdcf3f7 ("sbitmap: fix io hung due to race on sbitmap_word::cleared")
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sbitmap: remove unnecessary calculation of alloc_hint in __sbitmap_get_shallow</title>
<updated>2024-08-03T06:49:29+00:00</updated>
<author>
<name>Kemeng Shi</name>
<email>shikemeng@huaweicloud.com</email>
</author>
<published>2023-01-16T20:50:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=fda080767ccde23b07e614ba2e2ee9a34287137a'/>
<id>urn:sha1:fda080767ccde23b07e614ba2e2ee9a34287137a</id>
<content type='text'>
[ Upstream commit f1591a8bb3e02713f4ee2efe20df0d84ed80da48 ]

Updates to alloc_hint in the loop in __sbitmap_get_shallow() are mostly
pointless and equivalent to setting alloc_hint to zero (because
SB_NR_TO_BIT() considers only low sb-&gt;shift bits from alloc_hint). So
simplify the logic.

Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Kemeng Shi &lt;shikemeng@huaweicloud.com&gt;
Link: https://lore.kernel.org/r/20230116205059.3821738-2-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Stable-dep-of: 72d04bdcf3f7 ("sbitmap: fix io hung due to race on sbitmap_word::cleared")
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sbitmap: Try each queue to wake up at least one waiter</title>
<updated>2023-03-10T08:34:34+00:00</updated>
<author>
<name>Gabriel Krisman Bertazi</name>
<email>krisman@suse.de</email>
</author>
<published>2022-11-15T22:45:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ff9571a4dee91f0ca50bba996f470eb382a4fc3a'/>
<id>urn:sha1:ff9571a4dee91f0ca50bba996f470eb382a4fc3a</id>
<content type='text'>
commit 26edb30dd1c0c9be11fa676b4f330ada7b794ba6 upstream.

Jan reported the new algorithm as merged might be problematic if the
queue being awaken becomes empty between the waitqueue_active inside
sbq_wake_ptr check and the wake up.  If that happens, wake_up_nr will
not wake up any waiter and we loose too many wake ups.  In order to
guarantee progress, we need to wake up at least one waiter here, if
there are any.  This now requires trying to wake up from every queue.

Instead of walking through all the queues with sbq_wake_ptr, this call
moves the wake up inside that function.  In a previous version of the
patch, I found that updating wake_index several times when walking
through queues had a measurable overhead.  This ensures we only update
it once, at the end.

Fixes: 4f8126bb2308 ("sbitmap: Use single per-bitmap counting to wake up queued tags")
Reported-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Gabriel Krisman Bertazi &lt;krisman@suse.de&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Link: https://lore.kernel.org/r/20221115224553.23594-4-krisman@suse.de
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>sbitmap: Advance the queue index before waking up a queue</title>
<updated>2023-03-10T08:34:34+00:00</updated>
<author>
<name>Gabriel Krisman Bertazi</name>
<email>krisman@suse.de</email>
</author>
<published>2022-11-15T22:45:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=12815a7d8f8231f53e075087ed7c6423f8c74d20'/>
<id>urn:sha1:12815a7d8f8231f53e075087ed7c6423f8c74d20</id>
<content type='text'>
commit 976570b4ecd30d3ec6e1b0910da8e5edc591f2b6 upstream.

When a queue is awaken, the wake_index written by sbq_wake_ptr currently
keeps pointing to the same queue.  On the next wake up, it will thus
retry the same queue, which is unfair to other queues, and can lead to
starvation.  This patch, moves the index update to happen before the
queue is returned, such that it will now try a different queue first on
the next wake up, improving fairness.

Fixes: 4f8126bb2308 ("sbitmap: Use single per-bitmap counting to wake up queued tags")
Reported-by: Jan Kara &lt;jack@suse.cz&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Gabriel Krisman Bertazi &lt;krisman@suse.de&gt;
Link: https://lore.kernel.org/r/20221115224553.23594-2-krisman@suse.de
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>sbitmap: correct wake_batch recalculation to avoid potential IO hung</title>
<updated>2023-03-10T08:32:42+00:00</updated>
<author>
<name>Kemeng Shi</name>
<email>shikemeng@huaweicloud.com</email>
</author>
<published>2023-01-16T20:50:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3e5ddf2b8438612a095cf886eeaa95dced99ebe8'/>
<id>urn:sha1:3e5ddf2b8438612a095cf886eeaa95dced99ebe8</id>
<content type='text'>
[ Upstream commit b5fcf7871acb7f9a3a8ed341a68bd86aba3e254a ]

Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened")
mentioned that in case of shared tags, there could be just one real
active hctx(queue) because of lazy detection of tag idle. Then driver tag
allocation may wait forever on this real active hctx(queue) if wake_batch
is &gt; hctx_max_depth where hctx_max_depth is available tags depth for the
actve hctx(queue). However, the condition wake_batch &gt; hctx_max_depth is
not strong enough to avoid IO hung as the sbitmap_queue_wake_up will only
wake up one wait queue for each wake_batch even though there is only one
waiter in the woken wait queue. After this, there is only one tag to free
and wake_batch may not be reached anymore. Commit 180dccb0dba4f ("blk-mq:
fix tag_get wait task can't be awakened") methioned that driver tag
allocation may wait forever. Actually, the inactive hctx(queue) will be
truely idle after at most 30 seconds and will call blk_mq_tag_wakeup_all
to wake one waiter per wait queue to break the hung. But IO hung for 30
seconds is also not acceptable. Set batch size to small enough that depth
of the shared hctx(queue) is enough to wake up all of the queues like
sbq_calc_wake_batch do to fix this potential IO hung.

Although hctx_max_depth will be clamped to at least 4 while wake_batch
recalculation does not do the clamp, the wake_batch will be always
recalculated to 1 when hctx_max_depth &lt;= 4.

Fixes: 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened")
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Kemeng Shi &lt;shikemeng@huaweicloud.com&gt;
Link: https://lore.kernel.org/r/20230116205059.3821738-6-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sbitmap: Use single per-bitmap counting to wake up queued tags</title>
<updated>2023-03-10T08:32:42+00:00</updated>
<author>
<name>Gabriel Krisman Bertazi</name>
<email>krisman@suse.de</email>
</author>
<published>2022-11-05T23:10:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0f312961c7999f8d500301658b6e6bff967c4889'/>
<id>urn:sha1:0f312961c7999f8d500301658b6e6bff967c4889</id>
<content type='text'>
[ Upstream commit 4f8126bb2308066b877859e4b5923ffb54143630 ]

sbitmap suffers from code complexity, as demonstrated by recent fixes,
and eventual lost wake ups on nested I/O completion.  The later happens,
from what I understand, due to the non-atomic nature of the updates to
wait_cnt, which needs to be subtracted and eventually reset when equal
to zero.  This two step process can eventually miss an update when a
nested completion happens to interrupt the CPU in between the wait_cnt
updates.  This is very hard to fix, as shown by the recent changes to
this code.

The code complexity arises mostly from the corner cases to avoid missed
wakes in this scenario.  In addition, the handling of wake_batch
recalculation plus the synchronization with sbq_queue_wake_up is
non-trivial.

This patchset implements the idea originally proposed by Jan [1], which
removes the need for the two-step updates of wait_cnt.  This is done by
tracking the number of completions and wakeups in always increasing,
per-bitmap counters.  Instead of having to reset the wait_cnt when it
reaches zero, we simply keep counting, and attempt to wake up N threads
in a single wait queue whenever there is enough space for a batch.
Waking up less than batch_wake shouldn't be a problem, because we
haven't changed the conditions for wake up, and the existing batch
calculation guarantees at least enough remaining completions to wake up
a batch for each queue at any time.

Performance-wise, one should expect very similar performance to the
original algorithm for the case where there is no queueing.  In both the
old algorithm and this implementation, the first thing is to check
ws_active, which bails out if there is no queueing to be managed. In the
new code, we took care to avoid accounting completions and wakeups when
there is no queueing, to not pay the cost of atomic operations
unnecessarily, since it doesn't skew the numbers.

For more interesting cases, where there is queueing, we need to take
into account the cross-communication of the atomic operations.  I've
been benchmarking by running parallel fio jobs against a single hctx
nullb in different hardware queue depth scenarios, and verifying both
IOPS and queueing.

Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
varying only the hardware queue length per test.

queue size 2                 4                 8                 16                 32                 64
6.1-rc2    1681.1K (1.6K)    2633.0K (12.7K)   6940.8K (16.3K)   8172.3K (617.5K)   8391.7K (367.1K)   8606.1K (351.2K)
patched    1721.8K (15.1K)   3016.7K (3.8K)    7543.0K (89.4K)   8132.5K (303.4K)   8324.2K (230.6K)   8401.8K (284.7K)

The following is a similar experiment, ran against a nullb with a single
bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
parallel fio jobs operating on the same device

queue size 2 	             4                 8              	16             	    32		       64
6.1-rc2	   1081.0K (2.3K)    957.2K (1.5K)     1699.1K (5.7K) 	6178.2K (124.6K)    12227.9K (37.7K)   13286.6K (92.9K)
patched	   1081.8K (2.8K)    1316.5K (5.4K)    2364.4K (1.8K) 	6151.4K  (20.0K)    11893.6K (17.5K)   12385.6K (18.4K)

It has also survived blktests and a 12h-stress run against nullb. I also
ran the code against nvme and a scsi SSD, and I didn't observe
performance regression in those. If there are other tests you think I
should run, please let me know and I will follow up with results.

[1] https://lore.kernel.org/all/aef9de29-e9f5-259a-f8be-12d1b734e72@google.com/

Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Keith Busch &lt;kbusch@kernel.org&gt;
Cc: Liu Song &lt;liusong@linux.alibaba.com&gt;
Suggested-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Gabriel Krisman Bertazi &lt;krisman@suse.de&gt;
Link: https://lore.kernel.org/r/20221105231055.25953-1-krisman@suse.de
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Stable-dep-of: b5fcf7871acb ("sbitmap: correct wake_batch recalculation to avoid potential IO hung")
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>sbitmap: remove redundant check in __sbitmap_queue_get_batch</title>
<updated>2023-03-10T08:32:42+00:00</updated>
<author>
<name>Kemeng Shi</name>
<email>shikemeng@huaweicloud.com</email>
</author>
<published>2023-01-16T20:50:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b2fbd1c9bd30f917e4e042dcd993ffe907b78cba'/>
<id>urn:sha1:b2fbd1c9bd30f917e4e042dcd993ffe907b78cba</id>
<content type='text'>
[ Upstream commit 903e86f3a64d9573352bbab2f211fdbbaa5772b7 ]

Commit fbb564a557809 ("lib/sbitmap: Fix invalid loop in
__sbitmap_queue_get_batch()") mentioned that "Checking free bits when
setting the target bits. Otherwise, it may reuse the busying bits."
This commit add check to make sure all masked bits in word before
cmpxchg is zero. Then the existing check after cmpxchg to check any
zero bit is existing in masked bits in word is redundant.

Actually, old value of word before cmpxchg is stored in val and we
will filter out busy bits in val by "(get_mask &amp; ~val)" after cmpxchg.
So we will not reuse busy bits methioned in commit fbb564a557809
("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()"). Revert
new-added check to remove redundant check.

Fixes: fbb564a55780 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()")
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Kemeng Shi &lt;shikemeng@huaweicloud.com&gt;
Link: https://lore.kernel.org/r/20230116205059.3821738-3-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
</feed>
