<feed xmlns='http://www.w3.org/2005/Atom'>
<title>BMC/Intel-BMC/linux.git/drivers/md, branch dev-5.14-intel</title>
<subtitle>Intel OpenBMC Linux kernel source tree (mirror)</subtitle>
<id>https://git.radix-linux.su/BMC/Intel-BMC/linux.git/atom?h=dev-5.14-intel</id>
<link rel='self' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/atom?h=dev-5.14-intel'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/'/>
<updated>2021-09-30T08:13:03+00:00</updated>
<entry>
<title>md: fix a lock order reversal in md_alloc</title>
<updated>2021-09-30T08:13:03+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2021-09-01T11:38:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=a46d5e3603bdc8e36b518a67cf9073a107906b68'/>
<id>urn:sha1:a46d5e3603bdc8e36b518a67cf9073a107906b68</id>
<content type='text'>
[ Upstream commit 7df835a32a8bedf7ce88efcfa7c9b245b52ff139 ]

Commit b0140891a8cea3 ("md: Fix race when creating a new md device.")
not only moved assigning mddev-&gt;gendisk before calling add_disk, which
fixes the races described in the commit log, but also added a
mddev-&gt;open_mutex critical section over add_disk and creation of the
md kobj.  Adding a kobject after add_disk is racy vs deleting the gendisk
right after adding it, but md already prevents against that by holding
a mddev-&gt;active reference.

On the other hand taking this lock added a lock order reversal with what
is not disk-&gt;open_mutex (used to be bdev-&gt;bd_mutex when the commit was
added) for partition devices, which need that lock for the internal open
for the partition scan, and a recent commit also takes it for
non-partitioned devices, leading to further lockdep splatter.

Fixes: b0140891a8ce ("md: Fix race when creating a new md device.")
Fixes: d62633873590 ("block: support delayed holder registration")
Reported-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Tested-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
Reviewed-by: NeilBrown &lt;neilb@suse.de&gt;
Signed-off-by: Song Liu &lt;songliubraving@fb.com&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()</title>
<updated>2021-09-18T11:43:23+00:00</updated>
<author>
<name>Arne Welzel</name>
<email>arne.welzel@corelight.com</email>
</author>
<published>2021-08-13T22:40:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=d09494a54240741ed62766a6b5919499e23c1067'/>
<id>urn:sha1:d09494a54240741ed62766a6b5919499e23c1067</id>
<content type='text'>
commit 528b16bfc3ae5f11638e71b3b63a81f9999df727 upstream.

On systems with many cores using dm-crypt, heavy spinlock contention in
percpu_counter_compare() can be observed when the page allocation limit
for a given device is reached or close to be reached. This is due
to percpu_counter_compare() taking a spinlock to compute an exact
result on potentially many CPUs at the same time.

Switch to non-exact comparison of allocated and allowed pages by using
the value returned by percpu_counter_read_positive() to avoid taking
the percpu_counter spinlock.

This may over/under estimate the actual number of allocated pages by at
most (batch-1) * num_online_cpus().

Currently, batch is bounded by 32. The system on which this issue was
first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this
change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt
allocations, this seems an acceptable error. Certainly preferred over
running into the spinlock contention.

This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs
and 192GB RAM as follows, but can be provoked on systems with less CPUs
as well.

 * Disable swap
 * Tune vm settings to promote regular writeback
     $ echo 50 &gt; /proc/sys/vm/dirty_expire_centisecs
     $ echo 25 &gt; /proc/sys/vm/dirty_writeback_centisecs
     $ echo $((128 * 1024 * 1024)) &gt; /proc/sys/vm/dirty_background_bytes

 * Create 8 dmcrypt devices based on files on a tmpfs
 * Create and mount an ext4 filesystem on each crypt devices
 * Run stress-ng --hdd 8 within one of above filesystems

Total %system usage collected from sysstat goes to ~35%. Write throughput
on the underlying loop device is ~2GB/s. perf profiling an individual
kworker kcryptd thread shows the following profile, indicating spinlock
contention in percpu_counter_compare():

    99.98%     0.00%  kworker/u193:46  [kernel.kallsyms]  [k] ret_from_fork
      |
      --ret_from_fork
        kthread
        worker_thread
        |
         --99.92%--process_one_work
            |
            |--80.52%--kcryptd_crypt
            |    |
            |    |--62.58%--mempool_alloc
            |    |  |
            |    |   --62.24%--crypt_page_alloc
            |    |     |
            |    |      --61.51%--__percpu_counter_compare
            |    |        |
            |    |         --61.34%--__percpu_counter_sum
            |    |           |
            |    |           |--58.68%--_raw_spin_lock_irqsave
            |    |           |  |
            |    |           |   --58.30%--native_queued_spin_lock_slowpath
            |    |           |
            |    |            --0.69%--cpumask_next
            |    |                |
            |    |                 --0.51%--_find_next_bit
            |    |
            |    |--10.61%--crypt_convert
            |    |          |
            |    |          |--6.05%--xts_crypt
            ...

After applying this patch and running the same test, %system usage is
lowered to ~7% and write throughput on the loop device increases
to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62%
in the profile and not hitting the percpu_counter() spinlock anymore.

    |--8.15%--mempool_alloc
    |    |
    |    |--3.93%--crypt_page_alloc
    |    |    |
    |    |     --3.75%--__alloc_pages
    |    |         |
    |    |          --3.62%--get_page_from_freelist
    |    |              |
    |    |               --3.22%--rmqueue_bulk
    |    |                   |
    |    |                    --2.59%--_raw_spin_lock
    |    |                      |
    |    |                       --2.57%--native_queued_spin_lock_slowpath
    |    |
    |     --3.05%--_raw_spin_lock_irqsave
    |               |
    |                --2.49%--native_queued_spin_lock_slowpath

Suggested-by: DJ Gregor &lt;dj@corelight.com&gt;
Reviewed-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Signed-off-by: Arne Welzel &lt;arne.welzel@corelight.com&gt;
Fixes: 5059353df86e ("dm crypt: limit the number of allocated pages")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>md/raid10: Remove unnecessary rcu_dereference in raid10_handle_discard</title>
<updated>2021-09-15T08:02:35+00:00</updated>
<author>
<name>Xiao Ni</name>
<email>xni@redhat.com</email>
</author>
<published>2021-08-18T05:57:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=2539d2c07fb8bc44a9a043b0696d3a027ec4d1df'/>
<id>urn:sha1:2539d2c07fb8bc44a9a043b0696d3a027ec4d1df</id>
<content type='text'>
commit 46d4703b1db4c86ab5acb2331b10df999f005e8e upstream.

We are seeing the following warning in raid10_handle_discard.
[  695.110751] =============================
[  695.131439] WARNING: suspicious RCU usage
[  695.151389] 4.18.0-319.el8.x86_64+debug #1 Not tainted
[  695.174413] -----------------------------
[  695.192603] drivers/md/raid10.c:1776 suspicious
rcu_dereference_check() usage!
[  695.225107] other info that might help us debug this:
[  695.260940] rcu_scheduler_active = 2, debug_locks = 1
[  695.290157] no locks held by mkfs.xfs/10186.

In the first loop of function raid10_handle_discard. It already
determines which disk need to handle discard request and add the
rdev reference count rdev-&gt;nr_pending. So the conf-&gt;mirrors will
not change until all bios come back from underlayer disks. It
doesn't need to use rcu_dereference to get rdev.

Cc: stable@vger.kernel.org
Fixes: d30588b2731f ('md/raid10: improve raid10 discard request')
Signed-off-by: Xiao Ni &lt;xni@redhat.com&gt;
Acked-by: Guoqing Jiang &lt;guoqing.jiang@linux.dev&gt;
Signed-off-by: Song Liu &lt;songliubraving@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>raid1: ensure write behind bio has less than BIO_MAX_VECS sectors</title>
<updated>2021-09-15T08:02:33+00:00</updated>
<author>
<name>Guoqing Jiang</name>
<email>jiangguoqing@kylinos.cn</email>
</author>
<published>2021-08-24T01:16:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=17f220a74b51f8fbd878611266720ff95b859a5f'/>
<id>urn:sha1:17f220a74b51f8fbd878611266720ff95b859a5f</id>
<content type='text'>
commit 6607cd319b6b91bff94e90f798a61c031650b514 upstream.

We can't split write behind bio with more than BIO_MAX_VECS sectors,
otherwise the below call trace was triggered because we could allocate
oversized write behind bio later.

[ 8.097936] bvec_alloc+0x90/0xc0
[ 8.098934] bio_alloc_bioset+0x1b3/0x260
[ 8.099959] raid1_make_request+0x9ce/0xc50 [raid1]
[ 8.100988] ? __bio_clone_fast+0xa8/0xe0
[ 8.102008] md_handle_request+0x158/0x1d0 [md_mod]
[ 8.103050] md_submit_bio+0xcd/0x110 [md_mod]
[ 8.104084] submit_bio_noacct+0x139/0x530
[ 8.105127] submit_bio+0x78/0x1d0
[ 8.106163] ext4_io_submit+0x48/0x60 [ext4]
[ 8.107242] ext4_writepages+0x652/0x1170 [ext4]
[ 8.108300] ? do_writepages+0x41/0x100
[ 8.109338] ? __ext4_mark_inode_dirty+0x240/0x240 [ext4]
[ 8.110406] do_writepages+0x41/0x100
[ 8.111450] __filemap_fdatawrite_range+0xc5/0x100
[ 8.112513] file_write_and_wait_range+0x61/0xb0
[ 8.113564] ext4_sync_file+0x73/0x370 [ext4]
[ 8.114607] __x64_sys_fsync+0x33/0x60
[ 8.115635] do_syscall_64+0x33/0x40
[ 8.116670] entry_SYSCALL_64_after_hwframe+0x44/0xae

Thanks for the comment from Christoph.

[1]. https://bugs.archlinux.org/task/70992

Cc: stable@vger.kernel.org # v5.12+
Reported-by: Jens Stutte &lt;jens@chianterastutte.eu&gt;
Tested-by: Jens Stutte &lt;jens@chianterastutte.eu&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Guoqing Jiang &lt;jiangguoqing@kylinos.cn&gt;
Signed-off-by: Song Liu &lt;songliubraving@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>bcache: add proper error unwinding in bcache_device_init</title>
<updated>2021-09-15T08:02:03+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2021-08-09T06:40:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=eeb7f00f6021626247d8870506edf5ae7b303e52'/>
<id>urn:sha1:eeb7f00f6021626247d8870506edf5ae7b303e52</id>
<content type='text'>
[ Upstream commit 224b0683228c5f332f9cee615d85e75e9a347170 ]

Except for the IDA none of the allocations in bcache_device_init is
unwound on error, fix that.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Acked-by: Coly Li &lt;colyli@suse.de&gt;
Link: https://lore.kernel.org/r/20210809064028.1198327-7-hch@lst.de
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'block-5.14-2021-08-07' of git://git.kernel.dk/linux-block</title>
<updated>2021-08-07T17:26:21+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2021-08-07T17:26:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=6bbf59145c4b29a384b0a66d63ddfbf55eeb91c4'/>
<id>urn:sha1:6bbf59145c4b29a384b0a66d63ddfbf55eeb91c4</id>
<content type='text'>
Pull block fixes from Jens Axboe:
 "A few minor fixes:

   - Fix ldm kernel-doc warning (Bart)

   - Fix adding offset twice for DMA address in n64cart (Christoph)

   - Fix use-after-free in dasd path handling (Stefan)

   - Order kyber insert trace correctly (Vincent)

   - raid1 errored write handling fix (Wei)

   - Fix blk-iolatency queue get failure handling (Yu)"

* tag 'block-5.14-2021-08-07' of git://git.kernel.dk/linux-block:
  kyber: make trace_block_rq call consistent with documentation
  block/partitions/ldm.c: Fix a kernel-doc warning
  blk-iolatency: error out if blk_get_queue() failed in iolatency_set_limit()
  n64cart: fix the dma address in n64cart_do_bvec
  s390/dasd: fix use after free in dasd path handling
  md/raid10: properly indicate failure when ending a failed write request
</content>
</entry>
<entry>
<title>Merge branch 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-5.14</title>
<updated>2021-08-04T21:49:57+00:00</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2021-08-04T21:49:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=cc396d27d8d5884bbb555efd7783b9e9e2b41dc2'/>
<id>urn:sha1:cc396d27d8d5884bbb555efd7783b9e9e2b41dc2</id>
<content type='text'>
* 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md/raid10: properly indicate failure when ending a failed write request
</content>
</entry>
<entry>
<title>md/raid10: properly indicate failure when ending a failed write request</title>
<updated>2021-07-23T17:14:07+00:00</updated>
<author>
<name>Wei Shuyu</name>
<email>wsy@dogben.com</email>
</author>
<published>2021-06-28T07:15:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=5ba03936c05584b6f6f79be5ebe7e5036c1dd252'/>
<id>urn:sha1:5ba03936c05584b6f6f79be5ebe7e5036c1dd252</id>
<content type='text'>
Similar to [1], this patch fixes the same bug in raid10. Also cleanup the
comments.

[1] commit 2417b9869b81 ("md/raid1: properly indicate failure when ending
                         a failed write request")
Cc: stable@vger.kernel.org
Fixes: 7cee6d4e6035 ("md/raid10: end bio when the device faulty")
Signed-off-by: Wei Shuyu &lt;wsy@dogben.com&gt;
Acked-by: Guoqing Jiang &lt;jiangguoqing@kylinos.cn&gt;
Signed-off-by: Song Liu &lt;song@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'for-5.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm</title>
<updated>2021-07-01T01:19:39+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2021-07-01T01:19:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=2cfa582be80081fb8db02d4d9b44bff34b82ac54'/>
<id>urn:sha1:2cfa582be80081fb8db02d4d9b44bff34b82ac54</id>
<content type='text'>
Pull device mapper updates from Mike Snitzer:

 - Various DM persistent-data library improvements and fixes that
   benefit both the DM thinp and cache targets.

 - A few small DM kcopyd efficiency improvements.

 - Significant zoned related block core, DM core and DM zoned target
   changes that culminate with adding zoned append emulation (which is
   required to properly fix DM crypt's zoned support).

 - Various DM writecache target changes that improve efficiency. Adds an
   optional "metadata_only" feature that only promotes bios flagged with
   REQ_META. But the most significant improvement is writecache's
   ability to pause writeback, for a confiurable time, if/when the
   working set is larger than the cache (and the cache is full) -- this
   ensures performance is no worse than the slower origin device.

* tag 'for-5.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (35 commits)
  dm writecache: make writeback pause configurable
  dm writecache: pause writeback if cache full and origin being written directly
  dm io tracker: factor out IO tracker
  dm btree remove: assign new_root only when removal succeeds
  dm zone: fix dm_revalidate_zones() memory allocation
  dm ps io affinity: remove redundant continue statement
  dm writecache: add optional "metadata_only" parameter
  dm writecache: add "cleaner" and "max_age" to Documentation
  dm writecache: write at least 4k when committing
  dm writecache: flush origin device when writing and cache is full
  dm writecache: have ssd writeback wait if the kcopyd workqueue is busy
  dm writecache: use list_move instead of list_del/list_add in writecache_writeback()
  dm writecache: commit just one block, not a full page
  dm writecache: remove unused gfp_t argument from wc_add_block()
  dm crypt: Fix zoned block device support
  dm: introduce zone append emulation
  dm: rearrange core declarations for extended use from dm-zone.c
  block: introduce BIO_ZONE_WRITE_LOCKED bio flag
  block: introduce bio zone helpers
  block: improve handling of all zones reset operation
  ...
</content>
</entry>
<entry>
<title>Merge tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block</title>
<updated>2021-06-30T19:21:16+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2021-06-30T19:21:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/BMC/Intel-BMC/linux.git/commit/?id=440462198d9c45e48f2d8d9b18c5702d92282f46'/>
<id>urn:sha1:440462198d9c45e48f2d8d9b18c5702d92282f46</id>
<content type='text'>
Pull block driver updates from Jens Axboe:
 "Pretty calm round, mostly just NVMe and a bit of MD:

   - NVMe updates (via Christoph)
        - improve the APST configuration algorithm (Alexey Bogoslavsky)
        - look for StorageD3Enable on companion ACPI device
          (Mario Limonciello)
        - allow selecting the network interface for TCP connections
          (Martin Belanger)
        - misc cleanups (Amit Engel, Chaitanya Kulkarni, Colin Ian King,
          Christoph)
        - move the ACPI StorageD3 code to drivers/acpi/ and add quirks
          for certain AMD CPUs (Mario Limonciello)
        - zoned device support for nvmet (Chaitanya Kulkarni)
        - fix the rules for changing the serial number in nvmet
          (Noam Gottlieb)
        - various small fixes and cleanups (Dan Carpenter, JK Kim,
          Chaitanya Kulkarni, Hannes Reinecke, Wesley Sheng, Geert
          Uytterhoeven, Daniel Wagner)

   - MD updates (Via Song)
        - iostats rewrite (Guoqing Jiang)
        - raid5 lock contention optimization (Gal Ofri)

   - Fall through warning fix (Gustavo)

   - Misc fixes (Gustavo, Jiapeng)"

* tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block: (78 commits)
  nvmet: use NVMET_MAX_NAMESPACES to set nn value
  loop: Fix missing discard support when using LOOP_CONFIGURE
  nvme.h: add missing nvme_lba_range_type endianness annotations
  nvme: remove zeroout memset call for struct
  nvme-pci: remove zeroout memset call for struct
  nvmet: remove zeroout memset call for struct
  nvmet: add ZBD over ZNS backend support
  nvmet: add Command Set Identifier support
  nvmet: add nvmet_req_bio put helper for backends
  nvmet: add req cns error complete helper
  block: export blk_next_bio()
  nvmet: remove local variable
  nvmet: use nvme status value directly
  nvmet: use u32 type for the local variable nsid
  nvmet: use u32 for nvmet_subsys max_nsid
  nvmet: use req-&gt;cmd directly in file-ns fast path
  nvmet: use req-&gt;cmd directly in bdev-ns fast path
  nvmet: make ver stable once connection established
  nvmet: allow mn change if subsys not discovered
  nvmet: make sn stable once connection was established
  ...
</content>
</entry>
</feed>
