<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/fs/fs-writeback.c, branch v7.2-rc1</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v7.2-rc1</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v7.2-rc1'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-06-04T08:16:51+00:00</updated>
<entry>
<title>mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking</title>
<updated>2026-06-04T08:16:51+00:00</updated>
<author>
<name>Jeff Layton</name>
<email>jlayton@kernel.org</email>
</author>
<published>2026-05-11T11:58:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e1bf79628453e6afac81ffa57f4f40f28e5512ff'/>
<id>urn:sha1:e1bf79628453e6afac81ffa57f4f40f28e5512ff</id>
<content type='text'>
The IOCB_DONTCACHE writeback path in generic_write_sync() calls
filemap_flush_range() on every write, submitting writeback inline in
the writer's context.  Perf lock contention profiling shows the
performance problem is not lock contention but the writeback submission
work itself — walking the page tree and submitting I/O blocks the writer
for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
(dontcache).

Replace the inline filemap_flush_range() call with a flusher kick that
drains dirty pages in the background.  This moves writeback submission
completely off the writer's hot path.

To avoid flushing unrelated buffered dirty data, add a dedicated
WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
write back.  The flusher writes back that many pages from the oldest dirty
inodes (not restricted to dontcache-specific inodes). This helps
preserve I/O batching while limiting the scope of expedited writeback.

Like WB_start_all, the WB_start_dontcache bit coalesces multiple
DONTCACHE writes into a single flusher wakeup without per-write
allocations.  Use test_and_clear_bit to atomically consume the kick
request before reading the dirty counter and starting writeback, so that
concurrent DONTCACHE writes during writeback can re-set the bit and
schedule a follow-up flusher run.

Read the dirty counter with wb_stat_sum() (aggregating per-CPU batches)
rather than wb_stat() (which reads only the global counter) to ensure
small writes below the percpu batch threshold are visible to the flusher.

In filemap_dontcache_kick_writeback(), set the WB_start_dontcache bit
inside the unlocked_inode_to_wb_begin/end section for correct cgroup
writeback domain targeting, but defer the wb_wakeup() call until after
the section ends, since wb_wakeup() uses spin_unlock_irq() which would
unconditionally re-enable interrupts while the i_pages xa_lock may still
be held under irqsave during a cgroup writeback switch. Pin the wb with
wb_get() inside the RCU critical section before calling wb_wakeup()
outside it, since cgroup bdi_writeback structures are RCU-freed and the
wb pointer could become invalid after unlocked_inode_to_wb_end() drops
the RCU read lock.

Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
visibility.

dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM,
xfs on NVMe, fio io_uring):

Buffered and direct I/O paths are unaffected by this patchset. All
improvements are confined to the dontcache path:

Single-stream throughput (MB/s):
                        Before    After    Change
  seq-write/dontcache      298      897    +201%
  rand-write/dontcache     131      236     +80%

Tail latency improvements (seq-write/dontcache):
  p99:    135,266 us  -&gt;  23,986 us   (-82%)
  p99.9: 8,925,479 us -&gt;  28,443 us   (-99.7%)

Multi-writer (4 jobs, sequential write):
                                Before    After    Change
  dontcache aggregate (MB/s)     2,529    4,532     +79%
  dontcache p99 (us)             8,553    1,002     -88%
  dontcache p99.9 (us)         109,314    1,057     -99%

  Dontcache multi-writer throughput now matches buffered (4,532 vs
  4,616 MB/s).

32-file write (Axboe test):
                                Before    After    Change
  dontcache aggregate (MB/s)     1,548    3,499    +126%
  dontcache p99 (us)            10,170      602     -94%
  Peak dirty pages (MB)          1,837      213     -88%

  Dontcache now reaches 81% of buffered throughput (was 35%).

Competing writers (dontcache vs buffered, separate files):
                                Before    After
  buffered writer                  868      433 MB/s
  dontcache writer                 415      433 MB/s
  Aggregate                      1,284      866 MB/s

  Previously the buffered writer starved the dontcache writer 2:1.
  With per-bdi_writeback tracking, both writers now receive equal
  bandwidth. The aggregate matches the buffered-vs-buffered baseline
  (863 MB/s), indicating fair sharing regardless of I/O mode.

  The dontcache writer's p99.9 latency collapsed from 119 ms to
  33 ms (-73%), eliminating the severe periodic stalls seen in the
  baseline. Both writers now share identical latency profiles,
  matching the buffered-vs-buffered pattern.

The per-bdi_writeback dirty tracking dramatically reduces peak dirty
pages in dontcache workloads, with the 32-file test dropping from
1.8 GB to 213 MB. Dontcache sequential write throughput triples and
multi-writer throughput reaches parity with buffered I/O, with tail
latencies collapsing by 1-2 orders of magnitude.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Link: https://patch.msgid.link/20260511-dontcache-v7-3-2848ddce8090@kernel.org
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Reviewed-by: Ritesh Harjani (IBM) &lt;ritesh.list@gmail.com&gt;
Signed-off-by: Christian Brauner (Amutable) &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>mm: track DONTCACHE dirty pages per bdi_writeback</title>
<updated>2026-06-04T08:16:50+00:00</updated>
<author>
<name>Jeff Layton</name>
<email>jlayton@kernel.org</email>
</author>
<published>2026-05-11T11:58:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=88d6f128d06d492b6d178c8e8c53db8c82305ae1'/>
<id>urn:sha1:88d6f128d06d492b6d178c8e8c53db8c82305ae1</id>
<content type='text'>
Add a per-wb WB_DONTCACHE_DIRTY counter that tracks the number of dirty
pages with the dropbehind flag set (i.e., pages dirtied via RWF_DONTCACHE
writes).

Increment the counter alongside WB_RECLAIMABLE in folio_account_dirtied()
when the folio has the dropbehind flag set, and decrement it in
folio_clear_dirty_for_io() and folio_account_cleaned(). Also decrement it
when a non-DONTCACHE lookup atomically clears the dropbehind flag on a
dirty folio in __filemap_get_folio_mpol(), using folio_test_clear_dropbehind()
to prevent concurrent lookups from double-decrementing the counter, and
guarding the decrement with mapping_can_writeback() to match the increment
path.

Transfer the counter alongside WB_RECLAIMABLE in inode_do_switch_wbs() so
that the stat is properly migrated when an inode switches cgroup writeback
domains.

The counter will be used by the writeback flusher to determine how many
pages to write back when expediting writeback for IOCB_DONTCACHE writes,
without flushing the entire BDI's dirty pages.

Suggested-by: Jan Kara &lt;jack@suse.cz&gt;
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Link: https://patch.msgid.link/20260511-dontcache-v7-2-2848ddce8090@kernel.org
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Reviewed-by: Ritesh Harjani (IBM) &lt;ritesh.list@gmail.com&gt;
Signed-off-by: Christian Brauner (Amutable) &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>writeback: use a per-sb counter to drain inode wb switches at umount</title>
<updated>2026-05-22T10:06:35+00:00</updated>
<author>
<name>Baokun Li</name>
<email>libaokun@linux.alibaba.com</email>
</author>
<published>2026-05-21T09:50:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=31c1d19ead2c26a63859a2757d8b786765ba9cdd'/>
<id>urn:sha1:31c1d19ead2c26a63859a2757d8b786765ba9cdd</id>
<content type='text'>
Tracking in-flight inode wb switches with a single global counter
(isw_nr_in_flight) plus a synchronize_rcu() based wait in
cgroup_writeback_umount() forces every umount to take a global hit
whenever any other superblock on the system has wb switches in flight,
even if the superblock being unmounted has none of its own.

Replace the global synchronize_rcu()/flush_workqueue() pair with a
per-sb counter, s_isw_nr_in_flight, plus three small helpers:

  - cgroup_writeback_pin(sb)   - increment counter
  - cgroup_writeback_unpin(sb) - decrement and wake drainer if last
  - cgroup_writeback_drain(sb) - wait for counter to reach zero

The wiring is:

  - inode_prepare_wbs_switch() pins before checking SB_ACTIVE and
    grabbing the inode; failure paths unpin before returning.  A
    lockless SB_ACTIVE check at the top of the function lets us skip
    the atomic_inc/smp_mb dance once SB_ACTIVE has been cleared (it
    is monotonic and never set back).
  - process_inode_switch_wbs() unpins after the matching iput().
  - cgroup_writeback_umount() drains the per-sb counter via
    wait_var_event().

The smp_mb() pair between inode_prepare_wbs_switch() and
cgroup_writeback_umount() keeps the SB_ACTIVE / counter ordering:
either the umounter sees a non-zero counter and waits, or the
switcher sees SB_ACTIVE cleared and aborts before grabbing the
inode.

The global isw_nr_in_flight is left in place, since it is still used
to throttle in-flight switches via WB_FRN_MAX_IN_FLIGHT.

The rcu_read_lock() extension in inode_switch_wbs() and
cleanup_offline_cgwb() that the race fix added is no longer needed
and is reverted; the synchronize_rcu() that the race fix added to
cgroup_writeback_umount() is dropped as well.

The following numbers were measured on a 16 vCPU QEMU guest with 4
background superblocks each churning "create memcg -&gt; write 1 MiB -&gt;
rmdir memcg" to keep the global isw_nr_in_flight non-zero.  Latencies
are wall-clock around umount(8); only the target sb's umount is
measured.

Target sb runs its own cgwb churn:

                              p50      p95      p99      max
  global synchronize_rcu()   67.6 ms  88.3 ms  88.3 ms  96.8 ms
  per-sb counter (this)       7.9 ms  10.0 ms  10.0 ms  10.1 ms

Idle target umount latency under cross-sb cgwb-switch pressure:

                              p50      p95      p99      max
  global synchronize_rcu()   62.7 ms  95.4 ms 108.1 ms 108.6 ms
  per-sb counter (this)       5.3 ms   6.9 ms   7.4 ms   7.4 ms
  no-pressure baseline        4.9 ms   5.9 ms   6.3 ms   6.7 ms

8 concurrent umounts of idle sbs under the same pressure:

                              p50      p95      max
  global synchronize_rcu()   61.3 ms  99.5 ms 113.7 ms
  per-sb counter (this)       8.1 ms   9.1 ms   9.5 ms

In-kernel cgroup_writeback_umount() time across the same run
(bpftrace, ~340 calls covering all scenarios):

  global synchronize_rcu()    12371 ms total (~36 ms / call)
  per-sb counter (this)        1.37 ms total ( ~4 us / call)

Suggested-by: Christian Brauner &lt;brauner@kernel.org&gt;
Link: https://lore.kernel.org/r/177910456953.488929.2169908940676707307.b4-review@b4
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Baokun Li &lt;libaokun@linux.alibaba.com&gt;
Link: https://patch.msgid.link/20260521095016.2791354-4-libaokun@linux.alibaba.com
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Christian Brauner (Amutable) &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount()</title>
<updated>2026-05-22T10:06:35+00:00</updated>
<author>
<name>Baokun Li</name>
<email>libaokun@linux.alibaba.com</email>
</author>
<published>2026-05-21T09:50:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e90a6d668e26e00a72df2d09c173b563468f09c9'/>
<id>urn:sha1:e90a6d668e26e00a72df2d09c173b563468f09c9</id>
<content type='text'>
Commit e1b849cfa6b6 ("writeback: Avoid contention on wb-&gt;list_lock when
switching inodes") replaced the queue_rcu_work() based scheduling of
inode wb switches with a plain queue_work().  Since then no switcher
goes through call_rcu(), so rcu_barrier() in cgroup_writeback_umount()
has no callbacks of its own to wait for.  It still drains unrelated
call_rcu() callbacks from other subsystems on busy systems, which
incidentally slows umount down; drop it.

Fixes: e1b849cfa6b6 ("writeback: Avoid contention on wb-&gt;list_lock when switching inodes")
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Baokun Li &lt;libaokun@linux.alibaba.com&gt;
Link: https://patch.msgid.link/20260521095016.2791354-3-libaokun@linux.alibaba.com
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Christian Brauner (Amutable) &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()</title>
<updated>2026-05-22T10:06:35+00:00</updated>
<author>
<name>Baokun Li</name>
<email>libaokun@linux.alibaba.com</email>
</author>
<published>2026-05-21T09:50:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=cba38ec4cbd3a7b8b942a8d52531a05be8a9ff0d'/>
<id>urn:sha1:cba38ec4cbd3a7b8b942a8d52531a05be8a9ff0d</id>
<content type='text'>
When a container exits, the following BUG_ON() is occasionally triggered:

==================================================================
 VFS: Busy inodes after unmount of sdb (ext4)
 ------------[ cut here ]------------
 kernel BUG at fs/super.c:695!
 CPU: 3 PID: 6 Comm: containerd-shim Tainted: G OE K 6.6 #1
 pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
 pc : generic_shutdown_super+0xf0/0x100
 lr : generic_shutdown_super+0xf0/0x100
 Call trace:
  generic_shutdown_super+0xf0/0x100
  kill_block_super+0x20/0x48
  ext4_kill_sb+0x28/0x60
  deactivate_locked_super+0x54/0x130
  deactivate_super+0x84/0xa0
  cleanup_mnt+0xa4/0x140
  __cleanup_mnt+0x18/0x28
  task_work_run+0x78/0xe0
  do_notify_resume+0x204/0x240
==================================================================

The root cause is a race between cgroup_writeback_umount() and
inode_switch_wbs()/cleanup_offline_cgwb(). There is a window between
inode_prepare_wbs_switch() returning true and the subsequent
wb_queue_isw() call. Following is the process that triggers the issue:

      CPU A (umount)           |          CPU B (writeback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                 inode_switch_wbs/cleanup_offline_cgwb
                                  atomic_inc(&amp;isw_nr_in_flight)
                                  inode_prepare_wbs_switch
                                   -&gt; passes SB_ACTIVE check
                                   __iget(inode)
 generic_shutdown_super
  sb-&gt;s_flags &amp;= ~SB_ACTIVE
  cgroup_writeback_umount(sb)
   smp_mb()
   atomic_read(&amp;isw_nr_in_flight)
   rcu_barrier()
    -&gt; no pending RCU callbacks
   flush_workqueue(isw_wq)
    -&gt; nothing queued, returns
  evict_inodes(sb)
   -&gt; Inode skipped as isw still holds a ref.
  sop-&gt;put_super(sb)
   /* destroys percpu counters */
  -&gt; VFS: Busy inodes after unmount!
                                  wb_queue_isw()
                                   queue_work(isw_wq, ...)
                                  /* later in work function */
                                  inode_switch_wbs_work_fn
                                   process_inode_switch_wbs
                                    iput() -&gt; evict
                                     percpu_counter_dec() // UAF!

Fix this by extending the RCU read-side critical section in
inode_switch_wbs() and cleanup_offline_cgwb() to cover from
inode_prepare_wbs_switch() through wb_queue_isw().  Since there is
no sleep in this window, rcu_read_lock() can be used.  Then add a
synchronize_rcu() in cgroup_writeback_umount() before the existing
rcu_barrier(), so that all in-flight switchers that have passed the
SB_ACTIVE check have completed queue_work() before flush_workqueue()
is called.

The existing rcu_barrier() is intentionally retained so this fix can
be backported unchanged to stable kernels (5.10.y, 6.6.y, ...) that
still queue switches via queue_rcu_work(). It is a no-op on current
mainline (since commit e1b849cfa6b6 ("writeback: Avoid contention on
wb-&gt;list_lock when switching inodes")) and is removed in a follow-up
patch.

Fixes: a1a0e23e4903 ("writeback: flush inode cgroup wb switches instead of pinning super_block")
Cc: stable@vger.kernel.org
Suggested-by: Jan Kara &lt;jack@suse.cz&gt;
Link: https://lore.kernel.org/all/mxnjq2l6guusfchvauxr3v7c4bwjasybxlleqbbh4efloeqspz@iqylk76ohufz
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Baokun Li &lt;libaokun@linux.alibaba.com&gt;
Link: https://patch.msgid.link/20260521095016.2791354-2-libaokun@linux.alibaba.com
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Christian Brauner (Amutable) &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'vfs-7.1-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs</title>
<updated>2026-04-24T00:08:04+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2026-04-24T00:08:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=dd6c438c3e64a5ff0b5d7e78f7f9be547803ef1b'/>
<id>urn:sha1:dd6c438c3e64a5ff0b5d7e78f7f9be547803ef1b</id>
<content type='text'>
Pull vfs fixes from Christian Brauner:

 - eventpoll: fix ep_remove() UAF and follow-up cleanup

 - fs: aio: set VMA_DONTCOPY_BIT in mmap to fix NULL-pointer-dereference
   error

 - writeback: Fix use after free in inode_switch_wbs_work_fn()

 - fuse: reject oversized dirents in page cache

 - fs: aio: reject partial mremap to avoid Null-pointer-dereference
   error

 - nstree: fix func. parameter kernel-doc warnings

 - fs: Handle multiply claimed blocks more gracefully with mmb

* tag 'vfs-7.1-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  eventpoll: drop vestigial epi-&gt;dying flag
  eventpoll: drop dead bool return from ep_remove_epi()
  eventpoll: refresh eventpoll_release() fast-path comment
  eventpoll: move f_lock acquisition into ep_remove_file()
  eventpoll: fix ep_remove struct eventpoll / struct file UAF
  eventpoll: move epi_fget() up
  eventpoll: rename ep_remove_safe() back to ep_remove()
  eventpoll: drop vestigial __ prefix from ep_remove_{file,epi}()
  eventpoll: kill __ep_remove()
  eventpoll: split __ep_remove()
  eventpoll: use hlist_is_singular_node() in __ep_remove()
  fs: Handle multiply claimed blocks more gracefully with mmb
  nstree: fix func. parameter kernel-doc warnings
  fs: aio: reject partial mremap to avoid Null-pointer-dereference error
  fuse: reject oversized dirents in page cache
  writeback: Fix use after free in inode_switch_wbs_work_fn()
  fs: aio: set VMA_DONTCOPY_BIT in mmap to fix NULL-pointer-dereference error
</content>
</entry>
<entry>
<title>writeback: Fix use after free in inode_switch_wbs_work_fn()</title>
<updated>2026-04-23T22:34:58+00:00</updated>
<author>
<name>Jan Kara</name>
<email>jack@suse.cz</email>
</author>
<published>2026-04-13T09:36:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=6689f01d6740cf358932b3e97ee968c6099800d9'/>
<id>urn:sha1:6689f01d6740cf358932b3e97ee968c6099800d9</id>
<content type='text'>
inode_switch_wbs_work_fn() has a loop like:

  wb_get(new_wb);
  while (1) {
    list = llist_del_all(&amp;new_wb-&gt;switch_wbs_ctxs);
    /* Nothing to do? */
    if (!list)
      break;
    ... process the items ...
  }

Now adding of items to the list looks like:

wb_queue_isw()
  if (llist_add(&amp;isw-&gt;list, &amp;wb-&gt;switch_wbs_ctxs))
    queue_work(isw_wq, &amp;wb-&gt;switch_work);

Because inode_switch_wbs_work_fn() loops when processing isw items, it
can happen that wb-&gt;switch_work is pending while wb-&gt;switch_wbs_ctxs is
empty. This is a problem because in that case wb can get freed (no isw
items -&gt; no wb reference) while the work is still pending causing
use-after-free issues.

We cannot just fix this by cancelling work when freeing wb because that
could still trigger problematic 0 -&gt; 1 transitions on wb refcount due to
wb_get() in inode_switch_wbs_work_fn(). It could be all handled with
more careful code but that seems unnecessarily complex so let's avoid
that until it is proven that the looping actually brings practical
benefit. Just remove the loop from inode_switch_wbs_work_fn() instead.
That way when wb_queue_isw() queues work, we are guaranteed we have
added the first item to wb-&gt;switch_wbs_ctxs and nobody is going to
remove it (and drop the wb reference it holds) until the queued work
runs.

Fixes: e1b849cfa6b6 ("writeback: Avoid contention on wb-&gt;list_lock when switching inodes")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Link: https://patch.msgid.link/20260413093618.17244-2-jack@suse.cz
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm</title>
<updated>2026-04-19T15:01:17+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2026-04-19T15:01:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=40735a683bf844a453d7a0f91e5e3daa0abc659b'/>
<id>urn:sha1:40735a683bf844a453d7a0f91e5e3daa0abc659b</id>
<content type='text'>
Pull more MM updates from Andrew Morton:

 - "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song)

   Address the longstanding "dying memcg problem". A situation wherein a
   no-longer-used memory control group will hang around for an extended
   period pointlessly consuming memory

 - "fix unexpected type conversions and potential overflows" (Qi Zheng)

   Fix a couple of potential 32-bit/64-bit issues which were identified
   during review of the "Eliminate Dying Memory Cgroup" series

 - "kho: history: track previous kernel version and kexec boot count"
   (Breno Leitao)

   Use Kexec Handover (KHO) to pass the previous kernel's version string
   and the number of kexec reboots since the last cold boot to the next
   kernel, and print it at boot time

 - "liveupdate: prevent double preservation" (Pasha Tatashin)

   Teach LUO to avoid managing the same file across different active
   sessions

 - "liveupdate: Fix module unloading and unregister API" (Pasha
   Tatashin)

   Address an issue with how LUO handles module reference counting and
   unregistration during module unloading

 - "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar)

   Simplify and clean up the zswap crypto compression handling and
   improve the lifecycle management of zswap pool's per-CPU acomp_ctx
   resources

 - "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race"
   (SeongJae Park)

   Address unlikely but possible leaks and deadlocks in damon_call() and
   damon_walk()

 - "mm/damon/core: validate damos_quota_goal-&gt;nid" (SeongJae Park)

   Fix a couple of root-only wild pointer dereferences

 - "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race"
   (SeongJae Park)

   Update the DAMON documentation to warn operators about potential
   races which can occur if the commit_inputs parameter is altered at
   the wrong time

 - "Minor hmm_test fixes and cleanups" (Alistair Popple)

   Bugfixes and a cleanup for the HMM kernel selftests

 - "Modify memfd_luo code" (Chenghao Duan)

   Cleanups, simplifications and speedups to the memfd_lou code

 - "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport)

   Support for userfaultfd in guest_memfd

 - "selftests/mm: skip several tests when thp is not available" (Chunyu
   Hu)

   Fix several issues in the selftests code which were causing breakage
   when the tests were run on CONFIG_THP=n kernels

 - "mm/mprotect: micro-optimization work" (Pedro Falcato)

   A couple of nice speedups for mprotect()

 - "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav)

   Document upcoming changes in the maintenance of KHO, LUO, memfd_luo,
   kexec, crash, kdump and probably other kexec-based things - they are
   being moved out of mm.git and into a new git tree

* tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits)
  MAINTAINERS: add page cache reviewer
  mm/vmscan: avoid false-positive -Wuninitialized warning
  MAINTAINERS: update Dave's kdump reviewer email address
  MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE
  MAINTAINERS: drop include/linux/kho/abi/ from KHO
  MAINTAINERS: update KHO and LIVE UPDATE maintainers
  MAINTAINERS: update kexec/kdump maintainers entries
  mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd()
  selftests: mm: skip charge_reserved_hugetlb without killall
  userfaultfd: allow registration of ranges below mmap_min_addr
  mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update
  mm/hugetlb: fix early boot crash on parameters without '=' separator
  zram: reject unrecognized type= values in recompress_store()
  docs: proc: document ProtectionKey in smaps
  mm/mprotect: special-case small folios when applying permissions
  mm/mprotect: move softleaf code out of the main function
  mm: remove '!root_reclaim' checking in should_abort_scan()
  mm/sparse: fix comment for section map alignment
  mm/page_io: use sio-&gt;len for PSWPIN accounting in sio_read_complete()
  selftests/mm: transhuge_stress: skip the test when thp not available
  ...
</content>
</entry>
<entry>
<title>writeback: prevent memory cgroup release in writeback module</title>
<updated>2026-04-18T07:10:45+00:00</updated>
<author>
<name>Muchun Song</name>
<email>songmuchun@bytedance.com</email>
</author>
<published>2026-03-05T11:52:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=49717c7bd6b8e14329c2d04b1e8ec691175b6f4e'/>
<id>urn:sha1:49717c7bd6b8e14329c2d04b1e8ec691175b6f4e</id>
<content type='text'>
In the near future, a folio will no longer pin its corresponding memory
cgroup.  To ensure safety, it will only be appropriate to hold the rcu
read lock or acquire a reference to the memory cgroup returned by
folio_memcg(), thereby preventing it from being released.

In the current patch, the function get_mem_cgroup_css_from_folio() and the
rcu read lock are employed to safeguard against the release of the memory
cgroup.

This serves as a preparatory measure for the reparenting of the
LRU pages.

Link: https://lore.kernel.org/645f99bc344575417f67def3744f975596df2793.1772711148.git.zhengqi.arch@bytedance.com
Signed-off-by: Muchun Song &lt;songmuchun@bytedance.com&gt;
Signed-off-by: Qi Zheng &lt;zhengqi.arch@bytedance.com&gt;
Reviewed-by: Harry Yoo &lt;harry.yoo@oracle.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Cc: Allen Pais &lt;apais@linux.microsoft.com&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Chengming Zhou &lt;chengming.zhou@linux.dev&gt;
Cc: Chen Ridong &lt;chenridong@huawei.com&gt;
Cc: David Hildenbrand &lt;david@kernel.org&gt;
Cc: Hamza Mahfooz &lt;hamzamahfooz@linux.microsoft.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Imran Khan &lt;imran.f.khan@oracle.com&gt;
Cc: Kamalesh Babulal &lt;kamalesh.babulal@oracle.com&gt;
Cc: Lance Yang &lt;lance.yang@linux.dev&gt;
Cc: Liam Howlett &lt;Liam.Howlett@oracle.com&gt;
Cc: Lorenzo Stoakes (Oracle) &lt;ljs@kernel.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Michal Koutný &lt;mkoutny@suse.com&gt;
Cc: Mike Rapoport &lt;rppt@kernel.org&gt;
Cc: Muchun Song &lt;muchun.song@linux.dev&gt;
Cc: Nhat Pham &lt;nphamcs@gmail.com&gt;
Cc: Roman Gushchin &lt;roman.gushchin@linux.dev&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Usama Arif &lt;usamaarif642@gmail.com&gt;
Cc: Vlastimil Babka &lt;vbabka@kernel.org&gt;
Cc: Wei Xu &lt;weixugc@google.com&gt;
Cc: Yosry Ahmed &lt;yosry@kernel.org&gt;
Cc: Yuanchu Xie &lt;yuanchu@google.com&gt;
Cc: Zi Yan &lt;ziy@nvidia.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>writeback: don't block sync for filesystems with no data integrity guarantees</title>
<updated>2026-03-20T13:18:56+00:00</updated>
<author>
<name>Joanne Koong</name>
<email>joannelkoong@gmail.com</email>
</author>
<published>2026-03-20T00:51:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=76f9377cd2ab7a9220c25d33940d9ca20d368172'/>
<id>urn:sha1:76f9377cd2ab7a9220c25d33940d9ca20d368172</id>
<content type='text'>
Add a SB_I_NO_DATA_INTEGRITY superblock flag for filesystems that cannot
guarantee data persistence on sync (eg fuse). For superblocks with this
flag set, sync kicks off writeback of dirty inodes but does not wait
for the flusher threads to complete the writeback.

This replaces the per-inode AS_NO_DATA_INTEGRITY mapping flag added in
commit f9a49aa302a0 ("fs/writeback: skip AS_NO_DATA_INTEGRITY mappings
in wait_sb_inodes()"). The flag belongs at the superblock level because
data integrity is a filesystem-wide property, not a per-inode one.
Having this flag at the superblock level also allows us to skip having
to iterate every dirty inode in wait_sb_inodes() only to skip each inode
individually.

Prior to this commit, mappings with no data integrity guarantees skipped
waiting on writeback completion but still waited on the flusher threads
to finish initiating the writeback. Waiting on the flusher threads is
unnecessary. This commit kicks off writeback but does not wait on the
flusher threads. This change properly addresses a recent report [1] for
a suspend-to-RAM hang seen on fuse-overlayfs that was caused by waiting
on the flusher threads to finish:

Workqueue: pm_fs_sync pm_fs_sync_work_fn
Call Trace:
 &lt;TASK&gt;
 __schedule+0x457/0x1720
 schedule+0x27/0xd0
 wb_wait_for_completion+0x97/0xe0
 sync_inodes_sb+0xf8/0x2e0
 __iterate_supers+0xdc/0x160
 ksys_sync+0x43/0xb0
 pm_fs_sync_work_fn+0x17/0xa0
 process_one_work+0x193/0x350
 worker_thread+0x1a1/0x310
 kthread+0xfc/0x240
 ret_from_fork+0x243/0x280
 ret_from_fork_asm+0x1a/0x30
 &lt;/TASK&gt;

On fuse this is problematic because there are paths that may cause the
flusher thread to block (eg if systemd freezes the user session cgroups
first, which freezes the fuse daemon, before invoking the kernel
suspend. The kernel suspend triggers -&gt;write_node() which on fuse issues
a synchronous setattr request, which cannot be processed since the
daemon is frozen. Or if the daemon is buggy and cannot properly complete
writeback, initiating writeback on a dirty folio already under writeback
leads to writeback_get_folio() -&gt; folio_prepare_writeback() -&gt;
unconditional wait on writeback to finish, which will cause a hang).
This commit restores fuse to its prior behavior before tmp folios were
removed, where sync was essentially a no-op.

[1] https://lore.kernel.org/linux-fsdevel/CAJnrk1a-asuvfrbKXbEwwDSctvemF+6zfhdnuzO65Pt8HsFSRw@mail.gmail.com/T/#m632c4648e9cafc4239299887109ebd880ac6c5c1

Fixes: 0c58a97f919c ("fuse: remove tmp folio for writebacks and internal rb tree")
Reported-by: John &lt;therealgraysky@proton.me&gt;
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong &lt;joannelkoong@gmail.com&gt;
Link: https://patch.msgid.link/20260320005145.2483161-2-joannelkoong@gmail.com
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Reviewed-by: David Hildenbrand (Arm) &lt;david@kernel.org&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
</entry>
</feed>
