kernel/linux.git/fs/fs-writeback.c, branch v7.2-rc1

mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking

2026-06-04T08:16:51+00:00

The IOCB_DONTCACHE writeback path in generic_write_sync() calls filemap_flush_range() on every write, submitting writeback inline in the writer's context. Perf lock contention profiling shows the performance problem is not lock contention but the writeback submission work itself — walking the page tree and submitting I/O blocks the writer for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms (dontcache). Replace the inline filemap_flush_range() call with a flusher kick that drains dirty pages in the background. This moves writeback submission completely off the writer's hot path. To avoid flushing unrelated buffered dirty data, add a dedicated WB_start_dontcache bit and wb_check_start_dontcache() handler that uses the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to write back. The flusher writes back that many pages from the oldest dirty inodes (not restricted to dontcache-specific inodes). This helps preserve I/O batching while limiting the scope of expedited writeback. Like WB_start_all, the WB_start_dontcache bit coalesces multiple DONTCACHE writes into a single flusher wakeup without per-write allocations. Use test_and_clear_bit to atomically consume the kick request before reading the dirty counter and starting writeback, so that concurrent DONTCACHE writes during writeback can re-set the bit and schedule a follow-up flusher run. Read the dirty counter with wb_stat_sum() (aggregating per-CPU batches) rather than wb_stat() (which reads only the global counter) to ensure small writes below the percpu batch threshold are visible to the flusher. In filemap_dontcache_kick_writeback(), set the WB_start_dontcache bit inside the unlocked_inode_to_wb_begin/end section for correct cgroup writeback domain targeting, but defer the wb_wakeup() call until after the section ends, since wb_wakeup() uses spin_unlock_irq() which would unconditionally re-enable interrupts while the i_pages xa_lock may still be held under irqsave during a cgroup writeback switch. Pin the wb with wb_get() inside the RCU critical section before calling wb_wakeup() outside it, since cgroup bdi_writeback structures are RCU-freed and the wb pointer could become invalid after unlocked_inode_to_wb_end() drops the RCU read lock. Also add WB_REASON_DONTCACHE as a new writeback reason for tracing visibility. dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM, xfs on NVMe, fio io_uring): Buffered and direct I/O paths are unaffected by this patchset. All improvements are confined to the dontcache path: Single-stream throughput (MB/s): Before After Change seq-write/dontcache 298 897 +201% rand-write/dontcache 131 236 +80% Tail latency improvements (seq-write/dontcache): p99: 135,266 us -> 23,986 us (-82%) p99.9: 8,925,479 us -> 28,443 us (-99.7%) Multi-writer (4 jobs, sequential write): Before After Change dontcache aggregate (MB/s) 2,529 4,532 +79% dontcache p99 (us) 8,553 1,002 -88% dontcache p99.9 (us) 109,314 1,057 -99% Dontcache multi-writer throughput now matches buffered (4,532 vs 4,616 MB/s). 32-file write (Axboe test): Before After Change dontcache aggregate (MB/s) 1,548 3,499 +126% dontcache p99 (us) 10,170 602 -94% Peak dirty pages (MB) 1,837 213 -88% Dontcache now reaches 81% of buffered throughput (was 35%). Competing writers (dontcache vs buffered, separate files): Before After buffered writer 868 433 MB/s dontcache writer 415 433 MB/s Aggregate 1,284 866 MB/s Previously the buffered writer starved the dontcache writer 2:1. With per-bdi_writeback tracking, both writers now receive equal bandwidth. The aggregate matches the buffered-vs-buffered baseline (863 MB/s), indicating fair sharing regardless of I/O mode. The dontcache writer's p99.9 latency collapsed from 119 ms to 33 ms (-73%), eliminating the severe periodic stalls seen in the baseline. Both writers now share identical latency profiles, matching the buffered-vs-buffered pattern. The per-bdi_writeback dirty tracking dramatically reduces peak dirty pages in dontcache workloads, with the 32-file test dropping from 1.8 GB to 213 MB. Dontcache sequential write throughput triples and multi-writer throughput reaches parity with buffered I/O, with tail latencies collapsing by 1-2 orders of magnitude. Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Jeff Layton Link: https://patch.msgid.link/20260511-dontcache-v7-3-2848ddce8090@kernel.org Reviewed-by: Jan Kara Reviewed-by: Ritesh Harjani (IBM) Signed-off-by: Christian Brauner (Amutable)

mm: track DONTCACHE dirty pages per bdi_writeback

2026-06-04T08:16:50+00:00

Add a per-wb WB_DONTCACHE_DIRTY counter that tracks the number of dirty pages with the dropbehind flag set (i.e., pages dirtied via RWF_DONTCACHE writes). Increment the counter alongside WB_RECLAIMABLE in folio_account_dirtied() when the folio has the dropbehind flag set, and decrement it in folio_clear_dirty_for_io() and folio_account_cleaned(). Also decrement it when a non-DONTCACHE lookup atomically clears the dropbehind flag on a dirty folio in __filemap_get_folio_mpol(), using folio_test_clear_dropbehind() to prevent concurrent lookups from double-decrementing the counter, and guarding the decrement with mapping_can_writeback() to match the increment path. Transfer the counter alongside WB_RECLAIMABLE in inode_do_switch_wbs() so that the stat is properly migrated when an inode switches cgroup writeback domains. The counter will be used by the writeback flusher to determine how many pages to write back when expediting writeback for IOCB_DONTCACHE writes, without flushing the entire BDI's dirty pages. Suggested-by: Jan Kara Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Jeff Layton Link: https://patch.msgid.link/20260511-dontcache-v7-2-2848ddce8090@kernel.org Reviewed-by: Jan Kara Reviewed-by: Ritesh Harjani (IBM) Signed-off-by: Christian Brauner (Amutable)

writeback: use a per-sb counter to drain inode wb switches at umount

2026-05-22T10:06:35+00:00

Tracking in-flight inode wb switches with a single global counter (isw_nr_in_flight) plus a synchronize_rcu() based wait in cgroup_writeback_umount() forces every umount to take a global hit whenever any other superblock on the system has wb switches in flight, even if the superblock being unmounted has none of its own. Replace the global synchronize_rcu()/flush_workqueue() pair with a per-sb counter, s_isw_nr_in_flight, plus three small helpers: - cgroup_writeback_pin(sb) - increment counter - cgroup_writeback_unpin(sb) - decrement and wake drainer if last - cgroup_writeback_drain(sb) - wait for counter to reach zero The wiring is: - inode_prepare_wbs_switch() pins before checking SB_ACTIVE and grabbing the inode; failure paths unpin before returning. A lockless SB_ACTIVE check at the top of the function lets us skip the atomic_inc/smp_mb dance once SB_ACTIVE has been cleared (it is monotonic and never set back). - process_inode_switch_wbs() unpins after the matching iput(). - cgroup_writeback_umount() drains the per-sb counter via wait_var_event(). The smp_mb() pair between inode_prepare_wbs_switch() and cgroup_writeback_umount() keeps the SB_ACTIVE / counter ordering: either the umounter sees a non-zero counter and waits, or the switcher sees SB_ACTIVE cleared and aborts before grabbing the inode. The global isw_nr_in_flight is left in place, since it is still used to throttle in-flight switches via WB_FRN_MAX_IN_FLIGHT. The rcu_read_lock() extension in inode_switch_wbs() and cleanup_offline_cgwb() that the race fix added is no longer needed and is reverted; the synchronize_rcu() that the race fix added to cgroup_writeback_umount() is dropped as well. The following numbers were measured on a 16 vCPU QEMU guest with 4 background superblocks each churning "create memcg -> write 1 MiB -> rmdir memcg" to keep the global isw_nr_in_flight non-zero. Latencies are wall-clock around umount(8); only the target sb's umount is measured. Target sb runs its own cgwb churn: p50 p95 p99 max global synchronize_rcu() 67.6 ms 88.3 ms 88.3 ms 96.8 ms per-sb counter (this) 7.9 ms 10.0 ms 10.0 ms 10.1 ms Idle target umount latency under cross-sb cgwb-switch pressure: p50 p95 p99 max global synchronize_rcu() 62.7 ms 95.4 ms 108.1 ms 108.6 ms per-sb counter (this) 5.3 ms 6.9 ms 7.4 ms 7.4 ms no-pressure baseline 4.9 ms 5.9 ms 6.3 ms 6.7 ms 8 concurrent umounts of idle sbs under the same pressure: p50 p95 max global synchronize_rcu() 61.3 ms 99.5 ms 113.7 ms per-sb counter (this) 8.1 ms 9.1 ms 9.5 ms In-kernel cgroup_writeback_umount() time across the same run (bpftrace, ~340 calls covering all scenarios): global synchronize_rcu() 12371 ms total (~36 ms / call) per-sb counter (this) 1.37 ms total ( ~4 us / call) Suggested-by: Christian Brauner Link: https://lore.kernel.org/r/177910456953.488929.2169908940676707307.b4-review@b4 Reviewed-by: Jan Kara Signed-off-by: Baokun Li Link: https://patch.msgid.link/20260521095016.2791354-4-libaokun@linux.alibaba.com Acked-by: Tejun Heo Signed-off-by: Christian Brauner (Amutable)

writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount()

2026-05-22T10:06:35+00:00

Commit e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when switching inodes") replaced the queue_rcu_work() based scheduling of inode wb switches with a plain queue_work(). Since then no switcher goes through call_rcu(), so rcu_barrier() in cgroup_writeback_umount() has no callbacks of its own to wait for. It still drains unrelated call_rcu() callbacks from other subsystems on busy systems, which incidentally slows umount down; drop it. Fixes: e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when switching inodes") Reviewed-by: Jan Kara Signed-off-by: Baokun Li Link: https://patch.msgid.link/20260521095016.2791354-3-libaokun@linux.alibaba.com Acked-by: Tejun Heo Signed-off-by: Christian Brauner (Amutable)

writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()

2026-05-22T10:06:35+00:00

When a container exits, the following BUG_ON() is occasionally triggered: ================================================================== VFS: Busy inodes after unmount of sdb (ext4) ------------[ cut here ]------------ kernel BUG at fs/super.c:695! CPU: 3 PID: 6 Comm: containerd-shim Tainted: G OE K 6.6 #1 pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : generic_shutdown_super+0xf0/0x100 lr : generic_shutdown_super+0xf0/0x100 Call trace: generic_shutdown_super+0xf0/0x100 kill_block_super+0x20/0x48 ext4_kill_sb+0x28/0x60 deactivate_locked_super+0x54/0x130 deactivate_super+0x84/0xa0 cleanup_mnt+0xa4/0x140 __cleanup_mnt+0x18/0x28 task_work_run+0x78/0xe0 do_notify_resume+0x204/0x240 ================================================================== The root cause is a race between cgroup_writeback_umount() and inode_switch_wbs()/cleanup_offline_cgwb(). There is a window between inode_prepare_wbs_switch() returning true and the subsequent wb_queue_isw() call. Following is the process that triggers the issue: CPU A (umount) | CPU B (writeback) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ inode_switch_wbs/cleanup_offline_cgwb atomic_inc(&isw_nr_in_flight) inode_prepare_wbs_switch -> passes SB_ACTIVE check __iget(inode) generic_shutdown_super sb->s_flags &= ~SB_ACTIVE cgroup_writeback_umount(sb) smp_mb() atomic_read(&isw_nr_in_flight) rcu_barrier() -> no pending RCU callbacks flush_workqueue(isw_wq) -> nothing queued, returns evict_inodes(sb) -> Inode skipped as isw still holds a ref. sop->put_super(sb) /* destroys percpu counters */ -> VFS: Busy inodes after unmount! wb_queue_isw() queue_work(isw_wq, ...) /* later in work function */ inode_switch_wbs_work_fn process_inode_switch_wbs iput() -> evict percpu_counter_dec() // UAF! Fix this by extending the RCU read-side critical section in inode_switch_wbs() and cleanup_offline_cgwb() to cover from inode_prepare_wbs_switch() through wb_queue_isw(). Since there is no sleep in this window, rcu_read_lock() can be used. Then add a synchronize_rcu() in cgroup_writeback_umount() before the existing rcu_barrier(), so that all in-flight switchers that have passed the SB_ACTIVE check have completed queue_work() before flush_workqueue() is called. The existing rcu_barrier() is intentionally retained so this fix can be backported unchanged to stable kernels (5.10.y, 6.6.y, ...) that still queue switches via queue_rcu_work(). It is a no-op on current mainline (since commit e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when switching inodes")) and is removed in a follow-up patch. Fixes: a1a0e23e4903 ("writeback: flush inode cgroup wb switches instead of pinning super_block") Cc: stable@vger.kernel.org Suggested-by: Jan Kara Link: https://lore.kernel.org/all/mxnjq2l6guusfchvauxr3v7c4bwjasybxlleqbbh4efloeqspz@iqylk76ohufz Reviewed-by: Jan Kara Signed-off-by: Baokun Li Link: https://patch.msgid.link/20260521095016.2791354-2-libaokun@linux.alibaba.com Acked-by: Tejun Heo Signed-off-by: Christian Brauner (Amutable)

Merge tag 'vfs-7.1-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-04-24T00:08:04+00:00

Pull vfs fixes from Christian Brauner: - eventpoll: fix ep_remove() UAF and follow-up cleanup - fs: aio: set VMA_DONTCOPY_BIT in mmap to fix NULL-pointer-dereference error - writeback: Fix use after free in inode_switch_wbs_work_fn() - fuse: reject oversized dirents in page cache - fs: aio: reject partial mremap to avoid Null-pointer-dereference error - nstree: fix func. parameter kernel-doc warnings - fs: Handle multiply claimed blocks more gracefully with mmb * tag 'vfs-7.1-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: eventpoll: drop vestigial epi->dying flag eventpoll: drop dead bool return from ep_remove_epi() eventpoll: refresh eventpoll_release() fast-path comment eventpoll: move f_lock acquisition into ep_remove_file() eventpoll: fix ep_remove struct eventpoll / struct file UAF eventpoll: move epi_fget() up eventpoll: rename ep_remove_safe() back to ep_remove() eventpoll: drop vestigial __ prefix from ep_remove_{file,epi}() eventpoll: kill __ep_remove() eventpoll: split __ep_remove() eventpoll: use hlist_is_singular_node() in __ep_remove() fs: Handle multiply claimed blocks more gracefully with mmb nstree: fix func. parameter kernel-doc warnings fs: aio: reject partial mremap to avoid Null-pointer-dereference error fuse: reject oversized dirents in page cache writeback: Fix use after free in inode_switch_wbs_work_fn() fs: aio: set VMA_DONTCOPY_BIT in mmap to fix NULL-pointer-dereference error

writeback: Fix use after free in inode_switch_wbs_work_fn()

2026-04-23T22:34:58+00:00

inode_switch_wbs_work_fn() has a loop like: wb_get(new_wb); while (1) { list = llist_del_all(&new_wb->switch_wbs_ctxs); /* Nothing to do? */ if (!list) break; ... process the items ... } Now adding of items to the list looks like: wb_queue_isw() if (llist_add(&isw->list, &wb->switch_wbs_ctxs)) queue_work(isw_wq, &wb->switch_work); Because inode_switch_wbs_work_fn() loops when processing isw items, it can happen that wb->switch_work is pending while wb->switch_wbs_ctxs is empty. This is a problem because in that case wb can get freed (no isw items -> no wb reference) while the work is still pending causing use-after-free issues. We cannot just fix this by cancelling work when freeing wb because that could still trigger problematic 0 -> 1 transitions on wb refcount due to wb_get() in inode_switch_wbs_work_fn(). It could be all handled with more careful code but that seems unnecessarily complex so let's avoid that until it is proven that the looping actually brings practical benefit. Just remove the loop from inode_switch_wbs_work_fn() instead. That way when wb_queue_isw() queues work, we are guaranteed we have added the first item to wb->switch_wbs_ctxs and nobody is going to remove it (and drop the wb reference it holds) until the queued work runs. Fixes: e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when switching inodes") CC: stable@vger.kernel.org Signed-off-by: Jan Kara Link: https://patch.msgid.link/20260413093618.17244-2-jack@suse.cz Acked-by: Tejun Heo Signed-off-by: Christian Brauner

Merge tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2026-04-19T15:01:17+00:00

Pull more MM updates from Andrew Morton: - "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song) Address the longstanding "dying memcg problem". A situation wherein a no-longer-used memory control group will hang around for an extended period pointlessly consuming memory - "fix unexpected type conversions and potential overflows" (Qi Zheng) Fix a couple of potential 32-bit/64-bit issues which were identified during review of the "Eliminate Dying Memory Cgroup" series - "kho: history: track previous kernel version and kexec boot count" (Breno Leitao) Use Kexec Handover (KHO) to pass the previous kernel's version string and the number of kexec reboots since the last cold boot to the next kernel, and print it at boot time - "liveupdate: prevent double preservation" (Pasha Tatashin) Teach LUO to avoid managing the same file across different active sessions - "liveupdate: Fix module unloading and unregister API" (Pasha Tatashin) Address an issue with how LUO handles module reference counting and unregistration during module unloading - "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar) Simplify and clean up the zswap crypto compression handling and improve the lifecycle management of zswap pool's per-CPU acomp_ctx resources - "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race" (SeongJae Park) Address unlikely but possible leaks and deadlocks in damon_call() and damon_walk() - "mm/damon/core: validate damos_quota_goal->nid" (SeongJae Park) Fix a couple of root-only wild pointer dereferences - "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race" (SeongJae Park) Update the DAMON documentation to warn operators about potential races which can occur if the commit_inputs parameter is altered at the wrong time - "Minor hmm_test fixes and cleanups" (Alistair Popple) Bugfixes and a cleanup for the HMM kernel selftests - "Modify memfd_luo code" (Chenghao Duan) Cleanups, simplifications and speedups to the memfd_lou code - "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport) Support for userfaultfd in guest_memfd - "selftests/mm: skip several tests when thp is not available" (Chunyu Hu) Fix several issues in the selftests code which were causing breakage when the tests were run on CONFIG_THP=n kernels - "mm/mprotect: micro-optimization work" (Pedro Falcato) A couple of nice speedups for mprotect() - "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav) Document upcoming changes in the maintenance of KHO, LUO, memfd_luo, kexec, crash, kdump and probably other kexec-based things - they are being moved out of mm.git and into a new git tree * tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits) MAINTAINERS: add page cache reviewer mm/vmscan: avoid false-positive -Wuninitialized warning MAINTAINERS: update Dave's kdump reviewer email address MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE MAINTAINERS: drop include/linux/kho/abi/ from KHO MAINTAINERS: update KHO and LIVE UPDATE maintainers MAINTAINERS: update kexec/kdump maintainers entries mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd() selftests: mm: skip charge_reserved_hugetlb without killall userfaultfd: allow registration of ranges below mmap_min_addr mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update mm/hugetlb: fix early boot crash on parameters without '=' separator zram: reject unrecognized type= values in recompress_store() docs: proc: document ProtectionKey in smaps mm/mprotect: special-case small folios when applying permissions mm/mprotect: move softleaf code out of the main function mm: remove '!root_reclaim' checking in should_abort_scan() mm/sparse: fix comment for section map alignment mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete() selftests/mm: transhuge_stress: skip the test when thp not available ...

writeback: prevent memory cgroup release in writeback module

2026-04-18T07:10:45+00:00

In the near future, a folio will no longer pin its corresponding memory cgroup. To ensure safety, it will only be appropriate to hold the rcu read lock or acquire a reference to the memory cgroup returned by folio_memcg(), thereby preventing it from being released. In the current patch, the function get_mem_cgroup_css_from_folio() and the rcu read lock are employed to safeguard against the release of the memory cgroup. This serves as a preparatory measure for the reparenting of the LRU pages. Link: https://lore.kernel.org/645f99bc344575417f67def3744f975596df2793.1772711148.git.zhengqi.arch@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Qi Zheng Reviewed-by: Harry Yoo Acked-by: Johannes Weiner Acked-by: Shakeel Butt Cc: Allen Pais Cc: Axel Rasmussen Cc: Baoquan He Cc: Chengming Zhou Cc: Chen Ridong Cc: David Hildenbrand Cc: Hamza Mahfooz Cc: Hugh Dickins Cc: Imran Khan Cc: Kamalesh Babulal Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Michal Koutný Cc: Mike Rapoport Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Suren Baghdasaryan Cc: Usama Arif Cc: Vlastimil Babka Cc: Wei Xu Cc: Yosry Ahmed Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton

writeback: don't block sync for filesystems with no data integrity guarantees

2026-03-20T13:18:56+00:00

Add a SB_I_NO_DATA_INTEGRITY superblock flag for filesystems that cannot guarantee data persistence on sync (eg fuse). For superblocks with this flag set, sync kicks off writeback of dirty inodes but does not wait for the flusher threads to complete the writeback. This replaces the per-inode AS_NO_DATA_INTEGRITY mapping flag added in commit f9a49aa302a0 ("fs/writeback: skip AS_NO_DATA_INTEGRITY mappings in wait_sb_inodes()"). The flag belongs at the superblock level because data integrity is a filesystem-wide property, not a per-inode one. Having this flag at the superblock level also allows us to skip having to iterate every dirty inode in wait_sb_inodes() only to skip each inode individually. Prior to this commit, mappings with no data integrity guarantees skipped waiting on writeback completion but still waited on the flusher threads to finish initiating the writeback. Waiting on the flusher threads is unnecessary. This commit kicks off writeback but does not wait on the flusher threads. This change properly addresses a recent report [1] for a suspend-to-RAM hang seen on fuse-overlayfs that was caused by waiting on the flusher threads to finish: Workqueue: pm_fs_sync pm_fs_sync_work_fn Call Trace: __schedule+0x457/0x1720 schedule+0x27/0xd0 wb_wait_for_completion+0x97/0xe0 sync_inodes_sb+0xf8/0x2e0 __iterate_supers+0xdc/0x160 ksys_sync+0x43/0xb0 pm_fs_sync_work_fn+0x17/0xa0 process_one_work+0x193/0x350 worker_thread+0x1a1/0x310 kthread+0xfc/0x240 ret_from_fork+0x243/0x280 ret_from_fork_asm+0x1a/0x30 On fuse this is problematic because there are paths that may cause the flusher thread to block (eg if systemd freezes the user session cgroups first, which freezes the fuse daemon, before invoking the kernel suspend. The kernel suspend triggers ->write_node() which on fuse issues a synchronous setattr request, which cannot be processed since the daemon is frozen. Or if the daemon is buggy and cannot properly complete writeback, initiating writeback on a dirty folio already under writeback leads to writeback_get_folio() -> folio_prepare_writeback() -> unconditional wait on writeback to finish, which will cause a hang). This commit restores fuse to its prior behavior before tmp folios were removed, where sync was essentially a no-op. [1] https://lore.kernel.org/linux-fsdevel/CAJnrk1a-asuvfrbKXbEwwDSctvemF+6zfhdnuzO65Pt8HsFSRw@mail.gmail.com/T/#m632c4648e9cafc4239299887109ebd880ac6c5c1 Fixes: 0c58a97f919c ("fuse: remove tmp folio for writebacks and internal rb tree") Reported-by: John Cc: stable@vger.kernel.org Signed-off-by: Joanne Koong Link: https://patch.msgid.link/20260320005145.2483161-2-joannelkoong@gmail.com Reviewed-by: Jan Kara Reviewed-by: David Hildenbrand (Arm) Signed-off-by: Christian Brauner