kernel/linux.git/fs/fuse/file.c, branch v6.1.168

fuse: fix readahead reclaim deadlock

2026-01-11T14:19:20+00:00

[ Upstream commit bd5603eaae0aabf527bfb3ce1bb07e979ce5bd50 ] Commit e26ee4efbc79 ("fuse: allocate ff->release_args only if release is needed") skips allocating ff->release_args if the server does not implement open. However in doing so, fuse_prepare_release() now skips grabbing the reference on the inode, which makes it possible for an inode to be evicted from the dcache while there are inflight readahead requests. This causes a deadlock if the server triggers reclaim while servicing the readahead request and reclaim attempts to evict the inode of the file being read ahead. Since the folio is locked during readahead, when reclaim evicts the fuse inode and fuse_evict_inode() attempts to remove all folios associated with the inode from the page cache (truncate_inode_pages_range()), reclaim will block forever waiting for the lock since readahead cannot relinquish the lock because it is itself blocked in reclaim: >>> stack_trace(1504735) folio_wait_bit_common (mm/filemap.c:1308:4) folio_lock (./include/linux/pagemap.h:1052:3) truncate_inode_pages_range (mm/truncate.c:336:10) fuse_evict_inode (fs/fuse/inode.c:161:2) evict (fs/inode.c:704:3) dentry_unlink_inode (fs/dcache.c:412:3) __dentry_kill (fs/dcache.c:615:3) shrink_kill (fs/dcache.c:1060:12) shrink_dentry_list (fs/dcache.c:1087:3) prune_dcache_sb (fs/dcache.c:1168:2) super_cache_scan (fs/super.c:221:10) do_shrink_slab (mm/shrinker.c:435:9) shrink_slab (mm/shrinker.c:626:10) shrink_node (mm/vmscan.c:5951:2) shrink_zones (mm/vmscan.c:6195:3) do_try_to_free_pages (mm/vmscan.c:6257:3) do_swap_page (mm/memory.c:4136:11) handle_pte_fault (mm/memory.c:5562:10) handle_mm_fault (mm/memory.c:5870:9) do_user_addr_fault (arch/x86/mm/fault.c:1338:10) handle_page_fault (arch/x86/mm/fault.c:1481:3) exc_page_fault (arch/x86/mm/fault.c:1539:2) asm_exc_page_fault+0x22/0x27 Fix this deadlock by allocating ff->release_args and grabbing the reference on the inode when preparing the file for release even if the server does not implement open. The inode reference will be dropped when the last reference on the fuse file is dropped (see fuse_file_put() -> fuse_release_end()). Fixes: e26ee4efbc79 ("fuse: allocate ff->release_args only if release is needed") Cc: stable@vger.kernel.org Signed-off-by: Joanne Koong Reported-by: Omar Sandoval Signed-off-by: Miklos Szeredi Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

fuse: fix livelock in synchronous file put from fuseblk workers

2025-10-29T13:04:40+00:00

[ Upstream commit 26e5c67deb2e1f42a951f022fdf5b9f7eb747b01 ] I observed a hang when running generic/323 against a fuseblk server. This test opens a file, initiates a lot of AIO writes to that file descriptor, and closes the file descriptor before the writes complete. Unsurprisingly, the AIO exerciser threads are mostly stuck waiting for responses from the fuseblk server: # cat /proc/372265/task/372313/stack [<0>] request_wait_answer+0x1fe/0x2a0 [fuse] [<0>] __fuse_simple_request+0xd3/0x2b0 [fuse] [<0>] fuse_do_getattr+0xfc/0x1f0 [fuse] [<0>] fuse_file_read_iter+0xbe/0x1c0 [fuse] [<0>] aio_read+0x130/0x1e0 [<0>] io_submit_one+0x542/0x860 [<0>] __x64_sys_io_submit+0x98/0x1a0 [<0>] do_syscall_64+0x37/0xf0 [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 But the /weird/ part is that the fuseblk server threads are waiting for responses from itself: # cat /proc/372210/task/372232/stack [<0>] request_wait_answer+0x1fe/0x2a0 [fuse] [<0>] __fuse_simple_request+0xd3/0x2b0 [fuse] [<0>] fuse_file_put+0x9a/0xd0 [fuse] [<0>] fuse_release+0x36/0x50 [fuse] [<0>] __fput+0xec/0x2b0 [<0>] task_work_run+0x55/0x90 [<0>] syscall_exit_to_user_mode+0xe9/0x100 [<0>] do_syscall_64+0x43/0xf0 [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 The fuseblk server is fuse2fs so there's nothing all that exciting in the server itself. So why is the fuse server calling fuse_file_put? The commit message for the fstest sheds some light on that: "By closing the file descriptor before calling io_destroy, you pretty much guarantee that the last put on the ioctx will be done in interrupt context (during I/O completion). Aha. AIO fgets a new struct file from the fd when it queues the ioctx. The completion of the FUSE_WRITE command from userspace causes the fuse server to call the AIO completion function. The completion puts the struct file, queuing a delayed fput to the fuse server task. When the fuse server task returns to userspace, it has to run the delayed fput, which in the case of a fuseblk server, it does synchronously. Sending the FUSE_RELEASE command sychronously from fuse server threads is a bad idea because a client program can initiate enough simultaneous AIOs such that all the fuse server threads end up in delayed_fput, and now there aren't any threads left to handle the queued fuse commands. Fix this by only using asynchronous fputs when closing files, and leave a comment explaining why. Cc: stable@vger.kernel.org # v2.6.38 Fixes: 5a18ec176c934c ("fuse: fix hang of single threaded fuseblk filesystem") Signed-off-by: Darrick J. Wong Signed-off-by: Miklos Szeredi Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

fuse: allocate ff->release_args only if release is needed

2025-10-29T13:04:40+00:00

[ Upstream commit e26ee4efbc79610b20e7abe9d96c87f33dacc1ff ] This removed the need to pass isdir argument to fuse_put_file(). Signed-off-by: Amir Goldstein Signed-off-by: Miklos Szeredi Stable-dep-of: 26e5c67deb2e ("fuse: fix livelock in synchronous file put from fuseblk workers") Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

fuse: prevent overflow in copy_file_range return value

2025-09-19T14:29:57+00:00

commit 1e08938c3694f707bb165535df352ac97a8c75c9 upstream. The FUSE protocol uses struct fuse_write_out to convey the return value of copy_file_range, which is restricted to uint32_t. But the COPY_FILE_RANGE interface supports a 64-bit size copies. Currently the number of bytes copied is silently truncated to 32-bit, which may result in poor performance or even failure to copy in case of truncation to zero. Reported-by: Florian Weimer Closes: https://lore.kernel.org/all/lhuh5ynl8z5.fsf@oldenburg.str.redhat.com/ Fixes: 88bc7d5097a1 ("fuse: add support for copy_file_range()") Cc: # v4.20 Signed-off-by: Miklos Szeredi Signed-off-by: Greg Kroah-Hartman

fuse: check if copy_file_range() returns larger than requested size

2025-09-19T14:29:57+00:00

commit e5203209b3935041dac541bc5b37efb44220cc0b upstream. Just like write(), copy_file_range() should check if the return value is less or equal to the requested number of bytes. Reported-by: Chunsheng Luo Closes: https://lore.kernel.org/all/20250807062425.694-1-luochunsheng@ustc.edu/ Fixes: 88bc7d5097a1 ("fuse: add support for copy_file_range()") Cc: # v4.20 Signed-off-by: Miklos Szeredi Signed-off-by: Greg Kroah-Hartman

fuse: fix dax truncate/punch_hole fault path

2025-04-10T12:33:36+00:00

[ Upstream commit 7851bf649d423edd7286b292739f2eefded3d35c ] Patch series "fs/dax: Fix ZONE_DEVICE page reference counts", v9. Device and FS DAX pages have always maintained their own page reference counts without following the normal rules for page reference counting. In particular pages are considered free when the refcount hits one rather than zero and refcounts are not added when mapping the page. Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary mechanism for allowing GUP to hold references on the page (see get_dev_pagemap). However there doesn't seem to be any reason why FS DAX pages need their own reference counting scheme. By treating the refcounts on these pages the same way as normal pages we can remove a lot of special checks. In particular pXd_trans_huge() becomes the same as pXd_leaf(), although I haven't made that change here. It also frees up a valuable SW define PTE bit on architectures that have devmap PTE bits defined. It also almost certainly allows further clean-up of the devmap managed functions, but I have left that as a future improvment. It also enables support for compound ZONE_DEVICE pages which is one of my primary motivators for doing this work. This patch (of 20): FS DAX requires file systems to call into the DAX layout prior to unlinking inodes to ensure there is no ongoing DMA or other remote access to the direct mapped page. The fuse file system implements fuse_dax_break_layouts() to do this which includes a comment indicating that passing dmap_end == 0 leads to unmapping of the whole file. However this is not true - passing dmap_end == 0 will not unmap anything before dmap_start, and further more dax_layout_busy_page_range() will not scan any of the range to see if there maybe ongoing DMA access to the range. Fix this by passing -1 for dmap_end to fuse_dax_break_layouts() which will invalidate the entire file range to dax_layout_busy_page_range(). Link: https://lkml.kernel.org/r/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com Link: https://lkml.kernel.org/r/f09a34b6c40032022e4ddee6fadb7cc676f08867.1740713401.git-series.apopple@nvidia.com Fixes: 6ae330cad6ef ("virtiofs: serialize truncate/punch_hole and dax fault path") Signed-off-by: Alistair Popple Co-developed-by: Dan Williams Signed-off-by: Dan Williams Reviewed-by: Balbir Singh Tested-by: Alison Schofield Cc: Vivek Goyal Cc: Alexander Gordeev Cc: Asahi Lina Cc: Bjorn Helgaas Cc: Catalin Marinas Cc: Christian Borntraeger Cc: Christoph Hellwig Cc: Chunyan Zhang Cc: "Darrick J. Wong" Cc: Dave Chinner Cc: Dave Hansen Cc: Dave Jiang Cc: David Hildenbrand Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Huacai Chen Cc: Ira Weiny Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jason Gunthorpe Cc: John Hubbard Cc: linmiaohe Cc: Logan Gunthorpe Cc: Matthew Wilcow (Oracle) Cc: Michael "Camp Drill Sergeant" Ellerman Cc: Nicholas Piggin Cc: Peter Xu Cc: Sven Schnelle Cc: Ted Ts'o Cc: Vasily Gorbik Cc: Vishal Verma Cc: WANG Xuerui Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Sasha Levin

fuse: allow non-extending parallel direct writes on the same file

2024-09-12T09:10:27+00:00

[ Upstream commit 153524053bbb0d27bb2e0be36d1b46862e9ce74c ] In general, as of now, in FUSE, direct writes on the same file are serialized over inode lock i.e we hold inode lock for the full duration of the write request. I could not find in fuse code and git history a comment which clearly explains why this exclusive lock is taken for direct writes. Following might be the reasons for acquiring an exclusive lock but not be limited to 1) Our guess is some USER space fuse implementations might be relying on this lock for serialization. 2) The lock protects against file read/write size races. 3) Ruling out any issues arising from partial write failures. This patch relaxes the exclusive lock for direct non-extending writes only. File size extending writes might not need the lock either, but we are not entirely sure if there is a risk to introduce any kind of regression. Furthermore, benchmarking with fio does not show a difference between patch versions that take on file size extension a) an exclusive lock and b) a shared lock. A possible example of an issue with i_size extending writes are write error cases. Some writes might succeed and others might fail for file system internal reasons - for example ENOSPACE. With parallel file size extending writes it _might_ be difficult to revert the action of the failing write, especially to restore the right i_size. With these changes, we allow non-extending parallel direct writes on the same file with the help of a flag called FOPEN_PARALLEL_DIRECT_WRITES. If this flag is set on the file (flag is passed from libfuse to fuse kernel as part of file open/create), we do not take exclusive lock anymore, but instead use a shared lock that allows non-extending writes to run in parallel. FUSE implementations which rely on this inode lock for serialization can continue to do so and serialized direct writes are still the default. Implementations that do not do write serialization need to be updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in their file open/create reply. On patch review there were concerns that network file systems (or vfs multiple mounts of the same file system) might have issues with parallel writes. We believe this is not the case, as this is just a local lock, which network file systems could not rely on anyway. I.e. this lock is just for local consistency. Signed-off-by: Dharmendra Singh Signed-off-by: Bernd Schubert Signed-off-by: Miklos Szeredi Stable-dep-of: 3002240d1649 ("fuse: fix memory leak in fuse_create_open") Signed-off-by: Sasha Levin

fuse: update stats for pages in dropped aux writeback list

2024-09-12T09:10:17+00:00

commit f7790d67785302b3116bbbfda62a5a44524601a3 upstream. In the case where the aux writeback list is dropped (e.g. the pages have been truncated or the connection is broken), the stats for its pages and backing device info need to be updated as well. Fixes: e2653bd53a98 ("fuse: fix leaked aux requests") Signed-off-by: Joanne Koong Reviewed-by: Josef Bacik Cc: # v5.1 Signed-off-by: Miklos Szeredi Signed-off-by: Greg Kroah-Hartman

attr: use consistent sgid stripping checks

2023-03-03T10:52:25+00:00

commit ed5a7047d2011cb6b2bf84ceb6680124cc6a7d95 upstream. Currently setgid stripping in file_remove_privs()'s should_remove_suid() helper is inconsistent with other parts of the vfs. Specifically, it only raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the inode isn't in the caller's groups and the caller isn't privileged over the inode although we require this already in setattr_prepare() and setattr_copy() and so all filesystem implement this requirement implicitly because they have to use setattr_{prepare,copy}() anyway. But the inconsistency shows up in setgid stripping bugs for overlayfs in xfstests (e.g., generic/673, generic/683, generic/685, generic/686, generic/687). For example, we test whether suid and setgid stripping works correctly when performing various write-like operations as an unprivileged user (fallocate, reflink, write, etc.): echo "Test 1 - qa_user, non-exec file $verb" setup_testfile chmod a+rws $junk_file commit_and_check "$qa_user" "$verb" 64k 64k The test basically creates a file with 6666 permissions. While the file has the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a regular filesystem like xfs what will happen is: sys_fallocate() -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> setattr_copy() In should_remove_suid() we can see that ATTR_KILL_SUID is raised unconditionally because the file in the test has S_ISUID set. But we also see that ATTR_KILL_SGID won't be set because while the file is S_ISGID it is not S_IXGRP (see above) which is a condition for ATTR_KILL_SGID being raised. So by the time we call notify_change() we have attr->ia_valid set to ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that ATTR_KILL_SUID is set and does: ia_valid = attr->ia_valid |= ATTR_MODE attr->ia_mode = (inode->i_mode & ~S_ISUID); which means that when we call setattr_copy() later we will definitely update inode->i_mode. Note that attr->ia_mode still contains S_ISGID. Now we call into the filesystem's ->setattr() inode operation which will end up calling setattr_copy(). Since ATTR_MODE is set we will hit: if (ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode); if (!vfsgid_in_group_p(vfsgid) && !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) mode &= ~S_ISGID; inode->i_mode = mode; } and since the caller in the test is neither capable nor in the group of the inode the S_ISGID bit is stripped. But assume the file isn't suid then ATTR_KILL_SUID won't be raised which has the consequence that neither the setgid nor the suid bits are stripped even though it should be stripped because the inode isn't in the caller's groups and the caller isn't privileged over the inode. If overlayfs is in the mix things become a bit more complicated and the bug shows up more clearly. When e.g., ovl_setattr() is hit from ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and ATTR_KILL_SGID might be raised but because the check in notify_change() is questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be stripped the S_ISGID bit isn't removed even though it should be stripped: sys_fallocate() -> vfs_fallocate() -> ovl_fallocate() -> file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> ovl_setattr() // TAKE ON MOUNTER'S CREDS -> ovl_do_notify_change() -> notify_change() // GIVE UP MOUNTER'S CREDS // TAKE ON MOUNTER'S CREDS -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = attr_force | kill; -> notify_change() The fix for all of this is to make file_remove_privs()'s should_remove_suid() helper to perform the same checks as we already require in setattr_prepare() and setattr_copy() and have notify_change() not pointlessly requiring S_IXGRP again. It doesn't make any sense in the first place because the caller must calculate the flags via should_remove_suid() anyway which would raise ATTR_KILL_SGID. While we're at it we move should_remove_suid() from inode.c to attr.c where it belongs with the rest of the iattr helpers. Especially since it returns ATTR_KILL_S{G,U}ID flags. We also rename it to setattr_should_drop_suidgid() to better reflect that it indicates both setuid and setgid bit removal and also that it returns attr flags. Running xfstests with this doesn't report any regressions. We should really try and use consistent checks. Reviewed-by: Amir Goldstein Signed-off-by: Christian Brauner (Microsoft) Signed-off-by: Amir Goldstein Signed-off-by: Greg Kroah-Hartman

fuse: lock inode unconditionally in fuse_fallocate()

2022-11-23T08:10:42+00:00

file_modified() must be called with inode lock held. fuse_fallocate() didn't lock the inode in case of just FALLOC_KEEP_SIZE flags value, which resulted in a kernel Warning in notify_change(). Lock the inode unconditionally, like all other fallocate implementations do. Reported-by: Pengfei Xu Reported-and-tested-by: syzbot+462da39f0667b357c4b6@syzkaller.appspotmail.com Fixes: 4a6f278d4827 ("fuse: add file_modified() to fallocate") Signed-off-by: Miklos Szeredi