summaryrefslogtreecommitdiff
path: root/include/linux/fs.h
AgeCommit message (Collapse)AuthorFilesLines
2026-04-19Merge tag 'mm-stable-2026-04-18-02-14' of ↵Linus Torvalds1-8/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song) Address the longstanding "dying memcg problem". A situation wherein a no-longer-used memory control group will hang around for an extended period pointlessly consuming memory - "fix unexpected type conversions and potential overflows" (Qi Zheng) Fix a couple of potential 32-bit/64-bit issues which were identified during review of the "Eliminate Dying Memory Cgroup" series - "kho: history: track previous kernel version and kexec boot count" (Breno Leitao) Use Kexec Handover (KHO) to pass the previous kernel's version string and the number of kexec reboots since the last cold boot to the next kernel, and print it at boot time - "liveupdate: prevent double preservation" (Pasha Tatashin) Teach LUO to avoid managing the same file across different active sessions - "liveupdate: Fix module unloading and unregister API" (Pasha Tatashin) Address an issue with how LUO handles module reference counting and unregistration during module unloading - "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar) Simplify and clean up the zswap crypto compression handling and improve the lifecycle management of zswap pool's per-CPU acomp_ctx resources - "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race" (SeongJae Park) Address unlikely but possible leaks and deadlocks in damon_call() and damon_walk() - "mm/damon/core: validate damos_quota_goal->nid" (SeongJae Park) Fix a couple of root-only wild pointer dereferences - "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race" (SeongJae Park) Update the DAMON documentation to warn operators about potential races which can occur if the commit_inputs parameter is altered at the wrong time - "Minor hmm_test fixes and cleanups" (Alistair Popple) Bugfixes and a cleanup for the HMM kernel selftests - "Modify memfd_luo code" (Chenghao Duan) Cleanups, simplifications and speedups to the memfd_lou code - "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport) Support for userfaultfd in guest_memfd - "selftests/mm: skip several tests when thp is not available" (Chunyu Hu) Fix several issues in the selftests code which were causing breakage when the tests were run on CONFIG_THP=n kernels - "mm/mprotect: micro-optimization work" (Pedro Falcato) A couple of nice speedups for mprotect() - "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav) Document upcoming changes in the maintenance of KHO, LUO, memfd_luo, kexec, crash, kdump and probably other kexec-based things - they are being moved out of mm.git and into a new git tree * tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits) MAINTAINERS: add page cache reviewer mm/vmscan: avoid false-positive -Wuninitialized warning MAINTAINERS: update Dave's kdump reviewer email address MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE MAINTAINERS: drop include/linux/kho/abi/ from KHO MAINTAINERS: update KHO and LIVE UPDATE maintainers MAINTAINERS: update kexec/kdump maintainers entries mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd() selftests: mm: skip charge_reserved_hugetlb without killall userfaultfd: allow registration of ranges below mmap_min_addr mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update mm/hugetlb: fix early boot crash on parameters without '=' separator zram: reject unrecognized type= values in recompress_store() docs: proc: document ProtectionKey in smaps mm/mprotect: special-case small folios when applying permissions mm/mprotect: move softleaf code out of the main function mm: remove '!root_reclaim' checking in should_abort_scan() mm/sparse: fix comment for section map alignment mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete() selftests/mm: transhuge_stress: skip the test when thp not available ...
2026-04-18mm/vma: remove __vma_check_mmap_hook()Lorenzo Stoakes1-8/+1
Commit c50ca15dd496 ("mm: add vm_ops->mapped hook") introduced __vma_check_mmap_hook() in order to assert that a driver doesn't incorrectly implement both an f_op->mmap() and a vm_ops->mapped hook, the latter of which would not ultimately get invoked. However, this did not correctly account for stacked drivers (or drivers that otherwise use the compatibility layer) which might recursively call an mmap_prepare hook via the compatibility layer. Thus the nested mmap_prepare() invocation might result in a VMA which has vm_ops->mapped set with an overlaying mmap() hook, causing the __vma_check_mmap_hook() to fail in vfs_mmap(), wrongly failing the operation. This patch resolves this by simply removing the check, as we can't be certain that an mmap() hook doesn't at some point invoke the compatibility layer, and it's not worth trying to track it. Link: https://lore.kernel.org/20260413105713.92625-1-ljs@kernel.org Fixes: c50ca15dd496 ("mm: add vm_ops->mapped hook") Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/all/adx2ws5z0NMIe5Yj@shinmob/ Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-15Merge tag 'mm-stable-2026-04-13-21-45' of ↵Linus Torvalds1-3/+11
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "maple_tree: Replace big node with maple copy" (Liam Howlett) Mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - "mm, swap: swap table phase III: remove swap_map" (Kairui Song) Offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - "mm: memfd_luo: preserve file seals" (Pratyush Yadav) File seal preservation to LUO's memfd code - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan Chen) Additional userspace stats reportng to zswap - "arch, mm: consolidate empty_zero_page" (Mike Rapoport) Some cleanups for our handling of ZERO_PAGE() and zero_pfn - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu Han) A robustness improvement and some cleanups in the kmemleak code - "Improve khugepaged scan logic" (Vernon Yang) Improve khugepaged scan logic and reduce CPU consumption by prioritizing scanning tasks that access memory frequently - "Make KHO Stateless" (Jason Miu) Simplify Kexec Handover by transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas Ballasi and Steven Rostedt) Enhance vmscan's tracepointing - "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" (Catalin Marinas) Cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin) Fix a WARN() which can be emitted the KHO restores a vmalloc area - "mm: Remove stray references to pagevec" (Tal Zussman) Several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl Shutsemau) Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page - "mm/damon/core: improve DAMOS quota efficiency for core layer filters" (SeongJae Park) Improve two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used - "mm/damon: strictly respect min_nr_regions" (SeongJae Park) Improve DAMON usability by extending the treatment of the min_nr_regions user-settable parameter - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka) The proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ensued - "mm: cleanups around unmapping / zapping" (David Hildenbrand) A bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions - "support batched checking of the young flag for MGLRU" (Baolin Wang) Batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner) memcg cleanup and robustness improvements - "Allow order zero pages in page reporting" (Yuvraj Sakshith) Enhance free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - "mm: vma flag tweaks" (Lorenzo Stoakes) Cleanup work following from the recent conversion of the VMA flags to a bitmap - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae Park) Add some more developer-facing debug checks into DAMON core - "mm/damon: test and document power-of-2 min_region_sz requirement" (SeongJae Park) An additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling - "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" (SeongJae Park) Fix a hard-to-hit time overflow issue in DAMON core - "mm/damon: improve/fixup/update ratio calculation, test and documentation" (SeongJae Park) A batch of misc/minor improvements and fixups for DAMON - "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David Hildenbrand) Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky) A somewhat random mix of fixups, recompression cleanups and improvements in the zram code - "mm/damon: support multiple goal-based quota tuning algorithms" (SeongJae Park) Extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao) Fix the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged - "mm: improve map count checks" (Lorenzo Stoakes) Provide some cleanups and slight fixes in the mremap, mmap and vma code - "mm/damon: support addr_unit on default monitoring targets for modules" (SeongJae Park) Extend the use of DAMON core's addr_unit tunable - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache) Cleanups to khugepaged and is a base for Nico's planned khugepaged mTHP support - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand) Code movement and cleanups in the memhotplug and sparsemem code - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" (David Hildenbrand) Rationalize some memhotplug Kconfig support - "change young flag check functions to return bool" (Baolin Wang) Cleanups to change all young flag check functions to return bool - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh Law and SeongJae Park) Fix a few potential DAMON bugs - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo Stoakes) Convert a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it. Mainly in the vma code. - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes) Expand the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time. Cleanups, documentation, extension of mmap_prepare into filesystem drivers - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes) Simplify and clean up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. * tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm: fix deferred split queue races during migration mm/khugepaged: fix issue with tracking lock mm/huge_memory: add and use has_deposited_pgtable() mm/huge_memory: add and use normal_or_softleaf_folio_pmd() mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio() mm/huge_memory: separate out the folio part of zap_huge_pmd() mm/huge_memory: use mm instead of tlb->mm mm/huge_memory: remove unnecessary sanity checks mm/huge_memory: deduplicate zap deposited table call mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE() mm/huge_memory: add a common exit path to zap_huge_pmd() mm/huge_memory: handle buggy PMD entry in zap_huge_pmd() mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc mm/huge: avoid big else branch in zap_huge_pmd() mm/huge_memory: simplify vma_is_specal_huge() mm: on remap assert that input range within the proposed VMA mm: add mmap_action_map_kernel_pages[_full]() uio: replace deprecated mmap hook with mmap_prepare in uio_info drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare mm: allow handling of stacked mmap_prepare hooks in more drivers ...
2026-04-14Merge tag 'lsm-pr-20260410' of ↵Linus Torvalds1-0/+13
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull LSM updates from Paul Moore: "We only have five patches in the LSM tree, but three of the five are for an important bugfix relating to overlayfs and the mmap() and mprotect() access controls for LSMs. Highlights below: - Fix problems with the mmap() and mprotect() LSM hooks on overlayfs As we are dealing with problems both in mmap() and mprotect() there are essentially two components to this fix, spread across three patches with all marked for stable. The simplest portion of the fix is the creation of a new LSM hook, security_mmap_backing_file(), that is used to enforce LSM mmap() access controls on backing files in the stacked/overlayfs case. The existing security_mmap_file() does not have visibility past the user file. You can see from the associated SELinux hook callback the code is fairly straightforward. The mprotect() fix is a bit more complicated as there is no way in the mprotect() code path to inspect both the user and backing files, and bolting on a second file reference to vm_area_struct wasn't really an option. The solution taken here adds a LSM security blob and associated hooks to the backing_file struct that LSMs can use to capture and store relevant information from the user file. While the necessary SELinux information is relatively small, a single u32, I expect other LSMs to require more than that, and a dedicated backing_file LSM blob provides a storage mechanism without negatively impacting other filesystems. I want to note that other LSMs beyond SELinux have been involved in the discussion of the fixes presented here and they are working on their own related changes using these new hooks, but due to other issues those patches will be coming at a later date. - Use kstrdup_const()/kfree_const() for securityfs symlink targets - Resolve a handful of kernel-doc warnings in cred.h" * tag 'lsm-pr-20260410' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: selinux: fix overlayfs mmap() and mprotect() access checks lsm: add backing_file LSM hooks fs: prepare for adding LSM blob to backing_file securityfs: use kstrdup_const() to manage symlink targets cred: fix kernel-doc warnings in cred.h
2026-04-14Merge tag 'vfs-7.1-rc1.misc' of ↵Linus Torvalds1-3/+0
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - coredump: add tracepoint for coredump events - fs: hide file and bfile caches behind runtime const machinery Fixes: - fix architecture-specific compat_ftruncate64 implementations - dcache: Limit the minimal number of bucket to two - fs/omfs: reject s_sys_blocksize smaller than OMFS_DIR_START - fs/mbcache: cancel shrink work before destroying the cache - dcache: permit dynamic_dname()s up to NAME_MAX Cleanups: - remove or unexport unused fs_context infrastructure - trivial ->setattr cleanups - selftests/filesystems: Assume that TIOCGPTPEER is defined - writeback: fix kernel-doc function name mismatch for wb_put_many() - autofs: replace manual symlink buffer allocation in autofs_dir_symlink - init/initramfs.c: trivial fix: FSM -> Finite-state machine - fs: remove stale and duplicate forward declarations - readdir: Introduce dirent_size() - fs: Replace user_access_{begin/end} by scoped user access - kernel: acct: fix duplicate word in comment - fs: write a better comment in step_into() concerning .mnt assignment - fs: attr: fix comment formatting and spelling issues" * tag 'vfs-7.1-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits) dcache: permit dynamic_dname()s up to NAME_MAX fs: attr: fix comment formatting and spelling issues fs: hide file and bfile caches behind runtime const machinery fs: write a better comment in step_into() concerning .mnt assignment proc: rename proc_notify_change to proc_setattr proc: rename proc_setattr to proc_nochmod_setattr affs: rename affs_notify_change to affs_setattr adfs: rename adfs_notify_change to adfs_setattr hfs: update comments on hfs_inode_setattr kernel: acct: fix duplicate word in comment fs: Replace user_access_{begin/end} by scoped user access readdir: Introduce dirent_size() coredump: add tracepoint for coredump events fs: remove do_sys_truncate fs: pass on FTRUNCATE_* flags to do_truncate fs: fix archiecture-specific compat_ftruncate64 fs: remove stale and duplicate forward declarations init/initramfs.c: trivial fix: FSM -> Finite-state machine autofs: replace manual symlink buffer allocation in autofs_dir_symlink fs/mbcache: cancel shrink work before destroying the cache ...
2026-04-13Merge tag 'vfs-7.1-rc1.bh.metadata' of ↵Linus Torvalds1-6/+9
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs buffer_head updates from Christian Brauner: "This cleans up the mess that has accumulated over the years in metadata buffer_head tracking for inodes. It moves the tracking into dedicated structure in filesystem-private part of the inode (so that we don't use private_list, private_data, and private_lock in struct address_space), and also moves couple other users of private_data and private_list so these are removed from struct address_space saving 3 longs in struct inode for 99% of inodes" * tag 'vfs-7.1-rc1.bh.metadata' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits) fs: Drop i_private_list from address_space fs: Drop mapping_metadata_bhs from address space ext4: Track metadata bhs in fs-private inode part minix: Track metadata bhs in fs-private inode part udf: Track metadata bhs in fs-private inode part fat: Track metadata bhs in fs-private inode part bfs: Track metadata bhs in fs-private inode part affs: Track metadata bhs in fs-private inode part ext2: Track metadata bhs in fs-private inode part fs: Provide functions for handling mapping_metadata_bhs directly fs: Switch inode_has_buffers() to take mapping_metadata_bhs fs: Make bhs point to mapping_metadata_bhs fs: Move metadata bhs tracking to a separate struct fs: Fold fsync_buffers_list() into sync_mapping_buffers() fs: Drop osync_buffers_list() kvm: Use private inode list instead of i_private_list fs: Remove i_private_data aio: Stop using i_private_data and i_private_lock hugetlbfs: Stop using i_private_data fs: Stop using i_private_data for metadata bh tracking ...
2026-04-05mm: allow handling of stacked mmap_prepare hooks in more driversLorenzo Stoakes (Oracle)1-0/+3
While the conversion of mmap hooks to mmap_prepare is underway, we will encounter situations where mmap hooks need to invoke nested mmap_prepare hooks. The nesting of mmap hooks is termed 'stacking'. In order to flexibly facilitate the conversion of custom mmap hooks in drivers which stack, we must split up the existing __compat_vma_mmap() function into two separate functions: * compat_set_desc_from_vma() - This allows the setting of a vm_area_desc object's fields to the relevant fields of a VMA. * __compat_vma_mmap() - Once an mmap_prepare hook has been executed upon a vm_area_desc object, this function performs any mmap actions specified by the mmap_prepare hook and then invokes its vm_ops->mapped() hook if any were specified. In ordinary cases, where a file's f_op->mmap_prepare() hook simply needs to be invoked in a stacked mmap() hook, compat_vma_mmap() can be used. However some drivers define their own nested hooks, which are invoked in turn by another hook. A concrete example is vmbus_channel->mmap_ring_buffer(), which is invoked in turn by bin_attribute->mmap(): vmbus_channel->mmap_ring_buffer() has a signature of: int (*mmap_ring_buffer)(struct vmbus_channel *channel, struct vm_area_struct *vma); And bin_attribute->mmap() has a signature of: int (*mmap)(struct file *, struct kobject *, const struct bin_attribute *attr, struct vm_area_struct *vma); And so compat_vma_mmap() cannot be used here for incremental conversion of hooks from mmap() to mmap_prepare(). There are many such instances like this, where conversion to mmap_prepare would otherwise cascade to a huge change set due to nesting of this kind. The changes in this patch mean we could now instead convert vmbus_channel->mmap_ring_buffer() to vmbus_channel->mmap_prepare_ring_buffer(), and implement something like: struct vm_area_desc desc; int err; compat_set_desc_from_vma(&desc, file, vma); err = channel->mmap_prepare_ring_buffer(channel, &desc); if (err) return err; return __compat_vma_mmap(&desc, vma); Allowing us to incrementally update this logic, and other logic like it. Unfortunately, as part of this change, we need to be able to flexibly assign to the VMA descriptor, so have to remove some of the const declarations within the structure. Also update the VMA tests to reflect the changes. Link: https://lkml.kernel.org/r/24aac3019dd34740e788d169fccbe3c62781e648.1774045440.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexandre Torgue <alexandre.torgue@foss.st.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bodo Stroesser <bostroesser@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Clemens Ladisch <clemens@ladisch.de> Cc: David Hildenbrand <david@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Long Li <longli@microsoft.com> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Miquel Raynal <miquel.raynal@bootlin.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Richard Weinberger <richard@nod.at> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vignesh Raghavendra <vigneshr@ti.com> Cc: Wei Liu <wei.liu@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: add vm_ops->mapped hookLorenzo Stoakes (Oracle)1-1/+8
Previously, when a driver needed to do something like establish a reference count, it could do so in the mmap hook in the knowledge that the mapping would succeed. With the introduction of f_op->mmap_prepare this is no longer the case, as it is invoked prior to actually establishing the mapping. mmap_prepare is not appropriate for this kind of thing as it is called before any merge might take place, and after which an error might occur meaning resources could be leaked. To take this into account, introduce a new vm_ops->mapped callback which is invoked when the VMA is first mapped (though notably - not when it is merged - which is correct and mirrors existing mmap/open/close behaviour). We do better that vm_ops->open() here, as this callback can return an error, at which point the VMA will be unmapped. Note that vm_ops->mapped() is invoked after any mmap action is complete (such as I/O remapping). We intentionally do not expose the VMA at this point, exposing only the fields that could be used, and an output parameter in case the operation needs to update the vma->vm_private_data field. In order to deal with stacked filesystems which invoke inner filesystem's mmap() invocations, add __compat_vma_mapped() and invoke it on vfs_mmap() (via compat_vma_mmap()) to ensure that the mapped callback is handled when an mmap() caller invokes a nested filesystem's mmap_prepare() callback. Update the mmap_prepare documentation to describe the mapped hook and make it clear what its intended use is. The vm_ops->mapped() call is handled by the mmap complete logic to ensure the same code paths are handled by both the compatibility and VMA layers. Additionally, update VMA userland test headers to reflect the change. Link: https://lkml.kernel.org/r/4c5e98297eb0aae9565c564e1c296a112702f144.1774045440.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexandre Torgue <alexandre.torgue@foss.st.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bodo Stroesser <bostroesser@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Clemens Ladisch <clemens@ladisch.de> Cc: David Hildenbrand <david@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Long Li <longli@microsoft.com> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Miquel Raynal <miquel.raynal@bootlin.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Richard Weinberger <richard@nod.at> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vignesh Raghavendra <vigneshr@ti.com> Cc: Wei Liu <wei.liu@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm: various small mmap_prepare cleanupsLorenzo Stoakes (Oracle)1-2/+0
Patch series "mm: expand mmap_prepare functionality and usage", v4. This series expands the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time. This series starts with some cleanup of existing mmap_prepare logic, then adds documentation for the mmap_prepare call to make it easier for filesystem and driver writers to understand how it works. It then importantly adds a vm_ops->mapped hook, a key feature that was missing from mmap_prepare previously - this is invoked when a driver which specifies mmap_prepare has successfully been mapped but not merged with another VMA. mmap_prepare is invoked prior to a merge being attempted, so you cannot manipulate state such as reference counts as if it were a new mapping. The vm_ops->mapped hook allows a driver to perform tasks required at this stage, and provides symmetry against subsequent vm_ops->open,close calls. The series uses this to correct the afs implementation which wrongly manipulated reference count at mmap_prepare time. It then adds an mmap_prepare equivalent of vm_iomap_memory() - mmap_action_simple_ioremap(), then uses this to update a number of drivers. It then splits out the mmap_prepare compatibility layer (which allows for invocation of mmap_prepare hooks in an mmap() hook) in such a way as to allow for more incremental implementation of mmap_prepare hooks. It then uses this to extend mmap_prepare usage in drivers. Finally it adds an mmap_prepare equivalent of vm_map_pages(), which lays the foundation for future work which will extend mmap_prepare to DMA coherent mappings. This patch (of 21): Rather than passing arbitrary fields, pass a vm_area_desc pointer to mmap prepare functions to mmap prepare, and an action and vma pointer to mmap complete in order to put all the action-specific logic in the function actually doing the work. Additionally, allow mmap prepare functions to return an error so we can error out as soon as possible if there is something logically incorrect in the input. Update remap_pfn_range_prepare() to properly check the input range for the CoW case. Also remove io_remap_pfn_range_complete(), as we can simply set up the fields correctly in io_remap_pfn_range_prepare() and use remap_pfn_range_complete() for this. While we're here, make remap_pfn_range_prepare_vma() a little neater, and pass mmap_action directly to call_action_complete(). Then, update compat_vma_mmap() to perform its logic directly, as __compat_vma_map() is not used by anything so we don't need to export it. Also update compat_vma_mmap() to use vfs_mmap_prepare() rather than calling the mmap_prepare op directly. Finally, update the VMA userland tests to reflect the changes. Link: https://lkml.kernel.org/r/cover.1774045440.git.ljs@kernel.org Link: https://lkml.kernel.org/r/99f408e4694f44ab12bdc55fe0bd9685d3bd1117.1774045440.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexandre Torgue <alexandre.torgue@foss.st.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bodo Stroesser <bostroesser@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Clemens Ladisch <clemens@ladisch.de> Cc: David Hildenbrand <david@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Long Li <longli@microsoft.com> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Miquel Raynal <miquel.raynal@bootlin.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Richard Weinberger <richard@nod.at> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vignesh Raghavendra <vigneshr@ti.com> Cc: Wei Liu <wei.liu@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-03lsm: add backing_file LSM hooksPaul Moore1-0/+13
Stacked filesystems such as overlayfs do not currently provide the necessary mechanisms for LSMs to properly enforce access controls on the mmap() and mprotect() operations. In order to resolve this gap, a LSM security blob is being added to the backing_file struct and the following new LSM hooks are being created: security_backing_file_alloc() security_backing_file_free() security_mmap_backing_file() The first two hooks are to manage the lifecycle of the LSM security blob in the backing_file struct, while the third provides a new mmap() access control point for the underlying backing file. It is also expected that LSMs will likely want to update their security_file_mprotect() callback to address issues with their mprotect() controls, but that does not require a change to the security_file_mprotect() LSM hook. There are a three other small changes to support these new LSM hooks: * Pass the user file associated with a backing file down to alloc_empty_backing_file() so it can be included in the security_backing_file_alloc() hook. * Add getter and setter functions for the backing_file struct LSM blob as the backing_file struct remains private to fs/file_table.c. * Constify the file struct field in the LSM common_audit_data struct to better support LSMs that need to pass a const file struct pointer into the common LSM audit code. Thanks to Arnd Bergmann for identifying the missing EXPORT_SYMBOL_GPL() and supplying a fixup. Cc: stable@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-unionfs@vger.kernel.org Cc: linux-erofs@lists.ozlabs.org Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Paul Moore <paul@paul-moore.com>
2026-03-26fs: Drop i_private_list from address_spaceJan Kara1-2/+0
Nobody is using i_private_list anymore. Remove it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260326095354.16340-84-jack@suse.cz Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-26fs: Drop mapping_metadata_bhs from address spaceJan Kara1-1/+0
Nobody uses mapping_metadata_bhs in struct address_space anymore. Just remove it and with it all helper functions using it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260326095354.16340-83-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-26fs: Make bhs point to mapping_metadata_bhsJan Kara1-0/+1
Make buffer heads point to mapping_metadata_bhs instead of struct address_space. This makes the code more self contained. For the (only) case of IO error handling where we really need to reach struct address_space add a pointer to the mapping from mapping_metadata_bhs. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260326095354.16340-73-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-26fs: Move metadata bhs tracking to a separate structJan Kara1-0/+7
Instead of tracking metadata bhs for a mapping using i_private_list and i_private_lock create a dedicated mapping_metadata_bhs struct for it. So far this struct is embedded in address_space but that will be switched for per-fs private inode parts later in the series. This also changes the locking from bdev mapping's i_private_lock to a new lock embedded in mapping_metadata_bhs to untangle the i_private_lock locking for maintaining lists of metadata bhs and the locking for looking up / reclaiming bdev's buffer heads. The locking in remove_assoc_map() gets more complex due to this but overall this looks like a reasonable tradeoff. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260326095354.16340-72-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-26fs: Remove i_private_dataJan Kara1-2/+0
Nobody is using it anymore. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260326095354.16340-68-jack@suse.cz Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-26fs: Rename generic_file_fsync() to simple_fsync()Jan Kara1-2/+2
The implementation is now really basic so rename generic_file_fsync() simple_fsync() and __generic_file_fsync() to simple_fsync_noflush(). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260326095354.16340-56-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-20fs: remove stale and duplicate forward declarationsYuto Ohnuki1-3/+0
Remove the following unnecessary forward declarations from fs.h, which improves maintainability. - struct hd_geometry: became unused in fs.h when block_device_operations was moved to blkdev.h in commit 08f858512151 ("[PATCH] move block_device_operations to blkdev.h"). The forward declaration is now added to blkdev.h where it is actually used. - struct iovec: became unused when aio_read/aio_write were removed in commit 8436318205b9 ("->aio_read and ->aio_write removed") - struct iov_iter: duplicate forward declaration. This removes the redundant second declaration, added in commit 293bc9822fa9 ("new methods: ->read_iter() and ->write_iter()") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202512301303.s7YWTZHA-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202512302139.Wl0soAlz-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202512302105.pmzYfmcV-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202512302125.FNgHwu5z-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202512302108.nIV8r5ES-lkp@intel.com/ Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Link: https://patch.msgid.link/20260226201857.27310-2-ytohnuki@amazon.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-09vfs: remove externs from fs.h on functions modified by i_ino wideningJeff Layton1-31/+31
Christoph says, in response to one of the patches in the i_ino widening series, which changes the prototype of several functions in fs.h: "Can you please drop all these pointless externs while you're at it?" Remove extern keyword from functions touched by that patch (and a few that happened to be nearby). Also add missing argument names to declarations that lacked them. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260307-iino-u64-v2-1-a1a1696e0653@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-06treewide: change inode->i_ino from unsigned long to u64Jeff Layton1-1/+1
On 32-bit architectures, unsigned long is only 32 bits wide, which causes 64-bit inode numbers to be silently truncated. Several filesystems (NFS, XFS, BTRFS, etc.) can generate inode numbers that exceed 32 bits, and this truncation can lead to inode number collisions and other subtle bugs on 32-bit systems. Change the type of inode->i_ino from unsigned long to u64 to ensure that inode numbers are always represented as 64-bit values regardless of architecture. Update all format specifiers treewide from %lu/%lx to %llu/%llx to match the new type, along with corresponding local variable types. This is the bulk treewide conversion. Earlier patches in this series handled trace events separately to allow trace field reordering for better struct packing on 32-bit. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260304-iino-u64-v3-12-2257ad83d372@kernel.org Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-06vfs: widen inode hash/lookup functions to u64Jeff Layton1-24/+22
Change the inode hash/lookup VFS API functions to accept u64 parameters instead of unsigned long for inode numbers and hash values. This is preparation for widening i_ino itself to u64, which will allow filesystems to store full 64-bit inode numbers on 32-bit architectures. Since unsigned long implicitly widens to u64 on all architectures, this change is backward-compatible with all existing callers. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260304-iino-u64-v3-1-2257ad83d372@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-18Merge tag 'sysctl-7.00-rc1' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl updates from Joel Granados: - Remove macros from proc handler converters Replace the proc converter macros with "regular" functions. Though it is more verbose than the macro version, it helps when debugging and better aligns with coding-style.rst. - General cleanup Remove superfluous ctl_table forward declarations. Const qualify the memory_allocation_profiling_sysctl and loadpin_sysctl_table arrays. Add missing kernel doc to proc_dointvec_conv. - Testing This series was run through sysctl selftests/kunit test suite in x86_64. And went into linux-next after rc4, giving it a good 3 weeks of testing * tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: replace SYSCTL_INT_CONV_CUSTOM macro with functions sysctl: Replace unidirectional INT converter macros with functions sysctl: Add kernel doc to proc_douintvec_conv sysctl: Replace UINT converter macros with functions sysctl: Add CONFIG_PROC_SYSCTL guards for converter macros sysctl: clarify proc_douintvec_minmax doc sysctl: Return -ENOSYS from proc_douintvec_conv when CONFIG_PROC_SYSCTL=n sysctl: Remove unused ctl_table forward declarations loadpin: Implement custom proc_handler for enforce alloc_tag: move memory_allocation_profiling_sysctls into .rodata sysctl: Add missing kernel-doc for proc_dointvec_conv
2026-02-17Merge tag 'vfs-7.0-rc1.misc.2' of ↵Linus Torvalds1-2/+12
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull more misc vfs updates from Christian Brauner: "Features: - Optimize close_range() from O(range size) to O(active FDs) by using find_next_bit() on the open_fds bitmap instead of linearly scanning the entire requested range. This is a significant improvement for large-range close operations on sparse file descriptor tables. - Add FS_XFLAG_VERITY file attribute for fs-verity files, retrievable via FS_IOC_FSGETXATTR and file_getattr(). The flag is read-only. Add tracepoints for fs-verity enable and verify operations, replacing the previously removed debug printk's. - Prevent nfsd from exporting special kernel filesystems like pidfs and nsfs. These filesystems have custom ->open() and ->permission() export methods that are designed for open_by_handle_at(2) only and are incompatible with nfsd. Update the exportfs documentation accordingly. Fixes: - Fix KMSAN uninit-value in ovl_fill_real() where strcmp() was used on a non-null-terminated decrypted directory entry name from fscrypt. This triggered on encrypted lower layers when the decrypted name buffer contained uninitialized tail data. The fix also adds VFS-level name_is_dot(), name_is_dotdot(), and name_is_dot_dotdot() helpers, replacing various open-coded "." and ".." checks across the tree. - Fix read-only fsflags not being reset together with xflags in vfs_fileattr_set(). Currently harmless since no read-only xflags overlap with flags, but this would cause inconsistencies for any future shared read-only flag - Return -EREMOTE instead of -ESRCH from PIDFD_GET_INFO when the target process is in a different pid namespace. This lets userspace distinguish "process exited" from "process in another namespace", matching glibc's pidfd_getpid() behavior Cleanups: - Use C-string literals in the Rust seq_file bindings, replacing the kernel::c_str!() macro (available since Rust 1.77) - Fix typo in d_walk_ret enum comment, add porting notes for the readlink_copy() calling convention change" * tag 'vfs-7.0-rc1.misc.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add porting notes about readlink_copy() pidfs: return -EREMOTE when PIDFD_GET_INFO is called on another ns nfsd: do not allow exporting of special kernel filesystems exportfs: clarify the documentation of open()/permission() expotrfs ops fsverity: add tracepoints fs: add FS_XFLAG_VERITY for fs-verity files rust: seq_file: replace `kernel::c_str!` with C-Strings fs: dcache: fix typo in enum d_walk_ret comment ovl: use name_is_dot* helpers in readdir code fs: add helpers name_is_dot{,dot,_dotdot} ovl: Fix uninit-value in ovl_fill_real fs: reset read-only fsflags together with xflags fs/file: optimize close_range() complexity from O(N) to O(Sparse)
2026-02-10Merge tag 'pull-filename' of ↵Linus Torvalds1-13/+28
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs 'struct filename' updates from Al Viro: "[Mostly] sanitize struct filename handling" * tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (68 commits) sysfs(2): fs_index() argument is _not_ a pathname alpha: switch osf_mount() to strndup_user() ksmbd: use CLASS(filename_kernel) mqueue: switch to CLASS(filename) user_statfs(): switch to CLASS(filename) statx: switch to CLASS(filename_maybe_null) quotactl_block(): switch to CLASS(filename) chroot(2): switch to CLASS(filename) move_mount(2): switch to CLASS(filename_maybe_null) namei.c: switch user pathname imports to CLASS(filename{,_flags}) namei.c: convert getname_kernel() callers to CLASS(filename_kernel) do_f{chmod,chown,access}at(): use CLASS(filename_uflags) do_readlinkat(): switch to CLASS(filename_flags) do_sys_truncate(): switch to CLASS(filename) do_utimes_path(): switch to CLASS(filename_uflags) chdir(2): unspaghettify a bit... do_fchownat(): unspaghettify a bit... fspick(2): use CLASS(filename_flags) name_to_handle_at(): use CLASS(filename_uflags) vfs_open_tree(): use CLASS(filename_uflags) ...
2026-02-10Merge tag 'vfs-7.0-rc1.misc' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains a mix of VFS cleanups, performance improvements, API fixes, documentation, and a deprecation notice. Scalability and performance: - Rework pid allocation to only take pidmap_lock once instead of twice during alloc_pid(), improving thread creation/teardown throughput by 10-16% depending on false-sharing luck. Pad the namespace refcount to reduce false-sharing - Track file lock presence via a flag in ->i_opflags instead of reading ->i_flctx, avoiding false-sharing with ->i_readcount on open/close hot paths. Measured 4-16% improvement on 24-core open-in-a-loop benchmarks - Use a consume fence in locks_inode_context() to match the store-release/load-consume idiom, eliminating a hardware fence on some architectures - Annotate cdev_lock with __cacheline_aligned_in_smp to prevent false-sharing - Remove a redundant DCACHE_MANAGED_DENTRY check in __follow_mount_rcu() that never fires since the caller already verifies it, eliminating a 100% mispredicted branch - Fix a 100% mispredicted likely() in devcgroup_inode_permission() that became wrong after a prior code reorder Bug fixes and correctness: - Make insert_inode_locked() wait for inode destruction instead of skipping, fixing a corner case where two matching inodes could exist in the hash - Move f_mode initialization before file_ref_init() in alloc_file() to respect the SLAB_TYPESAFE_BY_RCU ordering contract - Add a WARN_ON_ONCE guard in try_to_free_buffers() for folios with no buffers attached, preventing a null pointer dereference when AS_RELEASE_ALWAYS is set but no release_folio op exists - Fix select restart_block to store end_time as timespec64, avoiding truncation of tv_sec on 32-bit architectures - Make dump_inode() use get_kernel_nofault() to safely access inode and superblock fields, matching the dump_mapping() pattern API modernization: - Make posix_acl_to_xattr() allocate the buffer internally since every single caller was doing it anyway. Reduces boilerplate and unnecessary error checking across ~15 filesystems - Replace deprecated simple_strtoul() with kstrtoul() for the ihash_entries, dhash_entries, mhash_entries, and mphash_entries boot parameters, adding proper error handling - Convert chardev code to use guard(mutex) and __free(kfree) cleanup patterns - Replace min_t() with min() or umin() in VFS code to avoid silently truncating unsigned long to unsigned int - Gate LOOKUP_RCU assertions behind CONFIG_DEBUG_VFS since callers already check the flag Deprecation: - Begin deprecating legacy BSD process accounting (acct(2)). The interface has numerous footguns and better alternatives exist (eBPF) Documentation: - Fix and complete kernel-doc for struct export_operations, removing duplicated documentation between ReST and source - Fix kernel-doc warnings for __start_dirop() and ilookup5_nowait() Testing: - Add a kunit test for initramfs cpio handling of entries with filesize > PATH_MAX Misc: - Add missing <linux/init_task.h> include in fs_struct.c" * tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits) posix_acl: make posix_acl_to_xattr() alloc the buffer fs: make insert_inode_locked() wait for inode destruction initramfs_test: kunit test for cpio.filesize > PATH_MAX fs: improve dump_inode() to safely access inode fields fs: add <linux/init_task.h> for 'init_fs' docs: exportfs: Use source code struct documentation fs: move initializing f_mode before file_ref_init() exportfs: Complete kernel-doc for struct export_operations exportfs: Mark struct export_operations functions at kernel-doc exportfs: Fix kernel-doc output for get_name() acct(2): begin the deprecation of legacy BSD process accounting device_cgroup: remove branch hint after code refactor VFS: fix __start_dirop() kernel-doc warnings fs: Describe @isnew parameter in ilookup5_nowait() fs/namei: Remove redundant DCACHE_MANAGED_DENTRY check in __follow_mount_rcu fs: only assert on LOOKUP_RCU when built with CONFIG_DEBUG_VFS select: store end_time as timespec64 in restart block chardev: Switch to guard(mutex) and __free(kfree) namespace: Replace simple_strtoul with kstrtoul to parse boot params dcache: Replace simple_strtoul with kstrtoul in set_dhash_entries ...
2026-02-10Merge tag 'vfs-7.0-rc1.namespace' of ↵Linus Torvalds1-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: - statmount: accept fd as a parameter Extend struct mnt_id_req with a file descriptor field and a new STATMOUNT_BY_FD flag. When set, statmount() returns mount information for the mount the fd resides on — including detached mounts (unmounted via umount2(MNT_DETACH)). For detached mounts the STATMOUNT_MNT_POINT and STATMOUNT_MNT_NS_ID mask bits are cleared since neither is meaningful. The capability check is skipped for STATMOUNT_BY_FD since holding an fd already implies prior access to the mount and equivalent information is available through fstatfs() and /proc/pid/mountinfo without privilege. Includes comprehensive selftests covering both attached and detached mount cases. - fs: Remove internal old mount API code (1 patch) Now that every in-tree filesystem has been converted to the new mount API, remove all the legacy shim code in fs_context.c that handled unconverted filesystems. This deletes ~280 lines including legacy_init_fs_context(), the legacy_fs_context struct, and associated wrappers. The mount(2) syscall path for userspace remains untouched. Documentation references to the legacy callbacks are cleaned up. - mount: add OPEN_TREE_NAMESPACE to open_tree() Container runtimes currently use CLONE_NEWNS to copy the caller's entire mount namespace — only to then pivot_root() and recursively unmount everything they just copied. With large mount tables and thousands of parallel container launches this creates significant contention on the namespace semaphore. OPEN_TREE_NAMESPACE copies only the specified mount tree (like OPEN_TREE_CLONE) but returns a mount namespace fd instead of a detached mount fd. The new namespace contains the copied tree mounted on top of a clone of the real rootfs. This functions as a combined unshare(CLONE_NEWNS) + pivot_root() in a single syscall. Works with user namespaces: an unshare(CLONE_NEWUSER) followed by OPEN_TREE_NAMESPACE creates a mount namespace owned by the new user namespace. Mount namespace file mounts are excluded from the copy to prevent cycles. Includes ~1000 lines of selftests" * tag 'vfs-7.0-rc1.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests/open_tree: add OPEN_TREE_NAMESPACE tests mount: add OPEN_TREE_NAMESPACE fs: Remove internal old mount API code selftests: statmount: tests for STATMOUNT_BY_FD statmount: accept fd as a parameter statmount: permission check should return EPERM
2026-02-10Merge tag 'vfs-7.0-rc1.atomic_open' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs atomic_open updates from Christian Brauner: "Allow knfsd to use atomic_open() While knfsd offers combined exclusive create and open results to clients, on some filesystems those results are not atomic. The separate vfs_create() + vfs_open() sequence in dentry_create() can produce races and unexpected errors. For example, open O_CREAT with mode 0 will succeed in creating the file but return -EACCES from vfs_open(). Additionally, network filesystems benefit from reducing remote round-trip operations by using a single atomic_open() call. Teach dentry_create() -- whose sole caller is knfsd -- to use atomic_open() for filesystems that support it" * tag 'vfs-7.0-rc1.atomic_open' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs/namei: fix kernel-doc markup for dentry_create VFS/knfsd: Teach dentry_create() to use atomic_open() VFS: Prepare atomic_open() for dentry_create() VFS: move dentry_create() from fs/open.c to fs/namei.c
2026-02-10Merge tag 'vfs-7.0-rc1.btrfs' of ↵Linus Torvalds1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs updates for btrfs from Christian Brauner: "This contains some changes for btrfs that are taken to the vfs tree to stop duplicating VFS code for subvolume/snapshot dentry Btrfs has carried private copies of the VFS may_delete() and may_create() functions in fs/btrfs/ioctl.c for permission checks during subvolume creation and snapshot destruction. These copies have drifted out of sync with the VFS originals — btrfs_may_delete() is missing the uid/gid validity check and btrfs_may_create() is missing the audit_inode_child() call. Export the VFS functions as may_{create,delete}_dentry() and switch btrfs to use them, removing ~70 lines of duplicated code" * tag 'vfs-7.0-rc1.btrfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: btrfs: use may_create_dentry() in btrfs_mksubvol() btrfs: use may_delete_dentry() in btrfs_ioctl_snap_destroy() fs: export may_create() as may_create_dentry() fs: export may_delete() as may_delete_dentry()
2026-02-09Merge tag 'vfs-7.0-rc1.leases' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs lease updates from Christian Brauner: "This contains updates for lease support to require filesystems to explicitly opt-in to lease support Currently kernel_setlease() falls through to generic_setlease() when a a filesystem does not define ->setlease(), silently granting lease support to every filesystem regardless of whether it is prepared for it. This is a poor default: most filesystems never intended to support leases, and the silent fallthrough makes it impossible to distinguish "supports leases" from "never thought about it". This inverts the default. It adds explicit .setlease = generic_setlease; assignments to every in-tree filesystem that should retain lease support, then changes kernel_setlease() to return -EINVAL when ->setlease is NULL. With the new default in place, simple_nosetlease() is redundant and is removed along with all references to it" * tag 'vfs-7.0-rc1.leases' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits) fuse: add setlease file operation fs: remove simple_nosetlease() filelock: default to returning -EINVAL when ->setlease operation is NULL xfs: add setlease file operation ufs: add setlease file operation udf: add setlease file operation tmpfs: add setlease file operation squashfs: add setlease file operation overlayfs: add setlease file operation orangefs: add setlease file operation ocfs2: add setlease file operation ntfs3: add setlease file operation nilfs2: add setlease file operation jfs: add setlease file operation jffs2: add setlease file operation gfs2: add a setlease file operation fat: add setlease file operation f2fs: add setlease file operation exfat: add setlease file operation ext4: add setlease file operation ...
2026-02-09Merge tag 'vfs-7.0-rc1.nonblocking_timestamps' of ↵Linus Torvalds1-11/+19
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs timestamp updates from Christian Brauner: "This contains the changes to support non-blocking timestamp updates. Since commit 66fa3cedf16a ("fs: Add async write file modification handling") file_update_time_flags() unconditionally returns -EAGAIN when any timestamp needs updating and IOCB_NOWAIT is set. This makes non-blocking direct writes impossible on file systems with granular enough timestamps, which in practice means all of them. This reworks the timestamp update path to propagate IOCB_NOWAIT through ->update_time so that file systems which can update timestamps without blocking are no longer penalized. With that groundwork in place, the core change passes IOCB_NOWAIT into ->update_time and returns -EAGAIN only when the file system indicates it would block. XFS implements non-blocking timestamp updates by using the new ->sync_lazytime and open-coding generic_update_time without the S_NOWAIT check, since the lazytime path through the generic helpers can never block in XFS" * tag 'vfs-7.0-rc1.nonblocking_timestamps' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: xfs: enable non-blocking timestamp updates xfs: implement ->sync_lazytime fs: refactor file_update_time_flags fs: add support for non-blocking timestamp updates fs: add a ->sync_lazytime method fs: factor out a sync_lazytime helper fs: refactor ->update_time handling fat: cleanup the flags for fat_truncate_time nfs: split nfs_update_timestamps fs: allow error returns from generic_update_time fs: remove inode_update_time
2026-01-29fs: add helpers name_is_dot{,dot,_dotdot}Amir Goldstein1-2/+12
Rename the helper is_dot_dotdot() into the name_ namespace and add complementary helpers to check for dot and dotdot names individually. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://patch.msgid.link/20260128132406.23768-3-amir73il@gmail.com Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-14fs: export may_create() as may_create_dentry()Filipe Manana1-0/+2
For many years btrfs as been using a copy of may_create() in fs/btrfs/ioctl.c:btrfs_may_create(). Everytime may_create() is updated we need to update the btrfs copy, and this is a maintenance burden. Currently there are minor differences between both because the btrfs side lacks updates done in may_create(). Export may_create() so that btrfs can use it and with the less generic name may_create_dentry(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Link: https://patch.msgid.link/ce5174bca079f4cdcbb8dd145f0924feb1f227cd.1768307858.git.fdmanana@suse.com Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-14fs: export may_delete() as may_delete_dentry()Filipe Manana1-0/+3
For many years btrfs as been using a copy of may_delete() in fs/btrfs/ioctl.c:btrfs_may_delete(). Everytime may_delete() is updated we need to update the btrfs copy, and this is a maintenance burden. Currently there are minor differences between both because the btrfs side lacks updates done in may_delete(). Export may_delete() so that btrfs can use it and with the less generic name may_delete_dentry(). While at it change the calls in vfs_rmdir() to pass a boolean literal instead of 1 and 0 as the last argument since the argument has a bool type. Signed-off-by: Filipe Manana <fdmanana@suse.com> Link: https://patch.msgid.link/e09128fd53f01b19d0a58f0e7d24739f79f47f6d.1768307858.git.fdmanana@suse.com Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-13struct filename ->refcnt doesn't need to be atomicAl Viro1-7/+1
... or visible outside of audit, really. Note that references held in delayed_filename always have refcount 1, and from the moment of complete_getname() or equivalent point in getname...() there won't be any references to struct filename instance left in places visible to other threads. Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13allow incomplete imports of filenamesAl Viro1-0/+12
There are two filename-related problems in io_uring and its interplay with audit. Filenames are imported when request is submitted and used when it is processed. Unfortunately, the latter may very well happen in a different thread. In that case the reference to filename is put into the wrong audit_context - that of submitting thread, not the processing one. Audit logics is called by the latter, and it really wants to be able to find the names in audit_context current (== processing) thread. Another related problem is the headache with refcounts - normally all references to given struct filename are visible only to one thread (the one that uses that struct filename). io_uring violates that - an extra reference is stashed in audit_context of submitter. It gets dropped when submitter returns to userland, which can happen simultaneously with processing thread deciding to drop the reference it got. We paper over that by making refcount atomic, but that means pointless headache for everyone. Solution: the notion of partially imported filenames. Namely, already copied from userland, but *not* exposed to audit yet. io_uring can create that in submitter thread, and complete the import (obtaining the usual reference to struct filename) in processing thread. Object: struct delayed_filename. Primitives for working with it: delayed_getname(&delayed_filename, user_string) - copies the name from userland, returning 0 and stashing the address of (still incomplete) struct filename in delayed_filename on success and returning -E... on error. delayed_getname_uflags(&delayed_filename, user_string, atflags) - similar, in the same relation to delayed_getname() as getname_uflags() is to getname() complete_getname(&delayed_filename) - completes the import of filename stashed in delayed_filename and returns struct filename to caller, emptying delayed_filename. CLASS(filename_complete_delayed, name)(&delayed_filename) - variant of CLASS(filename) with complete_getname() for constructor. dismiss_delayed_filename(&delayed_filename) - destructor; drops whatever might be stashed in delayed_filename, emptying it. putname_to_delayed(&delayed_filename, name) - if name is shared, stashes its copy into delayed_filename and drops the reference to name, otherwise stashes the name itself in there. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13struct filename: saner handling of long namesAl Viro1-2/+8
Always allocate struct filename from names_cachep, long name or short; short names would be embedded into struct filename. Longer ones do not cannibalize the original struct filename - put them into kmalloc'ed buffers (PATH_MAX-sized for import from userland, strlen() + 1 - for ones originating kernel-side, where we know the length beforehand). Cutoff length for short names is chosen so that struct filename would be 192 bytes long - that's both a multiple of 64 and large enough to cover the majority of real-world uses. Simplifies logics in getname()/putname() and friends. [fixed an embarrassing braino in EMBEDDED_NAME_MAX, first reported by Dan Carpenter] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13struct filename: use names_cachep only for getname() and friendsAl Viro1-4/+2
Instances of struct filename come from names_cachep (via __getname()). That is done by getname_flags() and getname_kernel() and these two are the main callers of __getname(). However, there are other callers that simply want to allocate PATH_MAX bytes for uses that have nothing to do with struct filename. We want saner allocation rules for long pathnames, so that struct filename would *always* come from names_cachep, with the out-of-line pathname getting kmalloc'ed. For that we need to be able to change the size of objects allocated by getname_flags()/getname_kernel(). That requires the rest of __getname() users to stop using names_cachep; we could explicitly switch all of those to kmalloc(), but that would cause quite a bit of noise. So the plan is to switch getname_...() to new helpers and turn __getname() into a wrapper for kmalloc(). Remaining __getname() users could be converted to explicit kmalloc() at leisure, hopefully along with figuring out what size do they really want - PATH_MAX is an overkill for some of them, used out of laziness ("we have a convenient helper that does 4K allocations and that's large enough, let's use it"). As a side benefit, names_cachep is no longer used outside of fs/namei.c, so we can move it there and be done with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13get rid of audit_reusename()Al Viro1-1/+0
Originally we tried to avoid multiple insertions into audit names array during retry loop by a cute hack - memorize the userland pointer and if there already is a match, just grab an extra reference to it. Cute as it had been, it had problems - two identical pointers had audit aux entries merged, two identical strings did not. Having different behaviour for syscalls that differ only by addresses of otherwise identical string arguments is obviously wrong - if nothing else, compiler can decide to merge identical string literals. Besides, this hack does nothing for non-audited processes - they get a fresh copy for retry. It's not time-critical, but having behaviour subtly differ that way is bogus. These days we have very few places that import filename more than once (9 functions total) and it's easy to massage them so we get rid of all re-imports. With that done, we don't need audit_reusename() anymore. There's no need to memorize userland pointer either. Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-13allow to use CLASS() for struct filename *Al Viro1-0/+6
Not all users match that model, but most of them do. By the end of the series we'll be left with very few irregular ones... Added: CLASS(filename, name)(user_path) => getname(user_path) CLASS(filename_kernel, name)(string) => getname_kernel(string) CLASS(filename_flags, name)(user_path, flags) => getname_flags(user_path, flags) CLASS(filename_uflags, name)(user_path, flags) => getname_uflags(user_path, flags) CLASS(filename_maybe_null, name)(user_path, flags) => getname_maybe_null(user_path, flags) all with putname() as destructor. "flags" in filename_flags is in LOOKUP_... space, only LOOKUP_EMPTY matters. "flags" in filename_uflags and filename_maybe_null is in AT_...... space, and only AT_EMPTY_PATH matters. filename_flags conventions might be worth reconsidering later (it might or might not be better off with boolean instead) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-01-12fs: add a ->sync_lazytime methodChristoph Hellwig1-0/+1
Allow the file system to explicitly implement lazytime syncing instead of pigging back on generic inode dirtying. This allows to simplify the XFS implementation and prepares for non-blocking lazytime timestamp updates. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260108141934.2052404-8-hch@lst.de Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12fs: refactor ->update_time handlingChristoph Hellwig1-10/+18
Pass the type of update (atime vs c/mtime plus version) as an enum instead of a set of flags that caused all kinds of confusion. Because inode_update_timestamps now can't return a modified version of those flags, return the I_DIRTY_* flags needed to persist the update, which is what the main caller in generic_update_time wants anyway, and which is suitable for the other callers that only want to know if an update happened. The whole update_time path keeps the flags argument, which will be used to support non-blocking updates soon even if it is unused, and (the slightly renamed) inode_update_time also gains the possibility to return a negative errno to support this. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260108141934.2052404-6-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12fs: allow error returns from generic_update_timeChristoph Hellwig1-1/+1
Now that no caller looks at the updated flags, switch generic_update_time to the same calling convention as the ->update_time method and return 0 or a negative errno. This prepares for adding non-blocking timestamp updates that could return -EAGAIN. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260108141934.2052404-3-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12fs: remove inode_update_timeChristoph Hellwig1-1/+0
The only external user is gone now, open code it in the two VFS callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260108141934.2052404-2-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12readdir: require opt-in for d_type flagsAmir Goldstein1-1/+5
Commit c31f91c6af96 ("fuse: don't allow signals to interrupt getdents copying") introduced the use of high bits in d_type as flags. However, overlayfs was not adapted to handle this change. In ovl_cache_entry_new(), the code checks if d_type == DT_CHR to determine if an entry might be a whiteout. When fuse is used as the lower layer and sets high bits in d_type, this comparison fails, causing whiteout files to not be recognized properly and resulting in incorrect overlayfs behavior. Fix this by requiring callers of iterate_dir() to opt-in for getting flag bits in d_type outside of S_DT_MASK. Fixes: c31f91c6af96 ("fuse: don't allow signals to interrupt getdents copying") Link: https://lore.kernel.org/all/20260107034551.439-1-luochunsheng@ustc.edu/ Link: https://github.com/containerd/stargz-snapshotter/issues/2214 Reported-by: Chunsheng Luo <luochunsheng@ustc.edu> Reviewed-by: Chunsheng Luo <luochunsheng@ustc.edu> Tested-by: Chunsheng Luo <luochunsheng@ustc.edu> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://patch.msgid.link/20260108074522.3400998-1-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-12fs: remove simple_nosetlease()Jeff Layton1-1/+0
Setting ->setlease() to a NULL pointer now has the same effect as setting it to simple_nosetlease(). Remove all of the setlease file_operations that are set to simple_nosetlease, and the function itself. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260108-setlease-6-20-v1-24-ea4dec9b67fa@kernel.org Acked-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-05sysctl: Remove unused ctl_table forward declarationsJoel Granados1-1/+0
Remove superfluous forward declarations of ctl_table from header files where they are no longer needed. These declarations were left behind after sysctl code refactoring and cleanup. Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-12-16shmem: fix recovery on rename failuresAl Viro1-1/+1
maple_tree insertions can fail if we are seriously short on memory; simple_offset_rename() does not recover well if it runs into that. The same goes for simple_offset_rename_exchange(). Moreover, shmem_whiteout() expects that if it succeeds, the caller will progress to d_move(), i.e. that shmem_rename2() won't fail past the successful call of shmem_whiteout(). Not hard to fix, fortunately - mtree_store() can't fail if the index we are trying to store into is already present in the tree as a singleton. For simple_offset_rename_exchange() that's enough - we just need to be careful about the order of operations. For simple_offset_rename() solution is to preinsert the target into the tree for new_dir; the rest can be done without any potentially failing operations. That preinsertion has to be done in shmem_rename2() rather than in simple_offset_rename() itself - otherwise we'd need to deal with the possibility of failure after successful shmem_whiteout(). Fixes: a2e459555c5f ("shmem: stable directory offsets") Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-12-15fs: Remove internal old mount API codeEric Sandeen1-2/+0
Now that the last in-tree filesystem has been converted to the new mount API, remove all legacy mount API code designed to handle un-converted filesystems, and remove associated documentation as well. (The code to handle the legacy mount(2) syscall from userspace is still in place, of course.) Tested with an allmodconfig build on x86_64, and a sanity check of an old mount(2) syscall mount. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://patch.msgid.link/20251212174403.2882183-1-sandeen@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-12-15fs: track the inode having file locks with a flag in ->i_opflagsMateusz Guzik1-0/+1
Opening and closing an inode dirties the ->i_readcount field. Depending on the alignment of the inode, it may happen to false-share with other fields loaded both for both operations to various extent. This notably concerns the ->i_flctx field. Since most inodes don't have the field populated, this bit can be managed with a flag in ->i_opflags instead which bypasses the problem. Here are results I obtained while opening a file read-only in a loop with 24 cores doing the work on Sapphire Rapids. Utilizing the flag as opposed to reading ->i_flctx field was toggled at runtime as the benchmark was running, to make sure both results come from the same alignment. before: 3233740 after: 3373346 (+4%) before: 3284313 after: 3518711 (+7%) before: 3505545 after: 4092806 (+16%) Or to put it differently, this varies wildly depending on how (un)lucky you get. The primary bottleneck before and after is the avoidable lockref trip in do_dentry_open(). Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20251203094837.290654-2-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-12-15VFS/knfsd: Teach dentry_create() to use atomic_open()Benjamin Coddington1-1/+1
While knfsd offers combined exclusive create and open results to clients, on some filesystems those results may not be atomic. This behavior can be observed. For example, an open O_CREAT with mode 0 will succeed in creating the file but unexpectedly return -EACCES from vfs_open(). Additionally reducing the number of remote RPC calls required for O_CREAT on network filesystem provides a performance benefit in the open path. Teach knfsd's helper dentry_create() to use atomic_open() for filesystems that support it. The previously const @path is passed up to atomic_open() and may be modified depending on whether an existing entry was found or if the atomic_open() returned an error and consumed the passed-in dentry. Signed-off-by: Benjamin Coddington <bcodding@hammerspace.com> Link: https://patch.msgid.link/8e449bfb64ab055abb9fd82641a171531415a88c.1764259052.git.bcodding@hammerspace.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-12-06Merge tag 'pull-persistency' of ↵Linus Torvalds1-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull persistent dentry infrastructure and conversion from Al Viro: "Some filesystems use a kinda-sorta controlled dentry refcount leak to pin dentries of created objects in dcache (and undo it when removing those). A reference is grabbed and not released, but it's not actually _stored_ anywhere. That works, but it's hard to follow and verify; among other things, we have no way to tell _which_ of the increments is intended to be an unpaired one. Worse, on removal we need to decide whether the reference had already been dropped, which can be non-trivial if that removal is on umount and we need to figure out if this dentry is pinned due to e.g. unlink() not done. Usually that is handled by using kill_litter_super() as ->kill_sb(), but there are open-coded special cases of the same (consider e.g. /proc/self). Things get simpler if we introduce a new dentry flag (DCACHE_PERSISTENT) marking those "leaked" dentries. Having it set claims responsibility for +1 in refcount. The end result this series is aiming for: - get these unbalanced dget() and dput() replaced with new primitives that would, in addition to adjusting refcount, set and clear persistency flag. - instead of having kill_litter_super() mess with removing the remaining "leaked" references (e.g. for all tmpfs files that hadn't been removed prior to umount), have the regular shrink_dcache_for_umount() strip DCACHE_PERSISTENT of all dentries, dropping the corresponding reference if it had been set. After that kill_litter_super() becomes an equivalent of kill_anon_super(). Doing that in a single step is not feasible - it would affect too many places in too many filesystems. It has to be split into a series. This work has really started early in 2024; quite a few preliminary pieces have already gone into mainline. This chunk is finally getting to the meat of that stuff - infrastructure and most of the conversions to it. Some pieces are still sitting in the local branches, but the bulk of that stuff is here" * tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits) d_make_discardable(): warn if given a non-persistent dentry kill securityfs_recursive_remove() convert securityfs get rid of kill_litter_super() convert rust_binderfs convert nfsctl convert rpc_pipefs convert hypfs hypfs: swich hypfs_create_u64() to returning int hypfs: switch hypfs_create_str() to returning int hypfs: don't pin dentries twice convert gadgetfs gadgetfs: switch to simple_remove_by_name() convert functionfs functionfs: switch to simple_remove_by_name() functionfs: fix the open/removal races functionfs: need to cancel ->reset_work in ->kill_sb() functionfs: don't bother with ffs->ref in ffs_data_{opened,closed}() functionfs: don't abuse ffs_data_closed() on fs shutdown convert selinuxfs ...