summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2025-07-10mm: restore documentation for __free_pages()Matthew Wilcox (Oracle)1-19/+19
The documentation was converted to be for ___free_pages(), which doesn't need documentation as it's static. Link: https://lkml.kernel.org/r/20250604190327.814086-1-willy@infradead.org Fixes: 8c57b687e833 (mm, bpf: Introduce free_pages_nolock()) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-10Revert "sched/numa: add statistics of numa balance task"Chen Yu2-4/+0
This reverts commit ad6b26b6a0a79166b53209df2ca1cf8636296382. This commit introduces per-memcg/task NUMA balance statistics, but unfortunately it introduced a NULL pointer exception due to the following race condition: After a swap task candidate was chosen, its mm_struct pointer was set to NULL due to task exit. Later, when performing the actual task swapping, the p->mm caused the problem. CPU0 CPU1 : ... task_numa_migrate task_numa_find_cpu task_numa_compare # a normal task p is chosen env->best_task = p # p exit: exit_signals(p); p->flags |= PF_EXITING exit_mm p->mm = NULL; migrate_swap_stop __migrate_swap_task((arg->src_task, arg->dst_cpu) count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL task_lock() should be held and the PF_EXITING flag needs to be checked to prevent this from happening. After discussion, the conclusion was that adding a lock is not worthwhile for some statistics calculations. Revert the change and rely on the tracepoint for this purpose. Link: https://lkml.kernel.org/r/20250704135620.685752-1-yu.c.chen@intel.com Link: https://lkml.kernel.org/r/20250708064917.BBD13C4CEED@smtp.kernel.org Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Reported-by: Jirka Hladky <jhladky@redhat.com> Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/ Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com> Reported-by: Suneeth D <Suneeth.D@amd.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Hladky <jhladky@redhat.com> Cc: Libo Chen <libo.chen@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-10mm/damon: fix divide by zero in damon_get_intervals_score()Honggyu Kim1-0/+1
The current implementation allows having zero size regions with no special reasons, but damon_get_intervals_score() gets crashed by divide by zero when the region size is zero. [ 29.403950] Oops: divide error: 0000 [#1] SMP NOPTI This patch fixes the bug, but does not disallow zero size regions to keep the backward compatibility since disallowing zero size regions might be a breaking change for some users. In addition, the same crash can happen when intervals_goal.access_bp is zero so this should be fixed in stable trees as well. Link: https://lkml.kernel.org/r/20250702000205.1921-5-honggyu.kim@sk.com Fixes: f04b0fedbe71 ("mm/damon/core: implement intervals auto-tuning") Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-10kasan: remove kasan_find_vm_area() to prevent possible deadlockYeoreum Yun1-43/+2
find_vm_area() couldn't be called in atomic_context. If find_vm_area() is called to reports vm area information, kasan can trigger deadlock like: CPU0 CPU1 vmalloc(); alloc_vmap_area(); spin_lock(&vn->busy.lock) spin_lock_bh(&some_lock); <interrupt occurs> <in softirq> spin_lock(&some_lock); <access invalid address> kasan_report(); print_report(); print_address_description(); kasan_find_vm_area(); find_vm_area(); spin_lock(&vn->busy.lock) // deadlock! To prevent possible deadlock while kasan reports, remove kasan_find_vm_area(). Link: https://lkml.kernel.org/r/20250703181018.580833-1-yeoreum.yun@arm.com Fixes: c056a364e954 ("kasan: print virtual mapping info in reports") Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com> Reported-by: Yunseong Kim <ysk@kzalloc.com> Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-10mm/migrate: fix do_pages_stat in compat modeChristoph Berg1-6/+8
For arrays with more than 16 entries, the old code would incorrectly advance the pages pointer by 16 words instead of 16 compat_uptr_t. Fix by doing the pointer arithmetic inside get_compat_pages_array where pages32 is already a correctly-typed pointer. Discovered while working on PostgreSQL 18's new NUMA introspection code. Link: https://lkml.kernel.org/r/aGREU0XTB48w9CwN@msg.df7cb.de Fixes: 5b1b561ba73c ("mm: simplify compat_sys_move_pages") Signed-off-by: Christoph Berg <myon@debian.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Reported-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reported-by: Tomas Vondra <tomas@vondra.me> Closes: https://www.postgresql.org/message-id/flat/6342f601-77de-4ee0-8c2a-3deb50ceac5b%40vondra.me#86402e3d80c031788f5f55b42c459471 Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-10mm/damon/core: handle damon_call_control as normal under kdmond deactivationSeongJae Park1-4/+3
DAMON sysfs interface internally uses damon_call() to update DAMON parameters as users requested, online. However, DAMON core cancels any damon_call() requests when it is deactivated by DAMOS watermarks. As a result, users cannot change DAMON parameters online while DAMON is deactivated. Note that users can turn DAMON off and on with different watermarks to work around. Since deactivated DAMON is nearly same to stopped DAMON, the work around should have no big problem. Anyway, a bug is a bug. There is no real good reason to cancel the damon_call() request under DAMOS deactivation. Fix it by simply handling the request as normal, rather than cancelling under the situation. Link: https://lkml.kernel.org/r/20250629204914.54114-1-sj@kernel.org Fixes: 42b7491af14c ("mm/damon/core: introduce damon_call()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-10mm/rmap: fix potential out-of-bounds page table access during batched unmapLance Yang1-18/+28
As pointed out by David[1], the batched unmap logic in try_to_unmap_one() may read past the end of a PTE table when a large folio's PTE mappings are not fully contained within a single page table. While this scenario might be rare, an issue triggerable from userspace must be fixed regardless of its likelihood. This patch fixes the out-of-bounds access by refactoring the logic into a new helper, folio_unmap_pte_batch(). The new helper correctly calculates the safe batch size by capping the scan at both the VMA and PMD boundaries. To simplify the code, it also supports partial batching (i.e., any number of pages from 1 up to the calculated safe maximum), as there is no strong reason to special-case for fully mapped folios. Link: https://lkml.kernel.org/r/20250701143100.6970-1-lance.yang@linux.dev Link: https://lkml.kernel.org/r/20250630011305.23754-1-lance.yang@linux.dev Link: https://lkml.kernel.org/r/20250627062319.84936-1-lance.yang@linux.dev Link: https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com [1] Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation") Signed-off-by: Lance Yang <lance.yang@linux.dev> Suggested-by: David Hildenbrand <david@redhat.com> Reported-by: David Hildenbrand <david@redhat.com> Closes: https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com Suggested-by: Barry Song <baohua@kernel.org> Acked-by: Barry Song <baohua@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <huang.ying.caritas@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mingzhe Yang <mingzhe.yang@ly.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-10mm/hugetlb: don't crash when allocating a folio if there are no resvVivek Kasireddy1-3/+6
There are cases when we try to pin a folio but discover that it has not been faulted-in. So, we try to allocate it in memfd_alloc_folio() but there is a chance that we might encounter a fatal crash/failure (VM_BUG_ON(!h->resv_huge_pages) in alloc_hugetlb_folio_reserve()) if there are no active reservations at that instant. This issue was reported by syzbot: kernel BUG at mm/hugetlb.c:2403! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403 Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89 f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087 RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000 RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005 R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000 R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8 FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88 memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750 udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline] udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443 udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline] udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Therefore, prevent the above crash by removing the VM_BUG_ON() as there is no need to crash the system in this situation and instead we could just fail the allocation request. Furthermore, as described above, the specific situation where this happens is when we try to pin memfd folios before they are faulted-in. Although, this is a valid thing to do, it is not the regular or the common use-case. Let us consider the following scenarios: 1) hugetlbfs_file_mmap() memfd_alloc_folio() hugetlb_fault() 2) memfd_alloc_folio() hugetlbfs_file_mmap() hugetlb_fault() 3) hugetlbfs_file_mmap() hugetlb_fault() alloc_hugetlb_folio() 3) is the most common use-case where first a memfd is allocated followed by mmap(), user writes/updates and then the relevant folios are pinned (memfd_pin_folios()). The BUG this patch is fixing occurs in 2) because we try to pin the folios before hugetlbfs_file_mmap() is called. So, in this situation we try to allocate the folios before pinning them but since we did not make any reservations, resv_huge_pages would be 0, leading to this issue. Link: https://lkml.kernel.org/r/20250626191116.1377761-1-vivek.kasireddy@intel.com Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak") Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Closes: https://syzkaller.appspot.com/bug?extid=a504cb5bae4fe117ba94 Closes: https://lore.kernel.org/all/677928b5.050a0220.3b53b0.004d.GAE@google.com/T/ Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Steve Sistare <steven.sistare@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-10mm/vmalloc: leave lazy MMU mode on PTE mapping errorAlexander Gordeev1-7/+15
vmap_pages_pte_range() enters the lazy MMU mode, but fails to leave it in case an error is encountered. Link: https://lkml.kernel.org/r/20250623075721.2817094-1-agordeev@linux.ibm.com Fixes: 2ba3e6947aed ("mm/vmalloc: track which page-table levels were modified") Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202506132017.T1l1l6ME-lkp@intel.com/ Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09vmscan: don't bother with debugfs_real_fops()Al Viro1-3/+5
... not when it's used only to check which file is used; debugfs_create_file_aux_num() allows to stash a number into debugfs entry and debugfs_get_aux_num() extracts it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Braino-spotted-by: Matthew Wilcox <willy@infradead.org> Link: https://lore.kernel.org/r/20250702211739.GE3406663@ZenIV Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-07secretmem: use SB_I_NOEXECChristian Brauner1-4/+9
Anonymous inodes may never ever be exectuable and the only way to enforce this is to raise SB_I_NOEXEC on the superblock which can never be unset. I've made the exec code yell at anyone who does not abide by this rule. For good measure also kill any pretense that device nodes are supported on the secretmem filesystem. > WARNING: fs/exec.c:119 at path_noexec+0x1af/0x200 fs/exec.c:118, CPU#1: syz-executor260/5835 > Modules linked in: > CPU: 1 UID: 0 PID: 5835 Comm: syz-executor260 Not tainted 6.16.0-rc4-next-20250703-syzkaller #0 PREEMPT(full) > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 > RIP: 0010:path_noexec+0x1af/0x200 fs/exec.c:118 > Code: 02 31 ff 48 89 de e8 f0 b1 89 ff d1 eb eb 07 e8 07 ad 89 ff b3 01 89 d8 5b 41 5e 41 5f 5d c3 cc cc cc cc cc e8 f2 ac 89 ff 90 <0f> 0b 90 e9 48 ff ff ff 44 89 f1 80 e1 07 80 c1 03 38 c1 0f 8c a6 > RSP: 0018:ffffc90003eefbd8 EFLAGS: 00010293 > RAX: ffffffff8235f22e RBX: ffff888072be0940 RCX: ffff88807763bc00 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 > RBP: 0000000000080000 R08: ffff88807763bc00 R09: 0000000000000003 > R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000011 > R13: 1ffff920007ddf90 R14: 0000000000000000 R15: dffffc0000000000 > FS: 000055556832d380(0000) GS:ffff888125d1e000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f21e34810d0 CR3: 00000000718a8000 CR4: 00000000003526f0 > Call Trace: > <TASK> > do_mmap+0xa43/0x10d0 mm/mmap.c:472 > vm_mmap_pgoff+0x31b/0x4c0 mm/util.c:579 > ksys_mmap_pgoff+0x51f/0x760 mm/mmap.c:607 > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] > do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > RIP: 0033:0x7f21e340a9f9 > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 > RSP: 002b:00007ffd23ca3468 EFLAGS: 00000246 ORIG_RAX: 0000000000000009 > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f21e340a9f9 > RDX: 0000000000000000 RSI: 0000000000004000 RDI: 0000200000ff9000 > RBP: 00007f21e347d5f0 R08: 0000000000000003 R09: 0000000000000000 > R10: 0000000000000011 R11: 0000000000000246 R12: 0000000000000001 > R13: 431bde82d7b634db R14: 0000000000000001 R15: 0000000000000001 Link: https://lore.kernel.org/686ba948.a00a0220.c7b3.0080.GAE@google.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-04Merge tag 'vfs-6.16-rc5.fixes' of ↵Linus Torvalds1-8/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix a regression caused by the anonymous inode rework. Making them regular files causes various places in the kernel to tip over starting with io_uring. Revert to the former status quo and port our assertion to be based on checking the inode so we don't lose the valuable VFS_*_ON_*() assertions that have already helped discover weird behavior our outright bugs. - Fix the the upper bound calculation in fuse_fill_write_pages() - Fix priority inversion issues in the eventpoll code - Make secretmen use anon_inode_make_secure_inode() to avoid bypassing the LSM layer - Fix a netfs hang due to missing case in final DIO read result collection - Fix a double put of the netfs_io_request struct - Provide some helpers to abstract out NETFS_RREQ_IN_PROGRESS flag wrangling - Fix infinite looping in netfs_wait_for_pause/request() - Fix a netfs ref leak on an extra subrequest inserted into a request's list of subreqs - Fix various cifs RPC callbacks to set NETFS_SREQ_NEED_RETRY if a subrequest fails retriably - Fix a cifs warning in the workqueue code when reconnecting a channel - Fix the updating of i_size in netfs to avoid a race between testing if we should have extended the file with a DIO write and changing i_size - Merge the places in netfs that update i_size on write - Fix coredump socket selftests * tag 'vfs-6.16-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: anon_inode: rework assertions netfs: Update tracepoints in a number of ways netfs: Renumber the NETFS_RREQ_* flags to make traces easier to read netfs: Merge i_size update functions netfs: Fix i_size updating smb: client: set missing retry flag in cifs_writev_callback() smb: client: set missing retry flag in cifs_readv_callback() smb: client: set missing retry flag in smb2_writev_callback() netfs: Fix ref leak on inserted extra subreq in write retry netfs: Fix looping in wait functions netfs: Provide helpers to perform NETFS_RREQ_IN_PROGRESS flag wangling netfs: Fix double put of request netfs: Fix hang due to missing case in final DIO read result collection eventpoll: Fix priority inversion problem fuse: fix fuse_fill_write_pages() upper bound calculation fs: export anon_inode_make_secure_inode() and fix secretmem LSM bypass selftests/coredump: Fix "socket_detect_userspace_client" test failure
2025-07-04tree-wide: s/struct fileattr/struct file_kattr/gChristian Brauner1-2/+2
Now that we expose struct file_attr as our uapi struct rename all the internal struct to struct file_kattr to clearly communicate that it is a kernel internal struct. This is similar to struct mount_{k}attr and others. Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01docs: dma-api: replace consistent with coherentPetr Tesarik1-3/+3
For consistency, always use the term "coherent" when talking about memory that is not subject to CPU caching effects. The term "consistent" is a relic of a long-removed PCI DMA API (pci_alloc_consistent() and pci_free_consistent() functions). Signed-off-by: Petr Tesarik <ptesarik@suse.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20250627101015.1600042-3-ptesarik@suse.com
2025-06-26mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path on writeSeongJae Park1-0/+1
memcg_path_store() assigns a newly allocated memory buffer to filter->memcg_path, without deallocating the previously allocated and assigned memory buffer. As a result, users can leak kernel memory by continuously writing a data to memcg_path DAMOS sysfs file. Fix the leak by deallocating the previously set memory buffer. Link: https://lkml.kernel.org/r/20250619183608.6647-2-sj@kernel.org Fixes: 7ee161f18b5d ("mm/damon/sysfs-schemes: implement filter directory") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> [6.3.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-26mm/alloc_tag: fix the kmemleak false positive issue in the allocation of the ↵Hao Ge1-0/+14
percpu variable tag->counters When loading a module, as long as the module has memory allocation operations, kmemleak produces a false positive report that resembles the following: unreferenced object (percpu) 0x7dfd232a1650 (size 16): comm "modprobe", pid 1301, jiffies 4294940249 hex dump (first 16 bytes on cpu 2): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 0): kmemleak_alloc_percpu+0xb4/0xd0 pcpu_alloc_noprof+0x700/0x1098 load_module+0xd4/0x348 codetag_module_init+0x20c/0x450 codetag_load_module+0x70/0xb8 load_module+0xef8/0x1608 init_module_from_file+0xec/0x158 idempotent_init_module+0x354/0x608 __arm64_sys_finit_module+0xbc/0x150 invoke_syscall+0xd4/0x258 el0_svc_common.constprop.0+0xb4/0x240 do_el0_svc+0x48/0x68 el0_svc+0x40/0xf8 el0t_64_sync_handler+0x10c/0x138 el0t_64_sync+0x1ac/0x1b0 This is because the module can only indirectly reference alloc_tag_counters through the alloc_tag section, which misleads kmemleak. However, we don't have a kmemleak ignore interface for percpu allocations yet. So let's create one and invoke it for tag->counters. [gehao@kylinos.cn: fix build error when CONFIG_DEBUG_KMEMLEAK=n, s/igonore/ignore/] Link: https://lkml.kernel.org/r/20250620093102.2416767-1-hao.ge@linux.dev Link: https://lkml.kernel.org/r/20250619183154.2122608-1-hao.ge@linux.dev Fixes: 12ca42c23775 ("alloc_tag: allocate percpu counters for module tags dynamically") Signed-off-by: Hao Ge <gehao@kylinos.cn> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Suren Baghdasaryan <surenb@google.com> [lib/alloc_tag.c] Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-26mm/hugetlb: remove unnecessary holding of hugetlb_lockGe Yang1-37/+17
In isolate_or_dissolve_huge_folio(), after acquiring the hugetlb_lock, it is only for the purpose of obtaining the correct hstate, which is then passed to alloc_and_dissolve_hugetlb_folio(). alloc_and_dissolve_hugetlb_folio() itself also acquires the hugetlb_lock. We can have alloc_and_dissolve_hugetlb_folio() obtain the hstate by itself, so that isolate_or_dissolve_huge_folio() no longer needs to acquire the hugetlb_lock. In addition, we keep the folio_test_hugetlb() check within isolate_or_dissolve_huge_folio(). By doing so, we can avoid disrupting the normal path by vainly holding the hugetlb_lock. replace_free_hugepage_folios() has the same issue, and we should address it as well. Addresses a possible performance problem which was added by the hotfix 113ed54ad276 ("mm/hugetlb: fix kernel NULL pointer dereference when replacing free hugetlb folios"). Link: https://lkml.kernel.org/r/1748317010-16272-1-git-send-email-yangge1116@126.com Fixes: 113ed54ad276 ("mm/hugetlb: fix kernel NULL pointer dereference when replacing free hugetlb folios") Signed-off-by: Ge Yang <yangge1116@126.com> Suggested-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-23Merge tag 'mm-hotfixes-stable-2025-06-22-18-52' of ↵Linus Torvalds5-27/+69
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "20 hotfixes. 7 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. Only 4 are for MM. - The series `Revert "bcache: update min_heap_callbacks to use default builtin swap"' from Kuan-Wei Chiu backs out the author's recent min_heap changes due to a performance regression. A fix for this regression has been developed but we felt it best to go back to the known-good version to give the new code more bake time. - A lot of MAINTAINERS maintenance. I like to get these changes upstreamed promptly because they can't break things and more accurate/complete MAINTAINERS info hopefully improves the speed and accuracy of our responses to submitters and reporters" * tag 'mm-hotfixes-stable-2025-06-22-18-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: MAINTAINERS: add additional mmap-related files to mmap section MAINTAINERS: add memfd, shmem quota files to shmem section MAINTAINERS: add stray rmap file to mm rmap section MAINTAINERS: add hugetlb_cgroup.c to hugetlb section MAINTAINERS: add further init files to mm init block MAINTAINERS: update maintainers for HugeTLB maple_tree: fix MA_STATE_PREALLOC flag in mas_preallocate() MAINTAINERS: add missing test files to mm gup section MAINTAINERS: add missing mm/workingset.c file to mm reclaim section selftests/mm: skip uprobe vma merge test if uprobes are not enabled bcache: remove unnecessary select MIN_HEAP Revert "bcache: remove heap-related macros and switch to generic min_heap" Revert "bcache: update min_heap_callbacks to use default builtin swap" selftests/mm: add configs to fix testcase failure kho: initialize tail pages for higher order folios properly MAINTAINERS: add linux-mm@ list to Kexec Handover mm: userfaultfd: fix race of userfaultfd_move and swap cache mm/gup: revert "mm: gup: fix infinite loop within __get_longterm_locked" selftests/mm: increase timeout from 180 to 900 seconds mm/shmem, swap: fix softlockup with mTHP swapin
2025-06-23fs: export anon_inode_make_secure_inode() and fix secretmem LSM bypassShivank Garg1-8/+1
Export anon_inode_make_secure_inode() to allow KVM guest_memfd to create anonymous inodes with proper security context. This replaces the current pattern of calling alloc_anon_inode() followed by inode_init_security_anon() for creating security context manually. This change also fixes a security regression in secretmem where the S_PRIVATE flag was not cleared after alloc_anon_inode(), causing LSM/SELinux checks to be bypassed for secretmem file descriptors. As guest_memfd currently resides in the KVM module, we need to export this symbol for use outside the core kernel. In the future, guest_memfd might be moved to core-mm, at which point the symbols no longer would have to be exported. When/if that happens is still unclear. Fixes: 2bfe15c52612 ("mm: create security context for memfd_secret inodes") Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Shivank Garg <shivankg@amd.com> Link: https://lore.kernel.org/20250620070328.803704-3-shivankg@amd.com Acked-by: "Mike Rapoport (Microsoft)" <rppt@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23Merge 6.16-rc3 into driver-core-nextGreg Kroah-Hartman6-57/+96
We need the driver-core fixes that are in 6.16-rc3 into here as well to build on top of. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-20mm: userfaultfd: fix race of userfaultfd_move and swap cacheKairui Song1-2/+31
This commit fixes two kinds of races, they may have different results: Barry reported a BUG_ON in commit c50f8e6053b0, we may see the same BUG_ON if the filemap lookup returned NULL and folio is added to swap cache after that. If another kind of race is triggered (folio changed after lookup) we may see RSS counter is corrupted: [ 406.893936] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0 type:MM_ANONPAGES val:-1 [ 406.894071] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0 type:MM_SHMEMPAGES val:1 Because the folio is being accounted to the wrong VMA. I'm not sure if there will be any data corruption though, seems no. The issues above are critical already. On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache lookup, and tries to move the found folio to the faulting vma. Currently, it relies on checking the PTE value to ensure that the moved folio still belongs to the src swap entry and that no new folio has been added to the swap cache, which turns out to be unreliable. While working and reviewing the swap table series with Barry, following existing races are observed and reproduced [1]: In the example below, move_pages_pte is moving src_pte to dst_pte, where src_pte is a swap entry PTE holding swap entry S1, and S1 is not in the swap cache: CPU1 CPU2 userfaultfd_move move_pages_pte() entry = pte_to_swp_entry(orig_src_pte); // Here it got entry = S1 ... < interrupted> ... <swapin src_pte, alloc and use folio A> // folio A is a new allocated folio // and get installed into src_pte <frees swap entry S1> // src_pte now points to folio A, S1 // has swap count == 0, it can be freed // by folio_swap_swap or swap // allocator's reclaim. <try to swap out another folio B> // folio B is a folio in another VMA. <put folio B to swap cache using S1 > // S1 is freed, folio B can use it // for swap out with no problem. ... folio = filemap_get_folio(S1) // Got folio B here !!! ... < interrupted again> ... <swapin folio B and free S1> // Now S1 is free to be used again. <swapout src_pte & folio A using S1> // Now src_pte is a swap entry PTE // holding S1 again. folio_trylock(folio) move_swap_pte double_pt_lock is_pte_pages_stable // Check passed because src_pte == S1 folio_move_anon_rmap(...) // Moved invalid folio B here !!! The race window is very short and requires multiple collisions of multiple rare events, so it's very unlikely to happen, but with a deliberately constructed reproducer and increased time window, it can be reproduced easily. This can be fixed by checking if the folio returned by filemap is the valid swap cache folio after acquiring the folio lock. Another similar race is possible: filemap_get_folio may return NULL, but folio (A) could be swapped in and then swapped out again using the same swap entry after the lookup. In such a case, folio (A) may remain in the swap cache, so it must be moved too: CPU1 CPU2 userfaultfd_move move_pages_pte() entry = pte_to_swp_entry(orig_src_pte); // Here it got entry = S1, and S1 is not in swap cache folio = filemap_get_folio(S1) // Got NULL ... < interrupted again> ... <swapin folio A and free S1> <swapout folio A re-using S1> move_swap_pte double_pt_lock is_pte_pages_stable // Check passed because src_pte == S1 folio_move_anon_rmap(...) // folio A is ignored !!! Fix this by checking the swap cache again after acquiring the src_pte lock. And to avoid the filemap overhead, we check swap_map directly [2]. The SWP_SYNCHRONOUS_IO path does make the problem more complex, but so far we don't need to worry about that, since folios can only be exposed to the swap cache in the swap out path, and this is covered in this patch by checking the swap cache again after acquiring the src_pte lock. Testing with a simple C program that allocates and moves several GB of memory did not show any observable performance change. Link: https://lkml.kernel.org/r/20250604151038.21968-1-ryncsn@gmail.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Kairui Song <kasong@tencent.com> Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1] Link: https://lore.kernel.org/all/CAGsJ_4yJhJBo16XhiC-nUzSheyX-V3-nFE+tAi=8Y560K8eT=A@mail.gmail.com/ [2] Reviewed-by: Lokesh Gidra <lokeshgidra@google.com> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Chris Li <chrisl@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kairui Song <kasong@tencent.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-20mm/gup: revert "mm: gup: fix infinite loop within __get_longterm_locked"David Hildenbrand1-4/+10
After commit 1aaf8c122918 ("mm: gup: fix infinite loop within __get_longterm_locked") we are able to longterm pin folios that are not supposed to get longterm pinned, simply because they temporarily have the LRU flag cleared (esp. temporarily isolated). For example, two __get_longterm_locked() callers can race, or __get_longterm_locked() can race with anything else that temporarily isolates folios. The introducing commit mentions the use case of a driver that uses vm_ops->fault to insert pages allocated through cma_alloc() into the page tables, assuming they can later get longterm pinned. These pages/ folios would never have the LRU flag set and consequently cannot get isolated. There is no known in-tree user making use of that so far, fortunately. To handle that in the future -- and avoid retrying forever to isolate/migrate them -- we will need a different mechanism for the CMA area *owner* to indicate that it actually already allocated the page and is fine with longterm pinning it. The LRU flag is not suitable for that. Probably we can lookup the relevant CMA area and query the bitmap; we only have have to care about some races, probably. If already allocated, we could just allow longterm pinning) Anyhow, let's fix the "must not be longterm pinned" problem first by reverting the original commit. Link: https://lkml.kernel.org/r/20250611131314.594529-1-david@redhat.com Fixes: 1aaf8c122918 ("mm: gup: fix infinite loop within __get_longterm_locked") Signed-off-by: David Hildenbrand <david@redhat.com> Closes: https://lore.kernel.org/all/20250522092755.GA3277597@tiffany/ Reported-by: Hyesoo Yu <hyesoo.yu@samsung.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Peter Xu <peterx@redhat.com> Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Cc: Aijun Sun <aijun.sun@unisoc.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-20mm/shmem, swap: fix softlockup with mTHP swapinKairui Song3-21/+28
Following softlockup can be easily reproduced on my test machine with: echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled swapon /dev/zram0 # zram0 is a 48G swap device mkdir -p /sys/fs/cgroup/memory/test echo 1G > /sys/fs/cgroup/test/memory.max echo $BASHPID > /sys/fs/cgroup/test/cgroup.procs while true; do dd if=/dev/zero of=/tmp/test.img bs=1M count=5120 cat /tmp/test.img > /dev/null rm /tmp/test.img done Then after a while: watchdog: BUG: soft lockup - CPU#0 stuck for 763s! [cat:5787] Modules linked in: zram virtiofs CPU: 0 UID: 0 PID: 5787 Comm: cat Kdump: loaded Tainted: G L 6.15.0.orig-gf3021d9246bc-dirty #118 PREEMPT(voluntary)· Tainted: [L]=SOFTLOCKUP Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015 RIP: 0010:mpol_shared_policy_lookup+0xd/0x70 Code: e9 b8 b4 ff ff 31 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 <48> 8b 1f 48 85 db 74 41 4c 8d 67 08 48 89 fb 48 89 f5 4c 89 e7 e8 RSP: 0018:ffffc90002b1fc28 EFLAGS: 00000202 RAX: 00000000001c20ca RBX: 0000000000724e1e RCX: 0000000000000001 RDX: ffff888118e214c8 RSI: 0000000000057d42 RDI: ffff888118e21518 RBP: 000000000002bec8 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000bf4 R11: 0000000000000000 R12: 0000000000000001 R13: 00000000001c20ca R14: 00000000001c20ca R15: 0000000000000000 FS: 00007f03f995c740(0000) GS:ffff88a07ad9a000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f03f98f1000 CR3: 0000000144626004 CR4: 0000000000770eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> shmem_alloc_folio+0x31/0xc0 shmem_swapin_folio+0x309/0xcf0 ? filemap_get_entry+0x117/0x1e0 ? xas_load+0xd/0xb0 ? filemap_get_entry+0x101/0x1e0 shmem_get_folio_gfp+0x2ed/0x5b0 shmem_file_read_iter+0x7f/0x2e0 vfs_read+0x252/0x330 ksys_read+0x68/0xf0 do_syscall_64+0x4c/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f03f9a46991 Code: 00 48 8b 15 81 14 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 20 ad 01 00 f3 0f 1e fa 80 3d 35 97 10 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec RSP: 002b:00007fff3c52bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f03f9a46991 RDX: 0000000000040000 RSI: 00007f03f98ba000 RDI: 0000000000000003 RBP: 00007fff3c52bd50 R08: 0000000000000000 R09: 00007f03f9b9a380 R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000 R13: 00007f03f98ba000 R14: 0000000000000003 R15: 0000000000000000 </TASK> The reason is simple, readahead brought some order 0 folio in swap cache, and the swapin mTHP folio being allocated is in conflict with it, so swapcache_prepare fails and causes shmem_swap_alloc_folio to return -EEXIST, and shmem simply retries again and again causing this loop. Fix it by applying a similar fix for anon mTHP swapin. The performance change is very slight, time of swapin 10g zero folios with shmem (test for 12 times): Before: 2.47s After: 2.48s [kasong@tencent.com: add comment] Link: https://lkml.kernel.org/r/20250610181645.45922-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250610181645.45922-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250609171751.36305-1-ryncsn@gmail.com Fixes: 1dd44c0af4fa ("mm: shmem: skip swapcache for swapin of synchronous swap device") Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-18slub: Fix a documentation build error for krealloc()Jonathan Corbet1-5/+5
The kerneldoc comment for krealloc() contains an unmarked literal block, leading to these warnings in the docs build: ./mm/slub.c:4936: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils] ./mm/slub.c:4936: ERROR: Undefined substitution referenced: "--------". [docutils] Mark up and indent the block properly to bring a bit of peace to our build logs. Fixes: 489a744e5fb1 (mm: krealloc: clarify valid usage of __GFP_ZERO) Signed-off-by: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20250611155916.2579160-6-willy@infradead.org Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-06-18slab: Add SL_pfmemalloc flagMatthew Wilcox (Oracle)2-24/+21
Give slab its own name for this flag. Move the implementation from slab.h to slub.c since it's only used inside slub.c. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20250611155916.2579160-5-willy@infradead.org Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-06-18slab: Add SL_partial flagMatthew Wilcox (Oracle)1-12/+10
Give slab its own name for this flag. Keep the PG_workingset alias information in one place. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20250611155916.2579160-4-willy@infradead.org Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-06-18slab: Rename slab->__page_flags to slab->flagsMatthew Wilcox (Oracle)2-5/+17
Slab has its own reasons for using flag bits; they aren't just the page bits. Maybe this won't be the ultimate solution, but we should be clear that these bits are in use. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20250611155916.2579160-3-willy@infradead.org Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-06-18secretmem: move setting O_LARGEFILE and bumping users' count to the place ↵Al Viro1-4/+3
where we create the file Acked-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-17memcg: cgroup: call css_rstat_updated irrespective of in_nmi()Shakeel Butt1-5/+5
css_rstat_updated() is nmi safe, so there is no need to avoid it in in_nmi(), so remove the check. Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Tested-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-17mm/filemap: introduce generic_file_*_mmap_prepare() helpersLorenzo Stoakes1-0/+29
Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"), the f_op->mmap() hook has been deprecated in favour of f_op->mmap_prepare(). The generic mmap handlers are very simple, so we can very easily convert these in advance of converting file systems which use them. This patch does so. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/30622c1f0b98c66840bc8c02668bda276a810b70.1750099179.git.lorenzo.stoakes@oracle.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-17fs: consistently use can_mmap_file() helperLorenzo Stoakes2-2/+2
Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"), the f_op->mmap() hook has been deprecated in favour of f_op->mmap_prepare(). Additionally, commit bb666b7c2707 ("mm: add mmap_prepare() compatibility layer for nested file systems") permits the use of the .mmap_prepare() hook even in nested filesystems like overlayfs. There are a number of places where we check only for f_op->mmap - this is incorrect now mmap_prepare exists, so update all of these to use the general helper can_mmap_file(). Most notably, this updates the elf logic to allow for the ability to execute binaries on filesystems which have the .mmap_prepare hook, but additionally we update nested filesystems. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/b68145b609532e62bab603dd9686faa6562046ec.1750099179.git.lorenzo.stoakes@oracle.com Acked-by: Kees Cook <kees@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-17mm/nommu: use file_has_valid_mmap_hooks() helperLorenzo Stoakes1-1/+1
Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"), the f_op->mmap() hook has been deprecated in favour of f_op->mmap_prepare(). Therefore, update the check for file operations supporting mmap() by using the file_has_valid_mmap_hooks() helper function, which checks for either f_op->mmap or f_op->mmap_prepare rather than checking only for f_op->mmap directly. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/5f120b644b5890d1b50202d0f0d4c9f0d6b62873.1750099179.git.lorenzo.stoakes@oracle.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-17mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepareLorenzo Stoakes2-2/+2
The call_mmap() function violates the existing convention in include/linux/fs.h whereby invocations of virtual file system hooks is performed by functions prefixed with vfs_xxx(). Correct this by renaming call_mmap() to vfs_mmap(). This also avoids confusion as to the fact that f_op->mmap_prepare may be invoked here. Also rename __call_mmap_prepare() function to vfs_mmap_prepare() and adjust to accept a file parameter, this is useful later for nested file systems. Finally, fix up the VMA userland tests and ensure the mmap_prepare -> mmap shim is implemented there. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/8d389f4994fa736aa8f9172bef8533c10a9e9011.1750099179.git.lorenzo.stoakes@oracle.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-17mm, slab: use frozen pages for large kmallocVlastimil Babka1-3/+3
Since slab pages are now frozen, it makes sense to have large kmalloc() objects behave same as small kmalloc(), as the choice between the two is an implementation detail depending on allocation size. Notably, increasing refcount on a slab page containing kmalloc() object is not possible anymore, so it should be consistent for large kmalloc pages. Therefore, change large kmalloc to use the frozen pages API. Because of some unexpected fallout in the slab pages case (see commit b9c0e49abfca ("mm: decline to manipulate the refcount on a slab page"), implement the same kind of checks and warnings as part of this change. Notably, networking code using sendpage_ok() to determine whether the page refcount can be manipulated in the network stack should continue behaving correctly. Before this change, the function returns true for large kmalloc pages and page refcount can be manipulated. After this change, the function will return false. Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-06-17mm, slab: restore NUMA policy support for large kmallocVlastimil Babka1-1/+6
The slab allocator observes the task's NUMA policy in various places such as allocating slab pages. Large kmalloc() allocations used to do that too, until an unintended change by c4cab557521a ("mm/slab_common: cleanup kmalloc_large()") resulted in ignoring mempolicy and just preferring the local node. Restore the NUMA policy support. Fixes: c4cab557521a ("mm/slab_common: cleanup kmalloc_large()") Cc: <stable@vger.kernel.org> Acked-by: Christoph Lameter (Ampere) <cl@gentwo.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-06-17sysfs: treewide: switch back to attribute_group::bin_attrsThomas Weißschuh1-1/+1
The normal bin_attrs field can now handle const pointers. This makes the _new variant unnecessary. Switch all users back. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20250530-sysfs-const-bin_attr-final-v3-4-724bfcf05b99@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-16Merge tag 'x86_urgent_for_6.16-rc3' of ↵Linus Torvalds1-37/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Dave Hansen: "This is a pretty scattered set of fixes. The majority of them are further fixups around the recent ITS mitigations. The rest don't really have a coherent story: - Some flavors of Xen PV guests don't support large pages, but the set_memory.c code assumes all CPUs support them. Avoid problems with a quick CPU feature check. - The TDX code has some wrappers to help retry calls to the TDX module. They use function pointers to assembly functions and the compiler usually generates direct CALLs. But some new compilers, plus -Os turned them in to indirect CALLs and the assembly code was not annotated for indirect calls. Force inlining of the helper to fix it up. - Last, a FRED issue showed up when single-stepping. It's fine when using an external debugger, but was getting stuck returning from a SIGTRAP handler otherwise. Clear the FRED 'swevent' bit to ensure that forward progress is made" * tag 'x86_urgent_for_6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert "mm/execmem: Unify early execmem_cache behaviour" x86/its: explicitly manage permissions for ITS pages x86/its: move its_pages array to struct mod_arch_specific x86/Kconfig: only enable ROX cache in execmem when STRICT_MODULE_RWX is set x86/mm/pat: don't collapse pages without PSE set x86/virt/tdx: Avoid indirect calls to TDX assembly functions selftests/x86: Add a test to detect infinite SIGTRAP handler loop x86/fred/signal: Prevent immediate repeat of single step trap on return from SIGTRAP handler
2025-06-14Merge tag 'mm-hotfixes-stable-2025-06-13-21-56' of ↵Linus Torvalds4-19/+93
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "9 hotfixes. 3 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. Only 4 are for MM" * tag 'mm-hotfixes-stable-2025-06-13-21-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: add mmap_prepare() compatibility layer for nested file systems init: fix build warnings about export.h MAINTAINERS: add Barry as a THP reviewer drivers/rapidio/rio_cm.c: prevent possible heap overwrite mm: close theoretical race where stale TLB entries could linger mm/vma: reset VMA iterator on commit_merge() OOM failure docs: proc: update VmFlags documentation in smaps scatterlist: fix extraneous '@'-sign kernel-doc notation selftests/mm: skip failed memfd setups in gup_longterm
2025-06-13mm: add mmap_prepare() compatibility layer for nested file systemsLorenzo Stoakes3-1/+87
Nested file systems, that is those which invoke call_mmap() within their own f_op->mmap() handlers, may encounter underlying file systems which provide the f_op->mmap_prepare() hook introduced by commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"). We have a chicken-and-egg scenario here - until all file systems are converted to using .mmap_prepare(), we cannot convert these nested handlers, as we can't call f_op->mmap from an .mmap_prepare() hook. So we have to do it the other way round - invoke the .mmap_prepare() hook from an .mmap() one. in order to do so, we need to convert VMA state into a struct vm_area_desc descriptor, invoking the underlying file system's f_op->mmap_prepare() callback passing a pointer to this, and then setting VMA state accordingly and safely. This patch achieves this via the compat_vma_mmap_prepare() function, which we invoke from call_mmap() if f_op->mmap_prepare() is specified in the passed in file pointer. We place the fundamental logic into mm/vma.h where VMA manipulation belongs. We also update the VMA userland tests to accommodate the changes. The compat_vma_mmap_prepare() function and its associated machinery is temporary, and will be removed once the conversion of file systems is complete. We carefully place this code so it can be used with CONFIG_MMU and also with cutting edge nommu silicon. [akpm@linux-foundation.org: export compat_vma_mmap_prepare tp fix build] [lorenzo.stoakes@oracle.com: remove unused declarations] Link: https://lkml.kernel.org/r/ac3ae324-4c65-432a-8c6d-2af988b18ac8@lucifer.local Link: https://lkml.kernel.org/r/20250609165749.344976-1-lorenzo.stoakes@oracle.com Fixes: c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"). Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Closes: https://lore.kernel.org/linux-mm/CAG48ez04yOEVx1ekzOChARDDBZzAKwet8PEoPM4Ln3_rk91AzQ@mail.gmail.com/ Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-12mm: close theoretical race where stale TLB entries could lingerRyan Roberts1-0/+2
Commit 3ea277194daa ("mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries") described a theoretical race as such: """ Nadav Amit identified a theoretical race between page reclaim and mprotect due to TLB flushes being batched outside of the PTL being held. He described the race as follows: CPU0 CPU1 ---- ---- user accesses memory using RW PTE [PTE now cached in TLB] try_to_unmap_one() ==> ptep_get_and_clear() ==> set_tlb_ubc_flush_pending() mprotect(addr, PROT_READ) ==> change_pte_range() ==> [ PTE non-present - no flush ] user writes using cached RW PTE ... try_to_unmap_flush() The same type of race exists for reads when protecting for PROT_NONE and also exists for operations that can leave an old TLB entry behind such as munmap, mremap and madvise. """ The solution was to introduce flush_tlb_batched_pending() and call it under the PTL from mprotect/madvise/munmap/mremap to complete any pending tlb flushes. However, while madvise_free_pte_range() and madvise_cold_or_pageout_pte_range() were both retro-fitted to call flush_tlb_batched_pending() immediately after initially acquiring the PTL, they both temporarily release the PTL to split a large folio if they stumble upon one. In this case, where re-acquiring the PTL flush_tlb_batched_pending() must be called again, but it previously was not. Let's fix that. There are 2 Fixes: tags here: the first is the commit that fixed madvise_free_pte_range(). The second is the commit that added madvise_cold_or_pageout_pte_range(), which looks like it copy/pasted the faulty pattern from madvise_free_pte_range(). This is a theoretical bug discovered during code review. Link: https://lkml.kernel.org/r/20250606092809.4194056-1-ryan.roberts@arm.com Fixes: 3ea277194daa ("mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries") Fixes: 9c276cc65a58 ("mm: introduce MADV_COLD") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Jann Horn <jannh@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-12mm/vma: reset VMA iterator on commit_merge() OOM failureLorenzo Stoakes1-18/+4
While an OOM failure in commit_merge() isn't really feasible due to the allocation which might fail (a maple tree pre-allocation) being 'too small to fail', we do need to handle this case correctly regardless. In vma_merge_existing_range(), we can theoretically encounter failures which result in an OOM error in two ways - firstly dup_anon_vma() might fail with an OOM error, and secondly commit_merge() failing, ultimately, to pre-allocate a maple tree node. The abort logic for dup_anon_vma() resets the VMA iterator to the initial range, ensuring that any logic looping on this iterator will correctly proceed to the next VMA. However the commit_merge() abort logic does not do the same thing. This resulted in a syzbot report occurring because mlockall() iterates through VMAs, is tolerant of errors, but ended up with an incorrect previous VMA being specified due to incorrect iterator state. While making this change, it became apparent we are duplicating logic - the logic introduced in commit 41e6ddcaa0f1 ("mm/vma: add give_up_on_oom option on modify/merge, use in uffd release") duplicates the vmg->give_up_on_oom check in both abort branches. Additionally, we observe that we can perform the anon_dup check safely on dup_anon_vma() failure, as this will not be modified should this call fail. Finally, we need to reset the iterator in both cases, so now we can simply use the exact same code to abort for both. We remove the VM_WARN_ON(err != -ENOMEM) as it would be silly for this to be otherwise and it allows us to implement the abort check more neatly. Link: https://lkml.kernel.org/r/20250606125032.164249-1-lorenzo.stoakes@oracle.com Fixes: 47b16d0462a4 ("mm: abort vma_modify() on merge out of memory failure") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: syzbot+d16409ea9ecc16ed261a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-mm/6842cc67.a00a0220.29ac89.003b.GAE@google.com/ Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-11shmem: no dentry retention past the refcount reaching zeroAl Viro1-1/+1
Just set DCACHE_DONTCACHE in ->s_d_flags and be done with that. Dentries there live for as long as they are pinned; once the refcount hits zero, that's it. The same, of course, goes for other tree-in-dcache filesystems - more in the next commits... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-11Revert "mm/execmem: Unify early execmem_cache behaviour"Mike Rapoport (Microsoft)1-37/+3
The commit d6d1e3e6580c ("mm/execmem: Unify early execmem_cache behaviour") changed early behaviour of execemem ROX cache to allow its usage in early x86 code that allocates text pages when CONFIG_MITGATION_ITS is enabled. The permission management of the pages allocated from execmem for ITS mitigation is now completely contained in arch/x86/kernel/alternatives.c and therefore there is no need to special case early allocations in execmem. This reverts commit d6d1e3e6580ca35071ad474381f053cbf1fb6414. Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20250603111446.2609381-6-rppt@kernel.org
2025-06-11new helper: set_default_d_op()Al Viro1-1/+1
... to be used instead of manually assigning to ->s_d_op. All in-tree filesystem converted (and field itself is renamed, so any out-of-tree ones in need of conversion will be caught by compiler). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-10Revert "mm/damon/Kconfig: enable CONFIG_DAMON by default"Linus Torvalds1-1/+0
This reverts commit 28615e6eed152f2fda5486680090b74aeed7b554. No, we don't make random features default to being on. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-06-09slab: Decouple slab_debug and no_hash_pointersKees Cook1-3/+2
Some system owners use slab_debug=FPZ (or similar) as a hardening option, but do not want to be forced into having kernel addresses exposed due to the implicit "no_hash_pointers" boot param setting.[1] Introduce the "hash_pointers" boot param, which defaults to "auto" (the current behavior), but also includes "always" (forcing on hashing even when "slab_debug=..." is defined), and "never". The existing "no_hash_pointers" boot param becomes an alias for "hash_pointers=never". This makes it possible to boot with "slab_debug=FPZ hash_pointers=always". Link: https://github.com/KSPP/linux/issues/368 [1] Fixes: 792702911f58 ("slub: force on no_hash_pointers when slub_debug is enabled") Co-developed-by: Sergio Perez Gonzalez <sperezglz@gmail.com> Signed-off-by: Sergio Perez Gonzalez <sperezglz@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Kees Cook <kees@kernel.org> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Rafael Aquini <raquini@redhat.com> Tested-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://patch.msgid.link/20250415170232.it.467-kees@kernel.org [kees@kernel.org: Add note about hash_pointers into slab_debug kernel parameter documentation.] Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-06-08treewide, timers: Rename from_timer() to timer_container_of()Ingo Molnar1-2/+2
Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-07Merge tag 'loongarch-6.16' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson Pull LoongArch updates from Huacai Chen: - Adjust the 'make install' operation - Support SCHED_MC (Multi-core scheduler) - Enable ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS - Enable HAVE_ARCH_STACKLEAK - Increase max supported CPUs up to 2048 - Introduce the numa_memblks conversion - Add PWM controller nodes in dts - Some bug fixes and other small changes * tag 'loongarch-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: platform/loongarch: laptop: Unregister generic_sub_drivers on exit platform/loongarch: laptop: Add backlight power control support platform/loongarch: laptop: Get brightness setting from EC on probe LoongArch: dts: Add PWM support to Loongson-2K2000 LoongArch: dts: Add PWM support to Loongson-2K1000 LoongArch: dts: Add PWM support to Loongson-2K0500 LoongArch: vDSO: Correctly use asm parameters in syscall wrappers LoongArch: Fix panic caused by NULL-PMD in huge_pte_offset() LoongArch: Preserve firmware configuration when desired LoongArch: Avoid using $r0/$r1 as "mask" for csrxchg LoongArch: Introduce the numa_memblks conversion LoongArch: Increase max supported CPUs up to 2048 LoongArch: Enable HAVE_ARCH_STACKLEAK LoongArch: Enable ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS LoongArch: Add SCHED_MC (Multi-core scheduler) support LoongArch: Add some annotations in archhelp LoongArch: Using generic scripts/install.sh in `make install` LoongArch: Add a default install.sh
2025-06-07Merge tag 'mm-stable-2025-06-06-16-09' of ↵Linus Torvalds11-11/+34
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: "The series 'Fix uprobe pte be overwritten when expanding vma' fixes a longstanding and quite obscure bug related to the vma merging of the uprobe mmap page" * tag 'mm-stable-2025-06-06-16-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: selftests/mm: add test about uprobe pte be orphan during vma merge selftests/mm: extract read_sysfs and write_sysfs into vm_util mm: expose abnormal new_pte during move_ptes mm: fix uprobe pte be overwritten when expanding vma mm/damon: s/primitives/code/ on comments
2025-06-07Merge tag 'mm-hotfixes-stable-2025-06-06-16-02' of ↵Linus Torvalds6-21/+64
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "13 hotfixes. 6 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. 11 are for MM" * tag 'mm-hotfixes-stable-2025-06-06-16-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count MAINTAINERS: add mm swap section kmsan: test: add module description MAINTAINERS: add tlb trace events to MMU GATHER AND TLB INVALIDATION mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race mm/hugetlb: unshare page tables during VMA split, not before MAINTAINERS: add Alistair as reviewer of mm memory policy iov_iter: use iov_offset for length calculation in iov_iter_aligned_bvec mm/mempolicy: fix incorrect freeing of wi_kobj alloc_tag: handle module codetag load errors as module load failures mm/madvise: handle madvise_lock() failure during race unwinding mm: fix vmstat after removing NR_BOUNCE KVM: s390: rename PROT_NONE to PROT_TYPE_DUMMY