kernel/linux.git - Linux kernel stable tree (mirror)

Age	Commit message (Collapse)	Author	Files	Lines
2025-03-18	selftests/mm: add tests for folio_split(), buddy allocator like split	Zi Yan	1	-7/+27
	It splits page cache folios to orders from 0 to 8 at different in-folio offset. Link: https://lkml.kernel.org/r/20250307174001.242794-9-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-18	xarray: add xas_try_split() to split a multi-index entry	Zi Yan	1	-0/+1
	Patch series "Buddy allocator like (or non-uniform) folio split", v10. This patchset adds a new buddy allocator like (or non-uniform) large folio split from a order-n folio to order-m with m < n. It reduces 1. the total number of after-split folios from 2^(n-m) to n-m+1; 2. the amount of memory needed for multi-index xarray split from 2^(n/6-m/6) to n/6-m/6, assuming XA_CHUNK_SHIFT=6; 3. keep more large folios after a split from all order-m folios to order-(n-1) to order-m folios. For example, to split an order-9 to order-0, folio split generates 10 (or 11 for anonymous memory) folios instead of 512, allocates 1 xa_node instead of 8, and leaves 1 order-8, 1 order-7, ..., 1 order-1 and 2 order-0 folios (or 4 order-0 for anonymous memory) instead of 512 order-0 folios. Instead of duplicating existing split_huge_page() code, __folio_split() is introduced as the shared backend code for both split_huge_page_to_list_to_order() and folio_split(). __folio_split() can support both uniform split and buddy allocator like (or non-uniform) split. All existing split_huge_page() users can be gradually converted to use folio_split() if possible. In this patchset, I converted truncate_inode_partial_folio() to use folio_split(). xfstests quick group passed for both tmpfs and xfs. I also semi-replicated Hugh's test[12] and ran it without any issue for almost 24 hours. This patch (of 8): A preparation patch for non-uniform folio split, which always split a folio into half iteratively, and minimal xarray entry split. Currently, xas_split_alloc() and xas_split() always split all slots from a multi-index entry. They cost the same number of xa_node as the to-be-split slots. For example, to split an order-9 entry, which takes 2^(9-6)=8 slots, assuming XA_CHUNK_SHIFT is 6 (!CONFIG_BASE_SMALL), 8 xa_node are needed. Instead xas_try_split() is intended to be used iteratively to split the order-9 entry into 2 order-8 entries, then split one order-8 entry, based on the given index, to 2 order-7 entries, ..., and split one order-1 entry to 2 order-0 entries. When splitting the order-6 entry and a new xa_node is needed, xas_try_split() will try to allocate one if possible. As a result, xas_try_split() would only need 1 xa_node instead of 8. When a new xa_node is needed during the split, xas_try_split() can try to allocate one but no more. -ENOMEM will be return if a node cannot be allocated. -EINVAL will be return if a sibling node is split or cascade split happens, where two or more new nodes are needed, and these are not supported by xas_try_split(). xas_split_alloc() and xas_split() split an order-9 to order-0: --------------------------------- \| \| \| \| \| \| \| \| \| \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| \| \| \| \| \| \| \| \| \| --------------------------------- \| \| \| \| ------- --- --- ------- \| \| ... \| \| V V V V ----------- ----------- ----------- ----------- \| xa_node \| \| xa_node \| ... \| xa_node \| \| xa_node \| ----------- ----------- ----------- ----------- xas_try_split() splits an order-9 to order-0: --------------------------------- \| \| \| \| \| \| \| \| \| \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| 7 \| \| \| \| \| \| \| \| \| \| --------------------------------- \| \| V ----------- \| xa_node \| ----------- Link: https://lkml.kernel.org/r/20250307174001.242794-1-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250307174001.242794-2-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	tools/selftests: add guard region test for /proc/$pid/pagemap	Lorenzo Stoakes	2	-0/+48
	Add a test to the guard region self tests to assert that the /proc/$pid/pagemap information now made availabile to the user correctly identifies and reports guard regions. As a part of this change, update vm_util.h to add the new bit (note there is no header file in the kernel where this is exposed, the user is expected to provide their own mask) and utilise the helper functions there for pagemap functionality. [lorenzo.stoakes@oracle.com: fixup define name] Link: https://lkml.kernel.org/r/32e83941-e6f5-42ee-9292-a44c16463cf1@lucifer.local Link: https://lkml.kernel.org/r/164feb0a43ae72650e6b20c3910213f469566311.1740139449.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm/mlock: print error on failure	Brendan Jackman	2	-3/+9
	It's not really possible to start diagnosing this without knowing the actual error. Also update the mlock2 helper to behave like libc would by setting errno and returning -1. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-12-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm: skip mlock tests if nobody user can't read it	Brendan Jackman	1	-1/+1
	If running from a directory that can't be read by unprivileged users, executing on-fault-test via the nobody user will fail. The kselftest build does give the file the correct permissions, but after being installed it might be in a directory without global execute permissions. Since the script can't safely fix that, just skip if it happens. Note that the stderr of the `ls` command is unfiltered meaning the user sees a "permission denied" error that can help inform them why the test was skipped. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-11-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm: ensure uffd-wp-mremap gets pages of each size	Brendan Jackman	1	-1/+22
	This test allocates a page of every available size and doesn't have any SKIP logic if the allocation fails. So, ensure it's available and skip the test if we can't do so. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-10-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm: drop unnecessary sudo usage	Brendan Jackman	1	-1/+1
	This script must be run as root anyway (see all the writing to privileged files in /proc etc). Remove the unnecessary use of sudo to avoid breaking on single-user systems that don't have sudo. This also avoids confusing readers. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-9-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm: skip gup_longterm tests on weird filesystems	Brendan Jackman	1	-1/+9
	Some filesystems don't support ftruncate()ing unlinked files. They return ENOENT. In that case, skip the test. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-8-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm: skip map_populate on weird filesystems	Brendan Jackman	1	-0/+7
	It seems that 9pfs does not allow truncating unlinked files, Mark Brown has noted that NFS may also behave this way. It doesn't seem quite right to call this a "bug" but it's probably a special enough case that it makes sense for the test to just SKIP if it happens. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-7-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm: don't fail uffd-stress if too many CPUs	Brendan Jackman	1	-1/+10
	This calculation divides a fixed parameter by an environment-dependent parameter i.e. the number of CPUs. The simple way to avoid machine-specific failures here is to just put a cap on the max value of the latter. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-6-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Suggested-by: Mateusz Guzik <mjguzik@gmail.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm: print some details when uffd-stress gets bad params	Brendan Jackman	1	-1/+2
	So this can be debugged more easily. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-5-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm/uffd: rename nr_cpus -> nr_parallel	Brendan Jackman	4	-20/+20
	A later commit will bound this variable so it no longer necessarily matches the number of CPUs. Rename it appropriately. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-4-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm: skip uffd-wp-mremap if userfaultfd not available	Brendan Jackman	1	-1/+4
	It's obvious that this should fail in that case, but still, save the reader the effort of figuring out that they've run into this by just SKIPping Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-3-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm: skip uffd-stress if userfaultfd not available	Brendan Jackman	1	-2/+2
	It's pretty obvious that the test wouldn't work if you don't have the feature enabled. But, it's still useful to SKIP instead of failing so the reader can immediately tell that this is the reason why. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-2-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm: report errno when things fail in gup_longterm	Brendan Jackman	1	-16/+21
	Patch series "selftests/mm: Some cleanups from trying to run them", v4. I never had much luck running mm selftests so I spent a few hours digging into why. Looks like most of the reason is missing SKIP checks, so this series is just adding a bunch of those that I found. I did not do anything like all of them, just the ones I spotted in gup_longterm, gup_test, mmap, userfaultfd and memfd_secret. It's a bit unfortunate to have to skip those tests when ftruncate() fails, but I don't have time to dig deep enough into it to actually make them pass. I have observed the issue on 9pfs and heard rumours that NFS has a similar problem. I'm now able to run these test groups successfully: - mmap - gup_test - compaction - migration - page_frag - userfaultfd - mlock I've never gone past "Waiting for hugetlb memory to get depleted", in the hugetlb tests. I don't know if they are stuck or if they would eventually work if I was patient enough (testing on a 1G machine). I have not investigated further. I had some issues with mlock tests failing due to -ENOSRCH from mlock2(), I can no longer reproduce that though, things work OK now. Of the remaining tests there may be others that work fine, but there's no convenient way to survey the whole output of run_vmtests.sh so I'm just going test by test here. In my spare moments I am slowly chipping away at a setup to run these tests continuously in a reasonably hermetic QEMU environment via virtme-ng: https://github.com/bjackman/linux/blob/5fad4b9c592290f38e0f8bc73c9abb9c99d8787c/README.md Hopefully that will eventually offer a way to provide a "canned" environment where the tests are known to work, which can be fairly easily reproduced by any developer. This patch (of 12): Just reporting failure doesn't tell you what went wrong. This can fail in different ways so report errno to help the reader get started debugging. Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-0-dec210a658f5@google.com Link: https://lkml.kernel.org/r/20250311-mm-selftests-v4-1-dec210a658f5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm: fix spelling	Ujwal Kundur	1	-1/+1
	Fix misspelling flagged by codespell. Link: https://lkml.kernel.org/r/20250215081803.1793-1-ujwal.kundur@gmail.com Signed-off-by: Ujwal Kundur <ujwal.kundur@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	mm: make vma cache SLAB_TYPESAFE_BY_RCU	Suren Baghdasaryan	2	-7/+7
	To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that object reuse before RCU grace period is over will be detected by lock_vma_under_rcu(). Current checks are sufficient as long as vma is detached before it is freed. The only place this is not currently happening is in exit_mmap(). Add the missing vma_mark_detached() in exit_mmap(). Another issue which might trick lock_vma_under_rcu() during vma reuse is vm_area_dup(), which copies the entire content of the vma into a new one, overriding new vma's vm_refcnt and temporarily making it appear as attached. This might trick a racing lock_vma_under_rcu() to operate on a reused vma if it found the vma before it got reused. To prevent this situation, we should ensure that vm_refcnt stays at detached state (0) when it is copied and advances to attached state only after it is added into the vma tree. Introduce vm_area_init_from() which preserves new vma's vm_refcnt and use it in vm_area_dup(). Since all vmas are in detached state with no current readers when they are freed, lock_vma_under_rcu() will not be able to take vm_refcnt after vma got detached even if vma is reused. vma_mark_attached() in modified to include a release fence to ensure all stores to the vma happen before vm_refcnt gets initialized. Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate vm_area_struct reuse and will minimize the number of call_rcu() calls. [surenb@google.com: remove atomic_set_release() usage in tools/] Link: https://lkml.kernel.org/r/20250217054351.2973666-1-surenb@google.com Link: https://lkml.kernel.org/r/20250213224655.1680278-18-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	mm: move lesser used vma_area_struct members into the last cacheline	Suren Baghdasaryan	1	-19/+18
	Move several vma_area_struct members which are rarely or never used during page fault handling into the last cacheline to better pack vm_area_struct. As a result vm_area_struct will fit into 3 as opposed to 4 cachelines. New typical vm_area_struct layout: struct vm_area_struct { union { struct { long unsigned int vm_start; /* 0 8 / long unsigned int vm_end; / 8 8 / }; / 0 16 / freeptr_t vm_freeptr; / 0 8 / }; / 0 16 / struct mm_struct vm_mm; /* 16 8 / pgprot_t vm_page_prot; / 24 8 / union { const vm_flags_t vm_flags; / 32 8 / vm_flags_t __vm_flags; / 32 8 / }; / 32 8 / unsigned int vm_lock_seq; / 40 4 / / XXX 4 bytes hole, try to pack / struct list_head anon_vma_chain; / 48 16 / / --- cacheline 1 boundary (64 bytes) --- / struct anon_vma anon_vma; /* 64 8 / const struct vm_operations_struct vm_ops; /* 72 8 / long unsigned int vm_pgoff; / 80 8 / struct file vm_file; /* 88 8 / void vm_private_data; /* 96 8 / atomic_long_t swap_readahead_info; / 104 8 / struct mempolicy vm_policy; /* 112 8 / struct vma_numab_state numab_state; /* 120 8 / / --- cacheline 2 boundary (128 bytes) --- / refcount_t vm_refcnt (__aligned__(64)); / 128 4 / / XXX 4 bytes hole, try to pack / struct { struct rb_node rb (__aligned__(8)); / 136 24 / long unsigned int rb_subtree_last; / 160 8 / } __attribute__((__aligned__(8))) shared; / 136 32 / struct anon_vma_name anon_name; /* 168 8 / struct vm_userfaultfd_ctx vm_userfaultfd_ctx; / 176 8 / / size: 192, cachelines: 3, members: 18 / / sum members: 176, holes: 2, sum holes: 8 / / padding: 8 / / forced alignments: 2, forced holes: 1, sum forced holes: 4 */ } __attribute__((__aligned__(64))); Memory consumption per 1000 VMAs becomes 48 pages: slabinfo after vm_area_struct changes: <name> ... <objsize> <objperslab> <pagesperslab> : ... vm_area_struct ... 192 42 2 : ... Link: https://lkml.kernel.org/r/20250213224655.1680278-14-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	mm: replace vm_lock and detached flag with a reference count	Suren Baghdasaryan	2	-33/+35
	rw_semaphore is a sizable structure of 40 bytes and consumes considerable space for each vm_area_struct. However vma_lock has two important specifics which can be used to replace rw_semaphore with a simpler structure: 1. Readers never wait. They try to take the vma_lock and fall back to mmap_lock if that fails. 2. Only one writer at a time will ever try to write-lock a vma_lock because writers first take mmap_lock in write mode. Because of these requirements, full rw_semaphore functionality is not needed and we can replace rw_semaphore and the vma->detached flag with a refcount (vm_refcnt). When vma is in detached state, vm_refcnt is 0 and only a call to vma_mark_attached() can take it out of this state. Note that unlike before, now we enforce both vma_mark_attached() and vma_mark_detached() to be done only after vma has been write-locked. vma_mark_attached() changes vm_refcnt to 1 to indicate that it has been attached to the vma tree. When a reader takes read lock, it increments vm_refcnt, unless the top usable bit of vm_refcnt (0x40000000) is set, indicating presence of a writer. When writer takes write lock, it sets the top usable bit to indicate its presence. If there are readers, writer will wait using newly introduced mm->vma_writer_wait. Since all writers take mmap_lock in write mode first, there can be only one writer at a time. The last reader to release the lock will signal the writer to wake up. refcount might overflow if there are many competing readers, in which case read-locking will fail. Readers are expected to handle such failures. In summary: 1. all readers increment the vm_refcnt; 2. writer sets top usable (writer) bit of vm_refcnt; 3. readers cannot increment the vm_refcnt if the writer bit is set; 4. in the presence of readers, writer must wait for the vm_refcnt to drop to 1 (plus the VMA_LOCK_OFFSET writer bit), indicating an attached vma with no readers; 5. vm_refcnt overflow is handled by the readers. While this vm_lock replacement does not yet result in a smaller vm_area_struct (it stays at 256 bytes due to cacheline alignment), it allows for further size optimization by structure member regrouping to bring the size of vm_area_struct below 192 bytes. [surenb@google.com: fix a crash due to vma_end_read() that should have been removed] Link: https://lkml.kernel.org/r/20250220200208.323769-1-surenb@google.com Link: https://lkml.kernel.org/r/20250213224655.1680278-13-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Matthew Wilcox <willy@infradead.org> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	mm: introduce vma_iter_store_attached() to use with attached vmas	Suren Baghdasaryan	2	-9/+43
	vma_iter_store() functions can be used both when adding a new vma and when updating an existing one. However for existing ones we do not need to mark them attached as they are already marked that way. With vma->detached being a separate flag, double-marking a vmas as attached or detached is not an issue because the flag will simply be overwritten with the same value. However once we fold this flag into the refcount later in this series, re-attaching or re-detaching a vma becomes an issue since these operations will be incrementing/decrementing a refcount. Introduce vma_iter_store_new() and vma_iter_store_overwrite() to replace vma_iter_store() and avoid re-attaching a vma during vma update. Add assertions in vma_mark_attached()/vma_mark_detached() to catch invalid usage. Update vma tests to check for vma detached state correctness. Link: https://lkml.kernel.org/r/20250213224655.1680278-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	mm: mark vma as detached until it's added into vma tree	Suren Baghdasaryan	1	-5/+12
	Current implementation does not set detached flag when a VMA is first allocated. This does not represent the real state of the VMA, which is detached until it is added into mm's VMA tree. Fix this by marking new VMAs as detached and resetting detached flag only after VMA is added into a tree. Introduce vma_mark_attached() to make the API more readable and to simplify possible future cleanup when vma->vm_mm might be used to indicate detached vma and vma_mark_attached() will need an additional mm parameter. Link: https://lkml.kernel.org/r/20250213224655.1680278-4-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	mm: move per-vma lock into vm_area_struct	Suren Baghdasaryan	1	-26/+7
	Back when per-vma locks were introduces, vm_lock was moved out of vm_area_struct in [1] because of the performance regression caused by false cacheline sharing. Recent investigation [2] revealed that the regressions is limited to a rather old Broadwell microarchitecture and even there it can be mitigated by disabling adjacent cacheline prefetching, see [3]. Splitting single logical structure into multiple ones leads to more complicated management, extra pointer dereferences and overall less maintainable code. When that split-away part is a lock, it complicates things even further. With no performance benefits, there are no reasons for this split. Merging the vm_lock back into vm_area_struct also allows vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset. Move vm_lock back into vm_area_struct, aligning it at the cacheline boundary and changing the cache to be cacheline-aligned as well. With kernel compiled using defconfig, this causes VMA memory consumption to grow from 160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes: slabinfo before: <name> ... <objsize> <objperslab> <pagesperslab> : ... vma_lock ... 40 102 1 : ... vm_area_struct ... 160 51 2 : ... slabinfo after moving vm_lock: <name> ... <objsize> <objperslab> <pagesperslab> : ... vm_area_struct ... 256 32 2 : ... Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages, which is 5.5MB per 100000 VMAs. Note that the size of this structure is dependent on the kernel configuration and typically the original size is higher than 160 bytes. Therefore these calculations are close to the worst case scenario. A more realistic vm_area_struct usage before this change is: <name> ... <objsize> <objperslab> <pagesperslab> : ... vma_lock ... 40 102 1 : ... vm_area_struct ... 176 46 2 : ... Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages, which is 3.9MB per 100000 VMAs. This memory consumption growth can be addressed later by optimizing the vm_lock. [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/ [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/ [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/ Link: https://lkml.kernel.org/r/20250213224655.1680278-3-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	tools/selftests: add file/shmem-backed mapping guard region tests	Lorenzo Stoakes	1	-0/+595
	Extend the guard region self tests to explicitly assert that guard regions work correctly for functionality specific to file-backed and shmem mappings. In addition to testing all of the existing guard region functionality that is currently tested against anonymous mappings against file-backed and shmem mappings (except those which are exclusive to anonymous mapping), we now also: * Test that MADV_SEQUENTIAL does not cause unexpected readahead behaviour. * Test that MAP_PRIVATE behaves as expected with guard regions installed in both a shared and private mapping of an fd. * Test that a read-only file can correctly establish guard regions. * Test a probable fault-around case does not interfere with guard regions (or vice-versa). * Test that truncation does not eliminate guard regions. * Test that hole punching functions as expected in the presence of guard regions. * Test that a read-only mapping of a memfd write sealed mapping can have guard regions established within it and function correctly without violation of the seal. * Test that guard regions installed into a mapping of the anonymous zero page function correctly. Link: https://lkml.kernel.org/r/90c16bec5fcaafcd1700dfa3e9988c3e1aa9ac1d.1739469950.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	tools/selftests: expand all guard region tests to file-backed	Lorenzo Stoakes	1	-85/+205
	Extend the guard region tests to allow for test fixture variants for anon, shmem, and local file files. This allows us to assert that each of the expected behaviours of anonymous memory also applies correctly to file-backed (both shmem and an a file created locally in the current working directory) and thus asserts the same correctness guarantees as all the remaining tests do. The fixture teardown is now performed in the parent process rather than child forked ones, meaning cleanup is always performed, including unlinking any generated temporary files. Additionally the variant fixture data type now contains an enum value indicating the type of backing store and the mmap() invocation is abstracted to allow for the mapping of whichever backing store the variant is testing. We adjust tests as necessary to account for the fact they may now reference files rather than anonymous memory. Link: https://lkml.kernel.org/r/ab42228d2bd9b8aa18e9faebcd5c88732a7e5820.1739469950.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm: rename guard-pages to guard-regions	Lorenzo Stoakes	4	-24/+24
	The feature formerly referred to as guard pages is more correctly referred to as 'guard regions', as in fact no pages are ever allocated in the process of installing the regions. To avoid confusion, rename the tests accordingly. [lorenzo.stoakes@oracle.com: fix guard regions invocation] Link: https://lkml.kernel.org/r/13426c71-d069-4407-9340-b227ff8b8736@lucifer.local Link: https://lkml.kernel.org/r/1c3cd04a3f69b5756b94bda701ac88325a9be18b.1739469950.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm: allow tests to run with no huge pages support	Mark Brown	1	-24/+42
	Currently the mm selftests refuse to run if huge pages are not available in the current system but this is an optional feature and not all the tests actually require them. Change the test during startup to be non-fatal and skip or omit tests which actually rely on having huge pages, allowing the other tests to be run. The gup_test does support using madvise() to configure huge pages but it ignores the error code so we just let it run. Link: https://lkml.kernel.org/r/20250212-kselftest-mm-no-hugepages-v1-2-44702f538522@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Nico Pache <npache@redhat.com> Cc: Mariano Pache <npache@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests: mm: fix typo	Eric Salem	1	-1/+1
	Fix misspelling. Link: https://lkml.kernel.org/r/77e0e915-36c3-4c95-84b8-0b73aaa17951@gmail.com Signed-off-by: Eric Salem <ericsalem@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm: fix thuge-gen test name uniqueness	Mark Brown	1	-2/+2
	The thuge-gen test_mmap() and test_shmget() tests are repeatedly run for a variety of sizes but always report the result of their test with the same name, meaning that automated sysetms running the tests are unable to distinguish between the various tests. Add the supplied sizes to the logged test names to distinguish between runs. My test automation was getting pretty confused about what was going on - the test names are a pretty important external interface. Link: https://lkml.kernel.org/r/20250204-kselftest-mm-fix-dups-v1-1-6afe417ef4bb@kernel.org Fixes: b38bd9b2c448 ("selftests/mm: thuge-gen: conform to TAP format output") Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	mm: completely abstract unnecessary adj_start calculation	Lorenzo Stoakes	1	-2/+2
	The adj_start calculation has been a constant source of confusion in the VMA merge code. There are two cases to consider, one where we adjust the start of the vmg->middle VMA (i.e. the vmg->__adjust_middle_start merge flag is set), in which case adj_start is calculated as: (1) adj_start = vmg->end - vmg->middle->vm_start And the case where we adjust the start of the vmg->next VMA (i.e. the vmg->__adjust_next_start merge flag is set), in which case adj_start is calculated as: (2) adj_start = -(vmg->middle->vm_end - vmg->end) We apply (1) thusly: vmg->middle->vm_start = vmg->middle->vm_start + vmg->end - vmg->middle->vm_start Which simplifies to: vmg->middle->vm_start = vmg->end Similarly, we apply (2) as: vmg->next->vm_start = vmg->next->vm_start + -(vmg->middle->vm_end - vmg->end) Noting that for these VMAs to be mergeable vmg->middle->vm_end == vmg->next->vm_start and so this simplifies to: vmg->next->vm_start = vmg->next->vm_start + -(vmg->next->vm_start - vmg->end) Which simplifies to: vmg->next->vm_start = vmg->end Therefore in each case, we simply need to adjust the start of the VMA to vmg->end (!) and can do away with this adj_start calculation. The only caveat is that we must ensure we update the vm_pgoff field correctly. We therefore abstract this entire calculation to a new function vmg_adjust_set_range() which performs this calculation and sets the adjusted VMA's new range using the general vma_set_range() function. We also must update vma_adjust_trans_huge() which expects the now-abstracted adj_start parameter. It turns out this is wholly unnecessary. In vma_adjust_trans_huge() the relevant code is: if (adjust_next > 0) { struct vm_area_struct *next = find_vma(vma->vm_mm, vma->vm_end); unsigned long nstart = next->vm_start; nstart += adjust_next; split_huge_pmd_if_needed(next, nstart); } The only case where this is relevant is when vmg->__adjust_middle_start is specified (in which case adj_next would have been positive), i.e. the one in which the vma specified is vmg->prev and this the sought 'next' VMA would be vmg->middle. We can therefore eliminate the find_vma() invocation altogether and simply provide the vmg->middle VMA in this instance, or NULL otherwise. Again we have an adj_next offset calculation: next->vm_start + vmg->end - vmg->middle->vm_start Where next == vmg->middle this simplifies to vmg->end as previously demonstrated. Therefore nstart is equal to vmg->end, which is already passed to vma_adjust_trans_huge() via the 'end' parameter and so this code (rather delightfully) simplifies to: if (next) split_huge_pmd_if_needed(next, end); With these changes in place, it becomes silly for commit_merge() to return vmg->target, as it is always the same and threaded through vmg, so we finally change commit_merge() to return an error value once again. This patch has no change in functional behaviour. Link: https://lkml.kernel.org/r/7bce2cd4b5afb56211822835d145471280c3dccc.1738326519.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	mm: eliminate adj_start parameter from commit_merge()	Lorenzo Stoakes	1	-0/+2
	Introduce internal vmg->__adjust_middle_start and vmg->__adjust_next_start merge flags, enabling us to indicate to commit_merge() that we are performing a merge which either spans only part of vmg->middle, or part of vmg->next respectively. In the former instance, we change the start of vmg->middle to match the attributes of vmg->prev, without spanning all of vmg->middle. This implies that vmg->prev->vm_end and vmg->middle->vm_start are both increased to form the new merged VMA (vmg->prev) and the new subsequent VMA (vmg->middle). In the latter case, we change the end of vmg->middle to match the attributes of vmg->next, without spanning all of vmg->next. This implies that vmg->middle->vm_end and vmg->next->vm_start are both decreased to form the new merged VMA (vmg->next) and the new prior VMA (vmg->middle). Since we now have a stable set of prev, middle, next VMAs threaded through vmg and with these flags set know what is happening, we can perform the calculation in commit_merge() instead. This allows us to drop the confusing adj_start parameter and instead pass semantic information to commit_merge(). In the latter case the -(middle->vm_end - start) calculation becomes -(middle->vm-end - vmg->end), however this is correct as vmg->end is set to the start parameter. This is because in this case (rather confusingly), we manipulate vmg->middle, but ultimately return vmg->next, whose range will be correctly specified. At this point vmg->start, end is the new range for the prior VMA rather than the merged one. This patch has no change in functional behaviour. Link: https://lkml.kernel.org/r/bcec0cd980b373a5eb02236cb033034ce1effe42.1738326519.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	mm: further refactor commit_merge()	Lorenzo Stoakes	1	-2/+7
	The current VMA merge mechanism contains a number of confusing mechanisms around removal of VMAs on merge and the shrinking of the VMA adjacent to vma->target in the case of merges which result in a partial merge with that adjacent VMA. Since we now have a STABLE set of VMAs - prev, middle, next - we are now able to have the caller of commit_merge() explicitly tell us which VMAs need deleting, using newly introduced internal VMA merge flags. Doing so allows us to embed this state within the VMG and remove the confusing remove, remove2 parameters from commit_merge(). We additionally are able to eliminate the highly confusing and misleading 'expanded' parameter - a parameter that in reality refers to whether or not the return VMA is the target one or the one immediately adjacent. We can infer which is the case from whether or not the adj_start parameter is negative. This also allows us to simplify further logic around iterator configuration and VMA iterator stores. Doing so means we can also eliminate the adjust parameter, as we are able to infer which VMA ought to be adjusted from adj_start - a positive value implies we adjust the start of 'middle', a negative one implies we adjust the start of 'next'. We are then able to have commit_merge() explicitly return the target VMA, or NULL on inability to pre-allocate memory. Errors were previously filtered so behaviour does not change. We additionally move from the slightly odd use of a bitwise-flag enum vmg->merge_flags field to vmg bitfields. This patch has no change in functional behaviour. Link: https://lkml.kernel.org/r/7bf2ed24af68aac18672b7acebbd9102f48c5b03.1738326519.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	mm: simplify vma merge structure and expand comments	Lorenzo Stoakes	1	-26/+26
	Patch series "mm: further simplify VMA merge operation", v3. While significant efforts have been made to improve the VMA merge operation, there remains remnants of the bad (or rather confusing) old days, which make the code difficult to understand, more bug prone and thus harder to modify. This series attempts to significantly improve matters in a number of respects - with a focus on simplifying the commit_merge() function which actually actions the merge operation - and importantly, adjusting the two most confusing merge cases - those in which we 'adjust' the VMA immediately adjacent to the one being merged. One source of confusion are the VMAs being threaded through the operation themselves - vmg->prev, vmg->vma and vmg->next. At the start of the operation, vmg->vma is either NULL if a new VMA is propose to be added, or if not then a pointer to an existing VMA being modified, and prev/next are (perhaps not present) VMAs sat immediately before and after the range specified in vmg->start, end, respectively. However, during the VMA merge operation, we change vmg->start, end and pgoff to span the newly merged range and vmg->vma to either be: a. The ultimately returned VMA (in most cases) or b. A VMA which we will manipulate, but ultimately instead return vmg->next. Case b. especially here is confusing for somebody reading this code, but the fact we update this state, along with vmg->start, end, pgoff only makes matters worse. We simplify things by replacing vmg->vma with vmg->middle and never changing it - this is always either NULL (for a new VMA) or the VMA being modified between vmg->prev and vmg->next. We further simplify by placing the merged VMA in a new vmg->target field - whether case b. above is the case or not. The reader of the code can now simply rely on vmg->middle being the middle VMA and vmg->target being the ultimately merged VMA. We additionally tackle the confusing cases where we 'adjust' VMAs other than the one we ultimately return as the merged VMA (this includes case b. above). These are: (1) merge <-----------> \|------\|\|--------\| \|------------\|---\| \| prev \|\| middle \| -> \| target \| m \| \|------\|\|--------\| \|------------\|---\| In which case middle must be adjusted so middle->vm_start is increased as well as performing the merge. (2) (equivalent to case b. above) <-------------> \|---------\|\|------\| \|---\|-------------\| \| middle \|\| next \| -> \| m \| target \| \|---------\|\|------\| \|---\|-------------\| In which case next must be adjusted so next->vm_start is decreased as well as performing the merge. This cases have previously been performed by calculating and passing around a dubious and confusing 'adj_start' parameter along side a pointer to an 'adjust' VMA indicating which VMA requires additional adjustment (middle in case 1 and next in case 2). With the VMG structure in place we are able to avoid this by simply setting a merge flag to describe each case: (1) Sets the vmg->__adjust_middle_start flag (2) Sets the vmg->__adjust_next_start flag By doing so it turns out we can vastly simplify the logic and calculate what is required to perform the operation. Taken together the refactorings make it far easier to understand what is being done even in these more confusing cases, make the code far more maintainable, debuggable, and testable, providing more internal state indicating what is happening in the merge operation. The changes have no functional net impact on the merge operation and everything should still behave as it did before. This patch (of 5): The merge code, while much improved, still has a number of points of confusion. As part of a broader series cleaning this up to make this more maintainable, we start by addressing some confusion around vma_merge_struct fields. So far, the caller either provides no vmg->vma (a new VMA) or supplies the existing VMA which is being altered, setting vmg->start,end,pgoff to the proposed VMA dimensions. vmg->vma is then updated, as are vmg->start,end,pgoff as the merge process proceeds and the appropriate merge strategy is determined. This is rather confusing, as vmg->vma starts off as the 'middle' VMA between vmg->prev,next, but becomes the 'target' VMA, except in one specific edge case (merge next, shrink middle). Int his patch we introduce vmg->middle to describe the VMA that is between vmg->prev and vmg->next, and does NOT change during the merge operation. We replace vmg->vma with vmg->target, and use this only during the merge operation itself. Aside from the merge right, shrink middle case, this becomes the VMA that forms the basis of the VMA that is returned. This edge case can be addressed in a future commit. We also add a number of comments to explain what is going on. Finally, we adjust the ASCII diagrams showing each merge case in vma_merge_existing_range() to be clearer - the arrow range previously showed the vmg->start, end spanned area, but it is clearer to change this to show the final merged VMA. This patch has no change in functional behaviour. Link: https://lkml.kernel.org/r/cover.1738326519.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/4dfe60f1419d55e5d0516f56349695d73a57184c.1738326519.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm: test splitting file-backed THP to any lower order	Zi Yan	1	-5/+6
	Now split_huge_page*() supports shmem THP split to any lower order. Test it. The test now reads file content out after split to check if the split corrupts the file data. Link: https://lkml.kernel.org/r/20250122161928.1240637-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm: make file-backed THP split work by writing PMD size data	Zi Yan	1	-8/+44
	Commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs") changes huge=always to allocate THP/mTHP based on write size and split_huge_page_test does not write PMD size data, so file-back THP is not created during the test. Fix it by writing PMD size data. Link: https://lkml.kernel.org/r/20250122161928.1240637-1-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17	selftests/mm: run_vmtests.sh: fix half_ufd_size_MB calculation	Rafael Aquini	1	-1/+3
	We noticed that uffd-stress test was always failing to run when invoked for the hugetlb profiles on x86_64 systems with a processor count of 64 or bigger: ... # ------------------------------------ # running ./uffd-stress hugetlb 128 32 # ------------------------------------ # ERROR: invalid MiB (errno=9, @uffd-stress.c:459) ... # [FAIL] not ok 3 uffd-stress hugetlb 128 32 # exit=1 ... The problem boils down to how run_vmtests.sh (mis)calculates the size of the region it feeds to uffd-stress. The latter expects to see an amount of MiB while the former is just giving out the number of free hugepages halved down. This measurement discrepancy ends up violating uffd-stress' assertion on number of hugetlb pages allocated per CPU, causing it to bail out with the error above. This commit fixes that issue by adjusting run_vmtests.sh's half_ufd_size_MB calculation so it properly renders the region size in MiB, as expected, while maintaining all of its original constraints in place. Link: https://lkml.kernel.org/r/20250218192251.53243-1-aquini@redhat.com Fixes: 2e47a445d7b3 ("selftests/mm: run_vmtests.sh: fix hugetlb mem size calculation") Signed-off-by: Rafael Aquini <raquini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-09	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm	Linus Torvalds	3	-9/+17
	Pull KVM fixes from Paolo Bonzini: "arm64: - Fix a couple of bugs affecting pKVM's PSCI relay implementation when running in the hVHE mode, resulting in the host being entered with the MMU in an unknown state, and EL2 being in the wrong mode x86: - Set RFLAGS.IF in C code on SVM to get VMRUN out of the STI shadow - Ensure DEBUGCTL is context switched on AMD to avoid running the guest with the host's value, which can lead to unexpected bus lock #DBs - Suppress DEBUGCTL.BTF on AMD (to match Intel), as KVM doesn't properly emulate BTF. KVM's lack of context switching has meant BTF has always been broken to some extent - Always save DR masks for SNP vCPUs if DebugSwap is supported, as the guest can enable DebugSwap without KVM's knowledge - Fix a bug in mmu_stress_tests where a vCPU could finish the "writes to RO memory" phase without actually generating a write-protection fault - Fix a printf() goof in the SEV smoke test that causes build failures with -Werror - Explicitly zero EAX and EBX in CPUID.0x8000_0022 output when PERFMON_V2 isn't supported by KVM" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: Explicitly zero EAX and EBX when PERFMON_V2 isn't supported by KVM KVM: selftests: Fix printf() format goof in SEV smoke test KVM: selftests: Ensure all vCPUs hit -EFAULT during initial RO stage KVM: SVM: Don't rely on DebugSwap to restore host DR0..DR3 KVM: SVM: Save host DR masks on CPUs with DebugSwap KVM: arm64: Initialize SCTLR_EL1 in __kvm_hyp_init_cpu() KVM: arm64: Initialize HCR_EL2.E2H early KVM: x86: Snapshot the host's DEBUGCTL after disabling IRQs KVM: SVM: Manually context switch DEBUGCTL if LBR virtualization is disabled KVM: x86: Snapshot the host's DEBUGCTL in common x86 KVM: SVM: Suppress DEBUGCTL.BTF on AMD KVM: SVM: Drop DEBUGCTL[5:2] from guest's effective value KVM: selftests: Assert that STI blocking isn't set after event injection KVM: SVM: Set RFLAGS.IF=1 in C code, to get VMRUN out of the STI shadow
2025-03-09	Merge tag 'kvm-x86-fixes-6.14-rcN.2' of https://github.com/kvm-x86/linux ↵	Paolo Bonzini	3	-9/+17
	into HEAD KVM x86 fixes for 6.14-rcN #2 - Set RFLAGS.IF in C code on SVM to get VMRUN out of the STI shadow. - Ensure DEBUGCTL is context switched on AMD to avoid running the guest with the host's value, which can lead to unexpected bus lock #DBs. - Suppress DEBUGCTL.BTF on AMD (to match Intel), as KVM doesn't properly emulate BTF. KVM's lack of context switching has meant BTF has always been broken to some extent. - Always save DR masks for SNP vCPUs if DebugSwap is supported, as the guest can enable DebugSwap without KVM's knowledge. - Fix a bug in mmu_stress_tests where a vCPU could finish the "writes to RO memory" phase without actually generating a write-protection fault. - Fix a printf() goof in the SEV smoke test that causes build failures with -Werror. - Explicitly zero EAX and EBX in CPUID.0x8000_0022 output when PERFMON_V2 isn't supported by KVM.
2025-03-09	Merge tag 'mm-hotfixes-stable-2025-03-08-16-27' of ↵	Linus Torvalds	12	-11/+71
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "33 hotfixes. 24 are cc:stable and the remainder address post-6.13 issues or aren't considered necessary for -stable kernels. 26 are for MM and 7 are for non-MM. - "mm: memory_failure: unmap poisoned folio during migrate properly" from Ma Wupeng fixes a couple of two year old bugs involving the migration of hwpoisoned folios. - "selftests/damon: three fixes for false results" from SeongJae Park fixes three one year old bugs in the SAMON selftest code. The remainder are singletons and doubletons. Please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2025-03-08-16-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (33 commits) mm/page_alloc: fix uninitialized variable rapidio: add check for rio_add_net() in rio_scan_alloc_net() rapidio: fix an API misues when rio_add_net() fails MAINTAINERS: .mailmap: update Sumit Garg's email address Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone" mm: fix finish_fault() handling for large folios mm: don't skip arch_sync_kernel_mappings() in error paths mm: shmem: remove unnecessary warning in shmem_writepage() userfaultfd: fix PTE unmapping stack-allocated PTE copies userfaultfd: do not block on locking a large folio with raised refcount mm: zswap: use ATOMIC_LONG_INIT to initialize zswap_stored_pages mm: shmem: fix potential data corruption during shmem swapin mm: fix kernel BUG when userfaultfd_move encounters swapcache selftests/damon/damon_nr_regions: sort collected regiosn before checking with min/max boundaries selftests/damon/damon_nr_regions: set ops update for merge results check to 100ms selftests/damon/damos_quota: make real expectation of quota exceeds include/linux/log2.h: mark is_power_of_2() with __always_inline NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback mm, swap: avoid BUG_ON in relocate_cluster() mm: swap: use correct step in loop to wait all clusters in wait_for_allocation() ...
2025-03-08	Merge tag 's390-6.14-6' of ↵	Linus Torvalds	1	-5/+5
	git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Vasily Gorbik: - Fix return address recovery of traced function in ftrace to ensure reliable stack unwinding - Fix compiler warnings and runtime crashes of vDSO selftests on s390 by introducing a dedicated GNU hash bucket pointer with correct 32-bit entry size - Fix test_monitor_call() inline asm, which misses CC clobber, by switching to an instruction that doesn't modify CC * tag 's390-6.14-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/ftrace: Fix return address recovery of traced function selftests/vDSO: Fix GNU hash table entry size for s390x s390/traps: Fix test_monitor_call() inline assembly
2025-03-06	selftests/damon/damon_nr_regions: sort collected regiosn before checking ↵	SeongJae Park	1	-0/+1
	with min/max boundaries damon_nr_regions.py starts DAMON, periodically collect number of regions in snapshots, and see if it is in the requested range. The check code assumes the numbers are sorted on the collection list, but there is no such guarantee. Hence this can result in false positive test success. Sort the list before doing the check. Link: https://lkml.kernel.org/r/20250225222333.505646-4-sj@kernel.org Fixes: 781497347d1b ("selftests/damon: implement test for min/max_nr_regions") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-06	selftests/damon/damon_nr_regions: set ops update for merge results check to ↵	SeongJae Park	1	-0/+1
	100ms damon_nr_regions.py updates max_nr_regions to a number smaller than expected number of real regions and confirms DAMON respect the harsh limit. To give time for DAMON to make changes for the regions, 3 aggregation intervals (300 milliseconds) are given. The internal mechanism works with not only the max_nr_regions, but also sz_limit, though. It avoids merging region if that casn make region of size larger than sz_limit. In the test, sz_limit is set too small to achive the new max_nr_regions, unless it is updated for the new min_nr_regions. But the update is done only once per operations set update interval, which is one second by default. Hence, the test randomly incurs false positive failures. Fix it by setting the ops interval same to aggregation interval, to make sure sz_limit is updated by the time of the check. Link: https://lkml.kernel.org/r/20250225222333.505646-3-sj@kernel.org Fixes: 8bf890c81612 ("selftests/damon/damon_nr_regions: test online-tuned max_nr_regions") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-06	selftests/damon/damos_quota: make real expectation of quota exceeds	SeongJae Park	1	-3/+6
	Patch series "selftests/damon: three fixes for false results". Fix three DAMON selftest bugs that cause two and one false positive failures and successes. This patch (of 3): damos_quota.py assumes the quota will always exceeded. But whether quota will be exceeded or not depend on the monitoring results. Actually the monitored workload has chaning access pattern and hence sometimes the quota may not really be exceeded. As a result, false positive test failures happen. Expect how much time the quota will be exceeded by checking the monitoring results, and use it instead of the naive assumption. Link: https://lkml.kernel.org/r/20250225222333.505646-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250225222333.505646-2-sj@kernel.org Fixes: 51f58c9da14b ("selftests/damon: add a test for DAMOS quota") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-06	selftests/damon/damos_quota_goal: handle minimum quota that cannot be ↵	SeongJae Park	1	-0/+3
	further reduced damos_quota_goal.py selftest see if DAMOS quota goals tuning feature increases or reduces the effective size quota for given score as expected. The tuning feature sets the minimum quota size as one byte, so if the effective size quota is already one, we cannot expect it further be reduced. However the test is not aware of the edge case, and fails since it shown no expected change of the effective quota. Handle the case by updating the failure logic for no change to see if it was the case, and simply skips to next test input. Link: https://lkml.kernel.org/r/20250217182304.45215-1-sj@kernel.org Fixes: f1c07c0a1662 ("selftests/damon: add a test for DAMOS quota goal") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202502171423.b28a918d-lkp@intel.com Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> [6.10.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-06	Revert "selftests/mm: remove local __NR_* definitions"	John Hubbard	9	-8/+60
	This reverts commit a5c6bc590094a1a73cf6fa3f505e1945d2bf2461. The general approach described in commit e076eaca5906 ("selftests: break the dependency upon local header files") was taken one step too far here: it should not have been extended to include the syscall numbers. This is because doing so would require per-arch support in tools/include/uapi, and no such support exists. This revert fixes two separate reports of test failures, from Dave Hansen[1], and Li Wang[2]. An excerpt of Dave's report: Before this commit (a5c6bc590094a1a73cf6fa3f505e1945d2bf2461) things are fine. But after, I get: running PKEY tests for unsupported CPU/OS An excerpt of Li's report: I just found that mlock2_() return a wrong value in mlock2-test [1] https://lore.kernel.org/dc585017-6740-4cab-a536-b12b37a7582d@intel.com [2] https://lore.kernel.org/CAEemH2eW=UMu9+turT2jRie7+6ewUazXmA6kL+VBo3cGDGU6RA@mail.gmail.com Link: https://lkml.kernel.org/r/20250214033850.235171-1-jhubbard@nvidia.com Fixes: a5c6bc590094 ("selftests/mm: remove local __NR_* definitions") Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Li Wang <liwang@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rich Felker <dalias@libc.org> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-04	selftests/vDSO: Fix GNU hash table entry size for s390x	Thomas Weißschuh	1	-5/+5
	Commit 14be4e6f3522 ("selftests: vDSO: fix ELF hash table entry size for s390x") changed the type of the ELF hash table entries to 64bit on s390x. However the GNU hash tables entries are always 32bit. The "bucket" pointer is shared between both hash algorithms. On s390, this caused the GNU hash algorithm to access its 32-bit entries as if they were 64-bit, triggering compiler warnings (assignment between "Elf64_Xword " and "Elf64_Word ") and runtime crashes. Introduce a new dedicated "gnu_bucket" pointer which is used by the GNU hash. Fixes: e0746bde6f82 ("selftests/vDSO: support DT_GNU_HASH") Reviewed-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Shuah Khan <skhan@linuxfoundation.org> Link: https://lore.kernel.org/r/20250217-selftests-vdso-s390-gnu-hash-v2-1-f6c2532ffe2a@linutronix.de Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-03	KVM: selftests: Fix printf() format goof in SEV smoke test	Sean Christopherson	1	-1/+2
	Print out the index of mismatching XSAVE bytes using unsigned decimal format. Some versions of clang complain about trying to print an integer as an unsigned char. x86/sev_smoke_test.c:55:51: error: format specifies type 'unsigned char' but the argument has type 'int' [-Werror,-Wformat] Fixes: 8c53183dbaa2 ("selftests: kvm: add test for transferring FPU state into VMSA") Link: https://lore.kernel.org/r/20250228233852.3855676-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03	KVM: selftests: Ensure all vCPUs hit -EFAULT during initial RO stage	Sean Christopherson	1	-8/+13
	During the initial mprotect(RO) stage of mmu_stress_test, keep vCPUs spinning until all vCPUs have hit -EFAULT, i.e. until all vCPUs have tried to write to a read-only page. If a vCPU manages to complete an entire iteration of the loop without hitting a read-only page, and the vCPU observes mprotect_ro_done before starting a second iteration, then the vCPU will prematurely fall through to GUEST_SYNC(3) (on x86 and arm64) and get out of sequence. Replace the "do-while (!r)" loop around the associated _vcpu_run() with a single invocation, as barring a KVM bug, the vCPU is guaranteed to hit -EFAULT, and retrying on success is super confusion, hides KVM bugs, and complicates this fix. The do-while loop was semi-unintentionally added specifically to fudge around a KVM x86 bug, and said bug is unhittable without modifying the test to force x86 down the !(x86\|\|arm64) path. On x86, if forced emulation is enabled, vcpu_arch_put_guest() may trigger emulation of the store to memory. Due a (very, very) longstanding bug in KVM x86's emulator, emulate writes to guest memory that fail during __kvm_write_guest_page() unconditionally return KVM_EXIT_MMIO. While that is desirable in the !memslot case, it's wrong in this case as the failure happens due to __copy_to_user() hitting a read-only page, not an emulated MMIO region. But as above, x86 only uses vcpu_arch_put_guest() if the __x86_64__ guards are clobbered to force x86 down the common path, and of course the unexpected MMIO is a KVM bug, i.e. should cause a test failure. Fixes: b6c304aec648 ("KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ)") Reported-by: Yan Zhao <yan.y.zhao@intel.com> Closes: https://lore.kernel.org/all/20250208105318.16861-1-yan.y.zhao@intel.com Debugged-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20250228230804.3845860-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-01	Merge tag 'objtool-urgent-2025-02-28' of ↵	Linus Torvalds	3	-5/+6
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool fixes from Ingo Molnar: "Fix an objtool false positive, and objtool related build warnings that happens on PIE-enabled architectures such as LoongArch" * tag 'objtool-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Add bch2_trans_unlocked_or_in_restart_error() to bcachefs noreturns objtool: Fix C jump table annotations for Clang vmlinux.lds: Ensure that const vars with relocations are mapped R/O
2025-03-01	Merge tag 'trace-v6.14-rc4' of ↵	Linus Torvalds	1	-7/+11
	git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix crash from bad histogram entry An error path in the histogram creation could leave an entry in a link list that gets freed. Then when a new entry is added it can cause a u-a-f bug. This is fixed by restructuring the code so that the histogram is consistent on failure and everything is cleaned up appropriately. - Fix fprobe self test The fprobe self test relies on no function being attached by ftrace. BPF programs can attach to functions via ftrace and systemd now does so. This causes those functions to appear in the enabled_functions list which holds all functions attached by ftrace. The selftest also uses that file to see if functions are being connected correctly. It counts the functions in the file, but if there's already functions in the file, it fails. Instead, add the number of functions in the file at the start of the test to all the calculations during the test. - Fix potential division by zero of the function profiler stddev The calculated divisor that calculates the standard deviation of the function times can overflow. If the overflow happens to land on zero, that can cause a division by zero. Check for zero from the calculation before doing the division. TODO: Catch when it ever overflows and report it accordingly. For now, just prevent the system from crashing. * tag 'trace-v6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Avoid potential division by zero in function_stat_show() selftests/ftrace: Let fprobe test consider already enabled functions tracing: Fix bad hist from corrupting named_triggers list
2025-02-28	KVM: selftests: Assert that STI blocking isn't set after event injection	Sean Christopherson	1	-0/+2
	Add an L1 (guest) assert to the nested exceptions test to verify that KVM doesn't put VMRUN in an STI shadow (AMD CPUs bleed the shadow into the guest's int_state if a #VMEXIT occurs before VMRUN fully completes). Add a similar assert to the VMX side as well, because why not. Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250224165442.2338294-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>