summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
3 daysksm: use range-walk function to jump over holes in scan_get_next_rmap_itemPedro Demarchi Gomes1-13/+113
commit f5548c318d6520d4fa3c5ed6003eeb710763cbc5 upstream. Currently, scan_get_next_rmap_item() walks every page address in a VMA to locate mergeable pages. This becomes highly inefficient when scanning large virtual memory areas that contain mostly unmapped regions, causing ksmd to use large amount of cpu without deduplicating much pages. This patch replaces the per-address lookup with a range walk using walk_page_range(). The range walker allows KSM to skip over entire unmapped holes in a VMA, avoiding unnecessary lookups. This problem was previously discussed in [1]. Consider the following test program which creates a 32 TiB mapping in the virtual address space but only populates a single page: #include <unistd.h> #include <stdio.h> #include <sys/mman.h> /* 32 TiB */ const size_t size = 32ul * 1024 * 1024 * 1024 * 1024; int main() { char *area = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_NORESERVE | MAP_PRIVATE | MAP_ANON, -1, 0); if (area == MAP_FAILED) { perror("mmap() failed\n"); return -1; } /* Populate a single page such that we get an anon_vma. */ *area = 0; /* Enable KSM. */ madvise(area, size, MADV_MERGEABLE); pause(); return 0; } $ ./ksm-sparse & $ echo 1 > /sys/kernel/mm/ksm/run Without this patch ksmd uses 100% of the cpu for a long time (more then 1 hour in my test machine) scanning all the 32 TiB virtual address space that contain only one mapped page. This makes ksmd essentially deadlocked not able to deduplicate anything of value. With this patch ksmd walks only the one mapped page and skips the rest of the 32 TiB virtual address space, making the scan fast using little cpu. Link: https://lkml.kernel.org/r/20251023035841.41406-1-pedrodemargomes@gmail.com Link: https://lkml.kernel.org/r/20251022153059.22763-1-pedrodemargomes@gmail.com Link: https://lore.kernel.org/linux-mm/423de7a3-1c62-4e72-8e79-19a6413e420c@redhat.com/ [1] Fixes: 31dbd01f3143 ("ksm: Kernel SamePage Merging") Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: craftfever <craftfever@airmail.cc> Closes: https://lkml.kernel.org/r/020cf8de6e773bb78ba7614ef250129f11a63781@murena.io Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: xu xin <xu.xin16@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ change page to folios ] Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/damon/tests/core-kunit: handle alloc failures in ↵SeongJae Park1-0/+3
damon_test_update_monitoring_result() commit 8cf298c01b7fdb08eef5b6b26d0fe98d48134d72 upstream. damon_test_update_monitoring_result() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-12-sj@kernel.org Fixes: f4c978b6594b ("mm/damon/core-test: add a test for damon_update_monitoring_results()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [6.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/damon/tests/core-kunit: handle alloc failure on damon_test_set_attrs()SeongJae Park1-0/+3
commit 915a2453d824a9b6bf724e3f970d86ae1d092a61 upstream. damon_test_set_attrs() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-13-sj@kernel.org Fixes: aa13779be6b7 ("mm/damon/core-test: add a test for damon_set_attrs()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/damon/tests/core-kunit: handle alloc failures in ↵SeongJae Park1-0/+3
damon_test_ops_registration() commit 4f835f4e8c863985f15abd69db033c2f66546094 upstream. damon_test_ops_registration() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-10-sj@kernel.org Fixes: 4f540f5ab4f2 ("mm/damon/core-test: add a kunit test case for ops registration") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/damon/tests/core-kunit: handle alloc failures in damon_test_set_regions()SeongJae Park1-2/+15
commit 74d5969995d129fd59dd93b9c7daa6669cb6810f upstream. damon_test_set_regions() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-11-sj@kernel.org Fixes: 62f409560eb2 ("mm/damon/core-test: test damon_set_regions") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [6.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/damon/tests/core-kunit: handle alloc failures on damos_test_filter_out()SeongJae Park1-0/+11
commit d14d5671e7c9cc788c5a1edfa94e6f9064275905 upstream. damon_test_filter_out() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-16-sj@kernel.org Fixes: 26713c890875 ("mm/damon/core-test: add a unit test for __damos_filter_out()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [6.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/damon/tests/core-kunit: handle memory alloc failure from ↵SeongJae Park1-0/+11
damon_test_aggregate() commit f79f2fc44ebd0ed655239046be3e80e8804b5545 upstream. damon_test_aggregate() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-5-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/damon/tests/core-kunit: handle alloc failures on ↵SeongJae Park1-0/+20
damon_test_split_regions_of() commit eded254cb69044bd4abde87394ea44909708d7c0 upstream. damon_test_split_regions_of() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-9-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/damon/tests/core-kunit: handle memory failure from damon_test_target()SeongJae Park1-0/+7
commit fafe953de2c661907c94055a2497c6b8dbfd26f3 upstream. damon_test_target() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-4-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/damon/tests/core-kunit: handle alloc failures on damon_test_merge_two()SeongJae Park1-0/+10
commit 3d443dd29a1db7efa587a4bb0c06a497e13ca9e4 upstream. damon_test_merge_two() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-7-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/damon/tests/core-kunit: handle alloc failures on ↵SeongJae Park1-0/+6
dasmon_test_merge_regions_of() commit 0998d2757218771c59d5ca59ccf13d1542a38f17 upstream. damon_test_merge_regions_of() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-8-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/damon/tests/core-kunit: handle alloc failures on damon_test_split_at()SeongJae Park1-0/+11
commit 5e80d73f22043c59c8ad36452a3253937ed77955 upstream. damon_test_split_at() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-6-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/damon/tests/core-kunit: handle allocation failures in damon_test_regions()SeongJae Park1-0/+6
commit e16fdd4f754048d6e23c56bd8d920b71e41e3777 upstream. damon_test_regions() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-3-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/damon/tests/core-kunit: handle alloc failres in damon_test_new_filter()SeongJae Park1-0/+2
commit 28ab2265e9422ccd81e4beafc0ace90f78de04c4 upstream. damon_test_new_filter() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-14-sj@kernel.org Fixes: 2a158e956b98 ("mm/damon/core-test: add a test for damos_new_filter()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [6.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/damon/tests/vaddr-kunit: handle alloc failures on ↵SeongJae Park1-1/+8
damon_test_split_evenly_succ() commit 0a63a0e7570b9b2631dfb8d836dc572709dce39e upstream. damon_test_split_evenly_succ() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-20-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/damon/tests/vaddr-kunit: handle alloc failures on ↵SeongJae Park1-0/+6
damon_do_test_apply_three_regions() commit 2b22d0fcc6320ba29b2122434c1d2f0785fb0a25 upstream. damon_do_test_apply_three_regions() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-18-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/damon/tests/vaddr-kunit: handle alloc failures in ↵SeongJae Park1-1/+10
damon_test_split_evenly_fail() commit 7890e5b5bb6e386155c6e755fe70e0cdcc77f18e upstream. damon_test_split_evenly_fail() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-19-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/balloon_compaction: convert balloon_page_delete() to balloon_page_finalize()David Hildenbrand1-1/+2
[ Upstream commit 15504b1163007bbfbd9a63460d5c14737c16e96d ] Let's move the removal of the page from the balloon list into the single caller, to remove the dependency on the PG_isolated flag and clarify locking requirements. Note that for now, balloon_page_delete() was used on two paths: (1) Removing a page from the balloon for deflation through balloon_page_list_dequeue() (2) Removing an isolated page from the balloon for migration in the per-driver migration handlers. Isolated pages were already removed from the balloon list during isolation. So instead of relying on the flag, we can just distinguish both cases directly and handle it accordingly in the caller. We'll shuffle the operations a bit such that they logically make more sense (e.g., remove from the list before clearing flags). In balloon migration functions we can now move the balloon_page_finalize() out of the balloon lock and perform the finalization just before dropping the balloon reference. Document that the page lock is currently required when modifying the movability aspects of a page; hopefully we can soon decouple this from the page lock. Link: https://lkml.kernel.org/r/20250704102524.326966-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 0da2ba35c0d5 ("powerpc/pseries/cmm: adjust BALLOON_MIGRATE when migrating pages") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/balloon_compaction: we cannot have isolated pages in the balloon listDavid Hildenbrand1-6/+0
[ Upstream commit fb05f992b6bbb4702307d96f00703ee637b24dbf ] Patch series "mm/migration: rework movable_ops page migration (part 1)", v2. In the future, as we decouple "struct page" from "struct folio", pages that support "non-lru page migration" -- movable_ops page migration such as memory balloons and zsmalloc -- will no longer be folios. They will not have ->mapping, ->lru, and likely no refcount and no page lock. But they will have a type and flags 🙂 This is the first part (other parts not written yet) of decoupling movable_ops page migration from folio migration. In this series, we get rid of the ->mapping usage, and start cleaning up the code + separating it from folio migration. Migration core will have to be further reworked to not treat movable_ops pages like folios. This is the first step into that direction. This patch (of 29): The core will set PG_isolated only after mops->isolate_page() was called. In case of the balloon, that is where we will remove it from the balloon list. So we cannot have isolated pages in the balloon list. Let's drop this unnecessary check. Link: https://lkml.kernel.org/r/20250704102524.326966-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 0da2ba35c0d5 ("powerpc/pseries/cmm: adjust BALLOON_MIGRATE when migrating pages") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm: fix arithmetic for max_prop_frac when setting max_ratioJingbo Xu1-1/+2
commit fa151a39a6879144b587f35c0dfcc15e1be9450f upstream. Since now bdi->max_ratio is part per million, fix the wrong arithmetic for max_prop_frac when setting max_ratio. Otherwise the miscalculated max_prop_frac will affect the incrementing of writeout completion count when max_ratio is not 100%. Link: https://lkml.kernel.org/r/20231219142508.86265-3-jefflexu@linux.alibaba.com Fixes: efc3e6ad53ea ("mm: split off __bdi_set_max_ratio() function") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm: fix arithmetic for bdi min_ratioJingbo Xu1-1/+0
commit e0646b7590084a5bf3b056d3ad871d9379d2c25a upstream. Since now bdi->min_ratio is part per million, fix the wrong arithmetic. Otherwise it will fail with -EINVAL when setting a reasonable min_ratio, as it tries to set min_ratio to (min_ratio * BDI_RATIO_SCALE) in percentage unit, which exceeds 100% anyway. # cat /sys/class/bdi/253\:0/min_ratio 0 # cat /sys/class/bdi/253\:0/max_ratio 100 # echo 1 > /sys/class/bdi/253\:0/min_ratio -bash: echo: write error: Invalid argument Link: https://lkml.kernel.org/r/20231219142508.86265-2-jefflexu@linux.alibaba.com Fixes: 8021fb3232f2 ("mm: split off __bdi_set_min_ratio() function") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/ksm: fix exec/fork inheritance support for prctlxu xin1-2/+16
[ Upstream commit 590c03ca6a3fbb114396673314e2aa483839608b ] Patch series "ksm: fix exec/fork inheritance", v2. This series fixes exec/fork inheritance. See the detailed description of the issue below. This patch (of 2): Background ========== commit d7597f59d1d33 ("mm: add new api to enable ksm per process") introduced MMF_VM_MERGE_ANY for mm->flags, and allowed user to set it by prctl() so that the process's VMAs are forcibly scanned by ksmd. Subsequently, the 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") supported inheriting the MMF_VM_MERGE_ANY flag when a task calls execve(). Finally, commit 3a9e567ca45fb ("mm/ksm: fix ksm exec support for prctl") fixed the issue that ksmd doesn't scan the mm_struct with MMF_VM_MERGE_ANY by adding the mm_slot to ksm_mm_head in __bprm_mm_init(). Problem ======= In some extreme scenarios, however, this inheritance of MMF_VM_MERGE_ANY during exec/fork can fail. For example, when the scanning frequency of ksmd is tuned extremely high, a process carrying MMF_VM_MERGE_ANY may still fail to pass it to the newly exec'd process. This happens because ksm_execve() is executed too early in the do_execve flow (prematurely adding the new mm_struct to the ksm_mm_slot list). As a result, before do_execve completes, ksmd may have already performed a scan and found that this new mm_struct has no VM_MERGEABLE VMAs, thus clearing its MMF_VM_MERGE_ANY flag. Consequently, when the new program executes, the flag MMF_VM_MERGE_ANY inheritance missed. Root reason =========== commit d7597f59d1d33 ("mm: add new api to enable ksm per process") clear the flag MMF_VM_MERGE_ANY when ksmd found no VM_MERGEABLE VMAs. Solution ======== Firstly, Don't clear MMF_VM_MERGE_ANY when ksmd found no VM_MERGEABLE VMAs, because perhaps their mm_struct has just been added to ksm_mm_slot list, and its process has not yet officially started running or has not yet performed mmap/brk to allocate anonymous VMAS. Secondly, recheck MMF_VM_MERGEABLE again if a process takes MMF_VM_MERGE_ANY, and create a mm_slot and join it into ksm_scan_list again. Link: https://lkml.kernel.org/r/20251007182504440BJgK8VXRHh8TD7IGSUIY4@zte.com.cn Link: https://lkml.kernel.org/r/20251007182821572h_SoFqYZXEP1mvWI4n9VL@zte.com.cn Fixes: 3c6f33b7273a ("mm/ksm: support fork/exec for prctl") Fixes: d7597f59d1d3 ("mm: add new api to enable ksm per process") Signed-off-by: xu xin <xu.xin16@zte.com.cn> Cc: Stefan Roesch <shr@devkernel.io> Cc: David Hildenbrand <david@redhat.com> Cc: Jinjiang Tu <tujinjiang@huawei.com> Cc: Wang Yaxin <wang.yaxin@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ changed mm_flags_test() and mm_flags_clear() calls to test_bit() and clear_bit() ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 dayskasan: refactor pcpu kasan vmalloc unpoisonMaciej Wieczor-Retman2-3/+18
commit 6f13db031e27e88213381039032a9cc061578ea6 upstream. A KASAN tag mismatch, possibly causing a kernel panic, can be observed on systems with a tag-based KASAN enabled and with multiple NUMA nodes. It was reported on arm64 and reproduced on x86. It can be explained in the following points: 1. There can be more than one virtual memory chunk. 2. Chunk's base address has a tag. 3. The base address points at the first chunk and thus inherits the tag of the first chunk. 4. The subsequent chunks will be accessed with the tag from the first chunk. 5. Thus, the subsequent chunks need to have their tag set to match that of the first chunk. Refactor code by reusing __kasan_unpoison_vmalloc in a new helper in preparation for the actual fix. Link: https://lkml.kernel.org/r/eb61d93b907e262eefcaa130261a08bcb6c5ce51.1764874575.git.m.wieczorretman@pm.me Fixes: 1d96320f8d53 ("kasan, vmalloc: add vmalloc tagging for SW_TAGS") Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Jiayuan Chen <jiayuan.chen@linux.dev> Cc: Kees Cook <kees@kernel.org> Cc: Marco Elver <elver@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> [6.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-01mm/mempool: fix poisoning order>0 pages with HIGHMEMVlastimil Babka1-6/+26
[ Upstream commit ec33b59542d96830e3c89845ff833cf7b25ef172 ] The kernel test has reported: BUG: unable to handle page fault for address: fffba000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page *pde = 03171067 *pte = 00000000 Oops: Oops: 0002 [#1] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G T 6.18.0-rc2-00031-gec7f31b2a2d3 #1 NONE a1d066dfe789f54bc7645c7989957d2bdee593ca Tainted: [T]=RANDSTRUCT Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 EIP: memset (arch/x86/include/asm/string_32.h:168 arch/x86/lib/memcpy_32.c:17) Code: a5 8b 4d f4 83 e1 03 74 02 f3 a4 83 c4 04 5e 5f 5d 2e e9 73 41 01 00 90 90 90 3e 8d 74 26 00 55 89 e5 57 56 89 c6 89 d0 89 f7 <f3> aa 89 f0 5e 5f 5d 2e e9 53 41 01 00 cc cc cc 55 89 e5 53 57 56 EAX: 0000006b EBX: 00000015 ECX: 001fefff EDX: 0000006b ESI: fffb9000 EDI: fffba000 EBP: c611fbf0 ESP: c611fbe8 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010287 CR0: 80050033 CR2: fffba000 CR3: 0316e000 CR4: 00040690 Call Trace: poison_element (mm/mempool.c:83 mm/mempool.c:102) mempool_init_node (mm/mempool.c:142 mm/mempool.c:226) mempool_init_noprof (mm/mempool.c:250 (discriminator 1)) ? mempool_alloc_pages (mm/mempool.c:640) bio_integrity_initfn (block/bio-integrity.c:483 (discriminator 8)) ? mempool_alloc_pages (mm/mempool.c:640) do_one_initcall (init/main.c:1283) Christoph found out this is due to the poisoning code not dealing properly with CONFIG_HIGHMEM because only the first page is mapped but then the whole potentially high-order page is accessed. We could give up on HIGHMEM here, but it's straightforward to fix this with a loop that's mapping, poisoning or checking and unmapping individual pages. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202511111411.9ebfa1ba-lkp@intel.com Analyzed-by: Christoph Hellwig <hch@lst.de> Fixes: bdfedb76f4f5 ("mm, mempool: poison elements backed by slab allocator") Cc: stable@vger.kernel.org Tested-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251113-mempool-poison-v1-1-233b3ef984c3@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-01mm/mempool: replace kmap_atomic() with kmap_local_page()Fabio M. De Francesco1-4/+4
[ Upstream commit f2bcc99a5e901a13b754648d1dbab60f4adf9375 ] kmap_atomic() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap_atomic() with kmap_local_page(). kmap_atomic() is implemented like a kmap_local_page() which also disables page-faults and preemption (the latter only in !PREEMPT_RT kernels). The kernel virtual addresses returned by these two API are only valid in the context of the callers (i.e., they cannot be handed to other threads). With kmap_local_page() the mappings are per thread and CPU local like in kmap_atomic(); however, they can handle page-faults and can be called from any context (including interrupts). The tasks that call kmap_local_page() can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. The code blocks between the mappings and un-mappings don't rely on the above-mentioned side effects of kmap_atomic(), so that mere replacements of the old API with the new one is all that they require (i.e., there is no need to explicitly call pagefault_disable() and/or preempt_disable()). Link: https://lkml.kernel.org/r/20231120142640.7077-1-fabio.maria.de.francesco@linux.intel.com Signed-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: ec33b59542d9 ("mm/mempool: fix poisoning order>0 pages with HIGHMEM") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-01shmem: fix tmpfs reconfiguration (remount) when noswap is setMike Yuan1-8/+7
commit 3cd1548a278c7d6a9bdef1f1866e7cf66bfd3518 upstream. In systemd we're trying to switch the internal credentials setup logic to new mount API [1], and I noticed fsconfig(FSCONFIG_CMD_RECONFIGURE) consistently fails on tmpfs with noswap option. This can be trivially reproduced with the following: ``` int fs_fd = fsopen("tmpfs", 0); fsconfig(fs_fd, FSCONFIG_SET_FLAG, "noswap", NULL, 0); fsconfig(fs_fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0); fsmount(fs_fd, 0, 0); fsconfig(fs_fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0); <------ EINVAL ``` After some digging the culprit is shmem_reconfigure() rejecting !(ctx->seen & SHMEM_SEEN_NOSWAP) && sbinfo->noswap, which is bogus as ctx->seen serves as a mask for whether certain options are touched at all. On top of that, noswap option doesn't use fsparam_flag_no, hence it's not really possible to "reenable" swap to begin with. Drop the check and redundant SHMEM_SEEN_NOSWAP flag. [1] https://github.com/systemd/systemd/pull/39637 Fixes: 2c6efe9cf2d7 ("shmem: add support to ignore swap") Signed-off-by: Mike Yuan <me@yhndnzj.com> Link: https://patch.msgid.link/20251108190930.440685-1-me@yhndnzj.com Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24memcg: fix data-race KCSAN bug in rstatsBreno Leitao1-5/+7
commit 78ec6f9df6642418411c534683da6133e0962ec7 upstream. A data-race issue in memcg rstat occurs when two distinct code paths access the same 4-byte region concurrently. KCSAN detection triggers the following BUG as a result. BUG: KCSAN: data-race in __count_memcg_events / mem_cgroup_css_rstat_flush write to 0xffffe8ffff98e300 of 4 bytes by task 5274 on cpu 17: mem_cgroup_css_rstat_flush (mm/memcontrol.c:5850) cgroup_rstat_flush_locked (kernel/cgroup/rstat.c:243 (discriminator 7)) cgroup_rstat_flush (./include/linux/spinlock.h:401 kernel/cgroup/rstat.c:278) mem_cgroup_flush_stats.part.0 (mm/memcontrol.c:767) memory_numa_stat_show (mm/memcontrol.c:6911) <snip> read to 0xffffe8ffff98e300 of 4 bytes by task 410848 on cpu 27: __count_memcg_events (mm/memcontrol.c:725 mm/memcontrol.c:962) count_memcg_event_mm.part.0 (./include/linux/memcontrol.h:1097 ./include/linux/memcontrol.h:1120) handle_mm_fault (mm/memory.c:5483 mm/memory.c:5622) <snip> value changed: 0x00000029 -> 0x00000000 The race occurs because two code paths access the same "stats_updates" location. Although "stats_updates" is a per-CPU variable, it is remotely accessed by another CPU at cgroup_rstat_flush_locked()->mem_cgroup_css_rstat_flush(), leading to the data race mentioned. Considering that memcg_rstat_updated() is in the hot code path, adding a lock to protect it may not be desirable, especially since this variable pertains solely to statistics. Therefore, annotating accesses to stats_updates with READ/WRITE_ONCE() can prevent KCSAN splats and potential partial reads/writes. Link: https://lkml.kernel.org/r/20240424125940.2410718-1-leitao@debian.org Fixes: 9cee7e8ef3e3 ("mm: memcg: optimize parent iteration in memcg_rstat_updated()") Signed-off-by: Breno Leitao <leitao@debian.org> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm: memcg: optimize parent iteration in memcg_rstat_updated()Yosry Ahmed1-21/+35
commit 9cee7e8ef3e31ca25b40ca52b8585dc6935deff2 upstream. In memcg_rstat_updated(), we iterate the memcg being updated and its parents to update memcg->vmstats_percpu->stats_updates in the fast path (i.e. no atomic updates). According to my math, this is 3 memory loads (and potentially 3 cache misses) per memcg: - Load the address of memcg->vmstats_percpu. - Load vmstats_percpu->stats_updates (based on some percpu calculation). - Load the address of the parent memcg. Avoid most of the cache misses by caching a pointer from each struct memcg_vmstats_percpu to its parent on the corresponding CPU. In this case, for the first memcg we have 2 memory loads (same as above): - Load the address of memcg->vmstats_percpu. - Load vmstats_percpu->stats_updates (based on some percpu calculation). Then for each additional memcg, we need a single load to get the parent's stats_updates directly. This reduces the number of loads from O(3N) to O(2+N) -- where N is the number of memcgs we need to iterate. Additionally, stash a pointer to memcg->vmstats in each struct memcg_vmstats_percpu such that we can access the atomic counter that all CPUs fold into, memcg->vmstats->stats_updates. memcg_should_flush_stats() is changed to memcg_vmstats_needs_flush() to accept a struct memcg_vmstats pointer accordingly. In struct memcg_vmstats_percpu, make sure both pointers together with stats_updates live on the same cacheline. Finally, update mem_cgroup_alloc() to take in a parent pointer and initialize the new cache pointers on each CPU. The percpu loop in mem_cgroup_alloc() may look concerning, but there are multiple similar loops in the cgroup creation path (e.g. cgroup_rstat_init()), most of which are hidden within alloc_percpu(). According to Oliver's testing [1], this fixes multiple 30-38% regressions in vm-scalability, will-it-scale-tlb_flush2, and will-it-scale-fallocate1. This comes at a cost of 2 more pointers per CPU (<2KB on a machine with 128 CPUs). [1] https://lore.kernel.org/lkml/ZbDJsfsZt2ITyo61@xsang-OptiPlex-9020/ [yosryahmed@google.com: fix struct memcg_vmstats_percpu size and alignment] Link: https://lkml.kernel.org/r/20240203044612.1234216-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20240124100023.660032-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Fixes: 8d59d2214c23 ("mm: memcg: make stats flushing threshold per-memcg") Tested-by: kernel test robot <oliver.sang@intel.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202401221624.cb53a8ca-oliver.sang@intel.com Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24memory tiers: use default_dram_perf_ref_source in log messageYing Huang1-3/+3
commit a530bbc53826c607f64e8ee466c3351efaf6aea5 upstream. Commit 3718c02dbd4c ("acpi, hmat: calculate abstract distance with HMAT") added a default_dram_perf_ref_source variable that was initialized but never used. This causes kmemleak to report the following memory leak: unreferenced object 0xff11000225a47b60 (size 16): comm "swapper/0", pid 1, jiffies 4294761654 hex dump (first 16 bytes): 41 43 50 49 20 48 4d 41 54 00 c1 4b 7d b7 75 7c ACPI HMAT..K}.u| backtrace (crc e6d0e7b2): [<ffffffff95d5afdb>] __kmalloc_node_track_caller_noprof+0x36b/0x440 [<ffffffff95c276d6>] kstrdup+0x36/0x60 [<ffffffff95dfabfa>] mt_set_default_dram_perf+0x23a/0x2c0 [<ffffffff9ad64733>] hmat_init+0x2b3/0x660 [<ffffffff95203cec>] do_one_initcall+0x11c/0x5c0 [<ffffffff9ac9cfc4>] do_initcalls+0x1b4/0x1f0 [<ffffffff9ac9d52e>] kernel_init_freeable+0x4ae/0x520 [<ffffffff97c789cc>] kernel_init+0x1c/0x150 [<ffffffff952aecd1>] ret_from_fork+0x31/0x70 [<ffffffff9520b18a>] ret_from_fork_asm+0x1a/0x30 This reminds us that we forget to use the performance data source information. So, use the variable in the error log message to help identify the root cause of inconsistent performance number. Link: https://lkml.kernel.org/r/87y13mvo0n.fsf@yhuang6-desk2.ccr.corp.intel.com Fixes: 3718c02dbd4c ("acpi, hmat: calculate abstract distance with HMAT") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reported-by: Waiman Long <longman@redhat.com> Acked-by: Waiman Long <longman@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24cachestat: do not flush stats in recency checkNhat Pham2-4/+15
commit 5a4d8944d6b1e1aaaa83ea42c116b520b4ed0394 upstream. syzbot detects that cachestat() is flushing stats, which can sleep, in its RCU read section (see [1]). This is done in the workingset_test_recent() step (which checks if the folio's eviction is recent). Move the stat flushing step to before the RCU read section of cachestat, and skip stat flushing during the recency check. [1]: https://lore.kernel.org/cgroups/000000000000f71227061bdf97e0@google.com/ Link: https://lkml.kernel.org/r/20240627201737.3506959-1-nphamcs@gmail.com Fixes: b00684722262 ("mm: workingset: move the stats flush into workingset_test_recent()") Signed-off-by: Nhat Pham <nphamcs@gmail.com> Reported-by: syzbot+b7f13b2d0cc156edf61a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/cgroups/000000000000f71227061bdf97e0@google.com/ Debugged-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: <stable@vger.kernel.org> [6.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm/secretmem: fix use-after-free race in fault handlerLance Yang1-1/+1
commit 6f86d0534fddfbd08687fa0f01479d4226bc3c3d upstream. When a page fault occurs in a secret memory file created with `memfd_secret(2)`, the kernel will allocate a new folio for it, mark the underlying page as not-present in the direct map, and add it to the file mapping. If two tasks cause a fault in the same page concurrently, both could end up allocating a folio and removing the page from the direct map, but only one would succeed in adding the folio to the file mapping. The task that failed undoes the effects of its attempt by (a) freeing the folio again and (b) putting the page back into the direct map. However, by doing these two operations in this order, the page becomes available to the allocator again before it is placed back in the direct mapping. If another task attempts to allocate the page between (a) and (b), and the kernel tries to access it via the direct map, it would result in a supervisor not-present page fault. Fix the ordering to restore the direct map before the folio is freed. Link: https://lkml.kernel.org/r/20251031120955.92116-1-lance.yang@linux.dev Fixes: 1507f51255c9 ("mm: introduce memfd_secret system call to create "secret" memory areas") Signed-off-by: Lance Yang <lance.yang@linux.dev> Reported-by: Google Big Sleep <big-sleep-vuln-reports@google.com> Closes: https://lore.kernel.org/linux-mm/CAEXGt5QeDpiHTu3K9tvjUTPqo+d-=wuCNYPa+6sWKrdQJ-ATdg@mail.gmail.com/ Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm/truncate: unmap large folio on split failureKiryl Shutsemau1-1/+26
commit fa04f5b60fda62c98a53a60de3a1e763f11feb41 upstream. Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are supposed to generate SIGBUS. This behavior might not be respected on truncation. During truncation, the kernel splits a large folio in order to reclaim memory. As a side effect, it unmaps the folio and destroys PMD mappings of the folio. The folio will be refaulted as PTEs and SIGBUS semantics are preserved. However, if the split fails, PMD mappings are preserved and the user will not receive SIGBUS on any accesses within the PMD. Unmap the folio on split failure. It will lead to refault as PTEs and preserve SIGBUS semantics. Make an exception for shmem/tmpfs that for long time intentionally mapped with PMDs across i_size. Link: https://lkml.kernel.org/r/20251027115636.82382-3-kirill@shutemov.name Fixes: b9a8a4195c7d ("truncate,shmem: Handle truncates that split large folios") Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christian Brauner <brauner@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm/memory: do not populate page table entries beyond i_sizeKiryl Shutsemau2-6/+38
commit 74207de2ba10c2973334906822dc94d2e859ffc5 upstream. Patch series "Fix SIGBUS semantics with large folios", v3. Accessing memory within a VMA, but beyond i_size rounded up to the next page size, is supposed to generate SIGBUS. Darrick reported[1] an xfstests regression in v6.18-rc1. generic/749 failed due to missing SIGBUS. This was caused by my recent changes that try to fault in the whole folio where possible: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()") 357b92761d94 ("mm/filemap: map entire large folio faultaround") These changes did not consider i_size when setting up PTEs, leading to xfstest breakage. However, the problem has been present in the kernel for a long time - since huge tmpfs was introduced in 2016. The kernel happily maps PMD-sized folios as PMD without checking i_size. And huge=always tmpfs allocates PMD-size folios on any writes. I considered this corner case when I implemented a large tmpfs, and my conclusion was that no one in their right mind should rely on receiving a SIGBUS signal when accessing beyond i_size. I cannot imagine how it could be useful for the workload. But apparently filesystem folks care a lot about preserving strict SIGBUS semantics. Generic/749 was introduced last year with reference to POSIX, but no real workloads were mentioned. It also acknowledged the tmpfs deviation from the test case. POSIX indeed says[3]: References within the address range starting at pa and continuing for len bytes to whole pages following the end of an object shall result in delivery of a SIGBUS signal. The patchset fixes the regression introduced by recent changes as well as more subtle SIGBUS breakage due to split failure on truncation. This patch (of 2): Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are supposed to generate SIGBUS. Recent changes attempted to fault in full folio where possible. They did not respect i_size, which led to populating PTEs beyond i_size and breaking SIGBUS semantics. Darrick reported generic/749 breakage because of this. However, the problem existed before the recent changes. With huge=always tmpfs, any write to a file leads to PMD-size allocation. Following the fault-in of the folio will install PMD mapping regardless of i_size. Fix filemap_map_pages() and finish_fault() to not install: - PTEs beyond i_size; - PMD mappings across i_size; Make an exception for shmem/tmpfs that for long time intentionally mapped with PMDs across i_size. Link: https://lkml.kernel.org/r/20251027115636.82382-1-kirill@shutemov.name Link: https://lkml.kernel.org/r/20251027115636.82382-2-kirill@shutemov.name Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Fixes: 6795801366da ("xfs: Support large folios") Reported-by: "Darrick J. Wong" <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24filemap: cap PTE range to be created to allowed zero fill in folio_map_range()Pankaj Raghav1-1/+6
commit 743a2753a02e805347969f6f89f38b736850d808 upstream. Usually the page cache does not extend beyond the size of the inode, therefore, no PTEs are created for folios that extend beyond the size. But with LBS support, we might extend page cache beyond the size of the inode as we need to guarantee folios of minimum order. While doing a read, do_fault_around() can create PTEs for pages that lie beyond the EOF leading to incorrect error return when accessing a page beyond the mapped file. Cap the PTE range to be created for the page cache up to the end of file(EOF) in filemap_map_pages() so that return error codes are consistent with POSIX[1] for LBS configurations. generic/749 has been created to trigger this edge case. This also fixes generic/749 for tmpfs with huge=always on systems with 4k base page size. [1](from mmap(2)) SIGBUS Attempted access to a page of the buffer that lies beyond the end of the mapped file. For an explanation of the treatment of the bytes in the page that corresponds to the end of a mapped file that is not a multiple of the page size, see NOTES. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240822135018.1931258-6-kernel@pankajraghav.com Tested-by: David Howells <dhowells@redhat.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm: memcg: restore subtree stats flushingYosry Ahmed3-33/+47
[ Upstream commit 7d7ef0a4686abe43cd76a141b340a348f45ecdf2 ] Stats flushing for memcg currently follows the following rules: - Always flush the entire memcg hierarchy (i.e. flush the root). - Only one flusher is allowed at a time. If someone else tries to flush concurrently, they skip and return immediately. - A periodic flusher flushes all the stats every 2 seconds. The reason this approach is followed is because all flushes are serialized by a global rstat spinlock. On the memcg side, flushing is invoked from userspace reads as well as in-kernel flushers (e.g. reclaim, refault, etc). This approach aims to avoid serializing all flushers on the global lock, which can cause a significant performance hit under high concurrency. This approach has the following problems: - Occasionally a userspace read of the stats of a non-root cgroup will be too expensive as it has to flush the entire hierarchy [1]. - Sometimes the stats accuracy are compromised if there is an ongoing flush, and we skip and return before the subtree of interest is actually flushed, yielding stale stats (by up to 2s due to periodic flushing). This is more visible when reading stats from userspace, but can also affect in-kernel flushers. The latter problem is particulary a concern when userspace reads stats after an event occurs, but gets stats from before the event. Examples: - When memory usage / pressure spikes, a userspace OOM handler may look at the stats of different memcgs to select a victim based on various heuristics (e.g. how much private memory will be freed by killing this). Reading stale stats from before the usage spike in this case may cause a wrongful OOM kill. - A proactive reclaimer may read the stats after writing to memory.reclaim to measure the success of the reclaim operation. Stale stats from before reclaim may give a false negative. - Reading the stats of a parent and a child memcg may be inconsistent (child larger than parent), if the flush doesn't happen when the parent is read, but happens when the child is read. As for in-kernel flushers, they will occasionally get stale stats. No regressions are currently known from this, but if there are regressions, they would be very difficult to debug and link to the source of the problem. This patch aims to fix these problems by restoring subtree flushing, and removing the unified/coalesced flushing logic that skips flushing if there is an ongoing flush. This change would introduce a significant regression with global stats flushing thresholds. With per-memcg stats flushing thresholds, this seems to perform really well. The thresholds protect the underlying lock from unnecessary contention. This patch was tested in two ways to ensure the latency of flushing is up to par, on a machine with 384 cpus: - A synthetic test with 5000 concurrent workers in 500 cgroups doing allocations and reclaim, as well as 1000 readers for memory.stat (variation of [2]). No regressions were noticed in the total runtime. Note that significant regressions in this test are observed with global stats thresholds, but not with per-memcg thresholds. - A synthetic stress test for concurrently reading memcg stats while memory allocation/freeing workers are running in the background, provided by Wei Xu [3]. With 250k threads reading the stats every 100ms in 50k cgroups, 99.9% of reads take <= 50us. Less than 0.01% of reads take more than 1ms, and no reads take more than 100ms. [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/ [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/ [akpm@linux-foundation.org: fix mm/zswap.c] [yosryahmed@google.com: remove stats flushing mutex] Link: https://lkml.kernel.org/r/CAJD7tkZgP3m-VVPn+fF_YuvXeQYK=tZZjJHj=dzD=CcSSpp2qg@mail.gmail.com Link: https://lkml.kernel.org/r/20231129032154.3710765-6-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm: workingset: move the stats flush into workingset_test_recent()Yosry Ahmed1-12/+24
[ Upstream commit b006847222623ac3cda8589d15379eac86a2bcb7 ] The workingset code flushes the stats in workingset_refault() to get accurate stats of the eviction memcg. In preparation for more scoped flushed and passing the eviction memcg to the flush call, move the call to workingset_test_recent() where we have a pointer to the eviction memcg. The flush call is sleepable, and cannot be made in an rcu read section. Hence, minimize the rcu read section by also moving it into workingset_test_recent(). Furthermore, instead of holding the rcu read lock throughout workingset_test_recent(), only hold it briefly to get a ref on the eviction memcg. This allows us to make the flush call after we get the eviction memcg. As for workingset_refault(), nothing else there appears to be protected by rcu. The memcg of the faulted folio (which is not necessarily the same as the eviction memcg) is protected by the folio lock, which is held from all callsites. Add a VM_BUG_ON() to make sure this doesn't change from under us. No functional change intended. Link: https://lkml.kernel.org/r/20231129032154.3710765-5-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm: memcg: make stats flushing threshold per-memcgYosry Ahmed1-16/+34
[ Upstream commit 8d59d2214c2362e7a9d185d80b613e632581af7b ] A global counter for the magnitude of memcg stats update is maintained on the memcg side to avoid invoking rstat flushes when the pending updates are not significant. This avoids unnecessary flushes, which are not very cheap even if there isn't a lot of stats to flush. It also avoids unnecessary lock contention on the underlying global rstat lock. Make this threshold per-memcg. The scheme is followed where percpu (now also per-memcg) counters are incremented in the update path, and only propagated to per-memcg atomics when they exceed a certain threshold. This provides two benefits: (a) On large machines with a lot of memcgs, the global threshold can be reached relatively fast, so guarding the underlying lock becomes less effective. Making the threshold per-memcg avoids this. (b) Having a global threshold makes it hard to do subtree flushes, as we cannot reset the global counter except for a full flush. Per-memcg counters removes this as a blocker from doing subtree flushes, which helps avoid unnecessary work when the stats of a small subtree are needed. Nothing is free, of course. This comes at a cost: (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 bytes. The extra memory usage is insigificant. (b) More work on the update side, although in the common case it will only be percpu counter updates. The amount of work scales with the number of ancestors (i.e. tree depth). This is not a new concept, adding a cgroup to the rstat tree involves a parent loop, so is charging. Testing results below show no significant regressions. (c) The error margin in the stats for the system as a whole increases from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * NR_MEMCGS. This is probably fine because we have a similar per-memcg error in charges coming from percpu stocks, and we have a periodic flusher that makes sure we always flush all the stats every 2s anyway. This patch was tested to make sure no significant regressions are introduced on the update path as follows. The following benchmarks were ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/): (1) Running 22 instances of netperf on a 44 cpu machine with hyperthreading disabled. All instances are run in a level 2 cgroup, as well as netserver: # netserver -6 # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Averaging 20 runs, the numbers are as follows: Base: 40198.0 mbps Patched: 38629.7 mbps (-3.9%) The regression is minimal, especially for 22 instances in the same cgroup sharing all ancestors (so updating the same atomics). (2) will-it-scale page_fault tests. These tests (specifically per_process_ops in page_fault3 test) detected a 25.9% regression before for a change in the stats update path [1]. These are the numbers from 10 runs (+ is good) on a machine with 256 cpus: LABEL | MEAN | MEDIAN | STDDEV | ------------------------------+-------------+-------------+------------- page_fault1_per_process_ops | | | | (A) base | 270249.164 | 265437.000 | 13451.836 | (B) patched | 261368.709 | 255725.000 | 13394.767 | | -3.29% | -3.66% | | page_fault1_per_thread_ops | | | | (A) base | 242111.345 | 239737.000 | 10026.031 | (B) patched | 237057.109 | 235305.000 | 9769.687 | | -2.09% | -1.85% | | page_fault1_scalability | | | (A) base | 0.034387 | 0.035168 | 0.0018283 | (B) patched | 0.033988 | 0.034573 | 0.0018056 | | -1.16% | -1.69% | | page_fault2_per_process_ops | | | (A) base | 203561.836 | 203301.000 | 2550.764 | (B) patched | 197195.945 | 197746.000 | 2264.263 | | -3.13% | -2.73% | | page_fault2_per_thread_ops | | | (A) base | 171046.473 | 170776.000 | 1509.679 | (B) patched | 166626.327 | 166406.000 | 768.753 | | -2.58% | -2.56% | | page_fault2_scalability | | | (A) base | 0.054026 | 0.053821 | 0.00062121 | (B) patched | 0.053329 | 0.05306 | 0.00048394 | | -1.29% | -1.41% | | page_fault3_per_process_ops | | | (A) base | 1295807.782 | 1297550.000 | 5907.585 | (B) patched | 1275579.873 | 1273359.000 | 8759.160 | | -1.56% | -1.86% | | page_fault3_per_thread_ops | | | (A) base | 391234.164 | 390860.000 | 1760.720 | (B) patched | 377231.273 | 376369.000 | 1874.971 | | -3.58% | -3.71% | | page_fault3_scalability | | | (A) base | 0.60369 | 0.60072 | 0.0083029 | (B) patched | 0.61733 | 0.61544 | 0.009855 | | +2.26% | +2.45% | | All regressions seem to be minimal, and within the normal variance for the benchmark. The fix for [1] assumes that 3% is noise -- and there were no further practical complaints), so hopefully this means that such variations in these microbenchmarks do not reflect on practical workloads. (3) I also ran stress-ng in a nested cgroup and did not observe any obvious regressions. [1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/ Link: https://lkml.kernel.org/r/20231129032154.3710765-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm: memcg: move vmstats structs definition above flushing codeYosry Ahmed1-74/+74
[ Upstream commit e0bf1dc859fdd08ef738824710770a30a8069433 ] The following patch will make use of those structs in the flushing code, so move their definitions (and a few other dependencies) a little bit up to reduce the diff noise in the following patch. No functional change intended. Link: https://lkml.kernel.org/r/20231129032154.3710765-3-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm: memcg: change flush_next_time to flush_last_timeYosry Ahmed1-3/+4
[ Upstream commit 508bed884767a8eb394640bae9edcdf082816c43 ] Patch series "mm: memcg: subtree stats flushing and thresholds", v4. This series attempts to address shortages in today's approach for memcg stats flushing, namely occasionally stale or expensive stat reads. The series does so by changing the threshold that we use to decide whether to trigger a flush to be per memcg instead of global (patch 3), and then changing flushing to be per memcg (i.e. subtree flushes) instead of global (patch 5). This patch (of 5): flush_next_time is an inaccurate name. It's not the next time that periodic flushing will happen, it's rather the next time that ratelimited flushing can happen if the periodic flusher is late. Simplify its semantics by just storing the timestamp of the last flush instead, flush_last_time. Move the 2*FLUSH_TIME addition to mem_cgroup_flush_stats_ratelimited(), and add a comment explaining it. This way, all the ratelimiting semantics live in one place. No functional change intended. Link: https://lkml.kernel.org/r/20231129032154.3710765-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20231129032154.3710765-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Chris Li <chrisl@kernel.org> (Google) Tested-by: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm: memcg: add per-memcg zswap writeback statDomenico Cerasuolo3-0/+6
[ Upstream commit 7108cc3f765cafd48a6a35f8add140beaecfa75b ] Since zswap now writes back pages from memcg-specific LRUs, we now need a new stat to show writebacks count for each memcg. [nphamcs@gmail.com: rename ZSWP_WB to ZSWPWB] Link: https://lkml.kernel.org/r/20231205193307.2432803-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20231130194023.4102148-5-nphamcs@gmail.com Suggested-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Signed-off-by: Nhat Pham <nphamcs@gmail.com> Tested-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm: memcg: add THP swap out info for anonymous reclaimXin Hao3-4/+7
[ Upstream commit 811244a501b967b00fecb1ae906d5dc6329c91e0 ] At present, we support per-memcg reclaim strategy, however we do not know the number of transparent huge pages being reclaimed, as we know the transparent huge pages need to be splited before reclaim them, and they will bring some performance bottleneck effect. for example, when two memcg (A & B) are doing reclaim for anonymous pages at same time, and 'A' memcg is reclaiming a large number of transparent huge pages, we can better analyze that the performance bottleneck will be caused by 'A' memcg. therefore, in order to better analyze such problems, there add THP swap out info for per-memcg. [akpm@linux-foundation.orgL fix swap_writepage_fs(), per Johannes] Link: https://lkml.kernel.org/r/20230913213343.GB48476@cmpxchg.org Link: https://lkml.kernel.org/r/20230913164938.16918-1-vernhao@tencent.com Signed-off-by: Xin Hao <vernhao@tencent.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm, percpu: do not consider sleepable allocations atomicMichal Hocko1-1/+7
[ Upstream commit 9a5b183941b52f84c0f9e5f27ce44e99318c9e0f ] 28307d938fb2 ("percpu: make pcpu_alloc() aware of current gfp context") has fixed a reclaim recursion for scoped GFP_NOFS context. It has done that by avoiding taking pcpu_alloc_mutex. This is a correct solution as the worker context with full GFP_KERNEL allocation/reclaim power and which is using the same lock cannot block the NOFS pcpu_alloc caller. On the other hand this is a very conservative approach that could lead to failures because pcpu_alloc lockless implementation is quite limited. We have a bug report about premature failures when scsi array of 193 devices is scanned. Sometimes (not consistently) the scanning aborts because the iscsid daemon fails to create the queue for a random scsi device during the scan. iscsid itself is running with PR_SET_IO_FLUSHER set so all allocations from this process context are GFP_NOIO. This in turn makes any pcpu_alloc lockless (without pcpu_alloc_mutex) which leads to pre-mature failures. It has turned out that iscsid has worked around this by dropping PR_SET_IO_FLUSHER (https://github.com/open-iscsi/open-iscsi/pull/382) when scanning host. But we can do better in this case on the kernel side and use pcpu_alloc_mutex for NOIO resp. NOFS constrained allocation scopes too. We just need the WQ worker to never trigger IO/FS reclaim. Achieve that by enforcing scoped GFP_NOIO for the whole execution of pcpu_balance_workfn (this will imply NOFS constrain as well). This will remove the dependency chain and preserve the full allocation power of the pcpu_alloc call. While at it make is_atomic really test for blockable allocations. Link: https://lkml.kernel.org/r/20250206122633.167896-1-mhocko@kernel.org Fixes: 28307d938fb2 ("percpu: make pcpu_alloc() aware of current gfp context") Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dennis Zhou <dennis@kernel.org> Cc: Filipe David Manana <fdmanana@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: chenxin <chenxinxin@xiaomi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm/mm_init: fix hash table order logging in alloc_large_system_hash()Isaac J. Manjarres1-1/+1
commit 0d6c356dd6547adac2b06b461528e3573f52d953 upstream. When emitting the order of the allocation for a hash table, alloc_large_system_hash() unconditionally subtracts PAGE_SHIFT from log base 2 of the allocation size. This is not correct if the allocation size is smaller than a page, and yields a negative value for the order as seen below: TCP established hash table entries: 32 (order: -4, 256 bytes, linear) TCP bind hash table entries: 32 (order: -2, 1024 bytes, linear) Use get_order() to compute the order when emitting the hash table information to correctly handle cases where the allocation size is smaller than a page: TCP established hash table entries: 32 (order: 0, 256 bytes, linear) TCP bind hash table entries: 32 (order: 0, 1024 bytes, linear) Link: https://lkml.kernel.org/r/20251028191020.413002-1-isaacmanjarres@google.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Isaac J. Manjarres <isaacmanjarres@google.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24base/node / acpi: Change 'node_hmem_attrs' to 'access_coordinates'Dave Jiang1-6/+6
[ Upstream commit 6a954e94d038f41d79c4e04348c95774d1c9337d ] Dan Williams suggested changing the struct 'node_hmem_attrs' to 'access_coordinates' [1]. The struct is a container of r/w-latency and r/w-bandwidth numbers. Moving forward, this container will also be used by CXL to store the performance characteristics of each link hop in the PCIE/CXL topology. So, where node_hmem_attrs is just the access parameters of a memory-node, access_coordinates applies more broadly to hardware topology characteristics. The observation is that seemed like an exercise in having the application identify "where" it falls on a spectrum of bandwidth and latency needs. For the tuple of read/write-latency and read/write-bandwidth, "coordinates" is not a perfect fit. Sometimes it is just conveying values in isolation and not a "location" relative to other performance points, but in the end this data is used to identify the performance operation point of a given memory-node. [2] Link: http://lore.kernel.org/r/64471313421f7_1b66294d5@dwillia2-xfh.jf.intel.com.notmuch/ Link: https://lore.kernel.org/linux-cxl/645e6215ee0de_1e6f2945e@dwillia2-xfh.jf.intel.com.notmuch/ Suggested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/170319615734.2212653.15319394025985499185.stgit@djiang5-mobl3 Signed-off-by: Dan Williams <dan.j.williams@intel.com> Stable-dep-of: 214291cbaace ("acpi/hmat: Fix lockdep warning for hmem_register_resource()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-24acpi, hmat: calculate abstract distance with HMATHuang Ying1-1/+102
[ Upstream commit 3718c02dbd4c88d47b5af003acdb3d1112604ea3 ] A memory tiering abstract distance calculation algorithm based on ACPI HMAT is implemented. The basic idea is as follows. The performance attributes of system default DRAM nodes are recorded as the base line. Whose abstract distance is MEMTIER_ADISTANCE_DRAM. Then, the ratio of the abstract distance of a memory node (target) to MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance attributes of the node to that of the default DRAM nodes. The functions to record the read/write latency/bandwidth of the default DRAM nodes and calculate abstract distance according to read/write latency/bandwidth ratio will be used by CXL CDAT (Coherent Device Attribute Table) and other memory device drivers. So, they are put in memory-tiers.c. Link: https://lkml.kernel.org/r/20230926060628.265989-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Tested-by: Bharata B Rao <bharata@amd.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Wei Xu <weixugc@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 214291cbaace ("acpi/hmat: Fix lockdep warning for hmem_register_resource()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-24memory tiering: add abstract distance calculation algorithms managementHuang Ying1-0/+59
[ Upstream commit 07a8bdd4120ced3490ef9adf51b8086af0aaa8e7 ] Patch series "memory tiering: calculate abstract distance based on ACPI HMAT", v4. We have the explicit memory tiers framework to manage systems with multiple types of memory, e.g., DRAM in DIMM slots and CXL memory devices. Where, same kind of memory devices will be grouped into memory types, then put into memory tiers. To describe the performance of a memory type, abstract distance is defined. Which is in direct proportion to the memory latency and inversely proportional to the memory bandwidth. To keep the code as simple as possible, fixed abstract distance is used in dax/kmem to describe slow memory such as Optane DCPMM. To support more memory types, in this series, we added the abstract distance calculation algorithm management mechanism, provided a algorithm implementation based on ACPI HMAT, and used the general abstract distance calculation interface in dax/kmem driver. So, dax/kmem can support HBM (high bandwidth memory) in addition to the original Optane DCPMM. This patch (of 4): The abstract distance may be calculated by various drivers, such as ACPI HMAT, CXL CDAT, etc. While it may be used by various code which hot-add memory node, such as dax/kmem etc. To decouple the algorithm users and the providers, the abstract distance calculation algorithms management mechanism is implemented in this patch. It provides interface for the providers to register the implementation, and interface for the users. Multiple algorithm implementations can cooperate via calculating abstract distance for different memory nodes. The preference of algorithm implementations can be specified via priority (notifier_block.priority). Link: https://lkml.kernel.org/r/20230926060628.265989-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20230926060628.265989-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Tested-by: Bharata B Rao <bharata@amd.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Wei Xu <weixugc@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 214291cbaace ("acpi/hmat: Fix lockdep warning for hmem_register_resource()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-23quota: remove unneeded return value of register_quota_formatKemeng Shi1-6/+1
[ Upstream commit a838e5dca63d1dc701e63b2b1176943c57485c45 ] The register_quota_format always returns 0, simply remove unneeded return value. Link: https://patch.msgid.link/20240715130534.2112678-3-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Jan Kara <jack@suse.cz> Stable-dep-of: 72b7ceca857f ("fs: quota: create dedicated workqueue for quota_release_work") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19mm/damon/vaddr: do not repeat pte_offset_map_lock() until successSeongJae Park1-6/+2
commit b93af2cc8e036754c0d9970d9ddc47f43cc94b9f upstream. DAMON's virtual address space operation set implementation (vaddr) calls pte_offset_map_lock() inside the page table walk callback function. This is for reading and writing page table accessed bits. If pte_offset_map_lock() fails, it retries by returning the page table walk callback function with ACTION_AGAIN. pte_offset_map_lock() can continuously fail if the target is a pmd migration entry, though. Hence it could cause an infinite page table walk if the migration cannot be done until the page table walk is finished. This indeed caused a soft lockup when CPU hotplugging and DAMON were running in parallel. Avoid the infinite loop by simply not retrying the page table walk. DAMON is promising only a best-effort accuracy, so missing access to such pages is no problem. Link: https://lkml.kernel.org/r/20250930004410.55228-1-sj@kernel.org Fixes: 7780d04046a2 ("mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Xinyu Zheng <zhengxinyu6@huawei.com> Closes: https://lore.kernel.org/20250918030029.2652607-1-zhengxinyu6@huawei.com Acked-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19mm/hugetlb: early exit from hugetlb_pages_alloc_boot() when max_huge_pages=0Li RongQing1-0/+3
commit b322e88b3d553e85b4e15779491c70022783faa4 upstream. Optimize hugetlb_pages_alloc_boot() to return immediately when max_huge_pages is 0, avoiding unnecessary CPU cycles and the below log message when hugepages aren't configured in the kernel command line. [ 3.702280] HugeTLB: allocation took 0ms with hugepage_allocation_threads=32 Link: https://lkml.kernel.org/r/20250814102333.4428-1-lirongqing@baidu.com Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Tested-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19mm/page_alloc: only set ALLOC_HIGHATOMIC for __GPF_HIGH allocationsThadeu Lima de Souza Cascardo1-1/+1
commit 6a204d4b14c99232e05d35305c27ebce1c009840 upstream. Commit 524c48072e56 ("mm/page_alloc: rename ALLOC_HIGH to ALLOC_MIN_RESERVE") is the start of a series that explains how __GFP_HIGH, which implies ALLOC_MIN_RESERVE, is going to be used instead of __GFP_ATOMIC for high atomic reserves. Commit eb2e2b425c69 ("mm/page_alloc: explicitly record high-order atomic allocations in alloc_flags") introduced ALLOC_HIGHATOMIC for such allocations of order higher than 0. It still used __GFP_ATOMIC, though. Then, commit 1ebbb21811b7 ("mm/page_alloc: explicitly define how __GFP_HIGH non-blocking allocations accesses reserves") just turned that check for !__GFP_DIRECT_RECLAIM, ignoring that high atomic reserves were expected to test for __GFP_HIGH. This leads to high atomic reserves being added for high-order GFP_NOWAIT allocations and others that clear __GFP_DIRECT_RECLAIM, which is unexpected. Later, those reserves lead to 0-order allocations going to the slow path and starting reclaim. From /proc/pagetypeinfo, without the patch: Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type HighAtomic 1 8 10 9 7 3 0 0 0 0 0 Node 0, zone Normal, type HighAtomic 64 20 12 5 0 0 0 0 0 0 0 With the patch: Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Link: https://lkml.kernel.org/r/20250814172245.1259625-1-cascardo@igalia.com Fixes: 1ebbb21811b7 ("mm/page_alloc: explicitly define how __GFP_HIGH non-blocking allocations accesses reserves") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Tested-by: Helen Koike <koike@igalia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Thierry Reding <thierry.reding@gmail.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>