summaryrefslogtreecommitdiff
path: root/mm/vmalloc.c
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2026-06-19 20:14:34 +0300
committerLinus Torvalds <torvalds@linux-foundation.org>2026-06-19 20:14:34 +0300
commita552c81ff4a16738ca5a44a177d552eb38d552ce (patch)
tree82800368fc5bc70e728875edb52777521f082ca8 /mm/vmalloc.c
parentc98d767b34574be82b74d77d02264a830ae1cadd (diff)
parente3d8707358ea76b78bdec9928937bb9a797f2c8f (diff)
downloadlinux-a552c81ff4a16738ca5a44a177d552eb38d552ce.tar.xz
Merge tag 'mm-stable-2026-06-18-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton: - "selftests/mm: clean up build output and verbosity" (Li Wang) Remove some noise from the MM selftests build - "mm: Free contiguous order-0 pages efficiently" (Ryan Roberts) Speed up the freeing of a batch of 0-order pages by first scanning them for coalescing opportunities. This is applicable to vfree() and to the releasing of frozen pages - "mm/damon: introduce DAMOS failed region quota charge ratio" (SeongJae Park) Address a DAMOS usability issue: The DAMOS quota often exhausts prematurely because it charges for all memory attempted, causing slow and inconsistent performance when actions fail on unreclaimable memory. To fix this, a new feature lets users set a smaller, flexible quota charge ratio (via a numerator and denominator) for failed regions. Since failed actions cause less overhead, reducing their quota cost ensures more predictable and efficient DAMOS processing - "selftests/cgroup: improve zswap tests robustness and support large page sizes" (Li Wang) Fix various spurious failures and improves the overall robustness of the cgroup zswap selftests - "fix MAP_DROPPABLE not supported errno" (Anthony Yznaga) Fix an issue in the mlock selftests on arm32 - "mm: huge_memory: clean up defrag sysfs with shared" (Breno Leitao) Some maintenance work in the huge_memory code - "treewide: fixup gfp_t printks" (Brendan Jackman) Use the special vprintf() gfp_t conversion in various places - "mm: Fix vmemmap optimization accounting and initialization" (Muchun Song) Fix several bugs in the vmemmap optimization, mainly around incorrect page accounting and memmap initialization in the DAX and memory hotplug paths. It also fixes pageblock migratetype initialization and struct page initialization for ZONE_DEVICE compound pages - "mm/damon: repost non-hotfix reviewed patches in damon/next tree" A sprinkle of unrelated minor bugfixes for DAMON - "mm: remove page_mapped()" (David Hildenbrand) Remove this function from the tree, replacing it with folio_mapped() - "mm/damon: let DAMON be paused and resumed" (SeongJae Park) Allow DAMON to be paused and resumed without losing its current state - "kasan: hw_tags: Disable tagging for stack and page-tables" (Muhammad Usama Anjum) Simplify and speed up kasan by removing its ineffective tagging of stacks and page tables - "mm/damon/reclaim,lru_sort: monitor all system rams by default" (SeongJae Park) Simplify deployment on diverse hardware like NUMA systems by updating DAMON_RECLAIM and DAMON_LRU_SORT to automatically monitor the physical address range covering all System RAM areas by default, replacing the overly restrictive behavior that only targeted the single largest memory block to save on negligible overhead - "mm/damon/sysfs: document filters/ directory as deprecated" (SeongJae Park) Update some DAMON docs - "mm: use spinlock guards for zone lock" (Dmitry Ilvokhin) Switch zone->lock handling over to using the guard() mechanisms - "mm/filemap: tighten mmap_miss hit accounting" (fujunjie) Fix a flaw where the mmap_miss counter over-credited page cache hits during fault-arounds and page-fault retries. This results in significant reduction of redundant synchronous mmap readahead I/O, drastically cutting down execution time and gigabytes read for sparse random or strided memory access workloads - "selftests/cgroup: Fix false positive failures in test_percpu_basic" (Li Wang) Fix a couple of false-positives in the cgroup kmem selftests - "mm/damon/reclaim: support monitoring intervals auto-tuning" (SeongJae Park) Add a new parameter to DAMON permitting DAMON_RECLAIM to automatically tune DAMON's sampling and aggregation intervals - "mm/damon/stat: add kdamond_pid parameter" (SeongJae Park) Change DAMON_STAT to provide the pid of its kdamond - "mm/kmemleak: dedupe verbose scan output" (Breno Leitao) Remove large amounts of duplicated backtraces from the verbose-mode kmemleak output - "mm: remove CONFIG_HAVE_BOOTMEM_INFO_NODE (Part 1)" (David Hildenbrand) Reduce our use of CONFIG_HAVE_BOOTMEM_INFO_NODE, with a view to removing it entirely in a later series - "mm/damon: validate min_region_size to be power of 2" (Liew Rui Yan) Prevent users from passing a non-power-of-2 value of `addr_unit', as this later results in undesirable behavior - "mm: document read_pages and simplify usage" (Frederick Mayle) - "tools/mm/page-types: Fix misc bugs" (Ye Liu) Fix three issues in tools/mm/page-types.c - "mm: misc cleanups from __GFP_UNMAPPED series" (Brendan Jackman) Implement several cleanups in the page allocator and related code - "mm, swap: swap table phase IV: unify allocation" (Kairui Song) Unify the allocation and charging of anon and shmem swap in folios, provides better synchronization, consolidates the metadata management, hence dropping the static array and map, and improves performance - "mm/damon: introduce data attributes monitoring" (SeongJae Park( Extend DAMON to monitor general data attributes other than accesses - "mm/vmalloc: free unused pages on vrealloc() shrink" (Shivam Kalra) Implement the TODO in vrealloc() to unmap and free unused pages when shrinking across a page boundary - "mm/damon: documentation and comment fixes" (niecheng) - "remove mmap_action success, error hooks" (Lorenzo Stoakes) Eliminate custom hooks from mmap_action by removing the problematic success_hook which allowed drivers to improperly access uninitialized VMAs. It replaces the error_hook with a simple error-code field and updates the memory char driver accordingly - "mm/damon: minor improvements for code readability and tests" (SeongJae Park) - "mm/damon: fix macro arguments and clarify quota goals doc" (Maksym Shcherba) - "userfaultfd: merge fs/userfaultfd.c into mm/userfaultfd.c" (Mike Rapoport) - "mm/mglru: improve reclaim loop and dirty folio" (Kairui Song and others) Clean up and slightly improves MGLRU's reclaim loop and dirty writeback handling. Large performance improvements are measured - "use vma locks for proc/pid/{smaps|numa_maps} reads" (Suren Baghdasaryan) Use per-vma locks when reading /proc/pid/smaps and numa_maps similar to reduce contention on central mmap_lock - "refactors thpsize_shmem_enabled_store() and thpsize_shmem_enabled_show()" (Ran Xiaokai) Some cleanup work in the THP code - "selftests/memfd: fix compilation warnings" (Konstantin Khorenko) Fix a few build glitches in the memfd selftest code. - "memcg: shrink obj_stock_pcp and cache multiple objcgs" (Shakeel Butt) Resolve a 68% performance regression caused by NUMA-node cache thrashing around struct obj_stock_pcp by shrinking its existing fields and expanding it into a multi-slot array that caches up to five obj_cgroup pointers per CPU, allowing per-node variants of the same memcg to coexist within a single 64-byte cache line. - "zram: writeback fixes" (Sergey Senozhatsky) address a couple of unrelated zram writeback issues - "mm: switch THP shrinker to list_lru" (Johannes Weiner) Resolve NUMA-awareness issues and streamlines callsite interaction by refactoring and extending the list_lru API to completely replace the complex, open-coded deferred split queue for Transparent Huge Pages - "mm: improve large folio readahead for exec memory" (Usama Arif) Improve large-folio readahead on systems like 64K-page arm64 by preventing the mmap_miss check from permanently disabling target-oriented VM_EXEC readahead, and by generalizing the force_thp_readahead gate to support mappings with any usefully large maximum folio order under the cache cap. - "userfaultfd/pagemap: pre-existing fixes" (Kiryl Shutsemau) Fix a bunch of minor issues in the userfaultfd/pagemap, all of which were flagged by Sashiko review of proposed new material - "mm/sparse-vmemmap: Provide generic vmemmap_set_pmd() and vmemmap_check_pmd()" (Muchun Song) Provide generic versions of these two functions so the four arch-specific implementations can be removed. - "mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap device" (Youngjun Park) Address a uswsusp-vs-swapoff race and reduces the swap device reference taking/releasing frequency. - "mm/hmm: A fix and a selftest" (Dev Jain) * tag 'mm-stable-2026-06-18-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) selftests/mm/hmm-tests: test pagemap reads of PMD device-private entries fs/proc/task_mmu: do not warn on seeing non-migration pmd entry lib/test_hmm: check alloc_page_vma() return value and handle OOM mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX mm/swap: remove redundant swap device reference in alloc/free mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap device mm/filemap: use folio_next_index() for start vmalloc: fix NULL pointer dereference in is_vm_area_hugepages() sparc/mm: drop vmemmap_check_pmd helper and use generic code loongarch/mm: drop vmemmap_check_pmd helper and use generic code riscv/mm: drop vmemmap_pmd helpers and use generic code arm64/mm: drop vmemmap_pmd helpers and use generic code mm/sparse-vmemmap: provide generic vmemmap_set_pmd() and vmemmap_check_pmd() rust: page: mark Page::nid as inline userfaultfd: build __VMA_UFFD_FLAGS from config-gated masks userfaultfd: gate must_wait writability check on pte_present() mm/huge_memory: preserve pmd_swp_uffd_wp on device-private PMD downgrade fs/proc/task_mmu: fix hugetlb self-deadlock in pagemap_scan_pte_hole() fs/proc/task_mmu: use huge_page_size() in pagemap_scan_hugetlb_entry() fs/proc/task_mmu: fix make_uffd_wp_huge_pte() prot-update race ...
Diffstat (limited to 'mm/vmalloc.c')
-rw-r--r--mm/vmalloc.c130
1 files changed, 107 insertions, 23 deletions
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index bb6ae08d18f5..1afca3568b9b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3416,6 +3416,32 @@ void vfree_atomic(const void *addr)
schedule_work(&p->wq);
}
+/*
+ * vm_area_free_pages - free a range of pages from a vmalloc allocation
+ * @vm: the vm_struct containing the pages
+ * @start_idx: first page index to free (inclusive)
+ * @end_idx: last page index to free (exclusive)
+ *
+ * Free pages [start_idx, end_idx) updating NR_VMALLOC stat accounting.
+ * Freed vm->pages[] entries are set to NULL.
+ * Caller is responsible for unmapping (vunmap_range) and KASAN
+ * poisoning before calling this.
+ */
+static void vm_area_free_pages(struct vm_struct *vm, unsigned int start_idx,
+ unsigned int end_idx)
+{
+ unsigned int i;
+
+ if (!(vm->flags & VM_MAP_PUT_PAGES)) {
+ for (i = start_idx; i < end_idx; i++)
+ mod_lruvec_page_state(vm->pages[i], NR_VMALLOC, -1);
+ }
+ free_pages_bulk(vm->pages + start_idx, end_idx - start_idx);
+
+ for (i = start_idx; i < end_idx; i++)
+ vm->pages[i] = NULL;
+}
+
/**
* vfree - Release memory allocated by vmalloc()
* @addr: Memory base address
@@ -3436,7 +3462,6 @@ void vfree_atomic(const void *addr)
void vfree(const void *addr)
{
struct vm_struct *vm;
- int i;
if (unlikely(in_interrupt())) {
vfree_atomic(addr);
@@ -3459,19 +3484,8 @@ void vfree(const void *addr)
if (unlikely(vm->flags & VM_FLUSH_RESET_PERMS))
vm_reset_perms(vm);
- for (i = 0; i < vm->nr_pages; i++) {
- struct page *page = vm->pages[i];
- BUG_ON(!page);
- /*
- * High-order allocs for huge vmallocs are split, so
- * can be freed as an array of order-0 allocations
- */
- if (!(vm->flags & VM_MAP_PUT_PAGES))
- mod_lruvec_page_state(page, NR_VMALLOC, -1);
- __free_page(page);
- cond_resched();
- }
+ vm_area_free_pages(vm, 0, vm->nr_pages);
kvfree(vm->pages);
kfree(vm);
}
@@ -3939,7 +3953,7 @@ fail:
__GFP_NOFAIL | __GFP_ZERO |\
__GFP_NORETRY | __GFP_RETRY_MAYFAIL |\
GFP_NOFS | GFP_NOIO | GFP_KERNEL_ACCOUNT |\
- GFP_USER | __GFP_NOLOCKDEP)
+ GFP_USER | __GFP_NOLOCKDEP | __GFP_SKIP_KASAN)
static gfp_t vmalloc_fix_flags(gfp_t flags)
{
@@ -3980,6 +3994,9 @@ static gfp_t vmalloc_fix_flags(gfp_t flags)
*
* %__GFP_NOWARN can be used to suppress failure messages.
*
+ * %__GFP_SKIP_KASAN can be used to skip unpoisoning of mapped pages
+ * (when prot=%PAGE_KERNEL).
+ *
* Can not be called from interrupt nor NMI contexts.
* Return: the address of the area or %NULL on failure
*/
@@ -3993,6 +4010,7 @@ void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align,
kasan_vmalloc_flags_t kasan_flags = KASAN_VMALLOC_NONE;
unsigned long original_align = align;
unsigned int shift = PAGE_SHIFT;
+ bool skip_vmalloc_kasan = kasan_hw_tags_enabled() && (gfp_mask & __GFP_SKIP_KASAN);
if (WARN_ON_ONCE(!size))
return NULL;
@@ -4023,7 +4041,7 @@ void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align,
again:
area = __get_vm_area_node(size, align, shift, VM_ALLOC |
VM_UNINITIALIZED | vm_flags, start, end, node,
- gfp_mask, caller);
+ gfp_mask & ~__GFP_SKIP_KASAN, caller);
if (!area) {
bool nofail = gfp_mask & __GFP_NOFAIL;
warn_alloc(gfp_mask, NULL,
@@ -4041,7 +4059,7 @@ again:
* kasan_unpoison_vmalloc().
*/
if (pgprot_val(prot) == pgprot_val(PAGE_KERNEL)) {
- if (kasan_hw_tags_enabled()) {
+ if (kasan_hw_tags_enabled() && !skip_vmalloc_kasan) {
/*
* Modify protection bits to allow tagging.
* This must be done before mapping.
@@ -4078,7 +4096,8 @@ again:
(gfp_mask & __GFP_SKIP_ZERO))
kasan_flags |= KASAN_VMALLOC_INIT;
/* KASAN_VMALLOC_PROT_NORMAL already set if required. */
- area->addr = kasan_unpoison_vmalloc(area->addr, size, kasan_flags);
+ if (!skip_vmalloc_kasan)
+ area->addr = kasan_unpoison_vmalloc(area->addr, size, kasan_flags);
/*
* In this function, newly allocated vm_struct has VM_UNINITIALIZED
@@ -4324,16 +4343,70 @@ void *vrealloc_node_align_noprof(const void *p, size_t size, unsigned long align
if (unlikely(flags & __GFP_THISNODE) && nid != NUMA_NO_NODE &&
nid != page_to_nid(vmalloc_to_page(p)))
goto need_realloc;
+ } else {
+ /*
+ * If p is NULL, vrealloc behaves exactly like vmalloc.
+ * Skip the shrink and in-place grow paths.
+ */
+ goto need_realloc;
}
- /*
- * TODO: Shrink the vm_area, i.e. unmap and free unused pages. What
- * would be a good heuristic for when to shrink the vm_area?
- */
if (size <= old_size) {
+ unsigned int new_nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
+
/* Zero out "freed" memory, potentially for future realloc. */
if (want_init_on_free() || want_init_on_alloc(flags))
memset((void *)p + size, 0, old_size - size);
+
+ /*
+ * Free tail pages when shrink crosses a page boundary.
+ *
+ * Skip huge page allocations (page_order > 0) as partial
+ * freeing would require splitting.
+ *
+ * Skip VM_FLUSH_RESET_PERMS, as direct-map permissions must
+ * be reset before pages are returned to the allocator.
+ *
+ * Skip VM_USERMAP, as remap_vmalloc_range_partial() validates
+ * mapping requests against the unchanged vm->size; freeing
+ * tail pages would cause vmalloc_to_page() to return NULL for
+ * the unmapped range.
+ *
+ * Skip if either GFP_NOFS or GFP_NOIO are used.
+ * kmemleak_free_part() internally allocates with
+ * GFP_KERNEL, which could trigger a recursive deadlock
+ * if we are under filesystem or I/O reclaim.
+ */
+ if (new_nr_pages < vm->nr_pages && !vm_area_page_order(vm) &&
+ !(vm->flags & (VM_FLUSH_RESET_PERMS | VM_USERMAP)) &&
+ gfp_has_io_fs(flags)) {
+ unsigned long addr = (unsigned long)kasan_reset_tag(p);
+ unsigned int old_nr_pages = vm->nr_pages;
+
+ /*
+ * Use the node lock to synchronize with concurrent
+ * readers (vmalloc_info_show).
+ */
+ struct vmap_node *vn = addr_to_node(addr);
+
+ spin_lock(&vn->busy.lock);
+ vm->nr_pages = new_nr_pages;
+ spin_unlock(&vn->busy.lock);
+
+ /* Notify kmemleak of the reduced allocation size before unmapping. */
+ kmemleak_free_part(
+ (void *)addr + ((unsigned long)new_nr_pages
+ << PAGE_SHIFT),
+ (unsigned long)(old_nr_pages - new_nr_pages)
+ << PAGE_SHIFT);
+
+ vunmap_range(addr + ((unsigned long)new_nr_pages
+ << PAGE_SHIFT),
+ addr + ((unsigned long)old_nr_pages
+ << PAGE_SHIFT));
+
+ vm_area_free_pages(vm, new_nr_pages, old_nr_pages);
+ }
vm->requested_size = size;
kasan_vrealloc(p, old_size, size);
return (void *)p;
@@ -4342,7 +4415,7 @@ void *vrealloc_node_align_noprof(const void *p, size_t size, unsigned long align
/*
* We already have the bytes available in the allocation; use them.
*/
- if (size <= alloced_size) {
+ if (size <= vm->nr_pages << PAGE_SHIFT) {
/*
* No need to zero memory here, as unused memory will have
* already been zeroed at initial allocation time or during
@@ -4641,7 +4714,18 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
smp_rmb();
vaddr = (char *) va->va_start;
- size = vm ? get_vm_area_size(vm) : va_size(va);
+ if (vm)
+ /*
+ * For VM_ALLOC areas, use nr_pages rather than
+ * get_vm_area_size() because vrealloc() may shrink
+ * the mapping without updating area->size. Other
+ * mapping types (vmap, ioremap) don't set nr_pages.
+ */
+ size = (vm->flags & VM_ALLOC && vm->nr_pages) ?
+ (vm->nr_pages << PAGE_SHIFT) :
+ get_vm_area_size(vm);
+ else
+ size = va_size(va);
if (addr >= vaddr + size)
goto next_va;