diff options
| author | fujunjie <fujunjie1@qq.com> | 2026-04-28 04:59:43 +0300 |
|---|---|---|
| committer | Andrew Morton <akpm@linux-foundation.org> | 2026-05-29 07:05:06 +0300 |
| commit | 0b9c0aeba938aad9964f855df00bf929b83a484d (patch) | |
| tree | 9bb3ce69ee8fde5849ea4319a41e1abc0a12f9ee /include/linux | |
| parent | 95b8e432265f61bd9ecdce07d76be6182289ac2a (diff) | |
| download | linux-0b9c0aeba938aad9964f855df00bf929b83a484d.tar.xz | |
mm/filemap: count only the faulting address as a mmap hit
Patch series "mm/filemap: tighten mmap_miss hit accounting", v3.
mmap_miss is increased when synchronous mmap readahead is needed, and
decreased when filemap_map_pages() maps folios that are already in the
page cache. The decrease side can over-credit hits in two cases:
- fault-around installs nearby PTEs even though the fault only proves
that the faulting address was accessed;
- after synchronous mmap readahead returns VM_FAULT_RETRY, the retry
can find the folio brought in by the same miss and immediately
cancel that miss.
Current evidence comes from a local KVM/data-disk microbenchmark using
mmap_miss_probe, with an 8 GiB guest, 2 vCPUs, 8192 KiB read_ahead_kb,
cold page cache before each run, 1% of the file accessed, and medians of 3
runs.
mmap_miss_probe mmap()s a prepared file with MADV_NORMAL and then touches
one byte at selected base-page offsets. The access order is random,
sequential, or a fixed page stride. The harness drops caches before each
run and samples /proc/vmstat around that access loop.
The 20 GiB case below is a larger-than-memory file case in an 8 GiB guest.
No separate memory hog was used. The 4 GiB case uses the same 8 GiB
guest but keeps the file fit-in-memory.
Each case used a fresh temporary qcow2 data disk, seen by the guest as
/dev/vda, formatted as ext4 and mounted at /mnt/mmap-matrix.
Each result is "pgpgin GiB / elapsed seconds". "pgpgin GiB" is the delta
of the guest /proc/vmstat pgpgin counter, converted from KiB to GiB; it is
used here as an approximate block input counter, not as resident memory or
exact application IO. "Elapsed seconds" is the wall-clock runtime of the
whole mmap_miss_probe access pass, not per-access latency.
For the 20 GiB larger-than-memory case:
workload before after
random 223.377 GiB/101.293s 1.010 GiB/4.790s
stride1021 204.214 GiB/97.557s 204.208 GiB/108.086s
stride2053 409.584 GiB/193.700s 0.970 GiB/3.685s
stride4099 406.452 GiB/134.241s 0.975 GiB/3.499s
sequential 0.212 GiB/0.050s 0.212 GiB/0.057s
For the 4 GiB fit-in-memory case:
workload before after
random 3.987 GiB/1.960s 0.980 GiB/1.221s
stride1021 4.002 GiB/1.838s 4.002 GiB/1.851s
stride2053 3.991 GiB/1.835s 0.811 GiB/0.985s
stride4099 4.001 GiB/1.836s 0.819 GiB/1.037s
sequential 0.056 GiB/0.013s 0.056 GiB/0.018s
The 20 GiB setup also has an ablation. P1 is only the faulting-address
hit accounting change. P2-only is only the FAULT_FLAG_TRIED retry
filter. P1+P2 is the combined accounting change:
workload variant result
random baseline 223.377 GiB/101.293s
random P1 223.268 GiB/98.481s
random P2-only 223.257 GiB/100.091s
random P1+P2 1.010 GiB/4.790s
stride2053 baseline 409.584 GiB/193.700s
stride2053 P1 409.584 GiB/197.645s
stride2053 P2-only 15.722 GiB/5.485s
stride2053 P1+P2 0.970 GiB/3.685s
sequential baseline 0.212 GiB/0.050s
sequential P1 0.212 GiB/0.046s
sequential P2-only 0.212 GiB/0.050s
sequential P1+P2 0.212 GiB/0.057s
After the v2 implementation refactor, only the final P1+P2 shape was rerun
in the same setup. The numbers stayed in line with the v1 P1+P2 rows
above:
workload larger-than-memory case fit-in-memory case
20 GiB file, 1% access 4 GiB file, 1% access
random 1.010 GiB/4.383s 0.980 GiB/1.088s
stride1021 204.216 GiB/105.601s 4.001 GiB/1.783s
stride2053 0.970 GiB/3.760s 0.810 GiB/0.908s
stride4099 0.975 GiB/3.410s 0.818 GiB/0.870s
sequential 0.212 GiB/0.060s 0.056 GiB/0.016s
This does not claim to solve every sparse pattern. The stride1021 rows
are intentionally shown as a boundary: with 8192 KiB read_ahead_kb,
file->f_ra.ra_pages is 2048 base pages, and synchronous mmap read-around
uses a 2048-page window centered around the fault, roughly [index - 1024,
index + 1023]. stride1021 is 1021 * 4 KiB = 4084 KiB, so the next access
lands inside the previous read-around window. About every other access
can be a real faulting-address page-cache hit, and the other half can each
read about 8 MiB. For about 52k accesses in the 20 GiB/1% run, half of
them times 8 MiB is about 205 GiB, matching the observed 204 GiB.
This patch (of 2):
filemap_map_pages() reduces file->f_ra.mmap_miss when fault-around maps
folios that are already present in the page cache. That hit accounting is
too generous because fault-around can install PTEs around the faulting
address even though the fault only proves that the faulting address was
accessed.
Move the mmap_miss update back into filemap_map_pages(), drop the
mmap_miss argument from the helper functions, and decrement mmap_miss only
when the helper return value shows that the faulting address was mapped.
Keep the existing workingset-folio behavior unchanged.
Link: https://lore.kernel.org/tencent_AA501E9A238337BD167E5C2ACF948A1AF308@qq.com
Link: https://lore.kernel.org/tencent_756F151FE66F3D80479A6F982C0AB8569F09@qq.com
Signed-off-by: fujunjie <fujunjie1@qq.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Vishal Moola <vishal.moola@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Diffstat (limited to 'include/linux')
0 files changed, 0 insertions, 0 deletions
