<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/mm/filemap.c, branch v6.18.21</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.18.21</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.18.21'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-03-19T15:08:49+00:00</updated>
<entry>
<title>mm: Fix a hmm_range_fault() livelock / starvation problem</title>
<updated>2026-03-19T15:08:49+00:00</updated>
<author>
<name>Thomas Hellström</name>
<email>thomas.hellstrom@linux.intel.com</email>
</author>
<published>2026-02-10T11:56:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=94b6d0ba4b640ba23bb6c708a59316e74e5ede63'/>
<id>urn:sha1:94b6d0ba4b640ba23bb6c708a59316e74e5ede63</id>
<content type='text'>
commit b570f37a2ce480be26c665345c5514686a8a0274 upstream.

If hmm_range_fault() fails a folio_trylock() in do_swap_page,
trying to acquire the lock of a device-private folio for migration,
to ram, the function will spin until it succeeds grabbing the lock.

However, if the process holding the lock is depending on a work
item to be completed, which is scheduled on the same CPU as the
spinning hmm_range_fault(), that work item might be starved and
we end up in a livelock / starvation situation which is never
resolved.

This can happen, for example if the process holding the
device-private folio lock is stuck in
   migrate_device_unmap()-&gt;lru_add_drain_all()
sinc lru_add_drain_all() requires a short work-item
to be run on all online cpus to complete.

A prerequisite for this to happen is:
a) Both zone device and system memory folios are considered in
   migrate_device_unmap(), so that there is a reason to call
   lru_add_drain_all() for a system memory folio while a
   folio lock is held on a zone device folio.
b) The zone device folio has an initial mapcount &gt; 1 which causes
   at least one migration PTE entry insertion to be deferred to
   try_to_migrate(), which can happen after the call to
   lru_add_drain_all().
c) No or voluntary only preemption.

This all seems pretty unlikely to happen, but indeed is hit by
the "xe_exec_system_allocator" igt test.

Resolve this by waiting for the folio to be unlocked if the
folio_trylock() fails in do_swap_page().

Rename migration_entry_wait_on_locked() to
softleaf_entry_wait_unlock() and update its documentation to
indicate the new use-case.

Future code improvements might consider moving
the lru_add_drain_all() call in migrate_device_unmap() to be
called *after* all pages have migration entries inserted.
That would eliminate also b) above.

v2:
- Instead of a cond_resched() in hmm_range_fault(),
  eliminate the problem by waiting for the folio to be unlocked
  in do_swap_page() (Alistair Popple, Andrew Morton)
v3:
- Add a stub migration_entry_wait_on_locked() for the
  !CONFIG_MIGRATION case. (Kernel Test Robot)
v4:
- Rename migrate_entry_wait_on_locked() to
  softleaf_entry_wait_on_locked() and update docs (Alistair Popple)
v5:
- Add a WARN_ON_ONCE() for the !CONFIG_MIGRATION
  version of softleaf_entry_wait_on_locked().
- Modify wording around function names in the commit message
  (Andrew Morton)

Suggested-by: Alistair Popple &lt;apopple@nvidia.com&gt;
Fixes: 1afaeb8293c9 ("mm/migrate: Trylock device page in do_swap_page")
Cc: Ralph Campbell &lt;rcampbell@nvidia.com&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: Jason Gunthorpe &lt;jgg@mellanox.com&gt;
Cc: Jason Gunthorpe &lt;jgg@ziepe.ca&gt;
Cc: Leon Romanovsky &lt;leon@kernel.org&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Matthew Brost &lt;matthew.brost@intel.com&gt;
Cc: John Hubbard &lt;jhubbard@nvidia.com&gt;
Cc: Alistair Popple &lt;apopple@nvidia.com&gt;
Cc: linux-mm@kvack.org
Cc: &lt;dri-devel@lists.freedesktop.org&gt;
Signed-off-by: Thomas Hellström &lt;thomas.hellstrom@linux.intel.com&gt;
Cc: &lt;stable@vger.kernel.org&gt; # v6.15+
Reviewed-by: John Hubbard &lt;jhubbard@nvidia.com&gt; #v3
Reviewed-by: Alistair Popple &lt;apopple@nvidia.com&gt;
Link: https://patch.msgid.link/20260210115653.92413-1-thomas.hellstrom@linux.intel.com
(cherry picked from commit a69d1ab971a624c6f112cea61536569d579c3215)
Signed-off-by: Rodrigo Vivi &lt;rodrigo.vivi@intel.com&gt;
Signed-off-by: Thomas Hellström &lt;thomas.hellstrom@linux.intel.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>mm/filemap: fix logic around SIGBUS in filemap_map_pages()</title>
<updated>2025-11-24T22:25:18+00:00</updated>
<author>
<name>Kiryl Shutsemau</name>
<email>kas@kernel.org</email>
</author>
<published>2025-11-20T16:14:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=7c9580f44f90f7a4c11fc7831efe323ebe446091'/>
<id>urn:sha1:7c9580f44f90f7a4c11fc7831efe323ebe446091</id>
<content type='text'>
Chris noticed that filemap_map_pages() calculates can_map_large only once
for the first page in the fault around range.  The value is not valid for
the following pages in the range and must be recalculated.

Instead of recalculating can_map_large on each iteration, pass down
file_end to filemap_map_folio_range() and let it make the decision on what
can be mapped.

Link: https://lkml.kernel.org/r/20251120161411.859078-1-kirill@shutemov.name
Fixes: 74207de2ba10 ("mm/memory: do not populate page table entries beyond i_size")h
Signed-off-by: Kiryl Shutsemau &lt;kas@kernel.org&gt;
Reported-by: Chris Mason &lt;clm@meta.com&gt;
Reviewed-by: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Baolin Wang &lt;baolin.wang@linux.alibaba.com&gt;
Cc: Chris Mason &lt;clm@meta.com&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: "Darrick J. Wong" &lt;djwong@kernel.org&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Liam Howlett &lt;liam.howlett@oracle.com&gt;
Cc: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Mike Rapoport &lt;rppt@kernel.org&gt;
Cc: Rik van Riel &lt;riel@surriel.com&gt;
Cc: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/memory: do not populate page table entries beyond i_size</title>
<updated>2025-11-10T05:19:43+00:00</updated>
<author>
<name>Kiryl Shutsemau</name>
<email>kas@kernel.org</email>
</author>
<published>2025-10-27T11:56:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=74207de2ba10c2973334906822dc94d2e859ffc5'/>
<id>urn:sha1:74207de2ba10c2973334906822dc94d2e859ffc5</id>
<content type='text'>
Patch series "Fix SIGBUS semantics with large folios", v3.

Accessing memory within a VMA, but beyond i_size rounded up to the next
page size, is supposed to generate SIGBUS.

Darrick reported[1] an xfstests regression in v6.18-rc1.  generic/749
failed due to missing SIGBUS.  This was caused by my recent changes that
try to fault in the whole folio where possible:

        19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
        357b92761d94 ("mm/filemap: map entire large folio faultaround")

These changes did not consider i_size when setting up PTEs, leading to
xfstest breakage.

However, the problem has been present in the kernel for a long time -
since huge tmpfs was introduced in 2016.  The kernel happily maps
PMD-sized folios as PMD without checking i_size.  And huge=always tmpfs
allocates PMD-size folios on any writes.

I considered this corner case when I implemented a large tmpfs, and my
conclusion was that no one in their right mind should rely on receiving a
SIGBUS signal when accessing beyond i_size.  I cannot imagine how it could
be useful for the workload.

But apparently filesystem folks care a lot about preserving strict SIGBUS
semantics.

Generic/749 was introduced last year with reference to POSIX, but no real
workloads were mentioned.  It also acknowledged the tmpfs deviation from
the test case.

POSIX indeed says[3]:

        References within the address range starting at pa and
        continuing for len bytes to whole pages following the end of an
        object shall result in delivery of a SIGBUS signal.

The patchset fixes the regression introduced by recent changes as well as
more subtle SIGBUS breakage due to split failure on truncation.


This patch (of 2):

Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
supposed to generate SIGBUS.

Recent changes attempted to fault in full folio where possible.  They did
not respect i_size, which led to populating PTEs beyond i_size and
breaking SIGBUS semantics.

Darrick reported generic/749 breakage because of this.

However, the problem existed before the recent changes.  With huge=always
tmpfs, any write to a file leads to PMD-size allocation.  Following the
fault-in of the folio will install PMD mapping regardless of i_size.

Fix filemap_map_pages() and finish_fault() to not install:
  - PTEs beyond i_size;
  - PMD mappings across i_size;

Make an exception for shmem/tmpfs that for long time intentionally
mapped with PMDs across i_size.

Link: https://lkml.kernel.org/r/20251027115636.82382-1-kirill@shutemov.name
Link: https://lkml.kernel.org/r/20251027115636.82382-2-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau &lt;kas@kernel.org&gt;
Fixes: 6795801366da ("xfs: Support large folios")
Reported-by: "Darrick J. Wong" &lt;djwong@kernel.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Baolin Wang &lt;baolin.wang@linux.alibaba.com&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Liam Howlett &lt;liam.howlett@oracle.com&gt;
Cc: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Mike Rapoport &lt;rppt@kernel.org&gt;
Cc: Rik van Riel &lt;riel@surriel.com&gt;
Cc: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'nfs-for-6.18-1' of git://git.linux-nfs.org/projects/anna/linux-nfs</title>
<updated>2025-10-03T21:20:40+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-10-03T21:20:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=070a542f08acb7e8cf197287f5c44658c715d2d1'/>
<id>urn:sha1:070a542f08acb7e8cf197287f5c44658c715d2d1</id>
<content type='text'>
Pull NFS client updates from Anna Schumaker:
 "New Features:
   - Add a Kconfig option to redirect dfprintk() to the trace buffer
   - Enable use of the RWF_DONTCACHE flag on the NFS client
   - Add striped layout handling to pNFS flexfiles
   - Add proper localio handling for READ and WRITE O_DIRECT

  Bugfixes:
   - Handle NFS4ERR_GRACE errors during delegation recall
   - Fix NFSv4.1 backchannel max_resp_sz verification check
   - Fix mount hang after CREATE_SESSION failure
   - Fix d_parent-&gt;d_inode locking in nfs4_setup_readdir()

  Other Cleanups and Improvements:
   - Improvements to write handling tracepoints
   - Fix a few trivial spelling mistakes
   - Cleanups to the rpcbind cleanup call sites
   - Convert the SUNRPC xdr_buf to use a scratch folio instead of
     scratch page
   - Remove unused NFS_WBACK_BUSY() macro
   - Remove __GFP_NOWARN flags
   - Unexport rpc_malloc() and rpc_free()"

* tag 'nfs-for-6.18-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (46 commits)
  NFS: add basic STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
  nfs/localio: add tracepoints for misaligned DIO READ and WRITE support
  nfs/localio: add proper O_DIRECT support for READ and WRITE
  nfs/localio: refactor iocb initialization
  nfs/localio: refactor iocb and iov_iter_bvec initialization
  nfs/localio: avoid issuing misaligned IO using O_DIRECT
  nfs/localio: make trace_nfs_local_open_fh more useful
  NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
  sunrpc: unexport rpc_malloc() and rpc_free()
  NFSv4/flexfiles: Add support for striped layouts
  NFSv4/flexfiles: Update layout stats &amp; error paths for striped layouts
  NFSv4/flexfiles: Write path updates for striped layouts
  NFSv4/flexfiles: Commit path updates for striped layouts
  NFSv4/flexfiles: Read path updates for striped layouts
  NFSv4/flexfiles: Update low level helper functions to be DS stripe aware.
  NFSv4/flexfiles: Add data structure support for striped layouts
  NFSv4/flexfiles: Use ds_commit_idx when marking a write commit
  NFSv4/flexfiles: Remove cred local variable dependency
  nfs4_setup_readdir(): insufficient locking for -&gt;d_parent-&gt;d_inode dereferencing
  NFS: Enable use of the RWF_DONTCACHE flag on the NFS client
  ...
</content>
</entry>
<entry>
<title>mm/filemap: map entire large folio faultaround</title>
<updated>2025-09-28T18:51:30+00:00</updated>
<author>
<name>Kiryl Shutsemau</name>
<email>kas@kernel.org</email>
</author>
<published>2025-09-23T11:07:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=357b92761d942432c90aeeb965f9eb0c94466921'/>
<id>urn:sha1:357b92761d942432c90aeeb965f9eb0c94466921</id>
<content type='text'>
Currently, kernel only maps part of large folio that fits into
start_pgoff/end_pgoff range.

Map entire folio where possible.  It will match finish_fault() behaviour
that user hits on cold page cache.

Mapping large folios at once will allow the rmap code to mlock it on add,
as it will recognize that it is fully mapped and mlocking is safe.

Link: https://lkml.kernel.org/r/20250923110711.690639-6-kirill@shutemov.name
Signed-off-by: Kiryl Shutsemau &lt;kas@kernel.org&gt;
Cc: Baolin Wang &lt;baolin.wang@linux.alibaba.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Cc: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>filemap: Add a version of folio_end_writeback that ignores dropbehind</title>
<updated>2025-09-23T17:29:50+00:00</updated>
<author>
<name>Trond Myklebust</name>
<email>trond.myklebust@hammerspace.com</email>
</author>
<published>2025-09-06T16:48:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=010054a530aa266ee1711dfbe23fc06b6eb0fa48'/>
<id>urn:sha1:010054a530aa266ee1711dfbe23fc06b6eb0fa48</id>
<content type='text'>
Filesystems such as NFS may need to defer dropbehind until after their
2-stage writes are done. This adds a helper
folio_end_writeback_no_dropbehind() that allows them to release the
writeback flag without immediately dropping the folio.

Signed-off-by: Trond Myklebust &lt;trond.myklebust@hammerspace.com&gt;
Signed-off-by: Anna Schumaker &lt;anna.schumaker@oracle.com&gt;
</content>
</entry>
<entry>
<title>filemap: Add a helper for filesystems implementing dropbehind</title>
<updated>2025-09-23T17:29:50+00:00</updated>
<author>
<name>Trond Myklebust</name>
<email>trond.myklebust@hammerspace.com</email>
</author>
<published>2025-09-06T16:48:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=24bbd533f596a4544e17579e9f622918680e7bff'/>
<id>urn:sha1:24bbd533f596a4544e17579e9f622918680e7bff</id>
<content type='text'>
Add a helper to allow filesystems to attempt to free the 'dropbehind'
folio.

Signed-off-by: Trond Myklebust &lt;trond.myklebust@hammerspace.com&gt;
Link: https://lore.kernel.org/all/5588a06f6d5a2cf6746828e2d36e7ada668b1739.1745381692.git.trond.myklebust@hammerspace.com/
Reviewed-by: Mike Snitzer &lt;snitzer@kernel.org&gt;
Signed-off-by: Anna Schumaker &lt;anna.schumaker@oracle.com&gt;
</content>
</entry>
<entry>
<title>mm, swap: cleanup swap cache API and add kerneldoc</title>
<updated>2025-09-21T21:22:23+00:00</updated>
<author>
<name>Kairui Song</name>
<email>kasong@tencent.com</email>
</author>
<published>2025-09-16T16:00:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=fd8d4f862f8c278fd1f5b61cef20056e88d8dfa5'/>
<id>urn:sha1:fd8d4f862f8c278fd1f5b61cef20056e88d8dfa5</id>
<content type='text'>
In preparation for replacing the swap cache backend with the swap table,
clean up and add proper kernel doc for all swap cache APIs.  Now all swap
cache APIs are well-defined with consistent names.

No feature change, only renaming and documenting.

Link: https://lkml.kernel.org/r/20250916160100.31545-9-ryncsn@gmail.com
Signed-off-by: Kairui Song &lt;kasong@tencent.com&gt;
Acked-by: Chris Li &lt;chrisl@kernel.org&gt;
Reviewed-by: Barry Song &lt;baohua@kernel.org&gt;
Reviewed-by: Baolin Wang &lt;baolin.wang@linux.alibaba.com&gt;
Acked-by: David Hildenbrand &lt;david@redhat.com&gt;
Suggested-by: Chris Li &lt;chrisl@kernel.org&gt;
Cc: Baoquan He &lt;bhe@redhat.com&gt;
Cc: "Huang, Ying" &lt;ying.huang@linux.alibaba.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Kemeng Shi &lt;shikemeng@huaweicloud.com&gt;
Cc: kernel test robot &lt;oliver.sang@intel.com&gt;
Cc: Lorenzo Stoakes &lt;lorenzo.stoakes@oracle.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Nhat Pham &lt;nphamcs@gmail.com&gt;
Cc: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Cc: Zi Yan &lt;ziy@nvidia.com&gt;
Cc: SeongJae Park &lt;sj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>filemap: optimize folio refount update in filemap_map_pages</title>
<updated>2025-09-21T21:22:20+00:00</updated>
<author>
<name>Jinjiang Tu</name>
<email>tujinjiang@huawei.com</email>
</author>
<published>2025-09-04T13:27:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0faa77afe72b0705cdba8a59a0969e20300f6548'/>
<id>urn:sha1:0faa77afe72b0705cdba8a59a0969e20300f6548</id>
<content type='text'>
There are two meaningless folio refcount update for order0 folio in
filemap_map_pages().  First, filemap_map_order0_folio() adds folio
refcount after the folio is mapped to pte.  And then, filemap_map_pages()
drops a refcount grabbed by next_uptodate_folio().  We could remain the
refcount unchanged in this case.

As Matthew metenioned in [1], it is safe to call folio_unlock() before
calling folio_put() here, because the folio is in page cache with refcount
held, and truncation will wait for the unlock.

Optimize filemap_map_folio_range() with the same method too.

With this patch, we can get 8% performance gain for lmbench testcase
'lat_pagefault -P 1 file' in order0 folio case, the size of file is 512M.


Link: https://lkml.kernel.org/r/20250904132737.1250368-1-tujinjiang@huawei.com
Link: https://lore.kernel.org/all/aKcU-fzxeW3xT5Wv@casper.infradead.org/ [1]
Signed-off-by: Jinjiang Tu &lt;tujinjiang@huawei.com&gt;
Reviewed-by: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/filemap: align last_index to folio size</title>
<updated>2025-09-21T21:22:15+00:00</updated>
<author>
<name>Youling Tang</name>
<email>tangyouling@kylinos.cn</email>
</author>
<published>2025-07-11T05:55:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=9fd53c8122271d9fe8b687f50a9bdf5588d41d0b'/>
<id>urn:sha1:9fd53c8122271d9fe8b687f50a9bdf5588d41d0b</id>
<content type='text'>
On XFS systems with pagesize=4K, blocksize=16K, and
CONFIG_TRANSPARENT_HUGEPAGE enabled, We observed the following readahead
behaviors:

 # echo 3 &gt; /proc/sys/vm/drop_caches
 # dd if=test of=/dev/null bs=64k count=1
 # ./tools/mm/page-types -r -L -f  /mnt/xfs/test
 foffset	offset	flags
 0	136d4c	__RU_l_________H______t_________________F_1
 1	136d4d	__RU_l__________T_____t_________________F_1
 2	136d4e	__RU_l__________T_____t_________________F_1
 3	136d4f	__RU_l__________T_____t_________________F_1
 ...
 c	136bb8	__RU_l_________H______t_________________F_1
 d	136bb9	__RU_l__________T_____t_________________F_1
 e	136bba	__RU_l__________T_____t_________________F_1
 f	136bbb	__RU_l__________T_____t_________________F_1   &lt;-- first read
 10	13c2cc	___U_l_________H______t______________I__F_1   &lt;-- readahead flag
 11	13c2cd	___U_l__________T_____t______________I__F_1
 12	13c2ce	___U_l__________T_____t______________I__F_1
 13	13c2cf	___U_l__________T_____t______________I__F_1
 ...
 1c	1405d4	___U_l_________H______t_________________F_1
 1d	1405d5	___U_l__________T_____t_________________F_1
 1e	1405d6	___U_l__________T_____t_________________F_1
 1f	1405d7	___U_l__________T_____t_________________F_1
 [ra_size = 32, req_count = 16, async_size = 16]

 # echo 3 &gt; /proc/sys/vm/drop_caches
 # dd if=test of=/dev/null bs=60k count=1
 # ./page-types -r -L -f  /mnt/xfs/test
 foffset	offset	flags
 0	136048	__RU_l_________H______t_________________F_1
 ...
 c	110a40	__RU_l_________H______t_________________F_1
 d	110a41	__RU_l__________T_____t_________________F_1
 e	110a42	__RU_l__________T_____t_________________F_1   &lt;-- first read
 f	110a43	__RU_l__________T_____t_________________F_1   &lt;-- first readahead flag
 10	13e7a8	___U_l_________H______t_________________F_1
 ...
 20	137a00	___U_l_________H______t_______P______I__F_1   &lt;-- second readahead flag (20 - 2f)
 21	137a01	___U_l__________T_____t_______P______I__F_1
 ...
 3f	10d4af	___U_l__________T_____t_______P_________F_1
 [first readahead: ra_size = 32, req_count = 15, async_size = 17]

When reading 64k data (same for 61-63k range, where last_index is
page-aligned in filemap_get_pages()), 128k readahead is triggered via
page_cache_sync_ra() and the PG_readahead flag is set on the next folio
(the one containing 0x10 page).

When reading 60k data, 128k readahead is also triggered via
page_cache_sync_ra().  However, in this case the readahead flag is set on
the 0xf page.  Although the requested read size (req_count) is 60k, the
actual read will be aligned to folio size (64k), which triggers the
readahead flag and initiates asynchronous readahead via
page_cache_async_ra().  This results in two readahead operations totaling
256k.

The root cause is that when the requested size is smaller than the actual
read size (due to folio alignment), it triggers asynchronous readahead. 
By changing last_index alignment from page size to folio size, we ensure
the requested size matches the actual read size, preventing the case where
a single read operation triggers two readahead operations.

After applying the patch:
 # echo 3 &gt; /proc/sys/vm/drop_caches
 # dd if=test of=/dev/null bs=60k count=1
 # ./page-types -r -L -f  /mnt/xfs/test
 foffset	offset	flags
 0	136d4c	__RU_l_________H______t_________________F_1
 1	136d4d	__RU_l__________T_____t_________________F_1
 2	136d4e	__RU_l__________T_____t_________________F_1
 3	136d4f	__RU_l__________T_____t_________________F_1
 ...
 c	136bb8	__RU_l_________H______t_________________F_1
 d	136bb9	__RU_l__________T_____t_________________F_1
 e	136bba	__RU_l__________T_____t_________________F_1   &lt;-- first read
 f	136bbb	__RU_l__________T_____t_________________F_1
 10	13c2cc	___U_l_________H______t______________I__F_1   &lt;-- readahead flag
 11	13c2cd	___U_l__________T_____t______________I__F_1
 12	13c2ce	___U_l__________T_____t______________I__F_1
 13	13c2cf	___U_l__________T_____t______________I__F_1
 ...
 1c	1405d4	___U_l_________H______t_________________F_1
 1d	1405d5	___U_l__________T_____t_________________F_1
 1e	1405d6	___U_l__________T_____t_________________F_1
 1f	1405d7	___U_l__________T_____t_________________F_1
 [ra_size = 32, req_count = 16, async_size = 16]

The same phenomenon will occur when reading from 49k to 64k.  Set the
readahead flag to the next folio.

Because the minimum order of folio in address_space equals the block size
(at least in xfs and bcachefs that already support bs &gt; ps), having
request_count aligned to block size will not cause overread.

[klarasmodin@gmail.com: fix overflow on 32-bit]
  Link: https://lkml.kernel.org/r/yru7qf5gvyzccq5ohhpylvxug5lr5tf54omspbjh4sm6pcdb2r@fpjgj2pxw7va
[akpm@linux-foundation.org: update it for Max's constification efforts]
Link: https://lkml.kernel.org/r/20250711055509.91587-1-youling.tang@linux.dev
Co-developed-by: Chi Zhiling &lt;chizhiling@kylinos.cn&gt;
Signed-off-by: Chi Zhiling &lt;chizhiling@kylinos.cn&gt;
Signed-off-by: Youling Tang &lt;tangyouling@kylinos.cn&gt;
Signed-off-by: Klara Modin &lt;klarasmodin@gmail.com&gt;
Reviewed-by: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Youling Tang &lt;youling.tang@linux.dev&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Klara Modin &lt;klarasmodin@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
</feed>
