summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2022-01-06mm: Convert __ksize() to struct slabMatthew Wilcox (Oracle)2-11/+9
In SLUB, use folios, and struct slab to access slab_cache field. In SLOB, use folios to properly resolve pointers beyond PAGE_SIZE offset of the object. [ vbabka@suse.cz: use folios, and only convert folio_test_slab() == true folios to struct slab ] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com>
2022-01-06mm: Convert virt_to_cache() to use struct slabMatthew Wilcox (Oracle)1-4/+4
This function is entirely self-contained, so can be converted from page to slab. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
2022-01-06mm: Convert [un]account_slab_page() to struct slabMatthew Wilcox (Oracle)3-13/+12
Convert the parameter of these functions to struct slab instead of struct page and drop _page from the names. For now their callers just convert page to slab. [ vbabka@suse.cz: replace existing functions instead of calling them ] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com>
2022-01-06mm: Split slab into its own typeMatthew Wilcox (Oracle)2-4/+171
Make struct slab independent of struct page. It still uses the underlying memory in struct page for storing slab-specific data, but slab and slub can now be weaned off using struct page directly. Some of the wrapper functions (slab_address() and slab_order()) still need to cast to struct folio, but this is a significant disentanglement. [ vbabka@suse.cz: Rebase on folios, use folio instead of page where possible. Do not duplicate flags field in struct slab, instead make the related accessors go through slab_folio(). For testing pfmemalloc use the folio_*_active flag accessors directly so the PageSlabPfmemalloc wrappers can be removed later. Make folio_slab() expect only folio_test_slab() == true folios and virt_to_slab() return NULL when folio_test_slab() == false. Move struct slab to mm/slab.h. Don't represent with struct slab pages that are not true slab pages, but just a compound page obtained directly rom page allocator (with large kmalloc() for SLUB and SLOB). ] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com>
2022-01-06mm/slub: Make object_err() staticVlastimil Babka1-15/+15
There are no callers outside of mm/slub.c anymore. Move freelist_corrupted() that calls object_err() to avoid a need for forward declaration. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <guro@fb.com>
2022-01-06mm/slab: Dissolve slab_map_pages() in its callerVlastimil Babka1-13/+2
The function no longer does what its name and comment suggests, and just sets two struct page fields, which can be done directly in its sole caller. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
2022-01-04truncate: Add truncate_cleanup_folio()Matthew Wilcox (Oracle)1-10/+11
Convert both callers of truncate_cleanup_page() to use truncate_cleanup_folio() instead. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Add filemap_release_folio()Matthew Wilcox (Oracle)2-20/+25
Reimplement try_to_release_page() as a wrapper around filemap_release_folio(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Use a folio in filemap_page_mkwriteMatthew Wilcox (Oracle)1-8/+8
This fixes a bug for tail pages. They always have a NULL mapping, so the check would fail and we would never mark the folio as dirty. Ends up growing the kernel by 19 bytes although there will be fewer calls to compound_head() dynamically. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Use a folio in filemap_map_pagesMatthew Wilcox (Oracle)1-13/+14
Saves 61 bytes due to fewer calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Use folios in next_uptodate_pageMatthew Wilcox (Oracle)1-15/+15
This saves 105 bytes of text. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Convert page_cache_delete_batch to foliosMatthew Wilcox (Oracle)1-10/+10
Saves one call to compound_head() and reduces text size by 15 bytes. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Convert filemap_get_pages to use foliosMatthew Wilcox (Oracle)1-12/+12
This saves a few calls to compound_head(), including one in filemap_update_page(). Shrinks the kernel by 78 bytes. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Drop the refcount while waiting for page lockMatthew Wilcox (Oracle)1-38/+5
Commit bd8a1f3655a7 ("mm/filemap: support readpage splitting a page") changed the read_iter path to drop the refcount while waiting for the page lock. However, it missed the same pattern in read_mapping_page() and friends. Use the same pattern in do_read_cache_folio() that is used in filemap_update_page(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Add read_cache_folio and read_mapping_folioMatthew Wilcox (Oracle)1-46/+51
Reimplement read_cache_page() as a wrapper around read_cache_folio(). Saves over 400 bytes of text from do_read_cache_folio() which more than makes up for the extra 100 bytes of text added to the various wrapper functions. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Convert filemap_fault to folioMatthew Wilcox (Oracle)1-40/+37
Instead of converting back-and-forth between the actual page and the head page, just convert once at the end of the function where we set the vmf->page. Saves 241 bytes of text, or 15% of the size of filemap_fault(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Convert do_async_mmap_readahead to take a folioMatthew Wilcox (Oracle)1-7/+8
Call page_cache_async_ra() directly instead of indirecting through page_cache_async_readahead(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04readahead: Convert page_cache_ra_unbounded to foliosMatthew Wilcox (Oracle)1-9/+9
This saves 99 bytes of kernel text. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04readahead: Convert page_cache_async_ra() to take a folioMatthew Wilcox (Oracle)1-3/+3
Using the folio here avoids checking whether it's a tail page. This patch mostly just enables some of the following patches. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Convert filemap_range_uptodate to foliosMatthew Wilcox (Oracle)1-8/+8
The only caller was already passing a head page, so this simply avoids a call to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Convert filemap_create_page to folioMatthew Wilcox (Oracle)1-19/+20
This is all internal to filemap and saves 100 bytes of text. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Convert filemap_read_page to take a folioMatthew Wilcox (Oracle)1-9/+9
One of the callers already had a folio; the other two grow by a few bytes, but filemap_read_page() shrinks by 50 bytes for a net reduction of 27 bytes. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Convert find_get_pages_contig to foliosMatthew Wilcox (Oracle)1-9/+8
None of the callers of find_get_pages_contig() want tail pages. They all use order-0 pages today, but if they were converted, they'd want folios. So just remove the call to find_subpage() instead of replacing it with folio_page(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Convert filemap_get_read_batch to use foliosMatthew Wilcox (Oracle)1-12/+11
The page cache only stores folios, never tail pages. Saves 29 bytes due to removing calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Convert find_get_entry to return a folioMatthew Wilcox (Oracle)1-74/+72
Convert callers to cope. Saves 580 bytes of kernel text; all five callers are reduced in size. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Add filemap_remove_folio and __filemap_remove_folioMatthew Wilcox (Oracle)2-23/+25
Reimplement __delete_from_page_cache() as a wrapper around __filemap_remove_folio() and delete_from_page_cache() as a wrapper around filemap_remove_folio(). Remove the EXPORT_SYMBOL as delete_from_page_cache() was not used by any in-tree modules. Convert page_cache_free_page() into filemap_free_folio(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Convert tracing of page cache operations to folioMatthew Wilcox (Oracle)1-4/+5
Pass the folio instead of a page. The page was already implicitly a folio as it accessed page->mapping directly. Add the order of the folio to the tracepoint, as this is important information. Also drop printing the address of the struct page as the pfn provides better information than the struct page address. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Add filemap_unaccount_folio()Matthew Wilcox (Oracle)1-35/+35
Replace unaccount_page_cache_page() with filemap_unaccount_folio(). The bug handling path could be a bit more robust (eg taking into account the mapcounts of tail pages), but it's really never supposed to happen. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Convert page_cache_delete to take a folioMatthew Wilcox (Oracle)1-11/+11
It was already assuming a head page, so this is a straightforward conversion. Convert the one caller to call page_folio(), even though it must currently be passing in a head page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04filemap: Add folio_put_wait_locked()Matthew Wilcox (Oracle)2-23/+25
Convert all three callers of put_and_wait_on_page_locked() to folio_put_wait_locked(). This shrinks the kernel overall by 19 bytes. filemap_update_page() shrinks by 19 bytes while __migration_entry_wait() is unchanged. folio_put_wait_locked() is 14 bytes smaller than put_and_wait_on_page_locked(), but pmd_migration_entry_wait() grows by 14 bytes. It removes the assumption from pmd_migration_entry_wait() that pages cannot be larger than a PMD (which is true today, but may be interesting to explore in the future). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-03mm/writeback: Improve __folio_mark_dirty() commentMatthew Wilcox (Oracle)1-1/+5
Add some notes about how this function needs to be called. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-03filemap: Remove PageHWPoison check from next_uptodate_page()Matthew Wilcox (Oracle)1-2/+0
Pages are individually marked as suffering from hardware poisoning. Checking that the head page is not hardware poisoned doesn't make sense; we might be after a subpage. We check each page individually before we use it, so this was an optimisation gone wrong. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-01mm: vmscan: reduce throttling due to a failure to make progress -fixMel Gorman1-1/+2
Hugh Dickins reported the following My tmpfs swapping load (tweaked to use huge pages more heavily than in real life) is far from being a realistic load: but it was notably slowed down by your throttling mods in 5.16-rc, and this patch makes it well again - thanks. But: it very quickly hit NULL pointer until I changed that last line to if (first_pgdat) consider_reclaim_throttle(first_pgdat, sc); The likely issue is that huge pages are a major component of the test workload. When this is the case, first_pgdat may never get set if compaction is ready to continue due to this check if (IS_ENABLED(CONFIG_COMPACTION) && sc->order > PAGE_ALLOC_COSTLY_ORDER && compaction_ready(zone, sc)) { sc->compaction_ready = true; continue; } If this was true for every zone in the zonelist, first_pgdat would never get set resulting in a NULL pointer exception. Link: https://lkml.kernel.org/r/20211209095453.GM3366@techsingularity.net Fixes: 1b4e3f26f9f75 ("mm: vmscan: Reduce throttling due to a failure to make progress") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@surriel.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-31mm: vmscan: Reduce throttling due to a failure to make progressMel Gorman1-9/+55
Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar problems due to reclaim throttling for excessive lengths of time. In Alexey's case, a memory hog that should go OOM quickly stalls for several minutes before stalling. In Mike and Darrick's cases, a small memcg environment stalled excessively even though the system had enough memory overall. Commit 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being made") introduced the problem although commit a19594ca4a8b ("mm/vmscan: increase the timeout if page reclaim is not making progress") made it worse. Systems at or near an OOM state that cannot be recovered must reach OOM quickly and memcg should kill tasks if a memcg is near OOM. To address this, only stall for the first zone in the zonelist, reduce the timeout to 1 tick for VMSCAN_THROTTLE_NOPROGRESS and only stall if the scan control nr_reclaimed is 0, kswapd is still active and there were excessive pages pending for writeback. If kswapd has stopped reclaiming due to excessive failures, do not stall at all so that OOM triggers relatively quickly. Similarly, if an LRU is simply congested, only lightly throttle similar to NOPROGRESS. Alexey's original case was the most straight forward for i in {1..3}; do tail /dev/zero; done On vanilla 5.16-rc1, this test stalled heavily, after the patch the test completes in a few seconds similar to 5.15. Alexey's second test case added watching a youtube video while tail runs 10 times. On 5.15, playback only jitters slightly, 5.16-rc1 stalls a lot with lots of frames missing and numerous audio glitches. With this patch applies, the video plays similarly to 5.15. [lkp@intel.com: Fix W=1 build warning] Link: https://lore.kernel.org/r/99e779783d6c7fce96448a3402061b9dc1b3b602.camel@gmx.de Link: https://lore.kernel.org/r/20211124011954.7cab9bb4@mail.inbox.lv Link: https://lore.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net Link: https://lore.kernel.org/r/20211202150614.22440-1-mgorman@techsingularity.net Link: https://linux-regtracking.leemhuis.info/regzbot/regression/20211124011954.7cab9bb4@mail.inbox.lv/ Reported-and-tested-by: Alexey Avramov <hakavlad@inbox.lv> Reported-and-tested-by: Mike Galbraith <efault@gmx.de> Reported-and-tested-by: Darrick J. Wong <djwong@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Acked-by: Hugh Dickins <hughd@google.com> Tracked-by: Thorsten Leemhuis <regressions@leemhuis.info> Fixes: 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being made") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-31mm/damon/dbgfs: fix 'struct pid' leaks in 'dbgfs_target_ids_write()'SeongJae Park1-2/+7
DAMON debugfs interface increases the reference counts of 'struct pid's for targets from the 'target_ids' file write callback ('dbgfs_target_ids_write()'), but decreases the counts only in DAMON monitoring termination callback ('dbgfs_before_terminate()'). Therefore, when 'target_ids' file is repeatedly written without DAMON monitoring start/termination, the reference count is not decreased and therefore memory for the 'struct pid' cannot be freed. This commit fixes this issue by decreasing the reference counts when 'target_ids' is written. Link: https://lkml.kernel.org/r/20211229124029.23348-1-sj@kernel.org Fixes: 4bc05954d007 ("mm/damon: implement a debugfs-based user space interface") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-25mm/hwpoison: clear MF_COUNT_INCREASED before retrying get_any_page()Liu Shixin1-0/+1
Hulk Robot reported a panic in put_page_testzero() when testing madvise() with MADV_SOFT_OFFLINE. The BUG() is triggered when retrying get_any_page(). This is because we keep MF_COUNT_INCREASED flag in second try but the refcnt is not increased. page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) ------------[ cut here ]------------ kernel BUG at include/linux/mm.h:737! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 5 PID: 2135 Comm: sshd Tainted: G B 5.16.0-rc6-dirty #373 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: release_pages+0x53f/0x840 Call Trace: free_pages_and_swap_cache+0x64/0x80 tlb_flush_mmu+0x6f/0x220 unmap_page_range+0xe6c/0x12c0 unmap_single_vma+0x90/0x170 unmap_vmas+0xc4/0x180 exit_mmap+0xde/0x3a0 mmput+0xa3/0x250 do_exit+0x564/0x1470 do_group_exit+0x3b/0x100 __do_sys_exit_group+0x13/0x20 __x64_sys_exit_group+0x16/0x20 do_syscall_64+0x34/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Modules linked in: ---[ end trace e99579b570fe0649 ]--- RIP: 0010:release_pages+0x53f/0x840 Link: https://lkml.kernel.org/r/20211221074908.3910286-1-liushixin2@huawei.com Fixes: b94e02822deb ("mm,hwpoison: try to narrow window race for free pages") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reported-by: Hulk Robot <hulkci@huawei.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-25mm/damon/dbgfs: protect targets destructions with kdamond_lockSeongJae Park1-0/+2
DAMON debugfs interface iterates current monitoring targets in 'dbgfs_target_ids_read()' while holding the corresponding 'kdamond_lock'. However, it also destructs the monitoring targets in 'dbgfs_before_terminate()' without holding the lock. This can result in a use_after_free bug. This commit avoids the race by protecting the destruction with the corresponding 'kdamond_lock'. Link: https://lkml.kernel.org/r/20211221094447.2241-1-sj@kernel.org Reported-by: Sangwoo Bae <sangwoob@amazon.com> Fixes: 4bc05954d007 ("mm/damon: implement a debugfs-based user space interface") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [5.15.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-25mm, hwpoison: fix condition in free hugetlb page pathNaoya Horiguchi1-9/+4
When a memory error hits a tail page of a free hugepage, __page_handle_poison() is expected to be called to isolate the error in 4kB unit, but it's not called due to the outdated if-condition in memory_failure_hugetlb(). This loses the chance to isolate the error in the finer unit, so it's not optimal. Drop the condition. This "(p != head && TestSetPageHWPoison(head)" condition is based on the old semantics of PageHWPoison on hugepage (where PG_hwpoison flag was set on the subpage), so it's not necessray any more. By getting to set PG_hwpoison on head page for hugepages, concurrent error events on different subpages in a single hugepage can be prevented by TestSetPageHWPoison(head) at the beginning of memory_failure_hugetlb(). So dropping the condition should not reopen the race window originally mentioned in commit b985194c8c0a ("hwpoison, hugetlb: lock_page/unlock_page does not match for handling a free hugepage") [naoya.horiguchi@linux.dev: fix "HardwareCorrupted" counter] Link: https://lkml.kernel.org/r/20211220084851.GA1460264@u2004 Link: https://lkml.kernel.org/r/20211210110208.879740-1-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Fei Luo <luofei@unicloud.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> [5.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-25mm: mempolicy: fix THP allocations escaping mempolicy restrictionsAndrey Ryabinin1-2/+1
alloc_pages_vma() may try to allocate THP page on the local NUMA node first: page = __alloc_pages_node(hpage_node, gfp | __GFP_THISNODE | __GFP_NORETRY, order); And if the allocation fails it retries allowing remote memory: if (!page && (gfp & __GFP_DIRECT_RECLAIM)) page = __alloc_pages_node(hpage_node, gfp, order); However, this retry allocation completely ignores memory policy nodemask allowing allocation to escape restrictions. The first appearance of this bug seems to be the commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings"). The bug disappeared later in the commit 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask") and reappeared again in slightly different form in the commit 76e654cc91bb ("mm, page_alloc: allow hugepage fallback to remote nodes when madvised") Fix this by passing correct nodemask to the __alloc_pages() call. The demonstration/reproducer of the problem: $ mount -oremount,size=4G,huge=always /dev/shm/ $ echo always > /sys/kernel/mm/transparent_hugepage/defrag $ cat mbind_thp.c #include <unistd.h> #include <sys/mman.h> #include <sys/stat.h> #include <fcntl.h> #include <assert.h> #include <stdlib.h> #include <stdio.h> #include <numaif.h> #define SIZE 2ULL << 30 int main(int argc, char **argv) { int fd; unsigned long long i; char *addr; pid_t pid; char buf[100]; unsigned long nodemask = 1; fd = open("/dev/shm/test", O_RDWR|O_CREAT); assert(fd > 0); assert(ftruncate(fd, SIZE) == 0); addr = mmap(NULL, SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); assert(mbind(addr, SIZE, MPOL_BIND, &nodemask, 2, MPOL_MF_STRICT|MPOL_MF_MOVE)==0); for (i = 0; i < SIZE; i+=4096) { addr[i] = 1; } pid = getpid(); snprintf(buf, sizeof(buf), "grep shm /proc/%d/numa_maps", pid); system(buf); sleep(10000); return 0; } $ gcc mbind_thp.c -o mbind_thp -lnuma $ numactl -H available: 2 nodes (0-1) node 0 cpus: 0 2 node 0 size: 1918 MB node 0 free: 1595 MB node 1 cpus: 1 3 node 1 size: 2014 MB node 1 free: 1731 MB node distances: node 0 1 0: 10 20 1: 20 10 $ rm -f /dev/shm/test; taskset -c 0 ./mbind_thp 7fd970a00000 bind:0 file=/dev/shm/test dirty=524288 active=0 N0=396800 N1=127488 kernelpagesize_kB=4 Link: https://lkml.kernel.org/r/20211208165343.22349-1-arbn@yandex-team.com Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-25kfence: fix memory leak when cat kfence objectsBaokun Li1-0/+1
Hulk robot reported a kmemleak problem: unreferenced object 0xffff93d1d8cc02e8 (size 248): comm "cat", pid 23327, jiffies 4624670141 (age 495992.217s) hex dump (first 32 bytes): 00 40 85 19 d4 93 ff ff 00 10 00 00 00 00 00 00 .@.............. 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: seq_open+0x2a/0x80 full_proxy_open+0x167/0x1e0 do_dentry_open+0x1e1/0x3a0 path_openat+0x961/0xa20 do_filp_open+0xae/0x120 do_sys_openat2+0x216/0x2f0 do_sys_open+0x57/0x80 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 unreferenced object 0xffff93d419854000 (size 4096): comm "cat", pid 23327, jiffies 4624670141 (age 495992.217s) hex dump (first 32 bytes): 6b 66 65 6e 63 65 2d 23 32 35 30 3a 20 30 78 30 kfence-#250: 0x0 30 30 30 30 30 30 30 37 35 34 62 64 61 31 32 2d 0000000754bda12- backtrace: seq_read_iter+0x313/0x440 seq_read+0x14b/0x1a0 full_proxy_read+0x56/0x80 vfs_read+0xa5/0x1b0 ksys_read+0xa0/0xf0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 I find that we can easily reproduce this problem with the following commands: cat /sys/kernel/debug/kfence/objects echo scan > /sys/kernel/debug/kmemleak cat /sys/kernel/debug/kmemleak The leaked memory is allocated in the stack below: do_syscall_64 do_sys_open do_dentry_open full_proxy_open seq_open ---> alloc seq_file vfs_read full_proxy_read seq_read seq_read_iter traverse ---> alloc seq_buf And it should have been released in the following process: do_syscall_64 syscall_exit_to_user_mode exit_to_user_mode_prepare task_work_run ____fput __fput full_proxy_release ---> free here However, the release function corresponding to file_operations is not implemented in kfence. As a result, a memory leak occurs. Therefore, the solution to this problem is to implement the corresponding release function. Link: https://lkml.kernel.org/r/20211206133628.2822545-1-libaokun1@huawei.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reported-by: Hulk Robot <hulkci@huawei.com> Acked-by: Marco Elver <elver@google.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-12Merge branch 'for-5.16-fixes' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu Pull percpu fixes from Dennis Zhou: "This contains a fix for SMP && !MMU archs for percpu which has been tested by arm and sh. It seems in the past they have gotten away with it due to mapping of vm functions to km functions, but this fell apart a few releases ago and was just reported recently. The other is just a minor dependency clean up. I think queued up right now by Andrew is a fix in percpu that papers of what seems to be a bug in hotplug for a special situation with memoryless nodes. Michal Hocko is digging into it further" * 'for-5.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu: percpu_ref: Replace kernel.h with the necessary inclusions percpu: km: ensure it is used with NOMMU (either UP or SMP)
2021-12-11mm: bdi: initialize bdi_min_ratio when bdi is unregisteredManjong Lee1-0/+7
Initialize min_ratio if it is set during bdi unregistration. This can prevent problems that may occur a when bdi is removed without resetting min_ratio. For example. 1) insert external sdcard 2) set external sdcard's min_ratio 70 3) remove external sdcard without setting min_ratio 0 4) insert external sdcard 5) set external sdcard's min_ratio 70 << error occur(can't set) Because when an sdcard is removed, the present bdi_min_ratio value will remain. Currently, the only way to reset bdi_min_ratio is to reboot. [akpm@linux-foundation.org: tweak comment and coding style] Link: https://lkml.kernel.org/r/20211021161942.5983-1-mj0123.lee@samsung.com Signed-off-by: Manjong Lee <mj0123.lee@samsung.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Changheun Lee <nanich.lee@samsung.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: <seunghwan.hyun@samsung.com> Cc: <sookwan7.kim@samsung.com> Cc: <yt0928.kim@samsung.com> Cc: <junho89.kim@samsung.com> Cc: <jisoo2146.oh@samsung.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-11hugetlbfs: fix issue of preallocation of gigantic pages can't workZhenguo Yao1-1/+1
Preallocation of gigantic pages can't work bacause of commit b5389086ad7b ("hugetlbfs: extend the definition of hugepages parameter to support node allocation"). When nid is NUMA_NO_NODE(-1), alloc_bootmem_huge_page will always return without doing allocation. Fix this by adding more check. Link: https://lkml.kernel.org/r/20211129133803.15653-1-yaozhenguo1@gmail.com Fixes: b5389086ad7b ("hugetlbfs: extend the definition of hugepages parameter to support node allocation") Signed-off-by: Zhenguo Yao <yaozhenguo1@gmail.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-11mm/memcg: relocate mod_objcg_mlstate(), get_obj_stock() and put_obj_stock()Waiman Long1-53/+53
All the calls to mod_objcg_mlstate(), get_obj_stock() and put_obj_stock() are done by functions defined within the same "#ifdef CONFIG_MEMCG_KMEM" compilation block. When CONFIG_MEMCG_KMEM isn't defined, the following compilation warnings will be issued [1] and [2]. mm/memcontrol.c:785:20: warning: unused function 'mod_objcg_mlstate' mm/memcontrol.c:2113:33: warning: unused function 'get_obj_stock' Fix these warning by moving those functions to under the same CONFIG_MEMCG_KMEM compilation block. There is no functional change. [1] https://lore.kernel.org/lkml/202111272014.WOYNLUV6-lkp@intel.com/ [2] https://lore.kernel.org/lkml/202111280551.LXsWYt1T-lkp@intel.com/ Link: https://lkml.kernel.org/r/20211129161140.306488-1-longman@redhat.com Fixes: 559271146efc ("mm/memcg: optimize user context object stock access") Fixes: 68ac5b3c8db2 ("mm/memcg: cache vmstat data in percpu memcg_stock_pcp") Signed-off-by: Waiman Long <longman@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-11mm/slub: fix endianness bug for alloc/free_traces attributesGerald Schaefer1-6/+9
On big-endian s390, the alloc/free_traces attributes produce endless output, because of always 0 idx in slab_debugfs_show(). idx is de-referenced from *v, which points to a loff_t value, with unsigned int idx = *(unsigned int *)v; This will only give the upper 32 bits on big-endian, which remain 0. Instead of only fixing this de-reference, during discussion it seemed more appropriate to change the seq_ops so that they use an explicit iterator in private loc_track struct. This patch adds idx to loc_track, which will also fix the endianness bug. Link: https://lore.kernel.org/r/20211117193932.4049412-1-gerald.schaefer@linux.ibm.com Link: https://lkml.kernel.org/r/20211126171848.17534-1-gerald.schaefer@linux.ibm.com Fixes: 64dd68497be7 ("mm: slub: move sysfs slab alloc/free interfaces to debugfs") Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Reported-by: Steffen Maier <maier@linux.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Faiyaz Mohammed <faiyazm@codeaurora.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-11mm/damon/vaddr-test: remove unnecessary variablesSeongJae Park1-8/+0
A couple of test functions in DAMON virtual address space monitoring primitives implementation has unnecessary damon_ctx variables. This commit removes those. Link: https://lkml.kernel.org/r/20211201150440.1088-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-11mm/damon/vaddr-test: split a test function having >1024 bytes frame sizeSeongJae Park1-37/+40
On some configuration[1], 'damon_test_split_evenly()' kunit test function has >1024 bytes frame size, so below build warning is triggered: CC mm/damon/vaddr.o In file included from mm/damon/vaddr.c:672: mm/damon/vaddr-test.h: In function 'damon_test_split_evenly': mm/damon/vaddr-test.h:309:1: warning: the frame size of 1064 bytes is larger than 1024 bytes [-Wframe-larger-than=] 309 | } | ^ This commit fixes the warning by separating the common logic in the function. [1] https://lore.kernel.org/linux-mm/202111182146.OV3C4uGr-lkp@intel.com/ Link: https://lkml.kernel.org/r/20211201150440.1088-6-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-11mm/damon/vaddr: remove an unnecessary warning messageSeongJae Park1-1/+0
The DAMON virtual address space monitoring primitive prints a warning message for wrong DAMOS action. However, it is not essential as the code returns appropriate failure in the case. This commit removes the message to make the log clean. Link: https://lkml.kernel.org/r/20211201150440.1088-5-sj@kernel.org Fixes: 6dea8add4d28 ("mm/damon/vaddr: support DAMON-based Operation Schemes") Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-11mm/damon/core: remove unnecessary error messagesSeongJae Park1-9/+2
DAMON core prints error messages when damon_target object creation is failed or wrong monitoring attributes are given. Because appropriate error code is returned for each case, the messages are not essential. Also, because the code path can be triggered with user-specified input, this could result in kernel log mistakenly being messy. To avoid the case, this commit removes the messages. Link: https://lkml.kernel.org/r/20211201150440.1088-4-sj@kernel.org Fixes: 4bc05954d007 ("mm/damon: implement a debugfs-based user space interface") Fixes: b9a6ac4e4ede ("mm/damon: adaptively adjust regions") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: kernel test robot <lkp@intel.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-11mm/damon/dbgfs: remove an unnecessary error messageSeongJae Park1-3/+1
When wrong scheme action is requested via the debugfs interface, DAMON prints an error message. Because the function returns error code, this is not really needed. Because the code path is triggered by the user specified input, this can result in kernel log mistakenly being messy. To avoid the case, this commit removes the message. Link: https://lkml.kernel.org/r/20211201150440.1088-3-sj@kernel.org Fixes: af122dd8f3c0 ("mm/damon/dbgfs: support DAMON-based Operation Schemes") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>