summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2021-04-30mm: provide filemap_range_needs_writeback() helperJens Axboe1-0/+43
Patch series "Improve IOCB_NOWAIT O_DIRECT reads", v3. An internal workload complained because it was using too much CPU, and when I took a look, we had a lot of io_uring workers going to town. For an async buffered read like workload, I am normally expecting _zero_ offloads to a worker thread, but this one had tons of them. I'd drop caches and things would look good again, but then a minute later we'd regress back to using workers. Turns out that every minute something was reading parts of the device, which would add page cache for that inode. I put patches like these in for our kernel, and the problem was solved. Don't -EAGAIN IOCB_NOWAIT dio reads just because we have page cache entries for the given range. This causes unnecessary work from the callers side, when the IO could have been issued totally fine without blocking on writeback when there is none. This patch (of 3): For O_DIRECT reads/writes, we check if we need to issue a call to filemap_write_and_wait_range() to issue and/or wait for writeback for any page in the given range. The existing mechanism just checks for a page in the range, which is suboptimal for IOCB_NOWAIT as we'll fallback to the slow path (and needing retry) if there's just a clean page cache page in the range. Provide filemap_range_needs_writeback() which tries a little harder to check if we actually need to issue and/or wait for writeback in the range. Link: https://lkml.kernel.org/r/20210224164455.1096727-1-axboe@kernel.dk Link: https://lkml.kernel.org/r/20210224164455.1096727-2-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm: page_poison: print page info when corruption is caughtSergei Trofimovich1-2/+4
When page_poison detects page corruption it's useful to see who freed a page recently to have a guess where write-after-free corruption happens. After this change corruption report has extra page data. Example report from real corruption (includes only page_pwner part): pagealloc: memory corruption e00000014cd61d10: 11 00 00 00 00 00 00 00 30 1d d2 ff ff 0f 00 60 ........0......` e00000014cd61d20: b0 1d d2 ff ff 0f 00 60 90 fe 1c 00 08 00 00 20 .......`....... ... CPU: 1 PID: 220402 Comm: cc1plus Not tainted 5.12.0-rc5-00107-g9720c6f59ecf #245 Hardware name: hp server rx3600, BIOS 04.03 04/08/2008 ... Call Trace: [<a000000100015210>] show_stack+0x90/0xc0 [<a000000101163390>] dump_stack+0x150/0x1c0 [<a0000001003f1e90>] __kernel_unpoison_pages+0x410/0x440 [<a0000001003c2460>] get_page_from_freelist+0x1460/0x2ca0 [<a0000001003c6be0>] __alloc_pages_nodemask+0x3c0/0x660 [<a0000001003ed690>] alloc_pages_vma+0xb0/0x500 [<a00000010037deb0>] __handle_mm_fault+0x1230/0x1fe0 [<a00000010037ef70>] handle_mm_fault+0x310/0x4e0 [<a00000010005dc70>] ia64_do_page_fault+0x1f0/0xb80 [<a00000010000ca00>] ia64_leave_kernel+0x0/0x270 page_owner tracks the page as freed page allocated via order 0, migratetype Movable, gfp_mask 0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), pid 37, ts 8173444098740 __reset_page_owner+0x40/0x200 free_pcp_prepare+0x4d0/0x600 free_unref_page+0x20/0x1c0 __put_page+0x110/0x1a0 migrate_pages+0x16d0/0x1dc0 compact_zone+0xfc0/0x1aa0 proactive_compact_node+0xd0/0x1e0 kcompactd+0x550/0x600 kthread+0x2c0/0x2e0 call_payload+0x50/0x80 Here we can see that page was freed by page migration but something managed to write to it afterwards. [slyfox@gentoo.org: s/dump_page_owner/dump_page/, per Vlastimil] Link: https://lkml.kernel.org/r/20210407230800.1086854-1-slyfox@gentoo.org Link: https://lkml.kernel.org/r/20210404141735.2152984-1-slyfox@gentoo.org Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm: page_owner: detect page_owner recursion via task_structSergei Trofimovich1-22/+10
Before the change page_owner recursion was detected via fetching backtrace and inspecting it for current instruction pointer. It has a few problems: - it is slightly slow as it requires extra backtrace and a linear stack scan of the result - it is too late to check if backtrace fetching required memory allocation itself (ia64's unwinder requires it). To simplify recursion tracking let's use page_owner recursion flag in 'struct task_struct'. The change make page_owner=on work on ia64 by avoiding infinite recursion in: kmalloc() -> __set_page_owner() -> save_stack() -> unwind() [ia64-specific] -> build_script() -> kmalloc() -> __set_page_owner() [we short-circuit here] -> save_stack() -> unwind() [recursion] Link: https://lkml.kernel.org/r/20210402115342.1463781-1-slyfox@gentoo.org Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm: page_owner: use kstrtobool() to parse bool optionSergei Trofimovich1-7/+1
I tried to use page_owner=1 for a while noticed too late it had no effect as opposed to similar init_on_alloc=1 (these work). Let's make them consistent. The change decreses binary size slightly: text data bss dec hex filename 12408 321 17 12746 31ca mm/page_owner.o.before 12320 321 17 12658 3172 mm/page_owner.o.after Link: https://lkml.kernel.org/r/20210401210909.3532086-1-slyfox@gentoo.org Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm: page_owner: fetch backtrace only for tracked pagesSergei Trofimovich1-3/+3
Very minor optimization. Link: https://lkml.kernel.org/r/20210401212445.3534721-1-slyfox@gentoo.org Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm, page_owner: remove unused parameter in __set_page_owner_handlezhongjiang-ali1-5/+5
Since commit 5556cfe8d994 ("mm, page_owner: fix off-by-one error in __set_page_owner_handle()") introduced, the parameter 'page' will not used, hence it need to be removed. Link: https://lkml.kernel.org/r/1616602022-43545-1-git-send-email-zhongjiang-ali@linux.alibaba.com Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm/page_owner: record the timestamp of all pages during freeGeorgi Djakov1-4/+8
Collect the time when each allocation is freed, to help with memory analysis with kdump/ramdump. Add the timestamp also in the page_owner debugfs file and print it in dump_page(). Having another timestamp when we free the page helps for debugging page migration issues. For example both alloc and free timestamps being the same can gave hints that there is an issue with migrating memory, as opposed to a page just being dropped during migration. Link: https://lkml.kernel.org/r/20210203175905.12267-1-georgi.djakov@linaro.org Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm/kmemleak.c: fix a typoBhaskar Chowdhury1-1/+1
s/interruptable/interruptible/ Link: https://lkml.kernel.org/r/20210319214140.23304-1-unixbhaskar@gmail.com Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm/slub.c: trivial typo fixesBhaskar Chowdhury1-4/+4
s/operatios/operations/ s/Mininum/Minimum/ s/mininum/minimum/ ......two different places. Link: https://lkml.kernel.org/r/20210325044940.14516-1-unixbhaskar@gmail.com Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm, slub: enable slub_debug static key when creating cache with explicit ↵Vlastimil Babka1-0/+9
debug flags Commit ca0cab65ea2b ("mm, slub: introduce static key for slub_debug()") introduced a static key to optimize the case where no debugging is enabled for any cache. The static key is enabled when slub_debug boot parameter is passed, or CONFIG_SLUB_DEBUG_ON enabled. However, some caches might be created with one or more debugging flags explicitly passed to kmem_cache_create(), and the commit missed this. Thus the debugging functionality would not be actually performed for these caches unless the static key gets enabled by boot param or config. This patch fixes it by checking for debugging flags passed to kmem_cache_create() and enabling the static key accordingly. Note such explicit debugging flags should not be used outside of debugging and testing as they will now enable the static key globally. btrfs_init_cachep() creates a cache with SLAB_RED_ZONE but that's a mistake that's being corrected [1]. rcu_torture_stats() creates a cache with SLAB_STORE_USER, but that is a testing module so it's OK and will start working as intended after this patch. Also note that in case of backports to kernels before v5.12 that don't have 59450bbc12be ("mm, slab, slub: stop taking cpu hotplug lock"), static_branch_enable_cpuslocked() should be used. [1] https://lore.kernel.org/linux-btrfs/20210315141824.26099-1-dsterba@suse.com/ Link: https://lkml.kernel.org/r/20210315153415.24404-1-vbabka@suse.cz Fixes: ca0cab65ea2b ("mm, slub: introduce static key for slub_debug()") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Oliver Glitta <glittao@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30mm/slab_common: provide "slab_merge" option for ↵Rafael Aquini1-0/+8
!IS_ENABLED(CONFIG_SLAB_MERGE_DEFAULT) builds This is a minor addition to the allocator setup options to provide a simple way to on demand enable back cache merging for builds that by default run with CONFIG_SLAB_MERGE_DEFAULT not set. Link: https://lkml.kernel.org/r/20210319194506.200159-1-aquini@redhat.com Signed-off-by: Rafael Aquini <aquini@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-29Merge tag 'fsnotify_for_v5.13-rc1' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: - support for limited fanotify functionality for unpriviledged users - faster merging of fanotify events - a few smaller fsnotify improvements * tag 'fsnotify_for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: shmem: allow reporting fanotify events with file handles on tmpfs fs: introduce a wrapper uuid_to_fsid() fanotify_user: use upper_32_bits() to verify mask fanotify: support limited functionality for unprivileged users fanotify: configurable limits via sysfs fanotify: limit number of event merge attempts fsnotify: use hash table for faster events merge fanotify: mix event info and pid into merge key hash fanotify: reduce event objectid to 29-bit hash fsnotify: allow fsnotify_{peek,remove}_first_event with empty queue
2021-04-29Merge tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-blockLinus Torvalds1-5/+4
Pull block updates from Jens Axboe: "Pretty quiet round this time, which is nice. In detail: - Series revamping bounce buffer support (Christoph) - Dead code removal (Christoph, Bart) - Partition iteration revamp, now using xarray (Christoph) - Passthrough request scheduler improvements (Lin) - Series of BFQ improvements (Paolo) - Fix ioprio task iteration (Peter) - Various little tweaks and fixes (Tejun, Saravanan, Bhaskar, Max, Nikolay)" * tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-block: (41 commits) blk-iocost: don't ignore vrate_min on QD contention blk-mq: Fix spurious debugfs directory creation during initialization bfq/mq-deadline: remove redundant check for passthrough request blk-mq: bypass IO scheduler's limit_depth for passthrough request block: Remove an obsolete comment from sg_io() block: move bio_list_copy_data to pktcdvd block: remove zero_fill_bio_iter block: add queue_to_disk() to get gendisk from request_queue block: remove an incorrect check from blk_rq_append_bio block: initialize ret in bdev_disk_changed block: Fix sys_ioprio_set(.which=IOPRIO_WHO_PGRP) task iteration block: remove disk_part_iter block: simplify diskstats_show block: simplify show_partition block: simplify printk_all_partitions block: simplify partition_overlaps block: simplify partition removal block: take bd_mutex around delete_partitions in del_gendisk block: refactor blk_drop_partitions block: move more syncing and invalidation to delete_partition ...
2021-04-28Merge tag 'core-rcu-2021-04-28' of ↵Linus Torvalds7-0/+17
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: - Support for "N" as alias for last bit in bitmap parsing library (eg using syntax like "nohz_full=2-N") - kvfree_rcu updates - mm_dump_obj() updates. (One of these is to mm, but was suggested by Andrew Morton.) - RCU callback offloading update - Polling RCU grace-period interfaces - Realtime-related RCU updates - Tasks-RCU updates - Torture-test updates - Torture-test scripting updates - Miscellaneous fixes * tag 'core-rcu-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits) rcutorture: Test start_poll_synchronize_rcu() and poll_state_synchronize_rcu() rcu: Provide polling interfaces for Tiny RCU grace periods torture: Fix kvm.sh --datestamp regex check torture: Consolidate qemu-cmd duration editing into kvm-transform.sh torture: Print proper vmlinux path for kvm-again.sh runs torture: Make TORTURE_TRUST_MAKE available in kvm-again.sh environment torture: Make kvm-transform.sh update jitter commands torture: Add --duration argument to kvm-again.sh torture: Add kvm-again.sh to rerun a previous torture-test torture: Create a "batches" file for build reuse torture: De-capitalize TORTURE_SUITE torture: Make upper-case-only no-dot no-slash scenario names official torture: Rename SRCU-t and SRCU-u to avoid lowercase characters torture: Remove no-mpstat error message torture: Record kvm-test-1-run.sh and kvm-test-1-run-qemu.sh PIDs torture: Record jitter start/stop commands torture: Extract kvm-test-1-run-qemu.sh from kvm-test-1-run.sh torture: Record TORTURE_KCONFIG_GDB_ARG in qemu-cmd torture: Abstract jitter.sh start/stop into scripts rcu: Provide polling interfaces for Tree RCU grace periods ...
2021-04-28Merge tag 'printk-for-5.13' of ↵Linus Torvalds1-6/+7
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Stop synchronizing kernel log buffer readers by logbuf_lock. As a result, the access to the buffer is fully lockless now. Note that printk() itself still uses locks because it tries to flush the messages to the console immediately. Also the per-CPU temporary buffers are still there because they prevent infinite recursion and serialize backtraces from NMI. All this is going to change in the future. - kmsg_dump API rework and cleanup as a side effect of the logbuf_lock removal. - Make bstr_printf() aware that %pf and %pF formats could deference the given pointer. - Show also page flags by %pGp format. - Clarify the documentation for plain pointer printing. - Do not show no_hash_pointers warning multiple times. - Update Senozhatsky email address. - Some clean up. * tag 'printk-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (24 commits) lib/vsprintf.c: remove leftover 'f' and 'F' cases from bstr_printf() printk: clarify the documentation for plain pointer printing kernel/printk.c: Fixed mundane typos printk: rename vprintk_func to vprintk vsprintf: dump full information of page flags in pGp mm, slub: don't combine pr_err with INFO mm, slub: use pGp to print page flags MAINTAINERS: update Senozhatsky email address lib/vsprintf: do not show no_hash_pointers message multiple times printk: console: remove unnecessary safe buffer usage printk: kmsg_dump: remove _nolock() variants printk: remove logbuf_lock printk: introduce a kmsg_dump iterator printk: kmsg_dumper: remove @active field printk: add syslog_lock printk: use atomic64_t for devkmsg_user.seq printk: use seqcount_latch for clear_seq printk: introduce CONSOLE_LOG_MAX printk: consolidate kmsg_dump_get_buffer/syslog_print_all code printk: refactor kmsg_dump_get_buffer() ...
2021-04-27Merge tag 'netfs-lib-20210426' of ↵Linus Torvalds3-19/+154
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull network filesystem helper library updates from David Howells: "Here's a set of patches for 5.13 to begin the process of overhauling the local caching API for network filesystems. This set consists of two parts: (1) Add a helper library to handle the new VM readahead interface. This is intended to be used unconditionally by the filesystem (whether or not caching is enabled) and provides a common framework for doing caching, transparent huge pages and, in the future, possibly fscrypt and read bandwidth maximisation. It also allows the netfs and the cache to align, expand and slice up a read request from the VM in various ways; the netfs need only provide a function to read a stretch of data to the pagecache and the helper takes care of the rest. (2) Add an alternative fscache/cachfiles I/O API that uses the kiocb facility to do async DIO to transfer data to/from the netfs's pages, rather than using readpage with wait queue snooping on one side and vfs_write() on the other. It also uses less memory, since it doesn't do buffered I/O on the backing file. Note that this uses SEEK_HOLE/SEEK_DATA to locate the data available to be read from the cache. Whilst this is an improvement from the bmap interface, it still has a problem with regard to a modern extent-based filesystem inserting or removing bridging blocks of zeros. Fixing that requires a much greater overhaul. This is a step towards overhauling the fscache API. The change is opt-in on the part of the network filesystem. A netfs should not try to mix the old and the new API because of conflicting ways of handling pages and the PG_fscache page flag and because it would be mixing DIO with buffered I/O. Further, the helper library can't be used with the old API. This does not change any of the fscache cookie handling APIs or the way invalidation is done at this time. In the near term, I intend to deprecate and remove the old I/O API (fscache_allocate_page{,s}(), fscache_read_or_alloc_page{,s}(), fscache_write_page() and fscache_uncache_page()) and eventually replace most of fscache/cachefiles with something simpler and easier to follow. This patchset contains the following parts: - Some helper patches, including provision of an ITER_XARRAY iov iterator and a function to do readahead expansion. - Patches to add the netfs helper library. - A patch to add the fscache/cachefiles kiocb API. - A pair of patches to fix some review issues in the ITER_XARRAY and read helpers as spotted by Al and Willy. Jeff Layton has patches to add support in Ceph for this that he intends for this merge window. I have a set of patches to support AFS that I will post a separate pull request for. With this, AFS without a cache passes all expected xfstests; with a cache, there's an extra failure, but that's also there before these patches. Fixing that probably requires a greater overhaul. Ceph also passes the expected tests. I also have patches in a separate branch to tidy up the handling of PG_fscache/PG_private_2 and their contribution to page refcounting in the core kernel here, but I haven't included them in this set and will route them separately" Link: https://lore.kernel.org/lkml/3779937.1619478404@warthog.procyon.org.uk/ * tag 'netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: netfs: Miscellaneous fixes iov_iter: Four fixes for ITER_XARRAY fscache, cachefiles: Add alternate API to use kiocb for read/write to cache netfs: Add a tracepoint to log failures that would be otherwise unseen netfs: Define an interface to talk to a cache netfs: Add write_begin helper netfs: Gather stats netfs: Add tracepoints netfs: Provide readahead and readpage netfs helpers netfs, mm: Add set/end/wait_on_page_fscache() aliases netfs, mm: Move PG_fscache helper funcs to linux/netfs.h netfs: Documentation for helper library netfs: Make a netfs helper module mm: Implement readahead_control pageset expansion mm/readahead: Handle ractl nr_pages being modified fs: Document file_ra_state mm/filemap: Pass the file_ra_state in the ractl mm: Add set/end/wait functions for PG_private_2 iov_iter: Add ITER_XARRAY
2021-04-26Merge tag 'arm64-upstream' of ↵Linus Torvalds4-11/+123
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: - MTE asynchronous support for KASan. Previously only synchronous (slower) mode was supported. Asynchronous is faster but does not allow precise identification of the illegal access. - Run kernel mode SIMD with softirqs disabled. This allows using NEON in softirq context for crypto performance improvements. The conditional yield support is modified to take softirqs into account and reduce the latency. - Preparatory patches for Apple M1: handle CPUs that only have the VHE mode available (host kernel running at EL2), add FIQ support. - arm64 perf updates: support for HiSilicon PA and SLLC PMU drivers, new functions for the HiSilicon HHA and L3C PMU, cleanups. - Re-introduce support for execute-only user permissions but only when the EPAN (Enhanced Privileged Access Never) architecture feature is available. - Disable fine-grained traps at boot and improve the documented boot requirements. - Support CONFIG_KASAN_VMALLOC on arm64 (only with KASAN_GENERIC). - Add hierarchical eXecute Never permissions for all page tables. - Add arm64 prctl(PR_PAC_{SET,GET}_ENABLED_KEYS) allowing user programs to control which PAC keys are enabled in a particular task. - arm64 kselftests for BTI and some improvements to the MTE tests. - Minor improvements to the compat vdso and sigpage. - Miscellaneous cleanups. * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (86 commits) arm64/sve: Add compile time checks for SVE hooks in generic functions arm64/kernel/probes: Use BUG_ON instead of if condition followed by BUG. arm64: pac: Optimize kernel entry/exit key installation code paths arm64: Introduce prctl(PR_PAC_{SET,GET}_ENABLED_KEYS) arm64: mte: make the per-task SCTLR_EL1 field usable elsewhere arm64/sve: Remove redundant system_supports_sve() tests arm64: fpsimd: run kernel mode NEON with softirqs disabled arm64: assembler: introduce wxN aliases for wN registers arm64: assembler: remove conditional NEON yield macros kasan, arm64: tests supports for HW_TAGS async mode arm64: mte: Report async tag faults before suspend arm64: mte: Enable async tag check fault arm64: mte: Conditionally compile mte_enable_kernel_*() arm64: mte: Enable TCO in functions that can read beyond buffer limits kasan: Add report for async mode arm64: mte: Drop arch_enable_tagging() kasan: Add KASAN mode kernel parameter arm64: mte: Add asynchronous mode support arm64: Get rid of CONFIG_ARM64_VHE arm64: Cope with CPUs stuck in VHE mode ...
2021-04-26Merge tag 'x86-entry-2021-04-26' of ↵Linus Torvalds2-4/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull entry code update from Thomas Gleixner: "Provide support for randomized stack offsets per syscall to make stack-based attacks harder which rely on the deterministic stack layout. The feature is based on the original idea of PaX's RANDSTACK feature, but uses a significantly different implementation. The offset does not affect the pt_regs location on the task stack as this was agreed on to be of dubious value. The offset is applied before the actual syscall is invoked. The offset is stored per cpu and the randomization happens at the end of the syscall which is less predictable than on syscall entry. The mechanism to apply the offset is via alloca(), i.e. abusing the dispised VLAs. This comes with the drawback that stack-clash-protection has to be disabled for the affected compilation units and there is also a negative interaction with stack-protector. Those downsides are traded with the advantage that this approach does not require any intrusive changes to the low level assembly entry code, does not affect the unwinder and the correct stack alignment is handled automatically by the compiler. The feature is guarded with a static branch which avoids the overhead when disabled. Currently this is supported for X86 and ARM64" * tag 'x86-entry-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: arm64: entry: Enable random_kstack_offset support lkdtm: Add REPORT_STACK for checking stack offsets x86/entry: Enable random_kstack_offset support stack: Optionally randomize kernel stack offset each syscall init_on_alloc: Optimize static branches jump_label: Provide CONFIG-driven build state defaults
2021-04-24mm/filemap: fix mapping_seek_hole_data on THP & 32-bitHugh Dickins1-10/+11
No problem on 64-bit, or without huge pages, but xfstests generic/285 and other SEEK_HOLE/SEEK_DATA tests have regressed on huge tmpfs, and on 32-bit architectures, with the new mapping_seek_hole_data(). Several different bugs turned out to need fixing. u64 cast to stop losing bits when converting unsigned long to loff_t (and let's use shifts throughout, rather than mixed with * and /). Use round_up() when advancing pos, to stop assuming that pos was already THP-aligned when advancing it by THP-size. (This use of round_up() assumes that any THP has THP-aligned index: true at present and true going forward, but could be recoded to avoid the assumption.) Use xas_set() when iterating away from a THP, so that xa_index stays in synch with start, instead of drifting away to return bogus offset. Check start against end to avoid wrapping 32-bit xa_index to 0 (and to handle these additional cases, seek_data or not, it's easier to break the loop than goto: so rearrange exit from the function). [hughd@google.com: remove unneeded u64 casts, per Matthew] Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104221347240.1170@eggly.anvils Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104211737410.3299@eggly.anvils Fixes: 41139aa4c3a3 ("mm/filemap: add mapping_seek_hole_data") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-24mm/filemap: fix find_lock_entries hang on 32-bit THPHugh Dickins1-2/+8
No problem on 64-bit, or without huge pages, but xfstests generic/308 hung uninterruptibly on 32-bit huge tmpfs. Since commit 0cc3b0ec23ce ("Clarify (and fix) in 4.13 MAX_LFS_FILESIZE macros"), MAX_LFS_FILESIZE is only a PAGE_SIZE away from wrapping 32-bit xa_index to 0, so the new find_lock_entries() has to be extra careful when handling a THP. Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2104211735430.3299@eggly.anvils Fixes: 5c211ba29deb ("mm: add and use find_lock_entries") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <dchinner@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-23mm: Implement readahead_control pageset expansionDavid Howells1-0/+75
Provide a function, readahead_expand(), that expands the set of pages specified by a readahead_control object to encompass a revised area with a proposed size and length. The proposed area must include all of the old area and may be expanded yet more by this function so that the edges align on (transparent huge) page boundaries as allocated. The expansion will be cut short if a page already exists in either of the areas being expanded into. Note that any expansion made in such a case is not rolled back. This will be used by fscache so that reads can be expanded to cache granule boundaries, thereby allowing whole granules to be stored in the cache, but there are other potential users also. Changes: v6: - Fold in a patch from Matthew Wilcox to tell the ondemand readahead algorithm about the expansion so that the next readahead starts at the right place[2]. v4: - Moved the declaration of readahead_expand() to a better place[1]. Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dave Wysochanski <dwysocha@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: Alexander Viro <viro@zeniv.linux.org.uk> cc: Christoph Hellwig <hch@lst.de> cc: Mike Marshall <hubcap@omnibond.com> cc: linux-mm@kvack.org cc: linux-cachefs@redhat.com cc: linux-afs@lists.infradead.org cc: linux-nfs@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20210217161358.GM2858050@casper.infradead.org/ [1] Link: https://lore.kernel.org/r/20210407201857.3582797-4-willy@infradead.org/ [2] Link: https://lore.kernel.org/r/159974633888.2094769.8326206446358128373.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/160588479816.3465195.553952688795241765.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118131787.1232039.4863969952441067985.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161028670.2537118.13831420617039766044.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340389201.1303470.14353807284546854878.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539530488.286939.18085961677838089157.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653789422.2770958.2108046612147345000.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789069829.6155.4295672417565512161.stgit@warthog.procyon.org.uk/ # v6
2021-04-23mm/readahead: Handle ractl nr_pages being modifiedMatthew Wilcox (Oracle)1-2/+2
Filesystems are not currently permitted to modify the number of pages in the ractl. An upcoming patch to add readahead_expand() changes that rule, so remove the check and resync the loop counter after every call to the filesystem. Tested-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20210420200116.3715790-1-willy@infradead.org/ Link: https://lore.kernel.org/r/20210421170923.4005574-1-willy@infradead.org/ # v2
2021-04-23mm/filemap: Pass the file_ra_state in the ractlMatthew Wilcox (Oracle)3-17/+16
For readahead_expand(), we need to modify the file ra_state, so pass it down by adding it to the ractl. We have to do this because it's not always the same as f_ra in the struct file that is already being passed. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dave Wysochanski <dwysocha@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> Link: https://lore.kernel.org/r/20210407201857.3582797-2-willy@infradead.org/ Link: https://lore.kernel.org/r/161789067431.6155.8063840447229665720.stgit@warthog.procyon.org.uk/ # v6
2021-04-23mm: Add set/end/wait functions for PG_private_2David Howells1-0/+61
Add three functions to manipulate PG_private_2: (*) set_page_private_2() - Set the flag and take an appropriate reference on the flagged page. (*) end_page_private_2() - Clear the flag, drop the reference and wake up any waiters, somewhat analogously with end_page_writeback(). (*) wait_on_page_private_2() - Wait for the flag to be cleared. Wrappers will need to be placed in the netfs lib header in the patch that adds that. [This implements a suggestion by Linus[1] to not mix the terminology of PG_private_2 and PG_fscache in the mm core function] Changes: v7: - Use compound_head() in all the functions to make them THP safe[6]. v5: - Add set and end functions, calling the end function end rather than unlock[3]. - Keep a ref on the page when PG_private_2 is set[4][5]. v4: - Remove extern from the declaration[2]. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dave Wysochanski <dwysocha@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: Alexander Viro <viro@zeniv.linux.org.uk> cc: Christoph Hellwig <hch@lst.de> cc: linux-mm@kvack.org cc: linux-cachefs@redhat.com cc: linux-afs@lists.infradead.org cc: linux-nfs@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/1330473.1612974547@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/CAHk-=wjgA-74ddehziVk=XAEMTKswPu1Yw4uaro1R3ibs27ztw@mail.gmail.com/ [1] Link: https://lore.kernel.org/r/20210216102659.GA27714@lst.de/ [2] Link: https://lore.kernel.org/r/161340387944.1303470.7944159520278177652.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539528910.286939.1252328699383291173.stgit@warthog.procyon.org.uk # v4 Link: https://lore.kernel.org/r/20210321105309.GG3420@casper.infradead.org [3] Link: https://lore.kernel.org/r/CAHk-=wh+2gbF7XEjYc=HV9w_2uVzVf7vs60BPz0gFA=+pUm3ww@mail.gmail.com/ [4] Link: https://lore.kernel.org/r/CAHk-=wjSGsRj7xwhSMQ6dAQiz53xA39pOG+XA_WeTgwBBu4uqg@mail.gmail.com/ [5] Link: https://lore.kernel.org/r/20210408145057.GN2531743@casper.infradead.org/ [6] Link: https://lore.kernel.org/r/161653788200.2770958.9517755716374927208.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789066013.6155.9816857201817288382.stgit@warthog.procyon.org.uk/ # v6
2021-04-21percpu: implement partial chunk depopulationRoman Gushchin5-20/+211
From Roman ("percpu: partial chunk depopulation"): In our [Facebook] production experience the percpu memory allocator is sometimes struggling with returning the memory to the system. A typical example is a creation of several thousands memory cgroups (each has several chunks of the percpu data used for vmstats, vmevents, ref counters etc). Deletion and complete releasing of these cgroups doesn't always lead to a shrinkage of the percpu memory, so that sometimes there are several GB's of memory wasted. The underlying problem is the fragmentation: to release an underlying chunk all percpu allocations should be released first. The percpu allocator tends to top up chunks to improve the utilization. It means new small-ish allocations (e.g. percpu ref counters) are placed onto almost filled old-ish chunks, effectively pinning them in memory. This patchset solves this problem by implementing a partial depopulation of percpu chunks: chunks with many empty pages are being asynchronously depopulated and the pages are returned to the system. To illustrate the problem the following script can be used: -- cd /sys/fs/cgroup mkdir percpu_test echo "+memory" > percpu_test/cgroup.subtree_control cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do mkdir percpu_test/cg_"${i}" for j in `seq 1 10`; do mkdir percpu_test/cg_"${i}"_"${j}" done done cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do for j in `seq 1 10`; do rmdir percpu_test/cg_"${i}"_"${j}" done done sleep 10 cat /proc/meminfo | grep Percpu for i in `seq 1 1000`; do rmdir percpu_test/cg_"${i}" done rmdir percpu_test -- It creates 11000 memory cgroups and removes every 10 out of 11. It prints the initial size of the percpu memory, the size after creating all cgroups and the size after deleting most of them. Results: vanilla: ./percpu_test.sh Percpu: 7488 kB Percpu: 481152 kB Percpu: 481152 kB with this patchset applied: ./percpu_test.sh Percpu: 7488 kB Percpu: 481408 kB Percpu: 135552 kB The total size of the percpu memory was reduced by more than 3.5 times. This patch: This patch implements partial depopulation of percpu chunks. As of now, a chunk can be depopulated only as a part of the final destruction, if there are no more outstanding allocations. However to minimize a memory waste it might be useful to depopulate a partially filed chunk, if a small number of outstanding allocations prevents the chunk from being fully reclaimed. This patch implements the following depopulation process: it scans over the chunk pages, looks for a range of empty and populated pages and performs the depopulation. To avoid races with new allocations, the chunk is previously isolated. After the depopulation the chunk is sidelined to a special list or freed. New allocations prefer using active chunks to sidelined chunks. If a sidelined chunk is used, it is reintegrated to the active lists. The depopulation is scheduled on the free path if the chunk is all of the following: 1) has more than 1/4 of total pages free and populated 2) the system has enough free percpu pages aside of this chunk 3) isn't the reserved chunk 4) isn't the first chunk If it's already depopulated but got free populated pages, it's a good target too. The chunk is moved to a special slot, pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work item is scheduled. On isolation, these pages are removed from the pcpu_nr_empty_pop_pages. It is constantly replaced to the to_depopulate_slot when it meets these qualifications. pcpu_reclaim_populated() iterates over the to_depopulate_slot until it becomes empty. The depopulation is performed in the reverse direction to keep populated pages close to the beginning. Depopulated chunks are sidelined to preferentially avoid them for new allocations. When no active chunk can suffice a new allocation, sidelined chunks are first checked before creating a new chunk. Signed-off-by: Roman Gushchin <guro@fb.com> Co-developed-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Dennis Zhou <dennis@kernel.org> Tested-by: Pratik Sampat <psampat@linux.ibm.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-21percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1Dennis Zhou1-6/+8
This prepares for adding a to_depopulate list and sidelined list after the free slot in the set of lists in pcpu_slot. Signed-off-by: Dennis Zhou <dennis@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-21percpu: factor out pcpu_check_block_hint()Roman Gushchin1-7/+23
Factor out the pcpu_check_block_hint() helper, which will be useful in the future. The new function checks if the allocation can likely fit within the contig hint. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-19shmem: allow reporting fanotify events with file handles on tmpfsAmir Goldstein1-0/+3
Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID for identifying objects using file handle and fsid in events. fanotify_mark(2) fails with -ENODEV when trying to set a mark on filesystems that report null f_fsid in stasfs(2). Use the digest of uuid as f_fsid for tmpfs to uniquely identify tmpfs objects as best as possible and allow setting an fanotify mark that reports events with file handles on tmpfs. Link: https://lore.kernel.org/r/20210322173944.449469-3-amir73il@gmail.com Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2021-04-17mm: ptdump: fix build failureChristophe Leroy1-1/+1
READ_ONCE() cannot be used for reading PTEs. Use ptep_get() instead, to avoid the following errors: CC mm/ptdump.o In file included from <command-line>: mm/ptdump.c: In function 'ptdump_pte_entry': include/linux/compiler_types.h:320:38: error: call to '__compiletime_assert_207' declared with attribute error: Unsupported access size for {READ,WRITE}_ONCE(). 320 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^ include/linux/compiler_types.h:301:4: note: in definition of macro '__compiletime_assert' 301 | prefix ## suffix(); \ | ^~~~~~ include/linux/compiler_types.h:320:2: note: in expansion of macro '_compiletime_assert' 320 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^~~~~~~~~~~~~~~~~~~ include/asm-generic/rwonce.h:36:2: note: in expansion of macro 'compiletime_assert' 36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \ | ^~~~~~~~~~~~~~~~~~ include/asm-generic/rwonce.h:49:2: note: in expansion of macro 'compiletime_assert_rwonce_type' 49 | compiletime_assert_rwonce_type(x); \ | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mm/ptdump.c:114:14: note: in expansion of macro 'READ_ONCE' 114 | pte_t val = READ_ONCE(*pte); | ^~~~~~~~~ make[2]: *** [mm/ptdump.o] Error 1 See commit 481e980a7c19 ("mm: Allow arches to provide ptep_get()") and commit c0e1c8c22beb ("powerpc/8xx: Provide ptep_get() with 16k pages") for details. Link: https://lkml.kernel.org/r/912b349e2bcaa88939904815ca0af945740c6bd4.1618478922.git.christophe.leroy@csgroup.eu Fixes: 30d621f6723b ("mm: add generic ptdump") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Steven Price <steven.price@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-17mm/mapping_dirty_helpers: guard hugepage pud's usageZack Rusin1-0/+2
Mapping dirty helpers have, so far, been only used on X86, but a port of vmwgfx to ARM64 exposed a problem which results in a compilation error on ARM64 systems: mm/mapping_dirty_helpers.c: In function `wp_clean_pud_entry': mm/mapping_dirty_helpers.c:172:32: error: implicit declaration of function `pud_dirty'; did you mean `pmd_dirty'? [-Werror=implicit-function-declaration] This is due to the fact that mapping_dirty_helpers code assumes that pud_dirty is always defined, which is not the case for architectures that don't define CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD. ARM64 arch is a little inconsistent when it comes to PUD hugepage helpers, e.g. it defines pud_young but not pud_dirty but regardless of that the core kernel code shouldn't assume that any of the PUD hugepage helpers are available unless CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD is defined. This prevents compilation errors whenever one of the drivers is ported to new architectures. Link: https://lkml.kernel.org/r/20210409165151.694574-1-zackr@vmware.com Signed-off-by: Zack Rusin <zackr@vmware.com> Reviewed-by: Thomas Hellstrm (Intel) <thomas_os@shipmail.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-17kasan: remove redundant config optionWalter Wu3-3/+3
CONFIG_KASAN_STACK and CONFIG_KASAN_STACK_ENABLE both enable KASAN stack instrumentation, but we should only need one config, so that we remove CONFIG_KASAN_STACK_ENABLE and make CONFIG_KASAN_STACK workable. see [1]. When enable KASAN stack instrumentation, then for gcc we could do no prompt and default value y, and for clang prompt and default value n. This patch fixes the following compilation warning: include/linux/kasan.h:333:30: warning: 'CONFIG_KASAN_STACK' is not defined, evaluates to 0 [-Wundef] [akpm@linux-foundation.org: fix merge snafu] Link: https://bugzilla.kernel.org/show_bug.cgi?id=210221 [1] Link: https://lkml.kernel.org/r/20210226012531.29231-1-walter-zh.wu@mediatek.com Fixes: d9b571c885a8 ("kasan: fix KASAN_STACK dependency for HW_TAGS") Signed-off-by: Walter Wu <walter-zh.wu@mediatek.com> Suggested-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-17mm: eliminate "expecting prototype" kernel-doc warningsRandy Dunlap3-13/+22
Fix stray kernel-doc warnings in mm/ due to mis-typed or missing function names. Quietens these kernel-doc warnings: mm/mmu_gather.c:264: warning: expecting prototype for tlb_gather_mmu(). Prototype was for __tlb_gather_mmu() instead mm/oom_kill.c:180: warning: expecting prototype for Check whether unreclaimable slab amount is greater than(). Prototype was for should_dump_unreclaim_slab() instead mm/shuffle.c:155: warning: expecting prototype for shuffle_free_memory(). Prototype was for __shuffle_free_memory() instead Link: https://lkml.kernel.org/r/20210411210642.11362-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-16percpu: split __pcpu_balance_workfn()Roman Gushchin1-17/+29
__pcpu_balance_workfn() became fairly big and hard to follow, but in fact it consists of two fully independent parts, responsible for the destruction of excessive free chunks and population of necessarily amount of free pages. In order to simplify the code and prepare for adding of a new functionality, split it in two functions: 1) pcpu_balance_free, 2) pcpu_balance_populated. Move the taking/releasing of the pcpu_alloc_mutex to an upper level to keep the current synchronization in place. Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-16percpu: fix a comment about the chunks orderingRoman Gushchin1-1/+4
Since the commit 3e54097beb22 ("percpu: manage chunks based on contig_bits instead of free_bytes") chunks are sorted based on the size of the biggest continuous free area instead of the total number of free bytes. Update the corresponding comment to reflect this. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-15Merge branch 'for-next/mte-async-kernel-mode' into for-next/coreCatalin Marinas3-11/+117
* for-next/mte-async-kernel-mode: : Add MTE asynchronous kernel mode support kasan, arm64: tests supports for HW_TAGS async mode arm64: mte: Report async tag faults before suspend arm64: mte: Enable async tag check fault arm64: mte: Conditionally compile mte_enable_kernel_*() arm64: mte: Enable TCO in functions that can read beyond buffer limits kasan: Add report for async mode arm64: mte: Drop arch_enable_tagging() kasan: Add KASAN mode kernel parameter arm64: mte: Add asynchronous mode support
2021-04-11Merge branch 'for-mingo-rcu' of ↵Ingo Molnar7-0/+17
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU changes from Paul E. McKenney: - Bitmap support for "N" as alias for last bit - kvfree_rcu updates - mm_dump_obj() updates. (One of these is to mm, but was suggested by Andrew Morton.) - RCU callback offloading update - Polling RCU grace-period interfaces - Realtime-related RCU updates - Tasks-RCU updates - Torture-test updates - Torture-test scripting updates - Miscellaneous fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-04-11kasan, arm64: tests supports for HW_TAGS async modeAndrey Konovalov3-0/+17
This change adds KASAN-KUnit tests support for the async HW_TAGS mode. In async mode, tag fault aren't being generated synchronously when a bad access happens, but are instead explicitly checked for by the kernel. As each KASAN-KUnit test expect a fault to happen before the test is over, check for faults as a part of the test handler. Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Link: https://lore.kernel.org/r/20210315132019.33202-10-vincenzo.frascino@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-04-11kasan: Add report for async modeVincenzo Frascino2-1/+32
KASAN provides an asynchronous mode of execution. Add reporting functionality for this mode. Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Link: https://lore.kernel.org/r/20210315132019.33202-5-vincenzo.frascino@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-04-11kasan: Add KASAN mode kernel parameterVincenzo Frascino2-10/+68
Architectures supported by KASAN_HW_TAGS can provide a sync or async mode of execution. On an MTE enabled arm64 hw for example this can be identified with the synchronous or asynchronous tagging mode of execution. In synchronous mode, an exception is triggered if a tag check fault occurs. In asynchronous mode, if a tag check fault occurs, the TFSR_EL1 register is updated asynchronously. The kernel checks the corresponding bits periodically. KASAN requires a specific kernel command line parameter to make use of this hw features. Add KASAN HW execution mode kernel command line parameter. Note: This patch adds the kasan.mode kernel parameter and the sync/async kernel command line options to enable the described features. [ Add a new var instead of exposing kasan_arg_mode to be consistent with flags for other command line arguments. ] Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Link: https://lore.kernel.org/r/20210315132019.33202-3-vincenzo.frascino@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-04-10Merge branch 'for-5.12-fixes' of ↵Linus Torvalds3-10/+15
git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu Pull percpu fix from Dennis Zhou: "This contains a fix for sporadically failing atomic percpu allocations. I only caught it recently while I was reviewing a new series [1] and simultaneously saw reports by btrfs in xfstests [2] and [3]. In v5.9, memcg accounting was extended to percpu done by adding a second type of chunk. I missed an interaction with the free page float count used to ensure we can support atomic allocations. If one type of chunk has no free pages, but the other has enough to satisfy the free page float requirement, we will not repopulate the free pages for the former type of chunk. This led to the sporadically failing atomic allocations" Link: https://lore.kernel.org/linux-mm/20210324190626.564297-1-guro@fb.com/ [1] Link: https://lore.kernel.org/linux-mm/20210401185158.3275.409509F4@e16-tech.com/ [2] Link: https://lore.kernel.org/linux-mm/CAL3q7H5RNBjCi708GH7jnczAOe0BLnacT9C+OBgA-Dx9jhB6SQ@mail.gmail.com/ [3] * 'for-5.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu: percpu: make pcpu_nr_empty_pop_pages per chunk type
2021-04-10kasan: fix conflict with page poisoningAndrey Konovalov1-1/+3
When page poisoning is enabled, it accesses memory that is marked as poisoned by KASAN, which leas to false-positive KASAN reports. Suppress the reports by adding KASAN annotations to unpoison_page() (poison_page() already has them). Link: https://lkml.kernel.org/r/2dc799014d31ac13fd97bd906bad33e16376fc67.1617118501.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-10mm/gup: check page posion status for coredump.Aili Yao2-0/+24
When we do coredump for user process signal, this may be an SIGBUS signal with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is resulted from ECC memory fail like SRAR or SRAO, we expect the memory recovery work is finished correctly, then the get_dump_page() will not return the error page as its process pte is set invalid by memory_failure(). But memory_failure() may fail, and the process's related pte may not be correctly set invalid, for current code, we will return the poison page, get it dumped, and then lead to system panic as its in kernel code. So check the poison status in get_dump_page(), and if TRUE, return NULL. There maybe other scenario that is also better to check the posion status and not to panic, so make a wrapper for this check, Thanks to David's suggestion(<david@redhat.com>). [akpm@linux-foundation.org: s/0/false/] [yaoaili@kingsoft.com: is_page_poisoned() arg cannot be null, per Matthew] Link: https://lkml.kernel.org/r/20210322115233.05e4e82a@alex-virtual-machine Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine Signed-off-by: Aili Yao <yaoaili@kingsoft.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Aili Yao <yaoaili@kingsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-09percpu: make pcpu_nr_empty_pop_pages per chunk typeRoman Gushchin3-10/+15
nr_empty_pop_pages is used to guarantee that there are some free populated pages to satisfy atomic allocations. Accounted and non-accounted allocations are using separate sets of chunks, so both need to have a surplus of empty pages. This commit makes pcpu_nr_empty_pop_pages and the corresponding logic per chunk type. [Dennis] This issue came up as I was reviewing [1] and realized I missed this. Simultaneously, it was reported btrfs was seeing failed atomic allocations in fsstress tests [2] and [3]. [1] https://lore.kernel.org/linux-mm/20210324190626.564297-1-guro@fb.com/ [2] https://lore.kernel.org/linux-mm/20210401185158.3275.409509F4@e16-tech.com/ [3] https://lore.kernel.org/linux-mm/CAL3q7H5RNBjCi708GH7jnczAOe0BLnacT9C+OBgA-Dx9jhB6SQ@mail.gmail.com/ Fixes: 3c7be18ac9a0 ("mm: memcg/percpu: account percpu memory to memory cgroups") Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Roman Gushchin <guro@fb.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08init_on_alloc: Optimize static branchesKees Cook2-4/+6
The state of CONFIG_INIT_ON_ALLOC_DEFAULT_ON (and ...ON_FREE...) did not change the assembly ordering of the static branches: they were always out of line. Use the new jump_label macros to check the CONFIG settings to default to the "expected" state, which slightly optimizes the resulting assembly code. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexander Potapenko <glider@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20210401232347.2791257-3-keescook@chromium.org
2021-04-06block: remove BLK_BOUNCE_ISA supportChristoph Hellwig1-5/+4
Remove the BLK_BOUNCE_ISA support now that all users are gone. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210331073001.46776-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-30mm: fix race by making init_zero_pfn() early_initcallIlya Lipnitskiy1-1/+1
There are code paths that rely on zero_pfn to be fully initialized before core_initcall. For example, wq_sysfs_init() is a core_initcall function that eventually results in a call to kernel_execve, which causes a page fault with a subsequent mmput. If zero_pfn is not initialized by then it may not get cleaned up properly and result in an error: BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1 Here is an analysis of the race as seen on a MIPS device. On this particular MT7621 device (Ubiquiti ER-X), zero_pfn is PFN 0 until initialized, at which point it becomes PFN 5120: 1. wq_sysfs_init calls into kobject_uevent_env at core_initcall: kobject_uevent_env+0x7e4/0x7ec kset_register+0x68/0x88 bus_register+0xdc/0x34c subsys_virtual_register+0x34/0x78 wq_sysfs_init+0x1c/0x4c do_one_initcall+0x50/0x1a8 kernel_init_freeable+0x230/0x2c8 kernel_init+0x10/0x100 ret_from_kernel_thread+0x14/0x1c 2. kobject_uevent_env() calls call_usermodehelper_exec() which executes kernel_execve asynchronously. 3. Memory allocations in kernel_execve cause a page fault, bumping the MM reference counter: add_mm_counter_fast+0xb4/0xc0 handle_mm_fault+0x6e4/0xea0 __get_user_pages.part.78+0x190/0x37c __get_user_pages_remote+0x128/0x360 get_arg_page+0x34/0xa0 copy_string_kernel+0x194/0x2a4 kernel_execve+0x11c/0x298 call_usermodehelper_exec_async+0x114/0x194 4. In case zero_pfn has not been initialized yet, zap_pte_range does not decrement the MM_ANONPAGES RSS counter and the BUG message is triggered shortly afterwards when __mmdrop checks the ref counters: __mmdrop+0x98/0x1d0 free_bprm+0x44/0x118 kernel_execve+0x160/0x1d8 call_usermodehelper_exec_async+0x114/0x194 ret_from_kernel_thread+0x14/0x1c To avoid races such as described above, initialize init_zero_pfn at early_initcall level. Depending on the architecture, ZERO_PAGE is either constant or gets initialized even earlier, at paging_init, so there is no issue with initializing zero_pfn earlier. Link: https://lkml.kernel.org/r/CALCv0x2YqOXEAy2Q=hafjhHCtTHVodChv1qpM=niAXOpqEbt7w@mail.gmail.com Signed-off-by: Ilya Lipnitskiy <ilya.lipnitskiy@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: stable@vger.kernel.org Tested-by: 周琰杰 (Zhou Yanjie) <zhouyanjie@wanyeetech.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-26arm64: Support execute-only permissions with Enhanced PANVladimir Murzin1-0/+6
Enhanced Privileged Access Never (EPAN) allows Privileged Access Never to be used with Execute-only mappings. Absence of such support was a reason for 24cecc377463 ("arm64: Revert support for execute-only user mappings"). Thus now it can be revisited and re-enabled. Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210312173811.58284-2-vladimir.murzin@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-03-25Merge branch 'akpm' (patches from Andrew)Linus Torvalds7-10/+96
Merge misc fixes from Andrew Morton: "14 patches. Subsystems affected by this patch series: mm (hugetlb, kasan, gup, selftests, z3fold, kfence, memblock, and highmem), squashfs, ia64, gcov, and mailmap" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mailmap: update Andrey Konovalov's email address mm/highmem: fix CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP mm: memblock: fix section mismatch warning again kfence: make compatible with kmemleak gcov: fix clang-11+ support ia64: fix format strings for err_inject ia64: mca: allocate early mca with GFP_ATOMIC squashfs: fix xattr id and id lookup sanity checks squashfs: fix inode lookup sanity checks z3fold: prevent reclaim/free race for headless pages selftests/vm: fix out-of-tree build mm/mmu_notifiers: ensure range_end() is paired with range_start() kasan: fix per-page tags for non-page_alloc pages hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings
2021-03-25mm/highmem: fix CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAPIra Weiny1-2/+2
The kernel test robot found that __kmap_local_sched_out() was not correctly skipping the guard pages when DEBUG_KMAP_LOCAL_FORCE_MAP was set.[1] This was due to DEBUG_HIGHMEM check being used. Change the configuration check to be correct. [1] https://lore.kernel.org/lkml/20210304083825.GB17830@xsang-OptiPlex-9020/ Link: https://lkml.kernel.org/r/20210318230657.1497881-1-ira.weiny@intel.com Fixes: 0e91a0c6984c ("mm/highmem: Provide CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP") Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Oliver Sang <oliver.sang@intel.com> Cc: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com> Cc: David Sterba <dsterba@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-25kfence: make compatible with kmemleakMarco Elver2-1/+11
Because memblock allocations are registered with kmemleak, the KFENCE pool was seen by kmemleak as one large object. Later allocations through kfence_alloc() that were registered with kmemleak via slab_post_alloc_hook() would then overlap and trigger a warning. Therefore, once the pool is initialized, we can remove (free) it from kmemleak again, since it should be treated as allocator-internal and be seen as "free memory". The second problem is that kmemleak is passed the rounded size, and not the originally requested size, which is also the size of KFENCE objects. To avoid kmemleak scanning past the end of an object and trigger a KFENCE out-of-bounds error, fix the size if it is a KFENCE object. For simplicity, to avoid a call to kfence_ksize() in slab_post_alloc_hook() (and avoid new IS_ENABLED(CONFIG_DEBUG_KMEMLEAK) guard), just call kfence_ksize() in mm/kmemleak.c:create_object(). Link: https://lkml.kernel.org/r/20210317084740.3099921-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reported-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Luis Henriques <lhenriques@suse.de> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>