summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2020-08-25rcu: Move rcu_cpu_started per-CPU variable to rcu_dataPaul E. McKenney2-7/+5
When the rcu_cpu_started per-CPU variable was added by commit f64c6013a202 ("rcu/x86: Provide early rcu_cpu_starting() callback"), there were multiple sets of per-CPU rcu_data structures. Therefore, the rcu_cpu_started flag was added as a separate per-CPU variable. But now there is only one set of per-CPU rcu_data structures, so this commit moves rcu_cpu_started to a new ->cpu_started field in that structure. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_cpu_stall_ftrace_dumpPaul E. McKenney1-2/+2
Given that sysfs can change the value of rcu_cpu_stall_ftrace_dump at any time, this commit adds a READ_ONCE() to the accesses to that variable. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_kick_kthreadsPaul E. McKenney1-2/+2
Given that sysfs can change the value of rcu_kick_kthreads at any time, this commit adds a READ_ONCE() to the sole access to that variable. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_resched_nsPaul E. McKenney1-2/+6
Given that sysfs can change the value of rcu_resched_ns at any time, this commit adds a READ_ONCE() to the sole access to that variable. While in the area, this commit also adds bounds checking, clamping the value to at least a millisecond, but no longer than a second. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_divisorPaul E. McKenney1-1/+4
Given that sysfs can change the value of rcu_divisor at any time, this commit adds a READ_ONCE to the sole access to that variable. While in the area, this commit also adds bounds checking, clamping the value to a shift that makes sense for a signed long. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25nocb: Remove show_rcu_nocb_state() false positive printoutPaul E. McKenney1-3/+2
The rcu_data structure's ->nocb_timer field is used to defer wakeups of the corresponding no-CBs CPU's grace-period kthread ("rcuog*"), and that structure's ->nocb_defer_wakeup field is used to track such deferral. This means that the show_rcu_nocb_state() printing an error when those fields are set for a CPU not corresponding to a no-CBs grace-period kthread is erroneous. This commit therefore switches the check from ->nocb_timer to ->nocb_bypass_timer and removes the check of ->nocb_defer_wakeup. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25rcu/tree: Remove CONFIG_PREMPT_RCU check in force_qs_rnp()Neeraj Upadhyay1-2/+1
Originally, the call to rcu_preempt_blocked_readers_cgp() from force_qs_rnp() had to be conditioned on CONFIG_PREEMPT_RCU=y, as in commit a77da14ce9af ("rcu: Yet another fix for preemption and CPU hotplug"). However, there is now a CONFIG_PREEMPT_RCU=n definition of rcu_preempt_blocked_readers_cgp() that unconditionally returns zero, so invoking it is now safe. In addition, the CONFIG_PREEMPT_RCU=n definition of rcu_initiate_boost() simply releases the rcu_node structure's ->lock, which is what happens when the "if" condition evaluates to false. This commit therefore drops the IS_ENABLED(CONFIG_PREEMPT_RCU) check, so that rcu_initiate_boost() is called only in CONFIG_PREEMPT_RCU=y kernels when there are readers blocking the current grace period. This does not change the behavior, but reduces code-reader confusion by eliminating non-CONFIG_PREEMPT_RCU=y calls to rcu_initiate_boost(). Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25rcu/tree: Force quiescent state on callback overloadNeeraj Upadhyay1-1/+1
On callback overload, it is necessary to quickly detect idle CPUs, and rcu_gp_fqs_check_wake() checks for this condition. Unfortunately, the code following the call to this function does not repeat this check, which means that in reality no actual quiescent-state forcing, instead only a couple of quick and pointless wakeups at the beginning of the grace period. This commit therefore adds a check for the RCU_GP_FLAG_OVLD flag in the post-wakeup "if" statement in rcu_gp_fqs_loop(). Fixes: 1fca4d12f4637 ("rcu: Expedite first two FQS scans under callback-overload conditions") Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25nocb: Clarify RCU nocb CPU error messagePaul E. McKenney1-1/+1
A message of the form "rcu: !!! lDTs ." can be tracked down, but doing so is not trivial. This commit therefore eases this process by adding text so that this error message now reads as follows: "rcu: nocb GP activity on CB-only CPU!!! lDTs ." Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25rcu/trace: Use gp_seq_req in acceleration's rcu_grace_period tracepointJoel Fernandes (Google)1-2/+3
During acceleration of CB, the rsp's gp_seq is rcu_seq_snap'd. This is the value used for acceleration - it is the value of gp_seq at which it is safe the execute all callbacks in the callback list. The rdp's gp_seq is not very useful for this scenario. Make rcu_grace_period report the gp_seq_req instead as it allows one to reason about how the acceleration works. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25rcu: Initialize at declaration time in rcu_exp_handler()Paul E. McKenney1-4/+2
This commit moves the initialization of the CONFIG_PREEMPT=n version of the rcu_exp_handler() function's rdp and rnp local variables into their respective declarations to save a couple lines of code. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25srcu: Remove KCSAN stubsPaul E. McKenney1-13/+0
KCSAN is now in mainline, so this commit removes the stubs for the data_race(), ASSERT_EXCLUSIVE_WRITER(), and ASSERT_EXCLUSIVE_ACCESS() macros. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25rcu: Remove KCSAN stubs from update.cPaul E. McKenney1-13/+0
KCSAN is now in mainline, so this commit removes the stubs for the data_race(), ASSERT_EXCLUSIVE_WRITER(), and ASSERT_EXCLUSIVE_ACCESS() macros. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25rcu: Remove KCSAN stubsPaul E. McKenney1-13/+0
KCSAN is now in mainline, so this commit removes the stubs for the data_race(), ASSERT_EXCLUSIVE_WRITER(), and ASSERT_EXCLUSIVE_ACCESS() macros. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25kcsan: Optimize debugfs stats countersMarco Elver4-34/+23
Remove kcsan_counter_inc/dec() functions, as they perform no other logic, and are no longer needed. This avoids several calls in kcsan_setup_watchpoint() and kcsan_found_watchpoint(), as well as lets the compiler warn us about potential out-of-bounds accesses as the array's size is known at all usage sites at compile-time. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25kcsan: Use pr_fmt for consistencyMarco Elver2-6/+10
Use the same pr_fmt throughout for consistency. [ The only exception is report.c, where the format must be kept precisely as-is. ] Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25kcsan: Show message if enabled earlyMarco Elver1-2/+6
Show a message in the kernel log if KCSAN was enabled early. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25kcsan: Remove debugfs test commandMarco Elver1-66/+0
Remove the debugfs test command, as it is no longer needed now that we have the KUnit+Torture based kcsan-test module. This is to avoid confusion around how KCSAN should be tested, as only the kcsan-test module is maintained. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25kcsan: Simplify constant string handlingMarco Elver2-6/+6
Simplify checking prefixes and length calculation of constant strings. For the former, the kernel provides str_has_prefix(), and the latter we should just use strlen("..") because GCC and Clang have optimizations that optimize these into constants. No functional change intended. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25kcsan: Simplify debugfs counter to name mappingMarco Elver1-20/+13
Simplify counter ID to name mapping by using an array with designated inits. This way, we can turn a run-time BUG() into a compile-time static assertion failure if a counter name is missing. No functional change intended. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25kcsan: Test support for compound instrumentationMarco Elver1-14/+51
Changes kcsan-test module to support checking reports that include compound instrumentation. Since we should not fail the test if this support is unavailable, we have to add a config variable that the test can use to decide what to check for. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25kcsan: Add missing CONFIG_KCSAN_IGNORE_ATOMICS checksMarco Elver1-8/+22
Add missing CONFIG_KCSAN_IGNORE_ATOMICS checks for the builtin atomics instrumentation. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25kcsan: Skew delay to be longer for certain access typesMarco Elver1-3/+7
For compound instrumentation and assert accesses, skew the watchpoint delay to be longer if randomized. This is useful to improve race detection for such accesses. For compound accesses we should increase the delay as we've aggregated both read and write instrumentation. By giving up 1 call into the runtime, we're less likely to set up a watchpoint and thus less likely to detect a race. We can balance this by increasing the watchpoint delay. For assert accesses, we know these are of increased interest, and we wish to increase our chances of detecting races for such checks. Note that, kcsan_udelay_{task,interrupt} define the upper bound delays. When randomized, delays are uniformly distributed between [0, delay]. Skewing the delay does not break this promise as long as the defined upper bounds are still adhered to. The current skew results in delays uniformly distributed between [delay/2, delay]. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25kcsan: Support compounded read-write instrumentationMarco Elver2-5/+22
Add support for compounded read-write instrumentation if supported by the compiler. Adds the necessary instrumentation functions, and a new type which is used to generate a more descriptive report. Furthermore, such compounded memory access instrumentation is excluded from the "assume aligned writes up to word size are atomic" rule, because we cannot assume that the compiler emits code that is atomic for compound ops. LLVM/Clang added support for the feature in: https://github.com/llvm/llvm-project/commit/785d41a261d136b64ab6c15c5d35f2adc5ad53e3 The new instrumentation is emitted for sets of memory accesses in the same basic block to the same address with at least one read appearing before a write. These typically result from compound operations such as ++, --, +=, -=, |=, &=, etc. but also equivalent forms such as "var = var + 1". Where the compiler determines that it is equivalent to emit a call to a single __tsan_read_write instead of separate __tsan_read and __tsan_write, we can then benefit from improved performance and better reporting for such access patterns. The new reports now show that the ops are both reads and writes, for example: read-write to 0xffffffff90548a38 of 8 bytes by task 143 on cpu 3: test_kernel_rmw_array+0x45/0xa0 access_thread+0x71/0xb0 kthread+0x21e/0x240 ret_from_fork+0x22/0x30 read-write to 0xffffffff90548a38 of 8 bytes by task 144 on cpu 2: test_kernel_rmw_array+0x45/0xa0 access_thread+0x71/0xb0 kthread+0x21e/0x240 ret_from_fork+0x22/0x30 Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25kcsan: Add atomic builtin test caseMarco Elver1-0/+63
Adds test case to kcsan-test module, to test atomic builtin instrumentation works. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-25kcsan: Add support for atomic builtinsMarco Elver1-0/+110
Some architectures (currently e.g. s390 partially) implement atomics using the compiler's atomic builtins (__atomic_*, __sync_*). To support enabling KCSAN on such architectures in future, or support experimental use of these builtins, implement support for them. We should also avoid breaking KCSAN kernels due to use (accidental or otherwise) of atomic builtins in drivers, as has happened in the past: https://lkml.kernel.org/r/5231d2c0-41d9-6721-e15f-a7eedf3ce69e@infradead.org The instrumentation is subtly different from regular reads/writes: TSAN instrumentation replaces the use of atomic builtins with a call into the runtime, and the runtime's job is to also execute the desired atomic operation. We rely on the __atomic_* compiler builtins, available with all KCSAN-supported compilers, to implement each TSAN atomic instrumentation function. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-24treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva26-41/+41
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23Merge tag 'core-urgent-2020-08-23' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull entry fix from Thomas Gleixner: "A single bug fix for the common entry code. The transcription of the x86 version messed up the reload of the syscall number from pt_regs after ptrace and seccomp which breaks syscall number rewriting" * tag 'core-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: core/entry: Respect syscall number rewrites
2020-08-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds2-3/+18
Pull networking fixes from David Miller: "Nothing earth shattering here, lots of small fixes (f.e. missing RCU protection, bad ref counting, missing memset(), etc.) all over the place: 1) Use get_file_rcu() in task_file iterator, from Yonghong Song. 2) There are two ways to set remote source MAC addresses in macvlan driver, but only one of which validates things properly. Fix this. From Alvin Šipraga. 3) Missing of_node_put() in gianfar probing, from Sumera Priyadarsini. 4) Preserve device wanted feature bits across multiple netlink ethtool requests, from Maxim Mikityanskiy. 5) Fix rcu_sched stall in task and task_file bpf iterators, from Yonghong Song. 6) Avoid reset after device destroy in ena driver, from Shay Agroskin. 7) Missing memset() in netlink policy export reallocation path, from Johannes Berg. 8) Fix info leak in __smc_diag_dump(), from Peilin Ye. 9) Decapsulate ECN properly for ipv6 in ipv4 tunnels, from Mark Tomlinson. 10) Fix number of data stream negotiation in SCTP, from David Laight. 11) Fix double free in connection tracker action module, from Alaa Hleihel. 12) Don't allow empty NHA_GROUP attributes, from Nikolay Aleksandrov" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (46 commits) net: nexthop: don't allow empty NHA_GROUP bpf: Fix two typos in uapi/linux/bpf.h net: dsa: b53: check for timeout tipc: call rcu_read_lock() in tipc_aead_encrypt_done() net/sched: act_ct: Fix skb double-free in tcf_ct_handle_fragments() error flow net: sctp: Fix negotiation of the number of data streams. dt-bindings: net: renesas, ether: Improve schema validation gre6: Fix reception with IP6_TNL_F_RCV_DSCP_COPY hv_netvsc: Fix the queue_mapping in netvsc_vf_xmit() hv_netvsc: Remove "unlikely" from netvsc_select_queue bpf: selftests: global_funcs: Check err_str before strstr bpf: xdp: Fix XDP mode when no mode flags specified selftests/bpf: Remove test_align leftovers tools/resolve_btfids: Fix sections with wrong alignment net/smc: Prevent kernel-infoleak in __smc_diag_dump() sfc: fix build warnings on 32-bit net: phy: mscc: Fix a couple of spelling mistakes "spcified" -> "specified" libbpf: Fix map index used in error message net: gemini: Fix missing free_netdev() in error path of gemini_ethernet_port_probe() net: atlantic: Use readx_poll_timeout() for large timeout ...
2020-08-23timekeeping: Provide multi-timestamp accessor to NMI safe timekeeperThomas Gleixner1-11/+65
printk wants to store various timestamps (MONOTONIC, REALTIME, BOOTTIME) to make correlation of dmesg from several systems easier. Provide an interface to retrieve all three timestamps in one go. There are some caveats: 1) Boot time and late sleep time injection Boot time is a racy access on 32bit systems if the sleep time injection happens late during resume and not in timekeeping_resume(). That could be avoided by expanding struct tk_read_base with boot offset for 32bit and adding more overhead to the update. As this is a hard to observe once per resume event which can be filtered with reasonable effort using the accurate mono/real timestamps, it's probably not worth the trouble. Aside of that it might be possible on 32 and 64 bit to observe the following when the sleep time injection happens late: CPU 0 CPU 1 timekeeping_resume() ktime_get_fast_timestamps() mono, real = __ktime_get_real_fast() inject_sleep_time() update boot offset boot = mono + bootoffset; That means that boot time already has the sleep time adjustment, but real time does not. On the next readout both are in sync again. Preventing this for 64bit is not really feasible without destroying the careful cache layout of the timekeeper because the sequence count and struct tk_read_base would then need two cache lines instead of one. 2) Suspend/resume timestamps Access to the time keeper clock source is disabled accross the innermost steps of suspend/resume. The accessors still work, but the timestamps are frozen until time keeping is resumed which happens very early. For regular suspend/resume there is no observable difference vs. sched clock, but it might affect some of the nasty low level debug printks. OTOH, access to sched clock is not guaranteed accross suspend/resume on all systems either so it depends on the hardware in use. If that turns out to be a real problem then this could be mitigated by using sched clock in a similar way as during early boot. But it's not as trivial as on early boot because it needs some careful protection against the clock monotonic timestamp jumping backwards on resume. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200814115512.159981360@linutronix.de
2020-08-23timekeeping: Utilize local_clock() for NMI safe timekeeper during early bootThomas Gleixner1-8/+25
During early boot the NMI safe timekeeper returns 0 until the first clocksource becomes available. This prevents it from being used for printk or other facilities which today use sched clock. sched clock can be available way before timekeeping is initialized. The obvious workaround for this is to utilize the early sched clock in the default dummy clock read function until a clocksource becomes available. After switching to the clocksource clock MONOTONIC and BOOTTIME will not jump because the timekeeping_init() bases clock MONOTONIC on sched clock and the offset between clock MONOTONIC and BOOTTIME is zero during boot. Clock REALTIME cannot provide useful timestamps during early boot up to the point where a persistent clock becomes available, which is either in timekeeping_init() or later when the RTC driver which might depend on I2C or other subsystems is initialized. There is a minor difference to sched_clock() vs. suspend/resume. As the timekeeper clock source might not be accessible during suspend, after timekeeping_suspend() timestamps freeze up to the point where timekeeping_resume() is invoked. OTOH this is true for some sched clock implementations as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20200814115512.041422402@linutronix.de
2020-08-22bpf: sockmap: Allow update from BPFLorenz Bauer1-2/+36
Allow calling bpf_map_update_elem on sockmap and sockhash from a BPF context. The synchronization required for this is a bit fiddly: we need to prevent the socket from changing its state while we add it to the sockmap, since we rely on getting a callback via sk_prot->unhash. However, we can't just lock_sock like in sock_map_sk_acquire because that might sleep. So instead we disable softirq processing and use bh_lock_sock to prevent further modification. Yet, this is still not enough. BPF can be called in contexts where the current CPU might have locked a socket. If the BPF can get a hold of such a socket, inserting it into a sockmap would lead to a deadlock. One straight forward example are sock_ops programs that have ctx->sk, but the same problem exists for kprobes, etc. We deal with this by allowing sockmap updates only from known safe contexts. Improper usage is rejected by the verifier. I've audited the enabled contexts to make sure they can't run in a locked context. It's possible that CGROUP_SKB and others are safe as well, but the auditing here is much more difficult. In any case, we can extend the safe contexts when the need arises. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200821102948.21918-6-lmb@cloudflare.com
2020-08-22bpf: Override the meaning of ARG_PTR_TO_MAP_VALUE for sockmap and sockhashLorenz Bauer1-0/+35
The verifier assumes that map values are simple blobs of memory, and therefore treats ARG_PTR_TO_MAP_VALUE, etc. as such. However, there are map types where this isn't true. For example, sockmap and sockhash store sockets. In general this isn't a big problem: we can just write helpers that explicitly requests PTR_TO_SOCKET instead of ARG_PTR_TO_MAP_VALUE. The one exception are the standard map helpers like map_update_elem, map_lookup_elem, etc. Here it would be nice we could overload the function prototype for different kinds of maps. Unfortunately, this isn't entirely straight forward: We only know the type of the map once we have resolved meta->map_ptr in check_func_arg. This means we can't swap out the prototype in check_helper_call until we're half way through the function. Instead, modify check_func_arg to treat ARG_PTR_TO_MAP_VALUE to mean "the native type for the map" instead of "pointer to memory" for sockmap and sockhash. This means we don't have to modify the function prototype at all Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200821102948.21918-5-lmb@cloudflare.com
2020-08-22bpf: sockmap: Call sock_map_update_elem directlyLorenz Bauer1-2/+3
Don't go via map->ops to call sock_map_update_elem, since we know what function to call in bpf_map_update_value. Since we currently don't allow calling map_update_elem from BPF context, we can remove ops->map_update_elem and rename the function to sock_map_update_elem_sys. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200821102948.21918-4-lmb@cloudflare.com
2020-08-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds2-1/+2
Merge misc fixes from Andrew Morton: "11 patches. Subsystems affected by this: misc, mm/hugetlb, mm/vmalloc, mm/misc, romfs, relay, uprobes, squashfs, mm/cma, mm/pagealloc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm, page_alloc: fix core hung in free_pcppages_bulk() mm: include CMA pages in lowmem_reserve at boot squashfs: avoid bio_alloc() failure with 1Mbyte blocks uprobes: __replace_page() avoid BUG in munlock_vma_page() kernel/relay.c: fix memleak on destroy relay channel romfs: fix uninitialized memory leak in romfs_dev_read() mm/rodata_test.c: fix missing function declaration mm/vunmap: add cond_resched() in vunmap_pmd_range khugepaged: adjust VM_BUG_ON_MM() in __khugepaged_enter() hugetlb_cgroup: convert comma to semicolon mailmap: add Andi Kleen
2020-08-22bpf: Implement link_query callbacks in map element iteratorsYonghong Song1-0/+15
For bpf_map_elem and bpf_sk_local_storage bpf iterators, additional map_id should be shown for fdinfo and userspace query. For example, the following is for a bpf_map_elem iterator. $ cat /proc/1753/fdinfo/9 pos: 0 flags: 02000000 mnt_id: 14 link_type: iter link_id: 34 prog_tag: 104be6d3fe45e6aa prog_id: 173 target_name: bpf_map_elem map_id: 127 Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200821184419.574240-1-yhs@fb.com
2020-08-22bpf: Implement link_query for bpf iteratorsYonghong Song1-0/+58
This patch implemented bpf_link callback functions show_fdinfo and fill_link_info to support link_query interface. The general interface for show_fdinfo and fill_link_info will print/fill the target_name. Each targets can register show_fdinfo and fill_link_info callbacks to print/fill more target specific information. For example, the below is a fdinfo result for a bpf task iterator. $ cat /proc/1749/fdinfo/7 pos: 0 flags: 02000000 mnt_id: 14 link_type: iter link_id: 11 prog_tag: 990e1f8152f7e54f prog_id: 59 target_name: task Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200821184418.574122-1-yhs@fb.com
2020-08-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2-3/+18
Alexei Starovoitov says: ==================== pull-request: bpf 2020-08-21 The following pull-request contains BPF updates for your *net* tree. We've added 11 non-merge commits during the last 5 day(s) which contain a total of 12 files changed, 78 insertions(+), 24 deletions(-). The main changes are: 1) three fixes in BPF task iterator logic, from Yonghong. 2) fix for compressed dwarf sections in vmlinux, from Jiri. 3) fix xdp attach regression, from Andrii. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-21uprobes: __replace_page() avoid BUG in munlock_vma_page()Hugh Dickins1-1/+1
syzbot crashed on the VM_BUG_ON_PAGE(PageTail) in munlock_vma_page(), when called from uprobes __replace_page(). Which of many ways to fix it? Settled on not calling when PageCompound (since Head and Tail are equals in this context, PageCompound the usual check in uprobes.c, and the prior use of FOLL_SPLIT_PMD will have cleared PageMlocked already). Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> [5.4+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008161338360.20413@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-21kernel/relay.c: fix memleak on destroy relay channelWei Yongjun1-0/+1
kmemleak report memory leak as follows: unreferenced object 0x607ee4e5f948 (size 8): comm "syz-executor.1", pid 2098, jiffies 4295031601 (age 288.468s) hex dump (first 8 bytes): 00 00 00 00 00 00 00 00 ........ backtrace: relay_open kernel/relay.c:583 [inline] relay_open+0xb6/0x970 kernel/relay.c:563 do_blk_trace_setup+0x4a8/0xb20 kernel/trace/blktrace.c:557 __blk_trace_setup+0xb6/0x150 kernel/trace/blktrace.c:597 blk_trace_ioctl+0x146/0x280 kernel/trace/blktrace.c:738 blkdev_ioctl+0xb2/0x6a0 block/ioctl.c:613 block_ioctl+0xe5/0x120 fs/block_dev.c:1871 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __x64_sys_ioctl+0x170/0x1ce fs/ioctl.c:739 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 'chan->buf' is malloced in relay_open() by alloc_percpu() but not free while destroy the relay channel. Fix it by adding free_percpu() before return from relay_destroy_channel(). Fixes: 017c59c042d0 ("relay: Use per CPU constructs for the relay channel buffer pointers") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Daniel Axtens <dja@axtens.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Akash Goel <akash.goel@intel.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200817122826.48518-1-weiyongjun1@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-21core/entry: Respect syscall number rewritesThomas Gleixner1-1/+2
The transcript of the x86 entry code to the generic version failed to reload the syscall number from ptregs after ptrace and seccomp have run, which both can modify the syscall number in ptregs. It returns the original syscall number instead which is obviously not the right thing to do. Reload the syscall number to fix that. Fixes: 142781e108b1 ("entry: Provide generic syscall entry functionality") Reported-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Kyle Huey <me@kylehuey.com> Tested-by: Kees Cook <keescook@chromium.org> Acked-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/87blj6ifo8.fsf@nanos.tec.linutronix.de
2020-08-20Merge tag 'dma-mapping-5.9-1' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds2-71/+87
Pull dma-mapping fixes from Christoph Hellwig: "Fix more fallout from the dma-pool changes (Nicolas Saenz Julienne, me)" * tag 'dma-mapping-5.9-1' of git://git.infradead.org/users/hch/dma-mapping: dma-pool: Only allocate from CMA when in same memory zone dma-pool: fix coherent pool allocations for IOMMU mappings
2020-08-20bpf: Add kernel module with user mode driver that populates bpffs.Alexei Starovoitov10-4/+382
Add kernel module with user mode driver that populates bpffs with BPF iterators. $ mount bpffs /my/bpffs/ -t bpf $ ls -la /my/bpffs/ total 4 drwxrwxrwt 2 root root 0 Jul 2 00:27 . drwxr-xr-x 19 root root 4096 Jul 2 00:09 .. -rw------- 1 root root 0 Jul 2 00:27 maps.debug -rw------- 1 root root 0 Jul 2 00:27 progs.debug The user mode driver will load BPF Type Formats, create BPF maps, populate BPF maps, load two BPF programs, attach them to BPF iterators, and finally send two bpf_link IDs back to the kernel. The kernel will pin two bpf_links into newly mounted bpffs instance under names "progs.debug" and "maps.debug". These two files become human readable. $ cat /my/bpffs/progs.debug id name attached 11 dump_bpf_map bpf_iter_bpf_map 12 dump_bpf_prog bpf_iter_bpf_prog 27 test_pkt_access 32 test_main test_pkt_access test_pkt_access 33 test_subprog1 test_pkt_access_subprog1 test_pkt_access 34 test_subprog2 test_pkt_access_subprog2 test_pkt_access 35 test_subprog3 test_pkt_access_subprog3 test_pkt_access 36 new_get_skb_len get_skb_len test_pkt_access 37 new_get_skb_ifindex get_skb_ifindex test_pkt_access 38 new_get_constant get_constant test_pkt_access The BPF program dump_bpf_prog() in iterators.bpf.c is printing this data about all BPF programs currently loaded in the system. This information is unstable and will change from kernel to kernel as ".debug" suffix conveys. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200819042759.51280-4-alexei.starovoitov@gmail.com
2020-08-20bpf: Add BPF program and map iterators as built-in BPF programs.Alexei Starovoitov5-0/+587
The program and map iterators work similar to seq_file-s. Once the program is pinned in bpffs it can be read with "cat" tool to print human readable output. In this case about BPF programs and maps. For example: $ cat /sys/fs/bpf/progs.debug id name attached 5 dump_bpf_map bpf_iter_bpf_map 6 dump_bpf_prog bpf_iter_bpf_prog $ cat /sys/fs/bpf/maps.debug id name max_entries 3 iterator.rodata 1 To avoid kernel build dependency on clang 10 separate bpf skeleton generation into manual "make" step and instead check-in generated .skel.h into git. Unlike 'bpftool prog show' in-kernel BTF name is used (when available) to print full name of BPF program instead of 16-byte truncated name. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200819042759.51280-3-alexei.starovoitov@gmail.com
2020-08-20bpf: Factor out bpf_link_by_id() helper.Alexei Starovoitov1-18/+28
Refactor the code a bit to extract bpf_link_by_id() helper. It's similar to existing bpf_prog_by_id(). Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200819042759.51280-2-alexei.starovoitov@gmail.com
2020-08-20fork: introduce kernel_clone()Christian Brauner1-8/+8
The old _do_fork() helper doesn't follow naming conventions of in-kernel helpers for syscalls. The process creation cleanup in [1] didn't change the name to something more reasonable mainly because _do_fork() was used in quite a few places. So sending this as a separate series seemed the better strategy. This commit does two things: 1. renames _do_fork() to kernel_clone() but keeps _do_fork() as a simple static inline wrapper around kernel_clone(). 2. Changes the return type from long to pid_t. This aligns kernel_thread() and kernel_clone(). Also, the return value from kernel_clone that is surfaced in fork(), vfork(), clone(), and clone3() is taken from pid_vrn() which returns a pid_t too. Follow-up patches will switch each caller of _do_fork() and each place where it is referenced over to kernel_clone(). After all these changes are done, we can remove _do_fork() completely and will only be left with kernel_clone(). [1]: 9ba27414f2ec ("Merge tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux") Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Link: https://lore.kernel.org/r/20200819104655.436656-2-christian.brauner@ubuntu.com
2020-08-19sched/topology: Mark SD_PREFER_SIBLING as SDF_NEEDS_GROUPSValentin Schneider1-1/+1
SD_PREFER_SIBLING is currently considered in sd_parent_degenerate() but not in sd_degenerate(). It too hinges on load balancing, and thus won't have any effect when set on a domain with a single group. Add it to SD_DEGENERATE_GROUPS_MASK. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: https://lore.kernel.org/r/20200817113003.20802-12-valentin.schneider@arm.com
2020-08-19sched/topology: Propagate SD_ASYM_CPUCAPACITY upwardsValentin Schneider1-2/+1
We currently set this flag *only* on domains whose topology level exactly match the level where we detect asymmetry (as returned by asym_cpu_capacity_level()). This is rather problematic. Say there are two clusters in the system, one with a lone big CPU and the other with a mix of big and LITTLE CPUs (as is allowed by DynamIQ): DIE [ ] MC [ ][ ] 0 1 2 3 4 L L B B B asym_cpu_capacity_level() will figure out that the MC level is the one where all CPUs can see a CPU of max capacity, and we will thus set SD_ASYM_CPUCAPACITY at MC level for all CPUs. That lone big CPU will degenerate its MC domain, since it would be alone in there, and will end up with just a DIE domain. Since the flag was only set at MC, this CPU ends up not seeing any SD with the flag set, which is broken. Rather than clearing dflags at every topology level, clear it before entering the topology level loop. This will properly propagate upwards flags that are set starting from a certain level. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Quentin Perret <qperret@google.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: https://lore.kernel.org/r/20200817113003.20802-11-valentin.schneider@arm.com
2020-08-19sched/topology: Remove SD_SERIALIZE degeneration special caseValentin Schneider1-4/+2
If there is only a single NUMA node in the system, the only NUMA topology level that will be generated will be NODE (identity distance), which doesn't have SD_SERIALIZE. This means we don't need this special case in sd_parent_degenerate(), as having the NODE level "naturally" covers it. Thus, remove it. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: https://lore.kernel.org/r/20200817113003.20802-10-valentin.schneider@arm.com
2020-08-19sched/topology: Use prebuilt SD flag degeneration maskValentin Schneider1-16/+4
Leverage SD_DEGENERATE_GROUPS_MASK in sd_degenerate() and sd_parent_degenerate(). Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: https://lore.kernel.org/r/20200817113003.20802-9-valentin.schneider@arm.com