summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2022-03-15x86/livepatch: Validate __fentry__ locationPeter Zijlstra1-17/+2
Currently livepatch assumes __fentry__ lives at func+0, which is most likely untrue with IBT on. Instead make it use ftrace_location() by default which both validates and finds the actual ip if there is any in the same symbol. Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20220308154318.285971256@infradead.org
2022-03-15x86/ibt,ftrace: Search for __fentry__ locationPeter Zijlstra3-30/+46
Currently a lot of ftrace code assumes __fentry__ is at sym+0. However with Intel IBT enabled the first instruction of a function will most likely be ENDBR. Change ftrace_location() to not only return the __fentry__ location when called for the __fentry__ location, but also when called for the sym+0 location. Then audit/update all callsites of this function to consistently use these new semantics. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20220308154318.227581603@infradead.org
2022-03-15Merge tag 'v5.17-rc8' into sched/core, to pick up fixesIngo Molnar25-107/+269
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-03-15Merge branch 'sched/fast-headers' into sched/coreIngo Molnar33-163/+369
Merge the scheduler build speedup of the fast-headers tree. Cumulative scheduler (kernel/sched/) build time speedup on a Linux distribution's config, which enables all scheduler features, compared to the vanilla kernel: _____________________________________________________________________________ | | Vanilla kernel (v5.13-rc7): |_____________________________________________________________________________ | | Performance counter stats for 'make -j96 kernel/sched/' (3 runs): | | 126,975,564,374 instructions # 1.45 insn per cycle ( +- 0.00% ) | 87,637,847,671 cycles # 3.959 GHz ( +- 0.30% ) | 22,136.96 msec cpu-clock # 7.499 CPUs utilized ( +- 0.29% ) | | 2.9520 +- 0.0169 seconds time elapsed ( +- 0.57% ) |_____________________________________________________________________________ | | Patched kernel: |_____________________________________________________________________________ | | Performance counter stats for 'make -j96 kernel/sched/' (3 runs): | | 50,420,496,914 instructions # 1.47 insn per cycle ( +- 0.00% ) | 34,234,322,038 cycles # 3.946 GHz ( +- 0.31% ) | 8,675.81 msec cpu-clock # 3.053 CPUs utilized ( +- 0.45% ) | | 2.8420 +- 0.0181 seconds time elapsed ( +- 0.64% ) |_____________________________________________________________________________ Summary: - CPU time used to build the scheduler dropped by -60.9%, a reduction from 22.1 clock-seconds to 8.7 clock-seconds. - Wall-clock time to build the scheduler dropped by -3.9%, a reduction from 2.95 seconds to 2.84 seconds. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-03-14Merge tag 'v5.17-rc8' into irq/core, to fix conflictsIngo Molnar25-107/+269
Conflicts: drivers/pinctrl/pinctrl-starfive.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-03-14Merge tag 'irqchip-5.18' of ↵Thomas Gleixner3-17/+29
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core Pull irqchip updates from Marc Zyngier: - Add support for the STM32MP13 variant - Move parent device away from struct irq_chip - Remove all instances of non-const strings assigned to struct irq_chip::name, enabling a nice cleanup for VIC and GIC) - Simplify the Qualcomm PDC driver - A bunch of SiFive PLIC cleanups - Add support for a new variant of the Meson GPIO block - Add support for the irqchip side of the Apple M1 PMU - Add support for the Apple M1 Pro/Max AICv2 irqchip - Add support for the Qualcomm MPM wakeup gadget - Move the Xilinx driver over to the generic irqdomain handling - Tiny speedup for IPIs on GICv3 systems - The usual odd cleanups Link: https://lore.kernel.org/all/20220313105142.704579-1-maz@kernel.org
2022-03-14cgroup: cleanup commentsTom Rix2-6/+6
for spdx, add a space before // replacements judgement to judgment transofrmed to transformed partitition to partition histrical to historical migratecd to migrated Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-03-12tracing/user_events: Use alloc_pages instead of kzalloc() for register pagesSteven Rostedt (Google)1-6/+8
kzalloc virtual addresses do not work with SetPageReserved, use the actual page virtual addresses instead via alloc_pages. The issue is reported when booting with user_events and DEBUG_VM_PGFLAGS=y. Also make the number of events based on the ORDER. Link: https://lore.kernel.org/all/CADYN=9+xY5Vku3Ws5E9S60SM5dCFfeGeRBkmDFbcxX0ZMoFing@mail.gmail.com/ Link: https://lore.kernel.org/all/20220311223028.1865-1-beaub@linux.microsoft.com/ Cc: Beau Belgrave <beaub@linux.microsoft.com> Reported-by: Anders Roxell <anders.roxell@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-11Merge branch 'davidh' (fixes from David Howells)Linus Torvalds1-11/+11
Merge misc fixes from David Howells: "A set of patches for watch_queue filter issues noted by Jann. I've added in a cleanup patch from Christophe Jaillet to convert to using formal bitmap specifiers for the note allocation bitmap. Also two filesystem fixes (afs and cachefiles)" * emailed patches from David Howells <dhowells@redhat.com>: cachefiles: Fix volume coherency attribute afs: Fix potential thrashing in afs writeback watch_queue: Make comment about setting ->defunct more accurate watch_queue: Fix lack of barrier/sync/lock between post and read watch_queue: Free the alloc bitmap when the watch_queue is torn down watch_queue: Fix the alloc bitmap size to reflect notes allocated watch_queue: Use the bitmap API when applicable watch_queue: Fix to always request a pow-of-2 pipe ring size watch_queue: Fix to release page in ->release() watch_queue, pipe: Free watchqueue state after clearing pipe ring watch_queue: Fix filter limit check
2022-03-11watch_queue: Make comment about setting ->defunct more accurateDavid Howells1-1/+1
watch_queue_clear() has a comment stating that setting ->defunct to true preventing new additions as well as preventing notifications. Whilst the latter is true, the first bit is superfluous since at the time this function is called, the pipe cannot be accessed to add new event sources. Remove the "new additions" bit from the comment. Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11watch_queue: Fix lack of barrier/sync/lock between post and readDavid Howells1-1/+1
There's nothing to synchronise post_one_notification() versus pipe_read(). Whilst posting is done under pipe->rd_wait.lock, the reader only takes pipe->mutex which cannot bar notification posting as that may need to be made from contexts that cannot sleep. Fix this by setting pipe->head with a barrier in post_one_notification() and reading pipe->head with a barrier in pipe_read(). If that's not sufficient, the rd_wait.lock will need to be taken, possibly in a ->confirm() op so that it only applies to notifications. The lock would, however, have to be dropped before copy_page_to_iter() is invoked. Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11watch_queue: Free the alloc bitmap when the watch_queue is torn downDavid Howells1-0/+1
Free the watch_queue note allocation bitmap when the watch_queue is destroyed. Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11watch_queue: Fix the alloc bitmap size to reflect notes allocatedDavid Howells1-1/+2
Currently, watch_queue_set_size() sets the number of notes available in wqueue->nr_notes according to the number of notes allocated, but sets the size of the bitmap to the unrounded number of notes originally asked for. Fix this by setting the bitmap size to the number of notes we're actually going to make available (ie. the number allocated). Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11watch_queue: Use the bitmap API when applicableChristophe JAILLET1-5/+2
Use bitmap_alloc() to simplify code, improve the semantic and reduce some open-coded arithmetic in allocator arguments. Also change a memset(0xff) into an equivalent bitmap_fill() to keep consistency. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11watch_queue: Fix to always request a pow-of-2 pipe ring sizeDavid Howells1-1/+1
The pipe ring size must always be a power of 2 as the head and tail pointers are masked off by AND'ing with the size of the ring - 1. watch_queue_set_size(), however, lets you specify any number of notes between 1 and 511. This number is passed through to pipe_resize_ring() without checking/forcing its alignment. Fix this by rounding the number of slots required up to the nearest power of two. The request is meant to guarantee that at least that many notifications can be generated before the queue is full, so rounding down isn't an option, but, alternatively, it may be better to give an error if we aren't allowed to allocate that much ring space. Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11watch_queue: Fix to release page in ->release()David Howells1-0/+1
When a pipe ring descriptor points to a notification message, the refcount on the backing page is incremented by the generic get function, but the release function, which marks the bitmap, doesn't drop the page ref. Fix this by calling generic_pipe_buf_release() at the end of watch_queue_pipe_buf_release(). Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11watch_queue: Fix filter limit checkDavid Howells1-2/+2
In watch_queue_set_filter(), there are a couple of places where we check that the filter type value does not exceed what the type_filter bitmap can hold. One place calculates the number of bits by: if (tf[i].type >= sizeof(wfilter->type_filter) * 8) which is fine, but the second does: if (tf[i].type >= sizeof(wfilter->type_filter) * BITS_PER_LONG) which is not. This can lead to a couple of out-of-bounds writes due to a too-large type: (1) __set_bit() on wfilter->type_filter (2) Writing more elements in wfilter->filters[] than we allocated. Fix this by just using the proper WATCH_TYPE__NR instead, which is the number of types we actually know about. The bug may cause an oops looking something like: BUG: KASAN: slab-out-of-bounds in watch_queue_set_filter+0x659/0x740 Write of size 4 at addr ffff88800d2c66bc by task watch_queue_oob/611 ... Call Trace: <TASK> dump_stack_lvl+0x45/0x59 print_address_description.constprop.0+0x1f/0x150 ... kasan_report.cold+0x7f/0x11b ... watch_queue_set_filter+0x659/0x740 ... __x64_sys_ioctl+0x127/0x190 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Allocated by task 611: kasan_save_stack+0x1e/0x40 __kasan_kmalloc+0x81/0xa0 watch_queue_set_filter+0x23a/0x740 __x64_sys_ioctl+0x127/0x190 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff88800d2c66a0 which belongs to the cache kmalloc-32 of size 32 The buggy address is located 28 bytes inside of 32-byte region [ffff88800d2c66a0, ffff88800d2c66c0) Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11tracing: Add snapshot at end of kernel boot upSteven Rostedt (Google)2-0/+20
Add ftrace_boot_snapshot kernel parameter that will take a snapshot at the end of boot up just before switching over to user space (it happens during the kernel freeing of init memory). This is useful when there's interesting data that can be collected from kernel start up, but gets overridden by user space start up code. With this option, the ring buffer content from the boot up traces gets saved in the snapshot at the end of boot up. This trace can be read from: /sys/kernel/tracing/snapshot Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-11tracing: Have TRACE_DEFINE_ENUM affect trace event types as wellSteven Rostedt (Google)1-0/+28
The macro TRACE_DEFINE_ENUM is used to convert enums in the kernel to their actual value when they are exported to user space via the trace event format file. Currently only the enums in the "print fmt" (TP_printk in the TRACE_EVENT macro) have the enums converted. But the enums can be used to denote array size: field:unsigned int fc_ineligible_rc[EXT4_FC_REASON_MAX]; offset:12; size:36; signed:0; The EXT4_FC_REASON_MAX has no meaning to userspace but it needs to know that information to know how to parse the array. Have the array indexes also be parsed as well. Link: https://lore.kernel.org/all/cover.1646922487.git.riteshh@linux.ibm.com/ Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-11tracing: Fix strncpy warning in trace_events_synth.cTom Zanussi1-4/+1
0-day reported the strncpy error below: ../kernel/trace/trace_events_synth.c: In function 'last_cmd_set': ../kernel/trace/trace_events_synth.c:65:9: warning: 'strncpy' specified bound depends on the length o\ f the source argument [-Wstringop-truncation] 65 | strncpy(last_cmd, str, strlen(str) + 1); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ../kernel/trace/trace_events_synth.c:65:32: note: length computed here 65 | strncpy(last_cmd, str, strlen(str) + 1); | ^~~~~~~~~~~ There's no reason to use strncpy here, in fact there's no reason to do anything but a simple kstrdup() (note we don't even need to check for failure since last_cmod is expected to be either the last cmd string or NULL, and the containing function is a void return). Link: https://lkml.kernel.org/r/77deca8cbfd226981b3f1eab203967381e9b5bd9.camel@kernel.org Fixes: 27c888da9867 ("tracing: Remove size restriction on synthetic event cmd error logging") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-11user_events: Prevent dyn_event delete racing with ioctl add/deleteBeau Belgrave1-6/+40
Find user_events always while under the event_mutex and before leaving the lock, add a ref count to the user_event. This ensures that all paths under the event_mutex that check the ref counts will be synchronized. The ioctl add/delete paths are protected by the reg_mutex. However, dyn_event is only protected by the event_mutex. The dyn_event delete path cannot acquire reg_mutex, since that could cause a deadlock between the ioctl delete case acquiring event_mutex after acquiring the reg_mutex. Link: https://lkml.kernel.org/r/20220310001141.1660-1-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-11bpf-lsm: Make bpf_lsm_kernel_read_file() as sleepableRoberto Sassu1-0/+1
Make bpf_lsm_kernel_read_file() as sleepable, so that bpf_ima_inode_hash() or bpf_ima_file_hash() can be called inside the implementation of this hook. Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220302111404.193900-8-roberto.sassu@huawei.com
2022-03-11bpf-lsm: Introduce new helper bpf_ima_file_hash()Roberto Sassu1-0/+20
ima_file_hash() has been modified to calculate the measurement of a file on demand, if it has not been already performed by IMA or the measurement is not fresh. For compatibility reasons, ima_inode_hash() remains unchanged. Keep the same approach in eBPF and introduce the new helper bpf_ima_file_hash() to take advantage of the modified behavior of ima_file_hash(). Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220302111404.193900-4-roberto.sassu@huawei.com
2022-03-11Merge tag 'trace-v5.17-rc6' of ↵Linus Torvalds2-2/+33
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Minor tracing fixes: - Fix unregistering the same event twice. A user could disable the same event that osnoise will disable on unregistering. - Inform RCU of a quiescent state in the osnoise testing thread. - Fix some kerneldoc comments" * tag 'trace-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Fix some W=1 warnings in kernel doc comments tracing/osnoise: Force quiescent states while tracing tracing/osnoise: Do not unregister events twice
2022-03-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski9-32/+61
net/dsa/dsa2.c commit afb3cc1a397d ("net: dsa: unlock the rtnl_mutex when dsa_master_setup() fails") commit e83d56537859 ("net: dsa: replay master state events in dsa_tree_{setup,teardown}_master") https://lore.kernel.org/all/20220307101436.7ae87da0@canb.auug.org.au/ drivers/net/ethernet/intel/ice/ice.h commit 97b0129146b1 ("ice: Fix error with handling of bonding MTU") commit 43113ff73453 ("ice: add TTY for GNSS module for E810T device") https://lore.kernel.org/all/20220310112843.3233bcf1@canb.auug.org.au/ drivers/staging/gdm724x/gdm_lte.c commit fc7f750dc9d1 ("staging: gdm724x: fix use after free in gdm_lte_rx()") commit 4bcc4249b4cf ("staging: Use netif_rx().") https://lore.kernel.org/all/20220308111043.1018a59d@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-11tracehook: Remove tracehook.hEric W. Biederman4-4/+3
Now that all of the definitions have moved out of tracehook.h into ptrace.h, sched/signal.h, resume_user_mode.h there is nothing left in tracehook.h so remove it. Update the few files that were depending upon tracehook.h to bring in definitions to use the headers they need directly. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-13-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-11resume_user_mode: Move to resume_user_mode.hEric W. Biederman3-4/+4
Move set_notify_resume and tracehook_notify_resume into resume_user_mode.h. While doing that rename tracehook_notify_resume to resume_user_mode_work. Update all of the places that included tracehook.h for these functions to include resume_user_mode.h instead. Update all of the callers of tracehook_notify_resume to call resume_user_mode_work. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-12-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-11task_work: Decouple TIF_NOTIFY_SIGNAL and task_workEric W. Biederman1-2/+5
There are a small handful of reasons besides pending signals that the kernel might want to break out of interruptible sleeps. The flag TIF_NOTIFY_SIGNAL and the helpers that set and clear TIF_NOTIFY_SIGNAL provide that the infrastructure for breaking out of interruptible sleeps and entering the return to user space slow path for those cases. Expand tracehook_notify_signal inline in it's callers and remove it, which makes clear that TIF_NOTIFY_SIGNAL and task_work are separate concepts. Update the comment on set_notify_signal to more accurately describe it's purpose. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-9-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-11task_work: Call tracehook_notify_signal from get_signal on all architecturesEric W. Biederman3-22/+6
Always handle TIF_NOTIFY_SIGNAL in get_signal. With commit 35d0b389f3b2 ("task_work: unconditionally run task_work from get_signal()") always calling task_work_run all of the work of tracehook_notify_signal is already happening except clearing TIF_NOTIFY_SIGNAL. Factor clear_notify_signal out of tracehook_notify_signal and use it in get_signal so that get_signal only needs one call of task_work_run. To keep the semantics in sync update xfer_to_guest_mode_work (which does not call get_signal) to call tracehook_notify_signal if either _TIF_SIGPENDING or _TIF_NOTIFY_SIGNAL. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-8-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-11tracing: Allow custom events to be added to the tracefs directorySteven Rostedt (Google)1-0/+2
Allow custom events to be added to the events directory in the tracefs file system. For example, a module could be installed that attaches to an event and wants to be enabled and disabled via the tracefs file system. It would use trace_add_event_call() to add the event to the tracefs directory, and trace_remove_event_call() to remove it. Make both those functions EXPORT_SYMBOL_GPL(). Link: https://lkml.kernel.org/r/20220303220625.186988045@goodmis.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-11tracing: Fix last_cmd_set() string management in histogram codeSteven Rostedt (Google)1-2/+4
Using strnlen(dest, str, n) is confusing, as the size of dest must be strlen(dest) + n + 1. Even more confusing, using sizeof(string constant) gives you strlen(string constant) + 1 and not just strlen(). These two together made using strncat() with a constant string a bit off in the calculations as we have: len = sizeof(HIST_PREFIX) + strlen(str) + 1; kfree(last_cmd); last_cmd = kzalloc(len, GFP_KERNEL); strcpy(last_cmd, HIST_PREFIX); len -= sizeof(HIST_PREFIX) + 1; strncat(last_cmd, str, len); The above works if we s/sizeof/strlen/ with HIST_PREFIX (which is defined as "hist:", but because sizeof(HIST_PREFIX) is equal to strlen(HIST_PREFIX) + 1, we can drop the +1 in the code. But at least comment that we are doing so. Link: https://lore.kernel.org/all/202203082112.Iu7tvFl4-lkp@intel.com/ Fixes: 9f8e5aee93ed2 ("tracing: Fix allocation of last_cmd in last_cmd_set()") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-11user_events: Fix potential uninitialized pointer while parsing fieldBeau Belgrave1-1/+3
Ensure name is initialized by default to NULL to prevent possible edge cases that could lead to it being left uninitialized. Add an explicit check for NULL name to ensure edge boundaries. Link: https://lore.kernel.org/bpf/20220224105334.GA2248@kili/ Link: https://lore.kernel.org/linux-trace-devel/20220224181637.2129-1-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-11bpf: Use offsetofend() to simplify macro definitionYuntao Wang1-2/+1
Use offsetofend() instead of offsetof() + sizeof() to simplify MIN_BPF_LINEINFO_SIZE macro definition. Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/bpf/20220310161518.534544-1-ytcoode@gmail.com
2022-03-10task_work: Introduce task_work_pendingEric W. Biederman2-3/+3
Wrap the test of task->task_works in a helper function to make it clear what is being tested. All of the other readers of task->task_work use READ_ONCE and this is even necessary on current as other processes can update task->task_work. So for consistency I have added READ_ONCE into task_work_pending. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-7-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-10task_work: Remove unnecessary include from posix_timers.hEric W. Biederman1-0/+1
Break a header file circular dependency by removing the unnecessary include of task_work.h from posix_timers.h. sched.h -> posix-timers.h posix-timers.h -> task_work.h task_work.h -> sched.h Add missing includes of task_work.h to: arch/x86/mm/tlb.c kernel/time/posix-cpu-timers.c Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-6-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-10ptrace: Remove tracehook_signal_handlerEric W. Biederman1-1/+2
The two line function tracehook_signal_handler is only called from signal_delivered. Expand it inline in signal_delivered and remove it. Just to make it easier to understand what is going on. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-5-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-10ptrace: Remove arch_syscall_{enter,exit}_tracehookEric W. Biederman1-2/+2
These functions are alwasy one-to-one wrappers around ptrace_report_syscall_entry and ptrace_report_syscall_exit. So directly call the functions they are wrapping instead. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-4-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-10ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.hEric W. Biederman1-0/+1
Rename tracehook_report_syscall_{entry,exit} to ptrace_report_syscall_{entry,exit} and place them in ptrace.h There is no longer any generic tracehook infractructure so make these ptrace specific functions ptrace specific. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-3-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-10dma-mapping: benchmark: extract a common header file for map_benchmark ↵Tian Tao1-24/+1
definition kernel/dma/map_benchmark.c and selftests/dma/dma_map_benchmark.c have duplicate map_benchmark definitions, which tends to lead to inconsistent changes to map_benchmark on both sides, extract a common header file to avoid this problem. Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Acked-by: Barry Song <song.bao.hua@hisilicon.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-03-10bpf: Add "live packet" mode for XDP in BPF_PROG_RUNToke Høiland-Jørgensen2-1/+2
This adds support for running XDP programs through BPF_PROG_RUN in a mode that enables live packet processing of the resulting frames. Previous uses of BPF_PROG_RUN for XDP returned the XDP program return code and the modified packet data to userspace, which is useful for unit testing of XDP programs. The existing BPF_PROG_RUN for XDP allows userspace to set the ingress ifindex and RXQ number as part of the context object being passed to the kernel. This patch reuses that code, but adds a new mode with different semantics, which can be selected with the new BPF_F_TEST_XDP_LIVE_FRAMES flag. When running BPF_PROG_RUN in this mode, the XDP program return codes will be honoured: returning XDP_PASS will result in the frame being injected into the networking stack as if it came from the selected networking interface, while returning XDP_TX and XDP_REDIRECT will result in the frame being transmitted out that interface. XDP_TX is translated into an XDP_REDIRECT operation to the same interface, since the real XDP_TX action is only possible from within the network drivers themselves, not from the process context where BPF_PROG_RUN is executed. Internally, this new mode of operation creates a page pool instance while setting up the test run, and feeds pages from that into the XDP program. The setup cost of this is amortised over the number of repetitions specified by userspace. To support the performance testing use case, we further optimise the setup step so that all pages in the pool are pre-initialised with the packet data, and pre-computed context and xdp_frame objects stored at the start of each page. This makes it possible to entirely avoid touching the page content on each XDP program invocation, and enables sending up to 9 Mpps/core on my test box. Because the data pages are recycled by the page pool, and the test runner doesn't re-initialise them for each run, subsequent invocations of the XDP program will see the packet data in the state it was after the last time it ran on that particular page. This means that an XDP program that modifies the packet before redirecting it has to be careful about which assumptions it makes about the packet content, but that is only an issue for the most naively written programs. Enabling the new flag is only allowed when not setting ctx_out and data_out in the test specification, since using it means frames will be redirected somewhere else, so they can't be returned. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220309105346.100053-2-toke@redhat.com
2022-03-09xfs: don't generate selinux audit messages for capability testingDarrick J. Wong1-0/+1
There are a few places where we test the current process' capability set to decide if we're going to be more or less generous with resource acquisition for a system call. If the process doesn't have the capability, we can continue the call, albeit in a degraded mode. These are /not/ the actual security decisions, so it's not proper to use capable(), which (in certain selinux setups) causes audit messages to get logged. Switch them to has_capability_noaudit. Fixes: 7317a03df703f ("xfs: refactor inode ownership change transaction/inode/quota allocation idiom") Fixes: ea9a46e1c4925 ("xfs: only return detailed fsmap info if the caller has CAP_SYS_ADMIN") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Acked-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2022-03-09ftrace: Fix some W=1 warnings in kernel doc commentsJiapeng Chong1-2/+2
Clean up the following clang-w1 warning: kernel/trace/ftrace.c:7827: warning: Function parameter or member 'ops' not described in 'unregister_ftrace_function'. kernel/trace/ftrace.c:7805: warning: Function parameter or member 'ops' not described in 'register_ftrace_function'. Link: https://lkml.kernel.org/r/20220307004303.26399-1-jiapeng.chong@linux.alibaba.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-09tracing/osnoise: Force quiescent states while tracingNicolas Saenz Julienne1-0/+20
At the moment running osnoise on a nohz_full CPU or uncontested FIFO priority and a PREEMPT_RCU kernel might have the side effect of extending grace periods too much. This will entice RCU to force a context switch on the wayward CPU to end the grace period, all while introducing unwarranted noise into the tracer. This behaviour is unavoidable as overly extending grace periods might exhaust the system's memory. This same exact problem is what extended quiescent states (EQS) were created for, conversely, rcu_momentary_dyntick_idle() emulates them by performing a zero duration EQS. So let's make use of it. In the common case rcu_momentary_dyntick_idle() is fairly inexpensive: atomically incrementing a local per-CPU counter and doing a store. So it shouldn't affect osnoise's measurements (which has a 1us granularity), so we'll call it unanimously. The uncommon case involve calling rcu_momentary_dyntick_idle() after having the osnoise process: - Receive an expedited quiescent state IPI with preemption disabled or during an RCU critical section. (activates rdp->cpu_no_qs.b.exp code-path). - Being preempted within in an RCU critical section and having the subsequent outermost rcu_read_unlock() called with interrupts disabled. (t->rcu_read_unlock_special.b.blocked code-path). Neither of those are possible at the moment, and are unlikely to be in the future given the osnoise's loop design. On top of this, the noise generated by the situations described above is unavoidable, and if not exposed by rcu_momentary_dyntick_idle() will be eventually seen in subsequent rcu_read_unlock() calls or schedule operations. Link: https://lkml.kernel.org/r/20220307180740.577607-1-nsaenzju@redhat.com Cc: stable@vger.kernel.org Fixes: bce29ac9ce0b ("trace: Add osnoise tracer") Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-09tracing/osnoise: Do not unregister events twiceDaniel Bristot de Oliveira1-0/+11
Nicolas reported that using: # trace-cmd record -e all -M 10 -p osnoise --poll Resulted in the following kernel warning: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1217 at kernel/tracepoint.c:404 tracepoint_probe_unregister+0x280/0x370 [...] CPU: 0 PID: 1217 Comm: trace-cmd Not tainted 5.17.0-rc6-next-20220307-nico+ #19 RIP: 0010:tracepoint_probe_unregister+0x280/0x370 [...] CR2: 00007ff919b29497 CR3: 0000000109da4005 CR4: 0000000000170ef0 Call Trace: <TASK> osnoise_workload_stop+0x36/0x90 tracing_set_tracer+0x108/0x260 tracing_set_trace_write+0x94/0xd0 ? __check_object_size.part.0+0x10a/0x150 ? selinux_file_permission+0x104/0x150 vfs_write+0xb5/0x290 ksys_write+0x5f/0xe0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7ff919a18127 [...] ---[ end trace 0000000000000000 ]--- The warning complains about an attempt to unregister an unregistered tracepoint. This happens on trace-cmd because it first stops tracing, and then switches the tracer to nop. Which is equivalent to: # cd /sys/kernel/tracing/ # echo osnoise > current_tracer # echo 0 > tracing_on # echo nop > current_tracer The osnoise tracer stops the workload when no trace instance is actually collecting data. This can be caused both by disabling tracing or disabling the tracer itself. To avoid unregistering events twice, use the existing trace_osnoise_callback_enabled variable to check if the events (and the workload) are actually active before trying to deactivate them. Link: https://lore.kernel.org/all/c898d1911f7f9303b7e14726e7cc9678fbfb4a0e.camel@redhat.com/ Link: https://lkml.kernel.org/r/938765e17d5a781c2df429a98f0b2e7cc317b022.1646823913.git.bristot@kernel.org Cc: stable@vger.kernel.org Cc: Marcelo Tosatti <mtosatti@redhat.com> Fixes: 2fac8d6486d5 ("tracing/osnoise: Allow multiple instances of the same tracer") Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-03-08prlimit: do not grab the tasklist_lockBarret Rhoden2-14/+23
Unnecessarily grabbing the tasklist_lock can be a scalability bottleneck for workloads that also must grab the tasklist_lock for waiting, killing, and cloning. The tasklist_lock was grabbed to protect tsk->sighand from disappearing (becoming NULL). tsk->signal was already protected by holding a reference to tsk. update_rlimit_cpu() assumed tsk->sighand != NULL. With this commit, it attempts to lock_task_sighand(). However, this means that update_rlimit_cpu() can fail. This only happens when a task is exiting. Note that during exec, sighand may *change*, but it will not be NULL. Prior to this commit, the do_prlimit() ensured that update_rlimit_cpu() would not fail by read locking the tasklist_lock and checking tsk->sighand != NULL. If update_rlimit_cpu() fails, there may be other tasks that are not exiting that share tsk->signal. However, the group_leader is the last task to be released, so if we cannot update_rlimit_cpu(group_leader), then the entire process is exiting. The only other caller of update_rlimit_cpu() is selinux_bprm_committing_creds(). It has tsk == current, so update_rlimit_cpu() cannot fail (current->sighand cannot disappear until current exits). This change resulted in a 14% speedup on a microbenchmark where parents kill and wait on their children, and children getpriority, setpriority, and getrlimit. Signed-off-by: Barret Rhoden <brho@google.com> Link: https://lkml.kernel.org/r/20220106172041.522167-4-brho@google.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2022-03-08prlimit: make do_prlimit() staticBarret Rhoden1-57/+59
There are no other callers in the kernel. Fixed up a comment format and whitespace issue when moving do_prlimit() higher in sys.c. Signed-off-by: Barret Rhoden <brho@google.com> Link: https://lkml.kernel.org/r/20220106172041.522167-3-brho@google.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2022-03-08sched/topology: Remove redundant variable and fix incorrect type in ↵K Prateek Nayak1-5/+3
build_sched_domains While investigating the sparse warning reported by the LKP bot [1], observed that we have a redundant variable "top" in the function build_sched_domains that was introduced in the recent commit e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs") The existing variable "sd" suffices which allows us to remove the redundant variable "top" while annotating the other variable "top_p" with the "__rcu" annotation to silence the sparse warning. [1] https://lore.kernel.org/lkml/202202170853.9vofgC3O-lkp@intel.com/ Fixes: e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20220218162743.1134-1-kprateek.nayak@amd.com
2022-03-08sched/deadline,rt: Remove unused parameter from pick_next_[rt|dl]_entity()Dietmar Eggemann2-6/+4
The `struct rq *rq` parameter isn't used. Remove it. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20220302183433.333029-7-dietmar.eggemann@arm.com
2022-03-08sched/deadline,rt: Remove unused functions for !CONFIG_SMPDietmar Eggemann2-20/+0
The need_pull_[rt|dl]_task() and pull_[rt|dl]_task() functions are not used on a !CONFIG_SMP system. Remove them. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20220302183433.333029-6-dietmar.eggemann@arm.com
2022-03-08sched/deadline: Use __node_2_[pdl|dle]() and rb_first_cached() consistentlyDietmar Eggemann1-12/+12
Deploy __node_2_pdl(node), __node_2_dle(node) and rb_first_cached() consistently throughout the sched class source file which makes the code at least easier to read. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20220302183433.333029-5-dietmar.eggemann@arm.com