summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2021-11-18PM: EM: Fix inefficient states detectionVincent Donnefort1-15/+8
[ Upstream commit aa1a43262ad5df010768f69530fa179ff81651d3 ] Currently, a debug message is printed if an inefficient state is detected in the Energy Model. Unfortunately, it won't detect if the first state is inefficient or if two successive states are. Fix this behavior. Fixes: 27871f7a8a34 (PM: Introduce an Energy Model management framework) Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Reviewed-by: Quentin Perret <qperret@google.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Matthias Kaehlcke <mka@chromium.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18kprobes: Do not use local variable when creating debugfs filePunit Agrawal1-2/+1
[ Upstream commit 8f7262cd66699a4b02eb7549b35c81b2116aad95 ] debugfs_create_file() takes a pointer argument that can be used during file operation callbacks (accessible via i_private in the inode structure). An obvious requirement is for the pointer to refer to valid memory when used. When creating the debugfs file to dynamically enable / disable kprobes, a pointer to local variable is passed to debugfs_create_file(); which will go out of scope when the init function returns. The reason this hasn't triggered random memory corruption is because the pointer is not accessed during the debugfs file callbacks. Since the enabled state is managed by the kprobes_all_disabled global variable, the local variable is not needed. Fix the incorrect (and unnecessary) usage of local variable during debugfs_file_create() by passing NULL instead. Link: https://lkml.kernel.org/r/163163031686.489837.4476867635937014973.stgit@devnote2 Fixes: bf8f6e5b3e51 ("Kprobes: The ON/OFF knob thru debugfs") Signed-off-by: Punit Agrawal <punitagrawal@gmail.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18scs: Release kasan vmalloc poison in scs_free processYee Lee1-0/+1
[ Upstream commit 528a4ab45300fa6283556d9b48e26b45a8aa15c4 ] Since scs allocation is moved to vmalloc region, the shadow stack is protected by kasan_posion_vmalloc. However, the vfree_atomic operation needs to access its context for scs_free process and causes kasan error as the dump info below. This patch Adds kasan_unpoison_vmalloc() before vfree_atomic, which aligns to the prior flow as using kmem_cache. The vmalloc region will go back posioned in the following vumap() operations. ================================================================== BUG: KASAN: vmalloc-out-of-bounds in llist_add_batch+0x60/0xd4 Write of size 8 at addr ffff8000100b9000 by task kthreadd/2 CPU: 0 PID: 2 Comm: kthreadd Not tainted 5.15.0-rc2-11681-g92477dd1faa6-dirty #1 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x43c show_stack+0x1c/0x2c dump_stack_lvl+0x68/0x84 print_address_description+0x80/0x394 kasan_report+0x180/0x1dc __asan_report_store8_noabort+0x48/0x58 llist_add_batch+0x60/0xd4 vfree_atomic+0x60/0xe0 scs_free+0x1dc/0x1fc scs_release+0xa4/0xd4 free_task+0x30/0xe4 __put_task_struct+0x1ec/0x2e0 delayed_put_task_struct+0x5c/0xa0 rcu_do_batch+0x62c/0x8a0 rcu_core+0x60c/0xc14 rcu_core_si+0x14/0x24 __do_softirq+0x19c/0x68c irq_exit+0x118/0x2dc handle_domain_irq+0xcc/0x134 gic_handle_irq+0x7c/0x1bc call_on_irq_stack+0x40/0x70 do_interrupt_handler+0x78/0x9c el1_interrupt+0x34/0x60 el1h_64_irq_handler+0x1c/0x2c el1h_64_irq+0x78/0x7c _raw_spin_unlock_irqrestore+0x40/0xcc sched_fork+0x4f0/0xb00 copy_process+0xacc/0x3648 kernel_clone+0x168/0x534 kernel_thread+0x13c/0x1b0 kthreadd+0x2bc/0x400 ret_from_fork+0x10/0x20 Memory state around the buggy address: ffff8000100b8f00: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ffff8000100b8f80: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 >ffff8000100b9000: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ^ ffff8000100b9080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ffff8000100b9100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ================================================================== Suggested-by: Kuan-Ying Lee <kuan-ying.lee@mediatek.com> Acked-by: Will Deacon <will@kernel.org> Tested-by: Will Deacon <will@kernel.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Yee Lee <yee.lee@mediatek.com> Fixes: a2abe7cbd8fe ("scs: switch to vmapped shadow stacks") Link: https://lore.kernel.org/r/20210930081619.30091-1-yee.lee@mediatek.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18cgroup: Make rebind_subsystems() disable v2 controllers all at onceWaiman Long1-4/+27
[ Upstream commit 7ee285395b211cad474b2b989db52666e0430daf ] It was found that the following warning was displayed when remounting controllers from cgroup v2 to v1: [ 8042.997778] WARNING: CPU: 88 PID: 80682 at kernel/cgroup/cgroup.c:3130 cgroup_apply_control_disable+0x158/0x190 : [ 8043.091109] RIP: 0010:cgroup_apply_control_disable+0x158/0x190 [ 8043.096946] Code: ff f6 45 54 01 74 39 48 8d 7d 10 48 c7 c6 e0 46 5a a4 e8 7b 67 33 00 e9 41 ff ff ff 49 8b 84 24 e8 01 00 00 0f b7 40 08 eb 95 <0f> 0b e9 5f ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 [ 8043.115692] RSP: 0018:ffffba8a47c23d28 EFLAGS: 00010202 [ 8043.120916] RAX: 0000000000000036 RBX: ffffffffa624ce40 RCX: 000000000000181a [ 8043.128047] RDX: ffffffffa63c43e0 RSI: ffffffffa63c43e0 RDI: ffff9d7284ee1000 [ 8043.135180] RBP: ffff9d72874c5800 R08: ffffffffa624b090 R09: 0000000000000004 [ 8043.142314] R10: ffffffffa624b080 R11: 0000000000002000 R12: ffff9d7284ee1000 [ 8043.149447] R13: ffff9d7284ee1000 R14: ffffffffa624ce70 R15: ffffffffa6269e20 [ 8043.156576] FS: 00007f7747cff740(0000) GS:ffff9d7a5fc00000(0000) knlGS:0000000000000000 [ 8043.164663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8043.170409] CR2: 00007f7747e96680 CR3: 0000000887d60001 CR4: 00000000007706e0 [ 8043.177539] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8043.184673] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 8043.191804] PKRU: 55555554 [ 8043.194517] Call Trace: [ 8043.196970] rebind_subsystems+0x18c/0x470 [ 8043.201070] cgroup_setup_root+0x16c/0x2f0 [ 8043.205177] cgroup1_root_to_use+0x204/0x2a0 [ 8043.209456] cgroup1_get_tree+0x3e/0x120 [ 8043.213384] vfs_get_tree+0x22/0xb0 [ 8043.216883] do_new_mount+0x176/0x2d0 [ 8043.220550] __x64_sys_mount+0x103/0x140 [ 8043.224474] do_syscall_64+0x38/0x90 [ 8043.228063] entry_SYSCALL_64_after_hwframe+0x44/0xae It was caused by the fact that rebind_subsystem() disables controllers to be rebound one by one. If more than one disabled controllers are originally from the default hierarchy, it means that cgroup_apply_control_disable() will be called multiple times for the same default hierarchy. A controller may be killed by css_kill() in the first round. In the second round, the killed controller may not be completely dead yet leading to the warning. To avoid this problem, we collect all the ssid's of controllers that needed to be disabled from the default hierarchy and then disable them in one go instead of one by one. Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18lockdep: Let lock_is_held_type() detect recursive read as readSebastian Andrzej Siewior1-1/+1
[ Upstream commit 2507003a1d10917c9158077bf6030719d02c941e ] lock_is_held_type(, 1) detects acquired read locks. It only recognized locks acquired with lock_acquire_shared(). Read locks acquired with lock_acquire_shared_recursive() are not recognized because a `2' is stored as the read value. Rework the check to additionally recognise lock's read value one and two as a read held lock. Fixes: e918188611f07 ("locking: More accurate annotations for read_lock()") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Boqun Feng <boqun.feng@gmail.com> Acked-by: Waiman Long <longman@redhat.com> Link: https://lkml.kernel.org/r/20210903084001.lblecrvz4esl4mrr@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18rcu: Fix existing exp request check in sync_sched_exp_online_cleanup()Neeraj Upadhyay1-1/+1
[ Upstream commit f0b2b2df5423fb369ac762c77900bc7765496d58 ] The sync_sched_exp_online_cleanup() checks to see if RCU needs an expedited quiescent state from the incoming CPU, sending it an IPI if so. Before sending IPI, it checks whether expedited qs need has been already requested for the incoming CPU, by checking rcu_data.cpu_no_qs.b.exp for the current cpu, on which sync_sched_exp_online_cleanup() is running. This works for the case where incoming CPU is same as self. However, for the case where incoming CPU is different from self, expedited request won't get marked, which can potentially delay reporting of expedited quiescent state for the incoming CPU. Fixes: e015a3411220 ("rcu: Avoid self-IPI in sync_sched_exp_online_cleanup()") Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18ftrace: do CPU checking after preemption disabled王贇1-3/+3
[ Upstream commit d33cc657372366a8959f099c619a208b4c5dc664 ] With CONFIG_DEBUG_PREEMPT we observed reports like: BUG: using smp_processor_id() in preemptible caller is perf_ftrace_function_call+0x6f/0x2e0 CPU: 1 PID: 680 Comm: a.out Not tainted Call Trace: <TASK> dump_stack_lvl+0x8d/0xcf check_preemption_disabled+0x104/0x110 ? optimize_nops.isra.7+0x230/0x230 ? text_poke_bp_batch+0x9f/0x310 perf_ftrace_function_call+0x6f/0x2e0 ... __text_poke+0x5/0x620 text_poke_bp_batch+0x9f/0x310 This telling us the CPU could be changed after task is preempted, and the checking on CPU before preemption will be invalid. Since now ftrace_test_recursion_trylock() will help to disable the preemption, this patch just do the checking after trylock() to address the issue. Link: https://lkml.kernel.org/r/54880691-5fe2-33e7-d12f-1fa6136f5183@linux.alibaba.com CC: Steven Rostedt <rostedt@goodmis.org> Cc: Guo Ren <guoren@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Petr Mladek <pmladek@suse.com> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Jisheng Zhang <jszhang@kernel.org> Reported-by: Abaci <abaci@linux.alibaba.com> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18PM: hibernate: Get block device exclusively in swsusp_check()Ye Bin1-2/+3
[ Upstream commit 39fbef4b0f77f9c89c8f014749ca533643a37c9f ] The following kernel crash can be triggered: [ 89.266592] ------------[ cut here ]------------ [ 89.267427] kernel BUG at fs/buffer.c:3020! [ 89.268264] invalid opcode: 0000 [#1] SMP KASAN PTI [ 89.269116] CPU: 7 PID: 1750 Comm: kmmpd-loop0 Not tainted 5.10.0-862.14.0.6.x86_64-08610-gc932cda3cef4-dirty #20 [ 89.273169] RIP: 0010:submit_bh_wbc.isra.0+0x538/0x6d0 [ 89.277157] RSP: 0018:ffff888105ddfd08 EFLAGS: 00010246 [ 89.278093] RAX: 0000000000000005 RBX: ffff888124231498 RCX: ffffffffb2772612 [ 89.279332] RDX: 1ffff11024846293 RSI: 0000000000000008 RDI: ffff888124231498 [ 89.280591] RBP: ffff8881248cc000 R08: 0000000000000001 R09: ffffed1024846294 [ 89.281851] R10: ffff88812423149f R11: ffffed1024846293 R12: 0000000000003800 [ 89.283095] R13: 0000000000000001 R14: 0000000000000000 R15: ffff8881161f7000 [ 89.284342] FS: 0000000000000000(0000) GS:ffff88839b5c0000(0000) knlGS:0000000000000000 [ 89.285711] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 89.286701] CR2: 00007f166ebc01a0 CR3: 0000000435c0e000 CR4: 00000000000006e0 [ 89.287919] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 89.289138] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 89.290368] Call Trace: [ 89.290842] write_mmp_block+0x2ca/0x510 [ 89.292218] kmmpd+0x433/0x9a0 [ 89.294902] kthread+0x2dd/0x3e0 [ 89.296268] ret_from_fork+0x22/0x30 [ 89.296906] Modules linked in: by running the following commands: 1. mkfs.ext4 -O mmp /dev/sda -b 1024 2. mount /dev/sda /home/test 3. echo "/dev/sda" > /sys/power/resume That happens because swsusp_check() calls set_blocksize() on the target partition which confuses the file system: Thread1 Thread2 mount /dev/sda /home/test get s_mmp_bh --> has mapped flag start kmmpd thread echo "/dev/sda" > /sys/power/resume resume_store software_resume swsusp_check set_blocksize truncate_inode_pages_range truncate_cleanup_page block_invalidatepage discard_buffer --> clean mapped flag write_mmp_block submit_bh submit_bh_wbc BUG_ON(!buffer_mapped(bh)) To address this issue, modify swsusp_check() to open the target block device with exclusive access. Signed-off-by: Ye Bin <yebin10@huawei.com> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18tracing/cfi: Fix cmp_entries_* functions signature mismatchKalesh Singh1-17/+23
[ Upstream commit 7ce1bb83a14019f8c396d57ec704d19478747716 ] If CONFIG_CFI_CLANG=y, attempting to read an event histogram will cause the kernel to panic due to failed CFI check. 1. echo 'hist:keys=common_pid' >> events/sched/sched_switch/trigger 2. cat events/sched/sched_switch/hist 3. kernel panics on attempting to read hist This happens because the sort() function expects a generic int (*)(const void *, const void *) pointer for the compare function. To prevent this CFI failure, change tracing map cmp_entries_* function signatures to match this. Also, fix the build error reported by the kernel test robot [1]. [1] https://lore.kernel.org/r/202110141140.zzi4dRh4-lkp@intel.com/ Link: https://lkml.kernel.org/r/20211014045217.3265162-1-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18workqueue: make sysfs of unbound kworker cpumask more cleverMenglong Dong1-4/+11
[ Upstream commit d25302e46592c97d29f70ccb1be558df31a9a360 ] Some unfriendly component, such as dpdk, write the same mask to unbound kworker cpumask again and again. Every time it write to this interface some work is queue to cpu, even though the mask is same with the original mask. So, fix it by return success and do nothing if the cpumask is equal with the old one. Signed-off-by: Mengen Sun <mengensun@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18locking/rwsem: Disable preemption for spinning regionYanfei Xu1-23/+30
[ Upstream commit 7cdacc5f52d68a9370f182c844b5b3e6cc975cc1 ] The spinning region rwsem_spin_on_owner() should not be preempted, however the rwsem_down_write_slowpath() invokes it and don't disable preemption. Fix it by adding a pair of preempt_disable/enable(). Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> [peterz: Fix CONFIG_RWSEM_SPIN_ON_OWNER=n build] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20211013134154.1085649-3-yanfei.xu@windriver.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18tracing: Disable "other" permission bits in the tracefs filesSteven Rostedt (VMware)15-96/+103
[ Upstream commit 21ccc9cd72116289469e5519b6159c675a2fa58f ] When building the files in the tracefs file system, do not by default set any permissions for OTH (other). This will make it easier for admins who want to define a group for accessing tracefs and not having to first disable all the permission bits for "other" in the file system. As tracing can leak sensitive information, it should never by default allowing all users access. An admin can still set the permission bits for others to have access, which may be useful for creating a honeypot and seeing who takes advantage of it and roots the machine. Link: https://lkml.kernel.org/r/20210818153038.864149276@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loopPaul E. McKenney1-2/+1
[ Upstream commit 0db7c32ad3160ae06f497d48a74bd46a2a35e6bf ] Early in debugging, it made some sense to differentiate the first iteration from subsequent iterations, but now this just causes confusion. This commit therefore moves the "set_tasks_gp_state(rtp, RTGS_WAIT_CBS)" statement to the beginning of the "for" loop in rcu_tasks_kthread(). Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18locking/lockdep: Avoid RCU-induced noinstr failPeter Zijlstra1-1/+1
[ Upstream commit ce0b9c805dd66d5e49fd53ec5415ae398f4c56e6 ] vmlinux.o: warning: objtool: look_up_lock_class()+0xc7: call to rcu_read_lock_any_held() leaves .noinstr.text section Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210624095148.311980536@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18rcutorture: Avoid problematic critical section nesting on PREEMPT_RTScott Wood1-12/+36
[ Upstream commit 71921a9606ddbcc1d98c00eca7ae82c373d1fecd ] rcutorture is generating some nesting scenarios that are not compatible on PREEMPT_RT. For example: preempt_disable(); rcu_read_lock_bh(); preempt_enable(); rcu_read_unlock_bh(); The problem here is that on PREEMPT_RT the bottom halves have to be disabled and enabled in preemptible context. Reorder locking: start with BH locking and continue with then with disabling preemption or interrupts. In the unlocking do it reverse by first enabling interrupts and preemption and BH at the very end. Ensure that on PREEMPT_RT BH locking remains unchanged if in non-preemptible context. Link: https://lkml.kernel.org/r/20190911165729.11178-6-swood@redhat.com Link: https://lkml.kernel.org/r/20210819182035.GF4126399@paulmck-ThinkPad-P17-Gen-1 Signed-off-by: Scott Wood <swood@redhat.com> [bigeasy: Drop ATOM_BH, make it only about changing BH in atomic context. Allow enabling RCU in IRQ-off section. Reword commit message.] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18ring-buffer: Protect ring_buffer_reset() from reentrancySteven Rostedt (VMware)1-0/+5
commit 51d157946666382e779f94c39891e8e9a020da78 upstream. The resetting of the entire ring buffer use to simply go through and reset each individual CPU buffer that had its own protection and synchronization. But this was very slow, due to performing a synchronization for each CPU. The code was reshuffled to do one disabling of all CPU buffers, followed by a single RCU synchronization, and then the resetting of each of the CPU buffers. But unfortunately, the mutex that prevented multiple occurrences of resetting the buffer was not moved to the upper function, and there is nothing to protect from it. Take the ring buffer mutex around the global reset. Cc: stable@vger.kernel.org Fixes: b23d7a5f4a07a ("ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU") Reported-by: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18signal: Add SA_IMMUTABLE to ensure forced siganls do not get changedEric W. Biederman1-1/+7
commit 00b06da29cf9dc633cdba87acd3f57f4df3fd5c7 upstream. As Andy pointed out that there are races between force_sig_info_to_task and sigaction[1] when force_sig_info_task. As Kees discovered[2] ptrace is also able to change these signals. In the case of seeccomp killing a process with a signal it is a security violation to allow the signal to be caught or manipulated. Solve this problem by introducing a new flag SA_IMMUTABLE that prevents sigaction and ptrace from modifying these forced signals. This flag is carefully made kernel internal so that no new ABI is introduced. Longer term I think this can be solved by guaranteeing short circuit delivery of signals in this case. Unfortunately reliable and guaranteed short circuit delivery of these signals is still a ways off from being implemented, tested, and merged. So I have implemented a much simpler alternative for now. [1] https://lkml.kernel.org/r/b5d52d25-7bde-4030-a7b1-7c6f8ab90660@www.fastmail.com [2] https://lkml.kernel.org/r/202110281136.5CE65399A7@keescook Cc: stable@vger.kernel.org Fixes: 307d522f5eb8 ("signal/seccomp: Refactor seccomp signal and coredump generation") Tested-by: Andrea Righi <andrea.righi@canonical.com> Tested-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18signal: Remove the bogus sigkill_pending in ptrace_stopEric W. Biederman1-14/+4
commit 7d613f9f72ec8f90ddefcae038fdae5adb8404b3 upstream. The existence of sigkill_pending is a little silly as it is functionally a duplicate of fatal_signal_pending that is used in exactly one place. Checking for pending fatal signals and returning early in ptrace_stop is actively harmful. It casues the ptrace_stop called by ptrace_signal to return early before setting current->exit_code. Later when ptrace_signal reads the signal number from current->exit_code is undefined, making it unpredictable what will happen. Instead rely on the fact that schedule will not sleep if there is a pending signal that can awaken a task. Removing the explict sigkill_pending test fixes fixes ptrace_signal when ptrace_stop does not stop because current->exit_code is always set to to signr. Cc: stable@vger.kernel.org Fixes: 3d749b9e676b ("ptrace: simplify ptrace_stop()->sigkill_pending() path") Fixes: 1a669c2f16d4 ("Add arch_ptrace_stop") Link: https://lkml.kernel.org/r/87pmsyx29t.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-29Merge tag 'trace-v5.15-rc6-3' of ↵Linus Torvalds2-9/+11
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing comment fixes from Steven Rostedt: - Some bots have informed me that some of the ftrace functions kernel-doc has formatting issues. - Also, fix my snake instinct. * tag 'trace-v5.15-rc6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix misspelling of "missing" ftrace: Fix kernel-doc formatting issues
2021-10-29tracing: Fix misspelling of "missing"Steven Rostedt (VMware)1-1/+1
My snake instinct was on and I wrote "misssing" instead of "missing". Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-29ftrace: Fix kernel-doc formatting issuesSteven Rostedt (VMware)1-8/+10
Some functions had kernel-doc that used a comma instead of a hash to separate the function name from the one line description. Also, the "ftrace_is_dead()" had an incomplete description. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-28Merge tag 'net-5.15-rc8' of ↵Linus Torvalds4-13/+27
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from WiFi (mac80211), and BPF. Current release - regressions: - skb_expand_head: adjust skb->truesize to fix socket memory accounting - mptcp: fix corrupt receiver key in MPC + data + checksum Previous releases - regressions: - multicast: calculate csum of looped-back and forwarded packets - cgroup: fix memory leak caused by missing cgroup_bpf_offline - cfg80211: fix management registrations locking, prevent list corruption - cfg80211: correct false positive in bridge/4addr mode check - tcp_bpf: fix race in the tcp_bpf_send_verdict resulting in reusing previous verdict Previous releases - always broken: - sctp: enhancements for the verification tag, prevent attackers from killing SCTP sessions - tipc: fix size validations for the MSG_CRYPTO type - mac80211: mesh: fix HE operation element length check, prevent out of bound access - tls: fix sign of socket errors, prevent positive error codes being reported from read()/write() - cfg80211: scan: extend RCU protection in cfg80211_add_nontrans_list() - implement ->sock_is_readable() for UDP and AF_UNIX, fix poll() for sockets in a BPF sockmap - bpf: fix potential race in tail call compatibility check resulting in two operations which would make the map incompatible succeeding - bpf: prevent increasing bpf_jit_limit above max - bpf: fix error usage of map_fd and fdget() in generic batch update - phy: ethtool: lock the phy for consistency of results - prevent infinite while loop in skb_tx_hash() when Tx races with driver reconfiguring the queue <> traffic class mapping - usbnet: fixes for bad HW conjured by syzbot - xen: stop tx queues during live migration, prevent UAF - net-sysfs: initialize uid and gid before calling net_ns_get_ownership - mlxsw: prevent Rx stalls under memory pressure" * tag 'net-5.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (67 commits) Revert "net: hns3: fix pause config problem after autoneg disabled" mptcp: fix corrupt receiver key in MPC + data + checksum riscv, bpf: Fix potential NULL dereference octeontx2-af: Fix possible null pointer dereference. octeontx2-af: Display all enabled PF VF rsrc_alloc entries. octeontx2-af: Check whether ipolicers exists net: ethernet: microchip: lan743x: Fix skb allocation failure net/tls: Fix flipped sign in async_wait.err assignment net/tls: Fix flipped sign in tls_err_abort() calls net/smc: Correct spelling mistake to TCPF_SYN_RECV net/smc: Fix smc_link->llc_testlink_time overflow nfp: bpf: relax prog rejection for mtu check through max_pkt_offset vmxnet3: do not stop tx queues after netif_device_detach() r8169: Add device 10ec:8162 to driver r8169 ptp: Document the PTP_CLK_MAGIC ioctl number usbnet: fix error return code in usbnet_probe() net: hns3: adjust string spaces of some parameters of tx bd info in debugfs net: hns3: expand buffer len for some debugfs command net: hns3: add more string spaces for dumping packets number of queue info in debugfs net: hns3: fix data endian problem of some functions of debugfs ...
2021-10-28Merge tag 'trace-v5.15-rc6-2' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Do not WARN when attaching event probe to non-existent event If the user tries to attach an event probe (eprobe) to an event that does not exist, it will trigger a warning. There's an error check that only expects memory issues otherwise it is considered a bug. But changes in the code to move around the locking made it that it can error out if the user attempts to attach to an event that does not exist, returning an -ENODEV. As this path can be caused by user space putting in a bad value, do not trigger a WARN" * tag 'trace-v5.15-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Do not warn when connecting eprobe to non existing event
2021-10-28tracing: Do not warn when connecting eprobe to non existing eventSteven Rostedt (VMware)1-2/+2
When the syscall trace points are not configured in, the kselftests for ftrace will try to attach an event probe (eprobe) to one of the system call trace points. This triggered a WARNING, because the failure only expects to see memory issues. But this is not the only failure. The user may attempt to attach to a non existent event, and the kernel must not warn about it. Link: https://lkml.kernel.org/r/20211027120854.0680aa0f@gandalf.local.home Fixes: 7491e2c442781 ("tracing: Add a probe that attaches to trace events") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-27Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski4-13/+27
Daniel Borkmann says: ==================== pull-request: bpf 2021-10-26 We've added 12 non-merge commits during the last 7 day(s) which contain a total of 23 files changed, 118 insertions(+), 98 deletions(-). The main changes are: 1) Fix potential race window in BPF tail call compatibility check, from Toke Høiland-Jørgensen. 2) Fix memory leak in cgroup fs due to missing cgroup_bpf_offline(), from Quanyang Wang. 3) Fix file descriptor reference counting in generic_map_update_batch(), from Xu Kuohai. 4) Fix bpf_jit_limit knob to the max supported limit by the arch's JIT, from Lorenz Bauer. 5) Fix BPF sockmap ->poll callbacks for UDP and AF_UNIX sockets, from Cong Wang and Yucong Sun. 6) Fix BPF sockmap concurrency issue in TCP on non-blocking sendmsg calls, from Liu Jian. 7) Fix build failure of INODE_STORAGE and TASK_STORAGE maps on !CONFIG_NET, from Tejun Heo. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix potential race in tail call compatibility check bpf: Move BPF_MAP_TYPE for INODE_STORAGE and TASK_STORAGE outside of CONFIG_NET selftests/bpf: Use recv_timeout() instead of retries net: Implement ->sock_is_readable() for UDP and AF_UNIX skmsg: Extract and reuse sk_msg_is_readable() net: Rename ->stream_memory_read to ->sock_is_readable tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function cgroup: Fix memory leak caused by missing cgroup_bpf_offline bpf: Fix error usage of map_fd and fdget() in generic_map_update_batch() bpf: Prevent increasing bpf_jit_limit above max bpf: Define bpf_jit_alloc_exec_limit for arm64 JIT bpf: Define bpf_jit_alloc_exec_limit for riscv JIT ==================== Link: https://lore.kernel.org/r/20211026201920.11296-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-26bpf: Fix potential race in tail call compatibility checkToke Høiland-Jørgensen3-9/+18
Lorenzo noticed that the code testing for program type compatibility of tail call maps is potentially racy in that two threads could encounter a map with an unset type simultaneously and both return true even though they are inserting incompatible programs. The race window is quite small, but artificially enlarging it by adding a usleep_range() inside the check in bpf_prog_array_compatible() makes it trivial to trigger from userspace with a program that does, essentially: map_fd = bpf_create_map(BPF_MAP_TYPE_PROG_ARRAY, 4, 4, 2, 0); pid = fork(); if (pid) { key = 0; value = xdp_fd; } else { key = 1; value = tc_fd; } err = bpf_map_update_elem(map_fd, &key, &value, 0); While the race window is small, it has potentially serious ramifications in that triggering it would allow a BPF program to tail call to a program of a different type. So let's get rid of it by protecting the update with a spinlock. The commit in the Fixes tag is the last commit that touches the code in question. v2: - Use a spinlock instead of an atomic variable and cmpxchg() (Alexei) v3: - Put lock and the members it protects into an embedded 'owner' struct (Daniel) Fixes: 3324b584b6f6 ("ebpf: misc core cleanup") Reported-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211026110019.363464-1-toke@redhat.com
2021-10-24Merge tag 'sched_urgent_for_v5.15_rc7' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: "Reset clang's Shadow Call Stack on hotplug to prevent it from overflowing" * tag 'sched_urgent_for_v5.15_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/scs: Reset the shadow stack when idle_task_exit
2021-10-23cgroup: Fix memory leak caused by missing cgroup_bpf_offlineQuanyang Wang1-1/+3
When enabling CONFIG_CGROUP_BPF, kmemleak can be observed by running the command as below: $mount -t cgroup -o none,name=foo cgroup cgroup/ $umount cgroup/ unreferenced object 0xc3585c40 (size 64): comm "mount", pid 425, jiffies 4294959825 (age 31.990s) hex dump (first 32 bytes): 01 00 00 80 84 8c 28 c0 00 00 00 00 00 00 00 00 ......(......... 00 00 00 00 00 00 00 00 6c 43 a0 c3 00 00 00 00 ........lC...... backtrace: [<e95a2f9e>] cgroup_bpf_inherit+0x44/0x24c [<1f03679c>] cgroup_setup_root+0x174/0x37c [<ed4b0ac5>] cgroup1_get_tree+0x2c0/0x4a0 [<f85b12fd>] vfs_get_tree+0x24/0x108 [<f55aec5c>] path_mount+0x384/0x988 [<e2d5e9cd>] do_mount+0x64/0x9c [<208c9cfe>] sys_mount+0xfc/0x1f4 [<06dd06e0>] ret_fast_syscall+0x0/0x48 [<a8308cb3>] 0xbeb4daa8 This is because that since the commit 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path") root_cgrp->bpf.refcnt.data is allocated by the function percpu_ref_init in cgroup_bpf_inherit which is called by cgroup_setup_root when mounting, but not freed along with root_cgrp when umounting. Adding cgroup_bpf_offline which calls percpu_ref_kill to cgroup_kill_sb can free root_cgrp->bpf.refcnt.data in umount path. This patch also fixes the commit 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"). A cgroup_bpf_offline is needed to do a cleanup that frees the resources which are allocated by cgroup_bpf_inherit in cgroup_setup_root. And inside cgroup_bpf_offline, cgroup_get() is at the beginning and cgroup_put is at the end of cgroup_bpf_release which is called by cgroup_bpf_offline. So cgroup_bpf_offline can keep the balance of cgroup's refcount. Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path") Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") Signed-off-by: Quanyang Wang <quanyang.wang@windriver.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20211018075623.26884-1-quanyang.wang@windriver.com
2021-10-23bpf: Fix error usage of map_fd and fdget() in generic_map_update_batch()Xu Kuohai1-2/+3
1. The ufd in generic_map_update_batch() should be read from batch.map_fd; 2. A call to fdget() should be followed by a symmetric call to fdput(). Fixes: aa2e93b8e58e ("bpf: Add generic support for update and delete batch ops") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211019032934.1210517-1-xukuohai@huawei.com
2021-10-23bpf: Prevent increasing bpf_jit_limit above maxLorenz Bauer1-1/+3
Restrict bpf_jit_limit to the maximum supported by the arch's JIT. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211014142554.53120-4-lmb@cloudflare.com
2021-10-22Merge branch 'ucount-fixes-for-v5.15' of ↵Linus Torvalds3-24/+59
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ucounts fixes from Eric Biederman: "There has been one very hard to track down bug in the ucount code that we have been tracking since roughly v5.14 was released. Alex managed to find a reliable reproducer a few days ago and then I was able to instrument the code and figure out what the issue was. It turns out the sigqueue_alloc single atomic operation optimization did not play nicely with ucounts multiple level rlimits. It turned out that either sigqueue_alloc or sigqueue_free could be operating on multiple levels and trigger the conditions for the optimization on more than one level at the same time. To deal with that situation I have introduced inc_rlimit_get_ucounts and dec_rlimit_put_ucounts that just focuses on the optimization and the rlimit and ucount changes. While looking into the big bug I found I couple of other little issues so I am including those fixes here as well. When I have time I would very much like to dig into process ownership of the shared signal queue and see if we could pick a single owner for the entire queue so that all of the rlimits can count to that owner. That should entirely remove the need to call get_ucounts and put_ucounts in sigqueue_alloc and sigqueue_free. It is difficult because Linux unlike POSIX supports setuid that works on a single thread" * 'ucount-fixes-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ucounts: Move get_ucounts from cred_alloc_blank to key_change_session_keyring ucounts: Proper error handling in set_cred_ucounts ucounts: Pair inc_rlimit_ucounts with dec_rlimit_ucoutns in commit_creds ucounts: Fix signal ucount refcounting
2021-10-20Merge tag 'dma-mapping-5.15-2' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds3-36/+48
Pull dma-mapping fixes from Christoph Hellwig: - fix more dma-debug fallout (Gerald Schaefer, Hamza Mahfooz) - fix a kerneldoc warning (Logan Gunthorpe) * tag 'dma-mapping-5.15-2' of git://git.infradead.org/users/hch/dma-mapping: dma-debug: teach add_dma_entry() about DMA_ATTR_SKIP_CPU_SYNC dma-debug: fix sg checks in debug_dma_map_sg() dma-mapping: fix the kerneldoc for dma_map_sgtable()
2021-10-20Merge tag 'audit-pr-20211019' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fix from Paul Moore: "One small audit patch to add a pointer NULL check" * tag 'audit-pr-20211019' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: fix possible null-pointer dereference in audit_filter_rules
2021-10-20Merge tag 'trace-v5.15-rc5' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Recursion fix for tracing. While cleaning up some of the tracing recursion protection logic, I discovered a scenario that the current design would miss, and would allow an infinite recursion. Removing an optimization trick that opened the hole fixes the issue and cleans up the code as well" * tag 'trace-v5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Have all levels of checks prevent recursion
2021-10-20ucounts: Move get_ucounts from cred_alloc_blank to key_change_session_keyringEric W. Biederman1-2/+0
Setting cred->ucounts in cred_alloc_blank does not make sense. The uid and user_ns are deliberately not set in cred_alloc_blank but instead the setting is delayed until key_change_session_keyring. So move dealing with ucounts into key_change_session_keyring as well. Unfortunately that movement of get_ucounts adds a new failure mode to key_change_session_keyring. I do not see anything stopping the parent process from calling setuid and changing the relevant part of it's cred while keyctl_session_to_parent is running making it fundamentally necessary to call get_ucounts in key_change_session_keyring. Which means that the new failure mode cannot be avoided. A failure of key_change_session_keyring results in a single threaded parent keeping it's existing credentials. Which results in the parent process not being able to access the session keyring and whichever keys are in the new keyring. Further get_ucounts is only expected to fail if the number of bits in the refernece count for the structure is too few. Since the code has no other way to report the failure of get_ucounts and because such failures are not expected to be common add a WARN_ONCE to report this problem to userspace. Between the WARN_ONCE and the parent process not having access to the keys in the new session keyring I expect any failure of get_ucounts will be noticed and reported and we can find another way to handle this condition. (Possibly by just making ucounts->count an atomic_long_t). Cc: stable@vger.kernel.org Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred") Link: https://lkml.kernel.org/r/7k0ias0uf.fsf_-_@disp2133 Tested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-19ucounts: Proper error handling in set_cred_ucountsEric W. Biederman1-2/+3
Instead of leaking the ucounts in new if alloc_ucounts fails, store the result of alloc_ucounts into a temporary variable, which is later assigned to new->ucounts. Cc: stable@vger.kernel.org Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred") Link: https://lkml.kernel.org/r/87pms2s0v8.fsf_-_@disp2133 Tested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-19ucounts: Pair inc_rlimit_ucounts with dec_rlimit_ucoutns in commit_credsEric W. Biederman1-1/+1
The purpose of inc_rlimit_ucounts and dec_rlimit_ucounts in commit_creds is to change which rlimit counter is used to track a process when the credentials changes. Use the same test for both to guarantee the tracking is correct. Cc: stable@vger.kernel.org Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Link: https://lkml.kernel.org/r/87v91us0w4.fsf_-_@disp2133 Tested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-19sched/scs: Reset the shadow stack when idle_task_exitWoody Lin1-0/+1
Commit f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled") removed the init_idle() call from idle_thread_get(). This was the sole call-path on hotplug that resets the Shadow Call Stack (scs) Stack Pointer (sp). Not resetting the scs-sp leads to scs overflow after enough hotplug cycles. Therefore add an explicit scs_task_reset() to the hotplug code to make sure the scs-sp does get reset on hotplug. Fixes: f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled") Signed-off-by: Woody Lin <woodylin@google.com> [peterz: Changelog] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20211012083521.973587-1-woodylin@google.com
2021-10-19audit: fix possible null-pointer dereference in audit_filter_rulesGaosheng Cui1-1/+1
Fix possible null-pointer dereference in audit_filter_rules. audit_filter_rules() error: we previously assumed 'ctx' could be null Cc: stable@vger.kernel.org Fixes: bf361231c295 ("audit: add saddr_fam filter field") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-10-19tracing: Have all levels of checks prevent recursionSteven Rostedt (VMware)1-2/+2
While writing an email explaining the "bit = 0" logic for a discussion on making ftrace_test_recursion_trylock() disable preemption, I discovered a path that makes the "not do the logic if bit is zero" unsafe. The recursion logic is done in hot paths like the function tracer. Thus, any code executed causes noticeable overhead. Thus, tricks are done to try to limit the amount of code executed. This included the recursion testing logic. Having recursion testing is important, as there are many paths that can end up in an infinite recursion cycle when tracing every function in the kernel. Thus protection is needed to prevent that from happening. Because it is OK to recurse due to different running context levels (e.g. an interrupt preempts a trace, and then a trace occurs in the interrupt handler), a set of bits are used to know which context one is in (normal, softirq, irq and NMI). If a recursion occurs in the same level, it is prevented*. Then there are infrastructure levels of recursion as well. When more than one callback is attached to the same function to trace, it calls a loop function to iterate over all the callbacks. Both the callbacks and the loop function have recursion protection. The callbacks use the "ftrace_test_recursion_trylock()" which has a "function" set of context bits to test, and the loop function calls the internal trace_test_and_set_recursion() directly, with an "internal" set of bits. If an architecture does not implement all the features supported by ftrace then the callbacks are never called directly, and the loop function is called instead, which will implement the features of ftrace. Since both the loop function and the callbacks do recursion protection, it was seemed unnecessary to do it in both locations. Thus, a trick was made to have the internal set of recursion bits at a more significant bit location than the function bits. Then, if any of the higher bits were set, the logic of the function bits could be skipped, as any new recursion would first have to go through the loop function. This is true for architectures that do not support all the ftrace features, because all functions being traced must first go through the loop function before going to the callbacks. But this is not true for architectures that support all the ftrace features. That's because the loop function could be called due to two callbacks attached to the same function, but then a recursion function inside the callback could be called that does not share any other callback, and it will be called directly. i.e. traced_function_1: [ more than one callback tracing it ] call loop_func loop_func: trace_recursion set internal bit call callback callback: trace_recursion [ skipped because internal bit is set, return 0 ] call traced_function_2 traced_function_2: [ only traced by above callback ] call callback callback: trace_recursion [ skipped because internal bit is set, return 0 ] call traced_function_2 [ wash, rinse, repeat, BOOM! out of shampoo! ] Thus, the "bit == 0 skip" trick is not safe, unless the loop function is call for all functions. Since we want to encourage architectures to implement all ftrace features, having them slow down due to this extra logic may encourage the maintainers to update to the latest ftrace features. And because this logic is only safe for them, remove it completely. [*] There is on layer of recursion that is allowed, and that is to allow for the transition between interrupt context (normal -> softirq -> irq -> NMI), because a trace may occur before the context update is visible to the trace recursion logic. Link: https://lore.kernel.org/all/609b565a-ed6e-a1da-f025-166691b5d994@linux.alibaba.com/ Link: https://lkml.kernel.org/r/20211018154412.09fcad3c@gandalf.local.home Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@hansenpartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Jisheng Zhang <jszhang@kernel.org> Cc: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com> Cc: Guo Ren <guoren@kernel.org> Cc: stable@vger.kernel.org Fixes: edc15cafcbfa3 ("tracing: Avoid unnecessary multiple recursion checks") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-19ucounts: Fix signal ucount refcountingEric W. Biederman2-19/+55
In commit fda31c50292a ("signal: avoid double atomic counter increments for user accounting") Linus made a clever optimization to how rlimits and the struct user_struct. Unfortunately that optimization does not work in the obvious way when moved to nested rlimits. The problem is that the last decrement of the per user namespace per user sigpending counter might also be the last decrement of the sigpending counter in the parent user namespace as well. Which means that simply freeing the leaf ucount in __free_sigqueue is not enough. Maintain the optimization and handle the tricky cases by introducing inc_rlimit_get_ucounts and dec_rlimit_put_ucounts. By moving the entire optimization into functions that perform all of the work it becomes possible to ensure that every level is handled properly. The new function inc_rlimit_get_ucounts returns 0 on failure to increment the ucount. This is different than inc_rlimit_ucounts which increments the ucounts and returns LONG_MAX if the ucount counter has exceeded it's maximum or it wrapped (to indicate the counter needs to decremented). I wish we had a single user to account all pending signals to across all of the threads of a process so this complexity was not necessary Cc: stable@vger.kernel.org Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") v1: https://lkml.kernel.org/r/87mtnavszx.fsf_-_@disp2133 Link: https://lkml.kernel.org/r/87fssytizw.fsf_-_@disp2133 Reviewed-by: Alexey Gladkov <legion@kernel.org> Tested-by: Rune Kleveland <rune.kleveland@infomedia.dk> Tested-by: Yu Zhao <yuzhao@google.com> Tested-by: Jordan Glover <Golden_Miller83@protonmail.ch> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-18dma-debug: teach add_dma_entry() about DMA_ATTR_SKIP_CPU_SYNCHamza Mahfooz3-24/+36
Mapping something twice should be possible as long as, DMA_ATTR_SKIP_CPU_SYNC is passed to the strictly speaking second relevant mapping operation (that attempts to map the same thing). So, don't issue a warning if the specified condition is met in add_dma_entry(). Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-16Merge tag 'trace-v5.15-rc3' of ↵Linus Torvalds3-11/+63
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Tracing fixes for 5.15: - Fix defined but not use warning/error for osnoise function - Fix memory leak in event probe - Fix memblock leak in bootconfig - Fix the API of event probes to be like kprobes - Added test to check removal of event probe API - Fix recordmcount.pl for nds32 failed build * tag 'trace-v5.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: nds32/ftrace: Fix Error: invalid operands (*UND* and *UND* sections) for `^' selftests/ftrace: Update test for more eprobe removal process tracing: Fix event probe removal from dynamic events tracing: Fix missing * in comment block bootconfig: init: Fix memblock leak in xbc_make_cmdline() tracing: Fix memory leak in eprobe_register() tracing: Fix missing osnoise tracer on max_latency
2021-10-14tracing: Fix event probe removal from dynamic eventsSteven Rostedt (VMware)1-3/+51
When an event probe is to be removed via the API that created it via the dynamic events, an -ENOENT error is returned. This is because the removal of the event probe does not expect to see the event system and name that the event probe is attached to, even though that's part of the API to create it. As the removal of probes is to use the same API as they are created. In fact, the removal is not consistent with the kprobes and uprobes removal. Fix that by allowing various ways to remove the eprobe. The eprobe is created with: e:[GROUP/]NAME SYSTEM/EVENT [OPTIONS] Have it get removed by echoing in the following into dynamic_events: # Remove all eprobes with NAME echo '-:NAME' >> dynamic_events # Remove a specific eprobe echo '-:GROUP/NAME' >> dynamic_events echo '-:GROUP/NAME SYSTEM/EVENT' >> dynamic_events echo '-:NAME SYSTEM/EVENT' >> dynamic_events echo '-:GROUP/NAME SYSTEM/EVENT OPTIONS' >> dynamic_events echo '-:NAME SYSTEM/EVENT OPTIONS' >> dynamic_events Link: https://lkml.kernel.org/r/20211012081925.0e19cc4f@gandalf.local.home Link: https://lkml.kernel.org/r/20211013205533.630722129@goodmis.org Suggested-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Fixes: 7491e2c442781 ("tracing: Add a probe that attaches to trace events") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-13Merge tag 'modules-for-v5.15-rc6' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull modules fix from Jessica Yu: - Build fix for cfi_init() when CONFIG_MODULE_UNLOAD=n * tag 'modules-for-v5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: module: fix clang CFI with MODULE_UNLOAD=n
2021-10-12Merge branch 'for-5.15-fixes' of ↵Linus Torvalds1-27/+29
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "All documentation / comment updates" * 'for-5.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroupv2, docs: fix misinformation in "device controller" section cgroup/cpuset: Change references of cpuset_mutex to cpuset_rwsem docs/cgroup: remove some duplicate words
2021-10-12Merge branch 'for-5.15-fixes' of ↵Linus Torvalds1-2/+16
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "One patch to add a missing __printf annotation and the other to enable deferred printing for debug dumps to avoid deadlocks when triggered from some contexts (e.g. console drivers)" * 'for-5.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix state-dump console deadlock workqueue: annotate alloc_workqueue() as printf
2021-10-11workqueue: fix state-dump console deadlockJohan Hovold1-2/+16
Console drivers often queue work while holding locks also taken in their console write paths, something which can lead to deadlocks on SMP when dumping workqueue state (e.g. sysrq-t or on suspend failures). For serial console drivers this could look like: CPU0 CPU1 ---- ---- show_workqueue_state(); lock(&pool->lock); <IRQ> lock(&port->lock); schedule_work(); lock(&pool->lock); printk(); lock(console_owner); lock(&port->lock); where workqueues are, for example, used to push data to the line discipline, process break signals and handle modem-status changes. Line disciplines and serdev drivers can also queue work on write-wakeup notifications, etc. Reworking every console driver to avoid queuing work while holding locks also taken in their write paths would complicate drivers and is neither desirable or feasible. Instead use the deferred-printk mechanism to avoid printing while holding pool locks when dumping workqueue state. Note that there are a few WARN_ON() assertions in the workqueue code which could potentially also trigger a deadlock. Hopefully the ongoing printk rework will provide a general solution for this eventually. This was originally reported after a lockdep splat when executing sysrq-t with the imx serial driver. Fixes: 3494fc30846d ("workqueue: dump workqueues on sysrq-t") Cc: stable@vger.kernel.org # 4.0 Reported-by: Fabio Estevam <festevam@denx.de> Tested-by: Fabio Estevam <festevam@denx.de> Signed-off-by: Johan Hovold <johan@kernel.org> Reviewed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-10-11dma-debug: fix sg checks in debug_dma_map_sg()Gerald Schaefer1-6/+6
The following warning occurred sporadically on s390: DMA-API: nvme 0006:00:00.0: device driver maps memory from kernel text or rodata [addr=0000000048cc5e2f] [len=131072] WARNING: CPU: 4 PID: 825 at kernel/dma/debug.c:1083 check_for_illegal_area+0xa8/0x138 It is a false-positive warning, due to broken logic in debug_dma_map_sg(). check_for_illegal_area() checks for overlay of sg elements with kernel text or rodata. It is called with sg_dma_len(s) instead of s->length as parameter. After the call to ->map_sg(), sg_dma_len() will contain the length of possibly combined sg elements in the DMA address space, and not the individual sg element length, which would be s->length. The check will then use the physical start address of an sg element, and add the DMA length for the overlap check, which could result in the false warning, because the DMA length can be larger than the actual single sg element length. In addition, the call to check_for_illegal_area() happens in the iteration over mapped_ents, which will not include all individual sg elements if any of them were combined in ->map_sg(). Fix this by using s->length instead of sg_dma_len(s). Also put the call to check_for_illegal_area() in a separate loop, iterating over all the individual sg elements ("nents" instead of "mapped_ents"). While at it, as suggested by Robin Murphy, also move check_for_stack() inside the new loop, as it is similarly concerned with validating the individual sg elements. Link: https://lore.kernel.org/lkml/20210705185252.4074653-1-gerald.schaefer@linux.ibm.com Fixes: 884d05970bfb ("dma-debug: use sg_dma_len accessor") Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-11dma-mapping: fix the kerneldoc for dma_map_sgtable()Logan Gunthorpe1-6/+6
htmldocs began producing the following warnings: kernel/dma/mapping.c:256: WARNING: Definition list ends without a blank line; unexpected unindent. kernel/dma/mapping.c:257: WARNING: Bullet list ends without a blank line; unexpected unindent. Reformatting the list without hyphens fixes the warnings and produces both a readable text and HTML output. Fixes: fffe3cc8c219 ("dma-mapping: allow map_sg() ops to return negative error code") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Christoph Hellwig <hch@lst.de>