summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2024-05-02serial: core: fix kernel-doc for uart_port_unlock_irqrestore()Randy Dunlap1-1/+1
commit 29bff582b74ed0bdb7e6986482ad9e6799ea4d2f upstream. Fix the function name to avoid a kernel-doc warning: include/linux/serial_core.h:666: warning: expecting prototype for uart_port_lock_irqrestore(). Prototype was for uart_port_unlock_irqrestore() instead Fixes: b0af4bcb4946 ("serial: core: Provide port lock wrappers") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Ogness <john.ogness@linutronix.de> Cc: linux-serial@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Slaby <jirislaby@kernel.org> Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://lore.kernel.org/r/20230927044128.4748-1-rdunlap@infradead.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-05-02ethernet: Add helper for assigning packet type when dest address does not ↵Rahul Rameshbabu1-0/+25
match device address commit 6e159fd653d7ebf6290358e0330a0cb8a75cf73b upstream. Enable reuse of logic in eth_type_trans for determining packet type. Suggested-by: Sabrina Dubroca <sd@queasysnail.net> Cc: stable@vger.kernel.org Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/20240423181319.115860-3-rrameshbabu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-05-02tracing: Increase PERF_MAX_TRACE_SIZE to handle Sentinel1 and docker togetherRobin H. Johnson1-1/+1
commit e531e90b5ab0f7ce5ff298e165214c1aec6ed187 upstream. Running endpoint security solutions like Sentinel1 that use perf-based tracing heavily lead to this repeated dump complaining about dockerd. The default value of 2048 is nowhere near not large enough. Using the prior patch "tracing: show size of requested buffer", we get "perf buffer not large enough, wanted 6644, have 6144", after repeated up-sizing (I did 2/4/6/8K). With 8K, the problem doesn't occur at all, so below is the trace for 6K. I'm wondering if this value should be selectable at boot time, but this is a good starting point. ``` ------------[ cut here ]------------ perf buffer not large enough, wanted 6644, have 6144 WARNING: CPU: 1 PID: 4997 at kernel/trace/trace_event_perf.c:402 perf_trace_buf_alloc+0x8c/0xa0 Modules linked in: [..] CPU: 1 PID: 4997 Comm: sh Tainted: G T 5.13.13-x86_64-00039-gb3959163488e #63 Hardware name: LENOVO 20KH002JUS/20KH002JUS, BIOS N23ET66W (1.41 ) 09/02/2019 RIP: 0010:perf_trace_buf_alloc+0x8c/0xa0 Code: 80 3d 43 97 d0 01 00 74 07 31 c0 5b 5d 41 5c c3 ba 00 18 00 00 89 ee 48 c7 c7 00 82 7d 91 c6 05 25 97 d0 01 01 e8 22 ee bc 00 <0f> 0b 31 c0 eb db 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 55 89 RSP: 0018:ffffb922026b7d58 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff9da5ee012000 RCX: 0000000000000027 RDX: ffff9da881657828 RSI: 0000000000000001 RDI: ffff9da881657820 RBP: 00000000000019f4 R08: 0000000000000000 R09: ffffb922026b7b80 R10: ffffb922026b7b78 R11: ffffffff91dda688 R12: 000000000000000f R13: ffff9da5ee012108 R14: ffff9da8816570a0 R15: ffffb922026b7e30 FS: 00007f420db1a080(0000) GS:ffff9da881640000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000060 CR3: 00000002504a8006 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: kprobe_perf_func+0x11e/0x270 ? do_execveat_common.isra.0+0x1/0x1c0 ? do_execveat_common.isra.0+0x5/0x1c0 kprobe_ftrace_handler+0x10e/0x1d0 0xffffffffc03aa0c8 ? do_execveat_common.isra.0+0x1/0x1c0 do_execveat_common.isra.0+0x5/0x1c0 __x64_sys_execve+0x33/0x40 do_syscall_64+0x6b/0xc0 ? do_syscall_64+0x11/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f420dc1db37 Code: ff ff 76 e7 f7 d8 64 41 89 00 eb df 0f 1f 80 00 00 00 00 f7 d8 64 41 89 00 eb dc 0f 1f 84 00 00 00 00 00 b8 3b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 01 43 0f 00 f7 d8 64 89 01 48 RSP: 002b:00007ffd4e8b4e38 EFLAGS: 00000246 ORIG_RAX: 000000000000003b RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f420dc1db37 RDX: 0000564338d1e740 RSI: 0000564338d32d50 RDI: 0000564338d28f00 RBP: 0000564338d28f00 R08: 0000564338d32d50 R09: 0000000000000020 R10: 00000000000001b6 R11: 0000000000000246 R12: 0000564338d28f00 R13: 0000564338d32d50 R14: 0000564338d1e740 R15: 0000564338d28c60 ---[ end trace 83ab3e8e16275e49 ]--- ``` Link: https://lkml.kernel.org/r/20210831043723.13481-2-robbat2@gentoo.org Signed-off-by: Robin H. Johnson <robbat2@gentoo.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-05-02serial: core: Provide port lock wrappersThomas Gleixner1-0/+79
[ Upstream commit b0af4bcb49464c221ad5f95d40f2b1b252ceedcc ] When a serial port is used for kernel console output, then all modifications to the UART registers which are done from other contexts, e.g. getty, termios, are interference points for the kernel console. So far this has been ignored and the printk output is based on the principle of hope. The rework of the console infrastructure which aims to support threaded and atomic consoles, requires to mark sections which modify the UART registers as unsafe. This allows the atomic write function to make informed decisions and eventually to restore operational state. It also allows to prevent the regular UART code from modifying UART registers while printk output is in progress. All modifications of UART registers are guarded by the UART port lock, which provides an obvious synchronization point with the console infrastructure. Provide wrapper functions for spin_[un]lock*(port->lock) invocations so that the console mechanics can be applied later on at a single place and does not require to copy the same logic all over the drivers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: John Ogness <john.ogness@linutronix.de> Link: https://lore.kernel.org/r/20230914183831.587273-2-john.ogness@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Stable-dep-of: 54c4ec5f8c47 ("serial: mxs-auart: add spinlock around changing cts state") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-05-02irqflags: Explicitly ignore lockdep_hrtimer_exit() argumentArnd Bergmann1-1/+1
commit c1d11fc2c8320871b40730991071dd0a0b405bc8 upstream. When building with 'make W=1' but CONFIG_TRACE_IRQFLAGS=n, the unused argument to lockdep_hrtimer_exit() causes a warning: kernel/time/hrtimer.c:1655:14: error: variable 'expires_in_hardirq' set but not used [-Werror=unused-but-set-variable] This is intentional behavior, so add a cast to void to shut up the warning. Fixes: 73d20564e0dc ("hrtimer: Don't dereference the hrtimer pointer after the callback") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240408074609.3170807-1-arnd@kernel.org Closes: https://lore.kernel.org/oe-kbuild-all/202311191229.55QXHVc6-lkp@intel.com/ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-05-02u64_stats: fix u64_stats_init() for lockdep when used repeatedly in one filePetr Tesarik1-1/+5
[ Upstream commit 38a15d0a50e0a43778561a5861403851f0b0194c ] Fix bogus lockdep warnings if multiple u64_stats_sync variables are initialized in the same file. With CONFIG_LOCKDEP, seqcount_init() is a macro which declares: static struct lock_class_key __key; Since u64_stats_init() is a function (albeit an inline one), all calls within the same file end up using the same instance, effectively treating them all as a single lock-class. Fixes: 9464ca650008 ("net: make u64_stats_init() a function") Closes: https://lore.kernel.org/netdev/ea1567d9-ce66-45e6-8168-ac40a47d1821@roeck-us.net/ Signed-off-by: Petr Tesarik <petr@tesarici.cz> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240404075740.30682-1-petr@tesarici.cz Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-13SUNRPC: increase size of rpc_wait_queue.qlen from unsigned short to unsigned intDai Ngo1-1/+1
[ Upstream commit 2c35f43b5a4b9cdfaa6fdd946f5a212615dac8eb ] When the NFS client is under extreme load the rpc_wait_queue.qlen counter can be overflowed. Here is an instant of the backlog queue overflow in a real world environment shown by drgn helper: rpc_task_stats(rpc_clnt): ------------------------- rpc_clnt: 0xffff92b65d2bae00 rpc_xprt: 0xffff9275db64f000 Queue: sending[64887] pending[524] backlog[30441] binding[0] XMIT task: 0xffff925c6b1d8e98 WRITE: 750654 __dta_call_status_580: 65463 __dta_call_transmit_status_579: 1 call_reserveresult: 685189 nfs_client_init_is_complete: 1 COMMIT: 584 call_reserveresult: 573 __dta_call_status_580: 11 ACCESS: 1 __dta_call_status_580: 1 GETATTR: 10 __dta_call_status_580: 4 call_reserveresult: 6 751249 tasks for server 111.222.333.444 Total tasks: 751249 count_rpc_wait_queues(xprt): ---------------------------- **** rpc_xprt: 0xffff9275db64f000 num_reqs: 65511 wait_queue: xprt_binding[0] cnt: 0 wait_queue: xprt_binding[1] cnt: 0 wait_queue: xprt_binding[2] cnt: 0 wait_queue: xprt_binding[3] cnt: 0 rpc_wait_queue[xprt_binding].qlen: 0 maxpriority: 0 wait_queue: xprt_sending[0] cnt: 0 wait_queue: xprt_sending[1] cnt: 64887 wait_queue: xprt_sending[2] cnt: 0 wait_queue: xprt_sending[3] cnt: 0 rpc_wait_queue[xprt_sending].qlen: 64887 maxpriority: 3 wait_queue: xprt_pending[0] cnt: 524 wait_queue: xprt_pending[1] cnt: 0 wait_queue: xprt_pending[2] cnt: 0 wait_queue: xprt_pending[3] cnt: 0 rpc_wait_queue[xprt_pending].qlen: 524 maxpriority: 0 wait_queue: xprt_backlog[0] cnt: 0 wait_queue: xprt_backlog[1] cnt: 685801 wait_queue: xprt_backlog[2] cnt: 0 wait_queue: xprt_backlog[3] cnt: 0 rpc_wait_queue[xprt_backlog].qlen: 30441 maxpriority: 3 [task cnt mismatch] There is no effect on operations when this overflow occurs. However it causes confusion when trying to diagnose the performance problem. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-13objtool: Add asm version of STACK_FRAME_NON_STANDARDJosh Poimboeuf1-0/+8
commit 081df94301e317e84c3413686043987da2c3e39d upstream. To be used for adding asm functions to the ignore list. The "aw" is needed to help the ELF section metadata match GCC-created sections. Otherwise the linker creates duplicate sections instead of combining them. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/8faa476f9a5ac89af27944ec184c89f95f3c6c49.1611263462.git.jpoimboe@redhat.com Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-13driver core: Introduce device_link_wait_removal()Herve Codina1-0/+1
commit 0462c56c290a99a7f03e817ae5b843116dfb575c upstream. The commit 80dd33cf72d1 ("drivers: base: Fix device link removal") introduces a workqueue to release the consumer and supplier devices used in the devlink. In the job queued, devices are release and in turn, when all the references to these devices are dropped, the release function of the device itself is called. Nothing is present to provide some synchronisation with this workqueue in order to ensure that all ongoing releasing operations are done and so, some other operations can be started safely. For instance, in the following sequence: 1) of_platform_depopulate() 2) of_overlay_remove() During the step 1, devices are released and related devlinks are removed (jobs pushed in the workqueue). During the step 2, OF nodes are destroyed but, without any synchronisation with devlink removal jobs, of_overlay_remove() can raise warnings related to missing of_node_put(): ERROR: memory leak, expected refcount 1 instead of 2 Indeed, the missing of_node_put() call is going to be done, too late, from the workqueue job execution. Introduce device_link_wait_removal() to offer a way to synchronize operations waiting for the end of devlink removals (i.e. end of workqueue jobs). Also, as a flushing operation is done on the workqueue, the workqueue used is moved from a system-wide workqueue to a local one. Cc: stable@vger.kernel.org Signed-off-by: Herve Codina <herve.codina@bootlin.com> Tested-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Reviewed-by: Nuno Sa <nuno.sa@analog.com> Reviewed-by: Saravana Kannan <saravanak@google.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20240325152140.198219-2-herve.codina@bootlin.com Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-13udp: do not accept non-tunnel GSO skbs landing in a tunnelAntoine Tenart1-0/+28
[ Upstream commit 3d010c8031e39f5fa1e8b13ada77e0321091011f ] When rx-udp-gro-forwarding is enabled UDP packets might be GROed when being forwarded. If such packets might land in a tunnel this can cause various issues and udp_gro_receive makes sure this isn't the case by looking for a matching socket. This is performed in udp4/6_gro_lookup_skb but only in the current netns. This is an issue with tunneled packets when the endpoint is in another netns. In such cases the packets will be GROed at the UDP level, which leads to various issues later on. The same thing can happen with rx-gro-list. We saw this with geneve packets being GROed at the UDP level. In such case gso_size is set; later the packet goes through the geneve rx path, the geneve header is pulled, the offset are adjusted and frag_list skbs are not adjusted with regard to geneve. When those skbs hit skb_fragment, it will misbehave. Different outcomes are possible depending on what the GROed skbs look like; from corrupted packets to kernel crashes. One example is a BUG_ON[1] triggered in skb_segment while processing the frag_list. Because gso_size is wrong (geneve header was pulled) skb_segment thinks there is "geneve header size" of data in frag_list, although it's in fact the next packet. The BUG_ON itself has nothing to do with the issue. This is only one of the potential issues. Looking up for a matching socket in udp_gro_receive is fragile: the lookup could be extended to all netns (not speaking about performances) but nothing prevents those packets from being modified in between and we could still not find a matching socket. It's OK to keep the current logic there as it should cover most cases but we also need to make sure we handle tunnel packets being GROed too early. This is done by extending the checks in udp_unexpected_gso: GSO packets lacking the SKB_GSO_UDP_TUNNEL/_CSUM bits and landing in a tunnel must be segmented. [1] kernel BUG at net/core/skbuff.c:4408! RIP: 0010:skb_segment+0xd2a/0xf70 __udp_gso_segment+0xaa/0x560 Fixes: 9fd1ff5d2ac7 ("udp: Support UDP fraglist GRO/GSO.") Fixes: 36707061d6ba ("udp: allow forwarding of plain (non-fraglisted) UDP GRO packets") Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-13mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL ↵Vlastimil Babka1-0/+9
allocations commit 803de9000f334b771afacb6ff3e78622916668b0 upstream. Sven reports an infinite loop in __alloc_pages_slowpath() for costly order __GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO. Such combination can happen in a suspend/resume context where a GFP_KERNEL allocation can have __GFP_IO masked out via gfp_allowed_mask. Quoting Sven: 1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER) with __GFP_RETRY_MAYFAIL set. 2. page alloc's __alloc_pages_slowpath tries to get a page from the freelist. This fails because there is nothing free of that costly order. 3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim, which bails out because a zone is ready to be compacted; it pretends to have made a single page of progress. 4. page alloc tries to compact, but this always bails out early because __GFP_IO is not set (it's not passed by the snd allocator, and even if it were, we are suspending so the __GFP_IO flag would be cleared anyway). 5. page alloc believes reclaim progress was made (because of the pretense in item 3) and so it checks whether it should retry compaction. The compaction retry logic thinks it should try again, because: a) reclaim is needed because of the early bail-out in item 4 b) a zonelist is suitable for compaction 6. goto 2. indefinite stall. (end quote) The immediate root cause is confusing the COMPACT_SKIPPED returned from __alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be indicating a lack of order-0 pages, and in step 5 evaluating that in should_compact_retry() as a reason to retry, before incrementing and limiting the number of retries. There are however other places that wrongly assume that compaction can happen while we lack __GFP_IO. To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO evaluation and switch the open-coded test in try_to_compact_pages() to use it. Also use the new helper in: - compaction_ready(), which will make reclaim not bail out in step 3, so there's at least one attempt to actually reclaim, even if chances are small for a costly order - in_reclaim_compaction() which will make should_continue_reclaim() return false and we don't over-reclaim unnecessarily - in __alloc_pages_slowpath() to set a local variable can_compact, which is then used to avoid retrying reclaim/compaction for costly allocations (step 5) if we can't compact and also to skip the early compaction attempt that we do in some cases Link: https://lkml.kernel.org/r/20240221114357.13655-2-vbabka@suse.cz Fixes: 3250845d0526 ("Revert "mm, oom: prevent premature OOM killer invocation for high order request"") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Sven van Ashbrook <svenva@chromium.org> Closes: https://lore.kernel.org/all/CAG-rBihs_xMKb3wrMO1%2B-%2Bp4fowP9oy1pa_OTkfxBzPUVOZF%2Bg@mail.gmail.com/ Tested-by: Karthikeyan Ramasubramanian <kramasub@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Curtis Malainey <cujomalainey@chromium.org> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Takashi Iwai <tiwai@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-13vfio: Introduce interface to flush virqfd inject workqueueAlex Williamson1-0/+2
[ Upstream commit b620ecbd17a03cacd06f014a5d3f3a11285ce053 ] In order to synchronize changes that can affect the thread callback, introduce an interface to force a flush of the inject workqueue. The irqfd pointer is only valid under spinlock, but the workqueue cannot be flushed under spinlock. Therefore the flush work for the irqfd is queued under spinlock. The vfio_irqfd_cleanup_wq workqueue is re-used for queuing this work such that flushing the workqueue is also ordered relative to shutdown. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20240308230557.805580-4-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-13x86/rfds: Mitigate Register File Data Sampling (RFDS)Pawan Gupta1-0/+2
commit 8076fcde016c9c0e0660543e67bff86cb48a7c9c upstream. RFDS is a CPU vulnerability that may allow userspace to infer kernel stale data previously used in floating point registers, vector registers and integer registers. RFDS only affects certain Intel Atom processors. Intel released a microcode update that uses VERW instruction to clear the affected CPU buffers. Unlike MDS, none of the affected cores support SMT. Add RFDS bug infrastructure and enable the VERW based mitigation by default, that clears the affected buffers just before exiting to userspace. Also add sysfs reporting and cmdline parameter "reg_file_data_sampling" to control the mitigation. For details see: Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst [ pawan: - Resolved conflicts in sysfs reporting. - s/ATOM_GRACEMONT/ALDERLAKE_N/ATOM_GRACEMONT is called ALDERLAKE_N in 6.6. ] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-13Drivers: hv: vmbus: Calculate ring buffer size for more efficient use of memoryMichael Kelley1-1/+21
commit b8209544296edbd1af186e2ea9c648642c37b18c upstream. The VMBUS_RING_SIZE macro adds space for a ring buffer header to the requested ring buffer size. The header size is always 1 page, and so its size varies based on the PAGE_SIZE for which the kernel is built. If the requested ring buffer size is a large power-of-2 size and the header size is small, the resulting size is inefficient in its use of memory. For example, a 512 Kbyte ring buffer with a 4 Kbyte page size results in a 516 Kbyte allocation, which is rounded to up 1 Mbyte by the memory allocator, and wastes 508 Kbytes of memory. In such situations, the exact size of the ring buffer isn't that important, and it's OK to allocate the 4 Kbyte header at the beginning of the 512 Kbytes, leaving the ring buffer itself with just 508 Kbytes. The memory allocation can be 512 Kbytes instead of 1 Mbyte and nothing is wasted. Update VMBUS_RING_SIZE to implement this approach for "large" ring buffer sizes. "Large" is somewhat arbitrarily defined as 8 times the size of the ring buffer header (which is of size PAGE_SIZE). For example, for 4 Kbyte PAGE_SIZE, ring buffers of 32 Kbytes and larger use the first 4 Kbytes as the ring buffer header. For 64 Kbyte PAGE_SIZE, ring buffers of 512 Kbytes and larger use the first 64 Kbytes as the ring buffer header. In both cases, smaller sizes add space for the header so the ring size isn't reduced too much by using part of the space for the header. For example, with a 64 Kbyte page size, we don't want a 128 Kbyte ring buffer to be reduced to 64 Kbytes by allocating half of the space for the header. In such a case, the memory allocation is less efficient, but it's the best that can be done. While the new algorithm slightly changes the amount of space allocated for ring buffers by drivers that use VMBUS_RING_SIZE, the devices aren't known to be sensitive to small changes in ring buffer size, so there shouldn't be any effect. Fixes: c1135c7fd0e9 ("Drivers: hv: vmbus: Introduce types of GPADL") Fixes: 6941f67ad37d ("hv_netvsc: Calculate correct ring size when PAGE_SIZE is not 4 Kbytes") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218502 Cc: stable@vger.kernel.org Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Tested-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> Link: https://lore.kernel.org/r/20240229004533.313662-1-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240229004533.313662-1-mhklinux@outlook.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-13nfs: fix UAF in direct writesJosef Bacik1-0/+1
[ Upstream commit 17f46b803d4f23c66cacce81db35fef3adb8f2af ] In production we have been hitting the following warning consistently ------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 17 PID: 1800359 at lib/refcount.c:28 refcount_warn_saturate+0x9c/0xe0 Workqueue: nfsiod nfs_direct_write_schedule_work [nfs] RIP: 0010:refcount_warn_saturate+0x9c/0xe0 PKRU: 55555554 Call Trace: <TASK> ? __warn+0x9f/0x130 ? refcount_warn_saturate+0x9c/0xe0 ? report_bug+0xcc/0x150 ? handle_bug+0x3d/0x70 ? exc_invalid_op+0x16/0x40 ? asm_exc_invalid_op+0x16/0x20 ? refcount_warn_saturate+0x9c/0xe0 nfs_direct_write_schedule_work+0x237/0x250 [nfs] process_one_work+0x12f/0x4a0 worker_thread+0x14e/0x3b0 ? ZSTD_getCParams_internal+0x220/0x220 kthread+0xdc/0x120 ? __btf_name_valid+0xa0/0xa0 ret_from_fork+0x1f/0x30 This is because we're completing the nfs_direct_request twice in a row. The source of this is when we have our commit requests to submit, we process them and send them off, and then in the completion path for the commit requests we have if (nfs_commit_end(cinfo.mds)) nfs_direct_write_complete(dreq); However since we're submitting asynchronous requests we sometimes have one that completes before we submit the next one, so we end up calling complete on the nfs_direct_request twice. The only other place we use nfs_generic_commit_list() is in __nfs_commit_inode, which wraps this call in a nfs_commit_begin(); nfs_commit_end(); Which is a common pattern for this style of completion handling, one that is also repeated in the direct code with get_dreq()/put_dreq() calls around where we process events as well as in the completion paths. Fix this by using the same pattern for the commit requests. Before with my 200 node rocksdb stress running this warning would pop every 10ish minutes. With my patch the stress test has been running for several hours without popping. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-13phy: tegra: xusb: Add API to retrieve the port number of phyWayne Chang1-0/+2
[ Upstream commit d843f031d9e90462253015bc0bd9e3852d206bf2 ] This patch introduces a new API, tegra_xusb_padctl_get_port_number, to the Tegra XUSB Pad Controller driver. This API is used to identify the USB port that is associated with a given PHY. The function takes a PHY pointer for either a USB2 PHY or USB3 PHY as input and returns the corresponding port number. If the PHY pointer is invalid, it returns -ENODEV. Cc: stable@vger.kernel.org Signed-off-by: Wayne Chang <waynec@nvidia.com> Reviewed-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Link: https://lore.kernel.org/r/20240307030328.1487748-2-waynec@nvidia.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-13PCI: Work around Intel I210 ROM BAR overlap defectBjorn Helgaas1-0/+1
[ Upstream commit 500b55b05d0a21c4adddf4c3b29ee6f32b502046 ] Per PCIe r5, sec 7.5.1.2.4, a device must not claim accesses to its Expansion ROM unless both the Memory Space Enable and the Expansion ROM Enable bit are set. But apparently some Intel I210 NICs don't work correctly if the ROM BAR overlaps another BAR, even if the Expansion ROM is disabled. Michael reported that on a Kontron SMARC-sAL28 ARM64 system with U-Boot v2021.01-rc3, the ROM BAR overlaps BAR 3, and networking doesn't work at all: BAR 0: 0x40000000 (32-bit, non-prefetchable) [size=1M] BAR 3: 0x40200000 (32-bit, non-prefetchable) [size=16K] ROM: 0x40200000 (disabled) [size=1M] NETDEV WATCHDOG: enP2p1s0 (igb): transmit queue 0 timed out Hardware name: Kontron SMARC-sAL28 (Single PHY) on SMARC Eval 2.0 carrier (DT) igb 0002:01:00.0 enP2p1s0: Reset adapter Previously, pci_std_update_resource() wrote the assigned ROM address to the BAR only when the ROM was enabled. This meant that the I210 ROM BAR could be left with an address assigned by firmware, which might overlap with other BARs. Quirk these I210 devices so pci_std_update_resource() always writes the assigned address to the ROM BAR, whether or not the ROM is enabled. Link: https://lore.kernel.org/r/20211223163754.GA1267351@bhelgaas Link: https://lore.kernel.org/r/20201230185317.30915-1-michael@walle.cc Link: https://bugzilla.kernel.org/show_bug.cgi?id=211105 Reported-by: Michael Walle <michael@walle.cc> Tested-by: Michael Walle <michael@walle.cc> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Stable-dep-of: 627c6db20703 ("PCI/DPC: Quirk PIO log size for Intel Raptor Lake Root Ports") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-13PCI: Cache PCIe Device Capabilities registerAmey Narkhede1-0/+1
[ Upstream commit 69139244806537f9d51364f37fe146bb2ee88a05 ] Add a new member called devcap in struct pci_dev for caching the PCIe Device Capabilities register to avoid reading PCI_EXP_DEVCAP multiple times. Refactor pcie_has_flr() to use cached device capabilities. Link: https://lore.kernel.org/r/20210817180500.1253-2-ameynarkhede03@gmail.com Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com> Stable-dep-of: 627c6db20703 ("PCI/DPC: Quirk PIO log size for Intel Raptor Lake Root Ports") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-13PCI/ERR: Cache RCEC EA Capability offset in pci_init_capabilities()Sean V Kelley1-0/+4
[ Upstream commit 90655631988f8f501529e6de5f13614389717ead ] Extend support for Root Complex Event Collectors by decoding and caching the RCEC Endpoint Association Extended Capabilities when enumerating. Use that cached information for later error source reporting. See PCIe r5.0, sec 7.9.10. Co-developed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/r/20201121001036.8560-4-sean.v.kelley@intel.com Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> # non-native/no RCEC Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Sean V Kelley <sean.v.kelley@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Stable-dep-of: 627c6db20703 ("PCI/DPC: Quirk PIO log size for Intel Raptor Lake Root Ports") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-13block: introduce zone_write_granularity limitDamien Le Moal1-0/+15
[ Upstream commit a805a4fa4fa376bbc145762bb8b09caa2fa8af48 ] Per ZBC and ZAC specifications, host-managed SMR hard-disks mandate that all writes into sequential write required zones be aligned to the device physical block size. However, NVMe ZNS does not have this constraint and allows write operations into sequential zones to be aligned to the device logical block size. This inconsistency does not help with software portability across device types. To solve this, introduce the zone_write_granularity queue limit to indicate the alignment constraint, in bytes, of write operations into zones of a zoned block device. This new limit is exported as a read-only sysfs queue attribute and the helper blk_queue_zone_write_granularity() introduced for drivers to set this limit. The function blk_queue_set_zoned() is modified to set this new limit to the device logical block size by default. NVMe ZNS devices as well as zoned nullb devices use this default value as is. The scsi disk driver is modified to execute the blk_queue_zone_write_granularity() helper to set the zone write granularity of host-managed SMR disks to the disk physical block size. The accessor functions queue_zone_write_granularity() and bdev_zone_write_granularity() are also introduced. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: c8f6f88d2592 ("block: Clear zone limits for a non-zoned stacked queue") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-13timers: Rename del_timer_sync() to timer_delete_sync()Thomas Gleixner1-1/+14
[ Upstream commit 9b13df3fb64ee95e2397585404e442afee2c7d4f ] The timer related functions do not have a strict timer_ prefixed namespace which is really annoying. Rename del_timer_sync() to timer_delete_sync() and provide del_timer_sync() as a wrapper. Document that del_timer_sync() is not for new code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Link: https://lore.kernel.org/r/20221123201624.954785441@linutronix.de Stable-dep-of: 0f7352557a35 ("wifi: brcmfmac: Fix use-after-free bug in brcmf_cfg80211_detach") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-13timers: Use del_timer_sync() even on UPThomas Gleixner1-6/+1
[ Upstream commit 168f6b6ffbeec0b9333f3582e4cf637300858db5 ] del_timer_sync() is assumed to be pointless on uniprocessor systems and can be mapped to del_timer() because in theory del_timer() can never be invoked while the timer callback function is executed. This is not entirely true because del_timer() can be invoked from interrupt context and therefore hit in the middle of a running timer callback. Contrary to that del_timer_sync() is not allowed to be invoked from interrupt context unless the affected timer is marked with TIMER_IRQSAFE. del_timer_sync() has proper checks in place to detect such a situation. Give up on the UP optimization and make del_timer_sync() unconditionally available. Co-developed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Link: https://lore.kernel.org/all/20220407161745.7d6754b3@gandalf.local.home Link: https://lore.kernel.org/all/20221110064101.429013735@goodmis.org Link: https://lore.kernel.org/r/20221123201624.888306160@linutronix.de Stable-dep-of: 0f7352557a35 ("wifi: brcmfmac: Fix use-after-free bug in brcmf_cfg80211_detach") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27rcu: add a helper to report consolidated flavor QSYan Zhai1-0/+31
[ Upstream commit 1a77557d48cff187a169c2aec01c0dd78a5e7e50 ] When under heavy load, network processing can run CPU-bound for many tens of seconds. Even in preemptible kernels (non-RT kernel), this can block RCU Tasks grace periods, which can cause trace-event removal to take more than a minute, which is unacceptably long. This commit therefore creates a new helper function that passes through both RCU and RCU-Tasks quiescent states every 100 milliseconds. This hard-coded value suffices for current workloads. Suggested-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Yan Zhai <yan@cloudflare.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/90431d46ee112d2b0af04dbfe936faaca11810a5.1710877680.git.yan@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 00bf63122459 ("bpf: report RCU QS in cpumap kthread") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27remoteproc: Add new get_loaded_rsc_table() to rproc_opsMathieu Poirier1-1/+5
[ Upstream commit 1a631382be1d22ddab0582dae3498b3d28e2e44a ] Add a new get_loaded_rsc_table() operation in order to support scenarios where the remoteproc core has booted a remote processor and detaches from it. When re-attaching to the remote processor, the core needs to know where the resource table has been placed in memory. Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Reviewed-by: Arnaud Pouliquen <arnaud.pouliquen@st.com> Link: https://lore.kernel.org/r/20210312162453.1234145-6-mathieu.poirier@linaro.org Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org> Stable-dep-of: 32381bbccba4 ("remoteproc: stm32: Fix incorrect type in assignment for va") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27bpf: net: Change do_ip_getsockopt() to take the sockptr_t argumentMartin KaFai Lau2-5/+5
[ Upstream commit 728f064cd7ebea8c182e99e6f152c8b4a0a6b071 ] Similar to the earlier patch that changes sk_getsockopt() to take the sockptr_t argument. This patch also changes do_ip_getsockopt() to take the sockptr_t argument such that a latter patch can make bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt(). Note on the change in ip_mc_gsfget(). This function is to return an array of sockaddr_storage in optval. This function is shared between ip_get_mcast_msfilter() and compat_ip_get_mcast_msfilter(). However, the sockaddr_storage is stored at different offset of the optval because of the difference between group_filter and compat_group_filter. Thus, a new 'ss_offset' argument is added to ip_mc_gsfget(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002828.2890585-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: 5c3be3e0eb44 ("ipmr: fix incorrect parameter validation in the ip_mroute_getsockopt() function") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27PCI: Make pci_dev_is_disconnected() helper public for other driversEthan Zhao1-0/+5
[ Upstream commit 39714fd73c6b60a8d27bcc5b431afb0828bf4434 ] Make pci_dev_is_disconnected() public so that it can be called from Intel VT-d driver to quickly fix/workaround the surprise removal unplug hang issue for those ATS capable devices on PCIe switch downstream hotplug capable ports. Beside pci_device_is_present() function, this one has no config space space access, so is light enough to optimize the normal pure surprise removal and safe removal flow. Acked-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Tested-by: Haorong Ye <yehaorong@bytedance.com> Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com> Link: https://lore.kernel.org/r/20240301080727.3529832-2-haifeng.zhao@linux.intel.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Stable-dep-of: 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27bpf: Mark bpf_spin_{lock,unlock}() helpers with notrace correctlyYonghong Song1-9/+12
[ Upstream commit 178c54666f9c4d2f49f2ea661d0c11b52f0ed190 ] Currently tracing is supposed not to allow for bpf_spin_{lock,unlock}() helper calls. This is to prevent deadlock for the following cases: - there is a prog (prog-A) calling bpf_spin_{lock,unlock}(). - there is a tracing program (prog-B), e.g., fentry, attached to bpf_spin_lock() and/or bpf_spin_unlock(). - prog-B calls bpf_spin_{lock,unlock}(). For such a case, when prog-A calls bpf_spin_{lock,unlock}(), a deadlock will happen. The related source codes are below in kernel/bpf/helpers.c: notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock) notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock) notrace is supposed to prevent fentry prog from attaching to bpf_spin_{lock,unlock}(). But actually this is not the case and fentry prog can successfully attached to bpf_spin_lock(). Siddharth Chintamaneni reported the issue in [1]. The following is the macro definition for above BPF_CALL_1: #define BPF_CALL_x(x, name, ...) \ static __always_inline \ u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \ typedef u64 (*btf_##name)(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \ u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)); \ u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)) \ { \ return ((btf_##name)____##name)(__BPF_MAP(x,__BPF_CAST,__BPF_N,__VA_ARGS__));\ } \ static __always_inline \ u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)) #define BPF_CALL_1(name, ...) BPF_CALL_x(1, name, __VA_ARGS__) The notrace attribute is actually applied to the static always_inline function ____bpf_spin_{lock,unlock}(). The actual callback function bpf_spin_{lock,unlock}() is not marked with notrace, hence allowing fentry prog to attach to two helpers, and this may cause the above mentioned deadlock. Siddharth Chintamaneni actually has a reproducer in [2]. To fix the issue, a new macro NOTRACE_BPF_CALL_1 is introduced which will add notrace attribute to the original function instead of the hidden always_inline function and this fixed the problem. [1] https://lore.kernel.org/bpf/CAE5sdEigPnoGrzN8WU7Tx-h-iFuMZgW06qp0KHWtpvoXxf1OAQ@mail.gmail.com/ [2] https://lore.kernel.org/bpf/CAE5sdEg6yUc_Jz50AnUXEEUh6O73yQ1Z6NV2srJnef0ZrQkZew@mail.gmail.com/ Fixes: d83525ca62cf ("bpf: introduce bpf_spin_lock") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240207070102.335167-1-yonghong.song@linux.dev Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27block: add a new set_read_only methodChristoph Hellwig1-0/+1
[ Upstream commit e00adcadf3af7a8335026d71ab9f0e0a922191ac ] Add a new method to allow for driver-specific processing when setting or clearing the block device read-only state. This allows to replace the cumbersome and error-prone override of the whole ioctl implementation. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 9674f54e41ff ("md: Don't clear MD_CLOSING when the raid is about to stop") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27fs/select: rework stack allocation hack for clangArnd Bergmann1-4/+0
[ Upstream commit ddb9fd7a544088ed70eccbb9f85e9cc9952131c1 ] A while ago, we changed the way that select() and poll() preallocate a temporary buffer just under the size of the static warning limit of 1024 bytes, as clang was frequently going slightly above that limit. The warnings have recently returned and I took another look. As it turns out, clang is not actually inherently worse at reserving stack space, it just happens to inline do_select() into core_sys_select(), while gcc never inlines it. Annotate do_select() to never be inlined and in turn remove the special case for the allocation size. This should give the same behavior for both clang and gcc all the time and once more avoids those warnings. Fixes: ad312f95d41c ("fs/select: avoid clang stack usage warning") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20240216202352.2492798-1-arnd@kernel.org Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27RDMA/mlx5: Fix fortify source warning while accessing Eth segmentLeon Romanovsky1-1/+4
[ Upstream commit 4d5e86a56615cc387d21c629f9af8fb0e958d350 ] ------------[ cut here ]------------ memcpy: detected field-spanning write (size 56) of single field "eseg->inline_hdr.start" at /var/lib/dkms/mlnx-ofed-kernel/5.8/build/drivers/infiniband/hw/mlx5/wr.c:131 (size 2) WARNING: CPU: 0 PID: 293779 at /var/lib/dkms/mlnx-ofed-kernel/5.8/build/drivers/infiniband/hw/mlx5/wr.c:131 mlx5_ib_post_send+0x191b/0x1a60 [mlx5_ib] Modules linked in: 8021q garp mrp stp llc rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) mlx5_core(OE) pci_hyperv_intf mlxdevm(OE) mlx_compat(OE) tls mlxfw(OE) psample nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables libcrc32c nfnetlink mst_pciconf(OE) knem(OE) vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd irqbypass cuse nfsv3 nfs fscache netfs xfrm_user xfrm_algo ipmi_devintf ipmi_msghandler binfmt_misc crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 snd_pcsp aesni_intel crypto_simd cryptd snd_pcm snd_timer joydev snd soundcore input_leds serio_raw evbug nfsd auth_rpcgss nfs_acl lockd grace sch_fq_codel sunrpc drm efi_pstore ip_tables x_tables autofs4 psmouse virtio_net net_failover failover floppy [last unloaded: mlx_compat(OE)] CPU: 0 PID: 293779 Comm: ssh Tainted: G OE 6.2.0-32-generic #32~22.04.1-Ubuntu Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:mlx5_ib_post_send+0x191b/0x1a60 [mlx5_ib] Code: 0c 01 00 a8 01 75 25 48 8b 75 a0 b9 02 00 00 00 48 c7 c2 10 5b fd c0 48 c7 c7 80 5b fd c0 c6 05 57 0c 03 00 01 e8 95 4d 93 da <0f> 0b 44 8b 4d b0 4c 8b 45 c8 48 8b 4d c0 e9 49 fb ff ff 41 0f b7 RSP: 0018:ffffb5b48478b570 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffb5b48478b628 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffb5b48478b5e8 R13: ffff963a3c609b5e R14: ffff9639c3fbd800 R15: ffffb5b480475a80 FS: 00007fc03b444c80(0000) GS:ffff963a3dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000556f46bdf000 CR3: 0000000006ac6003 CR4: 00000000003706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? show_regs+0x72/0x90 ? mlx5_ib_post_send+0x191b/0x1a60 [mlx5_ib] ? __warn+0x8d/0x160 ? mlx5_ib_post_send+0x191b/0x1a60 [mlx5_ib] ? report_bug+0x1bb/0x1d0 ? handle_bug+0x46/0x90 ? exc_invalid_op+0x19/0x80 ? asm_exc_invalid_op+0x1b/0x20 ? mlx5_ib_post_send+0x191b/0x1a60 [mlx5_ib] mlx5_ib_post_send_nodrain+0xb/0x20 [mlx5_ib] ipoib_send+0x2ec/0x770 [ib_ipoib] ipoib_start_xmit+0x5a0/0x770 [ib_ipoib] dev_hard_start_xmit+0x8e/0x1e0 ? validate_xmit_skb_list+0x4d/0x80 sch_direct_xmit+0x116/0x3a0 __dev_xmit_skb+0x1fd/0x580 __dev_queue_xmit+0x284/0x6b0 ? _raw_spin_unlock_irq+0xe/0x50 ? __flush_work.isra.0+0x20d/0x370 ? push_pseudo_header+0x17/0x40 [ib_ipoib] neigh_connected_output+0xcd/0x110 ip_finish_output2+0x179/0x480 ? __smp_call_single_queue+0x61/0xa0 __ip_finish_output+0xc3/0x190 ip_finish_output+0x2e/0xf0 ip_output+0x78/0x110 ? __pfx_ip_finish_output+0x10/0x10 ip_local_out+0x64/0x70 __ip_queue_xmit+0x18a/0x460 ip_queue_xmit+0x15/0x30 __tcp_transmit_skb+0x914/0x9c0 tcp_write_xmit+0x334/0x8d0 tcp_push_one+0x3c/0x60 tcp_sendmsg_locked+0x2e1/0xac0 tcp_sendmsg+0x2d/0x50 inet_sendmsg+0x43/0x90 sock_sendmsg+0x68/0x80 sock_write_iter+0x93/0x100 vfs_write+0x326/0x3c0 ksys_write+0xbd/0xf0 ? do_syscall_64+0x69/0x90 __x64_sys_write+0x19/0x30 do_syscall_64+0x59/0x90 ? do_user_addr_fault+0x1d0/0x640 ? exit_to_user_mode_prepare+0x3b/0xd0 ? irqentry_exit_to_user_mode+0x9/0x20 ? irqentry_exit+0x43/0x50 ? exc_page_fault+0x92/0x1b0 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fc03ad14a37 Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007ffdf8697fe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000008024 RCX: 00007fc03ad14a37 RDX: 0000000000008024 RSI: 0000556f46bd8270 RDI: 0000000000000003 RBP: 0000556f46bb1800 R08: 0000000000007fe3 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002 R13: 0000556f46bc66b0 R14: 000000000000000a R15: 0000556f46bb2f50 </TASK> ---[ end trace 0000000000000000 ]--- Link: https://lore.kernel.org/r/8228ad34bd1a25047586270f7b1fb4ddcd046282.1706433934.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27bpf: Defer the free of inner map when necessaryHou Tao1-1/+6
[ Upstream commit 876673364161da50eed6b472d746ef88242b2368 ] When updating or deleting an inner map in map array or map htab, the map may still be accessed by non-sleepable program or sleepable program. However bpf_map_fd_put_ptr() decreases the ref-counter of the inner map directly through bpf_map_put(), if the ref-counter is the last one (which is true for most cases), the inner map will be freed by ops->map_free() in a kworker. But for now, most .map_free() callbacks don't use synchronize_rcu() or its variants to wait for the elapse of a RCU grace period, so after the invocation of ops->map_free completes, the bpf program which is accessing the inner map may incur use-after-free problem. Fix the free of inner map by invoking bpf_map_free_deferred() after both one RCU grace period and one tasks trace RCU grace period if the inner map has been removed from the outer map before. The deferment is accomplished by using call_rcu() or call_rcu_tasks_trace() when releasing the last ref-counter of bpf map. The newly-added rcu_head field in bpf_map shares the same storage space with work field to reduce the size of bpf_map. Fixes: bba1dc0b55ac ("bpf: Remove redundant synchronize_rcu.") Fixes: 638e4b825d52 ("bpf: Allows per-cpu maps and map-in-map in sleepable programs") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231204140425.1480317-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> (cherry picked from commit 62fca83303d608ad4fec3f7428c8685680bb01b0) Signed-off-by: Robert Kolchmeyer <rkolchmeyer@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27rcu-tasks: Provide rcu_trace_implies_rcu_gp()Paul E. McKenney1-0/+12
[ Upstream commit e6c86c513f440bec5f1046539c7e3c6c653842da ] As an accident of implementation, an RCU Tasks Trace grace period also acts as an RCU grace period. However, this could change at any time. This commit therefore creates an rcu_trace_implies_rcu_gp() that currently returns true to codify this accident. Code relying on this accident must call this function to verify that this accident is still happening. Reported-by: Hou Tao <houtao@huaweicloud.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Link: https://lore.kernel.org/r/20221014113946.965131-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: 876673364161 ("bpf: Defer the free of inner map when necessary") Signed-off-by: Sasha Levin <sashal@kernel.org> (cherry picked from commit 10108826191ab30388e8ae9d54505a628f78a7ec) Signed-off-by: Robert Kolchmeyer <rkolchmeyer@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-27io_uring/unix: drop usage of io_uring socketJens Axboe1-5/+5
Commit a4104821ad651d8a0b374f0b2474c345bbb42f82 upstream. Since we no longer allow sending io_uring fds over SCM_RIGHTS, move to using io_is_uring_fops() to detect whether this is a io_uring fd or not. With that done, kill off io_uring_get_socket() as nobody calls it anymore. This is in preparation to yanking out the rest of the core related to unix gc with io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-15regmap: Add bulk read/write callbacks into regmap_configMarek Vasut1-0/+12
[ Upstream commit d77e745613680c54708470402e2b623dcd769681 ] Currently the regmap_config structure only allows the user to implement single element register read/write using .reg_read/.reg_write callbacks. The regmap_bus already implements bulk counterparts of both, and is being misused as a workaround for the missing bulk read/write callbacks in regmap_config by a couple of drivers. To stop this misuse, add the bulk read/write callbacks to regmap_config and call them from the regmap core code. Signed-off-by: Marek Vasut <marex@denx.de> Cc: Jagan Teki <jagan@amarulasolutions.com> Cc: Mark Brown <broonie@kernel.org> Cc: Maxime Ripard <maxime@cerno.tech> Cc: Robert Foss <robert.foss@linaro.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> To: dri-devel@lists.freedesktop.org Link: https://lore.kernel.org/r/20220430025145.640305-1-marex@denx.de Signed-off-by: Mark Brown <broonie@kernel.org> Stable-dep-of: 3f42b142ea11 ("serial: max310x: fix IO data corruption in batched operations") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-15regmap: allow to define reg_update_bits for no bus configurationAnsuel Smith1-0/+7
[ Upstream commit 02d6fdecb9c38de19065f6bed8d5214556fd061d ] Some device requires a special handling for reg_update_bits and can't use the normal regmap read write logic. An example is when locking is handled by the device and rmw operations requires to do atomic operations. Allow to declare a dedicated function in regmap_config for reg_update_bits in no bus configuration. Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com> Link: https://lore.kernel.org/r/20211104150040.1260-1-ansuelsmth@gmail.com Signed-off-by: Mark Brown <broonie@kernel.org> Stable-dep-of: 3f42b142ea11 ("serial: max310x: fix IO data corruption in batched operations") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-15mm/hugetlb: change hugetlb_reserve_pages() to type boolMike Kravetz1-1/+1
[ Upstream commit 33b8f84a4ee78491a8f4f9e4c5520c9da4a10983 ] While reviewing a bug in hugetlb_reserve_pages, it was noticed that all callers ignore the return value. Any failure is considered an ENOMEM error by the callers. Change the function to be of type bool. The function will return true if the reservation was successful, false otherwise. Callers currently assume a zero return code indicates success. Change the callers to look for true to indicate success. No functional change, only code cleanup. Link: https://lkml.kernel.org/r/20201221192542.15732-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Stable-dep-of: e656c7a9e596 ("mm: hugetlb pages should not be reserved by shmat() if SHM_NORESERVE") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-15hv_netvsc: Use vmbus_requestor to generate transaction IDs for VMBus hardeningAndres Beltran1-0/+1
[ Upstream commit 4d18fcc95f50950a99bd940d4e61a983f91d267a ] Currently, pointers to guest memory are passed to Hyper-V as transaction IDs in netvsc. In the face of errors or malicious behavior in Hyper-V, netvsc should not expose or trust the transaction IDs returned by Hyper-V to be valid guest memory addresses. Instead, use small integers generated by vmbus_requestor as requests (transaction) IDs. Signed-off-by: Andres Beltran <lkmlabelt@gmail.com> Co-developed-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Wei Liu <wei.liu@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Link: https://lore.kernel.org/r/20201109100402.8946-4-parri.andrea@gmail.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Stable-dep-of: 9cae43da9867 ("hv_netvsc: Register VF in netvsc_probe if NET_DEVICE_REGISTER missed") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-15Drivers: hv: vmbus: Add vmbus_requestor data structure for VMBus hardeningAndres Beltran1-0/+22
[ Upstream commit e8b7db38449ac5b950a3f00519171c4be3e226ff ] Currently, VMbus drivers use pointers into guest memory as request IDs for interactions with Hyper-V. To be more robust in the face of errors or malicious behavior from a compromised Hyper-V, avoid exposing guest memory addresses to Hyper-V. Also avoid Hyper-V giving back a bad request ID that is then treated as the address of a guest data structure with no validation. Instead, encapsulate these memory addresses and provide small integers as request IDs. Signed-off-by: Andres Beltran <lkmlabelt@gmail.com> Co-developed-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/20201109100402.8946-2-parri.andrea@gmail.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Stable-dep-of: 9cae43da9867 ("hv_netvsc: Register VF in netvsc_probe if NET_DEVICE_REGISTER missed") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-15lsm: fix default return value of the socket_getpeersec_*() hooksOndrej Mosnacek1-2/+2
[ Upstream commit 5a287d3d2b9de2b3e747132c615599907ba5c3c1 ] For these hooks the true "neutral" value is -EOPNOTSUPP, which is currently what is returned when no LSM provides this hook and what LSMs return when there is no security context set on the socket. Correct the value in <linux/lsm_hooks.h> and adjust the dispatch functions in security/security.c to avoid issues when the BPF LSM is enabled. Cc: stable@vger.kernel.org Fixes: 98e828a0650f ("security: Refactor declaration of LSM hooks") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> [PM: subject line tweak] Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-15lsm: make security_socket_getpeersec_stream() sockptr_t safePaul Moore3-7/+10
[ Upstream commit b10b9c342f7571f287fd422be5d5c0beb26ba974 ] Commit 4ff09db1b79b ("bpf: net: Change sk_getsockopt() to take the sockptr_t argument") made it possible to call sk_getsockopt() with both user and kernel address space buffers through the use of the sockptr_t type. Unfortunately at the time of conversion the security_socket_getpeersec_stream() LSM hook was written to only accept userspace buffers, and in a desire to avoid having to change the LSM hook the commit author simply passed the sockptr_t's userspace buffer pointer. Since the only sk_getsockopt() callers at the time of conversion which used kernel sockptr_t buffers did not allow SO_PEERSEC, and hence the security_socket_getpeersec_stream() hook, this was acceptable but also very fragile as future changes presented the possibility of silently passing kernel space pointers to the LSM hook. There are several ways to protect against this, including careful code review of future commits, but since relying on code review to catch bugs is a recipe for disaster and the upstream eBPF maintainer is "strongly against defensive programming", this patch updates the LSM hook, and all of the implementations to support sockptr_t and safely handle both user and kernel space buffers. Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: John Johansen <john.johansen@canonical.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Stable-dep-of: 5a287d3d2b9d ("lsm: fix default return value of the socket_getpeersec_*() hooks") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-15bpf: net: Change sk_getsockopt() to take the sockptr_t argumentMartin KaFai Lau2-2/+6
[ Upstream commit 4ff09db1b79b98b4a2a7511571c640b76cab3beb ] This patch changes sk_getsockopt() to take the sockptr_t argument such that it can be used by bpf_getsockopt(SOL_SOCKET) in a latter patch. security_socket_getpeersec_stream() is not changed. It stays with the __user ptr (optval.user and optlen.user) to avoid changes to other security hooks. bpf_getsockopt(SOL_SOCKET) also does not support SO_PEERSEC. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002802.2888419-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: 5a287d3d2b9d ("lsm: fix default return value of the socket_getpeersec_*() hooks") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-01fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaioBart Van Assche1-0/+2
commit b820de741ae48ccf50dd95e297889c286ff4f760 upstream. If kiocb_set_cancel_fn() is called for I/O submitted via io_uring, the following kernel warning appears: WARNING: CPU: 3 PID: 368 at fs/aio.c:598 kiocb_set_cancel_fn+0x9c/0xa8 Call trace: kiocb_set_cancel_fn+0x9c/0xa8 ffs_epfile_read_iter+0x144/0x1d0 io_read+0x19c/0x498 io_issue_sqe+0x118/0x27c io_submit_sqes+0x25c/0x5fc __arm64_sys_io_uring_enter+0x104/0xab0 invoke_syscall+0x58/0x11c el0_svc_common+0xb4/0xf4 do_el0_svc+0x2c/0xb0 el0_svc+0x2c/0xa4 el0t_64_sync_handler+0x68/0xb4 el0t_64_sync+0x1a4/0x1a8 Fix this by setting the IOCB_AIO_RW flag for read and write I/O that is submitted by libaio. Suggested-by: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Cc: Avi Kivity <avi@scylladb.com> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240215204739.2677806-2-bvanassche@acm.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-03-01drm/syncobj: make lockdep complain on WAIT_FOR_SUBMIT v3Christian König1-0/+5
[ Upstream commit 7621350c6bb20fb6ab7eb988833ab96eac3dcbef ] DRM_SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT can't be used when we hold locks since we are basically waiting for userspace to do something. Holding a lock while doing so can trivial deadlock with page faults etc... So make lockdep complain when a driver tries to do this. v2: Add lockdep_assert_none_held() macro. v3: Add might_sleep() and also use lockdep_assert_none_held() in the IOCTL path. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patchwork.freedesktop.org/patch/414944/ Stable-dep-of: 3c43177ffb54 ("drm/syncobj: call drm_syncobj_fence_add_wait when WAIT_AVAILABLE flag is set") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-01net: dev: Convert sa_data to flexible array in struct sockaddrKees Cook1-1/+4
[ Upstream commit b5f0de6df6dce8d641ef58ef7012f3304dffb9a1 ] One of the worst offenders of "fake flexible arrays" is struct sockaddr, as it is the classic example of why GCC and Clang have been traditionally forced to treat all trailing arrays as fake flexible arrays: in the distant misty past, sa_data became too small, and code started just treating it as a flexible array, even though it was fixed-size. The special case by the compiler is specifically that sizeof(sa->sa_data) and FORTIFY_SOURCE (which uses __builtin_object_size(sa->sa_data, 1)) do not agree (14 and -1 respectively), which makes FORTIFY_SOURCE treat it as a flexible array. However, the coming -fstrict-flex-arrays compiler flag will remove these special cases so that FORTIFY_SOURCE can gain coverage over all the trailing arrays in the kernel that are _not_ supposed to be treated as a flexible array. To deal with this change, convert sa_data to a true flexible array. To keep the structure size the same, move sa_data into a union with a newly introduced sa_data_min with the original size. The result is that FORTIFY_SOURCE can continue to have no idea how large sa_data may actually be, but anything using sizeof(sa->sa_data) must switch to sizeof(sa->sa_data_min). Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: Dylan Yudaken <dylany@fb.com> Cc: Yajun Deng <yajun.deng@linux.dev> Cc: Petr Machata <petrm@nvidia.com> Cc: Hangbin Liu <liuhangbin@gmail.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: syzbot <syzkaller@googlegroups.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221018095503.never.671-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: a7d6027790ac ("arp: Prevent overflow in arp_req_get().") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-01task_stack, x86/cea: Force-inline stack helpersBorislav Petkov1-1/+1
[ Upstream commit e87f4152e542610d0b4c6c8548964a68a59d2040 ] Force-inline two stack helpers to fix the following objtool warnings: vmlinux.o: warning: objtool: in_task_stack()+0xc: call to task_stack_page() leaves .noinstr.text section vmlinux.o: warning: objtool: in_entry_stack()+0x10: call to cpu_entry_stack() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220324183607.31717-2-bp@alien8.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23PM: runtime: Have devm_pm_runtime_enable() handle ↵Douglas Anderson1-0/+4
pm_runtime_dont_use_autosuspend() [ Upstream commit b4060db9251f919506e4d672737c6b8ab9a84701 ] The PM Runtime docs say: Drivers in ->remove() callback should undo the runtime PM changes done in ->probe(). Usually this means calling pm_runtime_disable(), pm_runtime_dont_use_autosuspend() etc. >From grepping code, it's clear that many people aren't aware of the need to call pm_runtime_dont_use_autosuspend(). When brainstorming solutions, one idea that came up was to leverage the new-ish devm_pm_runtime_enable() function. The idea here is that: * When the devm action is called we know that the driver is being removed. It's the perfect time to undo the use_autosuspend. * The code of pm_runtime_dont_use_autosuspend() already handles the case of being called when autosuspend wasn't enabled. Suggested-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Stable-dep-of: 3d07a411b4fa ("drm/msm/dsi: Use pm_runtime_resume_and_get to prevent refcnt leaks") Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23PM: runtime: add devm_pm_runtime_enable helperDmitry Baryshkov1-0/+4
[ Upstream commit b3636a3a2c51715736d3ec45f635ed03191962ce ] A typical code pattern for pm_runtime_enable() call is to call it in the _probe function and to call pm_runtime_disable() both from _probe error path and from _remove function. For some drivers the whole remove function would consist of the call to pm_remove_disable(). Add helper function to replace this bolierplate piece of code. Calling devm_pm_runtime_enable() removes the need for calling pm_runtime_disable() both in the probe()'s error path and in the remove() function. Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Link: https://lore.kernel.org/r/20210731195034.979084-2-dmitry.baryshkov@linaro.org Acked-by: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Stephen Boyd <sboyd@kernel.org> Stable-dep-of: 3d07a411b4fa ("drm/msm/dsi: Use pm_runtime_resume_and_get to prevent refcnt leaks") Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23netfilter: ipset: fix performance regression in swap operationJozsef Kadlecsik1-0/+4
commit 97f7cf1cd80eeed3b7c808b7c12463295c751001 upstream. The patch "netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test", commit 28628fa9 fixes a race condition. But the synchronize_rcu() added to the swap function unnecessarily slows it down: it can safely be moved to destroy and use call_rcu() instead. Eric Dumazet pointed out that simply calling the destroy functions as rcu callback does not work: sets with timeout use garbage collectors which need cancelling at destroy which can wait. Therefore the destroy functions are split into two: cancelling garbage collectors safely at executing the command received by netlink and moving the remaining part only into the rcu callback. Link: https://lore.kernel.org/lkml/C0829B10-EAA6-4809-874E-E1E9C05A8D84@automattic.com/ Fixes: 28628fa952fe ("netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test") Reported-by: Ale Crismani <ale.crismani@automattic.com> Reported-by: David Wang <00107082@163.com> Tested-by: David Wang <00107082@163.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23serial: Add rs485_supported to uart_portIlpo Järvinen1-0/+1
[ Upstream commit 8925c31c1ac2f1e05da988581f2a70a2a8c4d638 ] Preparing to move serial_rs485 struct sanitization into serial core, each driver has to provide what fields/flags it supports. This information is pointed into by rs485_supported. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20220606100433.13793-4-ilpo.jarvinen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Stable-dep-of: 0c2a5f471ce5 ("serial: 8250_exar: Set missing rs485_supported flag") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23hrtimer: Report offline hrtimer enqueueFrederic Weisbecker1-1/+3
commit dad6a09f3148257ac1773cd90934d721d68ab595 upstream. The hrtimers migration on CPU-down hotplug process has been moved earlier, before the CPU actually goes to die. This leaves a small window of opportunity to queue an hrtimer in a blind spot, leaving it ignored. For example a practical case has been reported with RCU waking up a SCHED_FIFO task right before the CPUHP_AP_IDLE_DEAD stage, queuing that way a sched/rt timer to the local offline CPU. Make sure such situations never go unnoticed and warn when that happens. Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240129235646.3171983-4-boqun.feng@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>