summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
42 hoursxen/privcmd: add boot control for restricted usage in domUJuergen Gross1-0/+1
commit 1613462be621ad5103ec338a7b0ca0746ec4e5f1 upstream. When running in an unprivileged domU under Xen, the privcmd driver is restricted to allow only hypercalls against a target domain, for which the current domU is acting as a device model. Add a boot parameter "unrestricted" to allow all hypercalls (the hypervisor will still refuse destructive hypercalls affecting other guests). Make this new parameter effective only in case the domU wasn't started using secure boot, as otherwise hypercalls targeting the domU itself might result in violating the secure boot functionality. This is achieved by adding another lockdown reason, which can be tested to not being set when applying the "unrestricted" option. This is part of XSA-482 Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hoursnetfilter: nft_set_pipapo: split gc into unlink and reclaim phaseFlorian Westphal1-0/+5
commit 9df95785d3d8302f7c066050117b04cd3c2048c2 upstream. Yiming Qian reports Use-after-free in the pipapo set type: Under a large number of expired elements, commit-time GC can run for a very long time in a non-preemptible context, triggering soft lockup warnings and RCU stall reports (local denial of service). We must split GC in an unlink and a reclaim phase. We cannot queue elements for freeing until pointers have been swapped. Expired elements are still exposed to both the packet path and userspace dumpers via the live copy of the data structure. call_rcu() does not protect us: dump operations or element lookups starting after call_rcu has fired can still observe the free'd element, unless the commit phase has made enough progress to swap the clone and live pointers before any new reader has picked up the old version. This a similar approach as done recently for the rbtree backend in commit 35f83a75529a ("netfilter: nft_set_rbtree: don't gc elements on insert"). Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Reported-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hoursudp_tunnel: fix NULL deref caused by udp_sock_create6 when CONFIG_IPV6=nXiang Mei1-1/+1
[ Upstream commit b3a6df291fecf5f8a308953b65ca72b7fc9e015d ] When CONFIG_IPV6 is disabled, the udp_sock_create6() function returns 0 (success) without actually creating a socket. Callers such as fou_create() then proceed to dereference the uninitialized socket pointer, resulting in a NULL pointer dereference. The captured NULL deref crash: BUG: kernel NULL pointer dereference, address: 0000000000000018 RIP: 0010:fou_nl_add_doit (net/ipv4/fou_core.c:590 net/ipv4/fou_core.c:764) [...] Call Trace: <TASK> genl_family_rcv_msg_doit.constprop.0 (net/netlink/genetlink.c:1114) genl_rcv_msg (net/netlink/genetlink.c:1194 net/netlink/genetlink.c:1209) [...] netlink_rcv_skb (net/netlink/af_netlink.c:2550) genl_rcv (net/netlink/genetlink.c:1219) netlink_unicast (net/netlink/af_netlink.c:1319 net/netlink/af_netlink.c:1344) netlink_sendmsg (net/netlink/af_netlink.c:1894) __sock_sendmsg (net/socket.c:727 (discriminator 1) net/socket.c:742 (discriminator 1)) __sys_sendto (./include/linux/file.h:62 (discriminator 1) ./include/linux/file.h:83 (discriminator 1) net/socket.c:2183 (discriminator 1)) __x64_sys_sendto (net/socket.c:2213 (discriminator 1) net/socket.c:2209 (discriminator 1) net/socket.c:2209 (discriminator 1)) do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1)) entry_SYSCALL_64_after_hwframe (net/arch/x86/entry/entry_64.S:130) This patch makes udp_sock_create6 return -EPFNOSUPPORT instead, so callers correctly take their error paths. There is only one caller of the vulnerable function and only privileged users can trigger it. Fixes: fd384412e199b ("udp_tunnel: Seperate ipv6 functions into its own file.") Reported-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Link: https://patch.msgid.link/20260317010241.1893893-1-xmei5@asu.edu Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursclsact: Fix use-after-free in init/destroy rollback asymmetryDaniel Borkmann1-0/+5
[ Upstream commit a0671125d4f55e1e98d9bde8a0b671941987e208 ] Fix a use-after-free in the clsact qdisc upon init/destroy rollback asymmetry. The latter is achieved by first fully initializing a clsact instance, and then in a second step having a replacement failure for the new clsact qdisc instance. clsact_init() initializes ingress first and then takes care of the egress part. This can fail midway, for example, via tcf_block_get_ext(). Upon failure, the kernel will trigger the clsact_destroy() callback. Commit 1cb6f0bae504 ("bpf: Fix too early release of tcx_entry") details the way how the transition is happening. If tcf_block_get_ext on the q->ingress_block ends up failing, we took the tcx_miniq_inc reference count on the ingress side, but not yet on the egress side. clsact_destroy() tests whether the {ingress,egress}_entry was non-NULL. However, even in midway failure on the replacement, both are in fact non-NULL with a valid egress_entry from the previous clsact instance. What we really need to test for is whether the qdisc instance-specific ingress or egress side previously got initialized. This adds a small helper for checking the miniq initialization called mini_qdisc_pair_inited, and utilizes that upon clsact_destroy() in order to fix the use-after-free scenario. Convert the ingress_destroy() side as well so both are consistent to each other. Fixes: 1cb6f0bae504 ("bpf: Fix too early release of tcx_entry") Reported-by: Keenan Dong <keenanat2000@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260313065531.98639-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursnet/sched: teql: Fix double-free in teql_master_xmitJamal Hadi Salim1-0/+28
[ Upstream commit 66360460cab63c248ca5b1070a01c0c29133b960 ] Whenever a TEQL devices has a lockless Qdisc as root, qdisc_reset should be called using the seq_lock to avoid racing with the datapath. Failure to do so may cause crashes like the following: [ 238.028993][ T318] BUG: KASAN: double-free in skb_release_data (net/core/skbuff.c:1139) [ 238.029328][ T318] Free of addr ffff88810c67ec00 by task poc_teql_uaf_ke/318 [ 238.029749][ T318] [ 238.029900][ T318] CPU: 3 UID: 0 PID: 318 Comm: poc_teql_ke Not tainted 7.0.0-rc3-00149-ge5b31d988a41 #704 PREEMPT(full) [ 238.029906][ T318] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 238.029910][ T318] Call Trace: [ 238.029913][ T318] <TASK> [ 238.029916][ T318] dump_stack_lvl (lib/dump_stack.c:122) [ 238.029928][ T318] print_report (mm/kasan/report.c:379 mm/kasan/report.c:482) [ 238.029940][ T318] ? skb_release_data (net/core/skbuff.c:1139) [ 238.029944][ T318] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) ... [ 238.029957][ T318] ? skb_release_data (net/core/skbuff.c:1139) [ 238.029969][ T318] kasan_report_invalid_free (mm/kasan/report.c:221 mm/kasan/report.c:563) [ 238.029979][ T318] ? skb_release_data (net/core/skbuff.c:1139) [ 238.029989][ T318] check_slab_allocation (mm/kasan/common.c:231) [ 238.029995][ T318] kmem_cache_free (mm/slub.c:2637 (discriminator 1) mm/slub.c:6168 (discriminator 1) mm/slub.c:6298 (discriminator 1)) [ 238.030004][ T318] skb_release_data (net/core/skbuff.c:1139) ... [ 238.030025][ T318] sk_skb_reason_drop (net/core/skbuff.c:1256) [ 238.030032][ T318] pfifo_fast_reset (./include/linux/ptr_ring.h:171 ./include/linux/ptr_ring.h:309 ./include/linux/skb_array.h:98 net/sched/sch_generic.c:827) [ 238.030039][ T318] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) ... [ 238.030054][ T318] qdisc_reset (net/sched/sch_generic.c:1034) [ 238.030062][ T318] teql_destroy (./include/linux/spinlock.h:395 net/sched/sch_teql.c:157) [ 238.030071][ T318] __qdisc_destroy (./include/net/pkt_sched.h:328 net/sched/sch_generic.c:1077) [ 238.030077][ T318] qdisc_graft (net/sched/sch_api.c:1062 net/sched/sch_api.c:1053 net/sched/sch_api.c:1159) [ 238.030089][ T318] ? __pfx_qdisc_graft (net/sched/sch_api.c:1091) [ 238.030095][ T318] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 238.030102][ T318] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 238.030106][ T318] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 238.030114][ T318] tc_get_qdisc (net/sched/sch_api.c:1529 net/sched/sch_api.c:1556) ... [ 238.072958][ T318] Allocated by task 303 on cpu 5 at 238.026275s: [ 238.073392][ T318] kasan_save_stack (mm/kasan/common.c:58) [ 238.073884][ T318] kasan_save_track (mm/kasan/common.c:64 (discriminator 5) mm/kasan/common.c:79 (discriminator 5)) [ 238.074230][ T318] __kasan_slab_alloc (mm/kasan/common.c:369) [ 238.074578][ T318] kmem_cache_alloc_node_noprof (./include/linux/kasan.h:253 mm/slub.c:4542 mm/slub.c:4869 mm/slub.c:4921) [ 238.076091][ T318] kmalloc_reserve (net/core/skbuff.c:616 (discriminator 107)) [ 238.076450][ T318] __alloc_skb (net/core/skbuff.c:713) [ 238.076834][ T318] alloc_skb_with_frags (./include/linux/skbuff.h:1383 net/core/skbuff.c:6763) [ 238.077178][ T318] sock_alloc_send_pskb (net/core/sock.c:2997) [ 238.077520][ T318] packet_sendmsg (net/packet/af_packet.c:2926 net/packet/af_packet.c:3019 net/packet/af_packet.c:3108) [ 238.081469][ T318] [ 238.081870][ T318] Freed by task 299 on cpu 1 at 238.028496s: [ 238.082761][ T318] kasan_save_stack (mm/kasan/common.c:58) [ 238.083481][ T318] kasan_save_track (mm/kasan/common.c:64 (discriminator 5) mm/kasan/common.c:79 (discriminator 5)) [ 238.085348][ T318] kasan_save_free_info (mm/kasan/generic.c:587 (discriminator 1)) [ 238.085900][ T318] __kasan_slab_free (mm/kasan/common.c:287) [ 238.086439][ T318] kmem_cache_free (mm/slub.c:6168 (discriminator 3) mm/slub.c:6298 (discriminator 3)) [ 238.087007][ T318] skb_release_data (net/core/skbuff.c:1139) [ 238.087491][ T318] consume_skb (net/core/skbuff.c:1451) [ 238.087757][ T318] teql_master_xmit (net/sched/sch_teql.c:358) [ 238.088116][ T318] dev_hard_start_xmit (./include/linux/netdevice.h:5324 ./include/linux/netdevice.h:5333 net/core/dev.c:3871 net/core/dev.c:3887) [ 238.088468][ T318] sch_direct_xmit (net/sched/sch_generic.c:347) [ 238.088820][ T318] __qdisc_run (net/sched/sch_generic.c:420 (discriminator 1)) [ 238.089166][ T318] __dev_queue_xmit (./include/net/sch_generic.h:229 ./include/net/pkt_sched.h:121 ./include/net/pkt_sched.h:117 net/core/dev.c:4196 net/core/dev.c:4802) Workflow to reproduce: 1. Initialize a TEQL topology (dummy0 and ifb0 as slaves, teql0 up). 2. Start multiple sender workers continuously transmitting packets through teql0 to drive teql_master_xmit(). 3. In parallel, repeatedly delete and re-add the root qdisc on dummy0 and ifb0 via RTNETLINK, forcing frequent teardown and reset activity (teql_destroy() / qdisc_reset()). 4. After running both workloads concurrently for several iterations, KASAN reports slab-use-after-free or double-free in the skb free path. Fix this by moving dev_reset_queue to sch_generic.h and calling it, instead of qdisc_reset, in teql_destroy since it handles both the lock and lockless cases correctly for root qdiscs. Fixes: 96009c7d500e ("sched: replace __QDISC_STATE_RUNNING bit with a spin lock") Reported-by: Xianrui Dong <keenanat2000@gmail.com> Tested-by: Xianrui Dong <keenanat2000@gmail.com> Co-developed-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260315155422.147256-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursnet: stmmac: remove support for lpi_intr_oRussell King (Oracle)1-1/+0
commit 14eb64db8ff07b58a35b98375f446d9e20765674 upstream. The dwmac databook for v3.74a states that lpi_intr_o is a sideband signal which should be used to ungate the application clock, and this signal is synchronous to the receive clock. The receive clock can run at 2.5, 25 or 125MHz depending on the media speed, and can stop under the control of the link partner. This means that the time it takes to clear is dependent on the negotiated media speed, and thus can be 8, 40, or 400ns after reading the LPI control and status register. It has been observed with some aggressive link partners, this clock can stop while lpi_intr_o is still asserted, meaning that the signal remains asserted for an indefinite period that the local system has no direct control over. The LPI interrupts will still be signalled through the main interrupt path in any case, and this path is not dependent on the receive clock. This, since we do not gate the application clock, and the chances of adding clock gating in the future are slim due to the clocks being ill-defined, lpi_intr_o serves no useful purpose. Remove the code which requests the interrupt, and all associated code. Reported-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com> Tested-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com> # Renesas RZ/V2H board Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1vnJbt-00000007YYN-28nm@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hoursx86/uprobes: Fix XOL allocation failure for 32-bit tasksOleg Nesterov1-0/+1
[ Upstream commit d55c571e4333fac71826e8db3b9753fadfbead6a ] This script #!/usr/bin/bash echo 0 > /proc/sys/kernel/randomize_va_space echo 'void main(void) {}' > TEST.c # -fcf-protection to ensure that the 1st endbr32 insn can't be emulated gcc -m32 -fcf-protection=branch TEST.c -o test bpftrace -e 'uprobe:./test:main {}' -c ./test "hangs", the probed ./test task enters an endless loop. The problem is that with randomize_va_space == 0 get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used by the stack vma. arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE. vm_unmapped_area() happily returns the high address > TASK_SIZE and then get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)" check. handle_swbp() doesn't report this failure (probably it should) and silently restarts the probed insn. Endless loop. I think that the right fix should change the x86 get_unmapped_area() paths to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case because ->orig_ax = -1. But we need a simple fix for -stable, so this patch just sets TS_COMPAT if the probed task is 32-bit to make in_ia32_syscall() true. Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()") Reported-by: Paulo Andrade <pandrade@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/ Cc: stable@vger.kernel.org Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hoursnet: dsa: properly keep track of conduit referenceVladimir Oltean1-0/+1
commit 06e219f6a706c367c93051f408ac61417643d2f9 upstream. Problem description ------------------- DSA has a mumbo-jumbo of reference handling of the conduit net device and its kobject which, sadly, is just wrong and doesn't make sense. There are two distinct problems. 1. The OF path, which uses of_find_net_device_by_node(), never releases the elevated refcount on the conduit's kobject. Nominally, the OF and non-OF paths should result in objects having identical reference counts taken, and it is already suspicious that dsa_dev_to_net_device() has a put_device() call which is missing in dsa_port_parse_of(), but we can actually even verify that an issue exists. With CONFIG_DEBUG_KOBJECT_RELEASE=y, if we run this command "before" and "after" applying this patch: (unbind the conduit driver for net device eno2) echo 0000:00:00.2 > /sys/bus/pci/drivers/fsl_enetc/unbind we see these lines in the output diff which appear only with the patch applied: kobject: 'eno2' (ffff002009a3a6b8): kobject_release, parent 0000000000000000 (delayed 1000) kobject: '109' (ffff0020099d59a0): kobject_release, parent 0000000000000000 (delayed 1000) 2. After we find the conduit interface one way (OF) or another (non-OF), it can get unregistered at any time, and DSA remains with a long-lived, but in this case stale, cpu_dp->conduit pointer. Holding the net device's underlying kobject isn't actually of much help, it just prevents it from being freed (but we never need that kobject directly). What helps us to prevent the net device from being unregistered is the parallel netdev reference mechanism (dev_hold() and dev_put()). Actually we actually use that netdev tracker mechanism implicitly on user ports since commit 2f1e8ea726e9 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings"), via netdev_upper_dev_link(). But time still passes at DSA switch probe time between the initial of_find_net_device_by_node() code and the user port creation time, time during which the conduit could unregister itself and DSA wouldn't know about it. So we have to run of_find_net_device_by_node() under rtnl_lock() to prevent that from happening, and release the lock only with the netdev tracker having acquired the reference. Do we need to keep the reference until dsa_unregister_switch() / dsa_switch_shutdown()? 1: Maybe yes. A switch device will still be registered even if all user ports failed to probe, see commit 86f8b1c01a0a ("net: dsa: Do not make user port errors fatal"), and the cpu_dp->conduit pointers remain valid. I haven't audited all call paths to see whether they will actually use the conduit in lack of any user port, but if they do, it seems safer to not rely on user ports for that reference. 2. Definitely yes. We support changing the conduit which a user port is associated to, and we can get into a situation where we've moved all user ports away from a conduit, thus no longer hold any reference to it via the net device tracker. But we shouldn't let it go nonetheless - see the next change in relation to dsa_tree_find_first_conduit() and LAG conduits which disappear. We have to be prepared to return to the physical conduit, so the CPU port must explicitly keep another reference to it. This is also to say: the user ports and their CPU ports may not always keep a reference to the same conduit net device, and both are needed. As for the conduit's kobject for the /sys/class/net/ entry, we don't care about it, we can release it as soon as we hold the net device object itself. History and blame attribution ----------------------------- The code has been refactored so many times, it is very difficult to follow and properly attribute a blame, but I'll try to make a short history which I hope to be correct. We have two distinct probing paths: - one for OF, introduced in 2016 in commit 83c0afaec7b7 ("net: dsa: Add new binding implementation") - one for non-OF, introduced in 2017 in commit 71e0bbde0d88 ("net: dsa: Add support for platform data") These are both complete rewrites of the original probing paths (which used struct dsa_switch_driver and other weird stuff, instead of regular devices on their respective buses for register access, like MDIO, SPI, I2C etc): - one for OF, introduced in 2013 in commit 5e95329b701c ("dsa: add device tree bindings to register DSA switches") - one for non-OF, introduced in 2008 in commit 91da11f870f0 ("net: Distributed Switch Architecture protocol support") except for tiny bits and pieces like dsa_dev_to_net_device() which were seemingly carried over since the original commit, and used to this day. The point is that the original probing paths received a fix in 2015 in the form of commit 679fb46c5785 ("net: dsa: Add missing master netdev dev_put() calls"), but the fix never made it into the "new" (dsa2) probing paths that can still be traced to today, and the fixed probing path was later deleted in 2019 in commit 93e86b3bc842 ("net: dsa: Remove legacy probing support"). That is to say, the new probing paths were never quite correct in this area. The existence of the legacy probing support which was deleted in 2019 explains why dsa_dev_to_net_device() returns a conduit with elevated refcount (because it was supposed to be released during dsa_remove_dst()). After the removal of the legacy code, the only user of dsa_dev_to_net_device() calls dev_put(conduit) immediately after this function returns. This pattern makes no sense today, and can only be interpreted historically to understand why dev_hold() was there in the first place. Change details -------------- Today we have a better netdev tracking infrastructure which we should use. Logically netdev_hold() belongs in common code (dsa_port_parse_cpu(), where dp->conduit is assigned), but there is a tradeoff to be made with the rtnl_lock() section which would become a bit too long if we did that - dsa_port_parse_cpu() also calls request_module(). So we duplicate a bit of logic in order for the callers of dsa_port_parse_cpu() to be the ones responsible of holding the conduit reference and releasing it on error. This shortens the rtnl_lock() section significantly. In the dsa_switch_probe() error path, dsa_switch_release_ports() will be called in a number of situations, one being where dsa_port_parse_cpu() maybe didn't get the chance to run at all (a different port failed earlier, etc). So we have to test for the conduit being NULL prior to calling netdev_put(). There have still been so many transformations to the code since the blamed commits (rename master -> conduit, commit 0650bf52b31f ("net: dsa: be compatible with masters which unregister on shutdown")), that it only makes sense to fix the code using the best methods available today and see how it can be backported to stable later. I suspect the fix cannot even be backported to kernels which lack dsa_switch_shutdown(), and I suspect this is also maybe why the long-lived conduit reference didn't make it into the new DSA probing paths at the time (problems during shutdown). Because dsa_dev_to_net_device() has a single call site and has to be changed anyway, the logic was just absorbed into the non-OF dsa_port_parse(). Tested on the ocelot/felix switch and on dsa_loop, both on the NXP LS1028A with CONFIG_DEBUG_KOBJECT_RELEASE=y. Reported-by: Ma Ke <make24@iscas.ac.cn> Closes: https://lore.kernel.org/netdev/20251214131204.4684-1-make24@iscas.ac.cn/ Fixes: 83c0afaec7b7 ("net: dsa: Add new binding implementation") Fixes: 71e0bbde0d88 ("net: dsa: Add support for platform data") Reviewed-by: Jonas Gorski <jonas.gorski@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20251215150236.3931670-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> [ backport: "conduit" -> "master" in code, kept original commit message ] Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hourstracing: Add recursion protection in kernel stack trace recordingSteven Rostedt1-0/+9
[ Upstream commit 5f1ef0dfcb5b7f4a91a9b0e0ba533efd9f7e2cdb ] A bug was reported about an infinite recursion caused by tracing the rcu events with the kernel stack trace trigger enabled. The stack trace code called back into RCU which then called the stack trace again. Expand the ftrace recursion protection to add a set of bits to protect events from recursion. Each bit represents the context that the event is in (normal, softirq, interrupt and NMI). Have the stack trace code use the interrupt context to protect against recursion. Note, the bug showed an issue in both the RCU code as well as the tracing stacktrace code. This only handles the tracing stack trace side of the bug. The RCU fix will be handled separately. Link: https://lore.kernel.org/all/20260102122807.7025fc87@gandalf.local.home/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Link: https://patch.msgid.link/20260105203141.515cd49f@gandalf.local.home Reported-by: Yao Kai <yaokai34@huawei.com> Tested-by: Yao Kai <yaokai34@huawei.com> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Leon Chen <leonchen.oss@139.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hoursrxrpc: Fix recvmsg() unconditional requeueDavid Howells1-0/+4
[ Upstream commit 2c28769a51deb6022d7fbd499987e237a01dd63a ] If rxrpc_recvmsg() fails because MSG_DONTWAIT was specified but the call at the front of the recvmsg queue already has its mutex locked, it requeues the call - whether or not the call is already queued. The call may be on the queue because MSG_PEEK was also passed and so the call was not dequeued or because the I/O thread requeued it. The unconditional requeue may then corrupt the recvmsg queue, leading to things like UAFs or refcount underruns. Fix this by only requeuing the call if it isn't already on the queue - and moving it to the front if it is already queued. If we don't queue it, we have to put the ref we obtained by dequeuing it. Also, MSG_PEEK doesn't dequeue the call so shouldn't call rxrpc_notify_socket() for the call if we didn't use up all the data on the queue, so fix that also. Fixes: 540b1c48c37a ("rxrpc: Fix deadlock between call creation and sendmsg/recvmsg") Reported-by: Faith <faith@zellic.io> Reported-by: Pumpkin Chang <pumpkin@devco.re> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Marc Dionne <marc.dionne@auristor.com> cc: Nir Ohfeld <niro@wiz.io> cc: Willy Tarreau <w@1wt.eu> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org cc: stable@kernel.org Link: https://patch.msgid.link/95163.1768428203@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org> [Use spin_unlock instead of spin_unlock_irq to maintain context consistency.] Signed-off-by: Robert Garcia <rob_garcia@163.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hoursNFS: Fix a deadlock involving nfs_release_folio()Trond Myklebust1-0/+1
[ Upstream commit cce0be6eb4971456b703aaeafd571650d314bcca ] Wang Zhaolong reports a deadlock involving NFSv4.1 state recovery waiting on kthreadd, which is attempting to reclaim memory by calling nfs_release_folio(). The latter cannot make progress due to state recovery being needed. It seems that the only safe thing to do here is to kick off a writeback of the folio, without waiting for completion, or else kicking off an asynchronous commit. Reported-by: Wang Zhaolong <wangzhaolong@huaweicloud.com> Fixes: 96780ca55e3c ("NFS: fix up nfs_release_folio() to try to release the page") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> [ Minor conflict resolved. ] Signed-off-by: Li hongliang <1468888505@139.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hoursbtrfs: fix NULL dereference on root when tracing inode evictionMiquel Sabaté Solà1-2/+2
[ Upstream commit f157dd661339fc6f5f2b574fe2429c43bd309534 ] When evicting an inode the first thing we do is to setup tracing for it, which implies fetching the root's id. But in btrfs_evict_inode() the root might be NULL, as implied in the next check that we do in btrfs_evict_inode(). Hence, we either should set the ->root_objectid to 0 in case the root is NULL, or we move tracing setup after checking that the root is not NULL. Setting the rootid to 0 at least gives us the possibility to trace this call even in the case when the root is NULL, so that's the solution taken here. Fixes: 1abe9b8a138c ("Btrfs: add initial tracepoint support for btrfs") Reported-by: syzbot+d991fea1b4b23b1f6bf8@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d991fea1b4b23b1f6bf8 Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [ Adjust context ] Signed-off-by: Bin Lan <lanbincn@139.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hoursnet/sched: act_gate: snapshot parameters with RCU on replacePaul Moses1-7/+26
[ Upstream commit 62413a9c3cb183afb9bb6e94dd68caf4e4145f4c ] The gate action can be replaced while the hrtimer callback or dump path is walking the schedule list. Convert the parameters to an RCU-protected snapshot and swap updates under tcf_lock, freeing the previous snapshot via call_rcu(). When REPLACE omits the entry list, preserve the existing schedule so the effective state is unchanged. Fixes: a51c328df310 ("net: qos: introduce a gate control flow action") Cc: stable@vger.kernel.org Signed-off-by: Paul Moses <p@1g4.org> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260223150512.2251594-2-p@1g4.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> [ hrtimer_setup() => hrtimer_init() + keep is_tcf_gate() ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hoursirqchip/gic-v3-its: Limit number of per-device MSIs to the range the ITS ↵Marc Zyngier1-0/+1
supports commit ce9e40a9a5e5cff0b1b0d2fa582b3d71a8ce68e8 upstream. The ITS driver blindly assumes that EventIDs are in abundant supply, to the point where it never checks how many the hardware actually supports. It turns out that some pretty esoteric integrations make it so that only a few bits are available, all the way down to a single bit. Enforce the advertised limitation at the point of allocating the device structure, and hope that the endpoint driver can deal with such limitation. Fixes: 84a6a2e7fc18d ("irqchip: GICv3: ITS: device allocation and configuration") Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Zenghui Yu <zenghui.yu@linux.dev> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260206154816.3582887-1-maz@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hoursmmc: core: Avoid bitfield RMW for claim/retune flagsPenghe Geng1-4/+5
commit 901084c51a0a8fb42a3f37d2e9c62083c495f824 upstream. Move claimed and retune control flags out of the bitfield word to avoid unrelated RMW side effects in asynchronous contexts. The host->claimed bit shared a word with retune flags. Writes to claimed in __mmc_claim_host() or retune_now in mmc_mq_queue_rq() can overwrite other bits when concurrent updates happen in other contexts, triggering spurious WARN_ON(!host->claimed). Convert claimed, can_retune, retune_now and retune_paused to bool to remove shared-word coupling. Fixes: 6c0cedd1ef952 ("mmc: core: Introduce host claiming by context") Fixes: 1e8e55b67030c ("mmc: block: Add CQE support") Cc: stable@vger.kernel.org Suggested-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Penghe Geng <pgeng@nvidia.com> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hoursmm/tracing: rss_stat: ensure curr is false from kthread contextKalesh Singh1-1/+7
commit 079c24d5690262e83ee476e2a548e416f3237511 upstream. The rss_stat trace event allows userspace tools, like Perfetto [1], to inspect per-process RSS metric changes over time. The curr field was introduced to rss_stat in commit e4dcad204d3a ("rss_stat: add support to detect RSS updates of external mm"). Its intent is to indicate whether the RSS update is for the mm_struct of the current execution context; and is set to false when operating on a remote mm_struct (e.g., via kswapd or a direct reclaimer). However, an issue arises when a kernel thread temporarily adopts a user process's mm_struct. Kernel threads do not have their own mm_struct and normally have current->mm set to NULL. To operate on user memory, they can "borrow" a memory context using kthread_use_mm(), which sets current->mm to the user process's mm. This can be observed, for example, in the USB Function Filesystem (FFS) driver. The ffs_user_copy_worker() handles AIO completions and uses kthread_use_mm() to copy data to a user-space buffer. If a page fault occurs during this copy, the fault handler executes in the kthread's context. At this point, current is the kthread, but current->mm points to the user process's mm. Since the rss_stat event (from the page fault) is for that same mm, the condition current->mm == mm becomes true, causing curr to be incorrectly set to true when the trace event is emitted. This is misleading because it suggests the mm belongs to the kthread, confusing userspace tools that track per-process RSS changes and corrupting their mm_id-to-process association. Fix this by ensuring curr is always false when the trace event is emitted from a kthread context by checking for the PF_KTHREAD flag. Link: https://lkml.kernel.org/r/20260219233708.1971199-1-kaleshsingh@google.com Link: https://perfetto.dev/ [1] Fixes: e4dcad204d3a ("rss_stat: add support to detect RSS updates of external mm") Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: "David Hildenbrand (Arm)" <david@kernel.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> [5.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hoursUSB: core: Limit the length of unkillable synchronous timeoutsAlan Stern1-0/+3
commit 1015c27a5e1a63efae2b18a9901494474b4d1dc3 upstream. The usb_control_msg(), usb_bulk_msg(), and usb_interrupt_msg() APIs in usbcore allow unlimited timeout durations. And since they use uninterruptible waits, this leaves open the possibility of hanging a task for an indefinitely long time, with no way to kill it short of unplugging the target device. To prevent this sort of problem, enforce a maximum limit on the length of these unkillable timeouts. The limit chosen here, somewhat arbitrarily, is 60 seconds. On many systems (although not all) this is short enough to avoid triggering the kernel's hung-task detector. In addition, clear up the ambiguity of negative timeout values by treating them the same as 0, i.e., using the maximum allowed timeout. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Link: https://lore.kernel.org/linux-usb/3acfe838-6334-4f6d-be7c-4bb01704b33d@rowland.harvard.edu/ Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") CC: stable@vger.kernel.org Link: https://patch.msgid.link/15fc9773-a007-47b0-a703-df89a8cf83dd@rowland.harvard.edu Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hoursUSB: usbcore: Introduce usb_bulk_msg_killable()Alan Stern1-2/+3
commit 416909962e7cdf29fd01ac523c953f37708df93d upstream. The synchronous message API in usbcore (usb_control_msg(), usb_bulk_msg(), and so on) uses uninterruptible waits. However, drivers may call these routines in the context of a user thread, which means it ought to be possible to at least kill them. For this reason, introduce a new usb_bulk_msg_killable() function which behaves the same as usb_bulk_msg() except for using wait_for_completion_killable_timeout() instead of wait_for_completion_timeout(). The same can be done later for usb_control_msg() later on, if it turns out to be needed. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Suggested-by: Oliver Neukum <oneukum@suse.com> Link: https://lore.kernel.org/linux-usb/3acfe838-6334-4f6d-be7c-4bb01704b33d@rowland.harvard.edu/ Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") CC: stable@vger.kernel.org Link: https://patch.msgid.link/248628b4-cc83-4e81-a620-3ce4e0376d41@rowland.harvard.edu Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hoursnet/mlx5: IFC updates for disabled host PFDaniel Jurgens1-1/+3
[ Upstream commit cd1746cb6555a2238c4aae9f9d60b637a61bf177 ] The port 2 host PF can be disabled, this bit reflects that setting. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: William Tu <witu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1752064867-16874-3-git-send-email-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Stable-dep-of: aed763abf0e9 ("net/mlx5: Fix deadlock between devlink lock and esw->wq") Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursnet/sched: Only allow act_ct to bind to clsact/ingress qdiscs and shared blocksVictor Nogueira1-0/+1
commit 11cb63b0d1a0685e0831ae3c77223e002ef18189 upstream. As Paolo said earlier [1]: "Since the blamed commit below, classify can return TC_ACT_CONSUMED while the current skb being held by the defragmentation engine. As reported by GangMin Kim, if such packet is that may cause a UaF when the defrag engine later on tries to tuch again such packet." act_ct was never meant to be used in the egress path, however some users are attaching it to egress today [2]. Attempting to reach a middle ground, we noticed that, while most qdiscs are not handling TC_ACT_CONSUMED, clsact/ingress qdiscs are. With that in mind, we address the issue by only allowing act_ct to bind to clsact/ingress qdiscs and shared blocks. That way it's still possible to attach act_ct to egress (albeit only with clsact). [1] https://lore.kernel.org/netdev/674b8cbfc385c6f37fb29a1de08d8fe5c2b0fbee.1771321118.git.pabeni@redhat.com/ [2] https://lore.kernel.org/netdev/cc6bfb4a-4a2b-42d8-b9ce-7ef6644fb22b@ovn.org/ Reported-by: GangMin Kim <km.kim1503@gmail.com> Fixes: 3f14b377d01d ("net/sched: act_ct: fix skb leak and crash on ooo frags") CC: stable@vger.kernel.org Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260225134349.1287037-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hoursxsk: introduce helper to determine rxq->frag_sizeLarysa Zaremba1-0/+10
[ Upstream commit 16394d80539937d348dd3b9ea32415c54e67a81b ] rxq->frag_size is basically a step between consecutive strictly aligned frames. In ZC mode, chunk size fits exactly, but if chunks are unaligned, there is no safe way to determine accessible space to grow tailroom. Report frag_size to be zero, if chunks are unaligned, chunk_size otherwise. Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://patch.msgid.link/20260305111253.2317394-3-larysa.zaremba@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursnet/sched: act_ife: Fix metalist update behaviorJamal Hadi Salim1-3/+1
[ Upstream commit e2cedd400c3ec0302ffca2490e8751772906ac23 ] Whenever an ife action replace changes the metalist, instead of replacing the old data on the metalist, the current ife code is appending the new metadata. Aside from being innapropriate behavior, this may lead to an unbounded addition of metadata to the metalist which might cause an out of bounds error when running the encode op: [ 138.423369][ C1] ================================================================== [ 138.424317][ C1] BUG: KASAN: slab-out-of-bounds in ife_tlv_meta_encode (net/ife/ife.c:168) [ 138.424906][ C1] Write of size 4 at addr ffff8880077f4ffe by task ife_out_out_bou/255 [ 138.425778][ C1] CPU: 1 UID: 0 PID: 255 Comm: ife_out_out_bou Not tainted 7.0.0-rc1-00169-gfbdfa8da05b6 #624 PREEMPT(full) [ 138.425795][ C1] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 138.425800][ C1] Call Trace: [ 138.425804][ C1] <IRQ> [ 138.425808][ C1] dump_stack_lvl (lib/dump_stack.c:122) [ 138.425828][ C1] print_report (mm/kasan/report.c:379 mm/kasan/report.c:482) [ 138.425839][ C1] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 138.425844][ C1] ? __virt_addr_valid (./arch/x86/include/asm/preempt.h:95 (discriminator 1) ./include/linux/rcupdate.h:975 (discriminator 1) ./include/linux/mmzone.h:2207 (discriminator 1) arch/x86/mm/physaddr.c:54 (discriminator 1)) [ 138.425853][ C1] ? ife_tlv_meta_encode (net/ife/ife.c:168) [ 138.425859][ C1] kasan_report (mm/kasan/report.c:221 mm/kasan/report.c:597) [ 138.425868][ C1] ? ife_tlv_meta_encode (net/ife/ife.c:168) [ 138.425878][ C1] kasan_check_range (mm/kasan/generic.c:186 (discriminator 1) mm/kasan/generic.c:200 (discriminator 1)) [ 138.425884][ C1] __asan_memset (mm/kasan/shadow.c:84 (discriminator 2)) [ 138.425889][ C1] ife_tlv_meta_encode (net/ife/ife.c:168) [ 138.425893][ C1] ? ife_tlv_meta_encode (net/ife/ife.c:171) [ 138.425898][ C1] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 138.425903][ C1] ife_encode_meta_u16 (net/sched/act_ife.c:57) [ 138.425910][ C1] ? __pfx_do_raw_spin_lock (kernel/locking/spinlock_debug.c:114) [ 138.425916][ C1] ? __asan_memcpy (mm/kasan/shadow.c:105 (discriminator 3)) [ 138.425921][ C1] ? __pfx_ife_encode_meta_u16 (net/sched/act_ife.c:45) [ 138.425927][ C1] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 138.425931][ C1] tcf_ife_act (net/sched/act_ife.c:847 net/sched/act_ife.c:879) To solve this issue, fix the replace behavior by adding the metalist to the ife rcu data structure. Fixes: aa9fd9a325d51 ("sched: act: ife: update parameters via rcu handling") Reported-by: Ruitong Liu <cnitlrt@gmail.com> Tested-by: Ruitong Liu <cnitlrt@gmail.com> Co-developed-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260304140603.76500-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursnet: sched: avoid qdisc_reset_all_tx_gt() vs dequeue race for lockless qdiscsKoichiro Den1-0/+10
[ Upstream commit 7f083faf59d14c04e01ec05a7507f036c965acf8 ] When shrinking the number of real tx queues, netif_set_real_num_tx_queues() calls qdisc_reset_all_tx_gt() to flush qdiscs for queues which will no longer be used. qdisc_reset_all_tx_gt() currently serializes qdisc_reset() with qdisc_lock(). However, for lockless qdiscs, the dequeue path is serialized by qdisc_run_begin/end() using qdisc->seqlock instead, so qdisc_reset() can run concurrently with __qdisc_run() and free skbs while they are still being dequeued, leading to UAF. This can easily be reproduced on e.g. virtio-net by imposing heavy traffic while frequently changing the number of queue pairs: iperf3 -ub0 -c $peer -t 0 & while :; do ethtool -L eth0 combined 1 ethtool -L eth0 combined 2 done With KASAN enabled, this leads to reports like: BUG: KASAN: slab-use-after-free in __qdisc_run+0x133f/0x1760 ... Call Trace: <TASK> ... __qdisc_run+0x133f/0x1760 __dev_queue_xmit+0x248f/0x3550 ip_finish_output2+0xa42/0x2110 ip_output+0x1a7/0x410 ip_send_skb+0x2e6/0x480 udp_send_skb+0xb0a/0x1590 udp_sendmsg+0x13c9/0x1fc0 ... </TASK> Allocated by task 1270 on cpu 5 at 44.558414s: ... alloc_skb_with_frags+0x84/0x7c0 sock_alloc_send_pskb+0x69a/0x830 __ip_append_data+0x1b86/0x48c0 ip_make_skb+0x1e8/0x2b0 udp_sendmsg+0x13a6/0x1fc0 ... Freed by task 1306 on cpu 3 at 44.558445s: ... kmem_cache_free+0x117/0x5e0 pfifo_fast_reset+0x14d/0x580 qdisc_reset+0x9e/0x5f0 netif_set_real_num_tx_queues+0x303/0x840 virtnet_set_channels+0x1bf/0x260 [virtio_net] ethnl_set_channels+0x684/0xae0 ethnl_default_set_doit+0x31a/0x890 ... Serialize qdisc_reset_all_tx_gt() against the lockless dequeue path by taking qdisc->seqlock for TCQ_F_NOLOCK qdiscs, matching the serialization model already used by dev_reset_queue(). Additionally clear QDISC_STATE_NON_EMPTY after reset so the qdisc state reflects an empty queue, avoiding needless re-scheduling. Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Signed-off-by: Koichiro Den <den@valinux.co.jp> Link: https://patch.msgid.link/20260228145307.3955532-1-den@valinux.co.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursbpf: export bpf_link_inc_not_zero.Kui-Feng Lee1-0/+6
[ Upstream commit 67c3e8353f45c27800eecc46e00e8272f063f7d1 ] bpf_link_inc_not_zero() will be used by kernel modules. We will use it in bpf_testmod.c later. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240530065946.979330-5-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Stable-dep-of: 56145d237385 ("bpf: Fix a UAF issue in bpf_trampoline_link_cgroup_shim") Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursindirect_call_wrapper: do not reevaluate function pointerEric Dumazet1-7/+11
[ Upstream commit 710f5c76580306cdb9ec51fac8fcf6a8faff7821 ] We have an increasing number of READ_ONCE(xxx->function) combined with INDIRECT_CALL_[1234]() helpers. Unfortunately this forces INDIRECT_CALL_[1234]() to read xxx->function many times, which is not what we wanted. Fix these macros so that xxx->function value is not reloaded. $ scripts/bloat-o-meter -t vmlinux.0 vmlinux add/remove: 0/0 grow/shrink: 1/65 up/down: 122/-1084 (-962) Function old new delta ip_push_pending_frames 59 181 +122 ip6_finish_output 687 681 -6 __udp_enqueue_schedule_skb 1078 1072 -6 ioam6_output 2319 2312 -7 xfrm4_rcv_encap_finish2 64 56 -8 xfrm4_output 297 289 -8 vrf_ip_local_out 278 270 -8 vrf_ip6_local_out 278 270 -8 seg6_input_finish 64 56 -8 rpl_output 700 692 -8 ipmr_forward_finish 124 116 -8 ip_forward_finish 143 135 -8 ip6mr_forward2_finish 100 92 -8 ip6_forward_finish 73 65 -8 input_action_end_bpf 1091 1083 -8 dst_input 52 44 -8 __xfrm6_output 801 793 -8 __xfrm4_output 83 75 -8 bpf_input 500 491 -9 __tcp_check_space 530 521 -9 input_action_end_dt6 291 280 -11 vti6_tnl_xmit 1634 1622 -12 bpf_xmit 1203 1191 -12 rpl_input 497 483 -14 rawv6_send_hdrinc 1355 1341 -14 ndisc_send_skb 1030 1016 -14 ipv6_srh_rcv 1377 1363 -14 ip_send_unicast_reply 1253 1239 -14 ip_rcv_finish 226 212 -14 ip6_rcv_finish 300 286 -14 input_action_end_x_core 205 191 -14 input_action_end_x 355 341 -14 input_action_end_t 205 191 -14 input_action_end_dx6_finish 127 113 -14 input_action_end_dx4_finish 373 359 -14 input_action_end_dt4 426 412 -14 input_action_end_core 186 172 -14 input_action_end_b6_encap 292 278 -14 input_action_end_b6 198 184 -14 igmp6_send 1332 1318 -14 ip_sublist_rcv 864 848 -16 ip6_sublist_rcv 1091 1075 -16 ipv6_rpl_srh_rcv 1937 1920 -17 xfrm_policy_queue_process 1246 1228 -18 seg6_output_core 903 885 -18 mld_sendpack 856 836 -20 NF_HOOK 756 736 -20 vti_tunnel_xmit 1447 1426 -21 input_action_end_dx6 664 642 -22 input_action_end 1502 1480 -22 sock_sendmsg_nosec 134 111 -23 ip6mr_forward2 388 364 -24 sock_recvmsg_nosec 134 109 -25 seg6_input_core 836 810 -26 ip_send_skb 172 146 -26 ip_local_out 140 114 -26 ip6_local_out 140 114 -26 __sock_sendmsg 162 136 -26 __ip_queue_xmit 1196 1170 -26 __ip_finish_output 405 379 -26 ipmr_queue_fwd_xmit 373 346 -27 sock_recvmsg 173 145 -28 ip6_xmit 1635 1607 -28 xfrm_output_resume 1418 1389 -29 ip_build_and_send_pkt 625 591 -34 dst_output 504 432 -72 Total: Before=25217686, After=25216724, chg -0.00% Fixes: 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260227172603.1700433-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursbpf/bonding: reject vlan+srcmac xmit_hash_policy change when XDP is loadedJiayuan Chen1-0/+1
[ Upstream commit 479d589b40b836442bbdadc3fdb37f001bb67f26 ] bond_option_mode_set() already rejects mode changes that would make a loaded XDP program incompatible via bond_xdp_check(). However, bond_option_xmit_hash_policy_set() has no such guard. For 802.3ad and balance-xor modes, bond_xdp_check() returns false when xmit_hash_policy is vlan+srcmac, because the 802.1q payload is usually absent due to hardware offload. This means a user can: 1. Attach a native XDP program to a bond in 802.3ad/balance-xor mode with a compatible xmit_hash_policy (e.g. layer2+3). 2. Change xmit_hash_policy to vlan+srcmac while XDP remains loaded. This leaves bond->xdp_prog set but bond_xdp_check() now returning false for the same device. When the bond is later destroyed, dev_xdp_uninstall() calls bond_xdp_set(dev, NULL, NULL) to remove the program, which hits the bond_xdp_check() guard and returns -EOPNOTSUPP, triggering: WARN_ON(dev_xdp_install(dev, mode, bpf_op, NULL, 0, NULL)) Fix this by rejecting xmit_hash_policy changes to vlan+srcmac when an XDP program is loaded on a bond in 802.3ad or balance-xor mode. commit 39a0876d595b ("net, bonding: Disallow vlan+srcmac with XDP") introduced bond_xdp_check() which returns false for 802.3ad/balance-xor modes when xmit_hash_policy is vlan+srcmac. The check was wired into bond_xdp_set() to reject XDP attachment with an incompatible policy, but the symmetric path -- preventing xmit_hash_policy from being changed to an incompatible value after XDP is already loaded -- was left unguarded in bond_option_xmit_hash_policy_set(). Note: commit 094ee6017ea0 ("bonding: check xdp prog when set bond mode") later added a similar guard to bond_option_mode_set(), but bond_option_xmit_hash_policy_set() remained unprotected. Reported-by: syzbot+5a287bcdc08104bc3132@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6995aff6.050a0220.2eeac1.014e.GAE@google.com/T/ Fixes: 39a0876d595b ("net, bonding: Disallow vlan+srcmac with XDP") Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Link: https://patch.msgid.link/20260226080306.98766-2-jiayuan.chen@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursxsk: Fix fragment node deletion to prevent buffer leakNikhil P. Rao1-3/+3
[ Upstream commit 60abb0ac11dccd6b98fd9182bc5f85b621688861 ] After commit b692bf9a7543 ("xsk: Get rid of xdp_buff_xsk::xskb_list_node"), the list_node field is reused for both the xskb pool list and the buffer free list, this causes a buffer leak as described below. xp_free() checks if a buffer is already on the free list using list_empty(&xskb->list_node). When list_del() is used to remove a node from the xskb pool list, it doesn't reinitialize the node pointers. This means list_empty() will return false even after the node has been removed, causing xp_free() to incorrectly skip adding the buffer to the free list. Fix this by using list_del_init() instead of list_del() in all fragment handling paths, this ensures the list node is reinitialized after removal, allowing the list_empty() to work correctly. Fixes: b692bf9a7543 ("xsk: Get rid of xdp_buff_xsk::xskb_list_node") Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com> Link: https://patch.msgid.link/20260225000456.107806-2-nikhil.rao@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: f7387d6579d6 ("xsk: Fix zero-copy AF_XDP fragment drop") Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursxsk: s/free_list_node/list_node/Maciej Fijalkowski2-8/+8
[ Upstream commit 30ec2c1baaead43903ad63ff8e3083949059083c ] Now that free_list_node's purpose is two-folded, make it just a 'list_node'. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20241007122458.282590-3-maciej.fijalkowski@intel.com Stable-dep-of: f7387d6579d6 ("xsk: Fix zero-copy AF_XDP fragment drop") Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursxsk: Get rid of xdp_buff_xsk::xskb_list_nodeMaciej Fijalkowski2-8/+7
[ Upstream commit b692bf9a7543af7ad11a59d182a3757578f0ba53 ] Let's bring xdp_buff_xsk back to occupying 2 cachelines by removing xskb_list_node - for the purpose of gathering the xskb frags free_list_node can be used, head of the list (xsk_buff_pool::xskb_list) stays as-is, just reuse the node ptr. It is safe to do as a single xdp_buff_xsk can never reside in two pool's lists simultaneously. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20241007122458.282590-2-maciej.fijalkowski@intel.com Stable-dep-of: f7387d6579d6 ("xsk: Fix zero-copy AF_XDP fragment drop") Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursmailbox: Allow controller specific mapping using fwnodeAnup Patel1-0/+3
[ Upstream commit ba879dfc0574878f3e08f217b2b4fdf845c426c0 ] Introduce optional fw_node() callback which allows a mailbox controller driver to provide controller specific mapping using fwnode. The Linux OF framework already implements fwnode operations for the Linux DD framework so the fw_xlate() callback works fine with device tree as well. Acked-by: Jassi Brar <jassisinghbrar@gmail.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Link: https://lore.kernel.org/r/20250818040920.272664-6-apatel@ventanamicro.com Signed-off-by: Paul Walmsley <pjw@kernel.org> Stable-dep-of: fcd7f96c7836 ("mailbox: Prevent out-of-bounds access in fw_mbox_index_xlate()") Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursmailbox: sort headers alphabeticallyTudor Ambarus2-4/+4
[ Upstream commit db824c1119fc16556a84cb7a771ca6553b3c3a45 ] Sorting headers alphabetically helps locating duplicates, and makes it easier to figure out where to insert new headers. Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org> Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com> Stable-dep-of: fcd7f96c7836 ("mailbox: Prevent out-of-bounds access in fw_mbox_index_xlate()") Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursresource: Add resource set range and size helpersIlpo Järvinen1-0/+32
[ Upstream commit 9fb6fef0fb49124291837af1da5028f79d53f98e ] Setting the end address for a resource with a given size lacks a helper and is therefore coded manually unlike the getter side which has a helper for resource size calculation. Also, almost all callsites that calculate the end address for a resource also set the start address right before it like this: res->start = start_addr; res->end = res->start + size - 1; Add resource_set_range(res, start_addr, size) that sets the start address and calculates the end address to simplify this often repeated fragment. Also add resource_set_size() for the cases where setting the start address of the resource is not necessary but mention in its kerneldoc that resource_set_range() is preferred when setting both addresses. Link: https://lore.kernel.org/r/20240614100606.15830-2-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Stable-dep-of: 11721c45a826 ("PCI: Use resource_set_range() that correctly sets ->end") Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursPCI: Correct PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 valueBjorn Helgaas1-1/+1
[ Upstream commit 39195990e4c093c9eecf88f29811c6de29265214 ] fb82437fdd8c ("PCI: Change capability register offsets to hex") incorrectly converted the PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 value from decimal 52 to hex 0x32: -#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 52 /* v2 endpoints with link end here */ +#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 0x32 /* end of v2 EPs w/ link */ This broke PCI capabilities in a VMM because subsequent ones weren't DWORD-aligned. Change PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 to the correct value of 0x34. fb82437fdd8c was from Baruch Siach <baruch@tkos.co.il>, but this was not Baruch's fault; it's a mistake I made when applying the patch. Fixes: fb82437fdd8c ("PCI: Change capability register offsets to hex") Reported-by: David Woodhouse <dwmw2@infradead.org> Closes: https://lore.kernel.org/all/3ae392a0158e9d9ab09a1d42150429dd8ca42791.camel@infradead.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursbtrfs: read raid stripe tree from diskJohannes Thumshirn1-0/+1
[ Upstream commit 515020900d447796bc2f0f57064663617a11b65d ] If we find the raid-stripe-tree on mount, read it from disk. This is a backward incompatible feature. The rescue=ignorebadroots mount option will skip this tree. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 511dc8912ae3 ("btrfs: fix incorrect key offset in error message in check_dev_extent_item()") Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursbtrfs: add raid stripe tree definitionsJohannes Thumshirn1-0/+29
[ Upstream commit ee1293308e01d359688243d665138f35a6f1f9b8 ] Add definitions for the raid stripe tree. This tree will hold information about the on-disk layout of the stripes in a RAID set. Each stripe extent has a 1:1 relationship with an on-disk extent item and is doing the logical to per-drive physical address translation for the extent item in question. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 511dc8912ae3 ("btrfs: fix incorrect key offset in error message in check_dev_extent_item()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04tracing: Wake up poll waiters for hist files when removing an eventPetr Pavlu1-0/+5
[ Upstream commit 9678e53179aa7e907360f5b5b275769008a69b80 ] The event_hist_poll() function attempts to verify whether an event file is being removed, but this check may not occur or could be unnecessarily delayed. This happens because hist_poll_wakeup() is currently invoked only from event_hist_trigger() when a hist command is triggered. If the event file is being removed, no associated hist command will be triggered and a waiter will be woken up only after an unrelated hist command is triggered. Fix the issue by adding a call to hist_poll_wakeup() in remove_event_file_dir() after setting the EVENT_FILE_FL_FREED flag. This ensures that a task polling on a hist file is woken up and receives EPOLLERR. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://patch.msgid.link/20260219162737.314231-3-petr.pavlu@suse.com Fixes: 1bd13edbbed6 ("tracing/hist: Add poll(POLLIN) support on hist file") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data()Qanux1-0/+2
[ Upstream commit 6db8b56eed62baacaf37486e83378a72635c04cc ] On the receive path, __ioam6_fill_trace_data() uses trace->nodelen to decide how much data to write for each node. It trusts this field as-is from the incoming packet, with no consistency check against trace->type (the 24-bit field that tells which data items are present). A crafted packet can set nodelen=0 while setting type bits 0-21, causing the function to write ~100 bytes past the allocated region (into skb_shared_info), which corrupts adjacent heap memory and leads to a kernel panic. Add a shared helper ioam6_trace_compute_nodelen() in ioam6.c to derive the expected nodelen from the type field, and use it: - in ioam6_iptunnel.c (send path, existing validation) to replace the open-coded computation; - in exthdrs.c (receive path, ipv6_hop_ioam) to drop packets whose nodelen is inconsistent with the type field, before any data is written. Per RFC 9197, bits 12-21 are each short (4-octet) fields, so they are included in IOAM6_MASK_SHORT_FIELDS (changed from 0xff100000 to 0xff1ffc00). Fixes: 9ee11f0fff20 ("ipv6: ioam: Data plane support for Pre-allocated Trace") Cc: stable@vger.kernel.org Signed-off-by: Junxi Qian <qjx1298677004@gmail.com> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260211040412.86195-1-qjx1298677004@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04drm: of: drm_of_panel_bridge_remove(): fix device_node leakLuca Ceresoli1-0/+3
[ Upstream commit a4b4385d0523e39a7c058cb5a6c8269e513126ca ] drm_of_panel_bridge_remove() uses of_graph_get_remote_node() to get a device_node but does not put the node reference. Fixes: c70087e8f16f ("drm/drm_of: add drm_of_panel_bridge_remove function") Cc: stable@vger.kernel.org # v4.15 Acked-by: Maxime Ripard <mripard@kernel.org> Link: https://patch.msgid.link/20260109-drm-bridge-alloc-getput-drm_of_find_bridge-2-v2-1-8bad3ef90b9f@bootlin.com Signed-off-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04Bluetooth: L2CAP: Fix invalid response to L2CAP_ECRED_RECONF_REQLuiz Augusto von Dentz1-0/+2
[ Upstream commit 7accb1c4321acb617faf934af59d928b0b047e2b ] This fixes responding with an invalid result caused by checking the wrong size of CID which should have been (cmd_len - sizeof(*req)) and on top of it the wrong result was use L2CAP_CR_LE_INVALID_PARAMS which is invalid/reserved for reconf when running test like L2CAP/ECFC/BI-03-C: > ACL Data RX: Handle 64 flags 0x02 dlen 14 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6 MTU: 64 MPS: 64 Source CID: 64 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reserved (0x000c) Result: Reconfiguration failed - one or more Destination CIDs invalid (0x0003) Fiix L2CAP/ECFC/BI-04-C which expects L2CAP_RECONF_INVALID_MPS (0x0002) when more than one channel gets its MPS reduced: > ACL Data RX: Handle 64 flags 0x02 dlen 16 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 8 MTU: 264 MPS: 99 Source CID: 64 ! Source CID: 65 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reconfiguration successful (0x0000) Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002) Fix L2CAP/ECFC/BI-05-C when SCID is invalid (85 unconnected): > ACL Data RX: Handle 64 flags 0x02 dlen 14 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6 MTU: 65 MPS: 64 ! Source CID: 85 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reconfiguration successful (0x0000) Result: Reconfiguration failed - one or more Destination CIDs invalid (0x0003) Fix L2CAP/ECFC/BI-06-C when MPS < L2CAP_ECRED_MIN_MPS (64): > ACL Data RX: Handle 64 flags 0x02 dlen 14 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6 MTU: 672 ! MPS: 63 Source CID: 64 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002) Result: Reconfiguration failed - other unacceptable parameters (0x0004) Fix L2CAP/ECFC/BI-07-C when MPS reduced for more than one channel: > ACL Data RX: Handle 64 flags 0x02 dlen 16 LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 3 len 8 MTU: 84 ! MPS: 71 Source CID: 64 ! Source CID: 65 < ACL Data TX: Handle 64 flags 0x00 dlen 10 LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2 ! Result: Reconfiguration successful (0x0000) Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002) Link: https://github.com/bluez/bluez/issues/1865 Fixes: 15f02b910562 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04include: uapi: netfilter_bridge.h: Cover for musl libcPhil Sutter1-0/+4
[ Upstream commit 4edd4ba71ce0df015303dba75ea9d20d1a217546 ] Musl defines its own struct ethhdr and thus defines __UAPI_DEF_ETHHDR to zero. To avoid struct redefinition errors, user space is therefore supposed to include netinet/if_ether.h before (or instead of) linux/if_ether.h. To relieve them from this burden, include the libc header here if not building for kernel space. Reported-by: Alyssa Ross <hi@alyssa.is> Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04ipv4: igmp: annotate data-races around idev->mr_maxdelayEric Dumazet1-1/+1
[ Upstream commit e4faaf65a75f650ac4366ddff5dabb826029ca5a ] idev->mr_maxdelay is read and written locklessly, add READ_ONCE()/WRITE_ONCE() annotations. While we are at it, make this field an u32. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260122172247.2429403-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04ipv6: annotate data-races in ip6_multipath_hash_{policy,fields}()Eric Dumazet1-2/+2
[ Upstream commit 03e9d91dd64e2f5ea632df5d59568d91757efc4d ] Add missing READ_ONCE() when reading sysctl values. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260115094141.3124990-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04virt: vbox: uapi: Mark inner unions in packed structs as packedThomas Weißschuh1-2/+2
[ Upstream commit c25d01e1c4f2d43f47af87c00e223f5ca7c71792 ] The unpacked unions within a packed struct generates alignment warnings on clang for 32-bit ARM: ./usr/include/linux/vbox_vmmdev_types.h:239:4: error: field u within 'struct vmmdev_hgcm_function_parameter32' is less aligned than 'union (unnamed union at ./usr/include/linux/vbox_vmmdev_types.h:223:2)' and is usually due to 'struct vmmdev_hgcm_function_parameter32' being packed, which can lead to unaligned accesses [-Werror,-Wunaligned-access] 239 | } u; | ^ ./usr/include/linux/vbox_vmmdev_types.h:254:6: error: field u within 'struct vmmdev_hgcm_function_parameter64::(anonymous union)::(unnamed at ./usr/include/linux/vbox_vmmdev_types.h:249:3)' is less aligned than 'union (unnamed union at ./usr/include/linux/vbox_vmmdev_types.h:251:4)' and is usually due to 'struct vmmdev_hgcm_function_parameter64::(anonymous union)::(unnamed at ./usr/include/linux/vbox_vmmdev_types.h:249:3)' being packed, which can lead to unaligned accesses [-Werror,-Wunaligned-access] With the recent changes to compile-test the UAPI headers in more cases, these warning in combination with CONFIG_WERROR breaks the build. Fix the warnings. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202512140314.DzDxpIVn-lkp@intel.com/ Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/linux-kbuild/20260110-uapi-test-disable-headers-arm-clang-unaligned-access-v1-1-b7b0fa541daa@kernel.org/ Suggested-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/linux-kbuild/29b2e736-d462-45b7-a0a9-85f8d8a3de56@app.fastmail.com/ Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Tested-by: Nicolas Schier <nsc@kernel.org> Reviewed-by: Nicolas Schier <nsc@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/20260115-kbuild-alignment-vbox-v1-2-076aed1623ff@linutronix.de Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04hyper-v: Mark inner union in hv_kvp_exchg_msg_value as packedThomas Weißschuh1-1/+1
[ Upstream commit 1e5271393d777f6159d896943b4c44c4f3ecff52 ] The unpacked union within a packed struct generates alignment warnings on clang for 32-bit ARM: ./usr/include/linux/hyperv.h:361:2: error: field within 'struct hv_kvp_exchg_msg_value' is less aligned than 'union hv_kvp_exchg_msg_value::(anonymous at ./usr/include/linux/hyperv.h:361:2)' and is usually due to 'struct hv_kvp_exchg_msg_value' being packed, which can lead to unaligned accesses [-Werror,-Wunaligned-access] 361 | union { | ^ With the recent changes to compile-test the UAPI headers in more cases, this warning in combination with CONFIG_WERROR breaks the build. Fix the warning. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202512140314.DzDxpIVn-lkp@intel.com/ Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/linux-kbuild/20260110-uapi-test-disable-headers-arm-clang-unaligned-access-v1-1-b7b0fa541daa@kernel.org/ Suggested-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/linux-kbuild/29b2e736-d462-45b7-a0a9-85f8d8a3de56@app.fastmail.com/ Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Wei Liu (Microsoft) <wei.liu@kernel.org> Tested-by: Nicolas Schier <nsc@kernel.org> Reviewed-by: Nicolas Schier <nsc@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/20260115-kbuild-alignment-vbox-v1-1-076aed1623ff@linutronix.de Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04media: dvb-core: dmxdevfilter must always flush bufsHans Verkuil1-2/+4
[ Upstream commit c4e620eccbef76aa5564ebb295e23d6540e27215 ] Currently the buffers are being filled until full, which works fine for the transport stream, but not when reading sections, those have to be returned to userspace immediately, otherwise dvbv5-scan will just wait forever. Add a 'flush' argument to dvb_vb2_fill_buffer to indicate whether the buffer must be flushed or wait until it is full. Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04EFI/CPER: don't go past the ARM processor CPER record bufferMauro Carvalho Chehab1-1/+2
[ Upstream commit eae21beecb95a3b69ee5c38a659f774e171d730e ] There's a logic inside GHES/CPER to detect if the section_length is too small, but it doesn't detect if it is too big. Currently, if the firmware receives an ARM processor CPER record stating that a section length is big, kernel will blindly trust section_length, producing a very long dump. For instance, a 67 bytes record with ERR_INFO_NUM set 46198 and section length set to 854918320 would dump a lot of data going a way past the firmware memory-mapped area. Fix it by adding a logic to prevent it to go past the buffer if ERR_INFO_NUM is too big, making it report instead: [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 [Hardware Error]: event severity: recoverable [Hardware Error]: Error 0, type: recoverable [Hardware Error]: section_type: ARM processor error [Hardware Error]: MIDR: 0xff304b2f8476870a [Hardware Error]: section length: 854918320, CPER size: 67 [Hardware Error]: section length is too big [Hardware Error]: firmware-generated error record is incorrect [Hardware Error]: ERR_INFO_NUM is 46198 Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> [ rjw: Subject and changelog tweaks ] Link: https://patch.msgid.link/41cd9f6b3ace3cdff7a5e864890849e4b1c58b63.1767871950.git.mchehab+huawei@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04APEI/GHES: ensure that won't go past CPER allocated recordMauro Carvalho Chehab1-0/+1
[ Upstream commit fa2408a24f8f0db14d9cfc613ef162dc267d7ad4 ] The logic at ghes_new() prevents allocating too large records, by checking if they're bigger than GHES_ESTATUS_MAX_SIZE (currently, 64KB). Yet, the allocation is done with the actual number of pages from the CPER bios table location, which can be smaller. Yet, a bad firmware could send data with a different size, which might be bigger than the allocated memory, causing an OOPS: Unable to handle kernel paging request at virtual address fff00000f9b40000 Mem abort info: ESR = 0x0000000096000007 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x07: level 3 translation fault Data abort info: ISV = 0, ISS = 0x00000007, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 swapper pgtable: 4k pages, 52-bit VAs, pgdp=000000008ba16000 [fff00000f9b40000] pgd=180000013ffff403, p4d=180000013fffe403, pud=180000013f85b403, pmd=180000013f68d403, pte=0000000000000000 Internal error: Oops: 0000000096000007 [#1] SMP Modules linked in: CPU: 0 UID: 0 PID: 303 Comm: kworker/0:1 Not tainted 6.19.0-rc1-00002-gda407d200220 #34 PREEMPT Hardware name: QEMU QEMU Virtual Machine, BIOS unknown 02/02/2022 Workqueue: kacpi_notify acpi_os_execute_deferred pstate: 214020c5 (nzCv daIF +PAN -UAO -TCO +DIT -SSBS BTYPE=--) pc : hex_dump_to_buffer+0x30c/0x4a0 lr : hex_dump_to_buffer+0x328/0x4a0 sp : ffff800080e13880 x29: ffff800080e13880 x28: ffffac9aba86f6a8 x27: 0000000000000083 x26: fff00000f9b3fffc x25: 0000000000000004 x24: 0000000000000004 x23: ffff800080e13905 x22: 0000000000000010 x21: 0000000000000083 x20: 0000000000000001 x19: 0000000000000008 x18: 0000000000000010 x17: 0000000000000001 x16: 00000007c7f20fec x15: 0000000000000020 x14: 0000000000000008 x13: 0000000000081020 x12: 0000000000000008 x11: ffff800080e13905 x10: ffff800080e13988 x9 : 0000000000000000 x8 : 0000000000000000 x7 : 0000000000000001 x6 : 0000000000000020 x5 : 0000000000000030 x4 : 00000000fffffffe x3 : 0000000000000000 x2 : ffffac9aba78c1c8 x1 : ffffac9aba76d0a8 x0 : 0000000000000008 Call trace: hex_dump_to_buffer+0x30c/0x4a0 (P) print_hex_dump+0xac/0x170 cper_estatus_print_section+0x90c/0x968 cper_estatus_print+0xf0/0x158 __ghes_print_estatus+0xa0/0x148 ghes_proc+0x1bc/0x220 ghes_notify_hed+0x5c/0xb8 notifier_call_chain+0x78/0x148 blocking_notifier_call_chain+0x4c/0x80 acpi_hed_notify+0x28/0x40 acpi_ev_notify_dispatch+0x50/0x80 acpi_os_execute_deferred+0x24/0x48 process_one_work+0x15c/0x3b0 worker_thread+0x2d0/0x400 kthread+0x148/0x228 ret_from_fork+0x10/0x20 Code: 6b14033f 540001ad a94707e2 f100029f (b8747b44) ---[ end trace 0000000000000000 ]--- Prevent that by taking the actual allocated are into account when checking for CPER length. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> [ rjw: Subject tweaks ] Link: https://patch.msgid.link/4e70310a816577fabf37d94ed36cde4ad62b1e0a.1767871950.git.mchehab+huawei@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04audit: add missing syscalls to read classJeffrey Bencteux1-0/+6
[ Upstream commit bcb90a2834c7393c26df9609b889a3097b7700cd ] The "at" variant of getxattr() and listxattr() are missing from the audit read class. Calling getxattrat() or listxattrat() on a file to read its extended attributes will bypass audit rules such as: -w /tmp/test -p rwa -k test_rwa The current patch adds missing syscalls to the audit read class. Signed-off-by: Jeffrey Bencteux <jeff@bencteux.fr> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04audit: add fchmodat2() to change attributes classJeffrey Bencteux1-0/+3
[ Upstream commit 4f493a6079b588cf1f04ce5ed6cdad45ab0d53dc ] fchmodat2(), introduced in version 6.6 is currently not in the change attribute class of audit. Calling fchmodat2() to change a file attribute in the same fashion than chmod() or fchmodat() will bypass audit rules such as: -w /tmp/test -p rwa -k test_rwa The current patch adds fchmodat2() to the change attributes class. Signed-off-by: Jeffrey Bencteux <jeff@bencteux.fr> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04net/mlx5: Fix multiport device check over light SFsShay Drory1-2/+2
[ Upstream commit 47bf2e813817159f4d195be83a9b5a640ee6baec ] Driver is using num_vhca_ports capability to distinguish between multiport master device and multiport slave device. num_vhca_ports is a capability the driver sets according to the MAX num_vhca_ports capability reported by FW. On the other hand, light SFs doesn't set the above capbility. This leads to wrong results whenever light SFs is checking whether he is a multiport master or slave. Therefore, use the MAX capability to distinguish between master and slave devices. Fixes: e71383fb9cd1 ("net/mlx5: Light probe local SFs") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <Jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260218072904.1764634-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>