summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
28 hoursmailbox: Fix NULL message support in mbox_send_message()Jassi Brar1-0/+3
commit c58e9456e30c7098cbcd9f04571992be8a2e4e63 upstream. The active_req field serves double duty as both the "is a TX in flight" flag (NULL means idle) and the storage for the in-flight message pointer. When a client sends NULL via mbox_send_message(), active_req is set to NULL, which the framework misinterprets as "no active request". This breaks the TX state machine by: - tx_tick() short-circuits on (!mssg), skipping the tx_done callback and the tx_complete completion - txdone_hrtimer() skips the channel entirely since active_req is NULL, so poll-based TX-done detection never fires. Fix this by introducing a MBOX_NO_MSG sentinel value that means "no active request," freeing NULL to be valid message data. The sentinel is defined in the subsystem-internal mailbox.h so that controller drivers within drivers/mailbox/ can reference it, but it is not exposed to clients outside the subsystem. Fifteen in-tree callers send NULL (doorbell-style IPCs on Qualcomm, Tegra, TI, Xilinx, i.MX, SCMI, and PCC platforms). All were audited for regression: - Most already work around the bug via knows_txdone=true with a manual mbox_client_txdone() call, making the framework's tracking irrelevant. These are unaffected. - Poll-based callers (Xilinx zynqmp/r5) are strictly better off: the poll timer now correctly detects NULL-active channels instead of silently skipping them. - irq-qcom-mpm.c was a pre-existing bug -- the only Qualcomm caller that omitted the knows_txdone + mbox_client_txdone() pattern. Fixed in a companion commit ("irqchip/qcom-mpm: Fix missing mailbox TX done acknowledgment"). - No caller sets both a tx_done callback and sends NULL, nor combines tx_block=true with NULL sends, so the newly reachable callback/completion paths are never exercised. Also update tegra-hsp's flush callback, which directly inspects active_req to wait for the channel to drain: the old "!= NULL" check becomes "!= MBOX_NO_MSG", otherwise flush spins until timeout since the sentinel is non-NULL. The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message by clients. Reported-by: Joonwon Kang <joonwonkang@google.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com> Signed-off-by: Joonwon Kang <joonwonkang@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
28 hoursplatform/x86/intel/vsec: Make driver_data info constDavid E. Box1-2/+2
[ Upstream commit 9577c74c96f88d807d1ba005adbf5952e7127e55 ] Treat PCI id->driver_data (intel_vsec_platform_info) as read-only by making vsec_priv->info a const pointer and updating all function signatures to accept const intel_vsec_platform_info *. This improves const-correctness and clarifies that the platform info data from the driver_data table is not meant to be modified at runtime. No functional changes intended. Signed-off-by: David E. Box <david.e.box@linux.intel.com> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Link: https://patch.msgid.link/20260313015202.3660072-3-david.e.box@linux.intel.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Stable-dep-of: 348ccc754d89 ("platform/x86/intel/vsec: Fix enable_cnt imbalance on PCIe error recovery") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
28 hoursRevert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare"Lorenzo Stoakes2-18/+2
commit 83f9efcce93f8574be2279090ee2aec58b86cda7 upstream. This reverts commit ea52cb24cd3f ("mm/hugetlbfs: update hugetlbfs to use mmap_prepare") with conflict resolution to account for changes in commit ea52cb24cd3f ("mm/hugetlbfs: update hugetlbfs to use mmap_prepare"). The patch incorrectly handled hugetlb VMA lock allocation at the mmap_prepare stage, where a failed allocation occurring after mmap_prepare is called might result in the lock leaking. There is no risk of a merge causing a similar issues, as VMA_DONTEXPAND_BIT is set for hugetlb mappings. As a first step in addressing this issue, simply revert the change so we can rework how we do this having corrected the underlying issues. We maintain the VMA flags changes as best we can, accounting for the fact that we were working with a VMA descriptor previously and propagating like-for-like changes for this. Note that we invoke vma_set_flags() and do not call vma_start_write() as vm_flags_set() does. This is OK as it's being done in an .mmap hook where the VMA is not yet linked into the tree so nobody else can be accessing it. Link: https://lore.kernel.org/20260512160643.266960-1-ljs@kernel.org Fixes: ea52cb24cd3f ("mm/hugetlbfs: update hugetlbfs to use mmap_prepare") Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Reported-by: Mingyu Wang <25181214217@stu.xidian.edu.cn> Closes: https://lore.kernel.org/linux-mm/20260425070700.562229-1-25181214217@stu.xidian.edu.cn/ Acked-by: Muchun Song <muchun.song@linux.dev> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
28 hoursserial: core: introduce guard(uart_port_lock_check_sysrq_irqsave)Jacques Nilo1-0/+12
commit c3cce2e67bb22a223f5b8ef05db0fcde70994068 upstream. uart_handle_break() and uart_prepare_sysrq_char() (in include/linux/serial_core.h) capture a SysRq character into port->sysrq_ch while the port lock is held and rely on the unlock helper -- uart_unlock_and_check_sysrq_irqrestore() -- to dispatch the captured character to handle_sysrq() on scope exit. The existing guard(uart_port_lock_irqsave) cannot be used by IRQ handlers that process RX, because its destructor calls plain uart_port_unlock_irqrestore() and silently drops port->sysrq_ch. Add a dedicated guard(uart_port_lock_check_sysrq_irqsave) variant whose destructor is the sysrq-aware unlock helper. The lock side is identical to uart_port_lock_irqsave -- only the unlock-time behaviour differs. Callers that may capture SysRq characters must use guard(uart_port_lock_check_sysrq_irqsave); the existing guard(uart_port_lock_irqsave) keeps its current plain-unlock semantics for the many callers that do not process RX. The new macro is placed after the CONFIG_MAGIC_SYSRQ_SERIAL block so both definitions of uart_unlock_and_check_sysrq_irqrestore() (sysrq enabled and disabled) are visible at expansion time. When CONFIG_MAGIC_SYSRQ_SERIAL=n the destructor degenerates to plain uart_port_unlock_irqrestore(), so there is no overhead. No functional change on its own; users are converted in the following patches. Cc: stable@vger.kernel.org Signed-off-by: Jacques Nilo <jnilo@free.fr> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://patch.msgid.link/3849af4bc55d5d2a424fa850844e94d641b2f8a6.1778675349.git.jnilo@free.fr Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
28 hoursxfrm: route MIGRATE notifications to caller's netnsMaoyi Xie1-1/+2
commit 7e2a4f7ca0952820731ef7bdadfc9a9e9d3571b4 upstream. xfrm_send_migrate() in net/xfrm/xfrm_user.c and pfkey_send_migrate() in net/key/af_key.c both hardcode &init_net for the multicast that announces a successful XFRM_MSG_MIGRATE / SADB_X_MIGRATE. XFRM_MSG_MIGRATE arrives on a per-netns NETLINK_XFRM socket, and the rest of the xfrm/af_key netlink path was made netns-aware in 2008. The other 14 multicast paths in xfrm_user.c route their event using xs_net(x), xp_net(xp) or sock_net(skb->sk); only the migrate path was missed. Two consequences of the init_net hardcoding: 1. The notification (selector, old/new endpoint addresses, and the km_address) is delivered to listeners on init_net's XFRMNLGRP_MIGRATE / pfkey BROADCAST_ALL groups rather than on the issuing netns. An IKE daemon running in init_net therefore receives migration notifications originating from any other netns on the host. 2. An IKE daemon running inside a non-init netns and subscribed to its own XFRMNLGRP_MIGRATE / pfkey groups never receives the notification of its own migration. IKEv2 MOBIKE / address-update handling inside a netns is silently broken. Thread struct net through km_migrate() and the xfrm_mgr.migrate function pointer, drop the &init_net override in xfrm_send_migrate() and pfkey_send_migrate(), and pass the caller's net (already in scope in xfrm_migrate() via sock_net(skb->sk)) all the way down. struct xfrm_mgr is in-tree only and not exported as a stable API, so the function-pointer signature change is internal. pfkey_broadcast() is already netns-aware via net_generic(net, pfkey_net_id) since the pernet conversion. The five other pfkey_broadcast() callers in af_key.c already pass xs_net(x), sock_net(sk) or a per-netns net, so this only removes the &init_net outlier. Fixes: 5c79de6e79cd ("[XFRM]: User interface for handling XFRM_MSG_MIGRATE") Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
28 hoursiommu, debugobjects: avoid gcc-16.1 section mismatch warningsArnd Bergmann1-0/+11
commit 4c9ad387aa2d6785299722e54224d34764edaeb3 upstream. gcc-16 has gained some more advanced inter-procedual optimization techniques that enable it to inline the dummy_tlb_add_page() and dummy_tlb_flush() function pointers into a specialized version of __arm_v7s_unmap: WARNING: modpost: vmlinux: section mismatch in reference: __arm_v7s_unmap+0x2cc (section: .text) -> dummy_tlb_add_page (section: .init.text) ERROR: modpost: Section mismatches detected. >From what I can tell, the transformation is correct, as this is only called when __arm_v7s_unmap() is called from arm_v7s_do_selftests(), which is also __init. Since __arm_v7s_unmap() however is not __init, gcc cannot inline the inner function calls directly. In debug_objects_selftest(), the same thing happens. Both the caller and the leaf function are __init, but the IPA pulls it into a non-init one: WARNING: modpost: vmlinux: section mismatch in reference: lookup_object_or_alloc+0x7c (section: .text.lookup_object_or_alloc) -> is_static_object (section: .init.text) Marking the affected functions as not "__init" would reliably avoid this issue but is not a good solution because it removes an otherwise correct annotation. I tried marking the functions as 'noinline', but that ended up not covering all the affected configurations. With some more experimenting, I found that marking these functions as __attribute__((noipa)) is both logical and reliable. In order to keep the syntax readable, add a custom macro for this in include/linux/compiler_attributes.h next to other related macros and use it to annotate both files. Link: https://lore.kernel.org/all/abRB6g-48ZX6Yl2r@willie-the-truck/ Cc: Will Deacon <will@kernel.org> Cc: Thomas Gleixner <tglx@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: linux-kbuild@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Will Deacon <will@kernel.org> Acked-by: Thomas Gleixner <tglx@kernel.org> Acked-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
28 hoursDisable -Wattribute-alias for clang-23 and newerNathan Chancellor4-0/+18
commit 175db11786bde9061db526bf1ac5107d915f5163 upstream. Clang recently added support for -Wattribute-alias [1], which results in the same warnings that necessitated commit bee20031772a ("disable -Wattribute-alias warning for SYSCALL_DEFINEx()") for GCC. kernel/time/itimer.c:325:1: error: alias and aliasee have different types 'long (unsigned int)' and 'long (typeof (__builtin_choose_expr((__builtin_types_compatible_p(typeof ((unsigned int)0), typeof (0LL)) || __builtin_types_compatible_p(typeof ((unsigned int)0), typeof (0ULL))), 0LL, 0L)))' (aka 'long (long)') [-Werror,-Wattribute-alias] 325 | SYSCALL_DEFINE1(alarm, unsigned int, seconds) | ^ include/linux/syscalls.h:225:36: note: expanded from macro 'SYSCALL_DEFINE1' 225 | #define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__) | ^ include/linux/syscalls.h:236:2: note: expanded from macro 'SYSCALL_DEFINEx' 236 | __SYSCALL_DEFINEx(x, sname, __VA_ARGS__) | ^ include/linux/syscalls.h:251:18: note: expanded from macro '__SYSCALL_DEFINEx' 251 | __attribute__((alias(__stringify(__se_sys##name)))); \ | ^ kernel/time/itimer.c:325:1: note: aliasee is declared here include/linux/syscalls.h:225:36: note: expanded from macro 'SYSCALL_DEFINE1' 225 | #define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__) | ^ include/linux/syscalls.h:236:2: note: expanded from macro 'SYSCALL_DEFINEx' 236 | __SYSCALL_DEFINEx(x, sname, __VA_ARGS__) | ^ include/linux/syscalls.h:255:18: note: expanded from macro '__SYSCALL_DEFINEx' 255 | asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \ | ^ <scratch space>:16:1: note: expanded from here 16 | __se_sys_alarm | ^ Disable the warnings in the same way for clang-23 and newer. Disable the warning about unknown warning options to avoid breaking the build for versions of clang-23 that do not have -Wattribute-alias, such as ones deployed by vendors like Android or CI systems or when bisecting LLVM between llvmorg-23-init and release/23.x. Cc: stable@vger.kernel.org Closes: https://github.com/ClangBuiltLinux/linux/issues/2163 Link: https://github.com/llvm/llvm-project/commit/40da6920a0d71d49dfa2392b09153600b0759f5e [1] Link: https://patch.msgid.link/20260515-syscall-disable-attribute-alias-for-clang-v1-1-9a9d95d41df6@kernel.org Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
28 hoursparport: Fix race between port and client registrationBen Hutchings1-0/+1
commit ef15ccbb3e8640a723c42ad90eaf81d66ae02017 upstream. The parport subsystem registers port devices before they are fully initialised, resulting in a race condition where client drivers such as lp can attach to ports that are not completely initialised or even being torn down. When the port and client drivers are built as modules and loaded around the same time during boot, this occasionally results in a crash. I was able to make this happen reliably in a VM with a PC-style parallel port by patching parport_pc to fail probing: > --- a/drivers/parport/parport_pc.c > +++ b/drivers/parport/parport_pc.c > @@ -2069,7 +2069,7 @@ static struct parport *__parport_pc_probe_port(unsigned long int base, > if (!p) > goto out3; > > - base_res = request_region(base, 3, p->name); > + base_res = NULL; > if (!base_res) > goto out4; > and then running: while true; do modprobe lp & modprobe parport_pc wait rmmod lp parport_pc done for a few seconds. In the long term I think port registration should be changed to put the call to device_add() inside parport_announce_port(), but since the latter currently cannot fail this will require changing all port drivers. For now, add a flag to indicate whether a port has been "announced" and only try to attach client drivers to ports when the flag is set. Fixes: 6fa45a226897 ("parport: add device-model to parport subsystem") Closes: https://bugs.debian.org/1130365 Closes: https://lore.kernel.org/all/6ba903ad-9897-42bb-8c2d-337385cc3746@molgen.mpg.de/ Cc: stable <stable@kernel.org> Signed-off-by: Ben Hutchings <benh@debian.org> Acked-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Link: https://patch.msgid.link/afo6uBv68GDevbMD@decadent.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
28 hoursdpll: export __dpll_device_change_ntf() for use under dpll_lockIvan Vecera1-0/+1
[ Upstream commit 20040b2a3cb992f84d3db4c086b909eb9b906b31 ] Export __dpll_device_change_ntf() so that drivers can send device change notifications from within device callbacks, which are already called under dpll_lock. Using dpll_device_change_ntf() in that context would deadlock. Add lockdep_assert_held() to catch misuse without the lock held. Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20260526074525.1451008-2-ivecera@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Stable-dep-of: d733f519f644 ("dpll: zl3073x: use __dpll_device_change_ntf() and remove change_work") Signed-off-by: Sasha Levin <sashal@kernel.org>
28 hoursnet: Introduce skb tc depth field to track packet loopsJamal Hadi Salim1-0/+2
[ Upstream commit 98b34f3e8c3492cfc89ff943c9d92b4d52863d1d ] Add a 2-bit per-skb tc depth field to track packet loops across the stack. The previous per-CPU loop counters like MIRRED_NEST_LIMIT assume a single call stack and lose state in two cases: 1) When a packet is queued and reprocessed later (e.g., egress->ingress via backlog), the per-cpu state is gone by the time it is dequeued. 2) With XPS/RPS a packet may arrive on one CPU and be processed on another. A per-skb field solves both by travelling with the packet itself. The field fits in existing padding, using 2 bits that were previously a hole: pahole before(-) and after (+) diff looks like: __u8 slow_gro:1; /* 132: 3 1 */ __u8 csum_not_inet:1; /* 132: 4 1 */ __u8 unreadable:1; /* 132: 5 1 */ + __u8 tc_depth:2; /* 132: 6 1 */ - /* XXX 2 bits hole, try to pack */ /* XXX 1 byte hole, try to pack */ __u16 tc_index; /* 134 2 */ There used to be a ttl field which was removed as part of tc_verd in commit aec745e2c520 ("net-tc: remove unused tc_verd fields"). It was already unused by that time, due to remove earlier in commit c19ae86a510c ("tc: remove unused redirect ttl"). The first user of this field is netem, which increments tc_depth on duplicated packets before re-enqueueing them at the root qdisc. On re-entry, netem skips duplication for any skb with tc_depth already set, bounding recursion to a single level regardless of tree topology. The other user is mirred which increments it on each pass and limits to depth to MIRRED_DEFER_LIMIT (3). The new field was called ttl in earlier versions of this patch but renamed to tc_depth to avoid confusion with IP ttl. Note (looking at you Sashiko! Dont ignore me and continue bringing this up): 1. Since both mirred and netem utilize the same 2-bit tc_depth field it is possible when netem and mirred are used together that netem qdisc to skip the duplication step. This is a known trade-off, as a 2-bit field cannot independently track both features' recursion depths and it is not considered sane to have a setup that addresses both features on at the same time. 2. skb_scrub_packet does not clear tc_depth. This means a packet's loop history is preserved even across namespaces. While this might be restrictive for some topologies, it is also design intent to provide robustness against loops across namespaces. Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260525122556.973584-2-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Stable-dep-of: db875221ab08 ("net/sched: Fix ethx:ingress -> ethy:egress -> ethx:ingress mirred loop") Signed-off-by: Sasha Levin <sashal@kernel.org>
28 hoursnetfilter: nf_tables: fix dst corruption in same register operationFernando Fernandez Mancera1-0/+7
[ Upstream commit 18014147d3ee7831dce53fe65d7fc8d428b02552 ] For lshift and rshift, the shift operations are performed in a loop over 32-bit words. The loop calculates the shifted value and write it to dst, and then immediately reads from src to calculate the carry for the next iteration. Because src and dst could point to the same memory location, the carry is incorrectly calculated using the newly modified dst value instead of the original src value. Adding a temporary local variable to cache the original value before writing to dst and using it for the carry calculation solves the problem. In addition, partial overlap is rejected from control plane for all kind of operations including byteorder. This was tested with the following bytecode: table test_table ip flags 0 use 1 handle 1 ip test_table test_chain use 3 type filter hook input prio 0 policy accept packets 0 bytes 0 flags 1 ip test_table test_chain 2 [ immediate reg 1 0x44332211 0x88776655 ] [ bitwise reg 1 = ( reg 1 << 0x08000000 ) ] [ cmp eq reg 1 0x66443322 0x00887766 ] [ counter pkts 0 bytes 0 ] ip test_table test_chain 4 3 [ immediate reg 1 0x44332211 0x88776655 ] [ bitwise reg 1 = ( reg 1 << 0x08000000 ) ] [ cmp eq reg 1 0x55443322 0x00887766 ] [ counter pkts 21794 bytes 1917798 ] Fixes: 567d746b55bc ("netfilter: bitwise: add support for shifts.") Acked-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
28 hourskunit: fix use-after-free in debugfs when using kunit.filterFlorian Schmaus1-0/+1
[ Upstream commit fb6988b83b4cafe8db63999c1ddff1b7c66d2ff5 ] When the kernel is booted with a kunit filter (e.g., kunit.filter="speed!=slow"), the kunit executor dynamically allocates copies of the filtered test suites using kmalloc/kmemdup. During the initial boot execution, kunit_debugfs_create_suite() creates debugfs files (such as /sys/kernel/debug/kunit/<suite>/run) and permanently stores a pointer to the dynamically allocated suite in the inode's i_private field. Previously, the executor freed this dynamically allocated suite_set immediately after executing the boot-time tests. Because the debugfs nodes were not destroyed, any subsequent interaction with the debugfs `run` file from userspace triggered a use-after-free (UAF). On systems with architectural capabilities, like CHERI RISC-V, this resulted in an immediate fatal hardware exception due to the invalidation of the capability tags on the reclaimed memory. On other architectures, it resulted in silent memory corruption. Fix this UAF by properly coupling the lifetime of the filtered suite memory allocation to the lifetime of the kunit subsystem and its associated VFS nodes. Ownership of the boot-time suite_set is now transferred to a global tracker ('kunit_boot_suites'), and the memory is cleanly released in kunit_exit() during module teardown. Link: https://lore.kernel.org/r/20260507084854.233984-1-florian.schmaus@codasip.com Fixes: e2219db280e3 ("kunit: add debugfs /sys/kernel/debug/kunit/<suite>/results display") Signed-off-by: Florian Schmaus <florian.schmaus@codasip.com> Reviewed-by: Martin Kaiser <martin@kaiser.cx> Reviewed-by: David Gow <david@davidgow.net> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
28 hoursHID: remove duplicate hid_warn_ratelimited definitionLiu Kai1-2/+0
[ Upstream commit dd2147375a8fe7c5bc3f1f1b1d3a9567c26faefa ] The hid_warn_ratelimited macro is defined twice in include/linux/hid.h: - first one added by commit 4051ead99888 ("HID: rate-limit hid_warn to prevent log flooding") - second one added by commit 1d64624243af ("HID: core: Add printk_ratelimited variants to hid_warn() etc")). The second definition is correctly grouped with other ratelimited macros. Remove the duplicate definition. Fixes: 1d64624243af ("HID: core: Add printk_ratelimited variants to hid_warn() etc") Signed-off-by: Liu Kai <lukace97@outlook.com> [bentiss: edited commit message] Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
28 hoursACPI: button: Enable wakeup GPEs for ACPI buttons at probe timeRafael J. Wysocki1-0/+5
[ Upstream commit a004b8f0d3bc5d82d3f2c91ff93f4b4b7ccb8f76 ] Prior to commit 57c31e6d620f ("ACPI: scan: Use acpi_setup_gpe_for_wake() for buttons"), ACPI button wakeup GPEs having handler methods remained enabled after acpi_wakeup_gpe_init(), but currently they are not enabled because acpi_setup_gpe_for_wake() disables them. That causes function keys to stop working on some systems [1] and there may be other related issues elsewhere. To address that, make the ACPI button driver enable wakeup GPEs for ACPI buttons so long as they have handler methods. While this does not restore the old behavior exactly (the ACPI button driver needs to be bound to the button devices for the GPEs to be enabled), it should be sufficient to restore the missing functionality. For this purpose, introduce acpi_enable_gpe_cond() that enables a GPE if its dispatch type matches the supplied one and modify acpi_button_probe() to use that function for enabling the GPEs in question. Fixes: 57c31e6d620f ("ACPI: scan: Use acpi_setup_gpe_for_wake() for buttons") Reported-by: Nick <nick@kousu.ca> Closes: https://lore.kernel.org/linux-acpi/E2OXET.4X5GTP37VTNC3@kousu.ca/ [1] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Nick <nick@kousu.ca> Cc: 7.0+ <stable@vger.kernel.org> # 7.0+ Link: https://patch.msgid.link/9629117.CDJkKcVGEf@rafael.j.wysocki Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daystcp: fix stale per-CPU tcp_tw_isn leak enabling ISN predictionEric Dumazet1-3/+4
[ Upstream commit 1bbf0ced1d9db73ac7893c2187f3459288603e0d ] Blamed commit moved the TIME_WAIT-derived ISN from the skb control block to a per-CPU variable, assuming the value would always be consumed by tcp_conn_request() for the same packet that wrote it. That assumption is violated by multiple drop paths between the producer (__this_cpu_write(tcp_tw_isn, isn) in tcp_v{4,6}_rcv()) and the consumer (tcp_conn_request()): - min_ttl / min_hopcount check - xfrm policy check - tcp_inbound_hash() MD5/AO mismatch - tcp_filter() eBPF/SO_ATTACH_FILTER drop - th->syn && th->fin discard in tcp_rcv_state_process() TCP_LISTEN - psp_sk_rx_policy_check() in tcp_v{4,6}_do_rcv() - tcp_checksum_complete() in tcp_v{4,6}_do_rcv() - tcp_v{4,6}_cookie_check() returning NULL When a packet is dropped on any of these paths, tcp_tw_isn is left set. The next SYN processed on the same CPU then consumes the non zero value in tcp_conn_request(), receiving a potentially predictable ISN. This patch moves back tcp_tw_isn to skb->cb[], getting rid of the per-cpu variable. Note that tcp_v{4,6}_fill_cb() do not set it. Very litle impact on overall code size/complexity: $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 2/1 up/down: 8/-15 (-7) Function old new delta tcp_v6_rcv 3038 3042 +4 tcp_v4_rcv 3035 3039 +4 tcp_conn_request 2938 2923 -15 Total: Before=24436060, After=24436053, chg -0.00% Fixes: 41eecbd712b7 ("tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260519084611.2485277-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 dayscrypto/krb5, rxrpc: Fix lack of pre-decrypt/pre-verify length checksDavid Howells2-3/+7
[ Upstream commit 2b50aceafe6606ea52ed42aadd1b4d44a188aade ] Change the krb5 crypto library to provide facilities to precheck the length of the message about to be decrypted or verified. Fix AF_RXRPC to make use of this to validate DATA packets secured with RxGK. Fixes: 9d1d2b59341f ("rxrpc: rxgk: Implement the yfs-rxgk security class (GSSAPI)") Closes: https://sashiko.dev/#/patchset/20260511160753.607296-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> cc: Herbert Xu <herbert@gondor.apana.org.au> cc: Simon Horman <horms@kernel.org> cc: Chuck Lever <chuck.lever@oracle.com> cc: linux-afs@lists.infradead.org Reviewed-by: Jeffrey Altman <jaltman@auristor.com> Tested-by: Marc Dionne <marc.dionne@auristor.com> Link: https://patch.msgid.link/20260515230516.2718212-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysnet: shaper: rework the VALID marking (again)Jakub Kicinski1-0/+1
[ Upstream commit b8d7519352ba8c6df83259295d4a3bad093cae90 ] Recent commit changed the semantics from NOT_VALID to VALID. I didn't realize that the flags are not stored atomically with the entry in XArray. There's still a race of reader observing a VALID mark for a slot, getting interrupted, writer replacing the entry with a different one, reader continuing, fetching the entry which is now a different pointer than the pointer for which VALID was meant. The biggest consequence of this is that we may see a UAF since net_shaper_rollback() assumed that entries without VALID can be freed without observing RCU. Looks like the XArray marks are buying us nothing at this point. Let's convert the code to an explicit valid field. The smp_load_acquire() / smp_store_release() barriers are marginally cleaner. Reported-by: Sashiko <sashiko-bot@kernel.org> Fixes: 93954b40f6a4 ("net-shapers: implement NL set and delete operations") Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260515221325.1685455-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysnet: airoha: Fix NPU RX DMA descriptor bitsChristian Marangi1-3/+3
[ Upstream commit 0cb5a74faa3bdcfa3b18735d554e12c0f615e35d ] In an internal review from Airoha, it was notice that the RX DMA descriptor bits and mask are wrong. These values probably refer to an old NPU firmware never published. The previous value works correctly but it was reported that in some specific condition in mixed scenario with both Ethernet and WiFi offload it's possible that RX DMA descriptor signal wrong value with the problem to the RX ring or packets getting dropped. To handle these specific scenario, apply the new suggested bits mask from Airoha. Correct functionality of both AN7581 NPU and MT7996 variant were verified and confirmed working. Fixes: a7fc8c641cab ("net: airoha: Fix npu rx DMA definitions") Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260518134530.3683-1-ansuelsmth@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 dayscgroup/rstat: validate cpu before css_rstat_cpu() accessQing Ming1-0/+1
[ Upstream commit 8817005efbdfdf5d4e4814cb5dc52b53d12917d7 ] css_rstat_updated() is exposed as a BPF kfunc and accepts a caller-provided cpu argument. The function uses cpu for per-cpu rstat lookups without checking whether it refers to a valid possible CPU. A BPF iter/cgroup program with CAP_BPF and CAP_PERFMON can pass an invalid cpu value. On an unfixed UBSCAN_BOUNDS test kernel, cpu == 0x7fffffff triggers: UBSAN: array-index-out-of-bounds in kernel/cgroup/rstat.c:31:9 index 2147483647 is out of range for type 'long unsigned int [64]' Call Trace: css_rstat_updated bpf_iter_run_prog cgroup_iter_seq_show bpf_seq_read Add cpu validation to the BPF-facing css_rstat_updated() kfunc and move the common implementation to __css_rstat_updated() for in-kernel callers. Fixes: a319185be9f5 ("cgroup: bpf: enable bpf programs to integrate with rstat") Signed-off-by: Qing Ming <a0yami@mailbox.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysdrm/gem: Make the GEM LRU lock part of drm_deviceBoris Brezillon2-11/+16
[ Upstream commit 379e8f1ca5e919b130b40d8115d92a536e5f8d7a ] Recently, a few races have been discovered in the GEM LRU logic, all of them caused by the fact the LRU lock is accessed through gem->lru->lock, and that very same lock also protects changes to gem->lru, leading to situations where gem->lru needs to first be accessed without the lock held, to then get the lru to access the lock through and finally take the lock and do the expected operation. Currently, the only driver making use of this API (MSM) declares a device-wide lock, and the user we're about to add (panthor) will do the same. There's no evidence that we will ever have a driver that wants different pools of LRUs protected by different locks under the same drm_device. So we're better off moving this lock to drm_device and always locking it through obj->dev->gem_lru_mutex, or directly through dev->gem_lru_mutex. If anyone ever needs more fine-grained locking, this can be revisited to pass some drm_gem_lru_pool object representing the pool of LRUs under a specific lock, but for now, the per-device lock seems to be enough. Fixes: e7c2af13f811 ("drm/gem: Add LRU/shrinker helper") Reported-by: Chia-I Wu <olvaffe@gmail.com> Closes: https://gitlab.freedesktop.org/panfrost/linux/-/work_items/86 Reviewed-by: Rob Clark <rob.clark@oss.qualcomm.com> Reviewed-by: Liviu Dudau <liviu.dudau@arm.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Chia-I Wu <olvaffe@gmail.com> Link: https://patch.msgid.link/20260518-panthor-shrinker-fixes-v4-1-1920234470d5@collabora.com Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysiommupt: Avoid rewalking during mapJason Gunthorpe1-6/+28
[ Upstream commit d6c65b0fd6218bd21ed0be7a8d3218e8f6dc91de ] Currently the core code provides a simplified interface to drivers where it fragments a requested multi-page map into single page size steps after doing all the calculations to figure out what page size is appropriate. Each step rewalks the page tables from the start. Since iommupt has a single implementation of the mapping algorithm it can internally compute each step as it goes while retaining its current position in the walk. Add a new function pt_pgsz_count() which computes the same page size fragement of a large mapping operations. Compute the next fragment when all the leaf entries of the current fragement have been written, then continue walking from the current point. The function pointer is run through pt_iommu_ops instead of iommu_domain_ops to discourage using it outside iommupt. All drivers with their own page tables should continue to use the simplified map_pages() style interfaces. Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Stable-dep-of: 0735c54804c7 ("iommu: Handle unmap error when iommu_debug is enabled") Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysiommupt: Directly call iommupt's unmap_range()Jason Gunthorpe2-6/+32
[ Upstream commit 99fb8afa16add85ed016baee9735231bca0c32b4 ] The common algorithm in iommupt does not require the iommu_pgsize() calculations, it can directly unmap any arbitrary range. Add a new function pointer to directly call an iommupt unmap_range op and make __iommu_unmap() call it directly. Gives about a 5% gain on single page unmappings. The function pointer is run through pt_iommu_ops instead of iommu_domain_ops to discourage using it outside iommupt. All drivers with their own page tables should continue to use the simplified map/unmap_pages() style interfaces. Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Stable-dep-of: 0735c54804c7 ("iommu: Handle unmap error when iommu_debug is enabled") Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysblock: allow submitting all zone writes from a single contextDamien Le Moal1-0/+8
[ Upstream commit 1365b6904fd050bf22ab9f3df375a396de5837a1 ] In order to maintain sequential write patterns per zone with zoned block devices, zone write plugging issues only a single write BIO per zone at any time. This works well but has the side effect that when large sequential write streams are issued by the user and these streams cross zone boundaries, the device ends up receiving a discontiguous set of write commands for different zones. The same also happens when a user writes simultaneously at high queue depth multiple zones: the device does not see all sequential writes per zone and receives discontiguous writes to different zones. While this does not affect the performance of solid state zoned block devices, when using an SMR HDD, this pattern change from sequential writes to discontiguous writes to different zones significantly increases head seek which results in degraded write throughput. In order to reduce this seek overhead for rotational media devices, introduce a per disk zone write plugs kernel thread to issue all write BIOs to zones. This single zone write issuing context is enabled for any zoned block device that has a request queue flagged with the new QUEUE_ZONED_QD1_WRITES flag. The flag QUEUE_ZONED_QD1_WRITES is visible as the sysfs queue attribute zoned_qd1_writes for zoned devices. For regular block devices, this attribute is not visible. For zoned block devices, a user can override the default value set to force the global write maximum queue depth of 1 for a zoned block device, or clear this attribute to fallback to the default behavior of zone write plugging which limits writes to QD=1 per sequential zone. Writing to a zoned block device flagged with QUEUE_ZONED_QD1_WRITES is implemented using a list of zone write plugs that have a non-empty BIO list. Listed zone write plugs are processed by the disk zone write plugs worker kthread in FIFO order, and all BIOs of a zone write plug are all processed before switching to the next listed zone write plug. A newly submitted BIO for a non-FULL zone write plug that is not yet listed causes the addition of the zone write plug at the end of the disk list of zone write plugs. Since the write BIOs queued in a zone write plug BIO list are necessarilly sequential, for rotational media, using the single zone write plugs kthread to issue all BIOs maintains a sequential write pattern and thus reduces seek overhead and improves write throughput. This processing essentially result in always writing to HDDs at QD=1, which is not an issue for HDDs operating with write caching enabled. Performance with write cache disabled is also not degraded thanks to the efficient write handling of modern SMR HDDs. A disk list of zone write plugs is defined using the new struct gendisk zone_wplugs_list, and accesses to this list is protected using the zone_wplugs_list_lock spinlock. The per disk kthread (zone_wplugs_worker) code is implemented by the function disk_zone_wplugs_worker(). A reference on listed zone write plugs is always held until all BIOs of the zone write plug are processed by the worker kthread. BIO issuing at QD=1 is driven using a completion structure (zone_wplugs_worker_bio_done) and calls to blk_io_wait(). With this change, performance when sequentially writing the zones of a 30 TB SMR SATA HDD connected to an AHCI adapter changes as follows (1MiB direct I/Os, results in MB/s unit): +--------------------+ | Write BW (MB/s) | +------------------+----------+---------+ | Sequential write | Baseline | Patched | | Queue Depth | 6.19-rc8 | | +------------------+----------+---------+ | 1 | 244 | 245 | | 2 | 244 | 245 | | 4 | 245 | 245 | | 8 | 242 | 245 | | 16 | 222 | 246 | | 32 | 211 | 245 | | 64 | 193 | 244 | | 128 | 112 | 246 | +------------------+----------+---------+ With the current code (baseline), as the sequential write stream crosses a zone boundary, higher queue depth creates a gap between the last IO to the previous zone and the first IOs to the following zones, causing head seeks and degrading performance. Using the disk zone write plugs worker thread, this pattern disappears and the maximum throughput of the drive is maintained, leading to over 100% improvements in throughput for high queue depth write. Using 16 fio jobs all writing to randomly chosen zones at QD=32 with 1 MiB direct IOs, write throughput also increases significantly. +--------------------+ | Write BW (MB/s) | +------------------+----------+---------+ | Random write | Baseline | Patched | | Number of zones | 6.19-rc7 | | +------------------+----------+---------+ | 1 | 191 | 192 | | 2 | 101 | 128 | | 4 | 115 | 123 | | 8 | 90 | 120 | | 16 | 64 | 115 | | 32 | 58 | 105 | | 64 | 56 | 101 | | 128 | 55 | 99 | +------------------+----------+---------+ Tests using XFS shows that buffered write speed with 8 jobs writing files increases by 12% to 35% depending on the workload. +--------------------+ | Write BW (MB/s) | +------------------+----------+---------+ | Workload | Baseline | Patched | | | 6.19-rc7 | | +------------------+----------+---------+ | 256MiB file size | 212 | 238 | +------------------+----------+---------+ | 4MiB .. 128 MiB | 213 | 243 | | random file size | | | +------------------+----------+---------+ | 2MiB .. 8 MiB | 179 | 242 | | random file size | | | +------------------+----------+---------+ Performance gains are even more significant when using an HBA that limits the maximum size of commands to a small value, e.g. HBAs controlled with the mpi3mr driver limit commands to a maximum of 1 MiB. In such case, the write throughput gains are over 40%. +--------------------+ | Write BW (MB/s) | +------------------+----------+---------+ | Workload | Baseline | Patched | | | 6.19-rc7 | | +------------------+----------+---------+ | 256MiB file size | 175 | 245 | +------------------+----------+---------+ | 4MiB .. 128 MiB | 174 | 244 | | random file size | | | +------------------+----------+---------+ | 2MiB .. 8 MiB | 171 | 243 | | random file size | | | +------------------+----------+---------+ Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 836efd35c472 ("block: fix handling of dead zone write plugs") Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysblock: rename struct gendisk zone_wplugs_lock fieldDamien Le Moal1-1/+1
[ Upstream commit b7cbc30e93e3a64ea058230f6d0c764d6d80276f ] Rename struct gendisk zone_wplugs_lock field to zone_wplugs_hash_lock to clearly indicates that this is the spinlock used for manipulating the hash table of zone write plugs. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 836efd35c472 ("block: fix handling of dead zone write plugs") Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysnetfs: Fix folio->private handling in netfs_perform_write()David Howells1-0/+1
[ Upstream commit ccde2ac757c713535b224233a296de40efe5212d ] Under some circumstances, netfs_perform_write() doesn't correctly manipulate folio->private between NULL, NETFS_FOLIO_COPY_TO_CACHE, pointing to a group and pointing to a netfs_folio struct, leading to potential multiple attachments of private data with associated folio ref leaks and also leaks of netfs_folio structs or netfs_group refs. Fix this by consolidating the place at which a folio is marked uptodate in one place and having that look at what's attached to folio->private and decide how to clean it up and then set the new group. Also, the content shouldn't be flushed if group is NULL, even if a group is specified in the netfs_group parameter, as that would be the case for a new folio. A filesystem should always specify netfs_group or never specify netfs_group. The Sashiko auto-review tool noted that it was theoretically possible that the fpos >= ctx->zero_point section might leak if it modified a streaming write folio. This is unlikely, but with a network filesystem, third party changes can happen. It also pointed out that __netfs_set_group() would leak if called multiple times on the same folio from the "whole folio modify section". Fixes: 8f52de0077ba ("netfs: Reduce number of conditional branches in netfs_perform_write()") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-22-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysnetfs: Fix potential UAF in netfs_unlock_abandoned_read_pages()David Howells1-1/+1
[ Upstream commit dbe556972100fabb8e5a1b3d2163831ff07b1e8e ] netfs_unlock_abandoned_read_pages(rreq) accesses the index of the folios it is wanting to unlock and compares that to rreq->no_unlock_folio so that it doesn't unlock a folio being read for netfs_perform_write() or netfs_write_begin(). However, given that netfs_unlock_abandoned_read_pages() is called _after_ NETFS_RREQ_IN_PROGRESS is cleared, the one folio that it's not allowed to dereference is the one specified by ->no_unlock_folio as ownership immediately reverts to the caller. Fix this by storing the folio pointer instead and using that rather than the index. Also fix netfs_unlock_read_folio() where the same applies. Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-20-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysnetfs: Fix streaming write being overwrittenDavid Howells1-0/+3
[ Upstream commit 7b4dcf1b9455a6e52ac7478b4057dbe10359576d ] In order to avoid reading whilst writing, netfslib will allow "streaming writes" in which dirty data is stored directly into folios without reading them first. Such folios are marked dirty but may not be marked uptodate. If a folio is entirely written by a streaming write, uptodate will be set, otherwise it will have a netfs_folio struct attached to ->private recording the dirty region. In the event that a partially written streaming write page is to be overwritten entirely by a single write(), netfs_perform_write() will try to copy over it, but doesn't discard the netfs_folio if it succeeds; further, it doesn't correctly handle a partial copy that overwrites some of the dirty data. Fix this by the following: (1) If the folio is successfully overwritten, free the netfs_folio struct before marking the page uptodate. (2) If the copy to the folio partially fails, but short of the dirty data, just ignore the copy. (3) If the copy partially fails and overwrites some of the dirty data, accept the copy, update the netfs_folio struct to record the new data. If the folio is now filled, free the netfs_folio and set uptodate, otherwise return a partial write. Found with: fsx -q -N 1000000 -p 10000 -o 128000 -l 600000 \ /xfstest.test/junk --replay-ops=junk.fsxops using the following as junk.fsxops: truncate 0x0 0 0x927c0 write 0x63fb8 0x53c8 0 copy_range 0xb704 0x19b9 0x24429 0x79380 write 0x2402b 0x144a2 0x90660 * write 0x204d5 0x140a0 0x927c0 * copy_range 0x1f72c 0x137d0 0x7a906 0x927c0 * read 0x00000 0x20000 0x9157c read 0x20000 0x20000 0x9157c read 0x40000 0x20000 0x9157c read 0x60000 0x20000 0x9157c read 0x7e1a0 0xcfb9 0x9157c on cifs with the default cache option. It shows folio 0x24 misbehaving if the FMODE_READ check is commented out in netfs_perform_write(): if (//(file->f_mode & FMODE_READ) || netfs_is_cache_enabled(ctx)) { and no fscache. This was initially found with the generic/522 xfstest. Fixes: 8f52de0077ba ("netfs: Reduce number of conditional branches in netfs_perform_write()") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-14-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysnetfs: Fix netfs_invalidate_folio() to clear dirty bit if all changes goneDavid Howells1-0/+4
[ Upstream commit 156ac2ec2ee77c44c4eb7439d6d165247ba12247 ] If a streaming write is made, this will leave the relevant modified folio in a not-uptodate, but dirty state with a netfs_folio struct hung off of folio->private indicating the dirty range. Subsequently truncating the file such that the dirty data in the folio is removed, but the first part of the folio theoretically remains will cause the netfs_folio struct to be discarded... but will leave the dirty flag set. If the folio is then read via mmap(), netfs_read_folio() will see that the page is dirty and jump to netfs_read_gaps() to fill in the missing bits. netfs_read_gaps(), however, expects there to be a netfs_folio struct present and can oops because truncate removed it. Fix this by calling folio_cancel_dirty() in netfs_invalidate_folio() in the event that all the dirty data in the folio is erased (as nfs does). Also add some tracepoints to log modifications to a dirty page. This can be reproduced with something like: dd if=/dev/zero of=/xfstest.test/foo bs=1M count=1 umount /xfstest.test mount /xfstest.test xfs_io -c "w 0xbbbf 0xf96c" \ -c "truncate 0xbbbf" \ -c "mmap -r 0xb000 0x11000" \ -c "mr 0xb000 0x11000" \ /xfstest.test/foo with fscaching disabled (otherwise streaming writes are suppressed) and a change to netfs_perform_write() to disallow streaming writes if the fd is open O_RDWR: if (//(file->f_mode & FMODE_READ) || <--- comment this out netfs_is_cache_enabled(ctx)) { It should be reproducible even without this change, but if prevents the above trivial xfs_io command from reproducing it. Note that the initial dd is important: the file must start out sufficiently large that the zero-point logic doesn't just clear the gaps because it knows there's nothing in the file to read yet. Unmounting and mounting is needed to clear the pagecache (there are other ways to do that that may also work). This was initially reproduced with the generic/522 xfstest on some patches that remove the FMODE_READ restriction. Fixes: 9ebff83e6481 ("netfs: Prep to use folio->private for write grouping and streaming write") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-12-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysnetfs: Fix potential for tearing in ->remote_i_size and ->zero_pointDavid Howells1-9/+284
[ Upstream commit 2c8f4742bb76117d735f92a3932d85239b16c494 ] Fix potential tearing in using ->remote_i_size and ->zero_point by copying i_size_read() and i_size_write() and using the same seqcount as for i_size. We need to make sure that netfslib and the filesystems that use it always hold i_lock whilst updating any of the sizes to prevent i_size_seqcount from getting corrupted. Fixes: 4058f742105e ("netfs: Keep track of the actual remote file size") Fixes: 100ccd18bb41 ("netfs: Optimise away reads above the point at which there can be no data") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-6-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysnetfs: Fix missing barriers when accessing stream->subrequests locklesslyDavid Howells1-0/+37
[ Upstream commit b5782e2d462c028096f922abca46318cec890670 ] The list of subrequests attached to stream->subrequests is accessed without locks by netfs_collect_read_results() and netfs_collect_write_results(), and then they access subreq->flags without taking a barrier after getting the subreq pointer from the list. Relatedly, the functions that build the list don't use any sort of write barrier when constructing the list to make sure that the NETFS_SREQ_IN_PROGRESS flag is perceived to be set first if no lock is taken. Fix this by: (1) Add a new list_add_tail_release() function that uses a release barrier to set the pointer to the new member of the list. (2) Add a new list_first_entry_or_null_acquire() function that uses an acquire barrier to read the pointer to the first member in a list (or return NULL). (3) Use list_add_tail_release() when adding a subreq to ->subrequests. (4) Use list_first_entry_or_null_acquire() when initially accessing the front of the list (when an item is removed, the pointer to the new front iterm is obtained under the same lock). Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Link: https://sashiko.dev/#/patchset/20260326104544.509518-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-4-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysfprobe: Fix unregister_fprobe() to wait for RCU grace periodMasami Hiramatsu (Google)1-0/+5
[ Upstream commit 657b594b2084b39a4bc6d8493aa2140cb00cea49 ] Commit 4346ba1604093 ("fprobe: Rewrite fprobe on function-graph tracer") changed fprobe to register struct fprobe to an rcu-hlist, but it forgot to wait for RCU GP. Thus there can be use-after-free if the fprobe is released right after unregistering. This can be happened on fprobe event and sample module code. To fix this issue, add synchronize_rcu() in unregister_fprobe(). Note that BPF is OK because fprobe is used as a part of bpf_kprobe_multi_link. This unregisters its fprobe in bpf_kprobe_multi_link_release() and it is deallocated via bpf_kprobe_multi_link_dealloc(), which is invoked from bpf_link_defer_dealloc_rcu_gp() RCU callback. For BPF, this also introduced unregister_fprobe_async() which does NOT wait for RCU grace priod. Link: https://lore.kernel.org/all/177813998919.256460.2809243930741138224.stgit@mhiramat.tok.corp.google.com/ Fixes: 4346ba1604093 ("fprobe: Rewrite fprobe on function-graph tracer") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 dayskprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()Jianpeng Chang1-1/+1
[ Upstream commit 307abfac04a254c09c5705d816b33354acee97a0 ] When kprobe_add_area_blacklist() iterates through a section like .kprobes.text, the start address may not correspond to a named symbol. On ARM64 with CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS=y (introduced by commit baaf553d3bc3 ("arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS")), the compiler flag -fpatchable-function-entry=4,2 inserts 2 NOPs before each function entry point for ftrace call_ops. These pre-function NOPs sit at the section base address, before the first named function symbol. The compiler emits a $x mapping symbol at offset 0x00 to mark the start of code, but find_kallsyms_symbol() ignores mapping symbols. Without CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS (e.g. defconfig), no pre-function NOPs are inserted, the first function starts at offset 0x00, and the bug does not trigger. This only affects modules that have a .kprobes.text section (i.e. those using the __kprobes annotation). Modules using NOKPROBE_SYMBOL() instead (like kretprobe_example.ko) blacklist exact function addresses via the _kprobe_blacklist section and are not affected. For kprobe_example.ko on ARM64 with -fpatchable-function-entry=4,2, the .kprobes.text section layout is: offset 0x00: $x + 2 NOPs (mapping symbol + ftrace preamble) offset 0x08: handler_post (64 bytes) offset 0x50: handler_pre (68 bytes) kprobe_add_area_blacklist() starts iterating from the section base address (offset 0x00), which only has the $x mapping symbol. kprobe_add_ksym_blacklist() then calls kallsyms_lookup_size_offset() for this address, which goes through: kallsyms_lookup_size_offset() -> module_address_lookup() -> find_kallsyms_symbol() find_kallsyms_symbol() scans all module symbols to find the closest preceding symbol. Since no named text symbol exists at offset 0x00, find_kallsyms_symbol() picks __UNIQUE_ID_vermagic (a .modinfo symbol whose address is in the temporary image) as the "best" match. The computed "size" = next_text_symbol - modinfo_symbol spans across these two unrelated memory regions, creating a blacklist entry with a bogus range of tens of terabytes. Whether this causes a visible failure depends on address randomization, here is what happens on Raspberry Pi 4/5: - On RPi5, the bogus size was ~35 TB. start + size stayed within 64-bit range, so the blacklist entry covered the entire kernel text. register_kprobe() in the module's own init function failed with -EINVAL. - On RPi4, the bogus size was ~75 TB. start + size overflowed 64 bits and wrapped to a small address near zero. The range check (addr >= start && addr < end) then failed because end wrapped around, so the bogus entry was accidentally harmless and kprobes worked by luck. The same bug exists on both machines, but randomization determines whether the integer overflow masks it or not. Fix this by adding notrace to the __kprobes macro. Functions in .kprobes.text are kprobe infrastructure handlers that should never be traced by ftrace. With notrace, the compiler stops inserting them and the non-symbol gap at the section start disappears entirely. Link: https://lore.kernel.org/all/20260506012706.2785785-1-jianpeng.chang.cn@windriver.com/ Fixes: baaf553d3bc3 ("arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS") Signed-off-by: Jianpeng Chang <jianpeng.chang.cn@windriver.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysnetfilter: nf_conntrack_expect: restore helper propagation via expectationPablo Neira Ayuso1-1/+4
[ Upstream commit dcb0f9aefdd604d36710fda53c25bd7cf4a3e37a ] A recent series to fix expectations broke helper propagation via expectation, this mechanism is used by the sip and h323 helper. This also propagates the conntrack helper to expected connections. I changed semantics of exp->helper which now tells us the actual helper that created the expectation. Add an explicit assign_helper field to expectations for this purpose and update helpers to use it. Restore this feature for userspace conntrack helper via ctnetlink nfqueue integration so it is again possible to attach a helper to an expectation, where it makes sense. This is not restored via ctnetlink expectation creation as there is no client for such feature. Use the expectation layer 4 protocol number for the helper lookup for consistency. Make sure the expectation using this helper propagation mechanism also go away when the helper is unregistered. Fixes: 9c42bc9db90a ("netfilter: nf_conntrack_expect: honor expectation helper field") Fixes: 917b61fa2042 ("netfilter: ctnetlink: ignore explicit helper on new expectations") Reported-by: Ilya Maximets <i.maximets@ovn.org> Tested-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysnetfilter: x_tables: add and use xtables_unregister_table_exitFlorian Westphal1-1/+1
[ Upstream commit b4597d5fd7d2f8cebfffd40dffb5e003cc78964c ] Previous change added xtables_unregister_table_pre_exit to detach the table from the packetpath and to unlink it from the active table list. In case of rmmod, userspace that is doing set/getsockopt for this table will not be able to re-instantiate the table: 1. The larval table has been removed already 2. existing instantiated table is no longer on the xt pernet table list. This adds the second stage helper: unlink the table from the dying list, free the hook ops (if any) and do the audit notification. It replaces xt_unregister_table(). Fixes: fdacd57c79b7 ("netfilter: x_tables: never register tables by default") Reported-by: Tristan Madani <tristan@talencesecurity.com> Reviewed-by: Tristan Madani <tristan@talencesecurity.com> Closes: https://lore.kernel.org/netfilter-devel/20260429175613.1459342-1-tristmd@gmail.com/ Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysnetfilter: x_tables: add and use xt_unregister_table_pre_exitFlorian Westphal4-3/+1
[ Upstream commit 527d6931473b75d90e38942aae6537d1a527f1fd ] Remove the copypasted variants of _pre_exit and add one single function in the xtables core. ebtables is not compatible with x_tables and therefore unchanged. This is a preparation patch to reduce noise in the followup bug fixes. Reviewed-by: Tristan Madani <tristan@talencesecurity.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Stable-dep-of: b4597d5fd7d2 ("netfilter: x_tables: add and use xtables_unregister_table_exit") Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysnetfilter: x_tables: allocate hook ops while under mutexFlorian Westphal1-0/+1
[ Upstream commit b62eb8dcf2c47d4d676a434efbd57c4f776f7829 ] arp/ip(6)t_register_table() add the table to the per-netns list via xt_register_table() before allocating the per-netns hook ops copy via kmemdup_array(). This leaves a window where the table is visible in the list with ops=NULL. If the pernet exit happens runs concurrently the pre_exit callback finds the table via xt_find_table() and passes the NULL ops pointer to nf_unregister_net_hooks(), causing a NULL dereference: general protection fault in nf_unregister_net_hooks+0xbc/0x150 RIP: nf_unregister_net_hooks (net/netfilter/core.c:613) Call Trace: ipt_unregister_table_pre_exit iptable_mangle_net_pre_exit ops_pre_exit_list cleanup_net Fix by moving the ops allocation into the xtables core so the table is never in the list without valid ops. Also ensure the table is no longer processing packets before its torn down on error unwind. nf_register_net_hooks might have published at least one hook; call synchronize_rcu() if there was an error. audit log register message gets deferred until all operations have passed, this avoids need to emit another ureg message in case of error unwinding. Based on earlier patch by Tristan Madani. Fixes: f9006acc8dfe5 ("netfilter: arp_tables: pass table pointer via nf_hook_ops") Fixes: ee177a54413a ("netfilter: ip6_tables: pass table pointer via nf_hook_ops") Fixes: ae689334225f ("netfilter: ip_tables: pass table pointer via nf_hook_ops") Link: https://lore.kernel.org/netfilter-devel/20260429175613.1459342-1-tristmd@gmail.com/ Signed-off-by: Tristan Madani <tristan@talencesecurity.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysbtrfs: tracepoints: fix sleep while in atomic context in btrfs_sync_file()Filipe Manana1-3/+1
[ Upstream commit c73370c677646e86fc4b1780fb07027bdf847375 ] The trace event btrfs_sync_file() is called in an atomic context (all trace events are) and its call to dput(), which is needed due to the call to dget_parent(), can sleep, triggering a kernel splat. This can be reproduced by enabling the trace event and running btrfs/056 from fstests for example. The splat shown in dmesg is the following: [53.919] BUG: sleeping function called from invalid context at fs/dcache.c:970 [53.947] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 32773, name: xfs_io [53.988] preempt_count: 2, expected: 0 [53.967] RCU nest depth: 0, expected: 0 [53.943] Preemption disabled at: [53.944] [<0000000000000000>] 0x0 [54.078] CPU: 0 UID: 0 PID: 32773 Comm: xfs_io Tainted: G W 7.1.0-rc1-btrfs-next-232+ #1 PREEMPT(full) [54.070] Tainted: [W]=WARN [54.071] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [54.072] Call Trace: [54.074] <TASK> [54.076] dump_stack_lvl+0x56/0x80 [54.079] __might_resched.cold+0xd6/0x10f [54.072] dput.part.0+0x24/0x110 [54.078] trace_event_raw_event_btrfs_sync_file+0x75/0x140 [btrfs] [54.089] btrfs_sync_file+0x1ed/0x530 [btrfs] [54.087] ? __handle_mm_fault+0x8ae/0xed0 [54.089] btrfs_do_write_iter+0x172/0x210 [btrfs] [54.091] vfs_write+0x21f/0x450 [54.094] __x64_sys_pwrite64+0x8d/0xc0 [54.096] ? do_user_addr_fault+0x20c/0x670 [54.099] do_syscall_64+0x60/0xf20 [54.092] ? clear_bhb_loop+0x60/0xb0 [54.094] entry_SYSCALL_64_after_hwframe+0x76/0x7e So stop using dget_parent() and dput() and access the parent dentry directly as dentry->d_parent. This is also what ext4 is doing in its equivalent trace event ext4_sync_file_enter(). Fixes: a85b46db143f ("btrfs: tracepoints: get correct superblock from dentry in event btrfs_sync_file()") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysdevice property: set fwnode->secondary to NULL in fwnode_init()Bartosz Golaszewski1-0/+1
commit 215c90ee656114f5e8c32408228d97082f8e0eef upstream. If a firmware node is allocated on the stack (for instance: temporary software node whose life-time we control) or on the heap - but using a non-zeroing allocation function - and initialized using fwnode_init(), its secondary pointer will contain uninitalized memory which likely will be neither NULL nor IS_ERR() and so may end up being dereferenced (for example: in dev_to_swnode()). Set fwnode->secondary to NULL on initialization. Cc: stable <stable@kernel.org> Fixes: 01bb86b380a3 ("driver core: Add fwnode_init()") Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com> Reviewed-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Sakari Ailus <sakari.ailus@linux.intel.com> Link: https://patch.msgid.link/20260506115701.23035-1-bartosz.golaszewski@oss.qualcomm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysring-buffer: Flush and stop persistent ring buffer on panicMasami Hiramatsu (Google)1-0/+13
commit a494d3c8d5392bcdff83c2a593df0c160ff9f322 upstream. On real hardware, panic and machine reboot may not flush hardware cache to memory. This means the persistent ring buffer, which relies on a coherent state of memory, may not have its events written to the buffer and they may be lost. Moreover, there may be inconsistency with the counters which are used for validation of the integrity of the persistent ring buffer which may cause all data to be discarded. To avoid this issue, stop recording of the ring buffer on panic and flush the cache of the ring buffer's memory. Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance") Cc: stable@vger.kernel.org Cc: Will Deacon <will@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ian Rogers <irogers@google.com> Link: https://patch.msgid.link/177751969602.2136606.12031934362587643488.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysnetfilter: nf_queue: hold bridge skb->dev while queuedHaoze Xie1-0/+1
commit e196115ec330a18de415bdb9f5071aa9f08e53ce upstream. br_pass_frame_up() rewrites skb->dev from the ingress port to the bridge master before queueing bridge LOCAL_IN packets. NFQUEUE only holds references on state.in/out and bridge physdevs, so a queued bridge packet can retain a freed bridge master in skb->dev until reinjection. When the verdict is reinjected later, br_netif_receive_skb() re-enters the receive path with skb->dev still pointing at the freed bridge master, triggering a use-after-free. Store skb->dev in the queue entry, hold a reference on it for the queue lifetime, and use the saved device when dropping queued packets during NETDEV_DOWN handling. Fixes: ac2863445686 ("netfilter: bridge: add nf_afinfo to enable queuing to userspace") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Haoze Xie <royenheart@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysBluetooth: serialize accept_q accessJiexun Wang1-0/+1
commit e83f5e24da741fa9405aeeff00b08c5ee7c37b88 upstream. bt_sock_poll() walks the accept queue without synchronization, while child teardown can unlink the same socket and drop its last reference. The unsynchronized accept queue walk has existed since the initial Bluetooth import. Protect accept_q with a dedicated lock for queue updates and polling. Also rework bt_accept_dequeue() to take temporary child references under the queue lock before dropping it and locking the child socket. Fixes: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Reported-by: Jann Horn <jannh@google.com> Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com> Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/damon: fix damos_stat tracepoint format for sz_appliedSeongJae Park1-1/+1
commit 620072fd783290ad92c2d445a47b0a61b161f352 upstream. The print format is wrongly marking sz_applied as sz_tried. Fix it. Link: https://lore.kernel.org/20260426193119.88095-1-sj@kernel.org Fixes: 804c26b961da ("mm/damon/core: add trace point for damos stat per apply interval") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> # 7.0.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysmm/page_alloc: fix initialization of tags of the huge zero folio with ↵David Hildenbrand (Arm)2-8/+9
init_on_free commit 6a288a4ddb4a994490505ab5f41c445f8e6b6467 upstream. __GFP_ZEROTAGS semantics are currently a bit weird, but effectively this flag is only ever set alongside __GFP_ZERO and __GFP_SKIP_KASAN. If we run with init_on_free, we will zero out pages during __free_pages_prepare(), to skip zeroing on the allocation path. However, when allocating with __GFP_ZEROTAG set, post_alloc_hook() will consequently not only skip clearing page content, but also skip clearing tag memory. Not clearing tags through __GFP_ZEROTAGS is irrelevant for most pages that will get mapped to user space through set_pte_at() later: set_pte_at() and friends will detect that the tags have not been initialized yet (PG_mte_tagged not set), and initialize them. However, for the huge zero folio, which will be mapped through a PMD marked as special, this initialization will not be performed, ending up exposing whatever tags were still set for the pages. The docs (Documentation/arch/arm64/memory-tagging-extension.rst) state that allocation tags are set to 0 when a page is first mapped to user space. That no longer holds with the huge zero folio when init_on_free is enabled. Fix it by decoupling __GFP_ZEROTAGS from __GFP_ZERO, passing to tag_clear_highpages() whether we want to also clear page content. Invert the meaning of the tag_clear_highpages() return value to have clearer semantics. Reproduced with the huge zero folio by modifying the check_buffer_fill arm64/mte selftest to use a 2 MiB area, after making sure that pages have a non-0 tag set when freeing (note that, during boot, we will not actually initialize tags, but only set KASAN_TAG_KERNEL in the page flags). $ ./check_buffer_fill 1..20 ... not ok 17 Check initial tags with private mapping, sync error mode and mmap memory not ok 18 Check initial tags with private mapping, sync error mode and mmap/mprotect memory ... This code needs more cleanups; we'll tackle that next, like decoupling __GFP_ZEROTAGS from __GFP_SKIP_KASAN. [akpm@linux-foundation.org: s/__GPF_ZERO/__GFP_ZERO/, per David] Link: https://lore.kernel.org/20260421-zerotags-v2-1-05cb1035482e@kernel.org Fixes: adfb6609c680 ("mm/huge_memory: initialise the tags of the huge zero folio") Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Lance Yang <lance.yang@linux.dev> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysata: libata-scsi: do not needlessly defer commands when using PMP with FBSNiklas Cassel1-3/+3
commit 759e8756da00aa115d504a18155b1d1ee1cc12e8 upstream. The ACS specification does not allow a non-NCQ command to be issued while an NCQ command is outstanding. Commit 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation") introduced a feature where a deferred non-NCQ command gets issued from a workqueue. The design stores a single non-NCQ command per port. However, when using Port Multipliers (PMPs), specifically PMPs that support FIS-Based Switching (FBS), non-NCQ and NCQ commands can be mixed on the same port, just not for the same link, see e.g. ata_std_qc_defer() which is, and always has operated on a per-link basis. Therefore, move the deferred_qc from struct ata_port to struct ata_link. This way, when using a PMP with FBS, we will not needlessly defer commands to all other links, just because one link issued a non-NCQ command while having an NCQ command outstanding. Only commands for that specific link will be deferred. This is in line with how PMPs with FBS worked before commit 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation"). Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation") Tested-by: Tommy Kelly <linux@tkel.ly> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysata: libata-scsi: do not use the deferred QC feature on PMPs with CBSNiklas Cassel1-0/+1
commit f233124fb36cd57ef09f96d517a38ab4b902e15e upstream. When using Port Multipliers (PMPs) with Command-Based Switching (CBS), you can only issue commands to one link at a time. For PMPs with CBS, there is already code to handle commands being sent to different links in sata_pmp_qc_defer_cmd_switch() using ap->excl_link. sata_sil24 also makes use of ap->excl_link. A user on the list reported that commit 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation") broke PMPs with CBS. The commit introduced code that stores a deferred qc in ap->deferred_qc, to later be issued via a workqueue. It turns out that this change is incompatible with the existing ap->excl_link handling used by PMPs with CBS. Thus, modify sata_pmp_qc_defer_cmd_switch() and sil24_qc_defer() to return ATA_DEFER_LINK_EXCL, and make sure that the deferred QC handling via workqueue is not used for this return value. This way, PMPs with CBS will work once again. Note that the starvation referenced in commit 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation") can only happen on libsas ports, and libsas does not support Port Multipliers, thus there is no harm of reverting back to the previous way of deferring commands for PMPs with CBS. Non-libsas ports connected to anything but a PMP with CBS (e.g. a normal drive or a PMP with FBS) will continue using the deferred workqueue, since it does result in lower completion latencies for non-NCQ commands, even though the workqueue is not strictly needed to avoid starvation for non-libsas ports. If we want to modify the scope of the workqueue issuing to also handle PMPs with CBS, then we should ensure that we can save both NCQ and non-NCQ commands in ap->deferred_qc, while also removing the existing PMP CBS handling using ap->excl_link, such that we don't duplicate features. While at it, also add a comment explaining how the ap->excl_link mechanism works. Fixes: 0ea84089dbf6 ("ata: libata-scsi: avoid Non-NCQ command starvation") Tested-by: Tommy Kelly <linux@tkel.ly> Reported-by: Tommy Kelly <linux@tkel.ly> Closes: https://lore.kernel.org/linux-ide/ce09cc21-a8e9-4845-b205-35411e22fba9@tkel.ly/ Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-23irqchip/gic-v5: Move LPI allocation into the LPI domainSascha Bischoff1-3/+0
commit dec85d2fbd20de3711a71e65397dfdb40c3fa953 upstream. The IPI and ITS MSI domains currently allocate and release LPIs directly, then pass the selected LPI ID to the parent LPI domain. This leaks the LPI domain's allocation policy into its child domains and forces each child to duplicate part of the parent domain's teardown. Make the LPI domain allocate LPIs in its .alloc() callback and release them in a matching .free() callback. Child domains can then request a parent interrupt without passing an implementation-specific LPI ID, and the LPI lifetime is tied to the domain that owns the LPI namespace. Remove the gicv5_alloc_lpi() and gicv5_free_lpi() wrappers now that no external caller needs to manage LPIs directly. This is a preparatory change for an actual leakage problem in the allocation code and therefore tagged with the same Fixes tag. Fixes: 0f0101325876 ("irqchip/gic-v5: Add GICv5 LPI/IPI support") Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260506093634.382062-2-sascha.bischoff@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-23drm/ttm: Fix ttm_bo_swapout() infinite LRU walk on swapout failureThomas Hellström1-0/+2
commit b2ed01e7ad3de80333e9b962a44024b094bc0b2b upstream. When ttm_tt_swapout() fails, the current code calls ttm_resource_add_bulk_move() followed by ttm_resource_move_to_lru_tail() to restore the resource's bulk_move membership. However, ttm_resource_move_to_lru_tail() places the resource at the tail of the LRU list which, relative to the walk cursor's hitch node (placed immediately after the resource when it was yielded), puts the resource *in front of the* the hitch. The next list_for_each_entry_continue() from the hitch finds the same resource again, causing an infinite loop. Fix by deferring del_bulk_move to the success path only. On the success path, TTM_TT_FLAG_SWAPPED has just been set by ttm_tt_swapout() but the resource is still tracked in the bulk_move range, so ttm_resource_del_bulk_move()'s !ttm_resource_unevictable() guard would incorrectly skip the removal. Introduce ttm_resource_del_bulk_move_unevictable() which bypasses that guard. Reported-by: Jatin Kataria <jkataria@netflix.com> Fixes: fc5d96670eb2 ("drm/ttm: Move swapped objects off the manager's LRU list") Cc: Christian König <christian.koenig@amd.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <dri-devel@lists.freedesktop.org> Cc: <stable@vger.kernel.org> # v6.13+ Assisted-by: GitHub_Copilot:claude-sonnet-4.6 Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Christian König <christian.koenig@amd.com> Tested-by: Boqun Feng <boqun@kernel.org> Link: https://patch.msgid.link/20260428094442.16985-1-thomas.hellstrom@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-23rseq: Implement read only ABI enforcement for optimized RSEQ V2 modeThomas Gleixner2-54/+33
commit 82f572449cfe75f12ea985986da60e11f308f77d upstream. The optimized RSEQ V2 mode requires that user space adheres to the ABI specification and does not modify the read-only fields cpu_id_start, cpu_id, node_id and mm_cid behind the kernel's back. While the kernel does not rely on these fields, the adherence to this is a fundamental prerequisite to allow multiple entities, e.g. libraries, in an application to utilize the full potential of RSEQ without stepping on each other toes. Validate this adherence on every update of these fields. If the kernel detects that user space modified the fields, the application is force terminated. Fixes: d6200245c75e ("rseq: Allow registering RSEQ with slice extension") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Link: https://patch.msgid.link/20260428224427.845230956%40kernel.org Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-23rseq: Revert to historical performance killing behaviourThomas Gleixner3-22/+61
commit b9eac6a9d93c952c4b7775a24d5c7a1bbf4c3c00 upstream. The recent RSEQ optimization work broke the TCMalloc abuse of the RSEQ ABI as it not longer unconditionally updates the CPU, node, mm_cid fields, which are documented as read only for user space. Due to the observed behavior of the kernel it was possible for TCMalloc to overwrite the cpu_id_start field for their own purposes and rely on the kernel to update it unconditionally after each context switch and before signal delivery. The RSEQ ABI only guarantees that these fields are updated when the data changes, i.e. the task is migrated or the MMCID of the task changes due to switching from or to per CPU ownership mode. The optimization work eliminated the unconditional updates and reduced them to the documented ABI guarantees, which results in a massive performance win for syscall, scheduling heavy work loads, which in turn breaks the TCMalloc expectations. There have been several options discussed to restore the TCMalloc functionality while preserving the optimization benefits. They all end up in a series of hard to maintain workarounds, which in the worst case introduce overhead for everyone, e.g. in the scheduler. The requirements of TCMalloc and the optimization work are diametral and the required work arounds are a maintainence burden. They end up as fragile constructs, which are blocking further optimization work and are pretty much guaranteed to cause more subtle issues down the road. The optimization work heavily depends on the generic entry code, which is not used by all architectures yet. So the rework preserved the original mechanism moslty unmodified to keep the support for architectures, which handle rseq in their own exit to user space loop. That code is currently optimized out by the compiler on architectures which use the generic entry code. This allows to revert back to the original behaviour by replacing the compile time constant conditions with a runtime condition where required, which disables the optimization and the dependend time slice extension feature until the run-time condition can be enabled in the RSEQ registration code on a per task basis again. The following changes are required to restore the original behavior, which makes TCMalloc work again: 1) Replace the compile time constant conditionals with runtime conditionals where appropriate to prevent the compiler from optimizing the legacy mode out 2) Enforce unconditional update of IDs on context switch for the non-optimized v1 mode 3) Enforce update of IDs in the pre signal delivery path for the non-optimized v1 mode 4) Enforce update of IDs in the membarrier(RSEQ) IPI for the non-optimized v1 mode 5) Make time slice and future extensions depend on optimized v2 mode This brings back the full performance problems, but preserves the v2 optimization code and for generic entry code using architectures also the TIF_RSEQ optimization which avoids a full evaluation of the exit to user mode loop in many cases. Fixes: 566d8015f7ee ("rseq: Avoid CPU/MM CID updates when no event pending") Reported-by: Mathias Stearn <mathias@mongodb.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Closes: https://lore.kernel.org/CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com Link: https://patch.msgid.link/20260428224427.517051752%40kernel.org Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-23HID: core: introduce hid_safe_input_report()Benjamin Tissoires1-0/+2
[ Upstream commit 206342541fc887ae919774a43942dc883161fece ] hid_input_report() is used in too many places to have a commit that doesn't cross subsystem borders. Instead of changing the API, introduce a new one when things matters in the transport layers: - usbhid - i2chid This effectively revert to the old behavior for those two transport layers. Fixes: 0a3fe972a7cb ("HID: core: Mitigate potential OOB by removing bogus memset()") Cc: stable@vger.kernel.org Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Signed-off-by: Jiri Kosina <jkosina@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>