summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
5 daysnet: skbuff: propagate shared-frag marker through frag-transfer helpersHyunwoo Kim3-1/+13
commit 48f6a5356a33dd78e7144ae1faef95ffc990aae0 upstream. Two frag-transfer helpers (__pskb_copy_fclone() and skb_shift()) fail to propagate the SKBFL_SHARED_FRAG bit in skb_shinfo()->flags when moving frags from source to destination. __pskb_copy_fclone() defers the rest of the shinfo metadata to skb_copy_header() after copying frag descriptors, but that helper only carries over gso_{size,segs, type} and never touches skb_shinfo()->flags; skb_shift() moves frag descriptors directly and leaves flags untouched. As a result, the destination skb keeps a reference to the same externally-owned or page-cache-backed pages while reporting skb_has_shared_frag() as false. The mismatch is harmful in any in-place writer that uses skb_has_shared_frag() to decide whether shared pages must be detoured through skb_cow_data(). ESP input is one such writer (esp4.c, esp6.c), and a single nft 'dup to <local>' rule -- or any other nf_dup_ipv4() / xt_TEE caller -- is enough to land a pskb_copy()'d skb in esp_input() with the marker stripped, letting an unprivileged user write into the page cache of a root-owned read-only file via authencesn-ESN stray writes. Set SKBFL_SHARED_FRAG on the destination whenever frag descriptors were actually moved from the source. skb_copy() and skb_copy_expand() share skb_copy_header() too but linearize all paged data into freshly allocated head storage and emerge with nr_frags == 0, so skb_has_shared_frag() returns false on its own; they need no change. The same omission exists in skb_gro_receive() and skb_gro_receive_list(). The former moves the incoming skb's frag descriptors into the accumulator's last sub-skb via two paths (a direct frag-move loop and the head_frag + memcpy path); the latter chains the incoming skb whole onto p's frag_list. Downstream skb_segment() reads only skb_shinfo(p)->flags, and skb_segment_list() reuses each sub-skb's shinfo as the nskb -- both p and lp must carry the marker. The same omission also exists in tcp_clone_payload(), which builds an MTU probe skb by moving frag descriptors from skbs on sk_write_queue into a freshly allocated nskb. The helper falls into the same family and warrants the same fix for consistency; no TCP TX-side in-place writer is currently known to reach a user page through this gap, but a future consumer depending on the marker would regress silently. The same omission exists in skb_segment(): the per-iteration flag merge takes only head_skb's flag, and the inner switch that rebinds frag_skb to list_skb on head_skb-frags exhaustion does not fold the new frag_skb's flag into nskb. Fold frag_skb's flag at both sites so segments drawing frags from frag_list members carry the marker. Fixes: cef401de7be8 ("net: fix possible wrong checksum generation") Fixes: f4c50a4034e6 ("xfrm: esp: avoid in-place decrypt on shared skb frags") Suggested-by: Sabrina Dubroca <sd@queasysnail.net> Suggested-by: Sultan Alsawaf <sultan@kerneltoast.com> Suggested-by: Ben Hutchings <ben@decadent.org.uk> Suggested-by: Lin Ma <malin89@huawei.com> Suggested-by: Jingguo Tan <tanjingguo@huawei.com> Suggested-by: Aaron Esau <aaron1esau@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Tested-by: Rajat Gupta <rajat.gupta@oss.qualcomm.com> Link: https://patch.msgid.link/ageeJfJHwgzmKXbh@v4bel Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysnet: skbuff: preserve shared-frag marker during coalescingWilliam Bowling1-0/+2
commit f84eca5817390257cef78013d0112481c503b4a3 upstream. skb_try_coalesce() can attach paged frags from @from to @to. If @from has SKBFL_SHARED_FRAG set, the resulting @to skb can contain the same externally-owned or page-cache-backed frags, but the shared-frag marker is currently lost. That breaks the invariant relied on by later in-place writers. In particular, ESP input checks skb_has_shared_frag() before deciding whether an uncloned nonlinear skb can skip skb_cow_data(). If TCP receive coalescing has moved shared frags into an unmarked skb, ESP can see skb_has_shared_frag() as false and decrypt in place over page-cache backed frags. Propagate SKBFL_SHARED_FRAG when skb_try_coalesce() transfers paged frags. The tailroom copy path does not need the marker because it copies bytes into @to's linear data rather than transferring frag descriptors. Fixes: cef401de7be8 ("net: fix possible wrong checksum generation") Fixes: f4c50a4034e6 ("xfrm: esp: avoid in-place decrypt on shared skb frags") Signed-off-by: William Bowling <vakzz@zellic.io> Reviewed-by: Eric Dumazet <edumazet@google.com> Tested-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260513041635.1289541-1-vakzz@zellic.io Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysnet/rds: reset op_nents when zerocopy page pin failsAllison Henderson1-0/+1
commit e174929793195e0cd6a4adb0cad731b39f9019b4 upstream. When iov_iter_get_pages2() fails in rds_message_zcopy_from_user(), the pinned pages are released with put_page(), and rm->data.op_mmp_znotifier is cleared. But we fail to properly clear rm->data.op_nents. Later when rds_message_purge() is called from rds_sendmsg() the cleanup loop iterates over the incorrectly non zero number of op_nents and frees them again. Fix this by properly resetting op_nents when it should be in rds_message_zcopy_from_user(). Fixes: 0cebaccef3ac ("rds: zerocopy Tx support.") Signed-off-by: Allison Henderson <achender@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260505234336.2132721-1-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysmptcp: pm: ADD_ADDR rtx: resched blocked ADD_ADDR quickerMatthieu Baerts (NGI0)1-1/+1
[ Upstream commit 3cf12492891c4b5ff54dda404a2de4ec54c9e1b5 ] When an ADD_ADDR needs to be retransmitted and another one has already been prepared -- e.g. multiple ADD_ADDRs have been sent in a row and need to be retransmitted later -- this additional retransmission will need to wait. In this case, the timer was reset to TCP_RTO_MAX / 8, which is ~15 seconds. This delay is unnecessary long: it should just be rescheduled at the next opportunity, e.g. after the retransmission timeout. Without this modification, some issues can be seen from time to time in the selftests when multiple ADD_ADDRs are sent, and the host takes time to process them, e.g. the "signal addresses, ADD_ADDR timeout" MPTCP Join selftest, especially with a debug kernel config. Note that on older kernels, 'timeout' is not available. It should be enough to replace it by one second (HZ). Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260505-net-mptcp-pm-fixes-7-1-rc3-v1-6-fca8091060a4@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> [ replaced `TCP_RTO_MAX / 8` with `HZ` ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysmptcp: pm: ADD_ADDR rtx: fix potential data-raceMatthieu Baerts (NGI0)1-0/+8
[ Upstream commit 5cd6e0ad79d2615264f63929f8b457ad97ae550d ] This mptcp_pm_add_timer() helper is executed as a timer callback in softirq context. To avoid any data races, the socket lock needs to be held with bh_lock_sock(). If the socket is in use, retry again soon after, similar to what is done with the keepalive timer. Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260505-net-mptcp-pm-fixes-7-1-rc3-v1-3-fca8091060a4@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> [ applied hunk to `net/mptcp/pm_netlink.c` instead of `net/mptcp/pm.c` ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysmptcp: pm: kernel: correctly retransmit ADD_ADDR ID 0Matthieu Baerts (NGI0)1-5/+8
[ Upstream commit b12014d2d36eaed4e4bec5f1ac7e91110eeb100d ] When adding the ADD_ADDR to the list, the address including the IP, port and ID are copied. On the other hand, when the endpoint corresponds to the one from the initial subflow, the ID is set to 0, as specified by the MPTCP protocol. The issue is that the ID was reset after having copied the ID in the ADD_ADDR entry. So the retransmission was done, but using a different ID than the initial one. Fixes: 8b8ed1b429f8 ("mptcp: pm: reuse ID 0 after delete and re-add") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260505-net-mptcp-pm-fixes-7-1-rc3-v1-1-fca8091060a4@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> [ applied to net/mptcp/pm_netlink.c instead of upstream's pm_kernel.c ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysmptcp: fix rx timestamp corruption on fastopenPaolo Abeni1-1/+3
[ Upstream commit 6254a16d6f0c672e3809ca5d7c9a28a55d71f764 ] The skb cb offset containing the timestamp presence flag is cleared before loading such information. Cache such value before MPTCP CB initialization. Fixes: 36b122baf6a8 ("mptcp: add subflow_v(4,6)_send_synack()") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260501-net-mptcp-misc-fixes-7-1-rc3-v1-3-b70118df778e@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysmptcp: drop __mptcp_fastopen_gen_msk_ackseq()Paolo Abeni4-29/+7
[ Upstream commit f03afb3aeb9d81f6c5ab728a61a040012923e3b3 ] When we will move the whole RX path under the msk socket lock, updating the already queued skb for passive fastopen socket at 3rd ack time will be extremely painful and race prone The map_seq for already enqueued skbs is used only to allow correct coalescing with later data; preventing collapsing to the first skb of a fastopen connect we can completely remove the __mptcp_fastopen_gen_msk_ackseq() helper. Before dropping this helper, a new item had to be added to the mptcp_skb_cb structure. Because this item will be frequently tested in the fast path -- almost on every packet -- and because there is free space there, a single byte is used instead of a bitfield. This micro optimisation slightly reduces the number of CPU operations to do the associated check. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-2-4a47d90d7998@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 6254a16d6f0c ("mptcp: fix rx timestamp corruption on fastopen") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysmptcp: pm: prio: skip closed subflowsMatthieu Baerts (NGI0)1-0/+3
[ Upstream commit 166b78344031bf7ac9f55cb5282776cfd85f220e ] When sending an MP_PRIO, closed subflows need to be skipped. This fixes the case where the initial subflow got closed, re-opened later, then an MP_PRIO is needed for the same local address. Note that explicit MP_PRIO cannot be sent during the 3WHS, so it is fine to use __mptcp_subflow_active(). Fixes: 067065422fcd ("mptcp: add the outgoing MP_PRIO support") Cc: stable@vger.kernel.org Fixes: b29fcfb54cd7 ("mptcp: full disconnect implementation") Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260505-net-mptcp-pm-fixes-7-1-rc3-v1-9-fca8091060a4@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> [ applied to renamed function `mptcp_pm_nl_mp_prio_send_ack()` in `pm_netlink.c` ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 dayslibceph: handle rbtree insertion error in decode_choose_args()Raphael Zimmer1-1/+4
commit d289478cfc0bcf81c7914200d6abdcb78bd04ded upstream. A message of type CEPH_MSG_OSD_MAP contains an OSD map that itself contains a CRUSH map. The received CRUSH map may optionally contain choose_args that get decoded in decode_choose_args(). In this function, num_choose_arg_maps is read from the message, and a corresponding number of crush_choose_arg_maps gets decoded afterwards. Each crush_choose_arg_map has a choose_args_index, which serves as the key when inserting it into the choose_args rbtree of the decoded crush_map. If a (potentially corrupted) message contains two crush_choose_arg_maps with the same index, the assertion in insert_choose_arg_map() triggers a kernel BUG when trying to insert the second crush_choose_arg_map. This patch fixes the issue by switching to the non-asserting rbtree insertion function and rejecting the message if the insertion fails. [ idryomov: changelog ] Cc: stable@vger.kernel.org Signed-off-by: Raphael Zimmer <raphael.zimmer@tu-ilmenau.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 dayslibceph: Fix potential out-of-bounds access in crush_decode()Raphael Zimmer2-5/+5
commit 4c79fc2d598694bda845b46229c9d48b65042970 upstream. A message of type CEPH_MSG_OSD_MAP containing a crush map with at least one bucket has two fields holding the bucket algorithm. If the values in these two fields differ, an out-of-bounds access can occur. This is the case because the first algorithm field (alg) is used to allocate the correct amount of memory for a bucket of this type, while the second algorithm field inside the bucket (b->alg) is used in the subsequent processing. This patch fixes the issue by adding a check that compares alg and b->alg and aborts the processing in case they differ. Furthermore, b->alg is set to 0 in this case, because the destruction of the crush map also uses this field to determine the bucket type, which can again result in an out-of-bounds access when trying to free the memory pointed to by the fields of the bucket. To correctly free the memory allocated for the bucket in such a case, the corresponding call to kfree is moved from the algorithm-specific crush_destroy_bucket functions to the generic crush_destroy_bucket(). Cc: stable@vger.kernel.org Signed-off-by: Raphael Zimmer <raphael.zimmer@tu-ilmenau.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 dayslibceph: Fix potential null-ptr-deref in decode_choose_args()Raphael Zimmer1-1/+2
commit 28b0a2ab8c82d0bbdeb8013029c67c978ce6e4bf upstream. A message of type CEPH_MSG_OSD_MAP contains an OSD map that itself contains a CRUSH map. When decoding this CRUSH map in crush_decode(), an array of max_buckets CRUSH buckets is decoded, where some indices may not refer to actual buckets and are therefore set to NULL. The received CRUSH map may optionally contain choose_args that get decoded in decode_choose_args(). When decoding a crush_choose_arg_map, a series of choose_args for different buckets is decoded, with the bucket_index being read from the incoming message. It is only checked that the bucket index does not exceed max_buckets, but not that it doesn't point to an index with a NULL bucket. If a (potentially corrupted) message contains a crush_choose_arg_map including such a bucket_index, a null pointer dereference may occur in the subsequent processing when attempting to access the bucket with the given index. This patch fixes the issue by extending the affected check. Now, it is only attempted to access the bucket if it is not NULL. Cc: stable@vger.kernel.org Signed-off-by: Raphael Zimmer <raphael.zimmer@tu-ilmenau.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 dayslibceph: Fix potential out-of-bounds access in osdmap_decode()Raphael Zimmer1-1/+1
commit 35d0ed82d03e5ee77ea4f31f20e29562a7721649 upstream. When decoding osd_state and osd_weight from an incoming osdmap in osdmap_decode(), both are decoded for each osd, i.e., map->max_osd times. The ceph_decode_need() check only accounts for sizeof(*map->osd_weight) once. This can potentially result in an out-of-bounds memory access if the incoming message is corrupted such that the max_osd value exceeds the actual content of the osdmap message. This patch fixes the issue by changing the corresponding part in the ceph_decode_need() check to account for map->max_osd*sizeof(*map->osd_weight). Cc: stable@vger.kernel.org Fixes: dcbc919a5dc8 ("libceph: switch osdmap decoding to use ceph_decode_entity_addr") Signed-off-by: Raphael Zimmer <raphael.zimmer@tu-ilmenau.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysnetfilter: nf_tables: unconditionally bump set->nelems before insertionPablo Neira Ayuso1-14/+16
[ Upstream commit def602e498a4f951da95c95b1b8ce8ae68aa733a ] In case that the set is full, a new element gets published then removed without waiting for the RCU grace period, while RCU reader can be walking over it already. To address this issue, add the element transaction even if set is full, but toggle the set_full flag to report -ENFILE so the abort path safely unwinds the set to its previous state. As for element updates, decrement set->nelems to restore it. A simpler fix is to call synchronize_rcu() in the error path. However, with a large batch adding elements to already maxed-out set, this could cause noticeable slowdown of such batches. Fixes: 35d0ac9070ef ("netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCL") Reported-by: Inseo An <y0un9sa@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> [ Minor conflict resolved. ] Signed-off-by: Li hongliang <1468888505@139.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnetfilter: nft_ct: fix missing expect put in obj evalLi Xiasong1-0/+2
commit 19f94b6fee75b3ef7fbc06f3745b9a771a8a19a4 upstream. nft_ct_expect_obj_eval() allocates an expectation and may call nf_ct_expect_related(), but never drops its local reference. Add nf_ct_expect_put(exp) before return to balance allocation. Fixes: 857b46027d6f ("netfilter: nft_ct: add ct expectations support") Cc: stable@vger.kernel.org Signed-off-by: Li Xiasong <lixiasong1@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysnetfilter: nf_conntrack_sip: get helper before allocating expectationLi Xiasong1-4/+4
commit eb6317739b1ea3ab28791e1f91b24781905fa815 upstream. process_register_request() allocates an expectation and then checks whether a conntrack helper is available. If helper lookup fails, the function returns early and the allocated expectation is left behind. Reorder the code to fetch and validate helper before calling nf_ct_expect_alloc(). This keeps the logic simpler and removes the leak path while preserving existing behavior. Fixes: e14575fa7529 ("netfilter: nf_conntrack: use rcu accessors where needed") Cc: stable@vger.kernel.org Signed-off-by: Li Xiasong <lixiasong1@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 dayspage_pool: fix incorrect mp_ops error handlingMina Almasry1-2/+4
[ Upstream commit abadf0ff63be488dc502ecfc9f622929a21b7117 ] Minor fix to the memory provider error handling, we should be jumping to free_ptr_ring in this error case rather than returning directly. Found by code-inspection. Cc: skhawaja@google.com Fixes: b400f4b87430 ("page_pool: Set `dma_sync` to false for devmem memory provider") Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Link: https://patch.msgid.link/20250821030349.705244-1-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnetpoll: pass buffer size to egress_dev() to avoid MAC truncationBreno Leitao1-10/+13
[ Upstream commit 76b93a8107574006b25495664304ea9237494d70 ] egress_dev() formats np->dev_mac via snprintf() but receives buf as a bare char *, so it cannot derive the buffer size from the pointer. The size argument was hardcoded to MAC_ADDR_STR_LEN (3 * ETH_ALEN - 1 = 17), which is silly wrong in two ways: 1) misleading kernel log output on the MAC-selected target path (np->dev_name[0] == '\0'); for example "aa:bb:cc:dd:ee:ff doesn't exist, aborting" was logged as "aa:bb:cc:dd:ee:f doesn't exist, aborting". 2) the second argument of snprintf is the size of the buffer, not the size of what you want to write. Add a bufsz parameter to egress_dev() and pass sizeof(buf) from each caller, matching the standard snprintf() idiom and removing the hardcoded size from the helper. Every caller already declares "char buf[MAC_ADDR_STR_LEN + 1]" so the formatted MAC continues to fit. Tested by booting with netconsole=6665@/aa:bb:cc:dd:ee:ff,6666@10.0.0.1/00:11:22:33:44:55 on a kernel without a matching device. Pre-fix dmesg shows "aa:bb:cc:dd:ee:f doesn't exist, aborting"; post-fix shows the full "aa:bb:cc:dd:ee:ff doesn't exist, aborting". Fixes: f8a10bed32f5 ("netconsole: allow selection of egress interface via MAC address") Cc: stable@vger.kernel.org Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260501-netpoll_snprintf_fix-v1-1-84b0566e6597@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnetpoll: Extract IPv6 address retrieval functionBreno Leitao1-32/+44
[ Upstream commit 6ad7969a361cbec5822285fb39203678ff462b64 ] Extract the IPv6 address retrieval logic from netpoll_setup() into a dedicated helper function netpoll_take_ipv6() to improve code organization and readability. The function handles obtaining the local IPv6 address from the network device, including proper address type matching between local and remote addresses (link-local vs global), and includes appropriate error handling when IPv6 is not supported or no suitable address is available. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250618-netpoll_ip_ref-v1-3-c2ac00fe558f@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 76b93a810757 ("netpoll: pass buffer size to egress_dev() to avoid MAC truncation") Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet/sched: sch_pie: annotate more data-races in pie_dump_stats()Eric Dumazet1-8/+6
[ Upstream commit 6d4106e8df94c0c52cf3ca6a6a0d01567fb3844e ] My prior patch missed few READ_ONCE()/WRITE_ONCE() annotations. Fixes: 5154561d9b11 ("net/sched: sch_pie: annotate data-races in pie_dump_stats()") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260430080056.35104-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet/sched: cls_flower: revert unintended changesPaolo Abeni1-3/+1
[ Upstream commit 1e01abec856593e02cd69fd95b784c10dd46880c ] While applying the blamed commit 4ca07b9239bd ("net: mctp i2c: check length before marking flow active"), I unintentionally included unrelated and unacceptable changes. Revert them. Fixes: 4ca07b9239bd ("net: mctp i2c: check length before marking flow active") Reported-by: Jeremy Kerr <jk@codeconstruct.com.au> Closes: https://lore.kernel.org/netdev/bd8704fe0bd53e278add5cde4873256656623e2e.camel@codeconstruct.com.au/ Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/043026a53ff84da88b17648c4b0d17f0331749cb.1777447863.git.pabeni@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet: tls: fix strparser anchor skb leak on offload RX setup failureJakub Kicinski3-0/+11
[ Upstream commit 58689498ca3384851145a754dbb1d8ed1cf9fb54 ] When tls_set_device_offload_rx() fails at tls_dev_add(), the error path calls tls_sw_free_resources_rx() to clean up the SW context that was initialized by tls_set_sw_offload(). This function calls tls_sw_release_resources_rx() (which stops the strparser via tls_strp_stop()) and tls_sw_free_ctx_rx() (which kfrees the context), but never frees the anchor skb that was allocated by alloc_skb(0) in tls_strp_init(). Note that tls_sw_free_resources_rx() is exclusively used for this "failed to start offload" code path, there's no other caller. The leak did not exist before commit 84c61fe1a75b ("tls: rx: do not use the standard strparser"), because the standard strparser doesn't try to pre-allocate an skb. The normal close path in tls_sk_proto_close() handles cleanup by calling tls_sw_strparser_done() (which calls tls_strp_done()) after dropping the socket lock, because tls_strp_done() does cancel_work_sync() and the strparser work handler takes the socket lock. Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20260428231559.1358502-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 dayspage_pool: fix memory-provider leak in page_pool_create_percpu() error pathHasan Basbunar1-5/+5
[ Upstream commit 5ef343614db766acdc01c56d66e780a1b43c6ac6 ] When page_pool_create_percpu() fails on page_pool_list(), it falls through to its err_uninit: label, which calls page_pool_uninit(). At that point page_pool_init() has already taken two references when the user requested PP_FLAG_ALLOW_UNREADABLE_NETMEM: pool->mp_ops->init(pool) static_branch_inc(&page_pool_mem_providers); Neither is undone by page_pool_uninit(); both are only undone by __page_pool_destroy() (success-side teardown). The error path therefore leaks the per-provider reference taken by mp_ops->init (io_zcrx_ifq->refs in the io_uring zcrx provider, the dmabuf binding refcount in the devmem provider) plus one increment of the page_pool_mem_providers static branch on every failure of xa_alloc_cyclic() inside page_pool_list(). The leaked io_zcrx_ifq->refs in turn pins everything io_zcrx_ifq_free() would release on cleanup: ifq->user (uid), ifq->mm_account (mmdrop), ifq->dev (device refcount), ifq->netdev_tracker (netdev refcount), and the rbuf region. The leaked static branch increment forces all subsequent page_pool_alloc_netmems() and page_pool_return_page() callers to take the slow mp_ops branch for the lifetime of the kernel. Reachable via the io_uring zcrx path: io_uring_register(IORING_REGISTER_ZCRX_IFQ) /* CAP_NET_ADMIN */ -> __io_uring_register -> io_register_zcrx -> zcrx_register_netdev -> netif_mp_open_rxq -> driver ndo_queue_mem_alloc -> page_pool_create_percpu -> page_pool_init succeeds (mp_ops->init runs, branch++) -> page_pool_list fails (xa_alloc_cyclic -ENOMEM) -> goto err_uninit <-- leak The same shape applies to the devmem dmabuf provider via mp_dmabuf_devmem_init()/mp_dmabuf_devmem_destroy(). Restore the cleanup symmetry by moving the mp_ops->destroy() and static_branch_dec() calls out of __page_pool_destroy() and into page_pool_uninit(), so page_pool_uninit() is again the strict inverse of page_pool_init(). page_pool_uninit() has only two callers (the err_uninit: path and __page_pool_destroy()), so this preserves the single-call invariant on the success path while fixing the err path. The error path of page_pool_init() itself still skips the mp_ops cleanup correctly: mp_ops->init is the last action that takes a reference before page_pool_init() returns 0, so when it returns an error neither the refcount nor the static branch has been touched. Triggering the bug requires xa_alloc_cyclic() to fail with -ENOMEM, which under normal GFP_KERNEL retry behaviour is rare. It is deterministic under CONFIG_FAULT_INJECTION with fail_page_alloc / xa fault injection, or under sustained memory pressure. The leak is silent: there is no warning, and the released kernel build continues running with a permanently-incremented static branch. Fixes: 0f9214046893 ("memory-provider: dmabuf devmem memory provider") Signed-off-by: Hasan Basbunar <basbunarhasan@gmail.com> Link: https://patch.msgid.link/20260428170739.34881-1-basbunarhasan@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet: page_pool: create hooks for custom memory providersPavel Begunkov2-9/+29
[ Upstream commit 57afb483015768903029c8336ee287f4b03c1235 ] A spin off from the original page pool memory providers patch by Jakub, which allows extending page pools with custom allocators. One of such providers is devmem TCP, and the other is io_uring zerocopy added in following patches. Link: https://lore.kernel.org/netdev/20230707183935.997267-7-kuba@kernel.org/ Co-developed-by: Jakub Kicinski <kuba@kernel.org> # initial mp proposal Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250204215622.695511-5-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 5ef343614db7 ("page_pool: fix memory-provider leak in page_pool_create_percpu() error path") Signed-off-by: Sasha Levin <sashal@kernel.org>
5 dayspage_pool: Set `dma_sync` to false for devmem memory providerSamiullah Khawaja2-5/+7
[ Upstream commit b400f4b87430c105d92550cee5a72aea01fdf3d6 ] Move the `dma_map` and `dma_sync` checks to `page_pool_init` to make them generic. Set dma_sync to false for devmem memory provider because the dma_sync APIs should not be used for dma_buf backed devmem memory provider. Cc: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20241211212033.1684197-4-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 5ef343614db7 ("page_pool: fix memory-provider leak in page_pool_create_percpu() error path") Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet/sched: sch_cake: annotate data-races in cake_dump_stats() (V)Eric Dumazet1-6/+7
[ Upstream commit a6c95b833dc17e84d16a8ac0f40fd0931616a52d ] cake_dump_stats() runs without qdisc spinlock being held. In this final patch, I add READ_ONCE()/WRITE_ONCE() annotations for cparams.target and cparams.interval. Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: "Toke Høiland-Jørgensen" <toke@toke.dk> Link: https://patch.msgid.link/20260427083606.459355-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 dayssctp: discard stale INIT after handshake completionXin Long1-0/+6
[ Upstream commit 8a92cb475ca90d84db769e4d4383e631ace0d6e5 ] After an association reaches ESTABLISHED, the peer’s init_tag is already known from the handshake. Any subsequent INIT with the same init_tag is not a valid restart, but a delayed or duplicate INIT. Drop such INIT chunks in sctp_sf_do_unexpected_init() instead of processing them as new association attempts. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Link: https://patch.msgid.link/5788c76c1ee122a3ed00189e88dcf9df1fba226c.1777214801.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnetfilter: skip recording stale or retransmitted INITXin Long1-3/+7
[ Upstream commit 576a5d2bad4814c881a829576b1261b9b8159d2b ] An INIT whose init_tag matches the peer's vtag does not provide new state information. It indicates either: - a stale INIT (after INIT-ACK has already been seen on the same side), or - a retransmitted INIT (after INIT has already been recorded on the same side). In both cases, the INIT must not update ct->proto.sctp.init[] state, since it does not advance the handshake tracking and may otherwise corrupt INIT/INIT-ACK validation logic. Allow INIT processing only when the conntrack entry is newly created (SCTP_CONNTRACK_NONE), or when the init_tag differs from the stored peer vtag. Note it skips the check for the ct with old_state SCTP_CONNTRACK_NONE in nf_conntrack_sctp_packet(), as it is just created in sctp_new() where it set ct->proto.sctp.vtag[IP_CT_DIR_REPLY] = ih->init_tag. Fixes: 9fb9cbb1082d ("[NETFILTER]: Add nf_conntrack subsystem.") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/ee56c3e416452b2a40589a2a85245ac2ad5e9f4b.1777214801.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet: mctp i2c: check length before marking flow activeWilliam A. Kennington III1-1/+3
[ Upstream commit 4ca07b9239bd0478ae586632a2ed72be37ed8407 ] Currently, mctp_i2c_get_tx_flow_state() is called before the packet length sanity check. This function marks a new flow as active in the MCTP core. If the sanity check fails, mctp_i2c_xmit() returns early without calling mctp_i2c_lock_nest(). This results in a mismatched locking state: the flow is active, but the I2C bus lock was never acquired for it. When the flow is later released, mctp_i2c_release_flow() will see the active state and queue an unlock marker. The TX thread will then decrement midev->i2c_lock_count from 0, causing it to underflow to -1. This underflow permanently breaks the driver's locking logic, allowing future transmissions to occur without holding the I2C bus lock, leading to bus collisions and potential hardware hangs. Move the mctp_i2c_get_tx_flow_state() call to after the length sanity check to ensure we only transition the flow state if we are actually going to proceed with the transmission and locking. Fixes: f5b8abf9fc3d ("mctp i2c: MCTP I2C binding driver") Signed-off-by: William A. Kennington III <william@wkennington.com> Acked-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20260423074741.201460-1-william@wkennington.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnetpoll: fix IPv6 local-address corruptionBreno Leitao1-1/+18
[ Upstream commit 3bc179bc7146c26c9dff75d2943d10528274e301 ] netpoll_setup() decides whether to auto-populate the local source address by testing np->local_ip.ip, which only inspects the first 4 bytes of the union inet_addr storage. For an IPv6 netpoll whose caller-supplied local address has a zero high-32 bits (::1, ::<suffix>, IPv4-mapped ::ffff:a.b.c.d, etc.), this misdetects the address as unset (which they are not, but the first 4 bytes are empty), calls netpoll_take_ipv6() and overwrites it with whatever matching link-local/global address the device happens to expose first. Introduce a helper netpoll_local_ip_unset() that picks the correct family-aware test (ipv6_addr_any() for IPv6, !.ip for IPv4) and use it from netpoll_setup(). Reproducer is something like: echo "::2" > local_ip echo 1 > enabled cat local_ip # before this fix: 2001:db8::1 (caller-supplied ::2 was clobbered) # after this fix: ::2 Fixes: b7394d2429c1 ("netpoll: prepare for ipv6") Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260424-netpoll_fix-v1-1-3a55348c625f@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnetpoll: extract IPv4 address retrieval into helper functionBreno Leitao1-17/+31
[ Upstream commit 3699f992e8c22d3ce54d2c1a5774e2c49028f99c ] Move the IPv4 address retrieval logic from netpoll_setup() into a separate netpoll_take_ipv4() function to improve code organization and readability. This change consolidates the IPv4-specific logic and error handling into a dedicated function while maintaining the same functionality. No functional changes. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250618-netpoll_ip_ref-v1-2-c2ac00fe558f@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 3bc179bc7146 ("netpoll: fix IPv6 local-address corruption") Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnetpoll: Extract carrier wait functionBreno Leitao1-12/+16
[ Upstream commit 76d30b51e818064e02917ce6328fb2c8adce5c87 ] Extract the carrier waiting logic into a dedicated helper function netpoll_wait_carrier() to improve code readability and reduce duplication in netpoll_setup(). Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250618-netpoll_ip_ref-v1-1-c2ac00fe558f@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 3bc179bc7146 ("netpoll: fix IPv6 local-address corruption") Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnetconsole: allow selection of egress interface via MAC addressUday Shankar1-12/+39
[ Upstream commit f8a10bed32f5fbede13a5f22fdc4ab8740ea213a ] Currently, netconsole has two methods of configuration - module parameter and configfs. The former interface allows for netconsole activation earlier during boot (by specifying the module parameter on the kernel command line), so it is preferred for debugging issues which arise before userspace is up/the configfs interface can be used. The module parameter syntax requires specifying the egress interface name. This requirement makes it hard to use for a couple reasons: - The egress interface name can be hard or impossible to predict. For example, installing a new network card in a system can change the interface names assigned by the kernel. - When constructing the module parameter, one may have trouble determining the original (kernel-assigned) name of the interface (which is the name that should be given to netconsole) if some stable interface naming scheme is in effect. A human can usually look at kernel logs to determine the original name, but this is very painful if automation is constructing the parameter. For these reasons, allow selection of the egress interface via MAC address when configuring netconsole using the module parameter. Update the netconsole documentation with an example of the new syntax. Selection of egress interface by MAC address via configfs is far less interesting (since when this interface can be used, one should be able to easily convert between MAC address and interface name), so it is left unimplemented. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Breno Leitao <leitao@debian.org> Tested-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250312-netconsole-v6-2-3437933e79b8@purestorage.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Stable-dep-of: 3bc179bc7146 ("netpoll: fix IPv6 local-address corruption") Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet, treewide: define and use MAC_ADDR_STR_LENUday Shankar1-3/+4
[ Upstream commit 6d6c1ba7824022528dbe3e283fafbd0775424128 ] There are a few places in the tree which compute the length of the string representation of a MAC address as 3 * ETH_ALEN - 1. Define a constant for this and use it where relevant. No functionality changes are expected. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@verge.net.au> Link: https://patch.msgid.link/20250312-netconsole-v6-1-3437933e79b8@purestorage.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Stable-dep-of: 3bc179bc7146 ("netpoll: fix IPv6 local-address corruption") Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daystcp: make probe0 timer handle expired user timeoutAltan Hacigumus1-2/+3
[ Upstream commit 2b9f6f7065d4cfb65ba19126e0b35ac4544c3f3a ] tcp_clamp_probe0_to_user_timeout() computes remaining time in jiffies using subtraction with an unsigned lvalue. If elapsed probing time exceeds the configured TCP_USER_TIMEOUT, the underflow yields a large value. This ends up re-arming the probe timer for a full backoff interval instead of expiring immediately, delaying connection teardown beyond the configured timeout. Fix this by preventing underflow so user-set timeout expiration is handled correctly without extending the probe timer. Fixes: 344db93ae3ee ("tcp: make TCP_USER_TIMEOUT accurate for zero window probes") Link: https://lore.kernel.org/r/20260414013634.43997-1-ahacigu.linux@gmail.com Signed-off-by: Altan Hacigumus <ahacigu.linux@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260424014639.54110-1-ahacigu.linux@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysneigh: let neigh_xmit take skb ownershipFlorian Westphal1-5/+5
[ Upstream commit 4438113be604ee67a7bf4f81da6e1cca41332ce4 ] neigh_xmit always releases the skb, except when no neighbour table is found. But even the first added user of neigh_xmit (mpls) relied on neigh_xmit to release the skb (or queue it for tx). sashiko reported: If neigh_xmit() is called with an uninitialized neighbor table (for example, NEIGH_ND_TABLE when IPv6 is disabled), it returns -EAFNOSUPPORT and bypasses its internal out_kfree_skb error path. Because the return value of neigh_xmit() is ignored here, does this leak the SKB? Assume full ownership and remove the last code path that doesn't xmit or free skb. Fixes: 4fd3d7d9e868 ("neigh: Add helper function neigh_xmit") Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260424145843.74055-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet/sched: taprio: fix NULL pointer dereference in class dumpWeiming Shi1-5/+8
[ Upstream commit 3d07ca5c0fae311226f737963984bd94bb159a87 ] When a TAPRIO child qdisc is deleted via RTM_DELQDISC, taprio_graft() is called with new == NULL and stores NULL into q->qdiscs[cl - 1]. Subsequent RTM_GETTCLASS dump operations walk all classes via taprio_walk() and call taprio_dump_class(), which calls taprio_leaf() returning the NULL pointer, then dereferences it to read child->handle, causing a kernel NULL pointer dereference. The bug is reachable with namespace-scoped CAP_NET_ADMIN on any kernel with CONFIG_NET_SCH_TAPRIO enabled. On systems with unprivileged user namespaces enabled, an unprivileged local user can trigger a kernel panic by creating a taprio qdisc inside a new network namespace, grafting an explicit child qdisc, deleting it, and requesting a class dump. The RTM_GETTCLASS dump itself requires no capability. Oops: general protection fault, probably for non-canonical address 0xdffffc0000000007: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000038-0x000000000000003f] RIP: 0010:taprio_dump_class (net/sched/sch_taprio.c:2478) Call Trace: <TASK> tc_fill_tclass (net/sched/sch_api.c:1966) qdisc_class_dump (net/sched/sch_api.c:2326) taprio_walk (net/sched/sch_taprio.c:2514) tc_dump_tclass_qdisc (net/sched/sch_api.c:2352) tc_dump_tclass_root (net/sched/sch_api.c:2370) tc_dump_tclass (net/sched/sch_api.c:2431) rtnl_dumpit (net/core/rtnetlink.c:6864) netlink_dump (net/netlink/af_netlink.c:2325) rtnetlink_rcv_msg (net/core/rtnetlink.c:6959) netlink_rcv_skb (net/netlink/af_netlink.c:2550) </TASK> Fix this by substituting &noop_qdisc when new is NULL in taprio_graft(), a common pattern used by other qdiscs (e.g., multiq_graft()) to ensure the q->qdiscs[] slots are never NULL. This makes control-plane dump paths safe without requiring individual NULL checks. Since the data-plane paths (taprio_enqueue and taprio_dequeue_from_txq) previously had explicit NULL guards that would drop/skip the packet cleanly, update those checks to test for &noop_qdisc instead. Without this, packets would reach taprio_enqueue_one() which increments the root qdisc's qlen and backlog before calling the child's enqueue; noop_qdisc drops the packet but those counters are never rolled back, permanently inflating the root qdisc's statistics. After this change *old can be a valid qdisc, NULL, or &noop_qdisc. Only call qdisc_put(*old) in the first case to avoid decreasing noop_qdisc's refcount, which was never increased. Fixes: 665338b2a7a0 ("net/sched: taprio: dump class stats for the actual q->qdiscs[]") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Tested-by: Weiming Shi <bestswngs@gmail.com> Link: https://patch.msgid.link/20260422161958.2517539-3-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet/sched: sch_fq_pie: annotate data-races in fq_pie_dump_stats()Eric Dumazet1-9/+10
[ Upstream commit 59b145771c7982cfe9020d4e9e22da92d6b5ae31 ] fq_codel_dump_stats() acquires the qdisc spinlock a bit too late. Move this acquisition before we fill tc_fq_pie_xstats with live data. Alternative would be to add READ_ONCE() and WRITE_ONCE() annotations, but the spinlock is needed anyway to scan q->new_flows and q->old_flows. Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260423063527.2568262-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet/sched: sch_choke: annotate data-races in choke_dump_stats()Eric Dumazet1-10/+16
[ Upstream commit d3aeb889dcbd78e95f500d383799a23d949796e0 ] choke_dump_stats() only runs with RTNL held. It reads fields that can be changed in qdisc fast path. Add READ_ONCE()/WRITE_ONCE() annotations. Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260423062839.2524324-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet/sched: netem: check for negative latency and jitterStephen Hemminger1-0/+22
[ Upstream commit 90be9fedb218ee95a1cf59050d1306fbfb0e8b87 ] Reject requests with negative latency or jitter. A negative value added to current timestamp (u64) wraps to an enormous time_to_send, disabling dequeue. The original UAPI used u32 for these values; the conversion to 64-bit time values via TCA_NETEM_LATENCY64 and TCA_NETEM_JITTER64 allowed signed values to reach the kernel without validation. Jitter is already silently clamped by an abs() in netem_change(); that abs() can be removed in a follow-up once this rejection is in place. Fixes: 99803171ef04 ("netem: add uapi to express delay and jitter in nanoseconds") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-7-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet/sched: netem: fix slot delay calculation overflowStephen Hemminger1-3/+2
[ Upstream commit 51e94e1e2fef351c74d69eb53666df808d26af95 ] get_slot_next() computes a random delay between min_delay and max_delay using: get_random_u32() * (max_delay - min_delay) >> 32 This overflows signed 64-bit arithmetic when the delay range exceeds approximately 2.1 seconds (2^31 nanoseconds), producing a negative result that effectively disables slot-based pacing. This is a realistic configuration for WAN emulation (e.g., slot 1s 5s). Use mul_u64_u32_shr() which handles the widening multiply without overflow. Fixes: 0a9fe5c375b5 ("netem: slotting with non-uniform distribution") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-6-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet/sched: netem: validate slot configurationStephen Hemminger1-0/+29
[ Upstream commit 01801c359a74737b9b1aa28568b60374d857241a ] Reject slot configurations that have no defensible meaning: - negative min_delay or max_delay - min_delay greater than max_delay - negative dist_delay or dist_jitter - negative max_packets or max_bytes Negative or out-of-order delays underflow in get_slot_next(), producing garbage intervals. Negative limits trip the per-slot accounting (packets_left/bytes_left <= 0) on the first packet of every slot, defeating the rate-limiting half of the slot feature. Note that dist_jitter has been silently coerced to its absolute value by get_slot() since the feature was introduced; rejecting negatives here converts that silent coercion into -EINVAL. The abs() can be removed in a follow-up. Fixes: 836af83b54e3 ("netem: support delivering packets in delayed time slots") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-5-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet/sched: netem: only reseed PRNG when seed is explicitly providedStephen Hemminger1-4/+6
[ Upstream commit 986afaf809940577224a99c3a08d97a15eb37e93 ] netem_change() unconditionally reseeds the PRNG on every tc change command. If TCA_NETEM_PRNG_SEED is not specified, a new random seed is generated, destroying reproducibility for users who set a deterministic seed on a previous change. Move the initial random seed generation to netem_init() and only reseed in netem_change() when TCA_NETEM_PRNG_SEED is explicitly provided by the user. Fixes: 4072d97ddc44 ("netem: add prng attribute to netem_sched_data") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-4-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet/sched: netem: fix queue limit check to include reordered packetsStephen Hemminger1-1/+1
[ Upstream commit 4185701fcce6b426b6c3630b25330dddd9c47b0d ] The queue limit check in netem_enqueue() uses q->t_len which only counts packets in the internal tfifo. Packets placed in sch->q by the reorder path (__qdisc_enqueue_head) are not counted, allowing the total queue occupancy to exceed sch->limit under reordering. Include sch->q.qlen in the limit check. Fixes: f8d4bc455047 ("net/sched: netem: account for backlog updates from child qdisc") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-3-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet/sched: netem: fix probability gaps in 4-state loss modelStephen Hemminger1-4/+4
[ Upstream commit 732b463449fd0ef90acd13cda68eab1c91adb00c ] The 4-state Markov chain in loss_4state() has gaps at the boundaries between transition probability ranges. The comparisons use: if (rnd < a4) else if (a4 < rnd && rnd < a1 + a4) When rnd equals a boundary value exactly, neither branch matches and no state transition occurs. The redundant lower-bound check (a4 < rnd) is already implied by being in the else branch. Remove the unnecessary lower-bound comparisons so the ranges are contiguous and every random value produces a transition, matching the GI (General and Intuitive) loss model specification. This bug goes back to original implementation of this model. Fixes: 661b79725fea ("netem: revised correlated loss generator") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-2-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnetfilter: nf_conntrack_sip: don't use simple_strtoulFlorian Westphal2-34/+119
[ Upstream commit 8cf6809cddcbe301aedfc6b51bcd4944d45795f6 ] Replace unsafe port parsing in epaddr_len(), ct_sip_parse_header_uri(), and ct_sip_parse_request() with a new sip_parse_port() helper that validates each digit against the buffer limit, eliminating the use of simple_strtoul() which assumes NUL-terminated strings. The previous code dereferenced pointers without bounds checks after sip_parse_addr() and relied on simple_strtoul() on non-NUL-terminated skb data. A port that reaches the buffer limit without a trailing character is also rejected as malformed. Also get rid of all simple_strtoul() usage in conntrack, prefer a stricter version instead. There are intentional changes: - Bail out if number is > UINT_MAX and indicate a failure, same for too long sequences. While we do accept 05535 as port 5535, we will not accept e.g. 'sip:10.0.0.1:005060'. While its syntactically valid under RFC 3261, we should restrict this to not waste cycles when presented with malformed packets with 64k '0' characters. - Force base 10 in ct_sip_parse_numerical_param(). This is used to fetch 'expire=' and 'rports='; both are expected to use base-10. - In nf_nat_sip.c, only accept the parsed value if its within the 1k-64k range. - epaddr_len now returns 0 if the port is invalid, as it already does for invalid ip addresses. This is intentional. nf_conntrack_sip performs lots of guesswork to find the right parts of the message to parse. Being stricter could break existing setups. Connection tracking helpers are designed to allow traffic to pass, not to block it. Based on an earlier patch from Jenny Guanni Qu <qguanni@gmail.com>. Fixes: 05e3ced297fe ("[NETFILTER]: nf_conntrack_sip: introduce SIP-URI parsing helper") Reported-by: Klaudia Kloc <klaudia@vidocsecurity.com> Reported-by: Dawid Moczadło <dawid@vidocsecurity.com> Reported-by: Jenny Guanni Qu <qguanni@gmail.com>. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnetfilter: xt_policy: fix strict mode inbound policy matchingJiexun Wang1-1/+1
[ Upstream commit 4b2b4d7d4e203c92db8966b163edfacb1f0e1e29 ] match_policy_in() walks sec_path entries from the last transform to the first one, but strict policy matching needs to consume info->pol[] in the same forward order as the rule layout. Derive the strict-match policy position from the number of transforms already consumed so that multi-element inbound rules are matched consistently. Fixes: c4b885139203 ("[NETFILTER]: x_tables: replace IPv4/IPv6 policy match by address family independant version") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnetfilter: arp_tables: fix IEEE1394 ARP payload parsingPablo Neira Ayuso2-3/+23
[ Upstream commit 1e8e3f449b1e73b73a843257635b9c50f0cc0f0a ] Weiming Shi says: "arp_packet_match() unconditionally parses the ARP payload assuming two hardware addresses are present (source and target). However, IPv4-over-IEEE1394 ARP (RFC 2734) omits the target hardware address field, and arp_hdr_len() already accounts for this by returning a shorter length for ARPHRD_IEEE1394 devices. As a result, on IEEE1394 interfaces arp_packet_match() advances past a nonexistent target hardware address and reads the wrong bytes for both the target device address comparison and the target IP address. This causes arptables rules to match against garbage data, leading to incorrect filtering decisions: packets that should be accepted may be dropped and vice versa. The ARP stack in net/ipv4/arp.c (arp_create and arp_process) already handles this correctly by skipping the target hardware address for ARPHRD_IEEE1394. Apply the same pattern to arp_packet_match()." Mangle the original patch to always return 0 (no match) in case user matches on the target hardware address which is never present in IEEE1394. Note that this returns 0 (no match) for either normal and inverse match because matching in the target hardware address in ARPHRD_IEEE1394 has never been supported by arptables. This is intentional, matching on the target hardware address should never evaluate true for ARPHRD_IEEE1394. Moreover, adjust arpt_mangle to drop the packet too as AI suggests: In arpt_mangle, the logic assumes a standard ARP layout. Because IEEE1394 (FireWire) omits the target hardware address, the linear pointer arithmetic miscalculates the offset for the target IP address. This causes mangling operations to write to the wrong location, leading to packet corruption. To ensure safety, this patch drops packets (NF_DROP) when mangling is requested for these fields on IEEE1394 devices, as the current implementation cannot correctly map the FireWire ARP payload. This omits both mangling target hardware and IP address. Even if IP address mangling should be possible in IEEE1394, this would require to adjust arpt_mangle offset calculation, which has never been supported. Based on patch from Weiming Shi <bestswngs@gmail.com>. Fixes: 6752c8db8e0c ("firewire net, ipv4 arp: Extend hardware address and remove driver-level packet inspection.") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daystipc: fix double-free in tipc_buf_append()Lee Jones1-1/+13
[ Upstream commit d293ca716e7d5dffdaecaf6b9b2f857a33dc3d3a ] tipc_msg_validate() can potentially reallocate the skb it is validating, freeing the old one. In tipc_buf_append(), it was being called with a pointer to a local variable which was a copy of the caller's skb pointer. If the skb was reallocated and validation subsequently failed, the error handling path would free the original skb pointer, which had already been freed, leading to double-free. Fix this by checking if head now points to a newly allocated reassembled skb. If it does, reassign *headbuf for later freeing operations. Fixes: d618d09a68e4 ("tipc: enforce valid ratio between skb truesize and contents") Suggested-by: Tung Nguyen <tung.quang.nguyen@est.tech> Signed-off-by: Lee Jones <lee@kernel.org> Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
5 daysnet/sched: sch_sfb: annotate data-races in sfb_dump_stats()Eric Dumazet1-22/+32
[ Upstream commit 1ada03fdef82d3d7d2edb9dcd3acc91917675e48 ] sfb_dump_stats() only runs with RTNL held, reading fields that can be changed in qdisc fast path. Add READ_ONCE()/WRITE_ONCE() annotations. Alternative would be to acquire the qdisc spinlock, but our long-term goal is to make qdisc dump operations lockless as much as we can. tc_sfb_xstats fields don't need to be latched atomically, otherwise this bug would have been caught earlier. Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260421141655.3953721-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>