kernel/linux.git/net/ipv4/netfilter, branch v6.18.33

netfilter: arp_tables: fix IEEE1394 ARP payload parsing

2026-05-23T11:07:07+00:00

[ Upstream commit 1e8e3f449b1e73b73a843257635b9c50f0cc0f0a ] Weiming Shi says: "arp_packet_match() unconditionally parses the ARP payload assuming two hardware addresses are present (source and target). However, IPv4-over-IEEE1394 ARP (RFC 2734) omits the target hardware address field, and arp_hdr_len() already accounts for this by returning a shorter length for ARPHRD_IEEE1394 devices. As a result, on IEEE1394 interfaces arp_packet_match() advances past a nonexistent target hardware address and reads the wrong bytes for both the target device address comparison and the target IP address. This causes arptables rules to match against garbage data, leading to incorrect filtering decisions: packets that should be accepted may be dropped and vice versa. The ARP stack in net/ipv4/arp.c (arp_create and arp_process) already handles this correctly by skipping the target hardware address for ARPHRD_IEEE1394. Apply the same pattern to arp_packet_match()." Mangle the original patch to always return 0 (no match) in case user matches on the target hardware address which is never present in IEEE1394. Note that this returns 0 (no match) for either normal and inverse match because matching in the target hardware address in ARPHRD_IEEE1394 has never been supported by arptables. This is intentional, matching on the target hardware address should never evaluate true for ARPHRD_IEEE1394. Moreover, adjust arpt_mangle to drop the packet too as AI suggests: In arpt_mangle, the logic assumes a standard ARP layout. Because IEEE1394 (FireWire) omits the target hardware address, the linear pointer arithmetic miscalculates the offset for the target IP address. This causes mangling operations to write to the wrong location, leading to packet corruption. To ensure safety, this patch drops packets (NF_DROP) when mangling is requested for these fields on IEEE1394 devices, as the current implementation cannot correctly map the FireWire ARP payload. This omits both mangling target hardware and IP address. Even if IP address mangling should be possible in IEEE1394, this would require to adjust arpt_mangle offset calculation, which has never been supported. Based on patch from Weiming Shi . Fixes: 6752c8db8e0c ("firewire net, ipv4 arp: Extend hardware address and remove driver-level packet inspection.") Reported-by: Xiang Mei Signed-off-by: Pablo Neira Ayuso Signed-off-by: Sasha Levin

netfilter: nat: use kfree_rcu to release ops

2026-05-23T11:07:02+00:00

[ Upstream commit 6eda0d771f94267f73f57c94630aa47e90957915 ] Florian Westphal says: "Historically this is not an issue, even for normal base hooks: the data path doesn't use the original nf_hook_ops that are used to register the callbacks. However, in v5.14 I added the ability to dump the active netfilter hooks from userspace. This code will peek back into the nf_hook_ops that are available at the tail of the pointer-array blob used by the datapath. The nat hooks are special, because they are called indirectly from the central nat dispatcher hook. They are currently invisible to the nfnl hook dump subsystem though. But once that changes the nat ops structures have to be deferred too." Update nf_nat_register_fn() to deal with partial exposition of the hooks from error path which can be also an issue for nfnetlink_hook. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso Signed-off-by: Sasha Levin

netfilter: nf_reject: don't reply to icmp error messages

2025-09-11T13:40:55+00:00

tcp reject code won't reply to a tcp reset. But the icmp reject 'netdev' family versions will reply to icmp dst-unreach errors, unlike icmp_send() and icmp6_send() which are used by the inet family implementation (and internally by the REJECT target). Check for the icmp(6) type and do not respond if its an unreachable error. Without this, something like 'ip protocol icmp reject', when used in a netdev chain attached to 'lo', cause a packet loop. Same for two hosts that both use such a rule: each error packet will be replied to. Such situation persist until the (bogus) rule is amended to ratelimit or checks the icmp type before the reject statement. As the inet versions don't do this make the netdev ones follow along. Signed-off-by: Florian Westphal

netfilter: nf_reject: remove unneeded exports

2025-09-02T13:28:17+00:00

These functions have no external callers and can be static. Signed-off-by: Florian Westphal

ipv4: Convert ->flowi4_tos to dscp_t.

2025-08-27T00:34:31+00:00

Convert the ->flowic_tos field of struct flowi_common from __u8 to dscp_t, rename it ->flowic_dscp and propagate these changes to struct flowi and struct flowi4. We've had several bugs in the past where ECN bits could interfere with IPv4 routing, because these bits were not properly cleared when setting ->flowi4_tos. These bugs should be fixed now and the dscp_t type has been introduced to ensure that variables carrying DSCP values don't accidentally have any ECN bits set. Several variables and structure fields have been converted to dscp_t already, but the main IPv4 routing structure, struct flowi4, is still using a __u8. To avoid any future regression, this patch converts it to dscp_t. There are many users to convert at once. Fortunately, around half of ->flowi4_tos users already have a dscp_t value at hand, which they currently convert to __u8 using inet_dscp_to_dsfield(). For all of these users, we just need to drop that conversion. But, although we try to do the __u8 <-> dscp_t conversions at the boundaries of the network or of user space, some places still store TOS/DSCP variables as __u8 in core networking code. Those can hardly be converted either because the data structure is part of UAPI or because the same variable or field is also used for handling ECN in other parts of the code. In all of these cases where we don't have a dscp_t variable at hand, we need to use inet_dsfield_to_dscp() when interacting with ->flowi4_dscp. Changes since v1: * Fix space alignment in __bpf_redirect_neigh_v4() (Ido). Signed-off-by: Guillaume Nault Reviewed-by: Ido Schimmel Link: https://patch.msgid.link/29acecb45e911d17446b9a3dbdb1ab7b821ea371.1756128932.git.gnault@redhat.com Signed-off-by: Jakub Kicinski

tcp: Don't pass hashinfo to socket lookup helpers.

2025-08-26T00:53:35+00:00

These socket lookup functions required struct inet_hashinfo because they are shared by TCP and DCCP. * __inet_lookup_established() * __inet_lookup_listener() * __inet6_lookup_established() * inet6_lookup_listener() DCCP has gone, and we don't need to pass hashinfo down to them. Let's fetch net->ipv4.tcp_death_row.hashinfo directly in the above 4 functions. Signed-off-by: Kuniyuki Iwashima Reviewed-by: Eric Dumazet Link: https://patch.msgid.link/20250822190803.540788-5-kuniyu@google.com Signed-off-by: Jakub Kicinski

netfilter: nf_reject: don't leak dst refcount for loopback packets

2025-08-21T17:02:00+00:00

recent patches to add a WARN() when replacing skb dst entry found an old bug: WARNING: include/linux/skbuff.h:1165 skb_dst_check_unset include/linux/skbuff.h:1164 [inline] WARNING: include/linux/skbuff.h:1165 skb_dst_set include/linux/skbuff.h:1210 [inline] WARNING: include/linux/skbuff.h:1165 nf_reject_fill_skb_dst+0x2a4/0x330 net/ipv4/netfilter/nf_reject_ipv4.c:234 [..] Call Trace: nf_send_unreach+0x17b/0x6e0 net/ipv4/netfilter/nf_reject_ipv4.c:325 nft_reject_inet_eval+0x4bc/0x690 net/netfilter/nft_reject_inet.c:27 expr_call_ops_eval net/netfilter/nf_tables_core.c:237 [inline] .. This is because blamed commit forgot about loopback packets. Such packets already have a dst_entry attached, even at PRE_ROUTING stage. Instead of checking hook just check if the skb already has a route attached to it. Fixes: f53b9b0bdc59 ("netfilter: introduce support for reject at prerouting stage") Signed-off-by: Florian Westphal Link: https://patch.msgid.link/20250820123707.10671-1-fw@strlen.de Signed-off-by: Jakub Kicinski

netfilter: add back NETFILTER_XTABLES dependencies

2025-08-07T11:19:25+00:00

Some Kconfig symbols were changed to depend on the 'bool' symbol NETFILTER_XTABLES_LEGACY, which means they can now be set to built-in when the xtables code itself is in a loadable module: x86_64-linux-ld: vmlinux.o: in function `arpt_unregister_table_pre_exit': (.text+0x1831987): undefined reference to `xt_find_table' x86_64-linux-ld: vmlinux.o: in function `get_info.constprop.0': arp_tables.c:(.text+0x1831aab): undefined reference to `xt_request_find_table_lock' x86_64-linux-ld: arp_tables.c:(.text+0x1831bea): undefined reference to `xt_table_unlock' x86_64-linux-ld: vmlinux.o: in function `do_arpt_get_ctl': arp_tables.c:(.text+0x183205d): undefined reference to `xt_find_table_lock' x86_64-linux-ld: arp_tables.c:(.text+0x18320c1): undefined reference to `xt_table_unlock' x86_64-linux-ld: arp_tables.c:(.text+0x183219a): undefined reference to `xt_recseq' Change these to depend on both NETFILTER_XTABLES and NETFILTER_XTABLES_LEGACY. Fixes: 9fce66583f06 ("netfilter: Exclude LEGACY TABLES on PREEMPT_RT.") Signed-off-by: Arnd Bergmann Acked-by: Florian Westphal Tested-by: Breno Leitao Signed-off-by: Pablo Neira Ayuso

netfilter: Exclude LEGACY TABLES on PREEMPT_RT.

2025-07-25T16:38:50+00:00

The seqcount xt_recseq is used to synchronize the replacement of xt_table::private in xt_replace_table() against all readers such as ipt_do_table() To ensure that there is only one writer, the writing side disables bottom halves. The sequence counter can be acquired recursively. Only the first invocation modifies the sequence counter (signaling that a writer is in progress) while the following (recursive) writer does not modify the counter. The lack of a proper locking mechanism for the sequence counter can lead to live lock on PREEMPT_RT if the high prior reader preempts the writer. Additionally if the per-CPU lock on PREEMPT_RT is removed from local_bh_disable() then there is no synchronisation for the per-CPU sequence counter. The affected code is "just" the legacy netfilter code which is replaced by "netfilter tables". That code can be disabled without sacrificing functionality because everything is provided by the newer implementation. This will only requires the usage of the "-nft" tools instead of the "-legacy" ones. The long term plan is to remove the legacy code so lets accelerate the progress. Relax dependencies on iptables legacy, replace select with depends on, this should cause no harm to existing kernel configs and users can still toggle IP{6}_NF_IPTABLES_LEGACY in any case. Make EBTABLES_LEGACY, IPTABLES_LEGACY and ARPTABLES depend on NETFILTER_XTABLES_LEGACY. Hide xt_recseq and its users, xt_register_table() and xt_percpu_counter_alloc() behind NETFILTER_XTABLES_LEGACY. Let NETFILTER_XTABLES_LEGACY depend on !PREEMPT_RT. This will break selftest expecing the legacy options enabled and will be addressed in a following patch. Co-developed-by: Florian Westphal Co-developed-by: Sebastian Andrzej Siewior Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Pablo Neira Ayuso

netfilter: nf_dup{4, 6}: Move duplication check to task_struct

2025-05-23T11:57:12+00:00

nf_skb_duplicated is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Due to the recursion involved, the simplest change is to make it a per-task variable. Move the per-CPU variable nf_skb_duplicated to task_struct and name it in_nf_duplicate. Add it to the existing bitfield so it doesn't use additional memory. Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Valentin Schneider Acked-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Pablo Neira Ayuso