Age | Commit message (Collapse) | Author | Files | Lines |
|
Haowei Yan <g1042620637@gmail.com> found that ets_class_from_arg() can
index an Out-Of-Bound class in ets_class_from_arg() when passed clid of
0. The overflow may cause local privilege escalation.
[ 18.852298] ------------[ cut here ]------------
[ 18.853271] UBSAN: array-index-out-of-bounds in net/sched/sch_ets.c:93:20
[ 18.853743] index 18446744073709551615 is out of range for type 'ets_class [16]'
[ 18.854254] CPU: 0 UID: 0 PID: 1275 Comm: poc Not tainted 6.12.6-dirty #17
[ 18.854821] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 18.856532] Call Trace:
[ 18.857441] <TASK>
[ 18.858227] dump_stack_lvl+0xc2/0xf0
[ 18.859607] dump_stack+0x10/0x20
[ 18.860908] __ubsan_handle_out_of_bounds+0xa7/0xf0
[ 18.864022] ets_class_change+0x3d6/0x3f0
[ 18.864322] tc_ctl_tclass+0x251/0x910
[ 18.864587] ? lock_acquire+0x5e/0x140
[ 18.865113] ? __mutex_lock+0x9c/0xe70
[ 18.866009] ? __mutex_lock+0xa34/0xe70
[ 18.866401] rtnetlink_rcv_msg+0x170/0x6f0
[ 18.866806] ? __lock_acquire+0x578/0xc10
[ 18.867184] ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[ 18.867503] netlink_rcv_skb+0x59/0x110
[ 18.867776] rtnetlink_rcv+0x15/0x30
[ 18.868159] netlink_unicast+0x1c3/0x2b0
[ 18.868440] netlink_sendmsg+0x239/0x4b0
[ 18.868721] ____sys_sendmsg+0x3e2/0x410
[ 18.869012] ___sys_sendmsg+0x88/0xe0
[ 18.869276] ? rseq_ip_fixup+0x198/0x260
[ 18.869563] ? rseq_update_cpu_node_id+0x10a/0x190
[ 18.869900] ? trace_hardirqs_off+0x5a/0xd0
[ 18.870196] ? syscall_exit_to_user_mode+0xcc/0x220
[ 18.870547] ? do_syscall_64+0x93/0x150
[ 18.870821] ? __memcg_slab_free_hook+0x69/0x290
[ 18.871157] __sys_sendmsg+0x69/0xd0
[ 18.871416] __x64_sys_sendmsg+0x1d/0x30
[ 18.871699] x64_sys_call+0x9e2/0x2670
[ 18.871979] do_syscall_64+0x87/0x150
[ 18.873280] ? do_syscall_64+0x93/0x150
[ 18.874742] ? lock_release+0x7b/0x160
[ 18.876157] ? do_user_addr_fault+0x5ce/0x8f0
[ 18.877833] ? irqentry_exit_to_user_mode+0xc2/0x210
[ 18.879608] ? irqentry_exit+0x77/0xb0
[ 18.879808] ? clear_bhb_loop+0x15/0x70
[ 18.880023] ? clear_bhb_loop+0x15/0x70
[ 18.880223] ? clear_bhb_loop+0x15/0x70
[ 18.880426] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 18.880683] RIP: 0033:0x44a957
[ 18.880851] Code: ff ff e8 fc 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 8974 24 10
[ 18.881766] RSP: 002b:00007ffcdd00fad8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 18.882149] RAX: ffffffffffffffda RBX: 00007ffcdd010db8 RCX: 000000000044a957
[ 18.882507] RDX: 0000000000000000 RSI: 00007ffcdd00fb70 RDI: 0000000000000003
[ 18.885037] RBP: 00007ffcdd010bc0 R08: 000000000703c770 R09: 000000000703c7c0
[ 18.887203] R10: 0000000000000080 R11: 0000000000000246 R12: 0000000000000001
[ 18.888026] R13: 00007ffcdd010da8 R14: 00000000004ca7d0 R15: 0000000000000001
[ 18.888395] </TASK>
[ 18.888610] ---[ end trace ]---
Fixes: dcc68b4d8084 ("net: sch_ets: Add a new Qdisc")
Reported-by: Haowei Yan <g1042620637@gmail.com>
Suggested-by: Haowei Yan <g1042620637@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20250111145740.74755-1-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When a user TGT ticket expired, gssd returns EKEYEXPIRED to the RPC
layer for the upcall to create the security context. The RPC layer
then retries the upcall twice before returning the EKEYEXPIRED to
the NFS layer.
This results in three separate TCP connections to the NFS server being
created by gssd for each RPC request. These connections are not used
and left in TIME_WAIT state.
Note that for RPC call that uses machine credential, gssd automatically
renews the ticket. But for a regular user the ticket needs to be
renewed by the user before access to the krb5 share is allowed.
This patch removes the retries by RPC on EKEYEXPIRED so that these
unused TCP connections are not created.
Reproducer:
$ kinit -l 1m
$ sleep 65
$ cd /mnt/krb5share
$ netstat -na |grep TIME_WAIT
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
The output format should provide a value that matches the one in
the /proc/<pid>/ns/net symlink. This makes it simpler to match the
rpc_xprt and rpc_clnt to a particular container.
Also, when the xprt defines the get_srcaddr operation, use that to
display the source address as well.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"This is slightly smaller than usual, with the most interesting work
being still around RTNL scope reduction.
Core:
- More core refactoring to reduce the RTNL lock contention, including
preparatory work for the per-network namespace RTNL lock, replacing
RTNL lock with a per device-one to protect NAPI-related net device
data and moving synchronize_net() calls outside such lock.
- Extend drop reasons usage, adding net scheduler, AF_UNIX, bridge
and more specific TCP coverage.
- Reduce network namespace tear-down time by removing per-subsystems
synchronize_net() in tipc and sched.
- Add flow label selector support for fib rules, allowing traffic
redirection based on such header field.
Netfilter:
- Do not remove netdev basechain when last device is gone, allowing
netdev basechains without devices.
- Revisit the flowtable teardown strategy, dealing better with fin,
reset and re-open events.
- Scale-up IP-vs connection dumping by avoiding linear search on each
restart.
Protocols:
- A significant XDP socket refactor, consolidating and optimizing
several helpers into the core
- Better scaling of ICMP rate-limiting, by removing false-sharing in
inet peers handling.
- Introduces netlink notifications for multicast IPv4 and IPv6
address changes.
- Add ipsec support for IP-TFS/AggFrag encapsulation, allowing
aggregation and fragmentation of the inner IP.
- Add sysctl to configure TIME-WAIT reuse delay for TCP sockets, to
avoid local port exhaustion issues when the average connection
lifetime is very short.
- Support updating keys (re-keying) for connections using kernel TLS
(for TLS 1.3 only).
- Support ipv4-mapped ipv6 address clients in smc-r v2.
- Add support for jumbo data packet transmission in RxRPC sockets,
gluing multiple data packets in a single UDP packet.
- Support RxRPC RACK-TLP to manage packet loss and retransmission in
conjunction with the congestion control algorithm.
Driver API:
- Introduce a unified and structured interface for reporting PHY
statistics, exposing consistent data across different H/W via
ethtool.
- Make timestamping selectable, allow the user to select the desired
hwtstamp provider (PHY or MAC) administratively.
- Add support for configuring a header-data-split threshold (HDS)
value via ethtool, to deal with partial or buggy H/W
implementation.
- Consolidate DSA drivers Energy Efficiency Ethernet support.
- Add EEE management to phylink, making use of the phylib
implementation.
- Add phylib support for in-band capabilities negotiation.
- Simplify how phylib-enabled mac drivers expose the supported
interfaces.
Tests and tooling:
- Make the YNL tool package-friendly to make it easier to deploy it
separately from the kernel.
- Increase TCP selftest coverage importing several packetdrill
test-cases.
- Regenerate the ethtool uapi header from the YNL spec, to ease
maintenance and future development.
- Add YNL support for decoding the link types used in net self-tests,
allowing a single build to run both net and drivers/net.
Drivers:
- Ethernet high-speed NICs:
- nVidia/Mellanox (mlx5):
- add cross E-Switch QoS support
- add SW Steering support for ConnectX-8
- implement support for HW-Managed Flow Steering, improving the
rule deletion/insertion rate
- support for multi-host LAG
- Intel (ixgbe, ice, igb):
- ice: add support for devlink health events
- ixgbe: add initial support for E610 chipset variant
- igb: add support for AF_XDP zero-copy
- Meta:
- add support for basic RSS config
- allow changing the number of channels
- add hardware monitoring support
- Broadcom (bnxt):
- implement TCP data split and HDS threshold ethtool support,
enabling Device Memory TCP.
- Marvell Octeon:
- implement egress ipsec offload support for the cn10k family
- Hisilicon (HIBMC):
- implement unicast MAC filtering
- Ethernet NICs embedded and virtual:
- Convert UDP tunnel drivers to NETDEV_PCPU_STAT_DSTATS, avoiding
contented atomic operations for drop counters
- Freescale:
- quicc: phylink conversion
- enetc: support Tx and Rx checksum offload and improve TSO
performances
- MediaTek:
- airoha: introduce support for ETS and HTB Qdisc offload
- Microchip:
- lan78XX USB: preparation work for phylink conversion
- Synopsys (stmmac):
- support DWMAC IP on NXP Automotive SoCs S32G2xx/S32G3xx/S32R45
- refactor EEE support to leverage the new driver API
- optimize DMA and cache access to increase raw RX performances
by 40%
- TI:
- icssg-prueth: add multicast filtering support for VLAN
interface
- netkit:
- add ability to configure head/tailroom
- VXLAN:
- accepts packets with user-defined reserved bit
- Ethernet switches:
- Microchip:
- lan969x: add RGMII support
- lan969x: improve TX and RX performance using the FDMA engine
- nVidia/Mellanox:
- move Tx header handling to PCI driver, to ease XDP support
- Ethernet PHYs:
- Texas Instruments DP83822:
- add support for GPIO2 clock output
- Realtek:
- 8169: add support for RTL8125D rev.b
- rtl822x: add hwmon support for the temperature sensor
- Microchip:
- add support for RDS PTP hardware
- consolidate periodic output signal generation
- CAN:
- several DT-bindings to DT schema conversions
- tcan4x5x:
- add HW standby support
- support nWKRQ voltage selection
- kvaser:
- allowing Bus Error Reporting runtime configuration
- WiFi:
- the on-going Multi-Link Operation (MLO) effort continues,
affecting both the stack and in drivers
- mac80211/cfg80211:
- Emergency Preparedness Communication Services (EPCS) station
mode support
- support for adding and removing station links for MLO
- add support for WiFi 7/EHT mesh over 320 MHz channels
- report Tx power info for each link
- RealTek (rtw88):
- enable USB Rx aggregation and USB 3 to improve performance
- LED support
- RealTek (rtw89):
- refactor power save to support Multi-Link Operations
- add support for RTL8922AE-VS variant
- MediaTek (mt76):
- single wiphy multiband support (preparation for MLO)
- p2p device support
- add TP-Link TXE50UH USB adapter support
- Qualcomm (ath10k):
- support for the QCA6698AQ IP core
- Qualcomm (ath12k):
- enable MLO for QCN9274
- Bluetooth:
- Allow sysfs to trigger hdev reset, to allow recovering devices
not responsive from user-space
- MediaTek: add support for MT7922, MT7925, MT7921e devices
- Realtek: add support for RTL8851BE devices
- Qualcomm: add support for WCN785x devices
- ISO: allow BIG re-sync"
* tag 'net-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1386 commits)
net/rose: prevent integer overflows in rose_setsockopt()
net: phylink: fix regression when binding a PHY
net: ethernet: ti: am65-cpsw: streamline TX queue creation and cleanup
net: ethernet: ti: am65-cpsw: streamline RX queue creation and cleanup
net: ethernet: ti: am65-cpsw: ensure proper channel cleanup in error path
ipv6: Convert inet6_rtm_deladdr() to per-netns RTNL.
ipv6: Convert inet6_rtm_newaddr() to per-netns RTNL.
ipv6: Move lifetime validation to inet6_rtm_newaddr().
ipv6: Set cfg.ifa_flags before device lookup in inet6_rtm_newaddr().
ipv6: Pass dev to inet6_addr_add().
ipv6: Convert inet6_ioctl() to per-netns RTNL.
ipv6: Hold rtnl_net_lock() in addrconf_init() and addrconf_cleanup().
ipv6: Hold rtnl_net_lock() in addrconf_dad_work().
ipv6: Hold rtnl_net_lock() in addrconf_verify_work().
ipv6: Convert net.ipv6.conf.${DEV}.XXX sysctl to per-netns RTNL.
ipv6: Add __in6_dev_get_rtnl_net().
net: stmmac: Drop redundant skb_mark_for_recycle() for SKB frags
net: mii: Fix the Speed display when the network cable is not connected
sysctl net: Remove macro checks for CONFIG_SYSCTL
eth: bnxt: update header sizing defaults
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm updates from Paul Moore:
- Improved handling of LSM "secctx" strings through lsm_context struct
The LSM secctx string interface is from an older time when only one
LSM was supported, migrate over to the lsm_context struct to better
support the different LSMs we now have and make it easier to support
new LSMs in the future.
These changes explain the Rust, VFS, and networking changes in the
diffstat.
- Only build lsm_audit.c if CONFIG_SECURITY and CONFIG_AUDIT are
enabled
Small tweak to be a bit smarter about when we build the LSM's common
audit helpers.
- Check for absurdly large policies from userspace in SafeSetID
SafeSetID policies rules are fairly small, basically just "UID:UID",
it easy to impose a limit of KMALLOC_MAX_SIZE on policy writes which
helps quiet a number of syzbot related issues. While work is being
done to address the syzbot issues through other mechanisms, this is a
trivial and relatively safe fix that we can do now.
- Various minor improvements and cleanups
A collection of improvements to the kernel selftests, constification
of some function parameters, removing redundant assignments, and
local variable renames to improve readability.
* tag 'lsm-pr-20250121' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
lockdown: initialize local array before use to quiet static analysis
safesetid: check size of policy writes
net: corrections for security_secid_to_secctx returns
lsm: rename variable to avoid shadowing
lsm: constify function parameters
security: remove redundant assignment to return variable
lsm: Only build lsm_audit.c if CONFIG_SECURITY and CONFIG_AUDIT are set
selftests: refactor the lsm `flags_overset_lsm_set_self_attr` test
binder: initialize lsm_context structure
rust: replace lsm context+len with lsm_context
lsm: secctx provider check on release
lsm: lsm_context in security_dentry_init_security
lsm: use lsm_context in security_inode_getsecctx
lsm: replace context+len with lsm_context
lsm: ensure the correct LSM context releaser
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull kthread updates from Frederic Weisbecker:
"Kthreads affinity follow either of 4 existing different patterns:
1) Per-CPU kthreads must stay affine to a single CPU and never
execute relevant code on any other CPU. This is currently handled
by smpboot code which takes care of CPU-hotplug operations.
Affinity here is a correctness constraint.
2) Some kthreads _have_ to be affine to a specific set of CPUs and
can't run anywhere else. The affinity is set through
kthread_bind_mask() and the subsystem takes care by itself to
handle CPU-hotplug operations. Affinity here is assumed to be a
correctness constraint.
3) Per-node kthreads _prefer_ to be affine to a specific NUMA node.
This is not a correctness constraint but merely a preference in
terms of memory locality. kswapd and kcompactd both fall into this
category. The affinity is set manually like for any other task and
CPU-hotplug is supposed to be handled by the relevant subsystem so
that the task is properly reaffined whenever a given CPU from the
node comes up. Also care should be taken so that the node affinity
doesn't cross isolated (nohz_full) cpumask boundaries.
4) Similar to the previous point except kthreads have a _preferred_
affinity different than a node. Both RCU boost kthreads and RCU
exp kworkers fall into this category as they refer to "RCU nodes"
from a distinctly distributed tree.
Currently the preferred affinity patterns (3 and 4) have at least 4
identified users, with more or less success when it comes to handle
CPU-hotplug operations and CPU isolation. Each of which do it in its
own ad-hoc way.
This is an infrastructure proposal to handle this with the following
API changes:
- kthread_create_on_node() automatically affines the created kthread
to its target node unless it has been set as per-cpu or bound with
kthread_bind[_mask]() before the first wake-up.
- kthread_affine_preferred() is a new function that can be called
right after kthread_create_on_node() to specify a preferred
affinity different than the specified node.
When the preferred affinity can't be applied because the possible
targets are offline or isolated (nohz_full), the kthread is affine to
the housekeeping CPUs (which means to all online CPUs most of the time
or only the non-nohz_full CPUs when nohz_full= is set).
kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
converted, along with a few old drivers.
Summary of the changes:
- Consolidate a bunch of ad-hoc implementations of
kthread_run_on_cpu()
- Introduce task_cpu_fallback_mask() that defines the default last
resort affinity of a task to become nohz_full aware
- Add some correctness check to ensure kthread_bind() is always
called before the first kthread wake up.
- Default affine kthread to its preferred node.
- Convert kswapd / kcompactd and remove their halfway working ad-hoc
affinity implementation
- Implement kthreads preferred affinity
- Unify kthread worker and kthread API's style
- Convert RCU kthreads to the new API and remove the ad-hoc affinity
implementation"
* tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks:
kthread: modify kernel-doc function name to match code
rcu: Use kthread preferred affinity for RCU exp kworkers
treewide: Introduce kthread_run_worker[_on_cpu]()
kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format
rcu: Use kthread preferred affinity for RCU boost
kthread: Implement preferred affinity
mm: Create/affine kswapd to its preferred node
mm: Create/affine kcompactd to its preferred node
kthread: Default affine kthread to its preferred NUMA node
kthread: Make sure kthread hasn't started while binding it
sched,arm64: Handle CPU isolation on last resort fallback rq selection
arm64: Exclude nohz_full CPUs from 32bits el0 support
lib: test_objpool: Use kthread_run_on_cpu()
kallsyms: Use kthread_run_on_cpu()
soc/qman: test: Use kthread_run_on_cpu()
arm/bL_switcher: Use kthread_run_on_cpu()
|
|
Commit ec596aaf9b48 ("SUNRPC: Remove code behind
CONFIG_RPCSEC_GSS_KRB5_SIMPLIFIED") was the last user of the
gss_decrypt_xdr_buf() and gss_encrypt_xdr_buf() functions.
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Commit ec596aaf9b48 ("SUNRPC: Remove code behind
CONFIG_RPCSEC_GSS_KRB5_SIMPLIFIED") was the last user of the routines
in gss_generic_token.c.
Remove the routines and associated header.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
xprt_iter_get_xprt() was added by
commit 80b14d5e61ca ("SUNRPC: Add a structure to track multiple
transports") but is unused.
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
I noticed that a handful of NFSv3 fstests were taking an
unexpectedly long time to run. Troubleshooting showed that the
server's TCP window closed and never re-opened, which caused the
client to trigger an RPC retransmit timeout after 180 seconds.
The client's recovery action was to establish a fresh connection
and retransmit the timed-out requests. This worked, but it adds a
long delay.
I tracked the problem to the commit that attempted to reduce the
rate at which the network layer delivers TCP socket data_ready
callbacks. Under most circumstances this change worked as expected,
but for NFSv3, which has no session or other type of throttling, it
can overwhelm the receiver on occasion.
I'm sure I could tweak the lowat settings, but the small benefit
doesn't seem worth the bother. Just revert it.
Fixes: 2b877fc53e97 ("SUNRPC: Reduce thread wake-up rate when receiving large RPC messages")
Cc: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
No conflicts and no adjacent changes.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
In case of possible unpredictably large arguments passed to
rose_setsockopt() and multiplied by extra values on top of that,
integer overflows may occur.
Do the safest minimum and fix these issues by checking the
contents of 'opt' and returning -EINVAL if they are too large. Also,
switch to unsigned int and remove useless check for negative 'opt'
in ROSE_IDLE case.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru>
Link: https://patch.msgid.link/20250115164220.19954-1-n.zhandarovich@fintech.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Let's register inet6_rtm_deladdr() with RTNL_FLAG_DOIT_PERNET and
hold rtnl_net_lock() before inet6_addr_del().
Now that inet6_addr_del() is always called under per-netns RTNL.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-12-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Let's register inet6_rtm_newaddr() with RTNL_FLAG_DOIT_PERNET
and hold rtnl_net_lock() before __dev_get_by_index().
Now that inet6_addr_add() and inet6_addr_modify() are always
called under per-netns RTNL.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-11-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
inet6_addr_add() and inet6_addr_modify() have the same code to validate
IPv6 lifetime that is done under RTNL.
Let's factorise it out to inet6_rtm_newaddr() so that we can validate
the lifetime without RTNL later.
Note that inet6_addr_add() is called from addrconf_add_ifaddr(), but the
lifetime is INFINITY_LIFE_TIME in the path, so expires and flags are 0.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-10-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We will convert inet6_rtm_newaddr() to per-netns RTNL.
Except for IFA_F_OPTIMISTIC, cfg.ifa_flags can be set before
__dev_get_by_index().
Let's move ifa_flags setup before __dev_get_by_index() so that
we can set ifa_flags without RTNL.
Also, now it's moved before tb[IFA_CACHEINFO] in preparing for
the next patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-9-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
inet6_addr_add() is called from inet6_rtm_newaddr() and
addrconf_add_ifaddr().
inet6_addr_add() looks up dev by __dev_get_by_index(), but
it's already done in inet6_rtm_newaddr().
Let's move the 2nd lookup to addrconf_add_ifaddr() and pass
dev to inet6_addr_add().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
These functions are called from inet6_ioctl() with a socket's netns
and hold RTNL.
* SIOCSIFADDR : addrconf_add_ifaddr()
* SIOCDIFADDR : addrconf_del_ifaddr()
* SIOCSIFDSTADDR : addrconf_set_dstaddr()
Let's use rtnl_net_lock().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
addrconf_init() holds RTNL for blackhole_netdev, which is the global
device in init_net.
addrconf_cleanup() holds RTNL to clean up devices in init_net too.
Let's use rtnl_net_lock(&init_net) there.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
addrconf_dad_work() is per-address work and holds RTNL internally.
We can fetch netns as dev_net(ifp->idev->dev).
Let's use rtnl_net_lock().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
addrconf_verify_work() is per-netns work to call addrconf_verify_rtnl()
under RTNL.
Let's use rtnl_net_lock().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
net.ipv6.conf.${DEV}.XXX sysctl are changed under RTNL:
* forwarding
* ignore_routes_with_linkdown
* disable_ipv6
* proxy_ndp
* addr_gen_mode
* stable_secret
* disable_policy
Let's use rtnl_net_lock() there.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115080608.28127-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since dccp and llc makefiles already check sysctl code
compilation with xxx-$(CONFIG_SYSCTL)
we can drop the checks
Signed-off-by: Denis Kirjanov <kirjanov@gmail.com>
Link: https://patch.msgid.link/20250119134254.19250-1-kirjanov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains Netfilter updates for net-next:
1) Unbreak set size settings for rbtree set backend, intervals in
rbtree are represented as two elements, this detailed is leaked
to userspace leading to bogus ENOSPC from control plane.
2) Remove dead code in br_netfilter's br_nf_pre_routing_finish()
due to never matching error when looking up for route,
from Antoine Tenart.
3) Simplify check for device already in use in flowtable,
from Phil Sutter.
4) Three patches to restore interface name field in struct nft_hook
and use it, this is to prepare for wildcard interface support.
From Phil Sutter.
5) Do not remove netdev basechain when last device is gone, this is
for consistency with the flowtable behaviour. This allows for netdev
basechains without devices. Another patch to simplify netdev event
notifier after this update. Also from Phil.
6) Two patches to add missing spinlock when flowtable updates TCP
state flags, from Florian Westphal.
7) Simplify __nf_ct_refresh_acct() by removing skbuff parameter,
also from Florian.
8) Flowtable gc now extends ct timeout for offloaded flow. This
is to address a possible race that leads to handing over flow
to classic path with long ct timeouts.
9) Tear down flow if cached rt_mtu is stale, before this patch,
packet is handed over to classic path but flow entry still remained
in place.
10) Revisit the flowtable teardown strategy, which was originally
designed to release flowtable hardware entries early. Add a new
CLOSING flag that still allows hardware to release entries when
fin/rst is seen, but keeps the flow entry in place when the
TCP connection is closed. Release flow after timeout or when a new
syn packet is seen for TCP reopen scenario.
* tag 'nf-next-25-01-19' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: flowtable: add CLOSING state
netfilter: flowtable: teardown flow if cached mtu is stale
netfilter: conntrack: rework offload nf_conn timeout extension logic
netfilter: conntrack: remove skb argument from nf_ct_refresh
netfilter: nft_flow_offload: update tcp state flags under lock
netfilter: nft_flow_offload: clear tcp MAXACK flag before moving to slowpath
netfilter: nf_tables: Simplify chain netdev notifier
netfilter: nf_tables: Tolerate chains with no remaining hooks
netfilter: nf_tables: Compare netdev hooks based on stored name
netfilter: nf_tables: Use stored ifname in netdev hook dumps
netfilter: nf_tables: Store user-defined hook ifname
netfilter: nf_tables: Flowtable hook's pf value never varies
netfilter: br_netfilter: remove unused conditional and dead code
netfilter: nf_tables: fix set size with rbtree backend
====================
Link: https://patch.msgid.link/20250119172051.8261-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The core has the current HDS config, it can pre-populate the values
for the drivers. While at it, remove the zero-setting in netdevsim.
Zero are the default values since the config is zalloc'ed.
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250119020518.1962249-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Record the pending configuration in net_device struct.
ethtool core duplicates the current config and the specific
handlers (for now just ringparam) can modify it.
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250119020518.1962249-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
For ease of review of the next patch store the dev pointer
on the stack, instead of referring to req_info.dev every time.
No functional changes.
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250119020518.1962249-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Separate the HDS config from the ethtool state struct.
The HDS config contains just simple parameters, not state.
Having it as a separate struct will make it easier to clone / copy
and also long term potentially make it per-queue.
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250119020518.1962249-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This is based on Donald Hunter's patch.
These functions could fail for various reasons, sometimes
triggering kfree_skb().
* unix_stream_connect() : connect()
* unix_stream_sendmsg() : sendmsg()
* queue_oob() : sendmsg(MSG_OOB)
* unix_dgram_sendmsg() : sendmsg()
Such kfree_skb() is tied to the errno of connect() and
sendmsg(), and we need not define skb drop reasons.
Let's use consume_skb() not to churn kfree_skb() events.
Link: https://lore.kernel.org/netdev/eb30b164-7f86-46bf-a5d3-0f8bda5e9398@redhat.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-10-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This is a follow-up of commit d460b04bc452 ("af_unix: Clean up
error paths in unix_stream_sendmsg().").
If we initialise skb with NULL in unix_stream_sendmsg(), we can
reuse the existing out_pipe label for the SEND_SHUTDOWN check.
Let's rename it and adjust the existing label as out_pipe_lock.
While at it, size and data_len are moved to the while loop scope.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-9-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
unix_dgram_disconnected() is called from two places:
1. when a connect()ed socket dis-connect()s or re-connect()s to
another socket
2. when sendmsg() fails because the peer socket that the client
has connect()ed to has been close()d
Then, the client's recv queue is purged to remove all messages from
the old peer socket.
Let's define a new drop reason for that case.
# echo 1 > /sys/kernel/tracing/events/skb/kfree_skb/enable
# python3
>>> from socket import *
>>>
>>> # s1 has a message from s2
>>> s1, s2 = socketpair(AF_UNIX, SOCK_DGRAM)
>>> s2.send(b'hello world')
>>>
>>> # re-connect() drops the message from s2
>>> s3 = socket(AF_UNIX, SOCK_DGRAM)
>>> s3.bind('')
>>> s1.connect(s3.getsockname())
# cat /sys/kernel/tracing/trace_pipe
python3-250 ... kfree_skb: ... location=skb_queue_purge_reason+0xdc/0x110 reason: UNIX_DISCONNECT
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
unix_stream_read_skb() is called when BPF SOCKMAP reads some data
from a socket in the map.
SOCKMAP does not support MSG_OOB, and reading OOB results in a drop.
Let's set drop reasons respectively.
* SOCKET_CLOSE : the socket in SOCKMAP was close()d
* UNIX_SKIP_OOB : OOB was read from the socket in SOCKMAP
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
AF_UNIX SOCK_STREAM socket supports MSG_OOB.
When OOB data is sent to a socket, recv() will break at that point.
If the next recv() does not have MSG_OOB, the normal data following
the OOB data is returned.
Then, the OOB skb is dropped.
Let's define a new drop reason for that case in manage_oob().
# echo 1 > /sys/kernel/tracing/events/skb/kfree_skb/enable
# python3
>>> from socket import *
>>> s1, s2 = socketpair(AF_UNIX)
>>> s1.send(b'a', MSG_OOB)
>>> s1.send(b'b')
>>> s2.recv(2)
b'b'
# cat /sys/kernel/tracing/trace_pipe
...
python3-223 ... kfree_skb: ... location=unix_stream_read_generic+0x59e/0xc20 reason: UNIX_SKIP_OOB
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Inflight file descriptors by SCM_RIGHTS hold references to the
struct file.
AF_UNIX sockets could hold references to each other, forming
reference cycles.
Once such sockets are close()d without the fd recv()ed, they
will be unaccessible from userspace but remain in kernel.
__unix_gc() garbage-collects skb with the dead file descriptors
and frees them by __skb_queue_purge().
Let's set SKB_DROP_REASON_SOCKET_CLOSE there.
# echo 1 > /sys/kernel/tracing/events/skb/kfree_skb/enable
# python3
>>> from socket import *
>>> from array import array
>>>
>>> # Create a reference cycle
>>> s1 = socket(AF_UNIX, SOCK_DGRAM)
>>> s1.bind('')
>>> s1.sendmsg([b"nop"], [(SOL_SOCKET, SCM_RIGHTS, array("i", [s1.fileno()]))], 0, s1.getsockname())
>>> s1.close()
>>>
>>> # Trigger GC
>>> s2 = socket(AF_UNIX)
>>> s2.close()
# cat /sys/kernel/tracing/trace_pipe
...
kworker/u16:2-42 ... kfree_skb: ... location=__unix_gc+0x4ad/0x580 reason: SOCKET_CLOSE
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
unix_sock_destructor() is called as sk->sk_destruct() just before
the socket is actually freed.
Let's use SKB_DROP_REASON_SOCKET_CLOSE for skb_queue_purge().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
unix_release_sock() is called when the last refcnt of struct file
is released.
Let's define a new drop reason SKB_DROP_REASON_SOCKET_CLOSE and
set it for kfree_skb() in unix_release_sock().
# echo 1 > /sys/kernel/tracing/events/skb/kfree_skb/enable
# python3
>>> from socket import *
>>> s1, s2 = socketpair(AF_UNIX)
>>> s1.send(b'hello world')
>>> s2.close()
# cat /sys/kernel/tracing/trace_pipe
...
python3-280 ... kfree_skb: ... protocol=0 location=unix_release_sock+0x260/0x420 reason: SOCKET_CLOSE
To be precise, unix_release_sock() is also called for a new child
socket in unix_stream_connect() when something fails, but the new
sk does not have skb in the recv queue then and no event is logged.
Note that only tcp_inbound_ao_hash() uses a similar drop reason,
SKB_DROP_REASON_TCP_CLOSE, and this can be generalised later.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250116053441.5758-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
I noticed that HyStart incorrectly marks the start of rounds,
leading to inaccurate measurements of ACK train lengths and
resetting the `ca->sample_cnt` variable. This inaccuracy can impact
HyStart's functionality in terminating exponential cwnd growth during
Slow-Start, potentially degrading TCP performance.
The issue arises because the changes introduced in commit 4e1fddc98d25
("tcp_cubic: fix spurious Hystart ACK train detections for not-cwnd-limited flows")
moved the caller of the `bictcp_hystart_reset` function inside the `hystart_update` function.
This modification added an additional condition for triggering the caller,
requiring that (tcp_snd_cwnd(tp) >= hystart_low_window) must also
be satisfied before invoking `bictcp_hystart_reset`.
This fix ensures that `bictcp_hystart_reset` is correctly called
at the start of a new round, regardless of the congestion window size.
This is achieved by moving the condition
(tcp_snd_cwnd(tp) >= hystart_low_window)
from before calling `bictcp_hystart_reset` to after it.
I tested with a client and a server connected through two Linux software routers.
In this setup, the minimum RTT was 150 ms, the bottleneck bandwidth was 50 Mbps,
and the bottleneck buffer size was 1 BDP, calculated as (50M / 1514 / 8) * 0.150 = 619 packets.
I conducted the test twice, transferring data from the server to the client for 1.5 seconds.
Before the patch was applied, HYSTART-DELAY stopped the exponential growth of cwnd when
cwnd = 516, and the bottleneck link was not yet saturated (516 < 619).
After the patch was applied, HYSTART-ACK-TRAIN stopped the exponential growth of cwnd when
cwnd = 632, and the bottleneck link was saturated (632 > 619).
In this test, applying the patch resulted in 300 KB more data delivered.
Fixes: 4e1fddc98d25 ("tcp_cubic: fix spurious Hystart ACK train detections for not-cwnd-limited flows")
Signed-off-by: Mahdi Arghavani <ma.arghavani@yahoo.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Haibo Zhang <haibo.zhang@otago.ac.nz>
Cc: David Eyers <david.eyers@otago.ac.nz>
Cc: Abbas Arghavani <abbas.arghavani@mdu.se>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
On a 32bit system the "keylen + sizeof(struct tipc_aead_key)" math could
have an integer wrapping issue. It doesn't matter because the "keylen"
is checked on the next line, but just to make life easier for static
analysis tools, let's re-order these conditions and avoid the integer
overflow.
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
aarp_send_probe_phase1() used to work by calling ndo_do_ioctl of
appletalk drivers ltpc or cops, but these two drivers have been removed
since the following commits:
commit 03dcb90dbf62 ("net: appletalk: remove Apple/Farallon LocalTalk PC
support")
commit 00f3696f7555 ("net: appletalk: remove cops support")
Thus aarp_send_probe_phase1() no longer works, so drop it. (found by
code inspection)
Signed-off-by: 谢致邦 (XIE Zhibang) <Yeking@Red54.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch addresses issues with filter counting in block (tcf_block),
particularly for software bypass scenarios, by introducing a more
accurate mechanism using useswcnt.
Previously, filtercnt and skipswcnt were introduced by:
Commit 2081fd3445fe ("net: sched: cls_api: add filter counter") and
Commit f631ef39d819 ("net: sched: cls_api: add skip_sw counter")
filtercnt tracked all tp (tcf_proto) objects added to a block, and
skipswcnt counted tp objects with the skipsw attribute set.
The problem is: a single tp can contain multiple filters, some with skipsw
and others without. The current implementation fails in the case:
When the first filter in a tp has skipsw, both skipswcnt and filtercnt
are incremented, then adding a second filter without skipsw to the same
tp does not modify these counters because tp->counted is already set.
This results in bypass software behavior based solely on skipswcnt
equaling filtercnt, even when the block includes filters without
skipsw. Consequently, filters without skipsw are inadvertently bypassed.
To address this, the patch introduces useswcnt in block to explicitly count
tp objects containing at least one filter without skipsw. Key changes
include:
Whenever a filter without skipsw is added, its tp is marked with usesw
and counted in useswcnt. tc_run() now uses useswcnt to determine software
bypass, eliminating reliance on filtercnt and skipswcnt.
This refined approach prevents software bypass for blocks containing
mixed filters, ensuring correct behavior in tc_run().
Additionally, as atomic operations on useswcnt ensure thread safety and
tp->lock guards access to tp->usesw and tp->counted, the broader lock
down_write(&block->cb_lock) is no longer required in tc_new_tfilter(),
and this resolves a performance regression caused by the filter counting
mechanism during parallel filter insertions.
The improvement can be demonstrated using the following script:
# cat insert_tc_rules.sh
tc qdisc add dev ens1f0np0 ingress
for i in $(seq 16); do
taskset -c $i tc -b rules_$i.txt &
done
wait
Each of rules_$i.txt files above includes 100000 tc filter rules to a
mlx5 driver NIC ens1f0np0.
Without this patch:
# time sh insert_tc_rules.sh
real 0m50.780s
user 0m23.556s
sys 4m13.032s
With this patch:
# time sh insert_tc_rules.sh
real 0m17.718s
user 0m7.807s
sys 3m45.050s
Fixes: 047f340b36fc ("net: sched: make skip_sw actually skip software")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Tested-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
xfrm assumed to always have a full socket at skb->sk.
This is not always true, so fix it by converting to a
full socket before it is used.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
|
|
tcp rst/fin packet triggers an immediate teardown of the flow which
results in sending flows back to the classic forwarding path.
This behaviour was introduced by:
da5984e51063 ("netfilter: nf_flow_table: add support for sending flows back to the slow path")
b6f27d322a0a ("netfilter: nf_flow_table: tear down TCP flows if RST or FIN was seen")
whose goal is to expedite removal of flow entries from the hardware
table. Before these patches, the flow was released after the flow entry
timed out.
However, this approach leads to packet races when restoring the
conntrack state as well as late flow re-offload situations when the TCP
connection is ending.
This patch adds a new CLOSING state that is is entered when tcp rst/fin
packet is seen. This allows for an early removal of the flow entry from
the hardware table. But the flow entry still remains in software, so tcp
packets to shut down the flow are not sent back to slow path.
If syn packet is seen from this new CLOSING state, then this flow enters
teardown state, ct state is set to TCP_CONNTRACK_CLOSE state and packet
is sent to slow path, so this TCP reopen scenario can be handled by
conntrack. TCP_CONNTRACK_CLOSE provides a small timeout that aims at
quickly releasing this stale entry from the conntrack table.
Moreover, skip hardware re-offload from flowtable software packet if the
flow is in CLOSING state.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Tear down the flow entry in the unlikely case that the interface mtu
changes, this gives the flow a chance to refresh the cached mtu,
otherwise such refresh does not occur until flow entry expires.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Offload nf_conn entries may not see traffic for a very long time.
To prevent incorrect 'ct is stale' checks during nf_conntrack table
lookup, the gc worker extends the timeout nf_conn entries marked for
offload to a large value.
The existing logic suffers from a few problems.
Garbage collection runs without locks, its unlikely but possible
that @ct is removed right after the 'offload' bit test.
In that case, the timeout of a new/reallocated nf_conn entry will
be increased.
Prevent this by obtaining a reference count on the ct object and
re-check of the confirmed and offload bits.
If those are not set, the ct is being removed, skip the timeout
extension in this case.
Parallel teardown is also problematic:
cpu1 cpu2
gc_worker
calls flow_offload_teardown()
tests OFFLOAD bit, set
clear OFFLOAD bit
ct->timeout is repaired (e.g. set to timeout[UDP_CT_REPLIED])
nf_ct_offload_timeout() called
expire value is fetched
<INTERRUPT>
-> NF_CT_DAY timeout for flow that isn't offloaded
(and might not see any further packets).
Use cmpxchg: if ct->timeout was repaired after the 2nd 'offload bit' test
passed, then ct->timeout will only be updated of ct->timeout was not
altered in between.
As we already have a gc worker for flowtable entries, ct->timeout repair
can be handled from the flowtable gc worker.
This avoids having flowtable specific logic in the conntrack core
and avoids checking entries that were never offloaded.
This allows to remove the nf_ct_offload_timeout helper.
Its safe to use in the add case, but not on teardown.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Its not used (and could be NULL), so remove it.
This allows to use nf_ct_refresh in places where we don't have
an skb without having to double-check that skb == NULL would be safe.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The conntrack entry is already public, there is a small chance that another
CPU is handling a packet in reply direction and racing with the tcp state
update.
Move this under ct spinlock.
This is done once, when ct is about to be offloaded, so this should
not result in a noticeable performance hit.
Fixes: 8437a6209f76 ("netfilter: nft_flow_offload: set liberal tracking mode for tcp")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This state reset is racy, no locks are held here.
Since commit
8437a6209f76 ("netfilter: nft_flow_offload: set liberal tracking mode for tcp"),
the window checks are disabled for normal data packets, but MAXACK flag
is checked when validating TCP resets.
Clear the flag so tcp reset validation checks are ignored.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
With conditional chain deletion gone, callback code simplifies: Instead
of filling an nft_ctx object, just pass basechain to the per-chain
function. Also plain list_for_each_entry() is safe now.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Do not drop a netdev-family chain if the last interface it is registered
for vanishes. Users dumping and storing the ruleset upon shutdown to
restore it upon next boot may otherwise lose the chain and all contained
rules. They will still lose the list of devices, a later patch will fix
that. For now, this aligns the event handler's behaviour with that for
flowtables.
The controversal situation at netns exit should be no problem here:
event handler will unregister the hooks, core nftables cleanup code will
drop the chain itself.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The 1:1 relationship between nft_hook and nf_hook_ops is about to break,
so choose the stored ifname to uniquely identify hooks.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|