| Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter and IPsec.
Current release - regressions:
- do not acquire dev->tx_global_lock in netdev_watchdog_up()
- ethtool: keep rtnl_lock for ops using ethtool_op_get_link()
- fix deadlock in nested UP notifier events
Current release - new code bugs:
- eth:
- cn20k: fix subbank free list indexing for search order
- airoha: fix BQL underflow in shared QDMA TX ring
Previous releases - regressions:
- netfilter:
- flowtable: fix offloaded ct timeout never being extended
- nf_conncount: prevent connlimit drops for early confirmed ct
Previous releases - always broken:
- require CAP_NET_ADMIN in the originating netns when modifying
cross-netns devices
- report NAPI thread PID in the caller's pid namespace
- mac802154: fix dirty frag in in-place crypto for IOT radios
- sctp: hold socket lock when dumping endpoints in sctp_diag, avoid
an overflow
- eth: gve: fix header buffer corruption with header-split and HW-GRO
- af_key: initialize alg_key_len for IPComp states, prevent OOB read"
* tag 'net-7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (213 commits)
selftests: bonding: add a test for VLAN propagation over a bonded real device
vlan: defer real device state propagation to netdev_work
net: add the driver-facing netdev_work scheduling API
net: turn the rx_mode work into a generic netdev_work facility
net: ethtool: keep rtnl_lock for ops using ethtool_op_get_link()
rxrpc: Fix rxrpc_rotate_tx_rotate() to check there's something to rotate
rxrpc: Fix leak of released call in recvmsg(MSG_PEEK)
rxrpc: Fix socket notification race
rxrpc: Fix potential infinite loop in rxrpc_recvmsg()
rxrpc: Fix oob challenge leak in cleanup after notification failure
rxrpc: Fix the reception of a reply packet before data transmission
afs: Fix uncancelled rxrpc OOB message handler
afs: Fix further netns teardown to cancel the preallocation charger
rxrpc: Fix double unlock in rxrpc_recvmsg()
rxrpc: Fix leak of connection from OOB challenge
rxrpc: Fix ACKALL packet handling
net: hns3: differentiate autoneg default values between copper and fiber
net: hns3: fix permanent link down deadlock after reset
net: hns3: refactor MAC autoneg and speed configuration
net: hns3: unify copper port ksettings configuration path
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Add a workaround to avoid a possible crash if nf_nat and nft_chain_nat are
compiled built-in and nf_nat fails to register, allowing nft_chain_nat to
access the incorrect pernetns area. This is crash specific of all built-in
compilation. From Matias Krause.
2) Revisit conncount GC optimization for confirmed conntracks, skip GC round
if IPS_ASSURED is set on. This is addressing an issue for corner case
use case scenario involving locally generated traffic. No crash, just a
functionality fix. From Fernando F. Mancera.
3) Validate iph->ihl in flowtable IPIP tunnel support, from Lorenzo Bianconi.
This a sanity check to bounces back malformed IPIP packets to classic
forwarding path.
4) Kdoc fixes for x_tables.h, from Randy Dunlap.
5) Use info->options so nft_synproxy_tcp_options() stays on the same local
snapshot, otherwise eval path can observe inconsistent mix of mss and
timestamps. From Runyu Xiao.
6) Add conntrack_sctp_collision.sh to cover for SCTP INIT collisions.
From Yi Chen.
7) Do not allow NFPROTO_UNSPEC targets if family is NFPROTO_BRIDGE in
nft_compat. This allows to use non-sense targets such as xt_nat leading
to crash. From Florian Westphal.
8) Add a selftest queueing from bridge family. From Florian Westphal.
9) Do not allow to reset a conntrack helper via ctnetlink. This feature
antedates the creation of the conntrack-tools, and it is not used
I don't have a usecase for it, I prefer to remove than fixing it.
10) Add deprecation warning for IPv4 only conntrack helpers for PPTP
and IRC. From Florian Westphal.
11) Store the master tuple in the expectation object and use it,
otherwise SLAB_TYPESAFE_RCU rules allow to display incorrect
master tuple information through ctnetlink.
12) Run expectation eviction when inserting an expectation with no
helper, this is a fix for the nft_ct custom expectation support.
13) Fix nft_ct custom expectation timeouts, userspace provides a
timeout in milliseconds but kernel assumes this comes in seconds.
From Florian Westphal.
14) Cap maximum number of expectations per class to 255 expectations
per master conntrack at helper registration. This is a fix to
restrict the maximum number of expectations per master conntrack
which can be a issue for the new lazy GC expectation approach.
* tag 'nf-26-06-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_conntrack_helper: cap maximum number of expectation at helper registration
netfilter: nft_ct: expectation timeouts are passed in milliseconds
netfilter: nf_conntrack_expect: run expectation eviction with no helper
netfilter: nf_conntrack_expect: store master_tuple in expectation
netfilter: conntrack: add deprecation warnings for irc and pptp trackers
netfilter: ctnetlink: do not allow to reset helper on existing conntrack
selftests: nft_queue.sh: add a bridge queue test
netfilter: nft_compat: ebtables emulation must reject non-bridge targets
selftests: netfilter: conntrack_sctp_collision.sh: Introduce SCTP INIT collision test
netfilter: nft_synproxy: stop bypassing the priv->info snapshot
netfilter: x_tables.h: fix all kernel-doc warnings
netfilter: flowtable: Validate iph->ihl in nf_flow_ip4_tunnel_proto()
netfilter: nf_conncount: prevent connlimit drops for early confirmed ct
netfilter: nf_nat: avoid invalid nat_net pointer use on failed nf_nat_init()
====================
Link: https://patch.msgid.link/20260623221548.701545-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2026-06-22
1) xfrm: use compat translator only for u64 alignment mismatch
Gate the XFRM_USER_COMPAT translator on COMPAT_FOR_U64_ALIGNMENT
so 32-bit compat tasks on arches whose 32-bit ABI already matches
the native 64-bit layout are no longer rejected with -EOPNOTSUPP.
From Sanman Pradhan.
2) net: af_key: initialize alg_key_len for IPComp states
Initialize the alg_key_len to 0 in the IPComp branch of
pfkey_msg2xfrm_state() so an uninitialized value cannot drive
xfrm_alg_len() into a slab-out-of-bounds kmemdup during
XFRM_MSG_MIGRATE. From Zijing Yin.
3) xfrm: Fix dev use-after-free in xfrm async resumption
Stash the original skb->dev and extend the RCU critical section
across xfrm_rcv_cb() and transport_finish() to prevent a
tunnel-device UAF and original-device refcount leak when a
callback replaces skb->dev. From Dong Chenchen.
4) xfrm: Fix xfrm state cache insertion race
Move the state-validity check inside xfrm_state_lock in the
input state cache insertion path so a state cannot be killed
between the check and the insert. From Herbert Xu.
5) xfrm: annotate data-races around xfrm_policy_count[] and xfrm_policy_default[]
Add READ_ONCE()/WRITE_ONCE() annotations on xfrm_policy_count
and xfrm_policy_default to silence the KCSAN data race reported
on net->xfrm.policy_count. From Eric Dumazet.
6) espintcp: use sk_msg_free_partial to fix partial send
Replace the manual skmsg accounting in espintcp with
sk_msg_free_partial() so the skmsg stays consistent on every
iteration and the partial-send accounting bugs go away.
From Sabrina Dubroca.
7) xfrm: validate selector family and prefixlen during match
Reject mismatched address families in xfrm_selector_match() and
bound prefixlen in addr4_match()/addr_match() to prevent the
shift-out-of-bounds syzbot reported when an AF_UNSPEC selector
with a large prefixlen is matched against an IPv4 flow.
From Eric Dumazet.
* tag 'ipsec-2026-06-22' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: validate selector family and prefixlen during match
espintcp: use sk_msg_free_partial to fix partial send
xfrm: annotate data-races around xfrm_policy_count[] and xfrm_policy_default[]
xfrm: Fix xfrm state cache insertion race
xfrm: Fix dev use-after-free in xfrm async resumption
net: af_key: initialize alg_key_len for IPComp states
xfrm: use compat translator only for u64 alignment mismatch
====================
Link: https://patch.msgid.link/20260622075726.29685-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Store master conntrack tuple in the expectation since exp->master might
refer to a different conntrack when accessed from rcu read side lock
area due to typesafe rcu rules.
Fixes: 02a3231b6d82 ("netfilter: nf_conntrack_expect: store netns and zone in expectation")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
IRC Direct client-to-client requires plaintext. IRC over TLS should be
preferred, making this helper ineffective. Add a deprecation warning and
update the help text to better reflect that this is needed for the DCC
extension, not IRC itself.
PPTP is esoteric these days and it is the only helper that requires the
destroy callback in the conntrack helper API.
Removal would simplify the conntrack core.
Both helpers are IPv4 only.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
When CONFIG_IP_MULTIPLE_TABLES is enabled but no rule is added,
fib_lookup() performs route lookup directly on two tables.
Since the first lookup does not properly bail out, the result
of an error route in the merged local/main table could be
overwritten by another route in the default table:
# unshare -n
# ip link set lo up
# ip route add 192.168.0.0/24 dev lo table 253
# ip route add unreachable 192.168.0.0/24
# ip route get 192.168.0.1
192.168.0.1 dev lo table default uid 0
cache <local>
Once a random rule is added, the error route is respected:
# ip rule add table 0
# ip rule del table 0
# ip route get 192.168.0.1
RTNETLINK answers: No route to host
Let's fix the inconsistent behaviour.
Fixes: f4530fa574df ("ipv4: Avoid overhead when no custom FIB rules are installed.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260619212753.3367244-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net. This batches
fixes for real crashes with trivial/correctness fixes. There is too
a rework of the conntrack expectation timeout strategy to deal with
a possible race when removing an expectation.
1) Fix the incorrect flowtable timeout extension for entries in
hw offload, from Adrian Bente. This is correcting a defect in
the functionality, no crash.
2) Hold reference to device under the fake dst in br_netfilter,
from Haoze Xie. This is fixing a possible UaF if the device
is removed while packet is sitting in nfqueue.
3) Reject template conntrack in xt_cluster, otherwise access to
uninitialize conntrack fields are possible leading to WARN_ON
due to unset layer 3 protocol. From Wyatt Feng.
4) Make sure the IPv6 tunnel header is in the linear skb data
area before pulling. While at it remove incomplete NEXTHDR_DEST
support. From Lorenzo Bianconi. This possibly leading to crash
if IPv4 header is not in the linear area.
5) Use test_bit_acquire in ipset hash set to avoid reordering
of subsequent memory access. This is addressing a LLM related
report, no crash has been observed. From Jozsef Kadlecsik.
6) Use test_bit_acquire in ipset bitmap set too, for the same
reason as in the previous patch, from Jozsef Kadlecsik.
7) Call kfree_rcu() after rcu_assign_pointer() to address a
possible UaF if kfree_rcu() runs inmediately, which to my
understanding never happens. Never observed in practise,
reported by LLM. Also from Jozsef Kadlecsik.
8) Use disable_delayed_work_sync() instead cancel_delayed_work_sync()
to avoid that ipset GC handler re-queues work as reported by LLM.
From Jozsef Kadlecsik. This is for correctness.
9) Restore the check in nft_payload for exceeding payloda offset
over 2^16. From Florian Westphal. This fixes a silent truncation,
not a big deal, but better be assertive and reject it.
10) Validate NFT_META_BRI_IIFHWADDR can only run from bridge
prerouting. From Florian Westphal. Harmless but it could allow
to read bytes from skb->cb.
11) Zero out destination hardware address during the flowtable
path setup, also from Florian. This is a correctness fix, LLM
points that possible infoleak can happen but topology to achieve
it is not clear.
12) Skip IPv4 options if present when building the IPV4 reject reply.
Otherwise bytes in the IPv4 options header can be sent back to
origin where the ICMP header is being expected. Again from
Florian Westphal.
13) Replace timer API for expectation by GC worker approach. This
is implicitly fixing a race between nf_ct_remove_expectations()
which might fail to remove the expectation due to timer_del()
returning false because timer has expired and callback is
being run concurrently. This fix is addressing a crash that has
been already reported with a reproducer.
14) Check if br_vlan_get_pvid_rcu() fails, otherwise possible stack
infoleak of 4-bytes. From Florian Westphal.
* tag 'nf-26-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_meta_bridge: fix NFT_META_BRI_IIFPVID stack leak
netfilter: nf_conntrack_expect: use conntrack GC to reap expectations
netfilter: nf_reject: skip iphdr options when looking for icmp header
netfilter: nft_flow_offload: zero device address for non-ether case
netfilter: nft_meta_bridge: add validate callback for get operations
netfilter: nft_payload: reject offsets exceeding 65535 bytes
netfilter: ipset: make sure gc is properly stopped
netfilter: ipset: fix order of kfree_rcu() and rcu_assign_pointer()
netfilter: ipset: Don't use test_bit() in lockless RCU readers in bitmap types
netfilter: ipset: Don't use test_bit() in lockless RCU readers in hash types
netfilter: flowtable: fix and simplify IP6IP6 tunnel handling
netfilter: xt_cluster: reject template conntracks in hash match
netfilter: nf_queue: pin bridge device while NFQUEUE holds fake dst
netfilter: flowtable: fix offloaded ct timeout never being extended
====================
Link: https://patch.msgid.link/20260620222738.112506-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
kmalloc_flex() in metadata_dst_alloc() sets __counted_by for the
structure to the options_len, which is then initialized to zero.
Later, we're initializing the structure by copying the tunnel info
together with the options, and this triggers a warning for a potential
memcpy overflow, since the compiler estimates that the options can't
fit into the structure, even though the memory for them is actually
allocated.
memcpy: detected buffer overflow: 104 byte write of buffer size 96
WARNING: CPU: X PID: Y at lib/string_helpers.c:1036 __fortify_report
skb_tunnel_info_unclone+0x179/0x190
geneve_xmit+0x7fe/0xe00
The issue is triggered when built with clang and source fortification.
Fix that by doing the copy in two stages: first - the main data with
the options_len, then the options. This way the correct length should
be known at the time of the copy.
It would be better if the options_len never changed after allocation,
but the allocation code is a little separate from the initialization
and it would be awkward and potentially dangerous to return a struct
with options_len set to a non-zero value from the metadata_dst_alloc().
Another option would be to use ip_tunnel_info_opts_set(), but it is
doing too many unnecessary operations for the use case here.
Fixes: 69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
Reported-by: Johan Thomsen <write@ownrisk.dk>
Closes: https://lore.kernel.org/netdev/CAKv6aAM8_EWgXScnKmKYm_4SwGDVBK++dzfP+Y6msUXbp99QUw@mail.gmail.com/
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Link: https://patch.msgid.link/20260616100332.1308294-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pull 9p updates from Dominique Martinet:
"Asides of the avalanche of LLM-driven fixes, there are a couple of big
changes this cycle:
- negative dentry and symlink cache
- a way out of the unkillable "io_wait_event_killable" (because it
looped around waiting for the request flush to come back from
server; this has been bugging syzcaller folks since forever): I'm
still not 100% sure about this patch, but I think it's as good as
we'll ever get, and will keep testing a bit further in the coming
weeks
The rest is more noisy than usual, but shouldn't cause any trouble"
* tag '9p-for-7.2-rc1' of https://github.com/martinetd/linux:
9p: Add missing read barrier in virtio zero-copy path
net/9p: Replace strlen() strcpy() pair with strscpy()
9p: skip nlink update in cacheless mode to fix WARN_ON
net/9p: fix race condition on rdma->state in trans_rdma.c
9p: v9fs_file_do_lock: replace WARN_ONCE with p9_debug
9p: Enable symlink caching in page cache
9p: Set default negative dentry retention time for cache=loose
9p: Add mount option for negative dentry cache retention
9p: Cache negative dentries for lookup performance
9p: avoid returning ERR_PTR(0) from mkdir operations
9p: avoid putting oldfid in p9_client_walk() error path
net/9p: fix infinite loop in p9_client_rpc on fatal signal
docs/filesystems/9p: fix broken external links
9p: invalidate readdir buffer on seek
9p: use kvzalloc for readdir buffer
net/9p/usbg: Constify struct configfs_item_operations
|
|
Not caching negative dentries can result in poor performance for
workloads that repeatedly look up non-existent paths. Each such
lookup triggers a full 9P transaction with the server, adding
unnecessary overhead.
A typical example is source compilation, where multiple cc1 processes
are spawned and repeatedly search for the same missing header files
over and over again.
This change enables caching of negative dentries, so that lookups for
known non-existent paths do not require a full 9P transaction. The
cached negative dentries are retained for a configurable duration
(expressed in milliseconds), as specified by the ndentry_timeout
field in struct v9fs_session_info. If set to -1, negative dentries
are cached indefinitely.
This optimization reduces lookup overhead and improves performance for
workloads involving frequent access to non-existent paths.
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Message-ID: <e542317dd03bbadb5249abd3ea6aecfdca692c19.1779355927.git.repk@triplefau.lt>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
This patch replaces the timer API by GC worker approach for
expectations, as it already happened in many other subsystems.
Use the existing conntrack GC worker to iterate over the local list of
expectations in the master conntrack to reap expired expectations.
Check IPS_HELPER_BIT to run GC for expectations, set it on for nft_ct
expectation which nevers sets it. Hold the expectation spinlock while
iterating over the master conntrack expectation list to synchronize with
nf_ct_remove_expectations(). This also performs runtime packet path
garbage collection through the expectation insertion and lookup
functions while walking over one of the chains of the global expectation
hashtables. Unconfirmed conntrack entries are skipped since ct->ext can
be reallocated and dying are skipped since those will be gone soon.
Set on IPS_HELPER_BIT if the helper ct extension is added, then the new
GC worker does not need to bump the ct refcount to check if the ct->ext
helper is available.
This removes the extra bump on the refcount for expectation timers, this
allows to remove several nf_ct_expect_put() calls after the unlink,
after this update only refcount remains at 1 while on the expectation
hashes.
This patch implicitly addresses a race with the existing timer API
allowing an expectation to access a stale exp->master pointer which has
been already released when expectation removal loses races with an
expiring timer, ie. timer_del() reporting false.
Add a new NF_CT_EXPECT_DEAD flag to reap this expectation via GC. This
is needed by nf_conntrack_unexpect_related() which is called in error
paths to invalidate newly created expectations that has been added into
the hashes. These expectactions cannot be inmediately released as GC or
nf_ct_remove_expectations() could race to make it. On expectation
insert, the runtime GC reaps stale expectations before checking the
expectation limit set by policy.
Set current timestamp in nf_ct_expect_alloc(), then add the expectation
policy timeout (or custom timeout specified added on top of this) to
specify the expectation lifetime.
Fixes: bffcaad9afdf ("netfilter: ctnetlink: ensure safe access to master conntrack")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Blamed commit added NFT_META_BRI_IIFHWADDR to the set validate callback,
yet this is a get operation.
Add a get validate callback and move the NFT_META_BRI_IIFHWADDR key
there.
AFAICS this is harmless, NFT_META_BRI_IIFHWADDR can deal with a NULL
input device and the set handler ignores a NFT_META_BRI_IIFHWADDR
operation, but it allows to read 4 bytes off bridge skb->cb[].
Fixes: cbd2257dc96e ("netfilter: nft_meta_bridge: introduce NFT_META_BRI_IIFHWADDR support")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The br_netfilter fake rtable is embedded in struct net_bridge and is
attached to bridged packets with skb_dst_set_noref(). If such a packet is
queued to NFQUEUE, __nf_queue() upgrades that fake dst with
skb_dst_force().
At that point the queued skb can hold a real dst reference after bridge
teardown has started. The problem is not that every bridged packet needs
its own dst reference. The problem is that NFQUEUE can keep the bridge
private fake dst alive after unregister begins.
Fix this by keeping the bridge fake dst model unchanged and pinning the
bridge master device only while the packet sits in NFQUEUE. Record the
bridge device in nf_queue_entry when the queued skb carries a bridge fake
dst, take a device reference for the queue lifetime, and drop it when the
queue entry is freed.
Also make sure queued entries are reaped when that bridge device goes
down, and drop the redundant nf_bridge_info_exists() test from the fake
dst detection.
This keeps netdev_priv(br->dev) alive until verdict completion, so the
embedded fake rtable and its metrics backing storage cannot be freed out
from under dst_release(). It also avoids the constant refcount bump and
avoids using ipv4-specific dst helpers for IPv6 bridge traffic.
Fixes: 34666d467cbf ("netfilter: bridge: move br_netfilter out of the core")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Haoze Xie <royenheart@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
SCTP_DIAG endpoint dumping was traversing endpoint address lists without
holding lock_sock(), while those lists could change concurrently via
socket operations (e.g., bindx changes). This creates a race where
nla_reserve() counts addresses under RCU protection, but the subsequent
copy may see fewer entries, potentially leaking uninitialized memory to
userspace.
Fix this by:
- Taking a reference on each endpoint during hash traversal
- Moving socket operations (lock_sock()) outside read_lock_bh()
- Serializing address list access during dump
- Reworking sctp_for_each_endpoint() to support restart-based traversal
with (net, pos) tracking
Also:
- Add WARN_ON_ONCE() for inconsistent address counts
- Fix idiag_states filtering for LISTEN vs association cases
- Skip dumping endpoints being freed (ep->base.dead)
- Move dump position tracking into iterator, removing cb->args[4] and
its comment for sctp_ep_dump().,
- Update the comment for cb->args[4] and remove the comment for unused
cb->args[5] for sctp_sock_dump().
Note: traversal is restart-based and may re-scan buckets multiple times,
but this is acceptable due to small bucket sizes and required to support
sleeping-safe callbacks.
This issue was reported by Nico Yip (@_cyeaa_) working with TrendAI Zero
Day Initiative.
Reported-by: Zero Day Initiative <zdi-disclosures@trendmicro.com>
Fixes: 8f840e47f190 ("sctp: add the sctp_diag.c file")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/4c1b49ab87e0f7d552ebd8172b364b1994e913c9.1781552190.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A tunnel changelink() operates on at most two netns, dev_net(dev) and
the tunnel link netns t->net. They differ once the device is created in
or moved to a netns other than the one the request runs in. The rtnl
changelink path checks CAP_NET_ADMIN only against dev_net(dev), so a
caller privileged there but not in t->net can rewrite a tunnel that
lives in t->net.
Add rtnl_dev_link_net_capable() next to rtnl_get_net_ns_capable() in
net/core/rtnetlink.c. It requires CAP_NET_ADMIN in the link netns and is
skipped when the link netns is dev_net(dev), where the rtnl path already
checked it. The other patches in this series use the same helper.
Gate ipgre_changelink() and erspan_changelink() with it, at the top of
the op before any attribute is parsed, because the parsers update live
tunnel fields first. ipgre_netlink_parms() sets t->collect_md before
ip_tunnel_changelink() runs.
Commit 8b484efd5cb4 ("ip6: vti: Use ip6_tnl.net in
vti6_siocdevprivate().") added the same check on the ioctl path. This
adds it on RTM_NEWLINK.
Reported-by: Xiao Liang <shaw.leon@gmail.com>
Closes: https://lore.kernel.org/netdev/CABAhCOSzP1vaThGV35_VnsRCb=87_CPjPVsTHbq905k8A+BuUg@mail.gmail.com/
Fixes: b57708add314 ("gre: add x-netns support")
Cc: stable@vger.kernel.org
Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260612085941.3158249-2-maoyixie.tju@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzbot reported a shift-out-of-bounds in xfrm_selector_match()
due to AF_UNSPEC selector with large prefixlen (e.g. 128) matched
against IPv4 flow (when XFRM_STATE_AF_UNSPEC is set).
Fix this by:
- Rejecting mismatched families in xfrm_selector_match.
- Returning false in addr4_match if prefixlen > 32.
- Returning false in addr_match if prefixlen > 128 (prevents overflow).
Fixes: 3f0ab59e6537 ("xfrm: validate new SA's prefixlen using SA family when sel.family is unset")
Reported-by: syzbot+9383b1ff0df4b29ca5e6@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6a2fbe35.be3f099c.2836ae.0018.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
KCSAN reported a data race involving net->xfrm.policy_count access.
Add missing READ_ONCE()/WRITE_ONCE() annotations on
xfrm_policy_count and xfrm_policy_default.
Fixes: 2518c7c2b3d7 ("[XFRM]: Hash policies when non-prefixed.")
Reported-by: syzbot+d85ba1c732720b9a4097@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6a2b9e96.99669fcc.12a77b.0006.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Merge in late fixes in preparation for the net-next PR.
Conflicts:
net/tls/tls_sw.c
406e8a651a7b ("net: skmsg: preserve sg.copy across SG transforms")
79511603a65b ("tls: remove dead sockmap (psock) handling from the SW path")
drivers/net/ethernet/microsoft/mana/mana_en.c
f8fd56977eeea ("net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check")
d07efe5a6e641 ("net: mana: Use per-queue allocation for tx_qp to reduce allocation size")
https://lore.kernel.org/ajAPXu-C_PuTgV-a@sirena.org.uk
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently sk_rethink_txhash() re-rolls the socket's txhash on RTO, PLB,
and spurious-retransmission events, but the cached route is reused and
the new hash is not propagated into the ECMP path selection logic. Two
changes are needed to make rehash select a different local ECMP path:
1. Add __sk_dst_reset() alongside sk_rethink_txhash() in
tcp_write_timeout(), tcp_rcv_spurious_retrans(), and
tcp_plb_check_rehash() so the cached dst is invalidated and the
next transmit triggers a fresh route lookup.
2. Set fl6->mp_hash from sk_txhash (or tcp_rsk(req)->txhash for
SYN/ACK retransmits and syncookies) in tcp_v6_connect(),
inet6_sk_rebuild_header(), inet6_csk_route_req(),
inet6_csk_route_socket(), tcp_v6_send_response(), and
cookie_v6_check() so fib6_select_path() picks a path based on the
new hash.
The mp_hash override only applies to fib_multipath_hash_policy 0 (the
default L3 policy). Its hash includes the flow label, but that is 0 by
default -- np->flow_label is unset, and auto_flowlabels only computes
the on-wire label later, per packet -- so flows to the same peer share
one local path. Keying the hash on sk_txhash makes the local path
per-connection and lets a rehash re-select it. Policies 1-3 are left
unchanged.
The mp_hash assignment is factored into a small helper,
ip6_ecmp_set_mp_hash(), shared by inet6_csk_route_req(),
inet6_csk_route_socket(), tcp_v6_connect(), inet6_sk_rebuild_header(),
tcp_v6_send_response(), and cookie_v6_check(). It applies
(txhash >> 1) ?: 1 for policy 0 (the >> 1 keeps mp_hash in the 31-bit
range; ?: 1 keeps it non-zero, since 0 would fall back to
rt6_multipath_hash()). inet6_csk_route_socket() calls it only for
sk_protocol == IPPROTO_TCP so that non-TCP callers (e.g., L2TP via
inet6_csk_xmit) fall through to rt6_multipath_hash() and retain their
existing flow-key-based ECMP behavior.
tcp_v6_send_response() also sets mp_hash from the response txhash so
that a control packet (a RST from the full socket, or an ACK from a
time-wait socket) selects the same local ECMP nexthop as the
connection's txhash rather than falling back to the flow hash. The
time-wait socket's tw_txhash is copied from sk_txhash when the
connection enters TIME_WAIT, so it reflects any rehash that occurred.
Setting mp_hash explicitly is necessary because the default ECMP hash
derives from fl6->flowlabel via np->flow_label, which is not updated
from sk_txhash (REPFLOW is off by default). ip6_make_flowlabel()
cannot help either, as it runs after the route lookup.
As a consequence, for policy 0 the local ECMP path of an IPv6 TCP
flow follows sk_txhash even when fl6->flowlabel is non-zero, e.g. a
reflected (REPFLOW) or explicitly set (IPV6_FLOWLABEL_MGR) flow
label. This is intentional: only local path selection changes, so
rehash can recover from a failed path; the on-wire flow label is
unchanged.
sk_set_txhash() is moved before ip6_dst_lookup_flow() in
tcp_v6_connect() so the initial ECMP path is selected by the same
txhash that subsequent route rebuilds will use. This avoids
unintended path changes when the cached dst is naturally invalidated
(e.g., by PMTU discovery or route changes).
The rehash sites (tcp_write_timeout(), tcp_plb_check_rehash(), and
tcp_rcv_spurious_retrans()) call __sk_rethink_txhash_reset_dst(),
which re-rolls the txhash and, when it changed, drops the cached dst
so the next transmit re-runs route selection. The dst reset is
guarded by sk->sk_family == AF_INET6 since IPv4 ECMP does not
currently use sk_txhash for path selection. For IPv4-mapped IPv6
sockets this produces a redundant dst reset on a cold path
(RTO/PLB); the subsequent IPv4 route lookup returns the same result.
The helper is deliberately separate from sk_rethink_txhash() itself:
dst_negative_advice() calls sk_rethink_txhash() before its own dst op,
so resetting the dst inside sk_rethink_txhash() would skip that op
(e.g. rt6_remove_exception_rt()).
For syncookies, cookie_init_sequence() computes the cookie value
before route_req() and sets txhash so the SYN-ACK selects the same
ECMP path that cookie_v6_check() will use when the full socket is
created. cookie_tcp_reqsk_init() derives txhash from the cookie so
the full socket's ECMP path matches the SYN-ACK. Both the SYN-ACK
assignment in tcp_conn_request() and the full-socket assignment in
cookie_tcp_reqsk_init() set txhash from the cookie for IPv4 and IPv6
alike. On IPv6 this drives ECMP path selection; on IPv4, which does
not use sk_txhash for ECMP, it only affects TX-queue selection. That
selection scales the hash by its high bits (reciprocal_scale()), which
are uniform in the keyed secure_tcp_syn_cookie() output -- the MSS index
only perturbs the low bits -- so the queue distribution matches
net_tx_rndhash().
cookie_init_sequence() is split from the former version that also
called tcp_synq_overflow() and incremented SYNCOOKIESSENT; those
side effects are now in cookie_record_sent(), called after
route_req() succeeds so they are not bumped when route_req() fails.
cookie_record_sent() is guarded by CONFIG_SYN_COOKIES to
match the guard on tcp_synq_overflow(). route_req() receives 0 as
tw_isn for the syncookie path so that tcp_v6_init_req() still saves
ireq->pktopts for REPFLOW flowlabel reflection and IPv6 cmsg
options. The ecn_ok clear for syncookies without timestamps stays
after tcp_ecn_create_request() so it takes precedence.
Signed-off-by: Neil Spring <ntspring@meta.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260615042158.1600746-2-ntspring@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A comment in <net/sctp/sctp.h> incorrectly refers to
CONFIG_SCTP_DBG_OBJCOUNT instead of CONFIG_SCTP_DBG_OBJCNT. Correct it.
Discovered while searching for CONFIG_* symbols referenced in code but
not defined in any Kconfig file.
Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Link: https://patch.msgid.link/20260613233725.162470-1-enelsonmoore@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for net-next.
More specifically, this contains conncount rework to address AI related
reports, assorted Netfiter updates and two small incremental updates on
IPVS:
1) Replace old obsolete workqueues (system_wq, system_unbound_wq)
in IPVS, from Marco Crivellari.
2) Replace WARN_ON{_ONCE} by DEBUG_NET_WARN_ON_ONCE in nf_tables.
In the recent years, reporters say that the use of WARN_ON{_ONCE}
in conjunction with panic_on_warn=1 results in DoS. Let's replace
it by DEBUG_NET_WARN_ON_ONCE so this is only exercised by test
infrastructure and fuzzers, while also providing context to AI
agents. From Fernando F. Mancera.
Five patches from Florian Westphal to address AI reports in the conncount
infrastructures:
3) Fix missing rcu read lock section when calling
__ovs_ct_limit_get_zone_limit().
4) Add a dedicate lock per rbtree tree, this increases memory
usage but it should improve scalability.
5) Add a helper function to find the rbtree node, no functional
changes are intented.
6) Add sequence counter to detect concurrent tree modifications
and retry lookups.
7) Add locks to GC conncount walk and address other nitpicks.
Then, several assorted updates:
8) Defensive Tree-wide addition of NULL checks for ct extensions.
9) Bail out if flowtable bypass cannot be fully set up from the
flow offload expression, instead of lazy building a likely
incomplete one.
10) Fix documentation for the new conn_max sysctl toggle in IPVS.
11) Add nf_dev_xmit_recursion*() helpers and use them, to address
recent AI reports.
* tag 'nf-next-26-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_dup_netdev: add nf_dev_xmit_recursion*() helpers and use them
ipvs: fix doc syntax for conn_max sysctl
netfilter: flowtable: bail out if forward path cannot be discovered
netfilter: conntrack: check NULL when retrieving ct extension
netfilter: nf_conncount: gc and rcu fixes
netfilter: nf_conncount: add sequence counter to detect tree modifications
netfilter: nf_conncount: split count_tree_node rbtree walk into helper
netfilter: nf_conncount: use per nf_conncount_data spinlocks
netfilter: nf_conncount: callers must hold rcu read lock
netfilter: nf_tables: use DEBUG_NET_WARN_ON_ONCE in packet and control paths
ipvs: Replace use of system_unbound_wq with system_dfl_long_wq
====================
Link: https://patch.msgid.link/20260614114605.474783-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Update nft_dup and nft_fwd to use the nf_dev_xmit_recursion() helpers.
This patch also disables BH when transmitting the skb to address a
possible migration to different CPU leading to imbalanced decrementation
of the recursion counters.
This is modeled after Florian Westphal's dev_xmit_recursion*() API
available since commit 97cdcf37b57e ("net: place xmit recursion in
softnet data") according to its current state in the tree.
Fixes: 1d47b55b36d2 ("netfilter: nft_fwd_netdev: use recursion counter in neigh egress path")
Fixes: f37ad9127039 ("netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
nf_ct_ext_find() might return NULL if ct extension is not found.
Add also the null checks to:
- nfct_help()
- nfct_help_data()
- nfct_seqadj()
- nfct_nat()
This is defensive, for safety reasons.
nf_ct_ext_find() used to return NULL if the extension is stale for
unconfirmed conntracks if the genid validation fails.
Skip NULL check in nf_nat_inet_fn() given this is valid to be NULL
for non-initialized ct nat extensions.
While at it, fetch ct helper area in nf_ct_expect_related_report() only
once and pass it on to other ancilliary functions. Replace WARN_ON()
by WARN_ON_ONCE() in nf_ct_unlink_expect_report().
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Devlink param value attribute is not defined since devlink is handling
the value validating and parsing internally, this allows us to implement
multi attribute values without breaking any policies.
Devlink param multi-attribute values are considered to be dynamically
sized arrays of u64 values, by introducing a new devlink param type
DEVLINK_PARAM_TYPE_U64_ARRAY, driver and user space can set a variable
count of u64 values into the DEVLINK_ATTR_PARAM_VALUE_DATA attribute.
Implement get/set parsing and add to the internal value structure passed
to drivers.
This is useful for devices that need to configure a list of values for
a specific configuration.
example:
$ devlink dev param show pci/... name multi-value-param
name multi-value-param type driver-specific
values:
cmode permanent value: 0,1,2,3,4,5,6,7
$ devlink dev param set pci/... name multi-value-param \
value 4,5,6,7,0,1,2,3 cmode permanent
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260609040453.711932-5-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2026-06-12
1) Replace the open-coded manual cleanup in xfrm_add_policy() error
path with xfrm_policy_destroy() for consistency with
xfrm_policy_construct().
From Deepanshu Kartikey.
2) Limit XFRMA_TFCPAD to a sensible maximum (max IP length, 64k) since
u32 is excessive for traffic flow confidentiality padding.
From David Ahern.
3) Add a new netlink message XFRM_MSG_MIGRATE_STATE that
allows migrating individual IPsec SAs independently of
their policies. The existing XFRM_MSG_MIGRATE is tightly coupled
to policy+SA migration, lacks SPI for unique SA identification,
and cannot express reqid changes or migrate Transport mode
selectors. The new interface identifies the SA via SPI and mark,
supports reqid changes, address family changes, encap removal,
and uses an atomic create+install flow under x->lock to prevent
SN/IV reuse during AEAD SA migration.
From Antony Antony.
* tag 'ipsec-next-2026-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: add documentation for XFRM_MSG_MIGRATE_STATE
xfrm: restrict netlink attributes for XFRM_MSG_MIGRATE_STATE
xfrm: add XFRM_MSG_MIGRATE_STATE for single SA migration
xfrm: make xfrm_dev_state_add xuo parameter const
xfrm: extract address family and selector validation helpers
xfrm: refactor XFRMA_MTIMER_THRESH validation into a helper
xfrm: move encap and xuo into struct xfrm_migrate
xfrm: add error messages to state migration
xfrm: add state synchronization after migration
xfrm: check family before comparing addresses in migrate
xfrm: split xfrm_state_migrate into create and install functions
xfrm: rename reqid in xfrm_migrate
xfrm: fix NAT-related field inheritance in SA migration
xfrm: allow migration from UDP encapsulated to non-encapsulated ESP
xfrm: add extack to xfrm_init_state
xfrm: remove redundant assignments
xfrm: Reject excessive values for XFRMA_TFCPAD
xfrm: cleanup error path in xfrm_add_policy()
====================
Link: https://patch.msgid.link/20260612074725.1760473-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add vsock_pending_to_accept() to move a socket directly from the
pending list to the accept queue in a single operation, avoiding
the sock_put/sock_hold dance and the sk_acceptq_removed()/
sk_acceptq_added() pair that would otherwise be needed when
calling vsock_remove_pending() followed by vsock_enqueue_accept().
Use it in vmci_transport_recv_connecting_server() where a completed
handshake transitions the socket from pending to accept queue.
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Raf Dickson <rafdog35@gmail.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260612045216.105796-2-rafdog35@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The main purpose of this cmd is to be able to associate a
non-psp-capable device (e.g. veth or netkit) with a psp device.
One use case is if we create a pair of veth/netkit, and assign 1 end
inside a netns, while leaving the other end within the default netns,
with a real PSP device, e.g. netdevsim or a physical PSP-capable NIC.
With this command, we could associate the veth/netkit inside the netns
with PSP device, so the virtual device could act as PSP-capable device
to initiate PSP connections, and performs PSP encryption/decryption on
the real PSP device.
Signed-off-by: Wei Wang <weibunny@fb.com>
Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20260608233118.2694144-3-weibunny.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The tls_toe feature and its single user (chelsio chtls) have been
unmaintained for multiple years. It also hooks into the core of the
TCP implementation, and bypasses most of the networking stack.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/1f30e73275c07bf879f547589872d0916025a52e.1781165969.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A child socket inherits the listener's bpf_sock_ops_cb_flags via
sk_clone_lock(). If its setup fails in tcp_v4_syn_recv_sock() /
tcp_v6_syn_recv_sock(), the child is freed through put_and_exit, where
inet_csk_prepare_forced_close() drops the socket lock and tcp_done() runs
without it.
If BPF_SOCK_OPS_STATE_CB_FLAG was inherited, tcp_done() -> tcp_set_state()
calls tcp_call_bpf(), which expects the lock and trips sock_owned_by_me():
WARNING: include/net/sock.h:1799 at tcp_set_state+0x433/0x550
RIP: 0010:tcp_set_state+0x433/0x550 include/net/sock.h:1799
Call Trace:
<IRQ>
tcp_done+0xba/0x250 net/ipv4/tcp.c:5095
tcp_v4_syn_recv_sock+0x850/0xa50 net/ipv4/tcp_ipv4.c:1787
tcp_check_req+0xf30/0x1360 net/ipv4/tcp_minisocks.c:926
tcp_v4_rcv+0x1047/0x1b50 net/ipv4/tcp_ipv4.c:2164
</IRQ>
The child is freed before it is ever established, so it should run no
sock_ops callback. Clear its cb flags in inet_csk_prepare_for_destroy_sock(),
the common point for the IPv4, IPv6 and chtls forced-close paths and for the
MPTCP ->syn_recv_sock() failure path (dispose_child), which reaches tcp_done()
on a child that was never established too.
Suggested-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Fixes: d44874910a26 ("bpf: Add BPF_SOCK_OPS_STATE_CB")
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260611092923.1895982-1-rhkrqnwk98@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Luiz Augusto von Dentz says:
====================
bluetooth-next pull request for net-next:
core:
- hci_sync: Add support for HCI_LE_Set_Host_Feature [v2]
- SMP: Use AES-CMAC library API
- sockets: convert to getsockopt_iter
- Add SPDX id lines to some source files
drivers:
- btintel_pcie: Support Product level reset
- btintel_pcie: Add support for smart trigger dump
- btintel_pcie: Add 50 ms delay before MAC init on BlazarIW
- btintel_pcie: Separate coredump work from RX work
- btmtk: add event filter to filter specific event
- btrtl: fix RTL8761B/BU broken LE extended scan
- btusb: Add Realtek RTL8922AE VID/PID 0bda/d922
- btusb: Add Realtek RTL8922AE VID/PID 0bda/d923
- btusb: MT7922: Add VID/PID 0e8d/223c
- btusb: MT7925: Add VID/PID 0e8d/8c38
- btusb: Add support for TP-Link TL-UB250
- btusb: Add Mercusys MA530 for Realtek RTL8761BUV
- btusb: Add TP-Link UB600 for Realtek 8761BUV
- btusb: Add support for Intel Lizard Peak 2 (0x8087:0x0040)
- btusb: Add USB ID 2c4e:0128 for Mercusys MA60XNB
- btusb: MT7925: Add VID/PID 13d3/3609
* tag 'for-net-next-2026-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (49 commits)
Bluetooth: btintel_pcie: Separate coredump work from RX work
Bluetooth: btmtksdio: fix infinite loop in btmtksdio_txrx_work()
Bluetooth: qca: Add BT FW build version to kernel log
Bluetooth: vhci: validate devcoredump state before side effects
Bluetooth: L2CAP: validate connectionless PSM length
Bluetooth: hci: validate codec capability element length
Bluetooth: L2CAP: Fix UAF in channel timeout by holding conn ref
Bluetooth: btintel_pcie: Load IOSF debug regs by controller variant
Bluetooth: btintel_pcie: Add 50 ms delay before MAC init on BlazarIW
Bluetooth: Add SPDX id lines to some source files
Bluetooth: btintel_pcie: Add support for smart trigger dump
Bluetooth: hci_h5: reset hci_uart::priv in the close() method
Bluetooth: btusb: clean up probe error handling
Bluetooth: btusb: fix wakeup irq devres lifetime
Bluetooth: btusb: fix wakeup source leak on probe failure
Bluetooth: btusb: fix use-after-free on marvell probe failure
Bluetooth: btusb: fix use-after-free on registration failure
Bluetooth: btmtk: fix URB leak in alloc_mtk_intr_urb error path
Bluetooth: hci_core: Fix UAF in hci_unregister_dev()
Bluetooth: hci_event: fix simultaneous discovery stuck in FINDING
...
====================
Link: https://patch.msgid.link/20260611183358.176776-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With TCP-timestamps (padded) taking 12 bytes and ADD_ADDR IPv6 + port
taking 30 bytes, the 40-byte limit for the TCP options is reached. In
this case, it is then not possible to send the address signal.
The idea is to let MPTCP dropping the TCP-timestamps option for some
specific packets, to be able to send some specific pure ACK carrying >28
bytes of MPTCP options, like with this specific ADD_ADDR. A new
parameter is passed from tcp_established_options to the MPTCP side to
indicate if the TCP TS option is used, and if it should be dropped. The
next commit implements the part on MPTCP side, but split into two
patches to help TCP maintainers to identify the modifications on TCP
side. This feature will be controlled by a new add_addr_v6_port_drop_ts
MPTCP sysctl knob.
It is important to keep in mind that dropping the TCP timestamps option
for one packet of the connection could eventually disrupt some
middleboxes: even if it should be unlikely, they could drop the packet
or even block the connection. That's why this new feature will be
controlled by a sysctl knob.
Note that it would be technically possible to squeeze both options into
the header if the ADD_ADDR is first written, and then the TCP timestamps
without the NOPs preceding it. But this means more modifications on TCP
side, plus some middleboxes could still be disrupted by that.
In this implementation, an unused bit is used in mptcp_out_options
structure to avoid passing an address to a local variable. Reading and
setting it needs CONFIG_MPTCP, so the whole block now has this #if
condition: mptcp_established_options() is then no longer used without
CONFIG_MPTCP.
About alternatives, instead of passing a new boolean (has_ts), another
option would be to pass the whole option structure (opts), but
'struct tcp_out_options' is currently defined in tcp_output.c, and it
would need to be exported. Plus that means the removal of the TCP TS
option would be done on the MPTCP side, and not here on the TCP side.
It feels clearer to remove other TCP options from the TCP side, than
hiding that from the MPTCP side.
Yet an other alternative would be to pass the size already taken by the
other TCP options, and have a way to drop them all when needed. But this
feels better to target only the timestamps option where dropping it
should be safe, even if it is currently the only option that would be
set before MPTCP, when MPTCP is used.
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260605-net-next-mptcp-add-addr6-port-ts-v2-5-758e7ca73f4d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
rocker_router_fib_event() calls fib_rule_get() during RCU dump.
If the fib_rule is dying, refcount_inc() will complain about it.
Let's call refcount_inc_not_zero() in fib_rules_dump().
Fixes: 5d7bfd141924 ("ipv4: fib_rules: Dump FIB rules when registering FIB notifier")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260610061744.2030996-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzbot reported use-after-free in nsim_fib4_prepare_event(). [0]
The problem is that the following functions call fib_info_hold() /
refcount_inc() while dumping fib_info under RCU, which is unsafe.
* mlxsw_sp_router_fib4_event()
* rocker_router_fib_event()
* nsim_fib4_prepare_event()
refcount_inc_not_zero() must be used, but it would be too late
there.
Let's guarantee the lifetime of fib_info in fib_leaf_notify().
Note that IPv6 does not need the corresponding change since
fib6_table_dump() holds fib6_table.tb6_lock.
[0]:
refcount_t: addition on 0; use-after-free.
WARNING: lib/refcount.c:25 at refcount_warn_saturate+0x9f/0x110 lib/refcount.c:25, CPU#0: kworker/u8:15/3420
Modules linked in:
CPU: 0 UID: 0 PID: 3420 Comm: kworker/u8:15 Not tainted syzkaller #0 PREEMPT_{RT,(full)}
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
Workqueue: netns cleanup_net
RIP: 0010:refcount_warn_saturate+0x9f/0x110 lib/refcount.c:25
Code: eb 66 85 db 74 3e 83 fb 01 75 4c e8 1b f1 22 fd 48 8d 3d 84 cb f1 0a 67 48 0f b9 3a eb 4a e8 08 f1 22 fd 48 8d 3d 81 cb f1 0a <67> 48 0f b9 3a eb 37 e8 f5 f0 22 fd 48 8d 3d 7e cb f1 0a 67 48 0f
RSP: 0018:ffffc9000f2c7270 EFLAGS: 00010293
RAX: ffffffff84a18858 RBX: 0000000000000002 RCX: ffff888032ff9ec0
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8f9353e0
RBP: 0000000000000000 R08: ffff888032ff9ec0 R09: 0000000000000005
R10: 0000000000000100 R11: 0000000000000004 R12: ffff8880570cc000
R13: dffffc0000000000 R14: ffff88802b40563c R15: ffff8880570cc000
FS: 0000000000000000(0000) GS:ffff888126173000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb1f4d5d000 CR3: 000000006072a000 CR4: 00000000003526f0
Call Trace:
<TASK>
__refcount_add include/linux/refcount.h:-1 [inline]
__refcount_inc include/linux/refcount.h:366 [inline]
refcount_inc include/linux/refcount.h:383 [inline]
fib_info_hold include/net/ip_fib.h:629 [inline]
nsim_fib4_prepare_event drivers/net/netdevsim/fib.c:930 [inline]
nsim_fib_event_schedule_work drivers/net/netdevsim/fib.c:1000 [inline]
nsim_fib_event_nb+0x1055/0x1240 drivers/net/netdevsim/fib.c:1043
call_fib_notifier+0x45/0x80 net/core/fib_notifier.c:25
call_fib_entry_notifier net/ipv4/fib_trie.c:90 [inline]
fib_leaf_notify net/ipv4/fib_trie.c:2176 [inline]
fib_table_notify net/ipv4/fib_trie.c:2194 [inline]
fib_notify+0x36b/0x5e0 net/ipv4/fib_trie.c:2217
fib_net_dump net/core/fib_notifier.c:70 [inline]
register_fib_notifier+0x184/0x360 net/core/fib_notifier.c:108
nsim_fib_create+0x85d/0x9f0 drivers/net/netdevsim/fib.c:1596
nsim_dev_reload_create drivers/net/netdevsim/dev.c:1604 [inline]
nsim_dev_reload_up+0x374/0x7c0 drivers/net/netdevsim/dev.c:1058
devlink_reload+0x501/0x8d0 net/devlink/dev.c:475
devlink_pernet_pre_exit+0x1ff/0x420 net/devlink/core.c:558
ops_pre_exit_list net/core/net_namespace.c:161 [inline]
ops_undo_list+0x187/0x940 net/core/net_namespace.c:234
cleanup_net+0x56e/0x800 net/core/net_namespace.c:702
process_one_work kernel/workqueue.c:3314 [inline]
process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3397
worker_thread+0xa53/0xfc0 kernel/workqueue.c:3478
kthread+0x388/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
Fixes: 0ae3eb7b4611 ("netdevsim: fib: Perform the route programming in a non-atomic context")
Fixes: c3852ef7f2f8 ("ipv4: fib: Replay events when registering FIB notifier")
Reported-by: syzbot+cb2aa2390ac024e25f5c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6a290011.39669fcc.33b062.00b1.GAE@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260610061744.2030996-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-7.1-rc8).
Conflicts:
drivers/net/ethernet/wangxun/txgbe/txgbe_aml.c
f67aead16e85 ("net: txgbe: rework service event handling")
57d39faed4c9 ("net: txgbe: improve functions of AML 40G devices")
net/rds/info.c
512db8267b73 ("rds: mark snapshot pages dirty in rds_info_getsockopt()")
6e94eeb2a2a6 ("rds: convert to getsockopt_iter")
Adjacent changes:
include/net/sock.h
1ee90b77b727 ("net: guard timestamp cmsgs to real error queue skbs")
f0de88303d5e ("net: make is_skb_wmem() available to modules")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
l2cap_chan_timeout() runs asynchronously and accesses chan->conn. If
the connection is torn down while the timer is running or pending,
chan->conn can be freed, leading to a use-after-free when the timer
worker attempts to lock conn->lock:
| BUG: KASAN: slab-use-after-free in instrument_atomic_read_write include/linux/instrumented.h:112 [inline]
| BUG: KASAN: slab-use-after-free in atomic_long_try_cmpxchg_acquire include/linux/atomic/atomic-instrumented.h:4456 [inline]
| BUG: KASAN: slab-use-after-free in __mutex_trylock_fast kernel/locking/mutex.c:161 [inline]
| BUG: KASAN: slab-use-after-free in mutex_lock+0x4f/0xa0 kernel/locking/mutex.c:318
| Write of size 8 at addr ffff8881298d9550 by task kworker/2:1/83
|
| CPU: 2 UID: 0 PID: 83 Comm: kworker/2:1 Not tainted 7.1.0-rc6-next-20260601-dirty #6 PREEMPT(full)
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
| Workqueue: events l2cap_chan_timeout
| Call Trace:
| <TASK>
| instrument_atomic_read_write include/linux/instrumented.h:112 [inline]
| atomic_long_try_cmpxchg_acquire include/linux/atomic/atomic-instrumented.h:4456 [inline]
| __mutex_trylock_fast kernel/locking/mutex.c:161 [inline]
| mutex_lock+0x4f/0xa0 kernel/locking/mutex.c:318
| l2cap_chan_timeout+0x5d/0x1b0 net/bluetooth/l2cap_core.c:422
| process_one_work kernel/workqueue.c:3326 [inline]
| process_scheduled_works+0x7c8/0xfb0 kernel/workqueue.c:3409
| worker_thread+0x8a9/0xcf0 kernel/workqueue.c:3490
| kthread+0x346/0x430 kernel/kthread.c:436
| ret_from_fork+0x1a3/0x470 arch/x86/kernel/process.c:158
| ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
| </TASK>
|
| Allocated by task 320:
| l2cap_conn_add+0xa7/0x820 net/bluetooth/l2cap_core.c:7075
| l2cap_connect_cfm+0xdb/0xd70 net/bluetooth/l2cap_core.c:7452
| hci_connect_cfm include/net/bluetooth/hci_core.h:2139 [inline]
| hci_remote_features_evt+0x52f/0x9f0 net/bluetooth/hci_event.c:3760
| hci_event_func net/bluetooth/hci_event.c:7796 [inline]
| hci_event_packet+0x561/0xa70 net/bluetooth/hci_event.c:7847
| hci_rx_work+0x370/0x890 net/bluetooth/hci_core.c:4040
| process_one_work kernel/workqueue.c:3326 [inline]
| process_scheduled_works+0x7c8/0xfb0 kernel/workqueue.c:3409
| worker_thread+0x8a9/0xcf0 kernel/workqueue.c:3490
| kthread+0x346/0x430 kernel/kthread.c:436
| ret_from_fork+0x1a3/0x470 arch/x86/kernel/process.c:158
| ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
|
| Freed by task 322:
| hci_disconn_cfm include/net/bluetooth/hci_core.h:2154 [inline]
| hci_conn_hash_flush+0x101/0x1f0 net/bluetooth/hci_conn.c:2736
| hci_dev_close_sync+0x889/0xde0 net/bluetooth/hci_sync.c:5405
| hci_dev_do_close net/bluetooth/hci_core.c:502 [inline]
| hci_unregister_dev+0x1f7/0x370 net/bluetooth/hci_core.c:2679
| vhci_release+0x12a/0x180 drivers/bluetooth/hci_vhci.c:690
| __fput+0x369/0x890 fs/file_table.c:510
| task_work_run+0x160/0x1d0 kernel/task_work.c:233
| get_signal+0xf5b/0x1120 kernel/signal.c:2810
| arch_do_signal_or_restart+0x4d/0x600 arch/x86/kernel/signal.c:337
| __exit_to_user_mode_loop kernel/entry/common.c:64 [inline]
| exit_to_user_mode_loop+0x85/0x510 kernel/entry/common.c:98
| do_syscall_64+0x263/0x3d0 arch/x86/entry/syscall_64.c:100
| entry_SYSCALL_64_after_hwframe+0x77/0x7f
|
| The buggy address belongs to the object at ffff8881298d9400
| which belongs to the cache kmalloc-512 of size 512
| The buggy address is located 336 bytes inside of
| freed 512-byte region [ffff8881298d9400, ffff8881298d9600)
Fix it by having chan->conn hold a reference to l2cap_conn (via
l2cap_conn_get) when the channel is added to the connection, and
releasing it in the channel destructor. This ensures the l2cap_conn
remains alive as long as the channel exists.
A new FLAG_DEL channel flag is introduced to indicate that the channel
has been deleted from its connection. l2cap_chan_del() atomically sets
this flag using test_and_set_bit() instead of setting chan->conn to
NULL. All asynchronous workers (l2cap_chan_timeout, l2cap_ack_timeout,
l2cap_monitor_timeout, l2cap_retrans_timeout) and l2cap_chan_send()
check FLAG_DEL to determine whether the channel has been torn down,
rather than testing chan->conn for NULL.
Fixes: 8c8e620467a7 ("Bluetooth: L2CAP: use chan timer to close channels in cleanup_listen()")
Cc: <stable@vger.kernel.org>
Cc: Siwei Zhang <oss@fourdim.xyz>
Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Assisted-by: Gemini:gemini-3.1-pro-preview
Reported-by: https://sashiko.dev/#/patchset/20260521021249.3258069-1-oss%40fourdim.xyz
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Many bluetooth source files are missing SPDX-License-Identifier
lines. Add appropriate IDs to these files, and remove other
license lines from the headers.
Leave the warranty disclaimer in files where the license ID is
GPL-2.0 but the wording of the disclaimer is slightly different
from that of the GPL v2 disclaimer.
It is not different enough to cause licensing conflicts, but is
kept to honor the original contributors' legal intent.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
This adds support for using HCI_LE_Set_Host_Feature [v2] instead of v1
if LL Extented Features is supported and the controller supports the
command.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Revalidate bridge ports, add missing NULL checks to fetch the bridge
device by the port. From Florian Westphal.
2) Fix netdevice refcount leak in the error path of nft_fwd hardware
offload function, also from Florian.
3) Unregister helper expectfn callback on conntrack helper module
removal, otherwise dangling pointer remains in place,
from Weiming Shi.
4) Fix possible pointer infoleak in getsockopt() IPT_SO_GET_ENTRIES,
From Kyle Zeng.
5) Validate that device MAC header is present before nf_syslog
accesses it. From Xiang Mei.
6-8) Three patches to address a possible infoleak of stale stack
data in three nf_tables expressions, due to mismatch in the
_init() and _eval() function which is possible since 14fb07130c7d.
From Davide Ornaghi and Florian Westphal.
netfilter pull request 26-06-10
* tag 'nf-26-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_meta_bridge: fix stale stack leak via IIFHWADDR register
netfilter: nft_fib: fix stale stack leak via the OIFNAME register
netfilter: nft_exthdr: fix register tracking for F_PRESENT flag
netfilter: nf_log: validate MAC header was set before dumping it
netfilter: x_tables: avoid leaking percpu counter pointers
netfilter: nf_conntrack: destroy stale expectfn expectations on unregister
netfilter: nf_tables_offload: drop device refcount on error
netfilter: revalidate bridge ports
====================
Link: https://patch.msgid.link/20260610161629.214092-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
NAT helpers such as nf_nat_h323 store a raw pointer to module text in
exp->expectfn (e.g. ip_nat_q931_expect). nf_ct_helper_expectfn_unregister()
only unlinks the callback descriptor and never walks the expectation table,
so an expectation pending at module removal survives with a dangling
exp->expectfn into freed module text.
When the expected connection arrives, init_conntrack() invokes
exp->expectfn(), now a stale pointer into the unloaded module. Reproduced
on a KASAN build by loading the H.323 helpers, creating a Q.931
expectation, unloading nf_nat_h323, then connecting to the expected port:
Oops: int3: 0000 [#1] SMP KASAN NOPTI
RIP: 0010:0xffffffffa06102d1
init_conntrack.isra.0 (net/netfilter/nf_conntrack_core.c:1862)
nf_conntrack_in (net/netfilter/nf_conntrack_core.c:2049)
ipv4_conntrack_local (net/netfilter/nf_conntrack_proto.c:223)
nf_hook_slow (net/netfilter/core.c:619)
__ip_local_out (net/ipv4/ip_output.c:120)
__tcp_transmit_skb (net/ipv4/tcp_output.c:1715)
tcp_connect (net/ipv4/tcp_output.c:4374)
tcp_v4_connect (net/ipv4/tcp_ipv4.c:345)
__sys_connect (net/socket.c:2167)
Modules linked in: nf_conntrack_h323 [last unloaded: nf_nat_h323]
Reaching the dangling state requires CAP_SYS_MODULE in the initial user
namespace to remove a NAT helper that still has live expectations, so this
is a robustness fix; leaving an expectation pointing at freed text is wrong
regardless.
Add nf_ct_helper_expectfn_destroy(), which walks the expectation table and
drops every expectation whose ->expectfn matches the descriptor being torn
down. Call it from each NAT helper's exit path after the existing RCU grace
period, so no expectation outlives the code it points at and no extra
synchronize_rcu() is introduced. With the fix, the same reproducer runs to
completion without the Oops.
Fixes: f587de0e2feb ("[NETFILTER]: nf_conntrack/nf_nat: add H.323 helper port")
Reported-by: Xiang Mei <xmei5@asu.edu>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Quite a few last updates, notably:
- b43: new support for an 11n device
- mt76:
- mt792x broken usb transport detection
- mt7921 regd improvements
- mt7927 support
- iwlwifi:
- more kunit tests
- FW version updates
- ath12k: WDS support
- rtw89:
- RTL8922AU support
- USB 3 mode switch for performance
- better monitor radiotap support
- RTL8922DE preparations
- cfg80211/mac80211:
- update UHR to D1.4, UHR DBE support
- finally remove 5/10 MHz support
- S1G rate reporting
- multicast encapsulation offload
* tag 'wireless-next-2026-06-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (285 commits)
b43: add RF power offset for N-PHY r8 + radio 2057 r8
b43: add channel info table for N-PHY r8 + radio 2057 r8
b43: add IPA TX gain table for N-PHY r8 + radio 2057 r8
b43: support radio 2057 rev 8
b43: route d11 corerev 22 to 24-bit indirect radio access
b43: add d11 core revision 0x16 to id table
b43: add firmware mappings for rev22
rfkill: Replace strcpy() with memcpy()
wifi: brcmfmac: flowring: simplify flow allocation
wifi: brcm80211: change current_bss to value
wifi: ath12k: enable IEEE80211_VHT_EXT_NSS_BW_CAPABLE when NSS ratio is reported
wifi: ath12k: fix EAPOL TX failure caused by stale tcl_metadata bits
wifi: ath: Update copyright in testmode_i.h
wifi: ath10k: Update Qualcomm copyrights
wifi: ath11k: Update Qualcomm copyrights
wifi: ath12k: Update Qualcomm copyrights
wifi: mt76: Drop unneeded mt76_register_debugfs_fops() return checks
wifi: mt76: mt7921: assert sniffer on chanctx change
wifi: mt76: mt7996: fix potential tx_retries underflow
wifi: mt76: mt7925: fix potential tx_retries underflow
...
====================
Link: https://patch.msgid.link/20260610103637.179340-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When an 802.3ad (LACP) bonding interface has no slaves in the
collecting/distributing state, the bonding master still reports
carrier as up as long as at least 'min_links' slaves have carrier.
In this situation, only one slave is effectively used for TX/RX,
while traffic received on other slaves is dropped. Upper-layer
daemons therefore consider the interface operational, even though
traffic may be blackholed if the lack of LACP negotiation means
the partner is not ready to deal with traffic.
Introduce a configuration knob to control this behavior. It allows
the bonding master to assert carrier only when at least 'min_links'
slaves are in Collecting_Distributing state.
The default mode preserves the existing behavior. This patch only
introduces the knob; its behavior is implemented in the subsequent
commit.
Fixes: 655f8919d549 ("bonding: add min links parameter to 802.3ad")
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
Acked-by: Jay Vosburgh <jv@jvosburgh.net>
Link: https://patch.msgid.link/20260603150331.1919611-4-louis.scalbert@6wind.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
skb_is_err_queue() treats PACKET_OUTGOING as the sole marker for an skb
from sk_error_queue. That assumption is not true for AF_PACKET sockets:
outgoing packet taps are also delivered to packet sockets with
skb->pkt_type == PACKET_OUTGOING, but their skb->cb is owned by AF_PACKET
instead of struct sock_exterr_skb.
If such an skb is received with timestamping enabled, the generic
timestamp cmsg path can read AF_PACKET control-buffer state as
sock_exterr_skb::opt_stats. With SO_RXQ_OVFL enabled, the packet drop
counter overlaps opt_stats. An odd drop count makes the path emit
SCM_TIMESTAMPING_OPT_STATS with skb->len and skb->data. For non-linear
skbs this copies past the linear head and can trigger hardened usercopy or
disclose adjacent heap contents.
Keep skb_is_err_queue() local to net/socket.c, but make it verify that
the PACKET_OUTGOING marker is paired with the sock_rmem_free destructor
installed by sock_queue_err_skb(). AF_PACKET receive skbs use normal
receive ownership and no longer pass as error-queue skbs, while legitimate
sk_error_queue entries keep the PACKET_OUTGOING marker and sock_rmem_free
ownership.
Fixes: 8605330aac5a ("tcp: fix SCM_TIMESTAMPING_OPT_STATS for normal skbs")
Signed-off-by: Kyle Zeng <kylebot@openai.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260607021819.49698-1-kylebot@openai.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
mana_query_link_cfg() sends an HWC command to firmware on every call,
but the link speed and QoS values it returns only change when the
driver explicitly calls mana_set_bw_clamp(). This function is called
not only by userspace via ethtool get_link_ksettings, but also
periodically by hv_netvsc through netvsc_get_link_ksettings and by
the sysfs speed_show attribute via dev_attr_show, resulting in
unnecessary HWC traffic every few minutes.
Add a link_cfg_error field to mana_port_context to cache the query
result. The field uses three states: 1 (not yet queried, initial
value set during mana_probe_port), 0 (success, speed/max_speed are
valid), or a negative errno for permanent errors like -EOPNOTSUPP
when the hardware does not support the command. Transient errors and
qos_unconfigured responses are not cached so that subsequent calls
will retry.
MANA is ops-locked because it implements net_shaper_ops, so the core
already takes netdev_lock() around all ethtool_ops and net_shaper_ops
entry points. Reuse that lock to serialize mana_query_link_cfg() and
mana_set_bw_clamp(). This prevents a concurrent mana_set_bw_clamp()
from racing with an in-flight query and publishing stale pre-clamp
speed/max_speed.
Invalidate the cache inside mana_set_bw_clamp() on success, so all
current and future callers that change the link configuration
automatically trigger a fresh query on the next mana_query_link_cfg()
call. Also reset link_cfg_error during resume in mana_probe() under
netdev_lock(), so that any query already in flight cannot later
store 0 and silently overwrite the post-resume invalidation.
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Link: https://patch.msgid.link/20260606133301.2180073-1-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Update the device id table to include the new device id 0x00C1.
This device's BAR layout is similar to VF's, update the function,
mana_gd_init_registers(), accordingly.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/20260605212302.2135499-1-haiyangz@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use the GIC functions to allocate interrupt contexts for RDMA EQs. These
interrupt contexts may be shared with Ethernet EQs when MSI-X vectors
are limited.
The driver now supports allocating dedicated MSI-X for each EQ. Indicate
this capability through driver capability bits. The RDMA EQs pass
use_msi_bitmap=false to share MSI-X vectors with Ethernet, while the
capability flag advertises that the driver supports per-vPort EQ
separation when hardware has sufficient vectors.
Populate eq.irq on all RDMA EQs for consistency with the Ethernet path.
Also relocate the GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE define to its
numeric BIT(6) position among the other capability flags.
Signed-off-by: Long Li <longli@microsoft.com>
Acked-by: Leon Romanovsky <leon@kernel.org>
Link: https://patch.msgid.link/20260605005717.2059954-7-longli@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use GIC functions to create a dedicated interrupt context or acquire a
shared interrupt context for each EQ when setting up a vPort.
The caller now owns the GIC reference across the EQ create/destroy
lifecycle: mana_create_eq() calls mana_gd_get_gic() before creating
each EQ and mana_destroy_eq() calls mana_gd_put_gic() after destroying
it. The msix_index invalidation is moved from mana_gd_deregister_irq()
to the mana_gd_create_eq() error path so that mana_destroy_eq() can
read the index before teardown.
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/20260605005717.2059954-6-longli@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
To allow Ethernet EQs to use dedicated or shared MSI-X vectors and RDMA
EQs to share the same MSI-X, introduce a GIC (GDMA IRQ Context) with
reference counting. This allows the driver to create an interrupt context
on an assigned or unassigned MSI-X vector and share it across multiple
EQ consumers.
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/20260605005717.2059954-4-longli@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When querying the device, adjust the max number of queues to allow
dedicated MSI-X vectors for each vPort. The per-vPort queue count is
clamped towards MANA_DEF_NUM_QUEUES but will not exceed the hardware
maximum reported by the device.
MSI-X sharing among vPorts is enabled when there are not enough MSI-X
vectors for dedicated allocation, or when the platform does not support
dynamic MSI-X allocation (in which case all vectors are pre-allocated
at probe time and sharing is always used). The msi_sharing flag is
reset at the top of mana_gd_query_max_resources() so it is recomputed
from current hardware state on each probe or resume cycle.
Clamp apc->max_queues to gc->max_num_queues_vport in mana_init_port()
so that on resume, if max_num_queues_vport has decreased due to fewer
MSI-X vectors, num_queues is reduced accordingly before EQ allocation.
A device reporting zero ports now results in a fatal probe error since
the per-vPort MSI-X math requires at least one port.
Rename mana_query_device_cfg() to mana_gd_query_device_cfg() as it is
used at GDMA device probe time for querying device capabilities.
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/20260605005717.2059954-3-longli@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
To prepare for assigning vPorts to dedicated MSI-X vectors, remove EQ
sharing among the vPorts and create dedicated EQs for each vPort.
Move the EQ definition from struct mana_context to struct mana_port_context
and update related support functions. Export mana_create_eq() and
mana_destroy_eq() for use by the MANA RDMA driver.
RSS QPs now take a vport reference via pd->vport_use_count to ensure
EQs outlive all QP consumers. The vport must already be configured by
a raw QP before an RSS QP can be created. EQs are only destroyed when
the last QP (raw or RSS) on the PD releases its reference.
Restrict each vport to a single RSS QP. The hardware only supports one
steering configuration (indirection table / hash key) per vport, and
mana_disable_vport_rx() on QP destroy disables RX globally for the
vport. Previously, creating a second RSS QP would silently overwrite
the first QP's steering config and destroy would blackhole all traffic.
This is now explicitly rejected with -EBUSY. Existing applications
(DPDK being the primary RDMA consumer) always create one RSS QP per
vport, so no real-world flows are affected.
Reject cross-port PD sharing for both raw and RSS QPs. Since EQs and
vport configuration are per-port, a PD is bound to the port used by
its first raw QP. Subsequent QPs on the same PD must use the same
port or the creation fails with -EINVAL. Previously this was silently
broken: with shared EQs it appeared to work, but with per-vPort EQs
a cross-port PD would cause wrong-port EQ teardown and corruption.
DPDK creates one PD per port so no existing flows are affected.
Serialize mana_set_channels() and the async per-port queue reset
handler against RDMA vport configuration to prevent RDMA from claiming
the vport during the detach/attach window. A channel_changing flag is
set under apc->vport_mutex before detach and checked by
mana_cfg_vport() when called from the RDMA path, blocking RDMA from
grabbing the vport during the entire window. When the port is down
and RDMA already holds the vport, the channel change is rejected with
-EBUSY.
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/20260605005717.2059954-2-longli@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
While NET_NCSI not enabled, ncsi_stop_dev() is not inline
and call with it, casue compile waring:
linux/include/net/ncsi.h:63:13: warning: 'ncsi_stop_dev'
defined but not used [-Wunused-function]
static void ncsi_stop_dev(struct ncsi_dev *nd)
Setting ncsi_stop_dev() to inline like other function to
remove compile warnings.
Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
Link: https://patch.msgid.link/20260605033607.37630-1-minda.chen@starfivetech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|