| Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 571dcbeb8e635182bb825ae758399831805693c2 ]
Since commit 9c328f54741b ("net: nfc: nci: Add parameter validation for
packet data") communication with nci nfc chips is not working any more.
The mentioned commit tries to fix access of uninitialized data, but
failed to understand that in some cases the data packet is of variable
length and can therefore not be compared to the maximum packet length
given by the sizeof(struct).
Fixes: 9c328f54741b ("net: nfc: nci: Add parameter validation for packet data")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Thalmeier <michael.thalmeier@hale.at>
Reported-by: syzbot+740e04c2a93467a0f8c8@syzkaller.appspotmail.com
Link: https://patch.msgid.link/20260218083000.301354-1-michael.thalmeier@hale.at
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit be054cc66f739a9ba615dba9012a07fab8e7dd6f ]
Commit 38a6f0865796 ("net: sched: support hash selecting tx queue")
added SKBEDIT_F_TXQ_SKBHASH support. The inclusive range size is
computed as:
mapping_mod = queue_mapping_max - queue_mapping + 1;
The range size can be 65536 when the requested range covers all possible
u16 queue IDs (e.g. queue_mapping=0 and queue_mapping_max=U16_MAX).
That value cannot be represented in a u16 and previously wrapped to 0,
so tcf_skbedit_hash() could trigger a divide-by-zero:
queue_mapping += skb_get_hash(skb) % params->mapping_mod;
Compute mapping_mod in a wider type and reject ranges larger than U16_MAX
to prevent params->mapping_mod from becoming 0 and avoid the crash.
Fixes: 38a6f0865796 ("net: sched: support hash selecting tx queue")
Cc: stable@vger.kernel.org # 6.12+
Signed-off-by: Ruitong Liu <cnitlrt@gmail.com>
Link: https://patch.msgid.link/20260213175948.1505257-1-cnitlrt@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6db8b56eed62baacaf37486e83378a72635c04cc ]
On the receive path, __ioam6_fill_trace_data() uses trace->nodelen
to decide how much data to write for each node. It trusts this field
as-is from the incoming packet, with no consistency check against
trace->type (the 24-bit field that tells which data items are
present). A crafted packet can set nodelen=0 while setting type bits
0-21, causing the function to write ~100 bytes past the allocated
region (into skb_shared_info), which corrupts adjacent heap memory
and leads to a kernel panic.
Add a shared helper ioam6_trace_compute_nodelen() in ioam6.c to
derive the expected nodelen from the type field, and use it:
- in ioam6_iptunnel.c (send path, existing validation) to replace
the open-coded computation;
- in exthdrs.c (receive path, ipv6_hop_ioam) to drop packets whose
nodelen is inconsistent with the type field, before any data is
written.
Per RFC 9197, bits 12-21 are each short (4-octet) fields, so they
are included in IOAM6_MASK_SHORT_FIELDS (changed from 0xff100000 to
0xff1ffc00).
Fixes: 9ee11f0fff20 ("ipv6: ioam: Data plane support for Pre-allocated Trace")
Cc: stable@vger.kernel.org
Signed-off-by: Junxi Qian <qjx1298677004@gmail.com>
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Link: https://patch.msgid.link/20260211040412.86195-1-qjx1298677004@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6a65c0cb0ff20b3cbc5f1c87b37dd22cdde14a1c ]
tipc_aead_users_dec() calls rcu_dereference(aead) twice: once to store
in 'tmp' for the NULL check, and again inside the atomic_add_unless()
call.
Use the already-dereferenced 'tmp' pointer consistently, matching the
correct pattern used in tipc_aead_users_inc() and tipc_aead_users_set().
Fixes: fc1b6d6de220 ("tipc: introduce TIPC encryption & authentication")
Cc: stable@vger.kernel.org
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Hodges <hodgesd@meta.com>
Link: https://patch.msgid.link/20260203145621.17399-1-git@danielhodges.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit baed0d9ba91d4f390da12d5039128ee897253d60 ]
In decode_choice(), the boundary check before get_len() uses the
variable `len`, which is still 0 from its initialization at the top of
the function:
unsigned int type, ext, len = 0;
...
if (ext || (son->attr & OPEN)) {
BYTE_ALIGN(bs);
if (nf_h323_error_boundary(bs, len, 0)) /* len is 0 here */
return H323_ERROR_BOUND;
len = get_len(bs); /* OOB read */
When the bitstream is exactly consumed (bs->cur == bs->end), the check
nf_h323_error_boundary(bs, 0, 0) evaluates to (bs->cur + 0 > bs->end),
which is false. The subsequent get_len() call then dereferences
*bs->cur++, reading 1 byte past the end of the buffer. If that byte
has bit 7 set, get_len() reads a second byte as well.
This can be triggered remotely by sending a crafted Q.931 SETUP message
with a User-User Information Element containing exactly 2 bytes of
PER-encoded data ({0x08, 0x00}) to port 1720 through a firewall with
the nf_conntrack_h323 helper active. The decoder fully consumes the
PER buffer before reaching this code path, resulting in a 1-2 byte
heap-buffer-overflow read confirmed by AddressSanitizer.
Fix this by checking for 2 bytes (the maximum that get_len() may read)
instead of the uninitialized `len`. This matches the pattern used at
every other get_len() call site in the same file, where the caller
checks for 2 bytes of available data before calling get_len().
Fixes: ec8a8f3c31dd ("netfilter: nf_ct_h323: Extend nf_h323_error_boundary to work on bits as well")
Signed-off-by: Vahagn Vardanian <vahagn@redrays.io>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260225130619.1248-2-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7aa767d0d3d04e50ae94e770db7db8197f666970 ]
udpgro_frglist.sh and udpgro_bench.sh are the flakiest tests
currently in NIPA. They fail in the same exact way, TCP GRO
test stalls occasionally and the test gets killed after 10min.
These tests use veth to simulate GRO. They attach a trivial
("return XDP_PASS;") XDP program to the veth to force TSO off
and NAPI on.
Digging into the failure mode we can see that the connection
is completely stuck after a burst of drops. The sender's snd_nxt
is at sequence number N [1], but the receiver claims to have
received (rcv_nxt) up to N + 3 * MSS [2]. Last piece of the puzzle
is that senders rtx queue is not empty (let's say the block in
the rtx queue is at sequence number N - 4 * MSS [3]).
In this state, sender sends a retransmission from the rtx queue
with a single segment, and sequence numbers N-4*MSS:N-3*MSS [3].
Receiver sees it and responds with an ACK all the way up to
N + 3 * MSS [2]. But sender will reject this ack as TCP_ACK_UNSENT_DATA
because it has no recollection of ever sending data that far out [1].
And we are stuck.
The root cause is the mess of the xmit return codes. veth returns
an error when it can't xmit a frame. We end up with a loss event
like this:
-------------------------------------------------
| GSO super frame 1 | GSO super frame 2 |
|-----------------------------------------------|
| seg | seg | seg | seg | seg | seg | seg | seg |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
-------------------------------------------------
x ok ok <ok>| ok ok ok <x>
\\
snd_nxt
"x" means packet lost by veth, and "ok" means it went thru.
Since veth has TSO disabled in this test it sees individual segments.
Segment 1 is on the retransmit queue and will be resent.
So why did the sender not advance snd_nxt even tho it clearly did
send up to seg 8? tcp_write_xmit() interprets the return code
from the core to mean that data has not been sent at all. Since
TCP deals with GSO super frames, not individual segment the crux
of the problem is that loss of a single segment can be interpreted
as loss of all. TCP only sees the last return code for the last
segment of the GSO frame (in <> brackets in the diagram above).
Of course for the problem to occur we need a setup or a device
without a Qdisc. Otherwise Qdisc layer disconnects the protocol
layer from the device errors completely.
We have multiple ways to fix this.
1) make veth not return an error when it lost a packet.
While this is what I think we did in the past, the issue keeps
reappearing and it's annoying to debug. The game of whack
a mole is not great.
2) fix the damn return codes
We only talk about NETDEV_TX_OK and NETDEV_TX_BUSY in the
documentation, so maybe we should make the return code from
ndo_start_xmit() a boolean. I like that the most, but perhaps
some ancient, not-really-networking protocol would suffer.
3) make TCP ignore the errors
It is not entirely clear to me what benefit TCP gets from
interpreting the result of ip_queue_xmit()? Specifically once
the connection is established and we're pushing data - packet
loss is just packet loss?
4) this fix
Ignore the rc in the Qdisc-less+GSO case, since it's unreliable.
We already always return OK in the TCQ_F_CAN_BYPASS case.
In the Qdisc-less case let's be a bit more conservative and only
mask the GSO errors. This path is taken by non-IP-"networks"
like CAN, MCTP etc, so we could regress some ancient thing.
This is the simplest, but also maybe the hackiest fix?
Similar fix has been proposed by Eric in the past but never committed
because original reporter was working with an OOT driver and wasn't
providing feedback (see Link).
Link: https://lore.kernel.org/CANn89iJcLepEin7EtBETrZ36bjoD9LrR=k4cfwWh046GB+4f9A@mail.gmail.com
Fixes: 1f59533f9ca5 ("qdisc: validate frames going through the direct_xmit path")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260223235100.108939-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3aa677625c8fad39989496c51bcff3872c1f16f1 ]
TIPC uses named table to store TIPC services represented by type and
instance. Each time an application calls TIPC API bind() to bind a
type/instance to a socket, an entry is created and inserted into the
named table. It looks like this:
named table:
key1, entry1 (type, instance ...)
key2, entry2 (type, instance ...)
In the above table, each entry represents a route for sending data
from one socket to the other. For all publications originated from
the same node, the key is UNIQUE to identify each entry.
It is calculated by this formula:
key = socket portid + number of bindings + 1 (1)
where:
- socket portid: unique and calculated by using linux kernel function
get_random_u32_below(). So, the value is randomized.
- number of bindings: the number of times a type/instance pair is bound
to a socket. This number is linearly increased,
starting from 0.
While the socket portid is unique and randomized by linux kernel, the
linear increment of "number of bindings" in formula (1) makes "key" not
unique anymore. For example:
- Socket 1 is created with its associated port number 20062001. Type 1000,
instance 1 is bound to socket 1:
key1: 20062001 + 0 + 1 = 20062002
Then, bind() is called a second time on Socket 1 to by the same type 1000,
instance 1:
key2: 20062001 + 1 + 1 = 20062003
Named table:
key1 (20062002), entry1 (1000, 1 ...)
key2 (20062003), entry2 (1000, 1 ...)
- Socket 2 is created with its associated port number 20062002. Type 1000,
instance 1 is bound to socket 2:
key3: 20062002 + 0 + 1 = 20062003
TIPC looks up the named table and finds out that key2 with the same value
already exists and rejects the insertion into the named table.
This leads to failure of bind() call from application on Socket 2 with error
message EINVAL "Invalid argument".
This commit fixes this issue by adding more port id checking to make sure
that the key is unique to publications originated from the same port id
and node.
Fixes: 218527fe27ad ("tipc: replace name table service range array with rb tree")
Signed-off-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260220050541.237962-1-tung.quang.nguyen@est.tech
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 138d7eca445ef37a0333425d269ee59900ca1104 ]
This adds a check for encryption key size upon receiving
L2CAP_LE_CONN_REQ which is required by L2CAP/LE/CFC/BV-15-C which
expects L2CAP_CR_LE_BAD_KEY_SIZE.
Link: https://lore.kernel.org/linux-bluetooth/5782243.rdbgypaU67@n9w6sw14/
Fixes: 27e2d4c8d28b ("Bluetooth: Add basic LE L2CAP connect request receiving support")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Tested-by: Christian Eggers <ceggers@arri.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
L2CAP_ECRED_CONN_REQ
[ Upstream commit a8d1d73c81d1e70d2aa49fdaf59d933bb783ffe5 ]
Upon receiving L2CAP_ECRED_CONN_REQ the given MTU shall be checked
against the suggested MTU of the listening socket as that is required
by the likes of PTS L2CAP/ECFC/BV-27-C test which expects
L2CAP_CR_LE_UNACCEPT_PARAMS if the MTU is lowers than socket omtu.
In order to be able to set chan->omtu the code now allows setting
setsockopt(BT_SNDMTU), but it is only allowed when connection has not
been stablished since there is no procedure to reconfigure the output
MTU.
Link: https://github.com/bluez/bluez/issues/1895
Fixes: 15f02b910562 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 05761c2c2b5bfec85c47f60c903c461e9b56cf87 ]
Similar to 03dba9cea72f ("Bluetooth: L2CAP: Fix not responding with
L2CAP_CR_LE_ENCRYPTION") the result code L2CAP_CR_LE_ENCRYPTION shall
be used when BT_SECURITY_MEDIUM is set since that means security mode 2
which mean it doesn't require authentication which results in
qualification test L2CAP/ECFC/BV-32-C failing.
Link: https://github.com/bluez/bluez/issues/1871
Fixes: 15f02b910562 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7accb1c4321acb617faf934af59d928b0b047e2b ]
This fixes responding with an invalid result caused by checking the
wrong size of CID which should have been (cmd_len - sizeof(*req)) and
on top of it the wrong result was use L2CAP_CR_LE_INVALID_PARAMS which
is invalid/reserved for reconf when running test like L2CAP/ECFC/BI-03-C:
> ACL Data RX: Handle 64 flags 0x02 dlen 14
LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6
MTU: 64
MPS: 64
Source CID: 64
< ACL Data TX: Handle 64 flags 0x00 dlen 10
LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2
! Result: Reserved (0x000c)
Result: Reconfiguration failed - one or more Destination CIDs invalid (0x0003)
Fiix L2CAP/ECFC/BI-04-C which expects L2CAP_RECONF_INVALID_MPS (0x0002)
when more than one channel gets its MPS reduced:
> ACL Data RX: Handle 64 flags 0x02 dlen 16
LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 8
MTU: 264
MPS: 99
Source CID: 64
! Source CID: 65
< ACL Data TX: Handle 64 flags 0x00 dlen 10
LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2
! Result: Reconfiguration successful (0x0000)
Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002)
Fix L2CAP/ECFC/BI-05-C when SCID is invalid (85 unconnected):
> ACL Data RX: Handle 64 flags 0x02 dlen 14
LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6
MTU: 65
MPS: 64
! Source CID: 85
< ACL Data TX: Handle 64 flags 0x00 dlen 10
LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2
! Result: Reconfiguration successful (0x0000)
Result: Reconfiguration failed - one or more Destination CIDs invalid (0x0003)
Fix L2CAP/ECFC/BI-06-C when MPS < L2CAP_ECRED_MIN_MPS (64):
> ACL Data RX: Handle 64 flags 0x02 dlen 14
LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6
MTU: 672
! MPS: 63
Source CID: 64
< ACL Data TX: Handle 64 flags 0x00 dlen 10
LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2
! Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002)
Result: Reconfiguration failed - other unacceptable parameters (0x0004)
Fix L2CAP/ECFC/BI-07-C when MPS reduced for more than one channel:
> ACL Data RX: Handle 64 flags 0x02 dlen 16
LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 3 len 8
MTU: 84
! MPS: 71
Source CID: 64
! Source CID: 65
< ACL Data TX: Handle 64 flags 0x00 dlen 10
LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2
! Result: Reconfiguration successful (0x0000)
Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002)
Link: https://github.com/bluez/bluez/issues/1865
Fixes: 15f02b910562 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c8d7f21ead727485ebf965e2b4d42d4a4f0840f6 ]
The IGTK key ID must be 4 or 5, but the code checks against
key ID + 1, so must check against 5/6 rather than 4/5. Fix
that.
Reported-by: Jouni Malinen <j@w1.fi>
Fixes: 08645126dd24 ("cfg80211: implement wext key handling")
Link: https://patch.msgid.link/20260209181220.362205-2-johannes@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1799d8abeabc68ec05679292aaf6cba93b343c05 ]
xfrm6_get_saddr() does not check the return value of
ipv6_dev_get_saddr(). When ipv6_dev_get_saddr() fails to find a suitable
source address (returns -EADDRNOTAVAIL), saddr->in6 is left
uninitialized, but xfrm6_get_saddr() still returns 0 (success).
This causes the caller xfrm_tmpl_resolve_one() to use the uninitialized
address in xfrm_state_find(), triggering KMSAN warning:
=====================================================
BUG: KMSAN: uninit-value in xfrm_state_find+0x2424/0xa940
xfrm_state_find+0x2424/0xa940
xfrm_resolve_and_create_bundle+0x906/0x5a20
xfrm_lookup_with_ifid+0xcc0/0x3770
xfrm_lookup_route+0x63/0x2b0
ip_route_output_flow+0x1ce/0x270
udp_sendmsg+0x2ce1/0x3400
inet_sendmsg+0x1ef/0x2a0
__sock_sendmsg+0x278/0x3d0
__sys_sendto+0x593/0x720
__x64_sys_sendto+0x130/0x200
x64_sys_call+0x332b/0x3e70
do_syscall_64+0xd3/0xf80
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Local variable tmp.i.i created at:
xfrm_resolve_and_create_bundle+0x3e3/0x5a20
xfrm_lookup_with_ifid+0xcc0/0x3770
=====================================================
Fix by checking the return value of ipv6_dev_get_saddr() and propagating
the error.
Fixes: a1e59abf8249 ("[XFRM]: Fix wildcard as tunnel source")
Reported-by: syzbot+e136d86d34b42399a8b1@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68bf1024.a70a0220.7a912.02c2.GAE@google.com/T/
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ac431d597a9bdfc2ba6b314813f29a6ef2b4a3bf ]
When decoding the key, verify that the key material would fit into
a fixed-size buffer in process_auth_done() and generally has a sane
length.
The new CEPH_MAX_KEY_LEN check replaces the existing check for a key
with no key material which is a) not universal since CEPH_CRYPTO_NONE
has to be excluded and b) doesn't provide much value since a smaller
than needed key is just as invalid as no key -- this has to be handled
elsewhere anyway.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b89fc7c2523b2b0750d91840f4e52521270d70ed ]
When canceling the reconnect worker, care must be taken to reset the
reconnect-pending bit. If the reconnect worker has not yet been
scheduled before it is canceled, the reconnect-pending bit will stay
on forever.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-6-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e25dbf561e03c0c5e36228e3b8b784392819ce85 ]
The gcc-16.0.1 snapshot produces a false-positive warning that turns
into a build failure with CONFIG_WERROR:
In file included from arch/x86/include/asm/string.h:6,
from net/vmw_vsock/vmci_transport.c:10:
In function 'vmci_transport_packet_init',
inlined from '__vmci_transport_send_control_pkt.constprop' at net/vmw_vsock/vmci_transport.c:198:2:
arch/x86/include/asm/string_32.h:150:25: error: argument 2 null where non-null expected because argument 3 is nonzero [-Werror=nonnull]
150 | #define memcpy(t, f, n) __builtin_memcpy(t, f, n)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
net/vmw_vsock/vmci_transport.c:164:17: note: in expansion of macro 'memcpy'
164 | memcpy(&pkt->u.wait, wait, sizeof(pkt->u.wait));
| ^~~~~~
arch/x86/include/asm/string_32.h:150:25: note: in a call to built-in function '__builtin_memcpy'
net/vmw_vsock/vmci_transport.c:164:17: note: in expansion of macro 'memcpy'
164 | memcpy(&pkt->u.wait, wait, sizeof(pkt->u.wait));
| ^~~~~~
This seems relatively harmless, and it so far the only instance of this
warning I have found. The __vmci_transport_send_control_pkt function
is called either with wait=NULL or with one of the type values that
pass 'wait' into memcpy() here, but not from the same caller.
Replacing the memcpy with a struct assignment is otherwise the same
but avoids the warning.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Bryan Tan <bryan-bt.tan@broadcom.com>
Link: https://patch.msgid.link/20260203163406.2636463-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 49d0901e260739de2fcc90c0c29f9e31e39a2d9b ]
hci_conn_enter_active_mode() uses queue_delayed_work() with the
intention that the work will run after the given timeout. However,
queue_delayed_work() does nothing if the work is already queued, so
depending on the link policy we may end up putting the connection
into idle mode every hdev->idle_timeout ms.
Use mod_delayed_work() instead so the work is queued if not already
queued, and the timeout is updated otherwise.
Signed-off-by: Stefan Sørensen <ssorensen@roku.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6e84fc395e90465f1418f582a9f7d53c87ab010e ]
syzbot reported that struct fib_alias.fa_state can be
modified locklessly by RCU readers. [0]
Let's use READ_ONCE()/WRITE_ONCE() properly.
[0]:
BUG: KCSAN: data-race in fib_table_lookup / fib_table_lookup
write to 0xffff88811b06a7fa of 1 bytes by task 4167 on cpu 0:
fib_alias_accessed net/ipv4/fib_lookup.h:32 [inline]
fib_table_lookup+0x361/0xd60 net/ipv4/fib_trie.c:1565
fib_lookup include/net/ip_fib.h:390 [inline]
ip_route_output_key_hash_rcu+0x378/0x1380 net/ipv4/route.c:2814
ip_route_output_key_hash net/ipv4/route.c:2705 [inline]
__ip_route_output_key include/net/route.h:169 [inline]
ip_route_output_flow+0x65/0x110 net/ipv4/route.c:2932
udp_sendmsg+0x13c3/0x15d0 net/ipv4/udp.c:1450
inet_sendmsg+0xac/0xd0 net/ipv4/af_inet.c:859
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x53a/0x600 net/socket.c:2592
___sys_sendmsg+0x195/0x1e0 net/socket.c:2646
__sys_sendmmsg+0x185/0x320 net/socket.c:2735
__do_sys_sendmmsg net/socket.c:2762 [inline]
__se_sys_sendmmsg net/socket.c:2759 [inline]
__x64_sys_sendmmsg+0x57/0x70 net/socket.c:2759
x64_sys_call+0x1e28/0x3000 arch/x86/include/generated/asm/syscalls_64.h:308
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
read to 0xffff88811b06a7fa of 1 bytes by task 4168 on cpu 1:
fib_alias_accessed net/ipv4/fib_lookup.h:31 [inline]
fib_table_lookup+0x338/0xd60 net/ipv4/fib_trie.c:1565
fib_lookup include/net/ip_fib.h:390 [inline]
ip_route_output_key_hash_rcu+0x378/0x1380 net/ipv4/route.c:2814
ip_route_output_key_hash net/ipv4/route.c:2705 [inline]
__ip_route_output_key include/net/route.h:169 [inline]
ip_route_output_flow+0x65/0x110 net/ipv4/route.c:2932
udp_sendmsg+0x13c3/0x15d0 net/ipv4/udp.c:1450
inet_sendmsg+0xac/0xd0 net/ipv4/af_inet.c:859
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x53a/0x600 net/socket.c:2592
___sys_sendmsg+0x195/0x1e0 net/socket.c:2646
__sys_sendmmsg+0x185/0x320 net/socket.c:2735
__do_sys_sendmmsg net/socket.c:2762 [inline]
__se_sys_sendmmsg net/socket.c:2759 [inline]
__x64_sys_sendmmsg+0x57/0x70 net/socket.c:2759
x64_sys_call+0x1e28/0x3000 arch/x86/include/generated/asm/syscalls_64.h:308
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
value changed: 0x00 -> 0x01
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 4168 Comm: syz.4.206 Not tainted syzkaller #0 PREEMPT(voluntary)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025
Reported-by: syzbot+d24f940f770afda885cf@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/69783ead.050a0220.c9109.0013.GAE@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260127043528.514160-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit cbe41362be2c27e0237a94a404ae413cec9c2ad9 ]
Replace the BUG_ON() which never fired with a DEBUG_NET_WARN_ON_ONCE()
$ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2
add/remove: 2/2 grow/shrink: 1/1 up/down: 370/-254 (116)
Function old new delta
gro_try_pull_from_frag0 - 196 +196
napi_gro_frags 771 929 +158
__pfx_gro_try_pull_from_frag0 - 16 +16
__pfx_gro_pull_from_frag0 16 - -16
dev_gro_receive 1514 1464 -50
gro_pull_from_frag0 188 - -188
Total: Before=22565899, After=22566015, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260122045720.1221017-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ad22d24be635c6beab6a1fdd3f8b1f3c478d15da ]
RDS connections carry a state "rds_conn_path::cp_state"
and transitions from one state to another and are conditional
upon an expected state: "rds_conn_path_transition."
There is one exception to this conditionality, which is
"RDS_CONN_ERROR" that can be enforced by "rds_conn_path_drop"
regardless of what state the condition is currently in.
But as soon as a connection enters state "RDS_CONN_ERROR",
the connection handling code expects it to go through the
shutdown-path.
The RDS/TCP multipath changes added a shortcut out of
"RDS_CONN_ERROR" straight back to "RDS_CONN_CONNECTING"
via "rds_tcp_accept_one_path" (e.g. after "rds_tcp_state_change").
A subsequent "rds_tcp_reset_callbacks" can then transition
the state to "RDS_CONN_RESETTING" with a shutdown-worker queued.
That'll trip up "rds_conn_init_shutdown", which was
never adjusted to handle "RDS_CONN_RESETTING" and subsequently
drops the connection with the dreaded "DR_INV_CONN_STATE",
which leaves "RDS_SHUTDOWN_WORK_QUEUED" on forever.
So we do two things here:
a) Don't shortcut "RDS_CONN_ERROR", but take the longer
path through the shutdown code.
b) Add "RDS_CONN_RESETTING" to the expected states in
"rds_conn_init_shutdown" so that we won't error out
and get stuck, if we ever hit weird state transitions
like this again."
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260122055213.83608-2-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 735ee8582da3d239eb0c7a53adca61b79fb228b3 ]
Quoting reporter:
In net/netfilter/xt_tcpmss.c (lines 53-68), the TCP option parser reads
op[i+1] directly without validating the remaining option length.
If the last byte of the option field is not EOL/NOP (0/1), the code attempts
to index op[i+1]. In the case where i + 1 == optlen, this causes an
out-of-bounds read, accessing memory past the optlen boundary
(either reading beyond the stack buffer _opt or the
following payload).
Reported-by: sungzii <sungzii@pm.me>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8a49fc8d8a3e83dc51ec05bcd4007bdea3c56eec ]
The upstream commit, 71d8c47fc653711c41bc3282e5b0e605b3727956
("netfilter: conntrack: introduce clash resolution on insertion race"),
sets allow_clash=true in the UDP/UDPLITE protocol handler
but does not set it in the generic protocol handler.
As a result, packets composed of connectionless protocols at each layer,
such as UDP over IP-in-IP, still drop packets due to conflicts during conntrack insertion.
To resolve this, this patch sets allow_clash in the nf_conntrack_l4proto_generic.
Signed-off-by: Yuto Hamaguchi <Hamaguchi.Yuto@da.MitsubishiElectric.co.jp>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 978b67d28358b0b4eacfa94453d1ad4e09b123ad ]
Following four sysctls can change under us, add missing READ_ONCE().
- ipv6.sysctl.max_dst_opts_len
- ipv6.sysctl.max_dst_opts_cnt
- ipv6.sysctl.max_hbh_opts_len
- ipv6.sysctl.max_hbh_opts_cnt
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260115094141.3124990-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 034bbd806298e9ba4197dd1587b0348ee30996ea ]
Following expression can overflow
if sysctl_icmp_msgs_per_sec is big enough.
sysctl_icmp_msgs_per_sec * delta / HZ;
Fixes: 4cdf507d5452 ("icmp: add a global rate limitation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260216142832.3834174-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f17bf505ff89595df5147755e51441632a5dc563 ]
Previous patch made ICMP rate limits per netns, it makes sense
to allow each netns to change the associated sysctl.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240829144641.3880376-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: 034bbd806298 ("icmp: prevent possible overflow in icmp_global_allow()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b056b4cd9178f7a1d5d57f7b48b073c29729ddaa ]
Host wide ICMP ratelimiter should be per netns, to provide better isolation.
Following patch in this series makes the sysctl per netns.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20240829144641.3880376-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: 034bbd806298 ("icmp: prevent possible overflow in icmp_global_allow()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ad5dfde2a5733aaf652ea3e40c8c5e071e935901 ]
isk->inet_num, isk->inet_rcv_saddr and sk->sk_bound_dev_if
are read locklessly in ping_lookup().
Add READ_ONCE()/WRITE_ONCE() annotations.
The race on isk->inet_rcv_saddr is probably coming from IPv6 support,
but does not deserve a specific backport.
Fixes: dbca1596bbb0 ("ping: convert to RCU lookups, get rid of rwlock")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260216100149.3319315-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 59f26d86b2a16f1406f3b42025062b6d1fba5dd5 ]
We need to check socket netns before considering them in ping_get_port().
Otherwise, one malicious netns could 'consume' all ports.
Add corresponding check in ping_lookup().
Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://patch.msgid.link/20250829153054.474201-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: ad5dfde2a573 ("ping: annotate data-races in ping_lookup()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f1b5dfe63f6a9eb17948cbaee4da4b66f51ac794 ]
Since introduced in commit c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP
socket kind"), ping socket does not use SLAB_TYPESAFE_BY_RCU nor check
nulls marker in loops.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: ad5dfde2a573 ("ping: annotate data-races in ping_lookup()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 71e99ee20fc3f662555118cf1159443250647533 ]
nf_tables_addchain() publishes the chain to table->chains via
list_add_tail_rcu() (in nft_chain_add()) before registering hooks.
If nf_tables_register_hook() then fails, the error path calls
nft_chain_del() (list_del_rcu()) followed by nf_tables_chain_destroy()
with no RCU grace period in between.
This creates two use-after-free conditions:
1) Control-plane: nf_tables_dump_chains() traverses table->chains
under rcu_read_lock(). A concurrent dump can still be walking
the chain when the error path frees it.
2) Packet path: for NFPROTO_INET, nf_register_net_hook() briefly
installs the IPv4 hook before IPv6 registration fails. Packets
entering nft_do_chain() via the transient IPv4 hook can still be
dereferencing chain->blob_gen_X when the error path frees the
chain.
Add synchronize_rcu() between nft_chain_del() and the chain destroy
so that all RCU readers -- both dump threads and in-flight packet
evaluation -- have finished before the chain is freed.
Fixes: 91c7b38dc9f0 ("netfilter: nf_tables: use new transaction infrastructure to handle chain")
Signed-off-by: Inseo An <y0un9sa@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 008e7a7c293b30bc43e4368dac6ea3808b75a572 ]
Although unlikely, recent support for IPIP tunnels increases chances of
reaching this WARN_ON_ONCE if userspace manages to build a sufficiently
long forward path.
Remove it.
Fixes: ddb94eafab8b ("net: resolve forwarding path from virtual netdevice and HW destination address")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a6d28eb8efe96b3e35c92efdf1bfacb0cccf541f ]
Mihail Milev reports: Error: UNINIT (CWE-457):
net/netfilter/nf_conntrack_h323_main.c:1189:2: var_decl:
Declaring variable "tuple" without initializer.
net/netfilter/nf_conntrack_h323_main.c:1197:2:
uninit_use_in_call: Using uninitialized value "tuple.src.l3num" when calling "__nf_ct_expect_find".
net/netfilter/nf_conntrack_expect.c:142:2:
read_value: Reading value "tuple->src.l3num" when calling "nf_ct_expect_dst_hash".
1195| tuple.dst.protonum = IPPROTO_TCP;
1196|
1197|-> exp = __nf_ct_expect_find(net, nf_ct_zone(ct), &tuple);
1198| if (exp && exp->master == ct)
1199| return exp;
Switch this to a C99 initialiser and set the l3num value.
Fixes: f587de0e2feb ("[NETFILTER]: nf_conntrack/nf_nat: add H.323 helper port")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit da29e453dcb3aa7cabead7915f5f945d0add3a52 ]
Commit 3db6e0d172c9 ("rds: use RCU to synchronize work-enqueue with
connection teardown") modifies rds_sendmsg to avoid enqueueing work
while a tear down is in progress. However, it also changed the return
value of rds_sendmsg to that of rds_send_xmit instead of the
payload_len. This means the user may incorrectly receive errno values
when it should have simply received a payload of 0 while the peer
attempts a reconnections. So this patch corrects the teardown handling
code to only use the out error path in that case, thus restoring the
original payload_len return value.
Fixes: 3db6e0d172c9 ("rds: use RCU to synchronize work-enqueue with connection teardown")
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260213035409.1963391-1-achender@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit dd2fdc3504592d85e549c523b054898a036a6afe upstream.
Commit 5940d1cf9f42 ("SUNRPC: Rebalance a kref in auth_gss.c") added
a kref_get(&gss_auth->kref) call to balance the gss_put_auth() done
in gss_release_msg(), but forgot to add a corresponding kref_put()
on the error path when kstrdup_const() fails.
If service_name is non-NULL and kstrdup_const() fails, the function
jumps to err_put_pipe_version which calls put_pipe_version() and
kfree(gss_msg), but never releases the gss_auth reference. This leads
to a kref leak where the gss_auth structure is never freed.
Add a forward declaration for gss_free_callback() and call kref_put()
in the err_put_pipe_version error path to properly release the
reference taken earlier.
Fixes: 5940d1cf9f42 ("SUNRPC: Rebalance a kref in auth_gss.c")
Cc: stable@vger.kernel.org
Signed-off-by: Daniel Hodges <git@danielhodges.dev>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3e6397b056335cc56ef0e9da36c95946a19f5118 upstream.
The gssx_dec_ctx(), gssx_dec_status(), and gssx_dec_name()
functions allocate memory via gssx_dec_buffer(), which calls
kmemdup(). When a subsequent decode operation fails, these
functions return immediately without freeing previously
allocated buffers, causing memory leaks.
The leak in gssx_dec_ctx() is particularly relevant because
the caller (gssp_accept_sec_context_upcall) initializes several
buffer length fields to non-zero values, resulting in memory
allocation:
struct gssx_ctx rctxh = {
.exported_context_token.len = GSSX_max_output_handle_sz,
.mech.len = GSS_OID_MAX_LEN,
.src_name.display_name.len = GSSX_max_princ_sz,
.targ_name.display_name.len = GSSX_max_princ_sz
};
If, for example, gssx_dec_name() succeeds for src_name but
fails for targ_name, the memory allocated for
exported_context_token, mech, and src_name.display_name
remains unreferenced and cannot be reclaimed.
Add error handling with goto-based cleanup to free any
previously allocated buffers before returning an error.
Reported-by: Xingjing Deng <micro6947@gmail.com>
Closes: https://lore.kernel.org/linux-nfs/CAK+ZN9qttsFDu6h1FoqGadXjMx1QXqPMoYQ=6O9RY4SxVTvKng@mail.gmail.com/
Fixes: 1d658336b05f ("SUNRPC: Add RPC based upcall mechanism for RPCGSS auth")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit afcae7d7b8a278a6c29e064f99e5bafd4ac1fb37 ]
svc_rdma_accept() computes sc_sq_depth as the sum of rq_depth and the
number of rdma_rw contexts (ctxts). This value is used to allocate the
Send CQ and to initialize the sc_sq_avail credit pool.
However, when the device uses memory registration for RDMA operations,
rdma_rw_init_qp() inflates the QP's max_send_wr by a factor of three
per context to account for REG and INV work requests. The Send CQ and
credit pool remain sized for only one work request per context,
causing Send Queue exhaustion under heavy NFS WRITE workloads.
Introduce rdma_rw_max_sge() to compute the actual number of Send Queue
entries required for a given number of rdma_rw contexts. Upper layer
protocols call this helper before creating a Queue Pair so that their
Send CQs and credit accounting match the QP's true capacity.
Update svc_rdma_accept() to use rdma_rw_max_sge() when computing
sc_sq_depth, ensuring the credit pool reflects the work requests
that rdma_rw_init_qp() will reserve.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 00bd1439f464 ("RDMA/rw: Support threshold for registration vs scattering to local pages")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20260128005400.25147-5-cel@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 59243315890578a040a2d50ae9e001a2ef2fcb62 ]
There is an upper bound on the number of rdma_rw contexts that can
be created per QP.
This invisible upper bound is because rdma_create_qp() adds one or
more additional SQEs for each ctxt that the ULP requests via
qp_attr.cap.max_rdma_ctxs. The QP's actual Send Queue length is on
the order of the sum of qp_attr.cap.max_send_wr and a factor times
qp_attr.cap.max_rdma_ctxs. The factor can be up to three, depending
on whether MR operations are required before RDMA Reads.
This limit is not visible to RDMA consumers via dev->attrs. When the
limit is surpassed, QP creation fails with -ENOMEM. For example:
svcrdma's estimate of the number of rdma_rw contexts it needs is
three times the number of pages in RPCSVC_MAXPAGES. When MAXPAGES
is about 260, the internally-computed SQ length should be:
64 credits + 10 backlog + 3 * (3 * 260) = 2414
Which is well below the advertised qp_max_wr of 32768.
If RPCSVC_MAXPAGES is increased to 4MB, that's 1040 pages:
64 credits + 10 backlog + 3 * (3 * 1040) = 9434
However, QP creation fails. Dynamic printk for mlx5 shows:
calc_sq_size:618:(pid 1514): send queue size (9326 * 256 / 64 -> 65536) exceeds limits(32768)
Although 9326 is still far below qp_max_wr, QP creation still
fails.
Because the total SQ length calculation is opaque to RDMA consumers,
there doesn't seem to be much that can be done about this except for
consumers to try to keep the requested rdma_rw ctxt count low.
Fixes: 2da0f610e733 ("svcrdma: Increase the per-transport rw_ctx count")
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Stable-dep-of: afcae7d7b8a2 ("RDMA/core: add rdma_rw_max_sge() helper for SQ sizing")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2da0f610e733606e06284ac3c1f188b9dec75d68 ]
rdma_rw_mr_factor() returns the smallest number of MRs needed to
move a particular number of pages. svcrdma currently asks for the
number of MRs needed to move RPCSVC_MAXPAGES (a little over one
megabyte), as that is the number of pages in the largest r/wsize
the server supports.
This call assumes that the client's NIC can bundle a full one
megabyte payload in a single rdma_segment. In fact, most NICs cannot
handle a full megabyte with a single rkey / rdma_segment. Clients
will typically split even a single Read chunk into many segments.
The server needs one MR to read each rdma_segment in a Read chunk,
and thus each one needs an rw_ctx.
svcrdma has been vastly underestimating the number of rw_ctxs needed
to handle 64 RPC requests with large Read chunks using small
rdma_segments.
Unfortunately there doesn't seem to be a good way to estimate this
number without knowing the client NIC's capabilities. Even then,
the client RPC/RDMA implementation is still free to split a chunk
into smaller segments (for example, it might be using physical
registration, which needs an rdma_segment per page).
The best we can do for now is choose a number that will guarantee
forward progress in the worst case (one page per segment).
At some later point, we could add some mechanisms to make this
much less of a problem:
- Add a core API to add more rw_ctxs to an already-established QP
- svcrdma could treat rw_ctx exhaustion as a temporary error and
try again
- Limit the number of Reads in flight
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Stable-dep-of: afcae7d7b8a2 ("RDMA/core: add rdma_rw_max_sge() helper for SQ sizing")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit fc2e69db82c1ac506cd7f539a3ab66d51d3380dc ]
The comment that starts "Qualify ..." applies to only some of the
following code paragraph. Re-arrange the lines so the comment makes
more sense.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Stable-dep-of: afcae7d7b8a2 ("RDMA/core: add rdma_rw_max_sge() helper for SQ sizing")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b918bfcf370c92ea3b82fa9bb3d017702b5fa4cb ]
These won't have much diagnostic value for site administrators.
Since they can't be disabled, they become noise.
What's more, the subsequent rdma_create_qp() call adjusts the Send
Queue size (possibly downward) without warning, making the size
reported by these pr_warns inaccurate.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Stable-dep-of: afcae7d7b8a2 ("RDMA/core: add rdma_rw_max_sge() helper for SQ sizing")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 81b84de32bb27ae1ae2eb9acf0420e9d0d14bf00 ]
icmp_route_lookup() performs multiple route lookups to find a suitable
route for sending ICMP error messages, with special handling for XFRM
(IPsec) policies.
The lookup sequence is:
1. First, lookup output route for ICMP reply (dst = original src)
2. Pass through xfrm_lookup() for policy check
3. If blocked (-EPERM) or dst is not local, enter "reverse path"
4. In reverse path, call xfrm_decode_session_reverse() to get fl4_dec
which reverses the original packet's flow (saddr<->daddr swapped)
5. If fl4_dec.saddr is local (we are the original destination), use
__ip_route_output_key() for output route lookup
6. If fl4_dec.saddr is NOT local (we are a forwarding node), use
ip_route_input() to simulate the reverse packet's input path
7. Finally, pass rt2 through xfrm_lookup() with XFRM_LOOKUP_ICMP flag
The bug occurs in step 6: ip_route_input() is called with fl4_dec.daddr
(original packet's source) as destination. If this address becomes local
between the initial check and ip_route_input() call (e.g., due to
concurrent "ip addr add"), ip_route_input() returns a LOCAL route with
dst.output set to ip_rt_bug.
This route is then used for ICMP output, causing dst_output() to call
ip_rt_bug(), triggering a WARN_ON:
------------[ cut here ]------------
WARNING: net/ipv4/route.c:1275 at ip_rt_bug+0x21/0x30, CPU#1
Call Trace:
<TASK>
ip_push_pending_frames+0x202/0x240
icmp_push_reply+0x30d/0x430
__icmp_send+0x1149/0x24f0
ip_options_compile+0xa2/0xd0
ip_rcv_finish_core+0x829/0x1950
ip_rcv+0x2d7/0x420
__netif_receive_skb_one_core+0x185/0x1f0
netif_receive_skb+0x90/0x450
tun_get_user+0x3413/0x3fb0
tun_chr_write_iter+0xe4/0x220
...
Fix this by checking rt2->rt_type after ip_route_input(). If it's
RTN_LOCAL, the route cannot be used for output, so treat it as an error.
The reproducer requires kernel modification to widen the race window,
making it unsuitable as a selftest. It is available at:
https://gist.github.com/mrpre/eae853b72ac6a750f5d45d64ddac1e81
Reported-by: syzbot+e738404dcd14b620923c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000b1060905eada8881@google.com/T/
Closes: https://lore.kernel.org/r/20260128090523.356953-1-jiayuan.chen@linux.dev
Fixes: 8b7817f3a959 ("[IPSEC]: Add ICMP host relookup support")
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://patch.msgid.link/20260206050220.59642-1-jiayuan.chen@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e97e6a1830ddb5885ba312e56b6fa3aa39b5f47e ]
Going forward skb_dst_set will assert that skb dst_entry
is empty during skb_dst_set. skb_dstref_steal is added to reset
existing entry without doing refcnt. skb_dstref_restore should
be used to restore the previous entry. Convert icmp_route_lookup
and ip_options_rcv_srr to these helpers. Add extra call to
skb_dstref_reset to icmp_route_lookup to clear the ip_route_input
entry.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250818154032.3173645-5-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: 81b84de32bb2 ("xfrm: fix ip_rt_bug race in icmp_route_lookup reverse path")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ae88a5d2f29b69819dc7b04086734439d074a643 ]
Reproducer available at [1].
The ATM send path (sendmsg -> vcc_sendmsg -> sigd_send) reads the vcc
pointer from msg->vcc and uses it directly without any validation. This
pointer comes from userspace via sendmsg() and can be arbitrarily forged:
int fd = socket(AF_ATMSVC, SOCK_DGRAM, 0);
ioctl(fd, ATMSIGD_CTRL); // become ATM signaling daemon
struct msghdr msg = { .msg_iov = &iov, ... };
*(unsigned long *)(buf + 4) = 0xdeadbeef; // fake vcc pointer
sendmsg(fd, &msg, 0); // kernel dereferences 0xdeadbeef
In normal operation, the kernel sends the vcc pointer to the signaling
daemon via sigd_enq() when processing operations like connect(), bind(),
or listen(). The daemon is expected to return the same pointer when
responding. However, a malicious daemon can send arbitrary pointer values.
Fix this by introducing find_get_vcc() which validates the pointer by
searching through vcc_hash (similar to how sigd_close() iterates over
all VCCs), and acquires a reference via sock_hold() if found.
Since struct atm_vcc embeds struct sock as its first member, they share
the same lifetime. Therefore using sock_hold/sock_put is sufficient to
keep the vcc alive while it is being used.
Note that there may be a race with sigd_close() which could mark the vcc
with various flags (e.g., ATM_VF_RELEASED) after find_get_vcc() returns.
However, sock_hold() guarantees the memory remains valid, so this race
only affects the logical state, not memory safety.
[1]: https://gist.github.com/mrpre/1ba5949c45529c511152e2f4c755b0f3
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+1f22cb1769f249df9fa0@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69039850.a70a0220.5b2ed.005d.GAE@google.com/T/
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Link: https://patch.msgid.link/20260205095501.131890-1-jiayuan.chen@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 4780ec142cbb24b794129d3080eee5cac2943ffc ]
Userspace provides an optimized representation in case intervals are
adjacent, where the end element is omitted.
The existing partial overlap detection logic skips anonymous set checks
on start elements for this reason.
However, it is possible to add intervals that overlap to this anonymous
where two start elements with the same, eg. A-B, A-C where C < B.
start end
A B
start end
A C
Restore the check on overlapping start elements to report an overlap.
Fixes: c9e6978e2725 ("netfilter: nft_set_rbtree: Switch to node list walk for overlap detection")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1e13f27e0675552161ab1778be9a23a636dde8a7 ]
nft_counter_reset() calls u64_stats_add() with a negative value to reset
the counter. This will work on 64bit archs, hence the negative value
added will wrap as a 64bit value which then can wrap the stat counter as
well.
On 32bit archs, the added negative value will wrap as a 32bit value and
_not_ wrapping the stat counter properly. In most cases, this would just
lead to a very large 32bit value being added to the stat counter.
Fix by introducing u64_stats_sub().
Fixes: 4a1d3acd6ea8 ("netfilter: nft_counter: Use u64_stats_t for statistic.")
Signed-off-by: Anders Grahn <anders.grahn@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2f635adbe2642d398a0be3ab245accd2987be0c3 ]
tests/shell/testcases/packetpath/set_match_nomatch_hash_fast
fails on big endian with:
Error: Could not process rule: No such file or directory
reset element ip test s { 244.147.90.126 }
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Fatal: Cannot fetch element "244.147.90.126"
... because the wrong bucket is searched, jhash() and jhash1_word are
not interchangeable on big endian.
Fixes: 3b02b0adc242 ("netfilter: nft_set_hash: fix lookups with fixed size hash on big endian")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c9efde1e537baed7648a94022b43836a348a074f ]
llc_shdlc_deinit() purges SHDLC skb queues and frees the llc_shdlc
structure while its timers and state machine work may still be active.
Timer callbacks can schedule sm_work, and sm_work accesses SHDLC state
and the skb queues. If teardown happens in parallel with a queued/running
work item, it can lead to UAF and other shutdown races.
Stop all SHDLC timers and cancel sm_work synchronously before purging the
queues and freeing the context.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 4a61cd6687fc ("NFC: Add an shdlc llc module to llc core")
Signed-off-by: Votokina Victoria <Victoria.Votokina@kaspersky.com>
Link: https://patch.msgid.link/20260203113158.2008723-1-Victoria.Votokina@kaspersky.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 838eb9687691d29915797a885b861fd09353386e ]
tcp_tx_timestamp() is only called at the end of tcp_sendmsg_locked()
before the final tcp_push().
By the time it is called, it is possible all the copied data
has been sent already (transmit queue is empty).
If this is the case, use the last skb in the rtx queue.
Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20260127123828.4098577-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit de8a70cefcb26cdceaafdc5ac144712681419c29 ]
Since commit be102eb6a0e7 ("netfilter: nf_conncount: rework API to use
sk_buff directly"), we skip the adding and trigger a GC when the ct is
confirmed. For connections originated from local to local it doesn't
work because the connection is confirmed on POSTROUTING, therefore
tracking on the INPUT hook is always skipped.
In order to fix this, we check whether skb input ifindex is set to
loopback ifindex. If it is then we fallback on a GC plus track operation
skipping the optimization. This fallback is necessary to avoid
duplicated tracking of a packet train e.g 10 UDP datagrams sent on a
burst when initiating the connection.
Tested with xt_connlimit/nft_connlimit and OVS limit and with a HTTP
server and iperf3 on UDP mode.
Fixes: be102eb6a0e7 ("netfilter: nf_conncount: rework API to use sk_buff directly")
Reported-by: Michal Slabihoudek <michal.slabihoudek@gooddata.com>
Closes: https://lore.kernel.org/netfilter/6989BD9F-8C24-4397-9AD7-4613B28BF0DB@gooddata.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit cda26c645946b08f070f20c166d4736767e4a805 ]
As far as I can see nothing bad can happen when NFTA_TARGET/MATCH_NAME
are too large because this calls x_tables helpers which check for the
length, but it seems better to already reject it during netlink parsing.
Rest of the changes avoid silent u8/u16 truncations.
For _TYPE, its expected to be only 1 or 0. In x_tables world, this
variable is set by kernel, for IPT_SO_GET_REVISION_TARGET its 1, for
all others its set to 0.
As older versions of nf_tables permitted any value except 1 to mean 'match',
keep this as-is but sanitize the value for consistency.
Fixes: 0ca743a55991 ("netfilter: nf_tables: add compatibility layer for x_tables")
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|