summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-02-13i40e: Do not allow untrusted VF to remove administratively set MACIvan Vecera1-5/+33
Currently when PF administratively sets VF's MAC address and the VF is put down (VF tries to delete all MACs) then the MAC is removed from MAC filters and primary VF MAC is zeroed. Do not allow untrusted VF to remove primary MAC when it was set administratively by PF. Reproducer: 1) Create VF 2) Set VF interface up 3) Administratively set the VF's MAC 4) Put VF interface down [root@host ~]# echo 1 > /sys/class/net/enp2s0f0/device/sriov_numvfs [root@host ~]# ip link set enp2s0f0v0 up [root@host ~]# ip link set enp2s0f0 vf 0 mac fe:6c:b5:da:c7:7d [root@host ~]# ip link show enp2s0f0 23: enp2s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 3c:ec:ef:b7:dd:04 brd ff:ff:ff:ff:ff:ff vf 0 link/ether fe:6c:b5:da:c7:7d brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off [root@host ~]# ip link set enp2s0f0v0 down [root@host ~]# ip link show enp2s0f0 23: enp2s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 3c:ec:ef:b7:dd:04 brd ff:ff:ff:ff:ff:ff vf 0 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off Fixes: 700bbf6c1f9e ("i40e: allow VF to remove any MAC filter") Fixes: ceb29474bbbc ("i40e: Add support for VF to specify its primary MAC address") Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20240208180335.1844996-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-12net: sysfs: Fix /sys/class/net/<iface> path for statisticsBreno Leitao1-24/+24
The Documentation/ABI/testing/sysfs-class-net-statistics documentation is pointing to the wrong path for the interface. Documentation is pointing to /sys/class/<iface>, instead of /sys/class/net/<iface>. Fix it by adding the `net/` directory before the interface. Fixes: 6044f9700645 ("net: sysfs: document /sys/class/net/statistics/*") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-12Merge branch 'mptcp-misc-fixes'David S. Miller8-62/+95
Matthieu Baerts says: ==================== mptcp: locking cleanup & misc. fixes Patches 1-4 are fixes for issues found by Paolo while working on adding TCP_NOTSENT_LOWAT support. The latter will need to track more states under the msk data lock. Since the locking msk locking schema is already quite complex, do a long awaited clean-up step by moving several confusing lockless initialization under the relevant locks. Note that it is unlikely a real race could happen even prior to such patches as the MPTCP-level state machine implicitly ensures proper serialization of the write accesses, even lacking explicit lock. But still, simplification is welcome and this will help for the maintenance. This can be backported up to v5.6. Patch 5 is a fix for the userspace PM, not to add new local address entries if the address is already in the list. This behaviour can be seen since v5.19. Patch 6 fixes an issue when Fastopen is used. The issue can happen since v6.2. A previous fix has already been applied, but not taking care of all cases according to syzbot. Patch 7 updates Geliang's email address in the MAINTAINERS file. ==================== Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-12MAINTAINERS: update Geliang's email addressGeliang Tang2-5/+6
Update my email-address in MAINTAINERS and .mailmap entries to my kernel.org account. Suggested-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-12mptcp: really cope with fastopen racePaolo Abeni1-1/+2
Fastopen and PM-trigger subflow shutdown can race, as reported by syzkaller. In my first attempt to close such race, I missed the fact that the subflow status can change again before the subflow_state_change callback is invoked. Address the issue additionally copying with all the states directly reachable from TCP_FIN_WAIT1. Fixes: 1e777f39b4d7 ("mptcp: add MSG_FASTOPEN sendmsg flag support") Fixes: 4fd19a307016 ("mptcp: fix inconsistent state on fastopen race") Cc: stable@vger.kernel.org Reported-by: syzbot+c53d4d3ddb327e80bc51@syzkaller.appspotmail.com Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/458 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-12mptcp: check addrs list in userspace_pm_get_local_idGeliang Tang1-1/+12
Before adding a new entry in mptcp_userspace_pm_get_local_id(), it's better to check whether this address is already in userspace pm local address list. If it's in the list, no need to add a new entry, just return it's address ID and use this address. Fixes: 8b20137012d9 ("mptcp: read attributes of addr entries managed by userspace PMs") Cc: stable@vger.kernel.org Signed-off-by: Geliang Tang <geliang.tang@linux.dev> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-12mptcp: corner case locking for rx path fields initializationPaolo Abeni5-39/+50
Most MPTCP-level related fields are under the mptcp data lock protection, but are written one-off without such lock at MPC complete time, both for the client and the server Leverage the mptcp_propagate_state() infrastructure to move such initialization under the proper lock client-wise. The server side critical init steps are done by mptcp_subflow_fully_established(): ensure the caller properly held the relevant lock, and avoid acquiring the same lock in the nested scopes. There are no real potential races, as write access to such fields is implicitly serialized by the MPTCP state machine; the primary goal is consistency. Fixes: d22f4988ffec ("mptcp: process MP_CAPABLE data option") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-12mptcp: fix more tx path fields initializationPaolo Abeni2-6/+13
The 'msk->write_seq' and 'msk->snd_nxt' are always updated under the msk socket lock, except at MPC handshake completiont time. Builds-up on the previous commit to move such init under the relevant lock. There are no known problems caused by the potential race, the primary goal is consistency. Fixes: 6d0060f600ad ("mptcp: Write MPTCP DSS headers to outgoing data packets") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-12mptcp: fix rcv space initializationPaolo Abeni3-7/+10
mptcp_rcv_space_init() is supposed to happen under the msk socket lock, but active msk socket does that without such protection. Leverage the existing mptcp_propagate_state() helper to that extent. We need to ensure mptcp_rcv_space_init will happen before mptcp_rcv_space_adjust(), and the release_cb does not assure that: explicitly check for such condition. While at it, move the wnd_end initialization out of mptcp_rcv_space_init(), it never belonged there. Note that the race does not produce ill effect in practice, but change allows cleaning-up and defying better the locking model. Fixes: a6b118febbab ("mptcp: add receive buffer auto-tuning") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-12mptcp: drop the push_pending fieldPaolo Abeni2-7/+6
Such field is there to avoid acquiring the data lock in a few spots, but it adds complexity to the already non trivial locking schema. All the relevant call sites (mptcp-level re-injection, set socket options), are slow-path, drop such field in favor of 'cb_flags', adding the relevant locking. This patch could be seen as an improvement, instead of a fix. But it simplifies the next patch. The 'Fixes' tag has been added to help having this series backported to stable. Fixes: e9d09baca676 ("mptcp: avoid atomic bit manipulation when possible") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-12Merge branch 'net-misplaced-fields'David S. Miller6-14/+16
Eric Dumazet says: ==================== net: more three misplaced fields We recently reorganized some structures for better data locality in networking fast paths. This series moves three fields that were not correctly classified. There probably more to come. Reference : https://lwn.net/Articles/951321/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-12net-device: move lstats in net_device_read_txrxEric Dumazet3-8/+9
dev->lstats is notably used from loopback ndo_start_xmit() and other virtual drivers. Per cpu stats updates are dirtying per-cpu data, but the pointer itself is read-only. Fixes: 43a71cd66b9c ("net-device: reorganize net_device fast path variables") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Coco Li <lixiaoyan@google.com> Cc: Simon Horman <horms@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-12tcp: move tp->tcp_usec_ts to tcp_sock_read_txrx groupEric Dumazet2-3/+3
tp->tcp_usec_ts is a read mostly field, used in rx and tx fast paths. Fixes: d5fed5addb2b ("tcp: reorganize tcp_sock fast path variables") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Coco Li <lixiaoyan@google.com> Cc: Wei Wang <weiwan@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-12tcp: move tp->scaling_ratio to tcp_sock_read_txrx groupEric Dumazet3-3/+4
tp->scaling_ratio is a read mostly field, used in rx and tx fast paths. Fixes: d5fed5addb2b ("tcp: reorganize tcp_sock fast path variables") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Coco Li <lixiaoyan@google.com> Cc: Wei Wang <weiwan@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-11Merge branch 'tls-fixes'David S. Miller3-82/+66
Jakub Kicinski says: ==================== net: tls: fix some issues with async encryption valis was reporting a race on socket close so I sat down to try to fix it. I used Sabrina's async crypto debug patch to test... and in the process run into some of the same issues, and created very similar fixes :( I didn't realize how many of those patches weren't applied. Once I found Sabrina's code [1] it turned out to be so similar in fact that I added her S-o-b's and Co-develop'eds in a semi-haphazard way. With this series in place all expected tests pass with async crypto. Sabrina had a few more fixes, but I'll leave those to her, things are not crashing anymore. [1] https://lore.kernel.org/netdev/cover.1694018970.git.sd@queasysnail.net/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-11net: tls: fix returned read length with async decryptJakub Kicinski1-1/+0
We double count async, non-zc rx data. The previous fix was lucky because if we fully zc async_copy_bytes is 0 so we add 0. Decrypted already has all the bytes we handled, in all cases. We don't have to adjust anything, delete the erroneous line. Fixes: 4d42cd6bc2ac ("tls: rx: fix return value for async crypto") Co-developed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-11selftests: tls: use exact comparison in recv_partialJakub Kicinski1-4/+4
This exact case was fail for async crypto and we weren't catching it. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-11net: tls: fix use-after-free with partial reads and async decryptSabrina Dubroca1-2/+3
tls_decrypt_sg doesn't take a reference on the pages from clear_skb, so the put_page() in tls_decrypt_done releases them, and we trigger a use-after-free in process_rx_list when we try to read from the partially-read skb. Fixes: fd31f3996af2 ("tls: rx: decrypt into a fresh skb") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-11net: tls: handle backlogging of crypto requestsJakub Kicinski1-0/+22
Since we're setting the CRYPTO_TFM_REQ_MAY_BACKLOG flag on our requests to the crypto API, crypto_aead_{encrypt,decrypt} can return -EBUSY instead of -EINPROGRESS in valid situations. For example, when the cryptd queue for AESNI is full (easy to trigger with an artificially low cryptd.cryptd_max_cpu_qlen), requests will be enqueued to the backlog but still processed. In that case, the async callback will also be called twice: first with err == -EINPROGRESS, which it seems we can just ignore, then with err == 0. Compared to Sabrina's original patch this version uses the new tls_*crypt_async_wait() helpers and converts the EBUSY to EINPROGRESS to avoid having to modify all the error handling paths. The handling is identical. Fixes: a54667f6728c ("tls: Add support for encryption using async offload accelerator") Fixes: 94524d8fc965 ("net/tls: Add support for async decryption of tls records") Co-developed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/netdev/9681d1febfec295449a62300938ed2ae66983f28.1694018970.git.sd@queasysnail.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-11tls: fix race between tx work scheduling and socket closeJakub Kicinski1-10/+6
Similarly to previous commit, the submitting thread (recvmsg/sendmsg) may exit as soon as the async crypto handler calls complete(). Reorder scheduling the work before calling complete(). This seems more logical in the first place, as it's the inverse order of what the submitting thread will do. Reported-by: valis <sec@valis.email> Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-11tls: fix race between async notify and socket closeJakub Kicinski2-38/+10
The submitting thread (one which called recvmsg/sendmsg) may exit as soon as the async crypto handler calls complete() so any code past that point risks touching already freed data. Try to avoid the locking and extra flags altogether. Have the main thread hold an extra reference, this way we can depend solely on the atomic ref counter for synchronization. Don't futz with reiniting the completion, either, we are now tightly controlling when completion fires. Reported-by: valis <sec@valis.email> Fixes: 0cada33241d9 ("net/tls: fix race condition causing kernel panic") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-11net: tls: factor out tls_*crypt_async_wait()Jakub Kicinski1-51/+45
Factor out waiting for async encrypt and decrypt to finish. There are already multiple copies and a subsequent fix will need more. No functional changes. Note that crypto_wait_req() returns wait->err Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-10Merge branch 'net-fix-module_description-for-net-p5'Jakub Kicinski29-0/+29
Breno Leitao says: ==================== net: Fix MODULE_DESCRIPTION() for net (p5) There are hundreds of network modules that misses MODULE_DESCRIPTION(), causing a warning when compiling with W=1. Example: WARNING: modpost: missing MODULE_DESCRIPTION() in net/sched/em_cmp.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/sched/em_nbyte.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/sched/em_u32.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/sched/em_meta.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/sched/em_text.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/sched/em_canid.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/ipv4/ip_tunnel.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/ipv4/ipip.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/ipv4/ip_gre.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/ipv4/udp_tunnel.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/ipv4/ip_vti.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/ipv4/ah4.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/ipv4/esp4.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/ipv4/xfrm4_tunnel.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/ipv4/tunnel4.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/xfrm/xfrm_algo.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/xfrm/xfrm_user.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/ipv6/ah6.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/ipv6/esp6.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/ipv6/xfrm6_tunnel.o WARNING: modpost: missing MODULE_DESCRIPTION() in net/ipv6/tunnel6.o This part5 of the patchset focus on the missing net/ module, which are now warning free. v1: https://lore.kernel.org/all/20240205101400.1480521-1-leitao@debian.org/ v2: https://lore.kernel.org/all/20240207101929.484681-1-leitao@debian.org/ ==================== Link: https://lore.kernel.org/r/20240208164244.3818498-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-10net: fill in MODULE_DESCRIPTION()s for dsa_loop_bdinfoBreno Leitao1-0/+1
W=1 builds now warn if module is built without a MODULE_DESCRIPTION(). Add descriptions to the DSA loopback fixed PHY module. Suggested-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20240208164244.3818498-10-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-10net: fill in MODULE_DESCRIPTION()s for ipvtapBreno Leitao1-0/+1
W=1 builds now warn if module is built without a MODULE_DESCRIPTION(). Add descriptions to the IP-VLAN based tap driver. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240208164244.3818498-9-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-10net: fill in MODULE_DESCRIPTION()s for net/schedBreno Leitao6-0/+6
W=1 builds now warn if module is built without a MODULE_DESCRIPTION(). Add descriptions to the network schedulers. Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20240208164244.3818498-8-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-10net: fill in MODULE_DESCRIPTION()s for ipv4 modulesBreno Leitao9-0/+9
W=1 builds now warn if module is built without a MODULE_DESCRIPTION(). Add descriptions to the IPv4 modules. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240208164244.3818498-7-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-10net: fill in MODULE_DESCRIPTION()s for ipv6 modulesBreno Leitao7-0/+7
W=1 builds now warn if module is built without a MODULE_DESCRIPTION(). Add descriptions to the IPv6 modules. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240208164244.3818498-6-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-10net: fill in MODULE_DESCRIPTION()s for 6LoWPANBreno Leitao1-0/+1
W=1 builds now warn if module is built without a MODULE_DESCRIPTION(). Add descriptions to IPv6 over Low power Wireless Personal Area Network. Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Alexander Aring <aahringo@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240208164244.3818498-5-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-10net: fill in MODULE_DESCRIPTION()s for af_keyBreno Leitao1-0/+1
W=1 builds now warn if module is built without a MODULE_DESCRIPTION(). Add descriptions to the PF_KEY socket helpers. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240208164244.3818498-4-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-10net: fill in MODULE_DESCRIPTION()s for mpoaBreno Leitao1-0/+1
W=1 builds now warn if module is built without a MODULE_DESCRIPTION(). Add descriptions to the Multi-Protocol Over ATM (MPOA) driver. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240208164244.3818498-3-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-10net: fill in MODULE_DESCRIPTION()s for xfrmBreno Leitao2-0/+2
W=1 builds now warn if module is built without a MODULE_DESCRIPTION(). Add descriptions to the XFRM interface drivers. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240208164244.3818498-2-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-10lan966x: Fix crash when adding interface under a lagHoratiu Vultur1-2/+7
There is a crash when adding one of the lan966x interfaces under a lag interface. The issue can be reproduced like this: ip link add name bond0 type bond miimon 100 mode balance-xor ip link set dev eth0 master bond0 The reason is because when adding a interface under the lag it would go through all the ports and try to figure out which other ports are under that lag interface. And the issue is that lan966x can have ports that are NULL pointer as they are not probed. So then iterating over these ports it would just crash as they are NULL pointers. The fix consists in actually checking for NULL pointers before accessing something from the ports. Like we do in other places. Fixes: cabc9d49333d ("net: lan966x: Add lag support for lan966x") Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240206123054.3052966-1-horatiu.vultur@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09net/sched: act_mirred: Don't zero blockid when net device is being deletedVictor Nogueira1-2/+0
While testing tdc with parallel tests for mirred to block we caught an intermittent bug. The blockid was being zeroed out when a net device was deleted and, thus, giving us an incorrect blockid value whenever we tried to dump the mirred action. Since we don't increment the block refcount in the control path (and only use the ID), we don't need to zero the blockid field whenever a net device is going down. Fixes: 42f39036cda8 ("net/sched: act_mirred: Allow mirred to block") Signed-off-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20240207222902.1469398-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09Merge branch 'net-openvswitch-limit-the-recursions-from-action-sets'Jakub Kicinski3-31/+102
Aaron Conole says: ==================== net: openvswitch: limit the recursions from action sets Open vSwitch module accepts actions as a list from the netlink socket and then creates a copy which it uses in the action set processing. During processing of the action list on a packet, the module keeps a count of the execution depth and exits processing if the action depth goes too high. However, during netlink processing the recursion depth isn't checked anywhere, and the copy trusts that kernel has large enough stack to accommodate it. The OVS sample action was the original action which could perform this kinds of recursion, and it originally checked that it didn't exceed the sample depth limit. However, when sample became optimized to provide the clone() semantics, the recursion limit was dropped. This series adds a depth limit during the __ovs_nla_copy_actions() call that will ensure we don't exceed the max that the OVS userspace could generate for a clone(). Additionally, this series provides a selftest in 2/2 that can be used to determine if the OVS module is allowing unbounded access. It can be safely omitted where the ovs selftest framework isn't available. ==================== Link: https://lore.kernel.org/r/20240207132416.1488485-1-aconole@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09selftests: openvswitch: Add validation for the recursion testAaron Conole2-15/+69
Add a test case into the netlink checks that will show the number of nested action recursions won't exceed 16. Going to 17 on a small clone call isn't enough to exhaust the stack on (most) systems, so it should be safe to run even on systems that don't have the fix applied. Signed-off-by: Aaron Conole <aconole@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240207132416.1488485-3-aconole@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09net: openvswitch: limit the number of recursions from action setsAaron Conole1-16/+33
The ovs module allows for some actions to recursively contain an action list for complex scenarios, such as sampling, checking lengths, etc. When these actions are copied into the internal flow table, they are evaluated to validate that such actions make sense, and these calls happen recursively. The ovs-vswitchd userspace won't emit more than 16 recursion levels deep. However, the module has no such limit and will happily accept limits larger than 16 levels nested. Prevent this by tracking the number of recursions happening and manually limiting it to 16 levels nested. The initial implementation of the sample action would track this depth and prevent more than 3 levels of recursion, but this was removed to support the clone use case, rather than limited at the current userspace limit. Fixes: 798c166173ff ("openvswitch: Optimize sample action for the clone use cases") Signed-off-by: Aaron Conole <aconole@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240207132416.1488485-2-aconole@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09Merge branch 'selftests-forwarding-various-fixes'Jakub Kicinski3-9/+17
Ido Schimmel says: ==================== selftests: forwarding: Various fixes Fix various problems in the forwarding selftests so that they will pass in the netdev CI instead of being ignored. See commit messages for details. ==================== Link: https://lore.kernel.org/r/20240208155529.1199729-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09selftests: forwarding: Fix bridge locked port test flakinessIdo Schimmel1-2/+2
The redirection test case fails in the netdev CI on debug kernels because an FDB entry is learned despite the presence of a tc filter that redirects incoming traffic [1]. I am unable to reproduce the failure locally, but I can see how it can happen given that learning is first enabled and only then the ingress tc filter is configured. On debug kernels the time window between these two operations is longer compared to regular kernels, allowing random packets to be transmitted and trigger learning. Fix by reversing the order and configure the ingress tc filter before enabling learning. [1] [...] # TEST: Locked port MAB redirect [FAIL] # Locked entry created for redirected traffic Fixes: 38c43a1ce758 ("selftests: forwarding: Add test case for traffic redirection from a locked port") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20240208155529.1199729-5-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09selftests: forwarding: Suppress grep warningsIdo Schimmel1-3/+3
Suppress the following grep warnings: [...] INFO: # Port group entries configuration tests - (*, G) TEST: Common port group entries configuration tests (IPv4 (*, G)) [ OK ] TEST: Common port group entries configuration tests (IPv6 (*, G)) [ OK ] grep: warning: stray \ before / grep: warning: stray \ before / grep: warning: stray \ before / TEST: IPv4 (*, G) port group entries configuration tests [ OK ] grep: warning: stray \ before / grep: warning: stray \ before / grep: warning: stray \ before / TEST: IPv6 (*, G) port group entries configuration tests [ OK ] [...] They do not fail the test, but do clutter the output. Fixes: b6d00da08610 ("selftests: forwarding: Add bridge MDB test") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20240208155529.1199729-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09selftests: forwarding: Fix bridge MDB test flakinessIdo Schimmel1-2/+6
After enabling a multicast querier on the bridge (like the test is doing), the bridge will wait for the Max Response Delay before starting to forward according to its MDB in order to let Membership Reports enough time to be received and processed. Currently, the test is waiting for exactly the default Max Response Delay (10 seconds) which is racy and leads to failures [1]. Fix by reducing the Max Response Delay to 1 second. [1] [...] # TEST: IPv4 host entries forwarding tests [FAIL] # Packet locally received after flood Fixes: b6d00da08610 ("selftests: forwarding: Add bridge MDB test") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20240208155529.1199729-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09selftests: forwarding: Fix layer 2 miss test flakinessIdo Schimmel1-2/+6
After enabling a multicast querier on the bridge (like the test is doing), the bridge will wait for the Max Response Delay before starting to forward according to its MDB in order to let Membership Reports enough time to be received and processed. Currently, the test is waiting for exactly the default Max Response Delay (10 seconds) which is racy and leads to failures [1]. Fix by reducing the Max Response Delay to 1 second. [1] [...] # TEST: L2 miss - Multicast (IPv4) [FAIL] # Unregistered multicast filter was hit after adding MDB entry Fixes: 8c33266ae26a ("selftests: forwarding: Add layer 2 miss test cases") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20240208155529.1199729-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09selftests: net: Fix bridge backup port test flakinessIdo Schimmel1-0/+23
The test toggles the carrier of a bridge port in order to test the bridge backup port feature. Due to the linkwatch delayed work the carrier change is not always reflected fast enough to the bridge driver and packets are not forwarded as the test expects, resulting in failures [1]. Fix by busy waiting on the bridge port state until it changes to the desired state following the carrier change. [1] # Backup port # ----------- [...] # TEST: swp1 carrier off [ OK ] # TEST: No forwarding out of swp1 [FAIL] [ 641.995910] br0: port 1(swp1) entered disabled state # TEST: No forwarding out of vx0 [ OK ] Fixes: b408453053fb ("selftests: net: Add bridge backup port and backup nexthop ID test") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20240208123110.1063930-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09selftests: net: add more missing kernel configPaolo Abeni1-1/+5
The reuseport_addr_any.sh is currently skipping DCCP tests and pmtu.sh is skipping all the FOU/GUE related cases: add the missing options. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/38d3ca7f909736c1aef56e6244d67c82a9bba6ff.1707326987.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09devlink: Fix command annotation documentationParav Pandit1-1/+1
Command example string is not read as command. Fix command annotation. Fixes: a8ce7b26a51e ("devlink: Expose port function commands to control migratable") Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240206161717.466653-1-parav@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09bonding: do not report NETDEV_XDP_ACT_XSK_ZEROCOPYMagnus Karlsson1-3/+2
Do not report the XDP capability NETDEV_XDP_ACT_XSK_ZEROCOPY as the bonding driver does not support XDP and AF_XDP in zero-copy mode even if the real NIC drivers do. Note that the driver used to report everything as supported before a device was bonded. Instead of just masking out the zero-copy support from this, have the driver report that no XDP feature is supported until a real device is bonded. This seems to be more truthful as it is the real drivers that decide what XDP features are supported. Fixes: cb9e6e584d58 ("bonding: add xdp_features support") Reported-by: Prashant Batra <prbatra.mail@gmail.com> Link: https://lore.kernel.org/all/CAJ8uoz2ieZCopgqTvQ9ZY6xQgTbujmC6XkMTamhp68O-h_-rLg@mail.gmail.com/T/ Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20240207084737.20890-1-magnus.karlsson@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09net/handshake: Fix handshake_req_destroy_test1Chuck Lever1-1/+4
Recently, handshake_req_destroy_test1 started failing: Expected handshake_req_destroy_test == req, but handshake_req_destroy_test == 0000000000000000 req == 0000000060f99b40 not ok 11 req_destroy works This is because "sock_release(sock)" was replaced with "fput(filp)" to address a memory leak. Note that sock_release() is synchronous but fput() usually delays the final close and clean-up. The delay is not consequential in the other cases that were changed but handshake_req_destroy_test1 is testing that handshake_req_cancel() followed by closing the file actually does call the ->hp_destroy method. Thus the PTR_EQ test at the end has to be sure that the final close is complete before it checks the pointer. We cannot use a completion here because if ->hp_destroy is never called (ie, there is an API bug) then the test will hang. Reported by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/netdev/ZcKDd1to4MPANCrn@tissot.1015granger.net/T/#mac5c6299f86799f1c71776f3a07f9c566c7c3c40 Fixes: 4a0f07d71b04 ("net/handshake: Fix memory leak in __sock_create() and sock_alloc_file()") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/170724699027.91401.7839730697326806733.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09net/mlx5: DPLL, Fix possible use after free after delayed work timer triggersJiri Pirko1-1/+1
I managed to hit following use after free warning recently: [ 2169.711665] ================================================================== [ 2169.714009] BUG: KASAN: slab-use-after-free in __run_timers.part.0+0x179/0x4c0 [ 2169.716293] Write of size 8 at addr ffff88812b326a70 by task swapper/4/0 [ 2169.719022] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 6.8.0-rc2jiri+ #2 [ 2169.720974] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 2169.722457] Call Trace: [ 2169.722756] <IRQ> [ 2169.723024] dump_stack_lvl+0x58/0xb0 [ 2169.723417] print_report+0xc5/0x630 [ 2169.723807] ? __virt_addr_valid+0x126/0x2b0 [ 2169.724268] kasan_report+0xbe/0xf0 [ 2169.724667] ? __run_timers.part.0+0x179/0x4c0 [ 2169.725116] ? __run_timers.part.0+0x179/0x4c0 [ 2169.725570] __run_timers.part.0+0x179/0x4c0 [ 2169.726003] ? call_timer_fn+0x320/0x320 [ 2169.726404] ? lock_downgrade+0x3a0/0x3a0 [ 2169.726820] ? kvm_clock_get_cycles+0x14/0x20 [ 2169.727257] ? ktime_get+0x92/0x150 [ 2169.727630] ? lapic_next_deadline+0x35/0x60 [ 2169.728069] run_timer_softirq+0x40/0x80 [ 2169.728475] __do_softirq+0x1a1/0x509 [ 2169.728866] irq_exit_rcu+0x95/0xc0 [ 2169.729241] sysvec_apic_timer_interrupt+0x6b/0x80 [ 2169.729718] </IRQ> [ 2169.729993] <TASK> [ 2169.730259] asm_sysvec_apic_timer_interrupt+0x16/0x20 [ 2169.730755] RIP: 0010:default_idle+0x13/0x20 [ 2169.731190] Code: c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 72 ff ff ff cc cc cc cc 8b 05 9a 7f 1f 02 85 c0 7e 07 0f 00 2d cf 69 43 00 fb f4 <fa> c3 66 66 2e 0f 1f 84 00 00 00 00 00 65 48 8b 04 25 c0 93 04 00 [ 2169.732759] RSP: 0018:ffff888100dbfe10 EFLAGS: 00000242 [ 2169.733264] RAX: 0000000000000001 RBX: ffff888100d9c200 RCX: ffffffff8241bd62 [ 2169.733925] RDX: ffffed109a848b15 RSI: 0000000000000004 RDI: ffffffff8127ac55 [ 2169.734566] RBP: 0000000000000004 R08: 0000000000000000 R09: ffffed109a848b14 [ 2169.735200] R10: ffff8884d42458a3 R11: 000000000000ba7e R12: ffffffff83d7d3a0 [ 2169.735835] R13: 1ffff110201b7fc6 R14: 0000000000000000 R15: ffff888100d9c200 [ 2169.736478] ? ct_kernel_exit.constprop.0+0xa2/0xc0 [ 2169.736954] ? do_idle+0x285/0x290 [ 2169.737323] default_idle_call+0x63/0x90 [ 2169.737730] do_idle+0x285/0x290 [ 2169.738089] ? arch_cpu_idle_exit+0x30/0x30 [ 2169.738511] ? mark_held_locks+0x1a/0x80 [ 2169.738917] ? lockdep_hardirqs_on_prepare+0x12e/0x200 [ 2169.739417] cpu_startup_entry+0x30/0x40 [ 2169.739825] start_secondary+0x19a/0x1c0 [ 2169.740229] ? set_cpu_sibling_map+0xbd0/0xbd0 [ 2169.740673] secondary_startup_64_no_verify+0x15d/0x16b [ 2169.741179] </TASK> [ 2169.741686] Allocated by task 1098: [ 2169.742058] kasan_save_stack+0x1c/0x40 [ 2169.742456] kasan_save_track+0x10/0x30 [ 2169.742852] __kasan_kmalloc+0x83/0x90 [ 2169.743246] mlx5_dpll_probe+0xf5/0x3c0 [mlx5_dpll] [ 2169.743730] auxiliary_bus_probe+0x62/0xb0 [ 2169.744148] really_probe+0x127/0x590 [ 2169.744534] __driver_probe_device+0xd2/0x200 [ 2169.744973] device_driver_attach+0x6b/0xf0 [ 2169.745402] bind_store+0x90/0xe0 [ 2169.745761] kernfs_fop_write_iter+0x1df/0x2a0 [ 2169.746210] vfs_write+0x41f/0x790 [ 2169.746579] ksys_write+0xc7/0x160 [ 2169.746947] do_syscall_64+0x6f/0x140 [ 2169.747333] entry_SYSCALL_64_after_hwframe+0x46/0x4e [ 2169.748049] Freed by task 1220: [ 2169.748393] kasan_save_stack+0x1c/0x40 [ 2169.748789] kasan_save_track+0x10/0x30 [ 2169.749188] kasan_save_free_info+0x3b/0x50 [ 2169.749621] poison_slab_object+0x106/0x180 [ 2169.750044] __kasan_slab_free+0x14/0x50 [ 2169.750451] kfree+0x118/0x330 [ 2169.750792] mlx5_dpll_remove+0xf5/0x110 [mlx5_dpll] [ 2169.751271] auxiliary_bus_remove+0x2e/0x40 [ 2169.751694] device_release_driver_internal+0x24b/0x2e0 [ 2169.752191] unbind_store+0xa6/0xb0 [ 2169.752563] kernfs_fop_write_iter+0x1df/0x2a0 [ 2169.753004] vfs_write+0x41f/0x790 [ 2169.753381] ksys_write+0xc7/0x160 [ 2169.753750] do_syscall_64+0x6f/0x140 [ 2169.754132] entry_SYSCALL_64_after_hwframe+0x46/0x4e [ 2169.754847] Last potentially related work creation: [ 2169.755315] kasan_save_stack+0x1c/0x40 [ 2169.755709] __kasan_record_aux_stack+0x9b/0xf0 [ 2169.756165] __queue_work+0x382/0x8f0 [ 2169.756552] call_timer_fn+0x126/0x320 [ 2169.756941] __run_timers.part.0+0x2ea/0x4c0 [ 2169.757376] run_timer_softirq+0x40/0x80 [ 2169.757782] __do_softirq+0x1a1/0x509 [ 2169.758387] Second to last potentially related work creation: [ 2169.758924] kasan_save_stack+0x1c/0x40 [ 2169.759322] __kasan_record_aux_stack+0x9b/0xf0 [ 2169.759773] __queue_work+0x382/0x8f0 [ 2169.760156] call_timer_fn+0x126/0x320 [ 2169.760550] __run_timers.part.0+0x2ea/0x4c0 [ 2169.760978] run_timer_softirq+0x40/0x80 [ 2169.761381] __do_softirq+0x1a1/0x509 [ 2169.761998] The buggy address belongs to the object at ffff88812b326a00 which belongs to the cache kmalloc-256 of size 256 [ 2169.763061] The buggy address is located 112 bytes inside of freed 256-byte region [ffff88812b326a00, ffff88812b326b00) [ 2169.764346] The buggy address belongs to the physical page: [ 2169.764866] page:000000000f2b1e89 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12b324 [ 2169.765731] head:000000000f2b1e89 order:2 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 2169.766484] anon flags: 0x200000000000840(slab|head|node=0|zone=2) [ 2169.767048] page_type: 0xffffffff() [ 2169.767422] raw: 0200000000000840 ffff888100042b40 0000000000000000 dead000000000001 [ 2169.768183] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000 [ 2169.768899] page dumped because: kasan: bad access detected [ 2169.769649] Memory state around the buggy address: [ 2169.770116] ffff88812b326900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 2169.770805] ffff88812b326980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 2169.771485] >ffff88812b326a00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2169.772173] ^ [ 2169.772787] ffff88812b326a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 2169.773477] ffff88812b326b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 2169.774160] ================================================================== [ 2169.774845] ================================================================== I didn't manage to reproduce it. Though the issue seems to be obvious. There is a chance that the mlx5_dpll_remove() calls cancel_delayed_work() when the work runs and manages to re-arm itself. In that case, after delay timer triggers next attempt to queue it, it works with freed memory. Fix this by using cancel_delayed_work_sync() instead which makes sure that work is done when it returns. Fixes: 496fd0a26bbf ("mlx5: Implement SyncE support using DPLL infrastructure") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240206164328.360313-1-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09dpll: fix possible deadlock during netlink dump operationJiri Pirko4-24/+6
Recently, I've been hitting following deadlock warning during dpll pin dump: [52804.637962] ====================================================== [52804.638536] WARNING: possible circular locking dependency detected [52804.639111] 6.8.0-rc2jiri+ #1 Not tainted [52804.639529] ------------------------------------------------------ [52804.640104] python3/2984 is trying to acquire lock: [52804.640581] ffff88810e642678 (nlk_cb_mutex-GENERIC){+.+.}-{3:3}, at: netlink_dump+0xb3/0x780 [52804.641417] but task is already holding lock: [52804.642010] ffffffff83bde4c8 (dpll_lock){+.+.}-{3:3}, at: dpll_lock_dumpit+0x13/0x20 [52804.642747] which lock already depends on the new lock. [52804.643551] the existing dependency chain (in reverse order) is: [52804.644259] -> #1 (dpll_lock){+.+.}-{3:3}: [52804.644836] lock_acquire+0x174/0x3e0 [52804.645271] __mutex_lock+0x119/0x1150 [52804.645723] dpll_lock_dumpit+0x13/0x20 [52804.646169] genl_start+0x266/0x320 [52804.646578] __netlink_dump_start+0x321/0x450 [52804.647056] genl_family_rcv_msg_dumpit+0x155/0x1e0 [52804.647575] genl_rcv_msg+0x1ed/0x3b0 [52804.648001] netlink_rcv_skb+0xdc/0x210 [52804.648440] genl_rcv+0x24/0x40 [52804.648831] netlink_unicast+0x2f1/0x490 [52804.649290] netlink_sendmsg+0x36d/0x660 [52804.649742] __sock_sendmsg+0x73/0xc0 [52804.650165] __sys_sendto+0x184/0x210 [52804.650597] __x64_sys_sendto+0x72/0x80 [52804.651045] do_syscall_64+0x6f/0x140 [52804.651474] entry_SYSCALL_64_after_hwframe+0x46/0x4e [52804.652001] -> #0 (nlk_cb_mutex-GENERIC){+.+.}-{3:3}: [52804.652650] check_prev_add+0x1ae/0x1280 [52804.653107] __lock_acquire+0x1ed3/0x29a0 [52804.653559] lock_acquire+0x174/0x3e0 [52804.653984] __mutex_lock+0x119/0x1150 [52804.654423] netlink_dump+0xb3/0x780 [52804.654845] __netlink_dump_start+0x389/0x450 [52804.655321] genl_family_rcv_msg_dumpit+0x155/0x1e0 [52804.655842] genl_rcv_msg+0x1ed/0x3b0 [52804.656272] netlink_rcv_skb+0xdc/0x210 [52804.656721] genl_rcv+0x24/0x40 [52804.657119] netlink_unicast+0x2f1/0x490 [52804.657570] netlink_sendmsg+0x36d/0x660 [52804.658022] __sock_sendmsg+0x73/0xc0 [52804.658450] __sys_sendto+0x184/0x210 [52804.658877] __x64_sys_sendto+0x72/0x80 [52804.659322] do_syscall_64+0x6f/0x140 [52804.659752] entry_SYSCALL_64_after_hwframe+0x46/0x4e [52804.660281] other info that might help us debug this: [52804.661077] Possible unsafe locking scenario: [52804.661671] CPU0 CPU1 [52804.662129] ---- ---- [52804.662577] lock(dpll_lock); [52804.662924] lock(nlk_cb_mutex-GENERIC); [52804.663538] lock(dpll_lock); [52804.664073] lock(nlk_cb_mutex-GENERIC); [52804.664490] The issue as follows: __netlink_dump_start() calls control->start(cb) with nlk->cb_mutex held. In control->start(cb) the dpll_lock is taken. Then nlk->cb_mutex is released and taken again in netlink_dump(), while dpll_lock still being held. That leads to ABBA deadlock when another CPU races with the same operation. Fix this by moving dpll_lock taking into dumpit() callback which ensures correct lock taking order. Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base functions") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Link: https://lore.kernel.org/r/20240207115902.371649-1-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09Merge tag 'net-6.8-rc4' of ↵Linus Torvalds130-481/+1104
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from WiFi and netfilter. Current release - regressions: - nic: intel: fix old compiler regressions - netfilter: ipset: missing gc cancellations fixed Current release - new code bugs: - netfilter: ctnetlink: fix filtering for zone 0 Previous releases - regressions: - core: fix from address in memcpy_to_iter_csum() - netfilter: nfnetlink_queue: un-break NF_REPEAT - af_unix: fix memory leak for dead unix_(sk)->oob_skb in GC. - devlink: avoid potential loop in devlink_rel_nested_in_notify_work() - iwlwifi: - mvm: fix a battery life regression - fix double-free bug - mac80211: fix waiting for beacons logic - nic: nfp: flower: prevent re-adding mac index for bonded port Previous releases - always broken: - rxrpc: fix generation of serial numbers to skip zero - tipc: check the bearer type before calling tipc_udp_nl_bearer_add() - tunnels: fix out of bounds access when building IPv6 PMTU error - nic: hv_netvsc: register VF in netvsc_probe if NET_DEVICE_REGISTER missed - nic: atlantic: fix DMA mapping for PTP hwts ring Misc: - selftests: more fixes to deal with very slow hosts" * tag 'net-6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (80 commits) netfilter: nft_set_pipapo: remove scratch_aligned pointer netfilter: nft_set_pipapo: add helper to release pcpu scratch area netfilter: nft_set_pipapo: store index in scratch maps netfilter: nft_set_rbtree: skip end interval element from gc netfilter: nfnetlink_queue: un-break NF_REPEAT netfilter: nf_tables: use timestamp to check for set element timeout netfilter: nft_ct: reject direction for ct id netfilter: ctnetlink: fix filtering for zone 0 s390/qeth: Fix potential loss of L3-IP@ in case of network issues netfilter: ipset: Missing gc cancellations fixed octeontx2-af: Initialize maps. net: ethernet: ti: cpsw: enable mac_managed_pm to fix mdio net: ethernet: ti: cpsw_new: enable mac_managed_pm to fix mdio netfilter: nft_set_pipapo: remove static in nft_pipapo_get() netfilter: nft_compat: restrict match/target protocol to u16 netfilter: nft_compat: reject unused compat flag netfilter: nft_compat: narrow down revision to unsigned 8-bits net: intel: fix old compiler regressions MAINTAINERS: Maintainer change for rds selftests: cmsg_ipv6: repeat the exact packet ...