Age | Commit message (Collapse) | Author | Files | Lines |
|
The napi_suspend_irqs routine bootstraps irq suspension by elongating
the defer timeout to irq_suspend_timeout.
The napi_resume_irqs routine effectively cancels irq suspension by
forcing the napi to be scheduled immediately.
Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Co-developed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://patch.msgid.link/20241109050245.191288-3-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a per-NAPI IRQ suspension parameter, which can be get/set with
netdev-genl.
This patch doesn't change any behavior but prepares the code for other
changes in the following commits which use irq_suspend_timeout as a
timeout for IRQ suspension.
Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca>
Co-developed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Tested-by: Joe Damato <jdamato@fastly.com>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Link: https://patch.msgid.link/20241109050245.191288-2-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Exit early if we're freeing more than 1024 frags, to prevent
looping too long.
Also minor code cleanups:
- Flip checks to reduce indentation.
- Use sizeof(*tokens) everywhere for consistentcy.
Cc: Yi Lai <yi1.lai@linux.intel.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20241107210331.3044434-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
RTNL_FLAG_DOIT_PERNET_WIP.
Currently, rtnl_setlink() and rtnl_dellink() cannot be fully converted
to per-netns RTNL due to a lack of handling peer/lower/upper devices in
different netns.
For example, when we change a device in rtnl_setlink() and need to
propagate that to its upper devices, we want to avoid acquiring all netns
locks, for which we do not know the upper limit.
The same situation happens when we remove a device.
rtnl_dellink() could be transformed to remove a single device in the
requested netns and delegate other devices to per-netns work, and
rtnl_setlink() might be ?
Until we come up with a better idea, let's use a new flag
RTNL_FLAG_DOIT_PERNET_WIP for rtnl_dellink() and rtnl_setlink().
This will unblock converting RTNL users where such devices are not related.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241108004823.29419-11-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now, we are ready to convert rtnl_newlink() to per-netns RTNL;
rtnl_link_ops is protected by SRCU and netns is prefetched in
rtnl_newlink().
Let's register rtnl_newlink() with RTNL_FLAG_DOIT_PERNET and
push RTNL down as rtnl_nets_lock().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241108004823.29419-10-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In ops->newlink(), veth, vxcan, and netkit call rtnl_link_get_net() with
a net pointer, which is the first argument of ->newlink().
rtnl_link_get_net() could return another netns based on IFLA_NET_NS_PID
and IFLA_NET_NS_FD in the peer device's attributes.
We want to get it and fill rtnl_nets->nets[] in advance in rtnl_newlink()
for per-netns RTNL.
All of the three get the peer netns in the same way:
1. Call rtnl_nla_parse_ifinfomsg()
2. Call ops->validate() (vxcan doesn't have)
3. Call rtnl_link_get_net_tb()
Let's add a new field peer_type to struct rtnl_link_ops and prefetch
netns in the peer ifla to add it to rtnl_nets in rtnl_newlink().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241108004823.29419-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
rtnl_newlink() needs to hold 3 per-netns RTNL: 2 for a new device
and 1 for its peer.
We will add rtnl_nets_lock() later, which performs the nested locking
based on struct rtnl_nets, which has an array of struct net pointers.
rtnl_nets_add() adds a net pointer to the array and sorts it so that
rtnl_nets_lock() can simply acquire per-netns RTNL from array[0] to [2].
Before calling rtnl_nets_add(), get_net() must be called for the net,
and rtnl_nets_destroy() will call put_net() for each.
Let's apply the helpers to rtnl_newlink().
When CONFIG_DEBUG_NET_SMALL_RTNL is disabled, we do not call
rtnl_net_lock() thus do not care about the array order, so
rtnl_net_cmp_locks() returns -1 so that the loop in rtnl_nets_add()
can be optimised to NOP.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241108004823.29419-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
link_ops is protected by link_ops_mutex and no longer needs RTNL,
so we have no reason to have __rtnl_link_register() separately.
Let's remove it and call rtnl_link_register() from ifb.ko and
dummy.ko.
Note that both modules' init() work on init_net only, so we need
not export pernet_ops_rwsem and can use rtnl_net_lock() there.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241108004823.29419-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
rtnl_link_unregister() holds RTNL and calls synchronize_srcu(),
but rtnl_newlink() will acquire SRCU frist and then RTNL.
Then, we need to unlink ops and call synchronize_srcu() outside
of RTNL to avoid the deadlock.
rtnl_link_unregister() rtnl_newlink()
---- ----
lock(rtnl_mutex);
lock(&ops->srcu);
lock(rtnl_mutex);
sync(&ops->srcu);
Let's move as such and add a mutex to protect link_ops.
Now, link_ops is protected by its dedicated mutex and
rtnl_link_register() no longer needs to hold RTNL.
While at it, we move the initialisation of ops->dellink and
ops->srcu out of the mutex scope.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241108004823.29419-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
rtnl_link_unregister() holds RTNL and calls __rtnl_link_unregister(),
where we call synchronize_srcu() to wait inflight RTM_NEWLINK requests
for per-netns RTNL.
We put synchronize_srcu() in __rtnl_link_unregister() due to ifb.ko
and dummy.ko.
However, rtnl_newlink() will acquire SRCU before RTNL later in this
series. Then, lockdep will detect the deadlock:
rtnl_link_unregister() rtnl_newlink()
---- ----
lock(rtnl_mutex);
lock(&ops->srcu);
lock(rtnl_mutex);
sync(&ops->srcu);
To avoid the problem, we must call synchronize_srcu() before RTNL in
rtnl_link_unregister().
As a preparation, let's remove __rtnl_link_unregister().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241108004823.29419-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use appropriate frag_page API instead of caller accessing
'page_frag_cache' directly.
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20241028115343.3405838-5-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Most of the original conversion is from the spatch below,
but I edited some and left out other instances that were
either buggy after conversion (where default values don't
fit into the type) or just looked strange.
@@
expression attr, def;
expression val;
identifier fn =~ "^nla_get_.*";
fresh identifier dfn = fn ## "_default";
@@
(
-if (attr)
- val = fn(attr);
-else
- val = def;
+val = dfn(attr, def);
|
-if (!attr)
- val = def;
-else
- val = fn(attr);
+val = dfn(attr, def);
|
-if (!attr)
- return def;
-return fn(attr);
+return dfn(attr, def);
|
-attr ? fn(attr) : def
+dfn(attr, def)
|
-!attr ? def : fn(attr)
+dfn(attr, def)
)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org>
Link: https://patch.msgid.link/20241108114145.0580b8684e7f.I740beeaa2f70ebfc19bfca1045a24d6151992790@changeid
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Create a mapping between a netdev and its neighoburs,
allowing for much cheaper flushes.
Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241107160444.2913124-7-gnaaman@drivenets.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove the now-unused neighbour::next pointer, leaving struct neighbour
solely with the hlist_node implementation.
Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241107160444.2913124-6-gnaaman@drivenets.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove all usage of the bare neighbour::next pointer,
replacing them with neighbour::hash and its for_each macro.
Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241107160444.2913124-5-gnaaman@drivenets.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Convert seq_file-related neighbour functionality to use neighbour::hash
and the related for_each macro.
Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241107160444.2913124-4-gnaaman@drivenets.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a doubly-linked node to neighbours, so that they
can be deleted without iterating the entire bucket they're in.
Signed-off-by: Gilad Naaman <gnaaman@drivenets.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241107160444.2913124-2-gnaaman@drivenets.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In the bpf_out_neigh_v6 function, rcu_read_lock() is used to begin an RCU
read-side critical section. However, when unlocking, one branch
incorrectly uses a different RCU unlock flavour rcu_read_unlock_bh()
instead of rcu_read_unlock(). This mismatch in RCU locking flavours can
lead to unexpected behavior and potential concurrency issues.
This possible bug was identified using a static analysis tool developed
by myself, specifically designed to detect RCU-related issues.
This patch corrects the mismatched unlock flavour by replacing the
incorrect rcu_read_unlock_bh() with the appropriate rcu_read_unlock(),
ensuring that the RCU critical section is properly exited. This change
prevents potential synchronization issues and aligns with proper RCU
usage patterns.
Fixes: 09eed1192cec ("neighbour: switch to standard rcu, instead of rcu_bh")
Signed-off-by: Jiawei Ye <jiawei.ye@foxmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/tencent_CFD3D1C3D68B45EA9F52D8EC76D2C4134306@qq.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
hrtimer_setup_sleeper_on_stack() replaces hrtimer_init_sleeper_on_stack()
to keep the naming convention consistent.
Convert the usage site over to it. The conversion was done with Coccinelle.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/c4b40b8fef250b6a325e1b8bd6057005fb3cb660.1730386209.git.namcao@linutronix.de
|
|
Found in the test_txmsg_pull in test_sockmap,
```
txmsg_cork = 512; // corking is importrant here
opt->iov_length = 3;
opt->iov_count = 1;
opt->rate = 512; // sendmsg will be invoked 512 times
```
The first sendmsg will send an sk_msg with size 3, and bpf_msg_pull_data
will be invoked the first time. sk_msg_reset_curr will reset the copybreak
from 3 to 0. In the second sendmsg, since we are in the stage of corking,
psock->cork will be reused in func sk_msg_alloc. msg->sg.copybreak is 0
now, the second msg will overwrite the first msg. As a result, we could
not pass the data integrity test.
The same problem happens in push and pop test. Thus, fix sk_msg_reset_curr
to restore the correct copybreak.
Fixes: bb9aefde5bba ("bpf: sockmap, updating the sg structure should also update curr")
Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
Link: https://lore.kernel.org/r/20241106222520.527076-9-zijianzhang@bytedance.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Several fixes to bpf_msg_pop_data,
1. In sk_msg_shift_left, we should put_page
2. if (len == 0), return early is better
3. pop the entire sk_msg (last == msg->sg.size) should be supported
4. Fix for the value of variable "a"
5. In sk_msg_shift_left, after shifting, i has already pointed to the next
element. Addtional sk_msg_iter_var_next may result in BUG.
Fixes: 7246d8ed4dcc ("bpf: helper to pop data from messages")
Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20241106222520.527076-8-zijianzhang@bytedance.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Several fixes to bpf_msg_push_data,
1. test_sockmap has tests where bpf_msg_push_data is invoked to push some
data at the end of a message, but -EINVAL is returned. In this case, in
bpf_msg_push_data, after the first loop, i will be set to msg->sg.end, add
the logic to handle it.
2. In the code block of "if (start - offset)", it's possible that "i"
points to the last of sk_msg_elem. In this case, "sk_msg_iter_next(msg,
end)" might still be called twice, another invoking is in "if (!copy)"
code block, but actually only one is needed. Add the logic to handle it,
and reconstruct the code to make the logic more clear.
Fixes: 6fff607e2f14 ("bpf: sk_msg program helper bpf_msg_push_data")
Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com>
Link: https://lore.kernel.org/r/20241106222520.527076-7-zijianzhang@bytedance.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-10-31
We've added 13 non-merge commits during the last 16 day(s) which contain
a total of 16 files changed, 710 insertions(+), 668 deletions(-).
The main changes are:
1) Optimize and homogenize bpf_csum_diff helper for all archs and also
add a batch of new BPF selftests for it, from Puranjay Mohan.
2) Rewrite and migrate the test_tcp_check_syncookie.sh BPF selftest
into test_progs so that it can be run in BPF CI, from Alexis Lothoré.
3) Two BPF sockmap selftest fixes, from Zijian Zhang.
4) Small XDP synproxy BPF selftest cleanup to remove IP_DF check,
from Vincent Li.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
selftests/bpf: Add a selftest for bpf_csum_diff()
selftests/bpf: Don't mask result of bpf_csum_diff() in test_verifier
bpf: bpf_csum_diff: Optimize and homogenize for all archs
net: checksum: Move from32to16() to generic header
selftests/bpf: remove xdp_synproxy IP_DF check
selftests/bpf: remove test_tcp_check_syncookie
selftests/bpf: test MSS value returned with bpf_tcp_gen_syncookie
selftests/bpf: add ipv4 and dual ipv4/ipv6 support in btf_skc_cls_ingress
selftests/bpf: get rid of global vars in btf_skc_cls_ingress
selftests/bpf: add missing ns cleanups in btf_skc_cls_ingress
selftests/bpf: factorize conn and syncookies tests in a single runner
selftests/bpf: Fix txmsg_redir of test_txmsg_pull in test_sockmap
selftests/bpf: Fix msg_verify_data in test_sockmap
====================
Link: https://patch.msgid.link/20241031221543.108853-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
fdget() is the first thing done in scope, all matching fdput() are
immediately followed by leaving the scope.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Cross-merge networking fixes after downstream PR (net-6.12-rc6).
Conflicts:
drivers/net/wireless/intel/iwlwifi/mvm/mld-mac80211.c
cbe84e9ad5e2 ("wifi: iwlwifi: mvm: really send iwl_txpower_constraints_cmd")
188a1bf89432 ("wifi: mac80211: re-order assigning channel in activate links")
https://lore.kernel.org/all/20241028123621.7bbb131b@canb.auug.org.au/
net/mac80211/cfg.c
c4382d5ca1af ("wifi: mac80211: update the right link for tx power")
8dd0498983ee ("wifi: mac80211: Fix setting txpower with emulate_chanctx")
drivers/net/ethernet/intel/ice/ice_ptp_hw.h
6e58c3310622 ("ice: fix crash on probe for DPLL enabled E810 LOM")
e4291b64e118 ("ice: Align E810T GPIO to other products")
ebb2693f8fbd ("ice: Read SDP section from NVM for pin definitions")
ac532f4f4251 ("ice: Cleanup unused declarations")
https://lore.kernel.org/all/20241030120524.1ee1af18@canb.auug.org.au/
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pull bpf fixes from Daniel Borkmann:
- Fix BPF verifier to force a checkpoint when the program's jump
history becomes too long (Eduard Zingerman)
- Add several fixes to the BPF bits iterator addressing issues like
memory leaks and overflow problems (Hou Tao)
- Fix an out-of-bounds write in trie_get_next_key (Byeonguk Jeong)
- Fix BPF test infra's LIVE_FRAME frame update after a page has been
recycled (Toke Høiland-Jørgensen)
- Fix BPF verifier and undo the 40-bytes extra stack space for
bpf_fastcall patterns due to various bugs (Eduard Zingerman)
- Fix a BPF sockmap race condition which could trigger a NULL pointer
dereference in sock_map_link_update_prog (Cong Wang)
- Fix tcp_bpf_recvmsg_parser to retrieve seq_copied from tcp_sk under
the socket lock (Jiayuan Chen)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, test_run: Fix LIVE_FRAME frame update after a page has been recycled
selftests/bpf: Add three test cases for bits_iter
bpf: Use __u64 to save the bits in bits iterator
bpf: Check the validity of nr_words in bpf_iter_bits_new()
bpf: Add bpf_mem_alloc_check_size() helper
bpf: Free dynamically allocated bits in bpf_iter_bits_destroy()
bpf: disallow 40-bytes extra stack for bpf_fastcall patterns
selftests/bpf: Add test for trie_get_next_key()
bpf: Fix out-of-bounds write in trie_get_next_key()
selftests/bpf: Test with a very short loop
bpf: Force checkpoint when jmp history is too long
bpf: fix filed access without lock
sock_map: fix a NULL pointer dereference in sock_map_link_update_prog()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from WiFi, bluetooth and netfilter.
No known new regressions outstanding.
Current release - regressions:
- wifi: mt76: do not increase mcu skb refcount if retry is not
supported
Current release - new code bugs:
- wifi:
- rtw88: fix the RX aggregation in USB 3 mode
- mac80211: fix memory corruption bug in struct ieee80211_chanctx
Previous releases - regressions:
- sched:
- stop qdisc_tree_reduce_backlog on TC_H_ROOT
- sch_api: fix xa_insert() error path in tcf_block_get_ext()
- wifi:
- revert "wifi: iwlwifi: remove retry loops in start"
- cfg80211: clear wdev->cqm_config pointer on free
- netfilter: fix potential crash in nf_send_reset6()
- ip_tunnel: fix suspicious RCU usage warning in ip_tunnel_find()
- bluetooth: fix null-ptr-deref in hci_read_supported_codecs
- eth: mlxsw: add missing verification before pushing Tx header
- eth: hns3: fixed hclge_fetch_pf_reg accesses bar space out of
bounds issue
Previous releases - always broken:
- wifi: mac80211: do not pass a stopped vif to the driver in
.get_txpower
- netfilter: sanitize offset and length before calling skb_checksum()
- core:
- fix crash when config small gso_max_size/gso_ipv4_max_size
- skip offload for NETIF_F_IPV6_CSUM if ipv6 header contains extension
- mptcp: protect sched with rcu_read_lock
- eth: ice: fix crash on probe for DPLL enabled E810 LOM
- eth: macsec: fix use-after-free while sending the offloading packet
- eth: stmmac: fix unbalanced DMA map/unmap for non-paged SKB data
- eth: hns3: fix kernel crash when 1588 is sent on HIP08 devices
- eth: mtk_wed: fix path of MT7988 WO firmware"
* tag 'net-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits)
net: hns3: fix kernel crash when 1588 is sent on HIP08 devices
net: hns3: fixed hclge_fetch_pf_reg accesses bar space out of bounds issue
net: hns3: initialize reset_timer before hclgevf_misc_irq_init()
net: hns3: don't auto enable misc vector
net: hns3: Resolved the issue that the debugfs query result is inconsistent.
net: hns3: fix missing features due to dev->features configuration too early
net: hns3: fixed reset failure issues caused by the incorrect reset type
net: hns3: add sync command to sync io-pgtable
net: hns3: default enable tx bounce buffer when smmu enabled
netfilter: nft_payload: sanitize offset and length before calling skb_checksum()
net: ethernet: mtk_wed: fix path of MT7988 WO firmware
selftests: forwarding: Add IPv6 GRE remote change tests
mlxsw: spectrum_ipip: Fix memory leak when changing remote IPv6 address
mlxsw: pci: Sync Rx buffers for device
mlxsw: pci: Sync Rx buffers for CPU
mlxsw: spectrum_ptp: Add missing verification before pushing Tx header
net: skip offload for NETIF_F_IPV6_CSUM if ipv6 header contains extension
Bluetooth: hci: fix null-ptr-deref in hci_read_supported_codecs
netfilter: nf_reject_ipv6: fix potential crash in nf_send_reset6()
netfilter: Fix use-after-free in get_info()
...
|
|
When some code has been moved in the commit in Fixes, some "return err;"
have correctly been changed in goto <some_where_in_the_error_handling_path>
but this one was missed.
Should "ops->maxtype > RTNL_MAX_TYPE" happen, then some resources would
leak.
Go through the error handling path to fix these leaks.
Fixes: 0d3008d1a9ae ("rtnetlink: Move ops->validate to rtnl_newlink().")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/eca90eeb4d9e9a0545772b68aeaab883d9fe2279.1729952228.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As documented in skbuff.h, devices with NETIF_F_IPV6_CSUM capability
can only checksum TCP and UDP over IPv6 if the IP header does not
contains extension.
This is enforced for UDP packets emitted from user-space to an IPv6
address as they go through ip6_make_skb(), which calls
__ip6_append_data() where a check is done on the header size before
setting CHECKSUM_PARTIAL.
But the introduction of UDP encapsulation with fou6 added a code-path
where it is possible to get an skb with a partial UDP checksum and an
IPv6 header with extension:
* fou6 adds a UDP header with a partial checksum if the inner packet
does not contains a valid checksum.
* ip6_tunnel adds an IPv6 header with a destination option extension
header if encap_limit is non-zero (the default value is 4).
The thread linked below describes in more details how to reproduce the
problem with GRE-in-UDP tunnel.
Add a check on the network header size in skb_csum_hwoffload_help() to
make sure no IPv6 packet with extension header is handed to a network
device with NETIF_F_IPV6_CSUM capability.
Link: https://lore.kernel.org/netdev/26548921.1r3eYUQgxm@benoit.monin/T/#u
Fixes: aa3463d65e7b ("fou: Add encap ops for IPv6 tunnels")
Signed-off-by: Benoît Monin <benoit.monin@gmx.fr>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/5fbeecfc311ea182aa1d1c771725ab8b4cac515e.1729778144.git.benoit.monin@gmx.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
1. Optimization
------------
The current implementation copies the 'from' and 'to' buffers to a
scratchpad and it takes the bitwise NOT of 'from' buffer while copying.
In the next step csum_partial() is called with this scratchpad.
so, mathematically, the current implementation is doing:
result = csum(to - from)
Here, 'to' and '~ from' are copied in to the scratchpad buffer, we need
it in the scratchpad buffer because csum_partial() takes a single
contiguous buffer and not two disjoint buffers like 'to' and 'from'.
We can re write this equation to:
result = csum(to) - csum(from)
using the distributive property of csum().
this allows 'to' and 'from' to be at different locations and therefore
this scratchpad and copying is not needed.
This in C code will look like:
result = csum_sub(csum_partial(to, to_size, seed),
csum_partial(from, from_size, 0));
2. Homogenization
--------------
The bpf_csum_diff() helper calls csum_partial() which is implemented by
some architectures like arm and x86 but other architectures rely on the
generic implementation in lib/checksum.c
The generic implementation in lib/checksum.c returns a 16 bit value but
the arch specific implementations can return more than 16 bits, this
works out in most places because before the result is used, it is passed
through csum_fold() that turns it into a 16-bit value.
bpf_csum_diff() directly returns the value from csum_partial() and
therefore the returned values could be different on different
architectures. see discussion in [1]:
for the int value 28 the calculated checksums are:
x86 : -29 : 0xffffffe3
generic (arm64, riscv) : 65507 : 0x0000ffe3
arm : 131042 : 0x0001ffe2
Pass the result of bpf_csum_diff() through from32to16() before returning
to homogenize this result for all architectures.
NOTE: from32to16() is used instead of csum_fold() because csum_fold()
does from32to16() + bitwise NOT of the result, which is not what we want
to do here.
[1] https://lore.kernel.org/bpf/CAJ+HfNiQbOcqCLxFUP2FMm5QrLXUUaj852Fxe3hn_2JNiucn6g@mail.gmail.com/
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20241026125339.26459-3-puranjay@kernel.org
|
|
Config a small gso_max_size/gso_ipv4_max_size will lead to an underflow
in sk_dst_gso_max_size(), which may trigger a BUG_ON crash,
because sk->sk_gso_max_size would be much bigger than device limits.
Call Trace:
tcp_write_xmit
tso_segs = tcp_init_tso_segs(skb, mss_now);
tcp_set_skb_tso_segs
tcp_skb_pcount_set
// skb->len = 524288, mss_now = 8
// u16 tso_segs = 524288/8 = 65535 -> 0
tso_segs = DIV_ROUND_UP(skb->len, mss_now)
BUG_ON(!tso_segs)
Add check for the minimum value of gso_max_size and gso_ipv4_max_size.
Fixes: 46e6b992c250 ("rtnetlink: allow GSO maximums to be set on device creation")
Fixes: 9eefedd58ae1 ("net: add gso_ipv4_max_size and gro_ipv4_max_size per device")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241023035213.517386-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit 26eebdc4b005 ("rtnetlink: Return int from rtnl_af_register().")
made rtnl_af_register() return int again, and kdoc needs to be fixed up.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241022210320.86111-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
ioctl(SIOCGIFCONF) calls dev_ifconf() that operates on the current netns.
Let's use per-netns RTNL helpers in dev_ifconf() and inet_gifconf().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
We will need the per-netns version of rtnl_trylock().
rtnl_net_trylock() calls __rtnl_net_lock() only when rtnl_trylock()
successfully holds RTNL.
When RTNL is removed, we will use mutex_trylock() for per-netns RTNL.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The following race condition could trigger a NULL pointer dereference:
sock_map_link_detach(): sock_map_link_update_prog():
mutex_lock(&sockmap_mutex);
...
sockmap_link->map = NULL;
mutex_unlock(&sockmap_mutex);
mutex_lock(&sockmap_mutex);
...
sock_map_prog_link_lookup(sockmap_link->map);
mutex_unlock(&sockmap_mutex);
<continue>
Fix it by adding a NULL pointer check. In this specific case, it makes
no sense to update a link which is being released.
Reported-by: Ruan Bonan <bonan.ruan@u.nus.edu>
Fixes: 699c23f02c65 ("bpf: Add bpf_link support for sk_msg and sk_skb progs")
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/r/20241026185522.338562-1-xiyou.wangcong@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
mm layer is providing convenient functions, we do not have
to work around old limitations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Gilad Naaman <gnaaman@drivenets.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241022150059.1345406-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts and no adjacent changes.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Cross-merge bpf fixes after downstream PR.
No conflicts.
Adjacent changes in:
include/linux/bpf.h
include/uapi/linux/bpf.h
kernel/bpf/btf.c
kernel/bpf/helpers.c
kernel/bpf/syscall.c
kernel/bpf/verifier.c
kernel/trace/bpf_trace.c
mm/slab_common.c
tools/include/uapi/linux/bpf.h
tools/testing/selftests/bpf/Makefile
Link: https://lore.kernel.org/all/20241024215724.60017-1-daniel@iogearbox.net/
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
bpf_selem_alloc()
In a later patch, the task local storage will only accept uptr
from the syscall update_elem and will not accept uptr from
the bpf prog. The reason is the bpf prog does not have a way
to provide a valid user space address.
bpf_local_storage_update() and bpf_selem_alloc() are used by
both bpf prog bpf_task_storage_get(BPF_LOCAL_STORAGE_GET_F_CREATE)
and bpf syscall update_elem. "bool swap_uptrs" arg is added
to bpf_local_storage_update() and bpf_selem_alloc() to tell if
it is called by the bpf prog or by the bpf syscall. When
swap_uptrs==true, it is called by the syscall.
The arg is named (swap_)uptrs because the later patch will swap
the uptrs between the newly allocated selem and the user space
provided map_value. It will make error handling easier in case
map->ops->map_update_elem() fails and the caller can decide
if it needs to unpin the uptr in the user space provided
map_value or the bpf_local_storage_update() has already
taken the uptr ownership and will take care of unpinning it also.
Only swap_uptrs==false is passed now. The logic to handle
the true case will be added in a later patch.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241023234759.860539-4-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
npinfo is not used in any of the ndo_netpoll_setup() methods.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241018052108.2610827-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This fixes the output of rps_default_mask and flow_limit_cpu_bitmap when
the CPU count is > 448, as it was truncated.
The underlying values are actually stored correctly when writing to
these sysctl but displaying them uses a fixed length temporary buffer in
dump_cpumask. This buffer can be too small if the CPU count is > 448.
Fix this by dynamically allocating the buffer in dump_cpumask, using a
guesstimate of what we need.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When computing the length we'll be able to use out of the buffers, one
char is removed from the temporary one to make room for a newline. It
should be removed from the output buffer length too, but in reality this
is not needed as the later call to scnprintf makes sure a null char is
written at the end of the buffer which we override with the newline.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Before adding a new line at the end of the temporary buffer in
dump_cpumask, a length check is performed to ensure there is space for
it.
len = min(sizeof(kbuf) - 1, *lenp);
len = scnprintf(kbuf, len, ...);
if (len < *lenp)
kbuf[len++] = '\n';
Note that the check is currently logically wrong, the written length is
compared against the output buffer, not the temporary one. However this
has no consequence as this is always true, even if fixed: scnprintf
includes a null char at the end of the buffer but the returned length do
not include it and there is always space for overriding it with a
newline.
Remove the condition.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
sock_{,re}set_flag() are contained in sock_valbool_flag(),
it would be cleaner to just use sock_valbool_flag().
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Link: https://patch.msgid.link/20241017133435.2552-1-yajun.deng@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
We can now undo parts of 4b3786a6c539 ("bpf: Zero former ARG_PTR_TO_{LONG,INT}
args in case of error") as discussed in [0].
Given the BPF helpers now have MEM_WRITE tag, the MEM_UNINIT can be cleared.
The mtu_len is an input as well as output argument, meaning, the BPF program
has to set it to something. It cannot be uninitialized. Therefore, allowing
uninitialized memory and zeroing it on error would be odd. It was done as
an interim step in 4b3786a6c539 as the desired behavior could not have been
expressed before the introduction of MEM_WRITE tag.
Fixes: 4b3786a6c539 ("bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/a86eb76d-f52f-dee4-e5d2-87e45de3e16f@iogearbox.net [0]
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241021152809.33343-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add a MEM_WRITE attribute for BPF helper functions which can be used in
bpf_func_proto to annotate an argument type in order to let the verifier
know that the helper writes into the memory passed as an argument. In
the past MEM_UNINIT has been (ab)used for this function, but the latter
merely tells the verifier that the passed memory can be uninitialized.
There have been bugs with overloading the latter but aside from that
there are also cases where the passed memory is read + written which
currently cannot be expressed, see also 4b3786a6c539 ("bpf: Zero former
ARG_PTR_TO_{LONG,INT} args in case of error").
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241021152809.33343-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Once RTNL is replaced with rtnl_net_lock(), we need a mechanism to
guarantee that rtnl_af_ops is alive during inflight RTM_SETLINK
even when its module is being unloaded.
Let's use SRCU to protect ops.
rtnl_af_lookup() now iterates rtnl_af_ops under RCU and returns
SRCU-protected ops pointer. The caller must call rtnl_af_put()
to release the pointer after the use.
Also, rtnl_af_unregister() unlinks the ops first and calls
synchronize_srcu() to wait for inflight RTM_SETLINK requests to
complete.
Note that rtnl_af_ops needs to be protected by its dedicated lock
when RTNL is removed.
Note also that BUG_ON() in do_setlink() is changed to the normal
error handling as a different af_ops might be found after
validate_linkmsg().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The next patch will add init_srcu_struct() in rtnl_af_register(),
then we need to handle its error.
Let's add the error handling in advance to make the following
patch cleaner.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Matt Johnston <matt@codeconstruct.com.au>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
We will push RTNL down to rtnl_setlink().
RTM_SETLINK could call rtnl_link_get_net_capable() in do_setlink()
to move a dev to a new netns, but the netns needs to be fetched before
holding rtnl_net_lock().
Let's move it to rtnl_setlink() and pass the netns to do_setlink().
Now, RTM_NEWLINK paths (rtnl_changelink() and rtnl_group_changelink())
can pass the prefetched netns to do_setlink().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
We will push RTNL down to rtnl_setlink().
Let's unify the error path to make it easy to place rtnl_net_lock().
While at it, keep the variables in reverse xmas order.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|