summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)AuthorFilesLines
2022-12-13Merge tag 'lsm-pr-20221212' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull lsm updates from Paul Moore: - Improve the error handling in the device cgroup such that memory allocation failures when updating the access policy do not potentially alter the policy. - Some minor fixes to reiserfs to ensure that it properly releases LSM-related xattr values. - Update the security_socket_getpeersec_stream() LSM hook to take sockptr_t values. Previously the net/BPF folks updated the getsockopt code in the network stack to leverage the sockptr_t type to make it easier to pass both kernel and __user pointers, but unfortunately when they did so they didn't convert the LSM hook. While there was/is no immediate risk by not converting the LSM hook, it seems like this is a mistake waiting to happen so this patch proactively does the LSM hook conversion. - Convert vfs_getxattr_alloc() to return an int instead of a ssize_t and cleanup the callers. Internally the function was never going to return anything larger than an int and the callers were doing some very odd things casting the return value; this patch fixes all that and helps bring a bit of sanity to vfs_getxattr_alloc() and its callers. - More verbose, and helpful, LSM debug output when the system is booted with "lsm.debug" on the command line. There are examples in the commit description, but the quick summary is that this patch provides better information about which LSMs are enabled and the ordering in which they are processed. - General comment and kernel-doc fixes and cleanups. * tag 'lsm-pr-20221212' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: lsm: Fix description of fs_context_parse_param lsm: Add/fix return values in lsm_hooks.h and fix formatting lsm: Clarify documentation of vm_enough_memory hook reiserfs: Add missing calls to reiserfs_security_free() lsm,fs: fix vfs_getxattr_alloc() return type and caller error paths device_cgroup: Roll back to original exceptions after copy failure LSM: Better reporting of actual LSMs at boot lsm: make security_socket_getpeersec_stream() sockptr_t safe audit: Fix some kernel-doc warnings lsm: remove obsoleted comments for security hooks fs: edit a comment made in bad taste
2022-12-13Merge tag 'linux-kselftest-kunit-next-6.2-rc1' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull KUnit updates from Shuah Khan: "Several enhancements, fixes, clean-ups, documentation updates, improvements to logging and KTAP compliance of KUnit test output: - log numbers in decimal and hex - parse KTAP compliant test output - allow conditionally exposing static symbols to tests when KUNIT is enabled - make static symbols visible during kunit testing - clean-ups to remove unused structure definition" * tag 'linux-kselftest-kunit-next-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (29 commits) Documentation: dev-tools: Clarify requirements for result description apparmor: test: make static symbols visible during kunit testing kunit: add macro to allow conditionally exposing static symbols to tests kunit: tool: make parser preserve whitespace when printing test log Documentation: kunit: Fix "How Do I Use This" / "Next Steps" sections kunit: tool: don't include KTAP headers and the like in the test log kunit: improve KTAP compliance of KUnit test output kunit: tool: parse KTAP compliant test output mm: slub: test: Use the kunit_get_current_test() function kunit: Use the static key when retrieving the current test kunit: Provide a static key to check if KUnit is actively running tests kunit: tool: make --json do nothing if --raw_ouput is set kunit: tool: tweak error message when no KTAP found kunit: remove KUNIT_INIT_MEM_ASSERTION macro Documentation: kunit: Remove redundant 'tips.rst' page Documentation: KUnit: reword description of assertions Documentation: KUnit: make usage.rst a superset of tips.rst, remove duplication kunit: eliminate KUNIT_INIT_*_ASSERT_STRUCT macros kunit: tool: remove redundant file.close() call in unit test kunit: tool: unit tests all check parser errors, standardize formatting a bit ...
2022-12-13Merge tag 'random-6.2-rc1-for-linus' of ↵Linus Torvalds3-24/+19
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: - Replace prandom_u32_max() and various open-coded variants of it, there is now a new family of functions that uses fast rejection sampling to choose properly uniformly random numbers within an interval: get_random_u32_below(ceil) - [0, ceil) get_random_u32_above(floor) - (floor, U32_MAX] get_random_u32_inclusive(floor, ceil) - [floor, ceil] Coccinelle was used to convert all current users of prandom_u32_max(), as well as many open-coded patterns, resulting in improvements throughout the tree. I'll have a "late" 6.1-rc1 pull for you that removes the now unused prandom_u32_max() function, just in case any other trees add a new use case of it that needs to converted. According to linux-next, there may be two trivial cases of prandom_u32_max() reintroductions that are fixable with a 's/.../.../'. So I'll have for you a final conversion patch doing that alongside the removal patch during the second week. This is a treewide change that touches many files throughout. - More consistent use of get_random_canary(). - Updates to comments, documentation, tests, headers, and simplification in configuration. - The arch_get_random*_early() abstraction was only used by arm64 and wasn't entirely useful, so this has been replaced by code that works in all relevant contexts. - The kernel will use and manage random seeds in non-volatile EFI variables, refreshing a variable with a fresh seed when the RNG is initialized. The RNG GUID namespace is then hidden from efivarfs to prevent accidental leakage. These changes are split into random.c infrastructure code used in the EFI subsystem, in this pull request, and related support inside of EFISTUB, in Ard's EFI tree. These are co-dependent for full functionality, but the order of merging doesn't matter. - Part of the infrastructure added for the EFI support is also used for an improvement to the way vsprintf initializes its siphash key, replacing an sleep loop wart. - The hardware RNG framework now always calls its correct random.c input function, add_hwgenerator_randomness(), rather than sometimes going through helpers better suited for other cases. - The add_latent_entropy() function has long been called from the fork handler, but is a no-op when the latent entropy gcc plugin isn't used, which is fine for the purposes of latent entropy. But it was missing out on the cycle counter that was also being mixed in beside the latent entropy variable. So now, if the latent entropy gcc plugin isn't enabled, add_latent_entropy() will expand to a call to add_device_randomness(NULL, 0), which adds a cycle counter, without the absent latent entropy variable. - The RNG is now reseeded from a delayed worker, rather than on demand when used. Always running from a worker allows it to make use of the CPU RNG on platforms like S390x, whose instructions are too slow to do so from interrupts. It also has the effect of adding in new inputs more frequently with more regularity, amounting to a long term transcript of random values. Plus, it helps a bit with the upcoming vDSO implementation (which isn't yet ready for 6.2). - The jitter entropy algorithm now tries to execute on many different CPUs, round-robining, in hopes of hitting even more memory latencies and other unpredictable effects. It also will mix in a cycle counter when the entropy timer fires, in addition to being mixed in from the main loop, to account more explicitly for fluctuations in that timer firing. And the state it touches is now kept within the same cache line, so that it's assured that the different execution contexts will cause latencies. * tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits) random: include <linux/once.h> in the right header random: align entropy_timer_state to cache line random: mix in cycle counter when jitter timer fires random: spread out jitter callback to different CPUs random: remove extraneous period and add a missing one in comments efi: random: refresh non-volatile random seed when RNG is initialized vsprintf: initialize siphash key using notifier random: add back async readiness notifier random: reseed in delayed work rather than on-demand random: always mix cycle counter in add_latent_entropy() hw_random: use add_hwgenerator_randomness() for early entropy random: modernize documentation comment on get_random_bytes() random: adjust comment to account for removed function random: remove early archrandom abstraction random: use random.trust_{bootloader,cpu} command line option only stackprotector: actually use get_random_canary() stackprotector: move get_random_canary() into stackprotector.h treewide: use get_random_u32_inclusive() when possible treewide: use get_random_u32_{above,below}() instead of manual loop treewide: use get_random_u32_below() instead of deprecated function ...
2022-12-12Merge tag 'rcu.2022.12.02a' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Documentation updates. This is the second in a series from an ongoing review of the RCU documentation. - Miscellaneous fixes. - Introduce a default-off Kconfig option that depends on RCU_NOCB_CPU that, on CPUs mentioned in the nohz_full or rcu_nocbs boot-argument CPU lists, causes call_rcu() to introduce delays. These delays result in significant power savings on nearly idle Android and ChromeOS systems. These savings range from a few percent to more than ten percent. This series also includes several commits that change call_rcu() to a new call_rcu_hurry() function that avoids these delays in a few cases, for example, where timely wakeups are required. Several of these are outside of RCU and thus have acks and reviews from the relevant maintainers. - Create an srcu_read_lock_nmisafe() and an srcu_read_unlock_nmisafe() for architectures that support NMIs, but which do not provide NMI-safe this_cpu_inc(). These NMI-safe SRCU functions are required by the upcoming lockless printk() work by John Ogness et al. - Changes providing minor but important increases in torture test coverage for the new RCU polled-grace-period APIs. - Changes to torturescript that avoid redundant kernel builds, thus providing about a 30% speedup for the torture.sh acceptance test. * tag 'rcu.2022.12.02a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (49 commits) net: devinet: Reduce refcount before grace period net: Use call_rcu_hurry() for dst_release() workqueue: Make queue_rcu_work() use call_rcu_hurry() percpu-refcount: Use call_rcu_hurry() for atomic switch scsi/scsi_error: Use call_rcu_hurry() instead of call_rcu() rcu/rcutorture: Use call_rcu_hurry() where needed rcu/rcuscale: Use call_rcu_hurry() for async reader test rcu/sync: Use call_rcu_hurry() instead of call_rcu rcuscale: Add laziness and kfree tests rcu: Shrinker for lazy rcu rcu: Refactor code a bit in rcu_nocb_do_flush_bypass() rcu: Make call_rcu() lazy to save power rcu: Implement lockdep_rcu_enabled for !CONFIG_DEBUG_LOCK_ALLOC srcu: Debug NMI safety even on archs that don't require it srcu: Explain the reason behind the read side critical section on GP start srcu: Warn when NMI-unsafe API is used in NMI arch/s390: Add ARCH_HAS_NMI_SAFE_THIS_CPU_OPS Kconfig option arch/loongarch: Add ARCH_HAS_NMI_SAFE_THIS_CPU_OPS Kconfig option rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state() rcu-tasks: Make grace-period-age message human-readable ...
2022-12-01net: Use call_rcu_hurry() for dst_release()Joel Fernandes (Google)1-1/+1
In a networking test on ChromeOS, kernels built with the new CONFIG_RCU_LAZY=y Kconfig option fail a networking test in the teardown phase. This failure may be reproduced as follows: ip netns del <name> The CONFIG_RCU_LAZY=y Kconfig option was introduced by earlier commits in this series for the benefit of certain battery-powered systems. This Kconfig option causes call_rcu() to delay its callbacks in order to batch them. This means that a given RCU grace period covers more callbacks, thus reducing the number of grace periods, in turn reducing the amount of energy consumed, which increases battery lifetime which can be a very good thing. This is not a subtle effect: In some important use cases, the battery lifetime is increased by more than 10%. This CONFIG_RCU_LAZY=y option is available only for CPUs that offload callbacks, for example, CPUs mentioned in the rcu_nocbs kernel boot parameter passed to kernels built with CONFIG_RCU_NOCB_CPU=y. Delaying callbacks is normally not a problem because most callbacks do nothing but free memory. If the system is short on memory, a shrinker will kick all currently queued lazy callbacks out of their laziness, thus freeing their memory in short order. Similarly, the rcu_barrier() function, which blocks until all currently queued callbacks are invoked, will also kick lazy callbacks, thus enabling rcu_barrier() to complete in a timely manner. However, there are some cases where laziness is not a good option. For example, synchronize_rcu() invokes call_rcu(), and blocks until the newly queued callback is invoked. It would not be a good for synchronize_rcu() to block for ten seconds, even on an idle system. Therefore, synchronize_rcu() invokes call_rcu_hurry() instead of call_rcu(). The arrival of a non-lazy call_rcu_hurry() callback on a given CPU kicks any lazy callbacks that might be already queued on that CPU. After all, if there is going to be a grace period, all callbacks might as well get full benefit from it. Yes, this could be done the other way around by creating a call_rcu_lazy(), but earlier experience with this approach and feedback at the 2022 Linux Plumbers Conference shifted the approach to call_rcu() being lazy with call_rcu_hurry() for the few places where laziness is inappropriate. Returning to the test failure, use of ftrace showed that this failure cause caused by the aadded delays due to this new lazy behavior of call_rcu() in kernels built with CONFIG_RCU_LAZY=y. Therefore, make dst_release() use call_rcu_hurry() in order to revert to the old test-failure-free behavior. [ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ] Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: David Ahern <dsahern@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: <netdev@vger.kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-24Merge branch 'master' of ↵Jakub Kicinski1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== ipsec 2022-11-23 1) Fix "disable_policy" on ipv4 early demuxP Packets after the initial packet in a flow might be incorectly dropped on early demux if there are no matching policies. From Eyal Birger. 2) Fix a kernel warning in case XFRM encap type is not available. From Eyal Birger. 3) Fix ESN wrap around for GSO to avoid a double usage of a sequence number. From Christian Langrock. 4) Fix a send_acquire race with pfkey_register. From Herbert Xu. 5) Fix a list corruption panic in __xfrm_state_delete(). Thomas Jarosch. 6) Fix an unchecked return value in xfrm6_init(). Chen Zhongjin. * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: Fix ignored return value in xfrm6_init() xfrm: Fix oops in __xfrm_state_delete() af_key: Fix send_acquire race with pfkey_register xfrm: replay: Fix ESN wrap around for GSO xfrm: lwtunnel: squelch kernel warning in case XFRM encap type is not available xfrm: fix "disable_policy" on ipv4 early demux ==================== Link: https://lore.kernel.org/r/20221123093117.434274-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-18netfilter: conntrack: Fix data-races around ct markDaniel Xu1-1/+1
nf_conn:mark can be read from and written to in parallel. Use READ_ONCE()/WRITE_ONCE() for reads and writes to prevent unwanted compiler optimizations. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-11-18net: neigh: decrement the family specific qlenThomas Zeitlhofer1-28/+30
Commit 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit per-device") introduced the length counter qlen in struct neigh_parms. There are separate neigh_parms instances for IPv4/ARP and IPv6/ND, and while the family specific qlen is incremented in pneigh_enqueue(), the mentioned commit decrements always the IPv4/ARP specific qlen, regardless of the currently processed family, in pneigh_queue_purge() and neigh_proxy_process(). As a result, with IPv6/ND, the family specific qlen is only incremented (and never decremented) until it exceeds PROXY_QLEN, and then, according to the check in pneigh_enqueue(), neighbor solicitations are not answered anymore. As an example, this is noted when using the subnet-router anycast address to access a Linux router. After a certain amount of time (in the observed case, qlen exceeded PROXY_QLEN after two days), the Linux router stops answering neighbor solicitations for its subnet-router anycast address and effectively becomes unreachable. Another result with IPv6/ND is that the IPv4/ARP specific qlen is decremented more often than incremented. This leads to negative qlen values, as a signed integer has been used for the length counter qlen, and potentially to an integer overflow. Fix this by introducing the helper function neigh_parms_qlen_dec(), which decrements the family specific qlen. Thereby, make use of the existing helper function neigh_get_dev_parms_rcu(), whose definition therefore needs to be placed earlier in neighbour.c. Take the family member from struct neigh_table to determine the currently processed family and appropriately call neigh_parms_qlen_dec() from pneigh_queue_purge() and neigh_proxy_process(). Additionally, use an unsigned integer for the length counter qlen. Fixes: 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit per-device") Signed-off-by: Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-18treewide: use get_random_u32_inclusive() when possibleJason A. Donenfeld1-15/+10
These cases were done with this Coccinelle: @@ expression H; expression L; @@ - (get_random_u32_below(H) + L) + get_random_u32_inclusive(L, H + L - 1) @@ expression H; expression L; expression E; @@ get_random_u32_inclusive(L, H - + E - - E ) @@ expression H; expression L; expression E; @@ get_random_u32_inclusive(L, H - - E - + E ) @@ expression H; expression L; expression E; expression F; @@ get_random_u32_inclusive(L, H - - E + F - + E ) @@ expression H; expression L; expression E; expression F; @@ get_random_u32_inclusive(L, H - + E + F - - E ) And then subsequently cleaned up by hand, with several automatic cases rejected if it didn't make sense contextually. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-11-18treewide: use get_random_u32_below() instead of deprecated functionJason A. Donenfeld3-17/+17
This is a simple mechanical transformation done by: @@ expression E; @@ - prandom_u32_max + get_random_u32_below (E) Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Reviewed-by: SeongJae Park <sj@kernel.org> # for damon Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-11-05lsm: make security_socket_getpeersec_stream() sockptr_t safePaul Moore1-1/+2
Commit 4ff09db1b79b ("bpf: net: Change sk_getsockopt() to take the sockptr_t argument") made it possible to call sk_getsockopt() with both user and kernel address space buffers through the use of the sockptr_t type. Unfortunately at the time of conversion the security_socket_getpeersec_stream() LSM hook was written to only accept userspace buffers, and in a desire to avoid having to change the LSM hook the commit author simply passed the sockptr_t's userspace buffer pointer. Since the only sk_getsockopt() callers at the time of conversion which used kernel sockptr_t buffers did not allow SO_PEERSEC, and hence the security_socket_getpeersec_stream() hook, this was acceptable but also very fragile as future changes presented the possibility of silently passing kernel space pointers to the LSM hook. There are several ways to protect against this, including careful code review of future commits, but since relying on code review to catch bugs is a recipe for disaster and the upstream eBPF maintainer is "strongly against defensive programming", this patch updates the LSM hook, and all of the implementations to support sockptr_t and safely handle both user and kernel space buffers. Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: John Johansen <john.johansen@canonical.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2022-11-04net: gso: fix panic on frag_list with mixed head alloc typesJiri Benc1-17/+19
Since commit 3dcbdb134f32 ("net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list"), it is allowed to change gso_size of a GRO packet. However, that commit assumes that "checking the first list_skb member suffices; i.e if either of the list_skb members have non head_frag head, then the first one has too". It turns out this assumption does not hold. We've seen BUG_ON being hit in skb_segment when skbs on the frag_list had differing head_frag with the vmxnet3 driver. This happens because __netdev_alloc_skb and __napi_alloc_skb can return a skb that is page backed or kmalloced depending on the requested size. As the result, the last small skb in the GRO packet can be kmalloced. There are three different locations where this can be fixed: (1) We could check head_frag in GRO and not allow GROing skbs with different head_frag. However, that would lead to performance regression on normal forward paths with unmodified gso_size, where !head_frag in the last packet is not a problem. (2) Set a flag in bpf_skb_net_grow and bpf_skb_net_shrink indicating that NETIF_F_SG is undesirable. That would need to eat a bit in sk_buff. Furthermore, that flag can be unset when all skbs on the frag_list are page backed. To retain good performance, bpf_skb_net_grow/shrink would have to walk the frag_list. (3) Walk the frag_list in skb_segment when determining whether NETIF_F_SG should be cleared. This of course slows things down. This patch implements (3). To limit the performance impact in skb_segment, the list is walked only for skbs with SKB_GSO_DODGY set that have gso_size changed. Normal paths thus will not hit it. We could check only the last skb but since we need to walk the whole list anyway, let's stay on the safe side. Fixes: 3dcbdb134f32 ("net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list") Signed-off-by: Jiri Benc <jbenc@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/e04426a6a91baf4d1081e1b478c82b5de25fdf21.1667407944.git.jbenc@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-04Merge tag 'for-netdev' of ↵Jakub Kicinski2-8/+6
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== bpf 2022-11-04 We've added 8 non-merge commits during the last 3 day(s) which contain a total of 10 files changed, 113 insertions(+), 16 deletions(-). The main changes are: 1) Fix memory leak upon allocation failure in BPF verifier's stack state tracking, from Kees Cook. 2) Fix address leakage when BPF progs release reference to an object, from Youlin Li. 3) Fix BPF CI breakage from buggy in.h uapi header dependency, from Andrii Nakryiko. 4) Fix bpftool pin sub-command's argument parsing, from Pu Lehui. 5) Fix BPF sockmap lockdep warning by cancelling psock work outside of socket lock, from Cong Wang. 6) Follow-up for BPF sockmap to fix sk_forward_alloc accounting, from Wang Yufen. bpf-for-netdev * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add verifier test for release_reference() bpf: Fix wrong reg type conversion in release_reference() bpf, sock_map: Move cancel_work_sync() out of sock lock tools/headers: Pull in stddef.h to uapi to fix BPF selftests build in CI net/ipv4: Fix linux/in.h header dependencies bpftool: Fix NULL pointer dereference when pin {PROG, MAP, LINK} without FILE bpf, sockmap: Fix the sk->sk_forward_alloc warning of sk_stream_kill_queues bpf, verifier: Fix memory leak in array reallocation for stack state ==================== Link: https://lore.kernel.org/r/20221104000445.30761-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-03bpf, sock_map: Move cancel_work_sync() out of sock lockCong Wang2-8/+6
Stanislav reported a lockdep warning, which is caused by the cancel_work_sync() called inside sock_map_close(), as analyzed below by Jakub: psock->work.func = sk_psock_backlog() ACQUIRE psock->work_mutex sk_psock_handle_skb() skb_send_sock() __skb_send_sock() sendpage_unlocked() kernel_sendpage() sock->ops->sendpage = inet_sendpage() sk->sk_prot->sendpage = tcp_sendpage() ACQUIRE sk->sk_lock tcp_sendpage_locked() RELEASE sk->sk_lock RELEASE psock->work_mutex sock_map_close() ACQUIRE sk->sk_lock sk_psock_stop() sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED) cancel_work_sync() __cancel_work_timer() __flush_work() // wait for psock->work to finish RELEASE sk->sk_lock We can move the cancel_work_sync() out of the sock lock protection, but still before saved_close() was called. Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()") Reported-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20221102043417.279409-1-xiyou.wangcong@gmail.com
2022-11-03net, neigh: Fix null-ptr-deref in neigh_table_clear()Chen Zhongjin1-1/+1
When IPv6 module gets initialized but hits an error in the middle, kenel panic with: KASAN: null-ptr-deref in range [0x0000000000000598-0x000000000000059f] CPU: 1 PID: 361 Comm: insmod Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:__neigh_ifdown.isra.0+0x24b/0x370 RSP: 0018:ffff888012677908 EFLAGS: 00000202 ... Call Trace: <TASK> neigh_table_clear+0x94/0x2d0 ndisc_cleanup+0x27/0x40 [ipv6] inet6_init+0x21c/0x2cb [ipv6] do_one_initcall+0xd3/0x4d0 do_init_module+0x1ae/0x670 ... Kernel panic - not syncing: Fatal exception When ipv6 initialization fails, it will try to cleanup and calls: neigh_table_clear() neigh_ifdown(tbl, NULL) pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev == NULL)) # dev_net(NULL) triggers null-ptr-deref. Fix it by passing NULL to pneigh_queue_purge() in neigh_ifdown() if dev is NULL, to make kernel not panic immediately. Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop") Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Link: https://lore.kernel.org/r/20221101121552.21890-1-chenzhongjin@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27net: do not sense pfmemalloc status in skb_append_pagefrags()Eric Dumazet1-1/+1
skb_append_pagefrags() is used by af_unix and udp sendpage() implementation so far. In commit 326140063946 ("tcp: TX zerocopy should not sense pfmemalloc status") we explained why we should not sense pfmemalloc status for pages owned by user space. We should also use skb_fill_page_desc_noacc() in skb_append_pagefrags() to avoid following KCSAN report: BUG: KCSAN: data-race in lru_add_fn / skb_append_pagefrags write to 0xffffea00058fc1c8 of 8 bytes by task 17319 on cpu 0: __list_add include/linux/list.h:73 [inline] list_add include/linux/list.h:88 [inline] lruvec_add_folio include/linux/mm_inline.h:323 [inline] lru_add_fn+0x327/0x410 mm/swap.c:228 folio_batch_move_lru+0x1e1/0x2a0 mm/swap.c:246 lru_add_drain_cpu+0x73/0x250 mm/swap.c:669 lru_add_drain+0x21/0x60 mm/swap.c:773 free_pages_and_swap_cache+0x16/0x70 mm/swap_state.c:311 tlb_batch_pages_flush mm/mmu_gather.c:59 [inline] tlb_flush_mmu_free mm/mmu_gather.c:256 [inline] tlb_flush_mmu+0x5b2/0x640 mm/mmu_gather.c:263 tlb_finish_mmu+0x86/0x100 mm/mmu_gather.c:363 exit_mmap+0x190/0x4d0 mm/mmap.c:3098 __mmput+0x27/0x1b0 kernel/fork.c:1185 mmput+0x3d/0x50 kernel/fork.c:1207 copy_process+0x19fc/0x2100 kernel/fork.c:2518 kernel_clone+0x166/0x550 kernel/fork.c:2671 __do_sys_clone kernel/fork.c:2812 [inline] __se_sys_clone kernel/fork.c:2796 [inline] __x64_sys_clone+0xc3/0xf0 kernel/fork.c:2796 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffffea00058fc1c8 of 8 bytes by task 17325 on cpu 1: page_is_pfmemalloc include/linux/mm.h:1817 [inline] __skb_fill_page_desc include/linux/skbuff.h:2432 [inline] skb_fill_page_desc include/linux/skbuff.h:2453 [inline] skb_append_pagefrags+0x210/0x600 net/core/skbuff.c:3974 unix_stream_sendpage+0x45e/0x990 net/unix/af_unix.c:2338 kernel_sendpage+0x184/0x300 net/socket.c:3561 sock_sendpage+0x5a/0x70 net/socket.c:1054 pipe_to_sendpage+0x128/0x160 fs/splice.c:361 splice_from_pipe_feed fs/splice.c:415 [inline] __splice_from_pipe+0x222/0x4d0 fs/splice.c:559 splice_from_pipe fs/splice.c:594 [inline] generic_splice_sendpage+0x89/0xc0 fs/splice.c:743 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x80/0xa0 fs/splice.c:931 splice_direct_to_actor+0x305/0x620 fs/splice.c:886 do_splice_direct+0xfb/0x180 fs/splice.c:974 do_sendfile+0x3bf/0x910 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x10c/0x150 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x0000000000000000 -> 0xffffea00058fc188 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 17325 Comm: syz-executor.0 Not tainted 6.1.0-rc1-syzkaller-00158-g440b7895c990-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 Fixes: 326140063946 ("tcp: TX zerocopy should not sense pfmemalloc status") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20221027040346.1104204-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27kunit: Use KUNIT_EXPECT_MEMEQ macroMaíra Canal1-2/+2
Use KUNIT_EXPECT_MEMEQ to compare memory blocks in replacement of the KUNIT_EXPECT_EQ macro. Therefor, the statement KUNIT_EXPECT_EQ(test, memcmp(foo, bar, size), 0); is replaced by: KUNIT_EXPECT_MEMEQ(test, foo, bar, size); Signed-off-by: Maíra Canal <mairacanal@riseup.net> Acked-by: Daniel Latypov <dlatypov@google.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2022-10-24net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failedZhengchao Shao1-0/+7
When the ops_init() interface is invoked to initialize the net, but ops->init() fails, data is released. However, the ptr pointer in net->gen is invalid. In this case, when nfqnl_nf_hook_drop() is invoked to release the net, invalid address access occurs. The process is as follows: setup_net() ops_init() data = kzalloc(...) ---> alloc "data" net_assign_generic() ---> assign "date" to ptr in net->gen ... ops->init() ---> failed ... kfree(data); ---> ptr in net->gen is invalid ... ops_exit_list() ... nfqnl_nf_hook_drop() *q = nfnl_queue_pernet(net) ---> q is invalid The following is the Call Trace information: BUG: KASAN: use-after-free in nfqnl_nf_hook_drop+0x264/0x280 Read of size 8 at addr ffff88810396b240 by task ip/15855 Call Trace: <TASK> dump_stack_lvl+0x8e/0xd1 print_report+0x155/0x454 kasan_report+0xba/0x1f0 nfqnl_nf_hook_drop+0x264/0x280 nf_queue_nf_hook_drop+0x8b/0x1b0 __nf_unregister_net_hook+0x1ae/0x5a0 nf_unregister_net_hooks+0xde/0x130 ops_exit_list+0xb0/0x170 setup_net+0x7ac/0xbd0 copy_net_ns+0x2e6/0x6b0 create_new_namespaces+0x382/0xa50 unshare_nsproxy_namespaces+0xa6/0x1c0 ksys_unshare+0x3a4/0x7e0 __x64_sys_unshare+0x2d/0x40 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> Allocated by task 15855: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 __kasan_kmalloc+0xa1/0xb0 __kmalloc+0x49/0xb0 ops_init+0xe7/0x410 setup_net+0x5aa/0xbd0 copy_net_ns+0x2e6/0x6b0 create_new_namespaces+0x382/0xa50 unshare_nsproxy_namespaces+0xa6/0x1c0 ksys_unshare+0x3a4/0x7e0 __x64_sys_unshare+0x2d/0x40 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Freed by task 15855: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_save_free_info+0x2a/0x40 ____kasan_slab_free+0x155/0x1b0 slab_free_freelist_hook+0x11b/0x220 __kmem_cache_free+0xa4/0x360 ops_init+0xb9/0x410 setup_net+0x5aa/0xbd0 copy_net_ns+0x2e6/0x6b0 create_new_namespaces+0x382/0xa50 unshare_nsproxy_namespaces+0xa6/0x1c0 ksys_unshare+0x3a4/0x7e0 __x64_sys_unshare+0x2d/0x40 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Fixes: f875bae06533 ("net: Automatically allocate per namespace data.") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-21Merge tag 'net-6.1-rc2' of ↵Linus Torvalds3-4/+24
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter. Current release - regressions: - revert "net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and}" - revert "net: sched: fq_codel: remove redundant resource cleanup in fq_codel_init()" - dsa: uninitialized variable in dsa_slave_netdevice_event() - eth: sunhme: uninitialized variable in happy_meal_init() Current release - new code bugs: - eth: octeontx2: fix resource not freed after malloc Previous releases - regressions: - sched: fix return value of qdisc ingress handling on success - sched: fix race condition in qdisc_graft() - udp: update reuse->has_conns under reuseport_lock. - tls: strp: make sure the TCP skbs do not have overlapping data - hsr: avoid possible NULL deref in skb_clone() - tipc: fix an information leak in tipc_topsrv_kern_subscr - phylink: add mac_managed_pm in phylink_config structure - eth: i40e: fix DMA mappings leak - eth: hyperv: fix a RX-path warning - eth: mtk: fix memory leaks Previous releases - always broken: - sched: cake: fix null pointer access issue when cake_init() fails" * tag 'net-6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (43 commits) net: phy: dp83822: disable MDI crossover status change interrupt net: sched: fix race condition in qdisc_graft() net: hns: fix possible memory leak in hnae_ae_register() wwan_hwsim: fix possible memory leak in wwan_hwsim_dev_new() sfc: include vport_id in filter spec hash and equal() genetlink: fix kdoc warnings selftests: add selftest for chaining of tc ingress handling to egress net: Fix return value of qdisc ingress handling on success net: sched: sfb: fix null pointer access issue when sfb_init() fails Revert "net: sched: fq_codel: remove redundant resource cleanup in fq_codel_init()" net: sched: cake: fix null pointer access issue when cake_init() fails ethernet: marvell: octeontx2 Fix resource not freed after malloc netfilter: nf_tables: relax NFTA_SET_ELEM_KEY_END set flags requirements netfilter: rpfilter/fib: Set ->flowic_uid correctly for user namespaces. ionic: catch NULL pointer issue on reconfig net: hsr: avoid possible NULL deref in skb_clone() bnxt_en: fix memory leak in bnxt_nvm_test() ip6mr: fix UAF issue in ip6mr_sk_done() when addrconf_init_net() failed udp: Update reuse->has_conns under reuseport_lock. net: ethernet: mediatek: ppe: Remove the unused function mtk_foe_entry_usable() ...
2022-10-19net: Fix return value of qdisc ingress handling on successPaul Blakey1-0/+4
Currently qdisc ingress handling (sch_handle_ingress()) doesn't set a return value and it is left to the old return value of the caller (__netif_receive_skb_core()) which is RX drop, so if the packet is consumed, caller will stop and return this value as if the packet was dropped. This causes a problem in the kernel tcp stack when having a egress tc rule forwarding to a ingress tc rule. The tcp stack sending packets on the device having the egress rule will see the packets as not successfully transmitted (although they actually were), will not advance it's internal state of sent data, and packets returning on such tcp stream will be dropped by the tcp stack with reason ack-of-unsent-data. See reproduction in [0] below. Fix that by setting the return value to RX success if the packet was handled successfully. [0] Reproduction steps: $ ip link add veth1 type veth peer name peer1 $ ip link add veth2 type veth peer name peer2 $ ifconfig peer1 5.5.5.6/24 up $ ip netns add ns0 $ ip link set dev peer2 netns ns0 $ ip netns exec ns0 ifconfig peer2 5.5.5.5/24 up $ ifconfig veth2 0 up $ ifconfig veth1 0 up #ingress forwarding veth1 <-> veth2 $ tc qdisc add dev veth2 ingress $ tc qdisc add dev veth1 ingress $ tc filter add dev veth2 ingress prio 1 proto all flower \ action mirred egress redirect dev veth1 $ tc filter add dev veth1 ingress prio 1 proto all flower \ action mirred egress redirect dev veth2 #steal packet from peer1 egress to veth2 ingress, bypassing the veth pipe $ tc qdisc add dev peer1 clsact $ tc filter add dev peer1 egress prio 20 proto ip flower \ action mirred ingress redirect dev veth1 #run iperf and see connection not running $ iperf3 -s& $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1 #delete egress rule, and run again, now should work $ tc filter del dev peer1 egress $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1 Fixes: f697c3e8b35c ("[NET]: Avoid unnecessary cloning for ingress filtering") Signed-off-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-18udp: Update reuse->has_conns under reuseport_lock.Kuniyuki Iwashima1-0/+16
When we call connect() for a UDP socket in a reuseport group, we have to update sk->sk_reuseport_cb->has_conns to 1. Otherwise, the kernel could select a unconnected socket wrongly for packets sent to the connected socket. However, the current way to set has_conns is illegal and possible to trigger that problem. reuseport_has_conns() changes has_conns under rcu_read_lock(), which upgrades the RCU reader to the updater. Then, it must do the update under the updater's lock, reuseport_lock, but it doesn't for now. For this reason, there is a race below where we fail to set has_conns resulting in the wrong socket selection. To avoid the race, let's split the reader and updater with proper locking. cpu1 cpu2 +----+ +----+ __ip[46]_datagram_connect() reuseport_grow() . . |- reuseport_has_conns(sk, true) |- more_reuse = __reuseport_alloc(more_socks_size) | . | | |- rcu_read_lock() | |- reuse = rcu_dereference(sk->sk_reuseport_cb) | | | | | /* reuse->has_conns == 0 here */ | | |- more_reuse->has_conns = reuse->has_conns | |- reuse->has_conns = 1 | /* more_reuse->has_conns SHOULD BE 1 HERE */ | | | | | |- rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb, | | | more_reuse) | `- rcu_read_unlock() `- kfree_rcu(reuse, rcu) | |- sk->sk_state = TCP_ESTABLISHED Note the likely(reuse) in reuseport_has_conns_set() is always true, but we put the test there for ease of review. [0] For the record, usually, sk_reuseport_cb is changed under lock_sock(). The only exception is reuseport_grow() & TCP reqsk migration case. 1) shutdown() TCP listener, which is moved into the latter part of reuse->socks[] to migrate reqsk. 2) New listen() overflows reuse->socks[] and call reuseport_grow(). 3) reuse->max_socks overflows u16 with the new listener. 4) reuseport_grow() pops the old shutdown()ed listener from the array and update its sk->sk_reuseport_cb as NULL without lock_sock(). shutdown()ed TCP sk->sk_reuseport_cb can be changed without lock_sock(), but, reuseport_has_conns_set() is called only for UDP under lock_sock(), so likely(reuse) never be false in reuseport_has_conns_set(). [0]: https://lore.kernel.org/netdev/CANn89iLja=eQHbsM_Ta2sQF0tOGU8vAGrh_izRuuHjuO1ouUag@mail.gmail.com/ Fixes: acdcecc61285 ("udp: correct reuseport selection with connected sockets") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20221014182625.89913-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-17Merge tag 'random-6.1-rc1-for-linus' of ↵Linus Torvalds3-26/+25
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull more random number generator updates from Jason Donenfeld: "This time with some large scale treewide cleanups. The intent of this pull is to clean up the way callers fetch random integers. The current rules for doing this right are: - If you want a secure or an insecure random u64, use get_random_u64() - If you want a secure or an insecure random u32, use get_random_u32() The old function prandom_u32() has been deprecated for a while now and is just a wrapper around get_random_u32(). Same for get_random_int(). - If you want a secure or an insecure random u16, use get_random_u16() - If you want a secure or an insecure random u8, use get_random_u8() - If you want secure or insecure random bytes, use get_random_bytes(). The old function prandom_bytes() has been deprecated for a while now and has long been a wrapper around get_random_bytes() - If you want a non-uniform random u32, u16, or u8 bounded by a certain open interval maximum, use prandom_u32_max() I say "non-uniform", because it doesn't do any rejection sampling or divisions. Hence, it stays within the prandom_*() namespace, not the get_random_*() namespace. I'm currently investigating a "uniform" function for 6.2. We'll see what comes of that. By applying these rules uniformly, we get several benefits: - By using prandom_u32_max() with an upper-bound that the compiler can prove at compile-time is ≤65536 or ≤256, internally get_random_u16() or get_random_u8() is used, which wastes fewer batched random bytes, and hence has higher throughput. - By using prandom_u32_max() instead of %, when the upper-bound is not a constant, division is still avoided, because prandom_u32_max() uses a faster multiplication-based trick instead. - By using get_random_u16() or get_random_u8() in cases where the return value is intended to indeed be a u16 or a u8, we waste fewer batched random bytes, and hence have higher throughput. This series was originally done by hand while I was on an airplane without Internet. Later, Kees and I worked on retroactively figuring out what could be done with Coccinelle and what had to be done manually, and then we split things up based on that. So while this touches a lot of files, the actual amount of code that's hand fiddled is comfortably small" * tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: prandom: remove unused functions treewide: use get_random_bytes() when possible treewide: use get_random_u32() when possible treewide: use get_random_{u8,u16}() when possible, part 2 treewide: use get_random_{u8,u16}() when possible, part 1 treewide: use prandom_u32_max() when possible, part 2 treewide: use prandom_u32_max() when possible, part 1
2022-10-16skmsg: pass gfp argument to alloc_sk_msg()Eric Dumazet1-4/+4
syzbot found that alloc_sk_msg() could be called from a non sleepable context. sk_psock_verdict_recv() uses rcu_read_lock() protection. We need the callers to pass a gfp_t argument to avoid issues. syzbot report was: BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 3613, name: syz-executor414 preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 INFO: lockdep is turned off. CPU: 0 PID: 3613 Comm: syz-executor414 Not tainted 6.0.0-syzkaller-09589-g55be6084c8e0 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e3/0x2cb lib/dump_stack.c:106 __might_resched+0x538/0x6a0 kernel/sched/core.c:9877 might_alloc include/linux/sched/mm.h:274 [inline] slab_pre_alloc_hook mm/slab.h:700 [inline] slab_alloc_node mm/slub.c:3162 [inline] slab_alloc mm/slub.c:3256 [inline] kmem_cache_alloc_trace+0x59/0x310 mm/slub.c:3287 kmalloc include/linux/slab.h:600 [inline] kzalloc include/linux/slab.h:733 [inline] alloc_sk_msg net/core/skmsg.c:507 [inline] sk_psock_skb_ingress_self+0x5c/0x330 net/core/skmsg.c:600 sk_psock_verdict_apply+0x395/0x440 net/core/skmsg.c:1014 sk_psock_verdict_recv+0x34d/0x560 net/core/skmsg.c:1201 tcp_read_skb+0x4a1/0x790 net/ipv4/tcp.c:1770 tcp_rcv_established+0x129d/0x1a10 net/ipv4/tcp_input.c:5971 tcp_v4_do_rcv+0x479/0xac0 net/ipv4/tcp_ipv4.c:1681 sk_backlog_rcv include/net/sock.h:1109 [inline] __release_sock+0x1d8/0x4c0 net/core/sock.c:2906 release_sock+0x5d/0x1c0 net/core/sock.c:3462 tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1483 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg net/socket.c:734 [inline] __sys_sendto+0x46d/0x5f0 net/socket.c:2117 __do_sys_sendto net/socket.c:2129 [inline] __se_sys_sendto net/socket.c:2125 [inline] __x64_sys_sendto+0xda/0xf0 net/socket.c:2125 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: 43312915b5ba ("skmsg: Get rid of unncessary memset()") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <cong.wang@bytedance.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-13ipv6: Fix data races around sk->sk_prot.Kuniyuki Iwashima1-2/+4
Commit 086d49058cd8 ("ipv6: annotate some data-races around sk->sk_prot") fixed some data-races around sk->sk_prot but it was not enough. Some functions in inet6_(stream|dgram)_ops still access sk->sk_prot without lock_sock() or rtnl_lock(), so they need READ_ONCE() to avoid load tearing. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-12xfrm: lwtunnel: squelch kernel warning in case XFRM encap type is not availableEyal Birger1-1/+3
Ido reported that a kernel warning [1] can be triggered from user space when the kernel is compiled with CONFIG_MODULES=y and CONFIG_XFRM=n when adding an xfrm encap type route, e.g: $ ip route add 198.51.100.0/24 dev dummy1 encap xfrm if_id 1 Error: lwt encapsulation type not supported. The reason for the warning is that the LWT infrastructure has an autoloading feature which is meant only for encap types that don't use a net device, which is not the case in xfrm encap. Mute this warning for xfrm encap as there's no encap module to autoload in this case. [1] WARNING: CPU: 3 PID: 2746262 at net/core/lwtunnel.c:57 lwtunnel_valid_encap_type+0x4f/0x120 [...] Call Trace: <TASK> rtm_to_fib_config+0x211/0x350 inet_rtm_newroute+0x3a/0xa0 rtnetlink_rcv_msg+0x154/0x3c0 netlink_rcv_skb+0x49/0xf0 netlink_unicast+0x22f/0x350 netlink_sendmsg+0x208/0x440 ____sys_sendmsg+0x21f/0x250 ___sys_sendmsg+0x83/0xd0 __sys_sendmsg+0x54/0xa0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Reported-by: Ido Schimmel <idosch@idosch.org> Fixes: 2c2493b9da91 ("xfrm: lwtunnel: add lwtunnel support for xfrm interfaces in collect_md mode") Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-10-12treewide: use get_random_u32() when possibleJason A. Donenfeld1-2/+2
The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-12treewide: use prandom_u32_max() when possible, part 1Jason A. Donenfeld3-24/+23
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done mechanically with this coccinelle script: @basic@ expression E; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u64; @@ ( - ((T)get_random_u32() % (E)) + prandom_u32_max(E) | - ((T)get_random_u32() & ((E) - 1)) + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2) | - ((u64)(E) * get_random_u32() >> 32) + prandom_u32_max(E) | - ((T)get_random_u32() & ~PAGE_MASK) + prandom_u32_max(PAGE_SIZE) ) @multi_line@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; identifier RAND; expression E; @@ - RAND = get_random_u32(); ... when != RAND - RAND %= (E); + RAND = prandom_u32_max(E); // Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@ ((T)get_random_u32()@p & (LITERAL)) // Add one to the literal. @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@ value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1: print("Skipping 0x%x for cleanup elsewhere" % (value)) cocci.include_match(False) elif value & (value + 1) != 0: print("Skipping 0x%x because it's not a power of two minus one" % (value)) cocci.include_match(False) elif literal.startswith('0x'): coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1)) else: coccinelle.RESULT = cocci.make_expr("%d" % (value + 1)) // Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; expression add_one.RESULT; identifier FUNC; @@ - (FUNC()@p & (LITERAL)) + prandom_u32_max(RESULT) @collapse_ret@ type T; identifier VAR; expression E; @@ { - T VAR; - VAR = (E); - return VAR; + return E; } @drop_var@ type T; identifier VAR; @@ { - T VAR; ... when != VAR } Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: KP Singh <kpsingh@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-03Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski3-21/+118
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-10-03 We've added 143 non-merge commits during the last 27 day(s) which contain a total of 151 files changed, 8321 insertions(+), 1402 deletions(-). The main changes are: 1) Add kfuncs for PKCS#7 signature verification from BPF programs, from Roberto Sassu. 2) Add support for struct-based arguments for trampoline based BPF programs, from Yonghong Song. 3) Fix entry IP for kprobe-multi and trampoline probes under IBT enabled, from Jiri Olsa. 4) Batch of improvements to veristat selftest tool in particular to add CSV output, a comparison mode for CSV outputs and filtering, from Andrii Nakryiko. 5) Add preparatory changes needed for the BPF core for upcoming BPF HID support, from Benjamin Tissoires. 6) Support for direct writes to nf_conn's mark field from tc and XDP BPF program types, from Daniel Xu. 7) Initial batch of documentation improvements for BPF insn set spec, from Dave Thaler. 8) Add a new BPF_MAP_TYPE_USER_RINGBUF map which provides single-user-space-producer / single-kernel-consumer semantics for BPF ring buffer, from David Vernet. 9) Follow-up fixes to BPF allocator under RT to always use raw spinlock for the BPF hashtab's bucket lock, from Hou Tao. 10) Allow creating an iterator that loops through only the resources of one task/thread instead of all, from Kui-Feng Lee. 11) Add support for kptrs in the per-CPU arraymap, from Kumar Kartikeya Dwivedi. 12) Add a new kfunc helper for nf to set src/dst NAT IP/port in a newly allocated CT entry which is not yet inserted, from Lorenzo Bianconi. 13) Remove invalid recursion check for struct_ops for TCP congestion control BPF programs, from Martin KaFai Lau. 14) Fix W^X issue with BPF trampoline and BPF dispatcher, from Song Liu. 15) Fix percpu_counter leakage in BPF hashtab allocation error path, from Tetsuo Handa. 16) Various cleanups in BPF selftests to use preferred ASSERT_* macros, from Wang Yufen. 17) Add invocation for cgroup/connect{4,6} BPF programs for ICMP pings, from YiFei Zhu. 18) Lift blinding decision under bpf_jit_harden = 1 to bpf_capable(), from Yauheni Kaliuta. 19) Various libbpf fixes and cleanups including a libbpf NULL pointer deref, from Xin Liu. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (143 commits) net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c Documentation: bpf: Add implementation notes documentations to table of contents bpf, docs: Delete misformatted table. selftests/xsk: Fix double free bpftool: Fix error message of strerror libbpf: Fix overrun in netlink attribute iteration selftests/bpf: Fix spelling mistake "unpriviledged" -> "unprivileged" samples/bpf: Fix typo in xdp_router_ipv4 sample bpftool: Remove unused struct event_ring_info bpftool: Remove unused struct btf_attach_point bpf, docs: Add TOC and fix formatting. bpf, docs: Add Clang note about BPF_ALU bpf, docs: Move Clang notes to a separate file bpf, docs: Linux byteswap note bpf, docs: Move legacy packet instructions to a separate file selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION) bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another function bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt() bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline ... ==================== Link: https://lore.kernel.org/r/20221003194915.11847-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-03gro: add support of (hw)gro packets to gro stackCoco Li1-4/+14
Current GRO stack only supports incoming packets containing one frame/MSS. This patch changes GRO to accept packets that are already GRO. HW-GRO (aka RSC for some vendors) is very often limited in presence of interleaved packets. Linux SW GRO stack can complete the job and provide larger GRO packets, thus reducing rate of ACK packets and cpu overhead. This also means BIG TCP can still be used, even if HW-GRO/RSC was able to cook ~64 KB GRO packets. v2: fix logic in tcp_gro_receive() Only support TCP for the moment (Paolo) Co-Developed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Coco Li <lixiaoyan@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-03Merge branch 'master' of ↵David S. Miller1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== 1) Refactor selftests to use an array of structs in xfrm_fill_key(). From Gautam Menghani. 2) Drop an unused argument from xfrm_policy_match. From Hongbin Wang. 3) Support collect metadata mode for xfrm interfaces. From Eyal Birger. 4) Add netlink extack support to xfrm. From Sabrina Dubroca. Please note, there is a merge conflict in: include/net/dst_metadata.h between commit: 0a28bfd4971f ("net/macsec: Add MACsec skb_metadata_dst Tx Data path support") from the net-next tree and commit: 5182a5d48c3d ("net: allow storing xfrm interface metadata in metadata_dst") from the ipsec-next tree. Can be solved as done in linux-next. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-01net: devlink: add port_init/fini() helpers to allow ↵Jiri Pirko1-3/+43
pre-register/post-unregister functions Lifetime of some of the devlink objects, like regions, is currently forced to be different for devlink instance and devlink port instance (per-port regions). The reason is that for devlink ports, the internal structures initialization happens only after devlink_port_register() is called. To resolve this inconsistency, introduce new set of helpers to allow driver to initialize devlink pointer and region list before devlink_register() is called. That allows port regions to be created before devlink port registration and destroyed after devlink port unregistration. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-01net: devlink: introduce a flag to indicate devlink port being registeredJiri Pirko1-2/+4
Instead of relying on devlink pointer not being initialized, introduce an extra flag to indicate if devlink port is registered. This is needed as later on devlink pointer is going to be initialized even in case devlink port is not registered yet. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-01net: devlink: introduce port registered assert helper and use itJiri Pirko1-13/+19
Instead of checking devlink_port->devlink pointer for not being NULL which indicates that devlink port is registered, put this check to new pair of helpers similar to what we have for devlink and use them in other functions. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-30net-sysfs: Convert to use sysfs_emit() APIsWang Yufen1-29/+29
Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-30net: skb: introduce and use a single page frag cachePaolo Abeni2-22/+103
After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs") we are observing 10-20% regressions in performance tests with small packets. The perf trace points to high pressure on the slab allocator. This change tries to improve the allocation schema for small packets using an idea originally suggested by Eric: a new per CPU page frag is introduced and used in __napi_alloc_skb to cope with small allocation requests. To ensure that the above does not lead to excessive truesize underestimation, the frag size for small allocation is inflated to 1K and all the above is restricted to build with 4K page size. Note that we need to update accordingly the run-time check introduced with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper"). Alex suggested a smart page refcount schema to reduce the number of atomic operations and deal properly with pfmemalloc pages. Under small packet UDP flood, I measure a 15% peak tput increases. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Suggested-by: Alexander H Duyck <alexanderduyck@fb.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/6b6f65957c59f86a353fc09a5127e83a32ab5999.1664350652.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-7/+0
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itselfMartin KaFai Lau1-1/+27
When a bad bpf prog '.init' calls bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop: .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ... ... => .init => bpf_setsockopt(tcp_cc). It was prevented by the prog->active counter before but the prog->active detection cannot be used in struct_ops as explained in the earlier patch of the set. In this patch, the second bpf_setsockopt(tcp_cc) is not allowed in order to break the loop. This is done by using a bit of an existing 1 byte hole in tcp_sock to check if there is on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock. Note that this essentially limits only the first '.init' can call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer does not support ECN) and the second '.init' cannot fallback to another cc. This applies even the second bpf_setsockopt(TCP_CONGESTION) will not cause a loop. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220929070407.965581-5-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-29bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another functionMartin KaFai Lau1-17/+28
This patch moves the bpf_setsockopt(TCP_CONGESTION) logic into another function. The next patch will add extra logic to avoid recursion and this will make the latter patch easier to follow. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220929070407.965581-4-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-29bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt()Martin KaFai Lau1-6/+7
The check on the tcp-cc, "cdg", is done in the bpf_sk_setsockopt which is used by the bpf_tcp_ca, bpf_lsm, cg_sockopt, and tcp_iter hooks. However, it is not done for cg sock_ddr, cg sockops, and some of the bpf_lsm_cgroup hooks. The tcp-cc "cdg" should have very limited usage. This patch is to move the "cdg" check to the common sol_tcp_sockopt() so that all hooks have a consistent behavior. The motivation to make this check consistent now is because the latter patch will refactor the bpf_setsockopt(TCP_CONGESTION) into another function, so it is better to take this chance to refactor this piece also. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220929070407.965581-3-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-29net: drop the weight argument from netif_napi_addJakub Kicinski1-2/+1
We tell driver developers to always pass NAPI_POLL_WEIGHT as the weight to netif_napi_add(). This may be confusing to newcomers, drop the weight argument, those who really need to tweak the weight can use netif_napi_add_weight(). Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for CAN Link: https://lore.kernel.org/r/20220927132753.750069-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-29net: shrink struct ubuf_infoPavel Begunkov1-17/+21
We can benefit from a smaller struct ubuf_info, so leave only mandatory fields and let users to decide how they want to extend it. Convert MSG_ZEROCOPY to struct ubuf_info_msgzc and remove duplicated fields. This reduces the size from 48 bytes to just 16. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28Revert "net: set proper memcg for net_init hooks allocations"Shakeel Butt1-7/+0
This reverts commit 1d0403d20f6c281cb3d14c5f1db5317caeec48e9. Anatoly Pugachev reported that the commit 1d0403d20f6c ("net: set proper memcg for net_init hooks allocations") is somehow causing the sparc64 VMs failed to boot and the VMs boot fine with that patch reverted. So, revert the patch for now and later we can debug the issue. Link: https://lore.kernel.org/all/20220918092849.GA10314@u164.east.ru/ Reported-by: Anatoly Pugachev <matorola@gmail.com> Signed-off-by: Shakeel Butt <shakeelb@google.com> Cc: Vasily Averin <vvs@openvz.org> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: cgroups@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Tested-by: Anatoly Pugachev <matorola@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Fixes: 1d0403d20f6c ("net: set proper memcg for net_init hooks allocations") Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-09-26xdp: improve page_pool xdp_return performanceJesper Dangaard Brouer1-6/+4
During LPC2022 I meetup with my page_pool co-maintainer Ilias. When discussing page_pool code we realised/remembered certain optimizations had not been fully utilised. Since commit c07aea3ef4d4 ("mm: add a signature in struct page") struct page have a direct pointer to the page_pool object this page was allocated from. Thus, with this info it is possible to skip the rhashtable_lookup to find the page_pool object in __xdp_return(). The rcu_read_lock can be removed as it was tied to xdp_mem_allocator. The page_pool object is still safe to access as it tracks inflight pages and (potentially) schedules final release from a work queue. Created a micro benchmark of XDP redirecting from mlx5 into veth with XDP_DROP bpf-prog on the peer veth device. This increased performance 6.5% from approx 8.45Mpps to 9Mpps corresponding to using 7 nanosec (27 cycles at 3.8GHz) less per packet. Suggested-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://lore.kernel.org/r/166377993287.1737053.10258297257583703949.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-26skmsg: Schedule psock work if the cached skb exists on the psockLiu Jian1-4/+8
In sk_psock_backlog function, for ingress direction skb, if no new data packet arrives after the skb is cached, the cached skb does not have a chance to be added to the receive queue of psock. As a result, the cached skb cannot be received by the upper-layer application. Fix this by reschedule the psock work to dispose the cached skb in sk_msg_recvmsg function. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220907071311.60534-1-liujian56@huawei.com
2022-09-26net: If sock is dead don't access sock's sk_wq in sk_stream_wait_memoryLiu Jian1-1/+2
Fixes the below NULL pointer dereference: [...] [ 14.471200] Call Trace: [ 14.471562] <TASK> [ 14.471882] lock_acquire+0x245/0x2e0 [ 14.472416] ? remove_wait_queue+0x12/0x50 [ 14.473014] ? _raw_spin_lock_irqsave+0x17/0x50 [ 14.473681] _raw_spin_lock_irqsave+0x3d/0x50 [ 14.474318] ? remove_wait_queue+0x12/0x50 [ 14.474907] remove_wait_queue+0x12/0x50 [ 14.475480] sk_stream_wait_memory+0x20d/0x340 [ 14.476127] ? do_wait_intr_irq+0x80/0x80 [ 14.476704] do_tcp_sendpages+0x287/0x600 [ 14.477283] tcp_bpf_push+0xab/0x260 [ 14.477817] tcp_bpf_sendmsg_redir+0x297/0x500 [ 14.478461] ? __local_bh_enable_ip+0x77/0xe0 [ 14.479096] tcp_bpf_send_verdict+0x105/0x470 [ 14.479729] tcp_bpf_sendmsg+0x318/0x4f0 [ 14.480311] sock_sendmsg+0x2d/0x40 [ 14.480822] ____sys_sendmsg+0x1b4/0x1c0 [ 14.481390] ? copy_msghdr_from_user+0x62/0x80 [ 14.482048] ___sys_sendmsg+0x78/0xb0 [ 14.482580] ? vmf_insert_pfn_prot+0x91/0x150 [ 14.483215] ? __do_fault+0x2a/0x1a0 [ 14.483738] ? do_fault+0x15e/0x5d0 [ 14.484246] ? __handle_mm_fault+0x56b/0x1040 [ 14.484874] ? lock_is_held_type+0xdf/0x130 [ 14.485474] ? find_held_lock+0x2d/0x90 [ 14.486046] ? __sys_sendmsg+0x41/0x70 [ 14.486587] __sys_sendmsg+0x41/0x70 [ 14.487105] ? intel_pmu_drain_pebs_core+0x350/0x350 [ 14.487822] do_syscall_64+0x34/0x80 [ 14.488345] entry_SYSCALL_64_after_hwframe+0x63/0xcd [...] The test scenario has the following flow: thread1 thread2 ----------- --------------- tcp_bpf_sendmsg tcp_bpf_send_verdict tcp_bpf_sendmsg_redir sock_close tcp_bpf_push_locked __sock_release tcp_bpf_push //inet_release do_tcp_sendpages sock->ops->release sk_stream_wait_memory // tcp_close sk_wait_event sk->sk_prot->close release_sock(__sk); *** lock_sock(sk); __tcp_close sock_orphan(sk) sk->sk_wq = NULL release_sock **** lock_sock(__sk); remove_wait_queue(sk_sleep(sk), &wait); sk_sleep(sk) //NULL pointer dereference &rcu_dereference_raw(sk->sk_wq)->wait While waiting for memory in thread1, the socket is released with its wait queue because thread2 has closed it. This caused by tcp_bpf_send_verdict didn't increase the f_count of psock->sk_redir->sk_socket->file in thread1. We should check if SOCK_DEAD flag is set on wakeup in sk_stream_wait_memory before accessing the wait queue. Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20220823133755.314697-2-liujian56@huawei.com
2022-09-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-3/+2
drivers/net/ethernet/freescale/fec.h 7b15515fc1ca ("Revert "fec: Restart PPS after link state change"") 40c79ce13b03 ("net: fec: add stop mode support for imx8 platform") https://lore.kernel.org/all/20220921105337.62b41047@canb.auug.org.au/ drivers/pinctrl/pinctrl-ocelot.c c297561bc98a ("pinctrl: ocelot: Fix interrupt controller") 181f604b33cd ("pinctrl: ocelot: add ability to be used in a non-mmio configuration") https://lore.kernel.org/all/20220921110032.7cd28114@canb.auug.org.au/ tools/testing/selftests/drivers/net/bonding/Makefile bbb774d921e2 ("net: Add tests for bonding and team address list management") 152e8ec77640 ("selftests/bonding: add a test for bonding lladdr target") https://lore.kernel.org/all/20220921110437.5b7dbd82@canb.auug.org.au/ drivers/net/can/usb/gs_usb.c 5440428b3da6 ("can: gs_usb: gs_can_open(): fix race dev->can.state condition") 45dfa45f52e6 ("can: gs_usb: add RX and TX hardware timestamp support") https://lore.kernel.org/all/84f45a7d-92b6-4dc5-d7a1-072152fab6ff@tessares.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-22flow_dissector: Do not count vlan tags inside tunnel payloadQingqing Yang1-2/+2
We've met the problem that when there is a vlan tag inside GRE encapsulation, the match of num_of_vlans fails. It is caused by the vlan tag inside GRE payload has been counted into num_of_vlans, which is not expected. One example packet is like this: Ethernet II, Src: Broadcom_68:56:07 (00:10:18:68:56:07) Dst: Broadcom_68:56:08 (00:10:18:68:56:08) 802.1Q Virtual LAN, PRI: 0, DEI: 0, ID: 100 Internet Protocol Version 4, Src: 192.168.1.4, Dst: 192.168.1.200 Generic Routing Encapsulation (Transparent Ethernet bridging) Ethernet II, Src: Broadcom_68:58:07 (00:10:18:68:58:07) Dst: Broadcom_68:58:08 (00:10:18:68:58:08) 802.1Q Virtual LAN, PRI: 0, DEI: 0, ID: 200 ... It should match the (num_of_vlans 1) rule, but it matches the (num_of_vlans 2) rule. The vlan tags inside the GRE or other tunnel encapsulated payload should not be taken into num_of_vlans. The fix is to stop counting the vlan number when the encapsulation bit is set. Fixes: 34951fcf26c5 ("flow_dissector: Add number of vlan tags dissector") Signed-off-by: Qingqing Yang <qingqing.yang@broadcom.com> Reviewed-by: Boris Sukholitko <boris.sukholitko@broadcom.com> Link: https://lore.kernel.org/r/20220919074808.136640-1-qingqing.yang@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-21bpf: Rename nfct_bsa to nfct_btf_struct_accessDaniel Xu1-9/+9
The former name was a little hard to guess. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/73adc72385c8b162391fbfb404f0b6d4c5cc55d7.1663683114.git.dxu@dxuuu.xyz Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-20tcp: Access &tcp_hashinfo via net.Kuniyuki Iwashima1-2/+3
We will soon introduce an optional per-netns ehash. This means we cannot use tcp_hashinfo directly in most places. Instead, access it via net->ipv4.tcp_death_row.hashinfo. The access will be valid only while initialising tcp_hashinfo itself and creating/destroying each netns. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-20net: rtnetlink: Enslave device before bringing it upPhil Sutter1-7/+7
Unlike with bridges, one can't add an interface to a bond and set it up at the same time: | # ip link set dummy0 down | # ip link set dummy0 master bond0 up | Error: Device can not be enslaved while up. Of all drivers with ndo_add_slave callback, bond and team decline if IFF_UP flag is set, vrf cycles the interface (i.e., sets it down and immediately up again) and the others just don't care. Support the common notion of setting the interface up after enslaving it by sorting the operations accordingly. Signed-off-by: Phil Sutter <phil@nwl.cc> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220914150623.24152-1-phil@nwl.cc Signed-off-by: Jakub Kicinski <kuba@kernel.org>