summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-07-08ath11k: Fix typo in commentsZhang Jiaming1-1/+1
There is a typo(isn't') in comments. It maybe 'isn't' instead of 'isn't''. Signed-off-by: Zhang Jiaming <jiaming@nfschina.com> Signed-off-by: Kalle Valo <quic_kvalo@quicinc.com> Link: https://lore.kernel.org/r/20220704030004.16484-1-jiaming@nfschina.com
2022-07-08ovl: turn of SB_POSIXACL with idmapped layers temporarilyChristian Brauner2-1/+28
This cycle we added support for mounting overlayfs on top of idmapped mounts. Recently I've started looking into potential corner cases when trying to add additional tests and I noticed that reporting for POSIX ACLs is currently wrong when using idmapped layers with overlayfs mounted on top of it. I have sent out an patch that fixes this and makes POSIX ACLs work correctly but the patch is a bit bigger and we're already at -rc5 so I recommend we simply don't raise SB_POSIXACL when idmapped layers are used. Then we can fix the VFS part described below for the next merge window so we can have good exposure in -next. I'm going to give a rather detailed explanation to both the origin of the problem and mention the solution so people know what's going on. Let's assume the user creates the following directory layout and they have a rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you would expect files on your host system to be owned. For example, ~/.bashrc for your regular user would be owned by 1000:1000 and /root/.bashrc would be owned by 0:0. IOW, this is just regular boring filesystem tree on an ext4 or xfs filesystem. The user chooses to set POSIX ACLs using the setfacl binary granting the user with uid 4 read, write, and execute permissions for their .bashrc file: setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc Now they to expose the whole rootfs to a container using an idmapped mount. So they first create: mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap} mkdir -pv /vol/contpool/ctrover/{over,work} chown 10000000:10000000 /vol/contpool/ctrover/{over,work} The user now creates an idmapped mount for the rootfs: mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \ /var/lib/lxc/c2/rootfs \ /vol/contpool/lowermap This for example makes it so that /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc which is owned by uid and gid 1000 as being owned by uid and gid 10001000 at /vol/contpool/lowermap/home/ubuntu/.bashrc. Assume the user wants to expose these idmapped mounts through an overlayfs mount to a container. mount -t overlay overlay \ -o lowerdir=/vol/contpool/lowermap, \ upperdir=/vol/contpool/overmap/over, \ workdir=/vol/contpool/overmap/work \ /vol/contpool/merge The user can do this in two ways: (1) Mount overlayfs in the initial user namespace and expose it to the container. (2) Mount overlayfs on top of the idmapped mounts inside of the container's user namespace. Let's assume the user chooses the (1) option and mounts overlayfs on the host and then changes into a container which uses the idmapping 0:10000000:65536 which is the same used for the two idmapped mounts. Now the user tries to retrieve the POSIX ACLs using the getfacl command getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc and to their surprise they see: # file: vol/contpool/merge/home/ubuntu/.bashrc # owner: 1000 # group: 1000 user::rw- user:4294967295:rwx group::r-- mask::rwx other::r-- indicating the uid wasn't correctly translated according to the idmapped mount. The problem is how we currently translate POSIX ACLs. Let's inspect the callchain in this example: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */ sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() |> vfs_getxattr() | -> __vfs_getxattr() | -> handler->get == ovl_posix_acl_xattr_get() | -> ovl_xattr_get() | -> vfs_getxattr() | -> __vfs_getxattr() | -> handler->get() /* lower filesystem callback */ |> posix_acl_fix_xattr_to_user() { 4 = make_kuid(&init_user_ns, 4); 4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4); /* FAILURE */ -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4); } If the user chooses to use option (2) and mounts overlayfs on top of idmapped mounts inside the container things don't look that much better: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:10000000:65536 sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() |> vfs_getxattr() | -> __vfs_getxattr() | -> handler->get == ovl_posix_acl_xattr_get() | -> ovl_xattr_get() | -> vfs_getxattr() | -> __vfs_getxattr() | -> handler->get() /* lower filesystem callback */ |> posix_acl_fix_xattr_to_user() { 4 = make_kuid(&init_user_ns, 4); 4 = mapped_kuid_fs(&init_user_ns, 4); /* FAILURE */ -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4); } As is easily seen the problem arises because the idmapping of the lower mount isn't taken into account as all of this happens in do_gexattr(). But do_getxattr() is always called on an overlayfs mount and inode and thus cannot possible take the idmapping of the lower layers into account. This problem is similar for fscaps but there the translation happens as part of vfs_getxattr() already. Let's walk through an fscaps overlayfs callchain: setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc The expected outcome here is that we'll receive the cap_net_raw capability as we are able to map the uid associated with the fscap to 0 within our container. IOW, we want to see 0 as the result of the idmapping translations. If the user chooses option (1) we get the following callchain for fscaps: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */ sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() -> vfs_getxattr() -> xattr_getsecurity() -> security_inode_getsecurity() ________________________________ -> cap_inode_getsecurity() | | { V | 10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000); | 10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); | /* Expected result is 0 and thus that we own the fscap. */ | 0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); | } | -> vfs_getxattr_alloc() | -> handler->get == ovl_other_xattr_get() | -> vfs_getxattr() | -> xattr_getsecurity() | -> security_inode_getsecurity() | -> cap_inode_getsecurity() | { | 0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); | 10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); | 10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000); | |____________________________________________________________________| } -> vfs_getxattr_alloc() -> handler->get == /* lower filesystem callback */ And if the user chooses option (2) we get: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:10000000:65536 sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() -> vfs_getxattr() -> xattr_getsecurity() -> security_inode_getsecurity() _______________________________ -> cap_inode_getsecurity() | | { V | 10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0); | 10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); | /* Expected result is 0 and thus that we own the fscap. */ | 0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); | } | -> vfs_getxattr_alloc() | -> handler->get == ovl_other_xattr_get() | |-> vfs_getxattr() | -> xattr_getsecurity() | -> security_inode_getsecurity() | -> cap_inode_getsecurity() | { | 0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); | 10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); | 0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); | |____________________________________________________________________| } -> vfs_getxattr_alloc() -> handler->get == /* lower filesystem callback */ We can see how the translation happens correctly in those cases as the conversion happens within the vfs_getxattr() helper. For POSIX ACLs we need to do something similar. However, in contrast to fscaps we cannot apply the fix directly to the kernel internal posix acl data structure as this would alter the cached values and would also require a rework of how we currently deal with POSIX ACLs in general which almost never take the filesystem idmapping into account (the noteable exception being FUSE but even there the implementation is special) and instead retrieve the raw values based on the initial idmapping. The correct values are then generated right before returning to userspace. The fix for this is to move taking the mount's idmapping into account directly in vfs_getxattr() instead of having it be part of posix_acl_fix_xattr_to_user(). To this end we simply move the idmapped mount translation into a separate step performed in vfs_{g,s}etxattr() instead of in posix_acl_fix_xattr_{from,to}_user(). To see how this fixes things let's go back to the original example. Assume the user chose option (1) and mounted overlayfs on top of idmapped mounts on the host: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */ sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() |> vfs_getxattr() | |> __vfs_getxattr() | | -> handler->get == ovl_posix_acl_xattr_get() | | -> ovl_xattr_get() | | -> vfs_getxattr() | | |> __vfs_getxattr() | | | -> handler->get() /* lower filesystem callback */ | | |> posix_acl_getxattr_idmapped_mnt() | | { | | 4 = make_kuid(&init_user_ns, 4); | | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4); | | 10000004 = from_kuid(&init_user_ns, 10000004); | | |_______________________ | | } | | | | | |> posix_acl_getxattr_idmapped_mnt() | | { | | V | 10000004 = make_kuid(&init_user_ns, 10000004); | 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004); | 10000004 = from_kuid(&init_user_ns, 10000004); | } |_________________________________________________ | | | | |> posix_acl_fix_xattr_to_user() | { V 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004); /* SUCCESS */ 4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004); } And similarly if the user chooses option (1) and mounted overayfs on top of idmapped mounts inside the container: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:10000000:65536 sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() |> vfs_getxattr() | |> __vfs_getxattr() | | -> handler->get == ovl_posix_acl_xattr_get() | | -> ovl_xattr_get() | | -> vfs_getxattr() | | |> __vfs_getxattr() | | | -> handler->get() /* lower filesystem callback */ | | |> posix_acl_getxattr_idmapped_mnt() | | { | | 4 = make_kuid(&init_user_ns, 4); | | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4); | | 10000004 = from_kuid(&init_user_ns, 10000004); | | |_______________________ | | } | | | | | |> posix_acl_getxattr_idmapped_mnt() | | { V | 10000004 = make_kuid(&init_user_ns, 10000004); | 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004); | 10000004 = from_kuid(0(&init_user_ns, 10000004); | |_________________________________________________ | } | | | |> posix_acl_fix_xattr_to_user() | { V 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004); /* SUCCESS */ 4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004); } The last remaining problem we need to fix here is ovl_get_acl(). During ovl_permission() overlayfs will call: ovl_permission() -> generic_permission() -> acl_permission_check() -> check_acl() -> get_acl() -> inode->i_op->get_acl() == ovl_get_acl() > get_acl() /* on the underlying filesystem) ->inode->i_op->get_acl() == /*lower filesystem callback */ -> posix_acl_permission() passing through the get_acl request to the underlying filesystem. This will retrieve the acls stored in the lower filesystem without taking the idmapping of the underlying mount into account as this would mean altering the cached values for the lower filesystem. The simple solution is to have ovl_get_acl() simply duplicate the ACLs, update the values according to the idmapped mount and return it to acl_permission_check() so it can be used in posix_acl_permission(). Since overlayfs doesn't cache ACLs they'll be released right after. Link: https://github.com/brauner/mount-idmapped/issues/9 Cc: Seth Forshee <sforshee@digitalocean.com> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: linux-unionfs@vger.kernel.org Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Fixes: bc70682a497c ("ovl: support idmapped layers") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-08net: ag71xx: switch to napi_build_skb() to reuse skbuff_headsSieng-Piaw Liew1-5/+5
napi_build_skb() reuses NAPI skbuff_head cache in order to save some cycles on freeing/allocating skbuff_heads on every new Rx or completed Tx. Use napi_consume_skb() to feed the cache with skbuff_heads of completed Tx, so it's never empty. The budget parameter is added to indicate NAPI context, as a value of zero can be passed in the case of netpoll. Signed-off-by: Sieng-Piaw Liew <liew.s.piaw@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08net: minor optimization in __alloc_skb()Eric Dumazet1-2/+1
TCP allocates 'fast clones' skbs for packets in tx queues. Currently, __alloc_skb() initializes the companion fclone field to SKB_FCLONE_CLONE, and leaves other fields untouched. It makes sense to defer this init much later in skb_clone(), because all fclone fields are copied and hot in cpu caches at that time. This removes one cache line miss in __alloc_skb(), cost seen on an host with 256 cpus all competing on memory accesses. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08octeontx2-af: Don't reset previous pfc configHariprasad Kelam3-13/+23
Current implementation is such that driver first resets the existing PFC config before applying new pfc configuration. This creates a problem like once PF or VFs requests PFC config previous pfc config by other PFVfs is getting reset. This patch fixes the problem by removing unnecessary resetting of PFC config. Also configure Pause quanta value to smaller as current value is too high. Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08selftests/bpf: Add test involving restrict type qualifierDaniel Müller3-2/+13
This change adds a type based test involving the restrict type qualifier to the BPF selftests. On the btfgen path, this will verify that bpftool correctly handles the corresponding RESTRICT BTF kind. Signed-off-by: Daniel Müller <deso@posteo.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20220706212855.1700615-3-deso@posteo.net
2022-07-08bpftool: Add support for KIND_RESTRICT to gen min_core_btf commandDaniel Müller1-0/+1
This change adjusts bpftool's type marking logic, as used in conjunction with TYPE_EXISTS relocations, to correctly recognize and handle the RESTRICT BTF kind. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Müller <deso@posteo.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20220623212205.2805002-1-deso@posteo.net/T/#m4c75205145701762a4b398e0cdb911d5b5305ffc Link: https://lore.kernel.org/bpf/20220706212855.1700615-2-deso@posteo.net
2022-07-08MAINTAINERS: Add entry for AF_XDP selftests filesMaciej Fijalkowski1-0/+1
Lukas reported that after commit f36600634282 ("libbpf: move xsk.{c,h} into selftests/bpf") MAINTAINERS file needed an update. In the meantime, Magnus removed AF_XDP samples in commit cfb5a2dbf141 ("bpf, samples: Remove AF_XDP samples"), but selftests part still misses its entry in MAINTAINERS. Now that xdpxceiver became xskxceiver, tools/testing/selftests/bpf/*xsk* will match all of the files related to AF_XDP testing (test_xsk.sh, xskxceiver, xsk_prereqs.sh, xsk.{c,h}). Reported-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220707111613.49031-3-maciej.fijalkowski@intel.com
2022-07-08selftests, xsk: Rename AF_XDP testing appMaciej Fijalkowski6-13/+13
Recently, xsk part of libbpf was moved to selftests/bpf directory and lives on its own because there is an AF_XDP testing application that needs it called xdpxceiver. That name makes it a bit hard to indicate who maintains it as there are other XDP samples in there, whereas this one is strictly about AF_XDP. Do s/xdpxceiver/xskxceiver so that it will be easier to figure out who maintains it. A follow-up patch will correct MAINTAINERS file. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220707111613.49031-2-maciej.fijalkowski@intel.com
2022-07-08bpf, docs: Remove deprecated xsk libbpf APIs descriptionPu Lehui1-11/+2
Since xsk APIs has been removed from libbpf, let's clean up the BPF docs simutaneously. Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20220708042736.669132-1-pulehui@huawei.com
2022-07-08l2tp: l2tp_debugfs: fix Clang -Wformat warningsJustin Stitt1-3/+3
When building with Clang we encounter the following warnings: | net/l2tp/l2tp_debugfs.c:187:40: error: format specifies type 'unsigned | short' but the argument has type 'u32' (aka 'unsigned int') | [-Werror,-Wformat] seq_printf(m, " nr %hu, ns %hu\n", session->nr, | session->ns); - | net/l2tp/l2tp_debugfs.c:196:32: error: format specifies type 'unsigned | short' but the argument has type 'int' [-Werror,-Wformat] | session->l2specific_type, l2tp_get_l2specific_len(session)); - | net/l2tp/l2tp_debugfs.c:219:6: error: format specifies type 'unsigned | short' but the argument has type 'u32' (aka 'unsigned int') | [-Werror,-Wformat] session->nr, session->ns, Both session->nr and ->nc are of type `u32`. The currently used format specifier is `%hu` which describes a `u16`. My proposed fix is to listen to Clang and use the correct format specifier `%u`. For the warning at line 196, l2tp_get_l2specific_len() returns an int and should therefore be using the `%d` format specifier. Link: https://github.com/ClangBuiltLinux/linux/issues/378 Signed-off-by: Justin Stitt <justinstitt@google.com> Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08eth: sp7021: switch to netif_napi_add_tx()Jakub Kicinski1-1/+1
The Tx NAPI should use netif_napi_add_tx(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Wells Lu <wellslutw@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08eth: mtk: switch to netif_napi_add_tx()Jakub Kicinski1-2/+1
netif_napi_add_tx() does not require the weight argument. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08Merge branch 'sysctl-data-races'David S. Miller8-26/+37
Kuniyuki Iwashima says: ==================== sysctl: Fix data-races around ipv4_table. A sysctl variable is accessed concurrently, and there is always a chance of data-race. So, all readers and writers need some basic protection to avoid load/store-tearing. The first half of this series changes some proc handlers used in ipv4_table to use READ_ONCE() and WRITE_ONCE() internally to fix data-races on the sysctl side. Then, the second half adds READ_ONCE() to the other readers of ipv4_table. Changes: v2: * Drop some changes that makes backporting difficult * First cleanup patch * Lockless helpers and .proc_handler changes * Drop the tracing part for .sysctl_mem * Steve already posted a fix * Drop int-to-bool change for cipso * Should be posted to net-next later * Drop proc_dobool() change * Can be included in another series v1: https://lore.kernel.org/netdev/20220706052130.16368-1-kuniyu@amazon.com/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08ipv4: Fix a data-race around sysctl_fib_sync_mem.Kuniyuki Iwashima1-1/+1
While reading sysctl_fib_sync_mem, it can be changed concurrently. So, we need to add READ_ONCE() to avoid a data-race. Fixes: 9ab948a91b2c ("ipv4: Allow amount of dirty memory from fib resizing to be controllable") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08icmp: Fix data-races around sysctl.Kuniyuki Iwashima1-2/+3
While reading icmp sysctl variables, they can be changed concurrently. So, we need to add READ_ONCE() to avoid data-races. Fixes: 4cdf507d5452 ("icmp: add a global rate limitation") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08cipso: Fix data-races around sysctl.Kuniyuki Iwashima2-6/+8
While reading cipso sysctl variables, they can be changed concurrently. So, we need to add READ_ONCE() to avoid data-races. Fixes: 446fda4f2682 ("[NetLabel]: CIPSOv4 engine") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08net: Fix data-races around sysctl_mem.Kuniyuki Iwashima1-1/+1
While reading .sysctl_mem, it can be changed concurrently. So, we need to add READ_ONCE() to avoid data-races. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08inetpeer: Fix data-races around sysctl.Kuniyuki Iwashima1-4/+8
While reading inetpeer sysctl variables, they can be changed concurrently. So, we need to add READ_ONCE() to avoid data-races. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08tcp: Fix a data-race around sysctl_tcp_max_orphans.Kuniyuki Iwashima1-1/+2
While reading sysctl_tcp_max_orphans, it can be changed concurrently. So, we need to add READ_ONCE() to avoid a data-race. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08sysctl: Fix data races in proc_dointvec_jiffies().Kuniyuki Iwashima1-2/+5
A sysctl variable is accessed concurrently, and there is always a chance of data-race. So, all readers and writers need some basic protection to avoid load/store-tearing. This patch changes proc_dointvec_jiffies() to use READ_ONCE() and WRITE_ONCE() internally to fix data-races on the sysctl side. For now, proc_dointvec_jiffies() itself is tolerant to a data-race, but we still need to add annotations on the other subsystem's side. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08sysctl: Fix data races in proc_doulongvec_minmax().Kuniyuki Iwashima1-2/+2
A sysctl variable is accessed concurrently, and there is always a chance of data-race. So, all readers and writers need some basic protection to avoid load/store-tearing. This patch changes proc_doulongvec_minmax() to use READ_ONCE() and WRITE_ONCE() internally to fix data-races on the sysctl side. For now, proc_doulongvec_minmax() itself is tolerant to a data-race, but we still need to add annotations on the other subsystem's side. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08sysctl: Fix data races in proc_douintvec_minmax().Kuniyuki Iwashima1-1/+1
A sysctl variable is accessed concurrently, and there is always a chance of data-race. So, all readers and writers need some basic protection to avoid load/store-tearing. This patch changes proc_douintvec_minmax() to use READ_ONCE() and WRITE_ONCE() internally to fix data-races on the sysctl side. For now, proc_douintvec_minmax() itself is tolerant to a data-race, but we still need to add annotations on the other subsystem's side. Fixes: 61d9b56a8920 ("sysctl: add unsigned int range support") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08sysctl: Fix data races in proc_dointvec_minmax().Kuniyuki Iwashima1-1/+1
A sysctl variable is accessed concurrently, and there is always a chance of data-race. So, all readers and writers need some basic protection to avoid load/store-tearing. This patch changes proc_dointvec_minmax() to use READ_ONCE() and WRITE_ONCE() internally to fix data-races on the sysctl side. For now, proc_dointvec_minmax() itself is tolerant to a data-race, but we still need to add annotations on the other subsystem's side. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08sysctl: Fix data races in proc_douintvec().Kuniyuki Iwashima1-2/+2
A sysctl variable is accessed concurrently, and there is always a chance of data-race. So, all readers and writers need some basic protection to avoid load/store-tearing. This patch changes proc_douintvec() to use READ_ONCE() and WRITE_ONCE() internally to fix data-races on the sysctl side. For now, proc_douintvec() itself is tolerant to a data-race, but we still need to add annotations on the other subsystem's side. Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08sysctl: Fix data races in proc_dointvec().Kuniyuki Iwashima1-3/+3
A sysctl variable is accessed concurrently, and there is always a chance of data-race. So, all readers and writers need some basic protection to avoid load/store-tearing. This patch changes proc_dointvec() to use READ_ONCE() and WRITE_ONCE() internally to fix data-races on the sysctl side. For now, proc_dointvec() itself is tolerant to a data-race, but we still need to add annotations on the other subsystem's side. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08net: sock: tracing: Fix sock_exceed_buf_limit not to dereference stale pointerSteven Rostedt (Google)1-2/+4
The trace event sock_exceed_buf_limit saves the prot->sysctl_mem pointer and then dereferences it in the TP_printk() portion. This is unsafe as the TP_printk() portion is executed at the time the buffer is read. That is, it can be seconds, minutes, days, months, even years later. If the proto is freed, then this dereference will can also lead to a kernel crash. Instead, save the sysctl_mem array into the ring buffer and have the TP_printk() reference that instead. This is the proper and safe way to read pointers in trace events. Link: https://lore.kernel.org/all/20220706052130.16368-12-kuniyu@amazon.com/ Cc: stable@vger.kernel.org Fixes: 3847ce32aea9f ("core: add tracepoints for queueing skb to rcvbuf") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-08x86/bugs: Do not enable IBPB-on-entry when IBPB is not supportedThadeu Lima de Souza Cascardo1-2/+5
There are some VM configurations which have Skylake model but do not support IBPB. In those cases, when using retbleed=ibpb, userspace is going to be killed and kernel is going to panic. If the CPU does not support IBPB, warn and proceed with the auto option. Also, do not fallback to IBPB on AMD/Hygon systems if it is not supported. Fixes: 3ebc17006888 ("x86/bugs: Add retbleed=ibpb") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-07-08bpf: Add flags arg to bpf_dynptr_read and bpf_dynptr_write APIsJoanne Koong5-19/+29
Commit 13bbbfbea759 ("bpf: Add bpf_dynptr_read and bpf_dynptr_write") added the bpf_dynptr_write() and bpf_dynptr_read() APIs. However, it will be needed for some dynptr types to pass in flags as well (e.g. when writing to a skb, the user may like to invalidate the hash or recompute the checksum). This patch adds a "u64 flags" arg to the bpf_dynptr_read() and bpf_dynptr_write() APIs before their UAPI signature freezes where we then cannot change them anymore with a 5.19.x released kernel. Fixes: 13bbbfbea759 ("bpf: Add bpf_dynptr_read and bpf_dynptr_write") Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20220706232547.4016651-1-joannelkoong@gmail.com
2022-07-08MAINTAINERS: Remove iommu@lists.linux-foundation.orgJoerg Roedel1-11/+0
The IOMMU mailing list has moved to iommu@lists.linux.dev and the old list should bounce by now. Remove it from the MAINTAINERS file. Cc: stable@vger.kernel.org Signed-off-by: Joerg Roedel <jroedel@suse.de> Link: https://lore.kernel.org/r/20220706103331.10215-1-joro@8bytes.org
2022-07-08Merge branch 'polarfire-soc-macb-reset-support'Jakub Kicinski2-51/+56
Conor Dooley says: ==================== PolarFire SoC macb reset support The Cadence MACBs on PolarFire SoC (MPFS) have reset capability and are compatible with the zynqmp's init function. I have removed the zynqmp specific comments from that function & renamed it to reflect what it does, since it is no longer zynqmp only. MPFS's MACB had previously used the generic binding, so I also added the required specific binding. For v2, I noticed some low hanging cleanup fruit so there are extra patches added for that: moving the init function out of the config structs, aligning the alignment of the zynqmp & default config structs with the other dozen or so structs & simplifing the error paths to use dev_err_probe(). Feel free to apply as many or as few of those as you like. ==================== Link: https://lore.kernel.org/r/20220706095129.828253-1-conor.dooley@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-08net: macb: sort init_reset_optional() with other init()sConor Dooley1-34/+34
init_reset_optional() is somewhat oddly placed amidst the macb_config struct definitions. Move it to a more reasonable location alongside the fu540 init functions. Reviewed-by: Claudiu Beznea <claudiu.beznea@microchip.com> Signed-off-by: Conor Dooley <conor.dooley@microchip.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-08net: macb: simplify error paths in init_reset_optional()Conor Dooley1-13/+7
The error handling paths in init_reset_optional() can all be simplified to return dev_err_probe(). Do so. Reviewed-by: Claudiu Beznea <claudiu.beznea@microchip.com> Signed-off-by: Conor Dooley <conor.dooley@microchip.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-08net: macb: unify macb_config alignment styleConor Dooley1-8/+8
The various macb_config structs have taken different approaches to alignment when broken over newlines. Pick one style and make them match. Reviewed-by: Claudiu Beznea <claudiu.beznea@microchip.com> Signed-off-by: Conor Dooley <conor.dooley@microchip.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-08net: macb: add polarfire soc reset supportConor Dooley1-8/+18
To date, the Microchip PolarFire SoC (MPFS) has been using the cdns,macb compatible, however the generic device does not have reset support. Add a new compatible & .data for MPFS to hook into the reset functionality added for zynqmp support (and make the zynqmp init function generic in the process). Reviewed-by: Claudiu Beznea <claudiu.beznea@microchip.com> Signed-off-by: Conor Dooley <conor.dooley@microchip.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-08dt-bindings: net: cdns,macb: document polarfire soc's macbConor Dooley1-0/+1
Until now the PolarFire SoC (MPFS) has been using the generic "cdns,macb" compatible but has optional reset support. Add a specific compatible which falls back to the currently used generic binding. Acked-by: Rob Herring <robh@kernel.org> Reviewed-by: Claudiu Beznea <claudiu.beznea@microchip.com> Signed-off-by: Conor Dooley <conor.dooley@microchip.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-08net: l2tp: fix clang -Wformat warningJustin Stitt1-1/+1
When building with clang we encounter this warning: | net/l2tp/l2tp_ppp.c:1557:6: error: format specifies type 'unsigned | short' but the argument has type 'u32' (aka 'unsigned int') | [-Werror,-Wformat] session->nr, session->ns, Both session->nr and session->ns are of type u32. The format specifier previously used is `%hu` which would truncate our unsigned integer from 32 to 16 bits. This doesn't seem like intended behavior, if it is then perhaps we need to consider suppressing the warning with pragma clauses. This patch should get us closer to the goal of enabling the -Wformat flag for Clang builds. Link: https://github.com/ClangBuiltLinux/linux/issues/378 Signed-off-by: Justin Stitt <justinstitt@google.com> Acked-by: Guillaume Nault <gnault@redhat.com> Link: https://lore.kernel.org/r/20220706230833.535238-1-justinstitt@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-08net: ocelot: fix wrong time_after usagePavel Skripkin1-9/+8
Accidentally noticed, that this driver is the only user of while (time_after(jiffies...)). It looks like typo, because likely this while loop will finish after 1st iteration, because time_after() returns true when 1st argument _is after_ 2nd one. There is one possible problem with this poll loop: the scheduler could put the thread to sleep, and it does not get woken up for OCELOT_FDMA_CH_SAFE_TIMEOUT_US. During that time, the hardware has done its thing, but you exit the while loop and return -ETIMEDOUT. Fix it by using sane poll API that avoids all problems described above Fixes: 753a026cfec1 ("net: ocelot: add FDMA support") Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20220706132845.27968-1-paskripkin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-08Merge tag 'mlx5-fixes-2022-07-06' of ↵Jakub Kicinski11-38/+76
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5 fixes 2022-07-06 This series provides bug fixes to mlx5 driver. * tag 'mlx5-fixes-2022-07-06' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5e: Ring the TX doorbell on DMA errors net/mlx5e: Fix capability check for updating vnic env counters net/mlx5e: CT: Use own workqueue instead of mlx5e priv net/mlx5: Lag, correct get the port select mode str net/mlx5e: Fix enabling sriov while tc nic rules are offloaded net/mlx5e: kTLS, Fix build time constant test in RX net/mlx5e: kTLS, Fix build time constant test in TX net/mlx5: Lag, decouple FDB selection and shared FDB net/mlx5: TC, allow offload from uplink to other PF's VF ==================== Link: https://lore.kernel.org/r/20220706231309.38579-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-08net: ethernet: ti: am65-cpsw: Fix devlink port register sequenceSiddharth Vadapalli1-7/+10
Renaming interfaces using udevd depends on the interface being registered before its netdev is registered. Otherwise, udevd reads an empty phys_port_name value, resulting in the interface not being renamed. Fix this by registering the interface before registering its netdev by invoking am65_cpsw_nuss_register_devlink() before invoking register_netdev() for the interface. Move the function call to devlink_port_type_eth_set(), invoking it after register_netdev() is invoked, to ensure that netlink notification for the port state change is generated after the netdev is completely initialized. Fixes: 58356eb31d60 ("net: ti: am65-cpsw-nuss: Add devlink support") Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com> Link: https://lore.kernel.org/r/20220706070208.12207-1-s-vadapalli@ti.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-08net: stmmac: dwc-qos: Disable split header for Tegra194Jon Hunter1-0/+1
There is a long-standing issue with the Synopsys DWC Ethernet driver for Tegra194 where random system crashes have been observed [0]. The problem occurs when the split header feature is enabled in the stmmac driver. In the bad case, a larger than expected buffer length is received and causes the calculation of the total buffer length to overflow. This results in a very large buffer length that causes the kernel to crash. Why this larger buffer length is received is not clear, however, the feedback from the NVIDIA design team is that the split header feature is not supported for Tegra194. Therefore, disable split header support for Tegra194 to prevent these random crashes from occurring. [0] https://lore.kernel.org/linux-tegra/b0b17697-f23e-8fa5-3757-604a86f3a095@nvidia.com/ Fixes: 67afd6d1cfdf ("net: stmmac: Add Split Header support and enable it in XGMAC cores") Signed-off-by: Jon Hunter <jonathanh@nvidia.com> Link: https://lore.kernel.org/r/20220706083913.13750-1-jonathanh@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-08net: page_pool: optimize page pool page allocation in NUMA scenarioJie Wang1-1/+2
Currently NIC packet receiving performance based on page pool deteriorates occasionally. To analysis the causes of this problem page allocation stats are collected. Here are the stats when NIC rx performance deteriorates: bandwidth(Gbits/s) 16.8 6.91 rx_pp_alloc_fast 13794308 21141869 rx_pp_alloc_slow 108625 166481 rx_pp_alloc_slow_h 0 0 rx_pp_alloc_empty 8192 8192 rx_pp_alloc_refill 0 0 rx_pp_alloc_waive 100433 158289 rx_pp_recycle_cached 0 0 rx_pp_recycle_cache_full 0 0 rx_pp_recycle_ring 362400 420281 rx_pp_recycle_ring_full 6064893 9709724 rx_pp_recycle_released_ref 0 0 The rx_pp_alloc_waive count indicates that a large number of pages' numa node are inconsistent with the NIC device numa node. Therefore these pages can't be reused by the page pool. As a result, many new pages would be allocated by __page_pool_alloc_pages_slow which is time consuming. This causes the NIC rx performance fluctuations. The main reason of huge numa mismatch pages in page pool is that page pool uses alloc_pages_bulk_array to allocate original pages. This function is not suitable for page allocation in NUMA scenario. So this patch uses alloc_pages_bulk_array_node which has a NUMA id input parameter to ensure the NUMA consistent between NIC device and allocated pages. Repeated NIC rx performance tests are performed 40 times. NIC rx bandwidth is higher and more stable compared to the datas above. Here are three test stats, the rx_pp_alloc_waive count is zero and rx_pp_alloc_slow which indicates pages allocated from slow patch is relatively low. bandwidth(Gbits/s) 93 93.9 93.8 rx_pp_alloc_fast 60066264 61266386 60938254 rx_pp_alloc_slow 16512 16517 16539 rx_pp_alloc_slow_ho 0 0 0 rx_pp_alloc_empty 16512 16517 16539 rx_pp_alloc_refill 473841 481910 481585 rx_pp_alloc_waive 0 0 0 rx_pp_recycle_cached 0 0 0 rx_pp_recycle_cache_full 0 0 0 rx_pp_recycle_ring 29754145 30358243 30194023 rx_pp_recycle_ring_full 0 0 0 rx_pp_recycle_released_ref 0 0 0 Signed-off-by: Jie Wang <wangjie125@huawei.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://lore.kernel.org/r/20220705113515.54342-1-huangguangbin2@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-08Merge tag 'nvme-5.19-2022-07-07' of git://git.infradead.org/nvme into block-5.19Jens Axboe3-2/+5
Pull NVMe fixes from Christoph: "nvme fixes for Linux 5.19 - another bogus identifier quirk (Keith Busch) - use struct group in the tracer to avoid a gcc warning (Keith Busch)" * tag 'nvme-5.19-2022-07-07' of git://git.infradead.org/nvme: nvme: use struct group for generic command dwords nvme-pci: phison e16 has bogus namespace ids
2022-07-08io_uring: explicit sqe padding for ioctl commandsPavel Begunkov2-2/+5
32 bit sqe->cmd_op is an union with 64 bit values. It's always a good idea to do padding explicitly. Also zero check it in prep, so it can be used in the future if needed without compatibility concerns. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e6b95a05e970af79000435166185e85b196b2ba2.1657202417.git.asml.silence@gmail.com [axboe: turn bitwise OR into logical variant] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-08i2c: cadence: Unregister the clk notifier in error pathSatish Nagireddy1-0/+1
This patch ensures that the clock notifier is unregistered when driver probe is returning error. Fixes: df8eb5691c48 ("i2c: Add driver for Cadence I2C controller") Signed-off-by: Satish Nagireddy <satish.nagireddy@getcruise.com> Tested-by: Lars-Peter Clausen <lars@metafoo.de> Reviewed-by: Michal Simek <michal.simek@amd.com> Signed-off-by: Wolfram Sang <wsa@kernel.org>
2022-07-07Merge tag 'devfreq-fixes-for-5.19-rc6' of ↵Rafael J. Wysocki1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux Pull a devfreq fix for 5.19-rc6 from Chanwoo Choi: "- Fix exynos-bus NULL pointer dereference by correctly using the local generated freq_table to output the debug values instead of using the profile freq_table that is not used in the driver." * tag 'devfreq-fixes-for-5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux: PM / devfreq: exynos-bus: Fix NULL pointer dereference
2022-07-07PM / devfreq: exynos-bus: Fix NULL pointer dereferenceChristian Marangi1-3/+3
Fix exynos-bus NULL pointer dereference by correctly using the local generated freq_table to output the debug values instead of using the profile freq_table that is not used in the driver. Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Fixes: b5d281f6c16d ("PM / devfreq: Rework freq_table to be local to devfreq struct") Cc: stable@vger.kernel.org Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Acked-by: Chanwoo Choi <cw00.choi@samsung.com> Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com>
2022-07-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski253-1665/+3141
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-07netfilter: conntrack: fix crash due to confirmed bit load reorderingFlorian Westphal3-0/+26
Kajetan Puchalski reports crash on ARM, with backtrace of: __nf_ct_delete_from_lists nf_ct_delete early_drop __nf_conntrack_alloc Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier. conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly' allocated object is still in use on another CPU: CPU1 CPU2 encounter 'ct' during hlist walk delete_from_lists refcount drops to 0 kmem_cache_free(ct); __nf_conntrack_alloc() // returns same object refcount_inc_not_zero(ct); /* might fail */ /* If set, ct is public/in the hash table */ test_bit(IPS_CONFIRMED_BIT, &ct->status); In case CPU1 already set refcount back to 1, refcount_inc_not_zero() will succeed. The expected possibilities for a CPU that obtained the object 'ct' (but no reference so far) are: 1. refcount_inc_not_zero() fails. CPU2 ignores the object and moves to the next entry in the list. This happens for objects that are about to be free'd, that have been free'd, or that have been reallocated by __nf_conntrack_alloc(), but where the refcount has not been increased back to 1 yet. 2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit in ct->status. If set, the object is public/in the table. If not, the object must be skipped; CPU2 calls nf_ct_put() to un-do the refcount increment and moves to the next object. Parallel deletion from the hlists is prevented by a 'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one cpu will do the unlink, the other one will only drop its reference count. Because refcount_inc_not_zero is not a full barrier, CPU2 may try to delete an object that is not on any list: 1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU) 2. CONFIRMED test also successful (load was reordered or zeroing of ct->status not yet visible) 3. delete_from_lists unlinks entry not on the hlist, because IPS_DYING_BIT is 0 (already cleared). 2) is already wrong: CPU2 will handle a partially initited object that is supposed to be private to CPU1. Add needed barriers when refcount_inc_not_zero() is successful. It also inserts a smp_wmb() before the refcount is set to 1 during allocation. Because other CPU might still see the object, refcount_set(1) "resurrects" it, so we need to make sure that other CPUs will also observe the right content. In particular, the CONFIRMED bit test must only pass once the object is fully initialised and either in the hash or about to be inserted (with locks held to delay possible unlink from early_drop or gc worker). I did not change flow_offload_alloc(), as far as I can see it should call refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to the skb so its refcount should be >= 1 in all cases. v2: prefer smp_acquire__after_ctrl_dep to smp_rmb (Will Deacon). v3: keep smp_acquire__after_ctrl_dep close to refcount_inc_not_zero call add comment in nf_conntrack_netlink, no control dependency there due to locks. Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/all/Yr7WTfd6AVTQkLjI@e126311.manchester.arm.com/ Reported-by: Kajetan Puchalski <kajetan.puchalski@arm.com> Diagnosed-by: Will Deacon <will@kernel.org> Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api") Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Will Deacon <will@kernel.org>
2022-07-07bpf: Make sure mac_header was set before using itEric Dumazet1-3/+5
Classic BPF has a way to load bytes starting from the mac header. Some skbs do not have a mac header, and skb_mac_header() in this case is returning a pointer that 65535 bytes after skb->head. Existing range check in bpf_internal_load_pointer_neg_helper() was properly kicking and no illegal access was happening. New sanity check in skb_mac_header() is firing, so we need to avoid it. WARNING: CPU: 1 PID: 28990 at include/linux/skbuff.h:2785 skb_mac_header include/linux/skbuff.h:2785 [inline] WARNING: CPU: 1 PID: 28990 at include/linux/skbuff.h:2785 bpf_internal_load_pointer_neg_helper+0x1b1/0x1c0 kernel/bpf/core.c:74 Modules linked in: CPU: 1 PID: 28990 Comm: syz-executor.0 Not tainted 5.19.0-rc4-syzkaller-00865-g4874fb9484be #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/29/2022 RIP: 0010:skb_mac_header include/linux/skbuff.h:2785 [inline] RIP: 0010:bpf_internal_load_pointer_neg_helper+0x1b1/0x1c0 kernel/bpf/core.c:74 Code: ff ff 45 31 f6 e9 5a ff ff ff e8 aa 27 40 00 e9 3b ff ff ff e8 90 27 40 00 e9 df fe ff ff e8 86 27 40 00 eb 9e e8 2f 2c f3 ff <0f> 0b eb b1 e8 96 27 40 00 e9 79 fe ff ff 90 41 57 41 56 41 55 41 RSP: 0018:ffffc9000309f668 EFLAGS: 00010216 RAX: 0000000000000118 RBX: ffffffffffeff00c RCX: ffffc9000e417000 RDX: 0000000000040000 RSI: ffffffff81873f21 RDI: 0000000000000003 RBP: ffff8880842878c0 R08: 0000000000000003 R09: 000000000000ffff R10: 000000000000ffff R11: 0000000000000001 R12: 0000000000000004 R13: ffff88803ac56c00 R14: 000000000000ffff R15: dffffc0000000000 FS: 00007f5c88a16700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fdaa9f6c058 CR3: 000000003a82c000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ____bpf_skb_load_helper_32 net/core/filter.c:276 [inline] bpf_skb_load_helper_32+0x191/0x220 net/core/filter.c:264 Fixes: f9aefd6b2aa3 ("net: warn if mac header was not set") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220707123900.945305-1-edumazet@google.com