<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/net/net_namespace.h, branch v5.10.7</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v5.10.7</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v5.10.7'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2020-09-30T18:50:35+00:00</updated>
<entry>
<title>bpf, net: Rework cookie generator as per-cpu one</title>
<updated>2020-09-30T18:50:35+00:00</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2020-09-30T15:18:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=92acdc58ab11af66fcaef485433fde61b5e32fac'/>
<id>urn:sha1:92acdc58ab11af66fcaef485433fde61b5e32fac</id>
<content type='text'>
With its use in BPF, the cookie generator can be called very frequently
in particular when used out of cgroup v2 hooks (e.g. connect / sendmsg)
and attached to the root cgroup, for example, when used in v1/v2 mixed
environments. In particular, when there's a high churn on sockets in the
system there can be many parallel requests to the bpf_get_socket_cookie()
and bpf_get_netns_cookie() helpers which then cause contention on the
atomic counter.

As similarly done in f991bd2e1421 ("fs: introduce a per-cpu last_ino
allocator"), add a small helper library that both can use for the 64 bit
counters. Given this can be called from different contexts, we also need
to deal with potential nested calls even though in practice they are
considered extremely rare. One idea as suggested by Eric Dumazet was
to use a reverse counter for this situation since we don't expect 64 bit
overflows anyways; that way, we can avoid bigger gaps in the 64 bit
counter space compared to just batch-wise increase. Even on machines
with small number of cores (e.g. 4) the cookie generation shrinks from
min/max/med/avg (ns) of 22/50/40/38.9 down to 10/35/14/17.3 when run
in parallel from multiple CPUs.

Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Cc: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Link: https://lore.kernel.org/bpf/8a80b8d27d3c49f9a14e1d5213c19d8be87d1dc8.1601477936.git.daniel@iogearbox.net
</content>
</entry>
<entry>
<title>net: Introduce netns_bpf for BPF programs attached to netns</title>
<updated>2020-06-01T22:21:02+00:00</updated>
<author>
<name>Jakub Sitnicki</name>
<email>jakub@cloudflare.com</email>
</author>
<published>2020-05-31T08:28:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=a3fd7ceee05431d2c51ed86c6cae015d236a51f0'/>
<id>urn:sha1:a3fd7ceee05431d2c51ed86c6cae015d236a51f0</id>
<content type='text'>
In order to:

 (1) attach more than one BPF program type to netns, or
 (2) support attaching BPF programs to netns with bpf_link, or
 (3) support multi-prog attach points for netns

we will need to keep more state per netns than a single pointer like we
have now for BPF flow dissector program.

Prepare for the above by extracting netns_bpf that is part of struct net,
for storing all state related to BPF programs attached to netns.

Turn flow dissector callbacks for querying/attaching/detaching a program
into generic ones that operate on netns_bpf. Next patch will move the
generic callbacks into their own module.

This is similar to how it is organized for cgroup with cgroup_bpf.

Signed-off-by: Jakub Sitnicki &lt;jakub@cloudflare.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Cc: Stanislav Fomichev &lt;sdf@google.com&gt;
Link: https://lore.kernel.org/bpf/20200531082846.2117903-3-jakub@cloudflare.com
</content>
</entry>
<entry>
<title>ipv6: Use global sernum for dst validation with nexthop objects</title>
<updated>2020-05-01T19:46:30+00:00</updated>
<author>
<name>David Ahern</name>
<email>dsahern@kernel.org</email>
</author>
<published>2020-05-01T14:53:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8f34e53b60b337e559f1ea19e2780ff95ab2fa65'/>
<id>urn:sha1:8f34e53b60b337e559f1ea19e2780ff95ab2fa65</id>
<content type='text'>
Nik reported a bug with pcpu dst cache when nexthop objects are
used illustrated by the following:
    $ ip netns add foo
    $ ip -netns foo li set lo up
    $ ip -netns foo addr add 2001:db8:11::1/128 dev lo
    $ ip netns exec foo sysctl net.ipv6.conf.all.forwarding=1
    $ ip li add veth1 type veth peer name veth2
    $ ip li set veth1 up
    $ ip addr add 2001:db8:10::1/64 dev veth1
    $ ip li set dev veth2 netns foo
    $ ip -netns foo li set veth2 up
    $ ip -netns foo addr add 2001:db8:10::2/64 dev veth2
    $ ip -6 nexthop add id 100 via 2001:db8:10::2 dev veth1
    $ ip -6 route add 2001:db8:11::1/128 nhid 100

    Create a pcpu entry on cpu 0:
    $ taskset -a -c 0 ip -6 route get 2001:db8:11::1

    Re-add the route entry:
    $ ip -6 ro del 2001:db8:11::1
    $ ip -6 route add 2001:db8:11::1/128 nhid 100

    Route get on cpu 0 returns the stale pcpu:
    $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
    RTNETLINK answers: Network is unreachable

    While cpu 1 works:
    $ taskset -a -c 1 ip -6 route get 2001:db8:11::1
    2001:db8:11::1 from :: via 2001:db8:10::2 dev veth1 src 2001:db8:10::1 metric 1024 pref medium

Conversion of FIB entries to work with external nexthop objects
missed an important difference between IPv4 and IPv6 - how dst
entries are invalidated when the FIB changes. IPv4 has a per-network
namespace generation id (rt_genid) that is bumped on changes to the FIB.
Checking if a dst_entry is still valid means comparing rt_genid in the
rtable to the current value of rt_genid for the namespace.

IPv6 also has a per network namespace counter, fib6_sernum, but the
count is saved per fib6_node. With the per-node counter only dst_entries
based on fib entries under the node are invalidated when changes are
made to the routes - limiting the scope of invalidations. IPv6 uses a
reference in the rt6_info, 'from', to track the corresponding fib entry
used to create the dst_entry. When validating a dst_entry, the 'from'
is used to backtrack to the fib6_node and check the sernum of it to the
cookie passed to the dst_check operation.

With the inline format (nexthop definition inline with the fib6_info),
dst_entries cached in the fib6_nh have a 1:1 correlation between fib
entries, nexthop data and dst_entries. With external nexthops, IPv6
looks more like IPv4 which means multiple fib entries across disparate
fib6_nodes can all reference the same fib6_nh. That means validation
of dst_entries based on external nexthops needs to use the IPv4 format
- the per-network namespace counter.

Add sernum to rt6_info and set it when creating a pcpu dst entry. Update
rt6_get_cookie to return sernum if it is set and update dst_check for
IPv6 to look for sernum set and based the check on it if so. Finally,
rt6_get_pcpu_route needs to validate the cached entry before returning
a pcpu entry (similar to the rt_cache_valid calls in __mkroute_input and
__mkroute_output for IPv4).

This problem only affects routes using the new, external nexthops.

Thanks to the kbuild test robot for catching the IS_ENABLED needed
around rt_genid_ipv6 before I sent this out.

Fixes: 5b98324ebe29 ("ipv6: Allow routes to use nexthop objects")
Reported-by: Nikolay Aleksandrov &lt;nikolay@cumulusnetworks.com&gt;
Signed-off-by: David Ahern &lt;dsahern@kernel.org&gt;
Reviewed-by: Nikolay Aleksandrov &lt;nikolay@cumulusnetworks.com&gt;
Tested-by: Nikolay Aleksandrov &lt;nikolay@cumulusnetworks.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf, net: Fix build issue when net ns not configured</title>
<updated>2020-03-29T23:16:50+00:00</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2020-03-29T19:59:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5a95cbb80ef8d8f2db29ab10777cd4742e6fc8ec'/>
<id>urn:sha1:5a95cbb80ef8d8f2db29ab10777cd4742e6fc8ec</id>
<content type='text'>
Fix a redefinition of 'net_gen_cookie' error that was overlooked
when net ns is not configured.

Fixes: f318903c0bf4 ("bpf: Add netns cookie and enable it for bpf cgroup hooks")
Reported-by: kbuild test robot &lt;lkp@intel.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
</content>
</entry>
<entry>
<title>bpf: Add netns cookie and enable it for bpf cgroup hooks</title>
<updated>2020-03-28T02:40:38+00:00</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2020-03-27T15:58:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f318903c0bf42448b4c884732df2bbb0ef7a2284'/>
<id>urn:sha1:f318903c0bf42448b4c884732df2bbb0ef7a2284</id>
<content type='text'>
In Cilium we're mainly using BPF cgroup hooks today in order to implement
kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
between Cilium managed nodes. While this works in its current shape and avoids
packet-level NAT for inter Cilium managed node traffic, there is one major
limitation we're facing today, that is, lack of netns awareness.

In Kubernetes, the concept of Pods (which hold one or multiple containers)
has been built around network namespaces, so while we can use the global scope
of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
NodePort ports on loopback addresses), we also have the need to differentiate
between initial network namespaces and non-initial one. For example, ExternalIP
services mandate that non-local service IPs are not to be translated from the
host (initial) network namespace as one example. Right now, we have an ugly
work-around in place where non-local service IPs for ExternalIP services are
not xlated from connect() and friends BPF hooks but instead via less efficient
packet-level NAT on the veth tc ingress hook for Pod traffic.

On top of determining whether we're in initial or non-initial network namespace
we also have a need for a socket-cookie like mechanism for network namespaces
scope. Socket cookies have the nice property that they can be combined as part
of the key structure e.g. for BPF LRU maps without having to worry that the
cookie could be recycled. We are planning to use this for our sessionAffinity
implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
provide the cookie for the initial network namespace while passing the context
instead of NULL would provide the cookie from the application's network namespace.
We're using a hole, so no size increase; the assignment happens only once.
Therefore this allows for a comparison on initial namespace as well as regular
cookie usage as we have today with socket cookies. We could later on enable
this helper for other program types as well as we would see need.

  (*) Both externalTrafficPolicy={Local|Cluster} types
  [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c

Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
</content>
</entry>
<entry>
<title>netns: Constify exported functions</title>
<updated>2020-01-17T12:25:24+00:00</updated>
<author>
<name>Guillaume Nault</name>
<email>gnault@redhat.com</email>
</author>
<published>2020-01-16T20:16:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=56f200c78ce4d94680a27a1ce97a29ebeb4f23e1'/>
<id>urn:sha1:56f200c78ce4d94680a27a1ce97a29ebeb4f23e1</id>
<content type='text'>
Mark function parameters as 'const' where possible.

Signed-off-by: Guillaume Nault &lt;gnault@redhat.com&gt;
Acked-by: Nicolas Dichtel &lt;nicolas.dichtel@6wind.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net</title>
<updated>2019-11-02T20:54:56+00:00</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2019-11-02T20:12:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d31e95585ca697fb31440c6fe30113adc85ecfbd'/>
<id>urn:sha1:d31e95585ca697fb31440c6fe30113adc85ecfbd</id>
<content type='text'>
The only slightly tricky merge conflict was the netdevsim because the
mutex locking fix overlapped a lot of driver reload reorganization.

The rest were (relatively) trivial in nature.

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>netns: fix GFP flags in rtnl_net_notifyid()</title>
<updated>2019-10-26T03:14:42+00:00</updated>
<author>
<name>Guillaume Nault</name>
<email>gnault@redhat.com</email>
</author>
<published>2019-10-23T16:39:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d4e4fdf9e4a27c87edb79b1478955075be141f67'/>
<id>urn:sha1:d4e4fdf9e4a27c87edb79b1478955075be141f67</id>
<content type='text'>
In rtnl_net_notifyid(), we certainly can't pass a null GFP flag to
rtnl_notify(). A GFP_KERNEL flag would be fine in most circumstances,
but there are a few paths calling rtnl_net_notifyid() from atomic
context or from RCU critical sections. The later also precludes the use
of gfp_any() as it wouldn't detect the RCU case. Also, the nlmsg_new()
call is wrong too, as it uses GFP_KERNEL unconditionally.

Therefore, we need to pass the GFP flags as parameter and propagate it
through function calls until the proper flags can be determined.

In most cases, GFP_KERNEL is fine. The exceptions are:
  * openvswitch: ovs_vport_cmd_get() and ovs_vport_cmd_dump()
    indirectly call rtnl_net_notifyid() from RCU critical section,

  * rtnetlink: rtmsg_ifinfo_build_skb() already receives GFP flags as
    parameter.

Also, in ovs_vport_cmd_build_info(), let's change the GFP flags used
by nlmsg_new(). The function is allowed to sleep, so better make the
flags consistent with the ones used in the following
ovs_vport_cmd_fill_info() call.

Found by code inspection.

Fixes: 9a9634545c70 ("netns: notify netns id events")
Signed-off-by: Guillaume Nault &lt;gnault@redhat.com&gt;
Acked-by: Nicolas Dichtel &lt;nicolas.dichtel@6wind.com&gt;
Acked-by: Pravin B Shelar &lt;pshelar@ovn.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net</title>
<updated>2019-10-20T17:43:00+00:00</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2019-10-20T05:51:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=2f184393e0c2d409c62262f57f2a57efdf9370b8'/>
<id>urn:sha1:2f184393e0c2d409c62262f57f2a57efdf9370b8</id>
<content type='text'>
Several cases of overlapping changes which were for the most
part trivially resolvable.

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: reorder 'struct net' fields to avoid false sharing</title>
<updated>2019-10-19T19:21:53+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2019-10-18T22:20:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=2a06b8982f8f2f40d03a3daf634676386bd84dbc'/>
<id>urn:sha1:2a06b8982f8f2f40d03a3daf634676386bd84dbc</id>
<content type='text'>
Intel test robot reported a ~7% regression on TCP_CRR tests
that they bisected to the cited commit.

Indeed, every time a new TCP socket is created or deleted,
the atomic counter net-&gt;count is touched (via get_net(net)
and put_net(net) calls)

So cpus might have to reload a contended cache line in
net_hash_mix(net) calls.

We need to reorder 'struct net' fields to move @hash_mix
in a read mostly cache line.

We move in the first cache line fields that can be
dirtied often.

We probably will have to address in a followup patch
the __randomize_layout that was added in linux-4.13,
since this might break our placement choices.

Fixes: 355b98553789 ("netns: provide pure entropy for net_hash_mix()")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: kernel test robot &lt;oliver.sang@intel.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
