<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/linux/filter.h, branch v4.17.1</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v4.17.1</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v4.17.1'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2018-03-31T00:15:18+00:00</updated>
<entry>
<title>bpf: Hooks for sys_bind</title>
<updated>2018-03-31T00:15:18+00:00</updated>
<author>
<name>Andrey Ignatov</name>
<email>rdna@fb.com</email>
</author>
<published>2018-03-30T22:08:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4fbac77d2d092b475dda9eea66da674369665427'/>
<id>urn:sha1:4fbac77d2d092b475dda9eea66da674369665427</id>
<content type='text'>
== The problem ==

There is a use-case when all processes inside a cgroup should use one
single IP address on a host that has multiple IP configured.  Those
processes should use the IP for both ingress and egress, for TCP and UDP
traffic. So TCP/UDP servers should be bound to that IP to accept
incoming connections on it, and TCP/UDP clients should make outgoing
connections from that IP. It should not require changing application
code since it's often not possible.

Currently it's solved by intercepting glibc wrappers around syscalls
such as `bind(2)` and `connect(2)`. It's done by a shared library that
is preloaded for every process in a cgroup so that whenever TCP/UDP
server calls `bind(2)`, the library replaces IP in sockaddr before
passing arguments to syscall. When application calls `connect(2)` the
library transparently binds the local end of connection to that IP
(`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).

Shared library approach is fragile though, e.g.:
* some applications clear env vars (incl. `LD_PRELOAD`);
* `/etc/ld.so.preload` doesn't help since some applications are linked
  with option `-z nodefaultlib`;
* other applications don't use glibc and there is nothing to intercept.

== The solution ==

The patch provides much more reliable in-kernel solution for the 1st
part of the problem: binding TCP/UDP servers on desired IP. It does not
depend on application environment and implementation details (whether
glibc is used or not).

It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
(similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).

The new program type is intended to be used with sockets (`struct sock`)
in a cgroup and provided by user `struct sockaddr`. Pointers to both of
them are parts of the context passed to programs of newly added types.

The new attach types provides hooks in `bind(2)` system call for both
IPv4 and IPv6 so that one can write a program to override IP addresses
and ports user program tries to bind to and apply such a program for
whole cgroup.

== Implementation notes ==

[1]
Separate attach types for `AF_INET` and `AF_INET6` are added
intentionally to prevent reading/writing to offsets that don't make
sense for corresponding socket family. E.g. if user passes `sockaddr_in`
it doesn't make sense to read from / write to `user_ip6[]` context
fields.

[2]
The write access to `struct bpf_sock_addr_kern` is implemented using
special field as an additional "register".

There are just two registers in `sock_addr_convert_ctx_access`: `src`
with value to write and `dst` with pointer to context that can't be
changed not to break later instructions. But the fields, allowed to
write to, are not available directly and to access them address of
corresponding pointer has to be loaded first. To get additional register
the 1st not used by `src` and `dst` one is taken, its content is saved
to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
address of pointer field, and finally the register's content is restored
from the temporary field after writing `src` value.

Signed-off-by: Andrey Ignatov &lt;rdna@fb.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
</content>
</entry>
<entry>
<title>bpf: Check attach type at prog load time</title>
<updated>2018-03-31T00:14:44+00:00</updated>
<author>
<name>Andrey Ignatov</name>
<email>rdna@fb.com</email>
</author>
<published>2018-03-30T22:08:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5e43f899b03a3492ce5fc44e8900becb04dae9c0'/>
<id>urn:sha1:5e43f899b03a3492ce5fc44e8900becb04dae9c0</id>
<content type='text'>
== The problem ==

There are use-cases when a program of some type can be attached to
multiple attach points and those attach points must have different
permissions to access context or to call helpers.

E.g. context structure may have fields for both IPv4 and IPv6 but it
doesn't make sense to read from / write to IPv6 field when attach point
is somewhere in IPv4 stack.

Same applies to BPF-helpers: it may make sense to call some helper from
some attach point, but not from other for same prog type.

== The solution ==

Introduce `expected_attach_type` field in in `struct bpf_attr` for
`BPF_PROG_LOAD` command. If scenario described in "The problem" section
is the case for some prog type, the field will be checked twice:

1) At load time prog type is checked to see if attach type for it must
   be known to validate program permissions correctly. Prog will be
   rejected with EINVAL if it's the case and `expected_attach_type` is
   not specified or has invalid value.

2) At attach time `attach_type` is compared with `expected_attach_type`,
   if prog type requires to have one, and, if they differ, attach will
   be rejected with EINVAL.

The `expected_attach_type` is now available as part of `struct bpf_prog`
in both `bpf_verifier_ops-&gt;is_valid_access()` and
`bpf_verifier_ops-&gt;get_func_proto()` () and can be used to check context
accesses and calls to helpers correspondingly.

Initially the idea was discussed by Alexei Starovoitov &lt;ast@fb.com&gt; and
Daniel Borkmann &lt;daniel@iogearbox.net&gt; here:
https://marc.info/?l=linux-netdev&amp;m=152107378717201&amp;w=2

Signed-off-by: Andrey Ignatov &lt;rdna@fb.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
</content>
</entry>
<entry>
<title>bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT:</title>
<updated>2018-03-29T22:09:43+00:00</updated>
<author>
<name>John Fastabend</name>
<email>john.fastabend@gmail.com</email>
</author>
<published>2018-03-28T19:49:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=fa246693a111fab32bd51d20f07a347e42773ee9'/>
<id>urn:sha1:fa246693a111fab32bd51d20f07a347e42773ee9</id>
<content type='text'>
Add support for the BPF_F_INGRESS flag in skb redirect helper. To
do this convert skb into a scatterlist and push into ingress queue.
This is the same logic that is used in the sk_msg redirect helper
so it should feel familiar.

Signed-off-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
</content>
</entry>
<entry>
<title>bpf: sockmap redirect ingress support</title>
<updated>2018-03-29T22:09:43+00:00</updated>
<author>
<name>John Fastabend</name>
<email>john.fastabend@gmail.com</email>
</author>
<published>2018-03-28T19:49:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8934ce2fd08171e8605f7fada91ee7619fe17ab8'/>
<id>urn:sha1:8934ce2fd08171e8605f7fada91ee7619fe17ab8</id>
<content type='text'>
Add support for the BPF_F_INGRESS flag in sk_msg redirect helper.
To do this add a scatterlist ring for receiving socks to check
before calling into regular recvmsg call path. Additionally, because
the poll wakeup logic only checked the skb recv queue we need to
add a hook in TCP stack (similar to write side) so that we have
a way to wake up polling socks when a scatterlist is redirected
to that sock.

After this all that is needed is for the redirect helper to
push the scatterlist into the psock receive queue.

Signed-off-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
</content>
</entry>
<entry>
<title>bpf: add parenthesis around argument of BPF_LDST_BYTES()</title>
<updated>2018-03-29T02:36:13+00:00</updated>
<author>
<name>Jakub Kicinski</name>
<email>jakub.kicinski@netronome.com</email>
</author>
<published>2018-03-29T00:48:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e59ac634908f4ea90066e6db7dd7ae8ca02815ff'/>
<id>urn:sha1:e59ac634908f4ea90066e6db7dd7ae8ca02815ff</id>
<content type='text'>
BPF_LDST_BYTES() does not put it's argument in parenthesis
when referencing it.  This makes it impossible to pass pointers
obtained by address-of operator (e.g. BPF_LDST_BYTES(&amp;insn)).
Add the parenthesis.

Signed-off-by: Jakub Kicinski &lt;jakub.kicinski@netronome.com&gt;
Reviewed-by: Quentin Monnet &lt;quentin.monnet@netronome.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data</title>
<updated>2018-03-19T20:14:38+00:00</updated>
<author>
<name>John Fastabend</name>
<email>john.fastabend@gmail.com</email>
</author>
<published>2018-03-18T19:57:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4f738adba30a7cfc006f605707e7aee847ffefa0'/>
<id>urn:sha1:4f738adba30a7cfc006f605707e7aee847ffefa0</id>
<content type='text'>
This implements a BPF ULP layer to allow policy enforcement and
monitoring at the socket layer. In order to support this a new
program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
the sendmsg/sendpage hook. To attach the policy to sockets a
sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.

Similar to previous sockmap usages when a sock is added to a
sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
program type attached then the BPF ULP layer is created on the
socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
every msg in sendmsg case and page/offset in sendpage case.

BPF_PROG_TYPE_SK_MSG Semantics/API:

BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
case and in the sendpage case leaves the data untouched. Both cases
return -EACESS to the user. Returning SK_PASS will allow the msg to
be sent.

In the sendmsg case data is copied into kernel space buffers before
running the BPF program. The kernel space buffers are stored in a
scatterlist object where each element is a kernel memory buffer.
Some effort is made to coalesce data from the sendmsg call here.
For example a sendmsg call with many one byte iov entries will
likely be pushed into a single entry. The BPF program is run with
data pointers (start/end) pointing to the first sg element.

In the sendpage case data is not copied. We opt not to copy the
data by default here, because the BPF infrastructure does not
know what bytes will be needed nor when they will be needed. So
copying all bytes may be wasteful. Because of this the initial
start/end data pointers are (0,0). Meaning no data can be read or
written. This avoids reading data that may be modified by the
user. A new helper is added later in this series if reading and
writing the data is needed. The helper call will do a copy by
default so that the page is exclusively owned by the BPF call.

The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
in the sendmsg() case and the entire page/offset in the sendpage case.
This avoids ambiguity on how to handle mixed return codes in the
sendmsg case. Again a helper is added later in the series if
a verdict needs to apply to multiple system calls and/or only
a subpart of the currently being processed message.

The helper msg_redirect_map() can be used to select the socket to
send the data on. This is used similar to existing redirect use
cases. This allows policy to redirect msgs.

Pseudo code simple example:

The basic logic to attach a program to a socket is as follows,

  // load the programs
  bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
		&amp;obj, &amp;msg_prog);

  // lookup the sockmap
  bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");

  // get fd for sockmap
  map_fd_msg = bpf_map__fd(bpf_map_msg);

  // attach program to sockmap
  bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);

Adding sockets to the map is done in the normal way,

  // Add a socket 'fd' to sockmap at location 'i'
  bpf_map_update_elem(map_fd_msg, &amp;i, fd, BPF_ANY);

After the above any socket attached to "my_sock_map", in this case
'fd', will run the BPF msg verdict program (msg_prog) on every
sendmsg and sendpage system call.

For a complete example see BPF selftests or sockmap samples.

Implementation notes:

It seemed the simplest, to me at least, to use a refcnt to ensure
psock is not lost across the sendmsg copy into the sg, the bpf program
running on the data in sg_data, and the final pass to the TCP stack.
Some performance testing may show a better method to do this and avoid
the refcnt cost, but for now use the simpler method.

Another item that will come after basic support is in place is
supporting MSG_MORE flag. At the moment we call sendpages even if
the MSG_MORE flag is set. An enhancement would be to collect the
pages into a larger scatterlist and pass down the stack. Notice that
bpf_tcp_sendmsg() could support this with some additional state saved
across sendmsg calls. I built the code to support this without having
to do refactoring work. Other features TBD include ZEROCOPY and the
TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
shortly.

Future work could improve size limits on the scatterlist rings used
here. Currently, we use MAX_SKB_FRAGS simply because this was being
used already in the TLS case. Future work could extend the kernel sk
APIs to tune this depending on workload. This is a trade-off
between memory usage and throughput performance.

Signed-off-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Acked-by: David S. Miller &lt;davem@davemloft.net&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
</content>
</entry>
<entry>
<title>net: avoid including xdp.h in filter.h</title>
<updated>2018-02-14T01:50:10+00:00</updated>
<author>
<name>Jesper Dangaard Brouer</name>
<email>brouer@redhat.com</email>
</author>
<published>2018-02-13T13:15:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=297dd12cb104151797fd649433a2157b585f1718'/>
<id>urn:sha1:297dd12cb104151797fd649433a2157b585f1718</id>
<content type='text'>
If is sufficient with a forward declaration of struct xdp_rxq_info in
linux/filter.h, which avoids including net/xdp.h.  This was originally
suggested by John Fastabend during the review phase, but wasn't
included in the final patchset revision.  Thus, this followup.

Suggested-by: John Fastabend &lt;john.fastabend@gmail.com&gt;
Signed-off-by: Jesper Dangaard Brouer &lt;brouer@redhat.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: make unknown opcode handling more robust</title>
<updated>2018-01-27T00:42:05+00:00</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2018-01-26T22:33:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5e581dad4fec0e6d062740dc35b8dc248b39d224'/>
<id>urn:sha1:5e581dad4fec0e6d062740dc35b8dc248b39d224</id>
<content type='text'>
Recent findings by syzcaller fixed in 7891a87efc71 ("bpf: arsh is
not supported in 32 bit alu thus reject it") triggered a warning
in the interpreter due to unknown opcode not being rejected by
the verifier. The 'return 0' for an unknown opcode is really not
optimal, since with BPF to BPF calls, this would go untracked by
the verifier.

Do two things here to improve the situation: i) perform basic insn
sanity check early on in the verification phase and reject every
non-uapi insn right there. The bpf_opcode_in_insntable() table
reuses the same mapping as the jumptable in ___bpf_prog_run() sans
the non-public mappings. And ii) in ___bpf_prog_run() we do need
to BUG in the case where the verifier would ever create an unknown
opcode due to some rewrites.

Note that JITs do not have such issues since they would punt to
interpreter in these situations. Moreover, the BPF_JIT_ALWAYS_ON
would also help to avoid such unknown opcodes in the first place.

Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Support passing args to sock_ops bpf function</title>
<updated>2018-01-26T00:41:14+00:00</updated>
<author>
<name>Lawrence Brakmo</name>
<email>brakmo@fb.com</email>
</author>
<published>2018-01-26T00:14:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=de525be2ca2734865d29c4b67ddd29913b214906'/>
<id>urn:sha1:de525be2ca2734865d29c4b67ddd29913b214906</id>
<content type='text'>
Adds support for passing up to 4 arguments to sock_ops bpf functions. It
reusues the reply union, so the bpf_sock_ops structures are not
increased in size.

Signed-off-by: Lawrence Brakmo &lt;brakmo@fb.com&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>bpf: Add write access to tcp_sock and sock fields</title>
<updated>2018-01-26T00:41:14+00:00</updated>
<author>
<name>Lawrence Brakmo</name>
<email>brakmo@fb.com</email>
</author>
<published>2018-01-26T00:14:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b73042b8a28e2603ac178295ab96c876ba5a97a1'/>
<id>urn:sha1:b73042b8a28e2603ac178295ab96c876ba5a97a1</id>
<content type='text'>
This patch adds a macro, SOCK_OPS_SET_FIELD, for writing to
struct tcp_sock or struct sock fields. This required adding a new
field "temp" to struct bpf_sock_ops_kern for temporary storage that
is used by sock_ops_convert_ctx_access. It is used to store and recover
the contents of a register, so the register can be used to store the
address of the sk. Since we cannot overwrite the dst_reg because it
contains the pointer to ctx, nor the src_reg since it contains the value
we want to store, we need an extra register to contain the address
of the sk.

Also adds the macro SOCK_OPS_GET_OR_SET_FIELD that calls one of the
GET or SET macros depending on the value of the TYPE field.

Signed-off-by: Lawrence Brakmo &lt;brakmo@fb.com&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
</feed>
