summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2019-12-20io_uring: move all prep state for IORING_OP_{SEND,RECV}_MGS to prep handlerJens Axboe1-31/+33
Add struct io_sr_msg in our io_kiocb per-command union, and ensure that the send/recvmsg prep handlers have grabbed what they need from the SQE by the time prep is done. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20io_uring: move all prep state for IORING_OP_CONNECT to prep handlerJens Axboe1-18/+22
Add struct io_connect in our io_kiocb per-command union, and ensure that io_connect_prep() has grabbed what it needs from the SQE. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20io_uring: add and use struct io_rw for read/writesJens Axboe1-46/+50
Put the kiocb in struct io_rw, and add the addr/len for the request as well. Use the kiocb->private field for the buffer index for fixed reads and writes. Any use of kiocb->ki_filp is flipped to req->file. It's the same thing, and less confusing. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20rxrpc: Fix missing security check on incoming callsDavid Howells6-60/+59
Fix rxrpc_new_incoming_call() to check that we have a suitable service key available for the combination of service ID and security class of a new incoming call - and to reject calls for which we don't. This causes an assertion like the following to appear: rxrpc: Assertion failed - 6(0x6) == 12(0xc) is false kernel BUG at net/rxrpc/call_object.c:456! Where call->state is RXRPC_CALL_SERVER_SECURING (6) rather than RXRPC_CALL_COMPLETE (12). Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2019-12-20rxrpc: Don't take call->user_mutex in rxrpc_new_incoming_call()David Howells1-17/+3
Standard kernel mutexes cannot be used in any way from interrupt or softirq context, so the user_mutex which manages access to a call cannot be a mutex since on a new call the mutex must start off locked and be unlocked within the softirq handler to prevent userspace interfering with a call we're setting up. Commit a0855d24fc22d49cdc25664fb224caee16998683 ("locking/mutex: Complain upon mutex API misuse in IRQ contexts") causes big warnings to be splashed in dmesg for each a new call that comes in from the server. Whilst it *seems* like it should be okay, since the accept path uses trylock, there are issues with PI boosting and marking the wrong task as the owner. Fix this by not taking the mutex in the softirq path at all. It's not obvious that there should be any need for it as the state is set before the first notification is generated for the new call. There's also no particular reason why the link-assessing ping should be triggered inside the mutex. It's not actually transmitted there anyway, but rather it has to be deferred to a workqueue. Further, I don't think that there's any particular reason that the socket notification needs to be done from within rx->incoming_lock, so the amount of time that lock is held can be shortened too and the ping prepared before the new call notification is sent. Fixes: 540b1c48c37a ("rxrpc: Fix deadlock between call creation and sendmsg/recvmsg") Signed-off-by: David Howells <dhowells@redhat.com> cc: Peter Zijlstra (Intel) <peterz@infradead.org> cc: Ingo Molnar <mingo@redhat.com> cc: Will Deacon <will@kernel.org> cc: Davidlohr Bueso <dave@stgolabs.net>
2019-12-20rxrpc: Unlock new call in rxrpc_new_incoming_call() rather than the callerDavid Howells2-26/+28
Move the unlock and the ping transmission for a new incoming call into rxrpc_new_incoming_call() rather than doing it in the caller. This makes it clearer to see what's going on. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> cc: Ingo Molnar <mingo@redhat.com> cc: Will Deacon <will@kernel.org> cc: Davidlohr Bueso <dave@stgolabs.net>
2019-12-20xfs: Make the symbol 'xfs_rtalloc_log_count' staticChen Wandun1-1/+1
Fix the following sparse warning: fs/xfs/libxfs/xfs_trans_resv.c:206:1: warning: symbol 'xfs_rtalloc_log_count' was not declared. Should it be static? Fixes: b1de6fc7520f ("xfs: fix log reservation overflows when allocating large rt extents") Signed-off-by: Chen Wandun <chenwandun@huawei.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-12-20io_uring: use u64_to_user_ptr() consistentlyJens Axboe1-9/+7
We use it in some spots, but not consistently. Convert the rest over, makes it easier to read as well. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-20xen/grant-table: remove multiple BUG_ON on gnttab_interfaceAditya Pakki1-4/+0
gnttab_request_version() always sets the gnttab_interface variable and the assertions to check for empty gnttab_interface is unnecessary. The patch eliminates multiple such assertions. Signed-off-by: Aditya Pakki <pakki001@umn.edu> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2019-12-20xen-blkback: support dynamic unbind/bindPaul Durrant1-18/+38
By simply re-attaching to shared rings during connect_ring() rather than assuming they are freshly allocated (i.e assuming the counters are zero) it is possible for vbd instances to be unbound and re-bound from and to (respectively) a running guest. This has been tested by running: while true; do fio --name=randwrite --ioengine=libaio --iodepth=16 \ --rw=randwrite --bs=4k --direct=1 --size=1G --verify=crc32; done in a PV guest whilst running: while true; do echo vbd-$DOMID-$VBD >unbind; echo unbound; sleep 5; echo vbd-$DOMID-$VBD >bind; echo bound; sleep 3; done in dom0 from /sys/bus/xen-backend/drivers/vbd to continuously unbind and re-bind its system disk image. This is a highly useful feature for a backend module as it allows it to be unloaded and re-loaded (i.e. updated) without requiring domUs to be halted. This was also tested by running: while true; do echo vbd-$DOMID-$VBD >unbind; echo unbound; sleep 5; rmmod xen-blkback; echo unloaded; sleep 1; modprobe xen-blkback; echo bound; cd $(pwd); sleep 3; done in dom0 whilst running the same loop as above in the (single) PV guest. Some (less stressful) testing has also been done using a Windows HVM guest with the latest 9.0 PV drivers installed. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2019-12-20xen/interface: re-define FRONT/BACK_RING_ATTACH()Paul Durrant1-20/+9
Currently these macros are defined to re-initialize a front/back ring (respectively) to values read from the shared ring in such a way that any requests/responses that are added to the shared ring whilst the front/back is detached will be skipped over. This, in general, is not a desirable semantic since most frontend implementations will eventually block waiting for a response which would either never appear or never be processed. Since the macros are currently unused, take this opportunity to re-define them to re-initialize a front/back ring using specified values. This also allows FRONT/BACK_RING_INIT() to be re-defined in terms of FRONT/BACK_RING_ATTACH() using a specified value of 0. NOTE: BACK_RING_ATTACH() will be used directly in a subsequent patch. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2019-12-20xenbus: limit when state is forced to closedPaul Durrant2-2/+11
If a driver probe() fails then leave the xenstore state alone. There is no reason to modify it as the failure may be due to transient resource allocation issues and hence a subsequent probe() may succeed. If the driver supports re-binding then only force state to closed during remove() only in the case when the toolstack may need to clean up. This can be detected by checking whether the state in xenstore has been set to closing prior to device removal. NOTE: Re-bind support is indicated by new boolean in struct xenbus_driver, which defaults to false. Subsequent patches will add support to some backend drivers. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2019-12-20xenbus: move xenbus_dev_shutdown() into frontend code...Paul Durrant4-27/+23
...and make it static xenbus_dev_shutdown() is seemingly intended to cause clean shutdown of PV frontends when a guest is rebooted. Indeed the function waits for a conpletion which is only set by a call to xenbus_frontend_closed(). This patch removes the shutdown() method from backends and moves xenbus_dev_shutdown() from xenbus_probe.c into xenbus_probe_frontend.c, renaming it appropriately and making it static. NOTE: In the case where the backend is running in a driver domain, the toolstack should have already terminated any frontends that may be using it (since Xen does not support re-startable PV driver domains) so xenbus_dev_shutdown() should never be called. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2019-12-20xen/blkfront: Adjust indentation in xlvbd_alloc_gendiskNathan Chancellor1-2/+2
Clang warns: ../drivers/block/xen-blkfront.c:1117:4: warning: misleading indentation; statement is not part of the previous 'if' [-Wmisleading-indentation] nr_parts = PARTS_PER_DISK; ^ ../drivers/block/xen-blkfront.c:1115:3: note: previous statement is here if (err) ^ This is because there is a space at the beginning of this line; remove it so that the indentation is consistent according to the Linux kernel coding style and clang no longer warns. While we are here, the previous line has some trailing whitespace; clean that up as well. Fixes: c80a420995e7 ("xen-blkfront: handle Xen major numbers other than XENVBD") Link: https://github.com/ClangBuiltLinux/linux/issues/791 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2019-12-20riscv: move sifive_l2_cache.c to drivers/socChristoph Hellwig8-2/+17
The sifive_l2_cache.c is in no way related to RISC-V architecture memory management. It is a little stub driver working around the fact that the EDAC maintainers prefer their drivers to be structured in a certain way that doesn't fit the SiFive SOCs. Move the file to drivers/soc and add a Kconfig option for it, as well as the whole drivers/soc boilerplate for CONFIG_SOC_SIFIVE. Fixes: a967a289f169 ("RISC-V: sifive_l2_cache: Add L2 cache controller driver for SiFive SoCs") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Borislav Petkov <bp@suse.de> [paul.walmsley@sifive.com: keep the MAINTAINERS change specific to the L2$ controller code] Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
2019-12-20riscv: define vmemmap before pfn_to_page callsDavid Abdurachmanov1-17/+21
pfn_to_page & page_to_pfn depend on vmemmap being available before the calls if kernel is configured with CONFIG_SPARSEMEM_VMEMMAP=y. This was caused by NOMMU changes which moved vmemmap definition bellow functions definitions calling pfn_to_page & page_to_pfn. Noticed while compiled 5.5-rc2 kernel for Fedora/RISCV. v2: - Add a comment for vmemmap in source Signed-off-by: David Abdurachmanov <david.abdurachmanov@sifive.com> Fixes: 6bd33e1ece52 ("riscv: add nommu support") Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
2019-12-20riscv: fix scratch register clearing in M-mode.Greentime Hu1-1/+1
This patch fixes that the sscratch register clearing in M-mode. It cleared sscratch register in M-mode, but it should clear mscratch register. That will cause kernel trap if the CPU core doesn't support S-mode when trying to access sscratch. Fixes: 9e80635619b5 ("riscv: clear the instruction cache and all registers when booting") Signed-off-by: Greentime Hu <greentime.hu@sifive.com> Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
2019-12-20riscv: Fix use of undefined config option CONFIG_CONFIG_MMUAndreas Schwab1-1/+1
In Kconfig files, config options are written without the CONFIG_ prefix. Fixes: 6bd33e1ece52 ("riscv: add nommu support") Signed-off-by: Andreas Schwab <schwab@suse.de> Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
2019-12-20Merge branch 'replace-cg_bpf-prog'Alexei Starovoitov15-626/+652
Andrey Ignatov says: ==================== v3->v4: - use OPTS_VALID and OPTS_GET to handle bpf_prog_attach_opts. v2->v3: - rely on DECLARE_LIBBPF_OPTS from libbpf_common.h; - separate "required" and "optional" arguments in bpf_prog_attach_xattr; - convert test_cgroup_attach to prog_tests; - move the new selftest to prog_tests/cgroup_attach_multi. v1->v2: - move DECLARE_LIBBPF_OPTS from libbpf.h to bpf.h (patch 4); - switch new libbpf API to OPTS framework; - switch selftest to libbpf OPTS framework. This patch set adds support for replacing cgroup-bpf programs attached with BPF_F_ALLOW_MULTI flag so that any program in a list can be updated to a new version without service interruption and order of programs can be preserved. Please see patch 3 for details on the use-case and API changes. Other patches: Patch 1 is preliminary refactoring of __cgroup_bpf_attach to simplify it. Patch 2 is minor cleanup of hierarchy_allows_attach. Patch 4 extends libbpf API to support new set of attach attributes. Patch 5 converts test_cgroup_attach to prog_tests. Patch 6 adds selftest coverage for the new API. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-12-20selftests/bpf: Test BPF_F_REPLACE in cgroup_attach_multiAndrey Ignatov1-3/+50
Test replacing a cgroup-bpf program attached with BPF_F_ALLOW_MULTI and possible failure modes: invalid combination of flags, invalid replace_bpf_fd, replacing a non-attachd to specified cgroup program. Example of program replacing: # gdb -q --args ./test_progs --name=cgroup_attach_multi ... Breakpoint 1, test_cgroup_attach_multi () at cgroup_attach_multi.c:227 (gdb) [1]+ Stopped gdb -q --args ./test_progs --name=cgroup_attach_multi # bpftool c s /mnt/cgroup2/cgroup-test-work-dir/cg1 ID AttachType AttachFlags Name 2133 egress multi 2134 egress multi # fg gdb -q --args ./test_progs --name=cgroup_attach_multi (gdb) c Continuing. Breakpoint 2, test_cgroup_attach_multi () at cgroup_attach_multi.c:233 (gdb) [1]+ Stopped gdb -q --args ./test_progs --name=cgroup_attach_multi # bpftool c s /mnt/cgroup2/cgroup-test-work-dir/cg1 ID AttachType AttachFlags Name 2139 egress multi 2134 egress multi Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/7b9b83e8d5fb82e15b034341bd40b6fb2431eeba.1576741281.git.rdna@fb.com
2019-12-20selftests/bpf: Convert test_cgroup_attach to prog_testsAndrey Ignatov6-574/+498
Convert test_cgroup_attach to prog_tests. This change does a lot of things but in many cases it's pretty expensive to separate them, so they go in one commit. Nevertheless the logic is ketp as is and changes made are just moving things around, simplifying them (w/o changing the meaning of the tests) and making prog_tests compatible: * split the 3 tests in the file into 3 separate files in prog_tests/; * rename the test functions to test_<file_base_name>; * remove unused includes, constants, variables and functions from every test; * replace `if`-s with or `if (CHECK())` where additional context should be logged and with `if (CHECK_FAIL())` where line number is enough; * switch from `log_err()` to logging via `CHECK()`; * replace `assert`-s with `CHECK_FAIL()` to avoid crashing the whole test_progs if one assertion fails; * replace cgroup_helpers with test__join_cgroup() in cgroup_attach_override only, other tests need more fine-grained control for cgroup creation/deletion so cgroup_helpers are still used there; * simplify cgroup_attach_autodetach by switching to easiest possible program since this test doesn't really need such a complicated program as cgroup_attach_multi does; * remove test_cgroup_attach.c itself. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/0ff19cc64d2dc5cf404349f07131119480e10e32.1576741281.git.rdna@fb.com
2019-12-20libbpf: Introduce bpf_prog_attach_xattrAndrey Ignatov3-1/+28
Introduce a new bpf_prog_attach_xattr function that, in addition to program fd, target fd and attach type, accepts an extendable struct bpf_prog_attach_opts. bpf_prog_attach_opts relies on DECLARE_LIBBPF_OPTS macro to maintain backward and forward compatibility and has the following "optional" attach attributes: * existing attach_flags, since it's not required when attaching in NONE mode. Even though it's quite often used in MULTI and OVERRIDE mode it seems to be a good idea to reduce number of arguments to bpf_prog_attach_xattr; * newly introduced attribute of BPF_PROG_ATTACH command: replace_prog_fd that is fd of previously attached cgroup-bpf program to replace if BPF_F_REPLACE flag is used. The new function is named to be consistent with other xattr-functions (bpf_prog_test_run_xattr, bpf_create_map_xattr, bpf_load_program_xattr). The struct bpf_prog_attach_opts is supposed to be used with DECLARE_LIBBPF_OPTS macro. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/bd6e0732303eb14e4b79cb128268d9e9ad6db208.1576741281.git.rdna@fb.com
2019-12-20bpf: Support replacing cgroup-bpf program in MULTI modeAndrey Ignatov6-9/+54
The common use-case in production is to have multiple cgroup-bpf programs per attach type that cover multiple use-cases. Such programs are attached with BPF_F_ALLOW_MULTI and can be maintained by different people. Order of programs usually matters, for example imagine two egress programs: the first one drops packets and the second one counts packets. If they're swapped the result of counting program will be different. It brings operational challenges with updating cgroup-bpf program(s) attached with BPF_F_ALLOW_MULTI since there is no way to replace a program: * One way to update is to detach all programs first and then attach the new version(s) again in the right order. This introduces an interruption in the work a program is doing and may not be acceptable (e.g. if it's egress firewall); * Another way is attach the new version of a program first and only then detach the old version. This introduces the time interval when two versions of same program are working, what may not be acceptable if a program is not idempotent. It also imposes additional burden on program developers to make sure that two versions of their program can co-exist. Solve the problem by introducing a "replace" mode in BPF_PROG_ATTACH command for cgroup-bpf programs being attached with BPF_F_ALLOW_MULTI flag. This mode is enabled by newly introduced BPF_F_REPLACE attach flag and bpf_attr.replace_bpf_fd attribute to pass fd of the old program to replace That way user can replace any program among those attached with BPF_F_ALLOW_MULTI flag without the problems described above. Details of the new API: * If BPF_F_REPLACE is set but replace_bpf_fd doesn't have valid descriptor of BPF program, BPF_PROG_ATTACH will return corresponding error (EINVAL or EBADF). * If replace_bpf_fd has valid descriptor of BPF program but such a program is not attached to specified cgroup, BPF_PROG_ATTACH will return ENOENT. BPF_F_REPLACE is introduced to make the user intent clear, since replace_bpf_fd alone can't be used for this (its default value, 0, is a valid fd). BPF_F_REPLACE also makes it possible to extend the API in the future (e.g. add BPF_F_BEFORE and BPF_F_AFTER if needed). Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Narkyiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/30cd850044a0057bdfcaaf154b7d2f39850ba813.1576741281.git.rdna@fb.com
2019-12-20bpf: Remove unused new_flags in hierarchy_allows_attach()Andrey Ignatov1-3/+2
new_flags is unused, remove it. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/2c49b30ab750f93cfef04a1e40b097d70c3a39a1.1576741281.git.rdna@fb.com
2019-12-20bpf: Simplify __cgroup_bpf_attachAndrey Ignatov1-39/+23
__cgroup_bpf_attach has a lot of identical code to handle two scenarios: BPF_F_ALLOW_MULTI is set and unset. Simplify it by splitting the two main steps: * First, the decision is made whether a new bpf_prog_list entry should be allocated or existing entry should be reused for the new program. This decision is saved in replace_pl pointer; * Next, replace_pl pointer is used to handle both possible states of BPF_F_ALLOW_MULTI flag (set / unset) instead of doing similar work for them separately. This splitting, in turn, allows to make further simplifications: * The check for attaching same program twice in BPF_F_ALLOW_MULTI mode can be done before allocating cgroup storage, so that if user tries to attach same program twice no alloc/free happens as it was before; * pl_was_allocated becomes redundant so it's removed. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/c6193db6fe630797110b0d3ff06c125d093b834c.1576741281.git.rdna@fb.com
2019-12-20Merge branch 'simplify-do_redirect'Alexei Starovoitov8-197/+75
Björn Töpel says: ==================== This series aims to simplify the XDP maps and xdp_do_redirect_map()/xdp_do_flush_map(), and to crank out some more performance from XDP_REDIRECT scenarios. The first part of the series simplifies all XDP_REDIRECT capable maps, so that __XXX_flush_map() does not require the map parameter, by moving the flush list from the map to global scope. This results in that the map_to_flush member can be removed from struct bpf_redirect_info, and its corresponding logic. Simpler code, and more performance due to that checks/code per-packet is moved to flush. Pre-series performance: $ sudo taskset -c 22 ./xdpsock -i enp134s0f0 -q 20 -n 1 -r -z sock0@enp134s0f0:20 rxdrop xdp-drv pps pkts 1.00 rx 20,797,350 230,942,399 tx 0 0 $ sudo ./xdp_redirect_cpu --dev enp134s0f0 --cpu 22 xdp_cpu_map0 Running XDP/eBPF prog_name:xdp_cpu_map5_lb_hash_ip_pairs XDP-cpumap CPU:to pps drop-pps extra-info XDP-RX 20 7723038 0 0 XDP-RX total 7723038 0 cpumap_kthread total 0 0 0 redirect_err total 0 0 xdp_exception total 0 0 Post-series performance: $ sudo taskset -c 22 ./xdpsock -i enp134s0f0 -q 20 -n 1 -r -z sock0@enp134s0f0:20 rxdrop xdp-drv pps pkts 1.00 rx 21,524,979 86,835,327 tx 0 0 $ sudo ./xdp_redirect_cpu --dev enp134s0f0 --cpu 22 xdp_cpu_map0 Running XDP/eBPF prog_name:xdp_cpu_map5_lb_hash_ip_pairs XDP-cpumap CPU:to pps drop-pps extra-info XDP-RX 20 7840124 0 0 XDP-RX total 7840124 0 cpumap_kthread total 0 0 0 redirect_err total 0 0 xdp_exception total 0 0 Results: +3.5% and +1.5% for the ubenchmarks. v1->v2 [1]: * Removed 'unused-variable' compiler warning (Jakub) [1] https://lore.kernel.org/bpf/20191218105400.2895-1-bjorn.topel@gmail.com/ ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-12-20xdp: Simplify __bpf_tx_xdp_map()Björn Töpel1-26/+7
The explicit error checking is not needed. Simply return the error instead. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-9-bjorn.topel@gmail.com
2019-12-20xdp: Remove map_to_flush and map swap detectionBjörn Töpel2-25/+3
Now that all XDP maps that can be used with bpf_redirect_map() tracks entries to be flushed in a global fashion, there is not need to track that the map has changed and flush from xdp_do_generic_map() anymore. All entries will be flushed in xdp_do_flush_map(). This means that the map_to_flush can be removed, and the corresponding checks. Moving the flush logic to one place, xdp_do_flush_map(), give a bulking behavior and performance boost. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-8-bjorn.topel@gmail.com
2019-12-20xdp: Make cpumap flush_list common for all map instancesBjörn Töpel3-21/+21
The cpumap flush list is used to track entries that need to flushed from via the xdp_do_flush_map() function. This list used to be per-map, but there is really no reason for that. Instead make the flush list global for all devmaps, which simplifies __cpu_map_flush() and cpu_map_alloc(). Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-7-bjorn.topel@gmail.com
2019-12-20xdp: Make devmap flush_list common for all map instancesBjörn Töpel3-25/+16
The devmap flush list is used to track entries that need to flushed from via the xdp_do_flush_map() function. This list used to be per-map, but there is really no reason for that. Instead make the flush list global for all devmaps, which simplifies __dev_map_flush() and dev_map_init_map(). Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-6-bjorn.topel@gmail.com
2019-12-20xsk: Make xskmap flush_list common for all map instancesBjörn Töpel4-35/+20
The xskmap flush list is used to track entries that need to flushed from via the xdp_do_flush_map() function. This list used to be per-map, but there is really no reason for that. Instead make the flush list global for all xskmaps, which simplifies __xsk_map_flush() and xsk_map_alloc(). Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-5-bjorn.topel@gmail.com
2019-12-20xdp: Fix graze->grace type-o in cpumap commentsBjörn Töpel1-3/+3
Simple spelling fix. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-4-bjorn.topel@gmail.com
2019-12-20xdp: Simplify cpumap cleanupBjörn Töpel1-29/+5
After the RCU flavor consolidation [1], call_rcu() and synchronize_rcu() waits for preempt-disable regions (NAPI) in addition to the read-side critical sections. As a result of this, the cleanup code in cpumap can be simplified * There is no longer a need to flush in __cpu_map_entry_free, since we know that this has been done when the call_rcu() callback is triggered. * When freeing the map, there is no need to explicitly wait for a flush. It's guaranteed to be done after the synchronize_rcu() call in cpu_map_free(). [1] https://lwn.net/Articles/777036/ Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-3-bjorn.topel@gmail.com
2019-12-20xdp: Simplify devmap cleanupBjörn Töpel1-38/+5
After the RCU flavor consolidation [1], call_rcu() and synchronize_rcu() waits for preempt-disable regions (NAPI) in addition to the read-side critical sections. As a result of this, the cleanup code in devmap can be simplified * There is no longer a need to flush in __dev_map_entry_free, since we know that this has been done when the call_rcu() callback is triggered. * When freeing the map, there is no need to explicitly wait for a flush. It's guaranteed to be done after the synchronize_rcu() call in dev_map_free(). The rcu_barrier() is still needed, so that the map is not freed prior the elements. [1] https://lwn.net/Articles/777036/ Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20191219061006.21980-2-bjorn.topel@gmail.com
2019-12-20NFC: pn544: Adjust indentation in pn544_hci_check_presenceNathan Chancellor1-1/+1
Clang warns ../drivers/nfc/pn544/pn544.c:696:4: warning: misleading indentation; statement is not part of the previous 'if' [-Wmisleading-indentation] return nfc_hci_send_cmd(hdev, NFC_HCI_RF_READER_A_GATE, ^ ../drivers/nfc/pn544/pn544.c:692:3: note: previous statement is here if (target->nfcid1_len != 4 && target->nfcid1_len != 7 && ^ 1 warning generated. This warning occurs because there is a space after the tab on this line. Remove it so that the indentation is consistent with the Linux kernel coding style and clang no longer warns. Fixes: da052850b911 ("NFC: Add pn544 presence check for different targets") Link: https://github.com/ClangBuiltLinux/linux/issues/814 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20Merge branch 'bcmgenet-Turn-on-offloads-by-default'David S. Miller2-50/+67
Doug Berger says: ==================== net: bcmgenet: Turn on offloads by default This commit stack is based on Florian's commit 4e8aedfe78c7 ("net: systemport: Turn on offloads by default") and enables the offloads for the bcmgenet driver by default. The first commit adds support for the HIGHDMA feature to the driver. The second converts the Tx checksum implementation to use the generic hardware logic rather than the deprecated IP centric methods. The third modifies the Rx checksum implementation to use the hardware offload to compute the complete checksum rather than filtering out bad packets detected by the hardware's IP centric implementation. This may increase processing load by passing bad packets to the network stack, but it provides for more flexible handling of packets by the network stack without requiring software computation of the checksum. The remaining commits mirror the extensions Florian made to the sysport driver to retain symmetry with that driver and to make the benefits of the hardware offloads more ubiquitous. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: bcmgenet: Add software counters to track reallocationsDoug Berger2-0/+8
When inserting the TSB, keep track of how many times we had to do it and if there was a failure in doing so, this helps profile the driver for possibly incorrect headroom settings. Signed-off-by: Doug Berger <opendmb@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: bcmgenet: Be drop monitor friendly while re-allocating headroomDoug Berger1-1/+2
During bcmgenet_put_tx_csum() make sure we differentiate a SKB headroom re-allocation failure from the normal swap and replace path. Signed-off-by: Doug Berger <opendmb@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: bcmgenet: Turn on offloads by defaultDoug Berger1-3/+5
We can turn on the RX/TX checksum offloads and the scatter/gather features by default and make sure that those are properly reflected back to e.g: stacked devices such as VLAN. Signed-off-by: Doug Berger <opendmb@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: bcmgenet: Utilize bcmgenet_set_features() during resume/openDoug Berger1-0/+8
During driver resume and open, the HW may have lost its context/state, utilize bcmgenet_set_features() to make sure we do restore the correct set of features that were previously configured. Signed-off-by: Doug Berger <opendmb@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: bcmgenet: Refactor bcmgenet_set_features()Doug Berger1-19/+19
In preparation for unconditionally enabling TX and RX checksum offloads, refactor bcmgenet_set_features() a bit such that __netdev_update_features() during register_netdev() can make sure that features are correctly programmed during network device registration. Since we can now be called during register_netdev() with clocks gated, we need to temporarily turn them on/off in order to have a successful register programming. We also move the CRC forward setting read into bcmgenet_set_features() since priv->crc_fwd_en matters while turning on RX checksum offload, that way we are guaranteed they are in sync in case we ever add support for NETIF_F_RXFCS at some point in the future. Signed-off-by: Doug Berger <opendmb@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: bcmgenet: use CHECKSUM_COMPLETE for NETIF_F_RXCSUMDoug Berger2-13/+8
This commit updates the Rx checksum offload behavior of the driver to use the more generic CHECKSUM_COMPLETE method that supports all protocols over the CHECKSUM_UNNECESSARY method that only applies to some protocols known by the hardware. This behavior is perceived to be superior. Signed-off-by: Doug Berger <opendmb@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: bcmgenet: enable NETIF_F_HW_CSUM featureDoug Berger1-17/+12
The GENET hardware should be capable of generating IP checksums using the NETIF_F_HW_CSUM feature, so switch to using that feature instead of the depricated NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM. Signed-off-by: Doug Berger <opendmb@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: bcmgenet: enable NETIF_F_HIGHDMA flagDoug Berger1-2/+10
This commit configures the DMA masks for the GENET driver and sets the NETIF_F_HIGHDMA flag to report support of the feature. Signed-off-by: Doug Berger <opendmb@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net: systemport: Set correct DMA maskFlorian Fainelli1-0/+8
SYSTEMPORT is capabable of doing up to 40-bit of physical addresses, set an appropriate DMA mask to permit that. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20Merge branch 'cls_u32-fix-refcount-leak'David S. Miller3-22/+230
Davide Caratti says: ==================== net/sched: cls_u32: fix refcount leak a refcount leak in the error path of u32_change() has been recently introduced. It can be observed with the following commands: [root@f31 ~]# tc filter replace dev eth0 ingress protocol ip prio 97 \ > u32 match ip src 127.0.0.1/32 indev notexist20 flowid 1:1 action drop RTNETLINK answers: Invalid argument We have an error talking to the kernel [root@f31 ~]# tc filter replace dev eth0 ingress protocol ip prio 98 \ > handle 42:42 u32 divisor 256 Error: cls_u32: Divisor can only be used on a hash table. We have an error talking to the kernel [root@f31 ~]# tc filter replace dev eth0 ingress protocol ip prio 99 \ > u32 ht 47:47 Error: cls_u32: Specified hash table not found. We have an error talking to the kernel they all legitimately return -EINVAL; however, they leave semi-configured filters at eth0 tc ingress: [root@f31 ~]# tc filter show dev eth0 ingress filter protocol ip pref 97 u32 chain 0 filter protocol ip pref 97 u32 chain 0 fh 800: ht divisor 1 filter protocol ip pref 98 u32 chain 0 filter protocol ip pref 98 u32 chain 0 fh 801: ht divisor 1 filter protocol ip pref 99 u32 chain 0 filter protocol ip pref 99 u32 chain 0 fh 802: ht divisor 1 With older kernels, filters were unconditionally considered empty (and thus de-refcounted) on the error path of ->change(). After commit 8b64678e0af8 ("net: sched: refactor tp insert/delete for concurrent execution"), filters were considered empty when the walk() function didn't set 'walker.stop' to 1. Finally, with commit 6676d5e416ee ("net: sched: set dedicated tcf_walker flag when tp is empty"), tc filters are considered empty unless the walker function is called with a non-NULL handle. This last change doesn't fit cls_u32 design, because at least the "root hnode" is (almost) always non-NULL, as it's allocated in u32_init(). - patch 1/2 is a proposal to restore the original kernel behavior, where no filter was installed in the error path of u32_change(). - patch 2/2 adds tdc selftests that can be ued to verify the correct behavior of u32 in the error path of ->change(). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20tc-testing: initial tdc selftests for cls_u32Davide Caratti2-22/+205
- move test "e9a3 - Add u32 with source match" to u32.json, and change the match pattern to catch all hnodes - add testcases for relevant error paths of cls_u32 module Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20net/sched: cls_u32: fix refcount leak in the error path of u32_change()Davide Caratti1-0/+25
when users replace cls_u32 filters with new ones having wrong parameters, so that u32_change() fails to validate them, the kernel doesn't roll-back correctly, and leaves semi-configured rules. Fix this in u32_walk(), avoiding a call to the walker function on filters that don't have a match rule connected. The side effect is, these "empty" filters are not even dumped when present; but that shouldn't be a problem as long as we are restoring the original behaviour, where semi-configured filters were not even added in the error path of u32_change(). Fixes: 6676d5e416ee ("net: sched: set dedicated tcf_walker flag when tp is empty") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20Merge branch 'nfp-tls-implement-the-stream-sync-RX-resync'David S. Miller11-26/+190
Jakub Kicinski says: ==================== nfp: tls: implement the stream sync RX resync This small series adds support for using the device in stream scan RX resync mode which improves the RX resync success rate. Without stream scan it's pretty much impossible to successfully resync a continuous stream. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20nfp: tls: implement the stream sync RX resyncJakub Kicinski9-16/+172
The simple RX resync strategy controlled by the kernel does not guarantee as good results as if the device helps by detecting the potential record boundaries and keeping track of them. We've called this strategy stream scan in the tls-offload doc. Implement this strategy for the NFP. The device sends a request for record boundary confirmation, which is then recorded in per-TLS socket state and responded to once record is reached. Because the device keeps track of records passing after the request was sent the response is not as latency sensitive as when kernel just tries to tell the device the information about the next record. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>