summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-12-11bpf: states_equal() must build idmap for all function framesEduard Zingerman2-3/+4
verifier.c:states_equal() must maintain register ID mapping across all function frames. Otherwise the following example might be erroneously marked as safe: main: fp[-24] = map_lookup_elem(...) ; frame[0].fp[-24].id == 1 fp[-32] = map_lookup_elem(...) ; frame[0].fp[-32].id == 2 r1 = &fp[-24] r2 = &fp[-32] call foo() r0 = 0 exit foo: 0: r9 = r1 1: r8 = r2 2: r7 = ktime_get_ns() 3: r6 = ktime_get_ns() 4: if (r6 > r7) goto skip_assign 5: r9 = r8 skip_assign: ; <--- checkpoint 6: r9 = *r9 ; (a) frame[1].r9.id == 2 ; (b) frame[1].r9.id == 1 7: if r9 == 0 goto exit: ; mark_ptr_or_null_regs() transfers != 0 info ; for all regs sharing ID: ; (a) r9 != 0 => &frame[0].fp[-32] != 0 ; (b) r9 != 0 => &frame[0].fp[-24] != 0 8: r8 = *r8 ; (a) r8 == &frame[0].fp[-32] ; (b) r8 == &frame[0].fp[-32] 9: r0 = *r8 ; (a) safe ; (b) unsafe exit: 10: exit While processing call to foo() verifier considers the following execution paths: (a) 0-10 (b) 0-4,6-10 (There is also path 0-7,10 but it is not interesting for the issue at hand. (a) is verified first.) Suppose that checkpoint is created at (6) when path (a) is verified, next path (b) is verified and (6) is reached. If states_equal() maintains separate 'idmap' for each frame the mapping at (6) for frame[1] would be empty and regsafe(r9)::check_ids() would add a pair 2->1 and return true, which is an error. If states_equal() maintains single 'idmap' for all frames the mapping at (6) would be { 1->1, 2->2 } and regsafe(r9)::check_ids() would return false when trying to add a pair 2->1. This issue was suggested in the following discussion: https://lore.kernel.org/bpf/CAEf4BzbFB5g4oUfyxk9rHy-PJSLQ3h8q9mV=rVoXfr_JVm8+1Q@mail.gmail.com/ Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20221209135733.28851-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-11selftests/bpf: test cases for regsafe() bug skipping check_id()Eduard Zingerman2-0/+103
Under certain conditions it was possible for verifier.c:regsafe() to skip check_id() call. This commit adds negative test cases previously errorneously accepted as safe. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20221209135733.28851-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-11bpf: regsafe() must not skip check_ids()Eduard Zingerman1-21/+8
The verifier.c:regsafe() has the following shortcut: equal = memcmp(rold, rcur, offsetof(struct bpf_reg_state, parent)) == 0; ... if (equal) return true; Which is executed regardless old register type. This is incorrect for register types that might have an ID checked by check_ids(), namely: - PTR_TO_MAP_KEY - PTR_TO_MAP_VALUE - PTR_TO_PACKET_META - PTR_TO_PACKET The following pattern could be used to exploit this: 0: r9 = map_lookup_elem(...) ; Returns PTR_TO_MAP_VALUE_OR_NULL id=1. 1: r8 = map_lookup_elem(...) ; Returns PTR_TO_MAP_VALUE_OR_NULL id=2. 2: r7 = ktime_get_ns() ; Unbound SCALAR_VALUE. 3: r6 = ktime_get_ns() ; Unbound SCALAR_VALUE. 4: if r6 > r7 goto +1 ; No new information about the state ; is derived from this check, thus ; produced verifier states differ only ; in 'insn_idx'. 5: r9 = r8 ; Optionally make r9.id == r8.id. --- checkpoint --- ; Assume is_state_visisted() creates a ; checkpoint here. 6: if r9 == 0 goto <exit> ; Nullness info is propagated to all ; registers with matching ID. 7: r1 = *(u64 *) r8 ; Not always safe. Verifier first visits path 1-7 where r8 is verified to be not null at (6). Later the jump from 4 to 6 is examined. The checkpoint for (6) looks as follows: R8_rD=map_value_or_null(id=2,off=0,ks=4,vs=8,imm=0) R9_rwD=map_value_or_null(id=2,off=0,ks=4,vs=8,imm=0) R10=fp0 The current state is: R0=... R6=... R7=... fp-8=... R8=map_value_or_null(id=2,off=0,ks=4,vs=8,imm=0) R9=map_value_or_null(id=1,off=0,ks=4,vs=8,imm=0) R10=fp0 Note that R8 states are byte-to-byte identical, so regsafe() would exit early and skip call to check_ids(), thus ID mapping 2->2 will not be added to 'idmap'. Next, states for R9 are compared: these are not identical and check_ids() is executed, but 'idmap' is empty, so check_ids() adds mapping 2->1 to 'idmap' and returns success. This commit pushes the 'equal' down to register types that don't need check_ids(). Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20221209135733.28851-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-10can: rcar_canfd: Add multi_channel_irqs to struct rcar_canfd_hw_infoBiju Das1-12/+4
RZ/G2L has separate IRQ lines for tx and error interrupt for each channel whereas R-Car has a combined IRQ line for all the channel specific tx and error interrupts. Add multi_channel_irqs to struct rcar_canfd_hw_info to select the driver to choose between combined and separate irq registration for channel interrupts. This patch also removes enum rcanfd_chip_id and chip_id from both struct rcar_canfd_hw_info, as it is unused. Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/all/20221027082158.95895-6-biju.das.jz@bp.renesas.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-12-10can: rcar_canfd: Add postdiv to struct rcar_canfd_hw_infoBiju Das1-2/+6
R-Car has a clock divider for CAN FD clock within the IP, whereas it is not available on RZ/G2L. Add postdiv variable to struct rcar_canfd_hw_info to take care of this difference. Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/all/20221027082158.95895-5-biju.das.jz@bp.renesas.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-12-10can: rcar_canfd: Add shared_global_irqs to struct rcar_canfd_hw_infoBiju Das1-2/+6
RZ/G2L has separate IRQ lines for receive FIFO and global error interrupt whereas R-Car has shared IRQ line. Add shared_global_irqs to struct rcar_canfd_hw_info to select the driver to choose between shared and separate irq registration for global interrupts. Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/all/20221027082158.95895-4-biju.das.jz@bp.renesas.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-12-10can: rcar_canfd: Add max_channels to struct rcar_canfd_hw_infoBiju Das1-15/+15
R-Car V3U supports a maximum of 8 channels whereas rest of the SoCs support 2 channels. Add max_channels variable to struct rcar_canfd_hw_info to handle this difference. Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/all/20221027082158.95895-3-biju.das.jz@bp.renesas.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-12-10can: m_can: sort header inclusion alphabeticallyVivek Yadav3-13/+13
Sort header inclusion alphabetically. Suggested-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Vivek Yadav <vivek.2311@samsung.com> Link: https://lore.kernel.org/all/20221104051617.21173-1-vivek.2311@samsung.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-12-10can: rcar_canfd: rcar_canfd_probe: Add struct rcar_canfd_hw_info to driver dataBiju Das1-13/+30
The CAN FD IP found on RZ/G2L SoC has some HW features different to that of R-Car. For example, it has multiple resets and multiple IRQs for global and channel interrupts. Also, it does not have ECC error flag registers and clk post divider present on R-Car. Similarly, R-Car V3U has 8 channels whereas other SoCs has only 2 channels. This patch adds the struct rcar_canfd_hw_info to take care of the HW feature differences and driver data present on both IPs. It also replaces the driver data chip type with struct rcar_canfd_hw_info by moving chip type to it. Whilst started using driver data instead of chip_id for detecting R-Car V3U SoCs. Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/all/20221027082158.95895-2-biju.das.jz@bp.renesas.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-12-10can: kvaser_usb: kvaser_usb_set_{,data}bittiming(): remove empty lines in ↵Marc Kleine-Budde1-2/+0
variable declaration Fix coding style by removing empty lines in variable declaration. Fixes: 39d3df6b0ea8 ("can: kvaser_usb: Compare requested bittiming parameters with actual parameters in do_set_{,data}_bittiming") Cc: Jimmy Assarsson <extja@kvaser.com> Cc: Anssi Hannula <anssi.hannula@bitwise.fi> Link: https://lore.kernel.org/all/20221031114513.81214-2-mkl@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-12-10can: kvaser_usb: kvaser_usb_set_bittiming(): fix redundant initialization ↵Marc Kleine-Budde1-1/+1
warning for err The variable err is initialized, but the initialized value is Overwritten before it is read. Fix the warning by not initializing the variable err at all. Fixes: 39d3df6b0ea8 ("can: kvaser_usb: Compare requested bittiming parameters with actual parameters in do_set_{,data}_bittiming") Cc: Jimmy Assarsson <extja@kvaser.com> Cc: Anssi Hannula <anssi.hannula@bitwise.fi> Link: https://lore.kernel.org/all/20221031114513.81214-1-mkl@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-12-10Merge tag 'ipsec-next-2022-12-09' of ↵Jakub Kicinski30-512/+2141
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== ipsec-next 2022-12-09 1) Add xfrm packet offload core API. From Leon Romanovsky. 2) Add xfrm packet offload support for mlx5. From Leon Romanovsky and Raed Salem. 3) Fix a typto in a error message. From Colin Ian King. * tag 'ipsec-next-2022-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: (38 commits) xfrm: Fix spelling mistake "oflload" -> "offload" net/mlx5e: Open mlx5 driver to accept IPsec packet offload net/mlx5e: Handle ESN update events net/mlx5e: Handle hardware IPsec limits events net/mlx5e: Update IPsec soft and hard limits net/mlx5e: Store all XFRM SAs in Xarray net/mlx5e: Provide intermediate pointer to access IPsec struct net/mlx5e: Skip IPsec encryption for TX path without matching policy net/mlx5e: Add statistics for Rx/Tx IPsec offloaded flows net/mlx5e: Improve IPsec flow steering autogroup net/mlx5e: Configure IPsec packet offload flow steering net/mlx5e: Use same coding pattern for Rx and Tx flows net/mlx5e: Add XFRM policy offload logic net/mlx5e: Create IPsec policy offload tables net/mlx5e: Generalize creation of default IPsec miss group and rule net/mlx5e: Group IPsec miss handles into separate struct net/mlx5e: Make clear what IPsec rx_err does net/mlx5e: Flatten the IPsec RX add rule path net/mlx5e: Refactor FTE setup code to be more clear net/mlx5e: Move IPsec flow table creation to separate function ... ==================== Link: https://lore.kernel.org/r/20221209093310.4018731-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-10net: devlink: Add missing error check to devlink_resource_put()Gavrilov Ilia1-3/+4
When the resource size changes, the return value of the 'nla_put_u64_64bit' function is not checked. That has been fixed to avoid rechecking at the next step. Found by InfoTeCS on behalf of Linux Verification Center (linuxtesting.org) with SVACE. Note that this is harmless, we'd error out at the next put(). Signed-off-by: Ilia.Gavrilov <Ilia.Gavrilov@infotecs.ru> Link: https://lore.kernel.org/r/20221208082821.3927937-1-Ilia.Gavrilov@infotecs.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-10skbuff: Introduce slab_build_skb()Kees Cook5-11/+66
syzkaller reported: BUG: KASAN: slab-out-of-bounds in __build_skb_around+0x235/0x340 net/core/skbuff.c:294 Write of size 32 at addr ffff88802aa172c0 by task syz-executor413/5295 For bpf_prog_test_run_skb(), which uses a kmalloc()ed buffer passed to build_skb(). When build_skb() is passed a frag_size of 0, it means the buffer came from kmalloc. In these cases, ksize() is used to find its actual size, but since the allocation may not have been made to that size, actually perform the krealloc() call so that all the associated buffer size checking will be correctly notified (and use the "new" pointer so that compiler hinting works correctly). Split this logic out into a new interface, slab_build_skb(), but leave the original 0 checking for now to catch any stragglers. Reported-by: syzbot+fda18eaa8c12534ccb3b@syzkaller.appspotmail.com Link: https://groups.google.com/g/syzkaller-bugs/c/UnIKxTtU5-0/m/-wbXinkgAQAJ Fixes: 38931d8989b5 ("mm: Make ksize() a reporting-only function") Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: pepsipu <soopthegoop@gmail.com> Cc: syzbot+fda18eaa8c12534ccb3b@syzkaller.appspotmail.com Cc: Vlastimil Babka <vbabka@suse.cz> Cc: kasan-dev <kasan-dev@googlegroups.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: ast@kernel.org Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Hao Luo <haoluo@google.com> Cc: Jesper Dangaard Brouer <hawk@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: jolsa@kernel.org Cc: KP Singh <kpsingh@kernel.org> Cc: martin.lau@linux.dev Cc: Stanislav Fomichev <sdf@google.com> Cc: song@kernel.org Cc: Yonghong Song <yhs@fb.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221208060256.give.994-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-10net: bcmgenet: Remove the unused functionJiapeng Chong1-18/+0
The function dmadesc_get_addr() is defined in the bcmgenet.c file, but not called elsewhere, so remove this unused function. drivers/net/ethernet/broadcom/genet/bcmgenet.c:120:26: warning: unused function 'dmadesc_get_addr'. Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3401 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Link: https://lore.kernel.org/r/20221209033723.32452-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-10Merge branch 'mptcp-miscellaneous-cleanup'Jakub Kicinski2-7/+7
Mat Martineau says: ==================== mptcp: Miscellaneous cleanup Two code cleanup patches for the 6.2 merge window that don't change behavior: Patch 1 makes proper use of nlmsg_free(), as suggested by Jakub while reviewing f8c9dfbd875b ("mptcp: add pm listener events"). Patch 2 clarifies success status in a few mptcp functions, which prevents some smatch false positives. ==================== Link: https://lore.kernel.org/r/20221209004431.143701-1-mathew.j.martineau@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-10mptcp: return 0 instead of 'err' varMatthieu Baerts2-3/+3
When 'err' is 0, it looks clearer to return '0' instead of the variable called 'err'. The behaviour is then not modified, just a clearer code. By doing this, we can also avoid false positive smatch warnings like this one: net/mptcp/pm_netlink.c:1169 mptcp_pm_parse_pm_addr_attr() warn: missing error code? 'err' Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-10mptcp: use nlmsg_free instead of kfree_skbGeliang Tang1-4/+4
Use nlmsg_free() instead of kfree_skb() in pm_netlink.c. The SKB's have been created by nlmsg_new(). The proper cleaning way should then be done with nlmsg_free(). For the moment, nlmsg_free() is simply calling kfree_skb() so we don't change the behaviour here. Suggested-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-10Merge tag 'mlx5-updates-2022-12-08' of ↵Jakub Kicinski33-195/+1373
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2022-12-08 1) Support range match action in SW steering Yevgeny Kliteynik says: ======================= The following patch series adds support for a range match action in SW Steering. SW steering is able to match only on the exact values of the packet fields, as requested by the user: the user provides mask for the fields that are of interest, and the exact values to be matched on when the traffic is handled. The following patch series add new type of action - Range Match, where the user provides a field to be matched on and a range of values (min to max) that will be considered as hit. There are several new notions that were implemented in order to support Range Match: - MATCH_RANGES Steering Table Entry (STE): the new STE type that allows matching the packets' fields on the range of values instead of a specific value. - Match Definer: this is a general FW object that defines which fields in the packet will be referenced by the mask and tag of each STE. Match definer ID is part of STE fields, and it defines how the HW needs to interpret the STE's mask/tag values. Till now SW steering used the definers that were managed by FW and implemented the STE layout as described by the HW spec. Now that we're adding a new type of STE, SW steering needs to also be able to define this new STE's layout, and this is do ======================= 2) From OZ add support for meter mtu offload 2.1: Refactor the code to allow both metering and range post actions as a pre-step for adding police mtu offload support. 2.2: Instantiate mtu green/red flow tables with a single match-all rule. Add the green/red actions to the hit/miss table accordingly 2.3: Initialize the meter object with the TC police mtu parameter. Use the hardware range match action feature. 3) From MaorD, support routes with more than 2 nexthops in multipath 4) Michael and Or, improve and extend vport representor counters. * tag 'mlx5-updates-2022-12-08' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5: Expose steering dropped packets counter net/mlx5: Refactor and expand rep vport stat group net/mlx5e: multipath, support routes with more than 2 nexthops net/mlx5e: TC, add support for meter mtu offload net/mlx5e: meter, add mtu post meter tables net/mlx5e: meter, refactor to allow multiple post meter tables net/mlx5: DR, Add support for range match action net/mlx5: DR, Add function that tells if STE miss addr has been initialized net/mlx5: DR, Some refactoring of miss address handling net/mlx5: DR, Manage definers with refcounts net/mlx5: DR, Handle FT action in a separate function net/mlx5: DR, Rework is_fw_table function net/mlx5: DR, Add functions to create/destroy MATCH_DEFINER general object net/mlx5: fs, add match on ranges API net/mlx5: mlx5_ifc updates for MATCH_DEFINER general object ==================== Link: https://lore.kernel.org/r/20221209001420.142794-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-10Merge branch '100GbE' of ↵Jakub Kicinski5-475/+463
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2022-12-08 (ice) Jacob Keller says: This series of patches primarily consists of changes to fix some corner cases that can cause Tx timestamp failures. The issues were discovered and reported by Siddaraju DH and primarily affect E822 hardware, though this series also includes some improvements that affect E810 hardware as well. The primary issue is regarding the way that E822 determines when to generate timestamp interrupts. If the driver reads timestamp indexes which do not have a valid timestamp, the E822 interrupt tracking logic can get stuck. This is due to the way that E822 hardware tracks timestamp index reads internally. I was previously unaware of this behavior as it is significantly different in E810 hardware. Most of the fixes target refactors to ensure that the ice driver does not read timestamp indexes which are not valid on E822 hardware. This is done by using the Tx timestamp ready bitmap register from the PHY. This register indicates what timestamp indexes have outstanding timestamps waiting to be captured. Care must be taken in all cases where we read the timestamp registers, and thus all flows which might have read these registers are refactored. The ice_ptp_tx_tstamp function is modified to consolidate as much of the logic relating to these registers as possible. It now handles discarding stale timestamps which are old or which occurred after a PHC time update. This replaces previously standalone thread functions like the periodic work function and the ice_ptp_flush_tx_tracker function. In addition, some minor cleanups noticed while writing these refactors are included. The remaining patches refactor the E822 implementation to remove the "bypass" mode for timestamps. The E822 hardware has the ability to provide a more precise timestamp by making use of measurements of the precise way that packets flow through the hardware pipeline. These measurements are known as "Vernier" calibration. The "bypass" mode disables many of these measurements in favor of a faster start up time for Tx and Rx timestamping. Instead, once these measurements were captured, the driver tries to reconfigure the PHY to enable the vernier calibrations. Unfortunately this recalibration does not work. Testing indicates that the PHY simply remains in bypass mode without the increased timestamp precision. Remove the attempt at recalibration and always use vernier mode. This has one disadvantage that Tx and Rx timestamps cannot begin until after at least one packet of that type goes through the hardware pipeline. Because of this, further refactor the driver to separate Tx and Rx vernier calibration. Complete the Tx and Rx independently, enabling the appropriate type of timestamp as soon as the relevant packet has traversed the hardware pipeline. This was reported by Milena Olech. Note that although these might be considered "bug fixes", the required changes in order to appropriately resolve these issues is large. Thus it does not feel suitable to send this series to net. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ice: reschedule ice_ptp_wait_for_offset_valid during reset ice: make Tx and Rx vernier offset calibration independent ice: only check set bits in ice_ptp_flush_tx_tracker ice: handle flushing stale Tx timestamps in ice_ptp_tx_tstamp ice: cleanup allocations in ice_ptp_alloc_tx_tracker ice: protect init and calibrating check in ice_ptp_request_ts ice: synchronize the misc IRQ when tearing down Tx tracker ice: check Tx timestamp memory register for ready timestamps ice: handle discarding old Tx requests in ice_ptp_tx_tstamp ice: always call ice_ptp_link_change and make it void ice: fix misuse of "link err" with "link status" ice: Reset TS memory for all quads ice: Remove the E822 vernier "bypass" logic ice: Use more generic names for ice_ptp_tx fields ==================== Link: https://lore.kernel.org/r/20221208213932.1274143-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09docs/bpf: Add documentation for BPF_MAP_TYPE_SK_STORAGEDonald Hunter1-0/+155
Add documentation for the BPF_MAP_TYPE_SK_STORAGE including kernel version introduced, usage and examples. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20221209112401.69319-1-donald.hunter@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-09net: openvswitch: Add support to count upcall packetswangchuanlei4-0/+121
Add support to count upall packets, when kmod of openvswitch upcall to count the number of packets for upcall succeed and failed, which is a better way to see how many packets upcalled on every interfaces. Signed-off-by: wangchuanlei <wangchuanlei@inspur.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-09rhashtable: Allow rhashtable to be used from irq-safe contextsTejun Heo2-31/+46
rhashtable currently only does bh-safe synchronization making it impossible to use from irq-safe contexts. Switch it to use irq-safe synchronization to remove the restriction. v2: Update the lock functions to return the ulong flags value and unlock functions to take the value directly instead of passing around the pointer. Suggested by Linus. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-09Merge branch 'net-sched-retpoline'David S. Miller38-72/+391
Pedro Tammela says: ==================== net/sched: retpoline wrappers for tc In tc all qdics, classifiers and actions can be compiled as modules. This results today in indirect calls in all transitions in the tc hierarchy. Due to CONFIG_RETPOLINE, CPUs with mitigations=on might pay an extra cost on indirect calls. For newer Intel cpus with IBRS the extra cost is nonexistent, but AMD Zen cpus and older x86 cpus still go through the retpoline thunk. Known built-in symbols can be optimized into direct calls, thus avoiding the retpoline thunk. So far, tc has not been leveraging this build information and leaving out a performance optimization for some CPUs. In this series we wire up 'tcf_classify()' and 'tcf_action_exec()' with direct calls when known modules are compiled as built-in as an opt-in optimization. We measured these changes in one AMD Zen 4 cpu (Retpoline), one AMD Zen 3 cpu (Retpoline), one Intel 10th Gen CPU (IBRS), one Intel 3rd Gen cpu (Retpoline) and one Intel Xeon CPU (IBRS) using pktgen with 64b udp packets. Our test setup is a dummy device with clsact and matchall in a kernel compiled with every tc module as built-in. We observed a 3-8% speed up on the retpoline CPUs, when going through 1 tc filter, and a 60-100% speed up when going through 100 filters. For the IBRS cpus we observed a 1-2% degradation in both scenarios, we believe the extra branches check introduced a small overhead therefore we added a static key that bypasses the wrapper on kernels not using the retpoline mitigation, but compiled with CONFIG_RETPOLINE. 1 filter: CPU | before (pps) | after (pps) | diff R9 7950X | 5914980 | 6380227 | +7.8% R9 5950X | 4237838 | 4412241 | +4.1% R9 5950X | 4265287 | 4413757 | +3.4% [*] i5-3337U | 1580565 | 1682406 | +6.4% i5-10210U | 3006074 | 3006857 | +0.0% i5-10210U | 3160245 | 3179945 | +0.6% [*] Xeon 6230R | 3196906 | 3197059 | +0.0% Xeon 6230R | 3190392 | 3196153 | +0.01% [*] 100 filters: CPU | before (pps) | after (pps) | diff R9 7950X | 373598 | 820396 | +119.59% R9 5950X | 313469 | 633303 | +102.03% R9 5950X | 313797 | 633150 | +101.77% [*] i5-3337U | 127454 | 211210 | +65.71% i5-10210U | 389259 | 381765 | -1.9% i5-10210U | 408812 | 412730 | +0.9% [*] Xeon 6230R | 415420 | 406612 | -2.1% Xeon 6230R | 416705 | 405869 | -2.6% [*] [*] In these tests we ran pktgen with clone set to 1000. On the 7950x system we also tested the impact of filters if iteration order placement varied, first by compiling a kernel with the filter under test being the first one in the static iteration and then repeating it with being last (of 15 classifiers existing today). We saw a difference of +0.5-1% in pps between being the first in the iteration vs being the last. Therefore we order the classifiers and actions according to relevance per our current thinking. v5->v6: - Address Eric Dumazet suggestions v4->v5: - Rebase v3->v4: - Address Eric Dumazet suggestions v2->v3: - Address suggestions by Jakub, Paolo and Eric - Dropped RFC tag (I forgot to add it on v2) v1->v2: - Fix build errors found by the bots - Address Kuniyuki Iwashima suggestions ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-09net/sched: avoid indirect classify functions on retpoline kernelsPedro Tammela14-25/+49
Expose the necessary tc classifier functions and wire up cls_api to use direct calls in retpoline kernels. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-09net/sched: avoid indirect act functions on retpoline kernelsPedro Tammela21-42/+81
Expose the necessary tc act functions and wire up act_api to use direct calls in retpoline kernels. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-09net/sched: add retpoline wrapper for tcPedro Tammela2-0/+256
On kernels using retpoline as a spectrev2 mitigation, optimize actions and filters that are compiled as built-ins into a direct call. On subsequent patches we expose the classifiers and actions functions and wire up the wrapper into tc. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-09net/sched: move struct action_ops definition out of ifdefPedro Tammela1-5/+5
The type definition should be visible even in configurations not using CONFIG_NET_CLS_ACT. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-09xfrm: Fix spelling mistake "oflload" -> "offload"Colin Ian King1-1/+1
There is a spelling mistake in a NL_SET_ERR_MSG message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-09Merge branch 'mlx5 IPsec packet offload support (Part II)'Steffen Klassert12-118/+1047
Leon Romanovsky says: ============ This is second part with implementation of packet offload. ============ Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-12-09net: phy: remove redundant "depends on" linesRandy Dunlap1-3/+0
Delete a few lines of "depends on PHYLIB" since they are inside an "if PHYLIB / endif # PHYLIB" block, i.e., they are redundant and the other 50+ drivers there don't use "depends on PHYLIB" since it is not needed. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Heiner Kallweit <hkallweit1@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Link: https://lore.kernel.org/r/20221207044257.30036-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09net_tstamp: add SOF_TIMESTAMPING_OPT_ID_TCPWillem de Bruijn5-6/+45
Add an option to initialize SOF_TIMESTAMPING_OPT_ID for TCP from write_seq sockets instead of snd_una. This should have been the behavior from the start. Because processes may now exist that rely on the established behavior, do not change behavior of the existing option, but add the right behavior with a new flag. It is encouraged to always set SOF_TIMESTAMPING_OPT_ID_TCP on stream sockets along with the existing SOF_TIMESTAMPING_OPT_ID. Intuitively the contract is that the counter is zero after the setsockopt, so that the next write N results in a notification for the last byte N - 1. On idle sockets snd_una == write_seq and this holds for both. But on sockets with data in transmission, snd_una records the unacked offset in the stream. This depends on the ACK response from the peer. A process cannot learn this in a race free manner (ioctl SIOCOUTQ is one racy approach). write_seq records the offset at the last byte written by the process. This is a better starting point. It matches the intuitive contract in all circumstances, unaffected by external behavior. The new timestamp flag necessitates increasing sk_tsflags to 32 bits. Move the field in struct sock to avoid growing the socket (for some common CONFIG variants). The UAPI interface so_timestamping.flags is already int, so 32 bits wide. Reported-by: Sotirios Delimanolis <sotodel@meta.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20221207143701.29861-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09Merge branch 'fix-possible-deadlock-during-wed-attach'Jakub Kicinski2-10/+23
Lorenzo Bianconi says: ==================== fix possible deadlock during WED attach Fix a possible deadlock in mtk_wed_attach if mtk_wed_wo_init routine fails. Check wo pointer is properly allocated before running mtk_wed_wo_reset() and mtk_wed_wo_deinit(). ==================== Link: https://lore.kernel.org/r/cover.1670421354.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09net: ethernet: mtk_wed: fix possible deadlock if mtk_wed_wo_init failsLorenzo Bianconi1-5/+12
Introduce __mtk_wed_detach() in order to avoid a deadlock in mtk_wed_attach routine if mtk_wed_wo_init fails since both mtk_wed_attach and mtk_wed_detach run holding hw_lock mutex. Fixes: 4c5de09eb0d0 ("net: ethernet: mtk_wed: add configure wed wo support") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09net: ethernet: mtk_wed: fix some possible NULL pointer dereferencesLorenzo Bianconi2-5/+11
Fix possible NULL pointer dereference in mtk_wed_detach routine checking wo pointer is properly allocated before running mtk_wed_wo_reset() and mtk_wed_wo_deinit(). Even if it is just a theoretical issue at the moment check wo pointer is not NULL in mtk_wed_mcu_msg_update. Moreover, honor mtk_wed_mcu_send_msg return value in mtk_wed_wo_reset() Fixes: 799684448e3e ("net: ethernet: mtk_wed: introduce wed wo support") Fixes: 4c5de09eb0d0 ("net: ethernet: mtk_wed: add configure wed wo support") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09nfp: Fix spelling mistake "tha" -> "the"Colin Ian King1-1/+1
There is a spelling mistake in a nn_dp_warn message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20221207094312.2281493-1-colin.i.king@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09selftests: net: Fix O=dir buildsBjörn Töpel1-3/+3
The BPF Makefile in net/bpf did incorrect path substitution for O=dir builds, e.g. make O=/tmp/kselftest headers make O=/tmp/kselftest -C tools/testing/selftests would fail in selftest builds [1] net/ with clang-16: error: no such file or directory: 'kselftest/net/bpf/nat6to4.c' clang-16: error: no input files Add a pattern prerequisite and an order-only-prerequisite (for creating the directory), to resolve the issue. [1] https://lore.kernel.org/all/202212060009.34CkQmCN-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Fixes: 837a3d66d698 ("selftests: net: Add cross-compilation support for BPF programs") Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20221206102838.272584-1-bjorn@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09Merge branch 'mlxsw-add-spectrum-1-ip6gre-support'Jakub Kicinski5-131/+138
Petr Machata says: ==================== mlxsw: Add Spectrum-1 ip6gre support Ido Schimmel writes: Currently, mlxsw only supports ip6gre offload on Spectrum-2 and newer ASICs. Spectrum-1 can also offload ip6gre tunnels, but it needs double entry router interfaces (RIFs) for the RIFs representing these tunnels. In addition, the RIF index needs to be even. This is handled in patches #1-#3. The implementation can otherwise be shared between all Spectrum generations. This is handled in patches #4-#5. Patch #6 moves a mlxsw ip6gre selftest to a shared directory, as ip6gre is no longer only supported on Spectrum-2 and newer ASICs. This work is motivated by users that require multiple GRE tunnels that all share the same underlay VRF. Currently, mlxsw only supports decapsulation based on the underlay destination IP (i.e., not taking the GRE key into account), so users need to configure these tunnels with different source IPs and IPv6 addresses are easier to spare than IPv4. Tested using existing ip6gre forwarding selftests. ==================== Link: https://lore.kernel.org/r/cover.1670414573.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09selftests: mlxsw: Move IPv6 decap_error test to shared directoryIdo Schimmel1-1/+1
Now that Spectrum-1 gained ip6gre support we can move the test out of the Spectrum-2 directory. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09mlxsw: spectrum_ipip: Add Spectrum-1 ip6gre supportIdo Schimmel1-75/+8
As explained in the previous patch, the existing Spectrum-2 ip6gre implementation can be reused for Spectrum-1. Change the Spectrum-1 ip6gre operations structure to use the common operations. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09mlxsw: spectrum_ipip: Rename Spectrum-2 ip6gre operationsIdo Schimmel1-47/+47
There are two main differences between Spectrum-1 and newer ASICs in terms of IP-in-IP support: 1. In Spectrum-1, RIFs representing ip6gre tunnels require two entries in the RIF table. 2. In Spectrum-2 and newer ASICs, packets ingress the underlay (during encapsulation) and egress the underlay (during decapsulation) via a special generic loopback RIF. The first difference was handled in previous patches by adding the 'double_rif_entry' field to the Spectrum-1 operations structure of ip6gre RIFs. The second difference is handled during RIF creation, by only creating a generic loopback RIF in Spectrum-2 and newer ASICs. Therefore, the ip6gre operations can be shared between Spectrum-1 and newer ASIC in a similar fashion to how the ipgre operations are shared. Rename the operations to not be Spectrum-2 specific and move them earlier in the file so that they could later be used for Spectrum-1. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09mlxsw: spectrum_router: Add support for double entry RIFsIdo Schimmel3-1/+5
In Spectrum-1, loopback router interfaces (RIFs) used for IP-in-IP encapsulation with an IPv6 underlay require two RIF entries and the RIF index must be even. Prepare for this change by extending the RIF parameters structure with a 'double_entry' field that indicates if the RIF being created requires two RIF entries or not. Only set it for RIFs representing ip6gre tunnels in Spectrum-1. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09mlxsw: spectrum_router: Parametrize RIF allocation sizeIdo Schimmel1-14/+27
Currently, each router interface (RIF) consumes one entry in the RIFs table. This is going to change in subsequent patches where some RIFs will consume two table entries. Prepare for this change by parametrizing the RIF allocation size. For now, always pass '1'. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09mlxsw: spectrum_router: Use gen_pool for RIF index allocationIdo Schimmel2-10/+67
Currently, each router interface (RIF) consumes one entry in the RIFs table and there are no alignment constraints. This is going to change in subsequent patches where some RIFs will consume two table entries and their indexes will need to be aligned to the allocation size (even). Prepare for this change by converting the RIF index allocation to use gen_pool with the 'gen_pool_first_fit_order_align' algorithm. No Kconfig changes necessary as mlxsw already selects 'GENERIC_ALLOCATOR'. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-09Merge branch 'Dynptr refactorings'Alexei Starovoitov13-202/+369
Kumar Kartikeya Dwivedi says: ==================== This is part 1 of https://lore.kernel.org/bpf/20221018135920.726360-1-memxor@gmail.com. This thread also gives some background on why the refactor is being done: https://lore.kernel.org/bpf/CAEf4Bzb4beTHgVo+G+jehSj8oCeAjRbRcm6MRe=Gr+cajRBwEw@mail.gmail.com As requested in patch 6 by Alexei, it only includes patches which refactors the code, on top of which further fixes will be made in part 2. The refactor itself fixes another issue as a side effect. No functional change is intended (except a few modified log messages). Changelog: ---------- v1 -> v2 v1: https://lore.kernel.org/bpf/20221115000130.1967465-1-memxor@gmail.com * Address feedback from Joanne and David, add acks Fixes v1 -> v1 Fixes v1: https://lore.kernel.org/bpf/20221018135920.726360-1-memxor@gmail.com * Collect acks from Joanne and David * Fix misc nits pointed out by Joanne, David * Split move of reg->off alignment check for dynptr into separate change (Alexei) ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-09selftests/bpf: Add test for dynptr reinit in user_ringbuf callbackKumar Kartikeya Dwivedi2-8/+45
The original support for bpf_user_ringbuf_drain callbacks simply short-circuited checks for the dynptr state, allowing users to pass PTR_TO_DYNPTR (now CONST_PTR_TO_DYNPTR) to helpers that initialize a dynptr. This bug would have also surfaced with other dynptr helpers in the future that changed dynptr view or modified it in some way. Include test cases for all cases, i.e. both bpf_dynptr_from_mem and bpf_ringbuf_reserve_dynptr, and ensure verifier rejects both of them. Without the fix, both of these programs load and pass verification. While at it, remove sys_nanosleep target from failure cases' SEC definition, as there is no such tracepoint. Acked-by: David Vernet <void@manifault.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221207204141.308952-8-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-09bpf: Use memmove for bpf_dynptr_{read,write}Kumar Kartikeya Dwivedi1-2/+10
It may happen that destination buffer memory overlaps with memory dynptr points to. Hence, we must use memmove to correctly copy from dynptr to destination buffer, or source buffer to dynptr. This actually isn't a problem right now, as memcpy implementation falls back to memmove on detecting overlap and warns about it, but we shouldn't be relying on that. Acked-by: Joanne Koong <joannelkoong@gmail.com> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221207204141.308952-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-09bpf: Move PTR_TO_STACK alignment check to process_dynptr_funcKumar Kartikeya Dwivedi1-5/+8
After previous commit, we are minimizing helper specific assumptions from check_func_arg_reg_off, making it generic, and offloading checks for a specific argument type to their respective functions called after check_func_arg_reg_off has been called. This allows relying on a consistent set of guarantees after that call and then relying on them in code that deals with registers for each argument type later. This is in line with how process_spin_lock, process_timer_func, process_kptr_func check reg->var_off to be constant. The same reasoning is used here to move the alignment check into process_dynptr_func. Note that it also needs to check for constant var_off, and accumulate the constant var_off when computing the spi in get_spi, but that fix will come in later changes. Acked-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221207204141.308952-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-09bpf: Rework check_func_arg_reg_offKumar Kartikeya Dwivedi3-27/+40
While check_func_arg_reg_off is the place which performs generic checks needed by various candidates of reg->type, there is some handling for special cases, like ARG_PTR_TO_DYNPTR, OBJ_RELEASE, and ARG_PTR_TO_RINGBUF_MEM. This commit aims to streamline these special cases and instead leave other things up to argument type specific code to handle. The function will be restrictive by default, and cover all possible cases when OBJ_RELEASE is set, without having to update the function again (and missing to do that being a bug). This is done primarily for two reasons: associating back reg->type to its argument leaves room for the list getting out of sync when a new reg->type is supported by an arg_type. The other case is ARG_PTR_TO_RINGBUF_MEM. The problem there is something we already handle, whenever a release argument is expected, it should be passed as the pointer that was received from the acquire function. Hence zero fixed and variable offset. There is nothing special about ARG_PTR_TO_RINGBUF_MEM, where technically its target register type PTR_TO_MEM | MEM_RINGBUF can already be passed with non-zero offset to other helper functions, which makes sense. Hence, lift the arg_type_is_release check for reg->off and cover all possible register types, instead of duplicating the same kind of check twice for current OBJ_RELEASE arg_types (alloc_mem and ptr_to_btf_id). For the release argument, arg_type_is_dynptr is the special case, where we go to actual object being freed through the dynptr, so the offset of the pointer still needs to allow fixed and variable offset and process_dynptr_func will verify them later for the release argument case as well. This is not specific to ARG_PTR_TO_DYNPTR though, we will need to make this exception for any future object on the stack that needs to be released. In this sense, PTR_TO_STACK as a candidate for object on stack argument is a special case for release offset checks, and they need to be done by the helper releasing the object on stack. Since the check has been lifted above all register type checks, remove the duplicated check that is being done for PTR_TO_BTF_ID. Acked-by: Joanne Koong <joannelkoong@gmail.com> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221207204141.308952-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-09bpf: Rework process_dynptr_funcKumar Kartikeya Dwivedi7-79/+191
Recently, user ringbuf support introduced a PTR_TO_DYNPTR register type for use in callback state, because in case of user ringbuf helpers, there is no dynptr on the stack that is passed into the callback. To reflect such a state, a special register type was created. However, some checks have been bypassed incorrectly during the addition of this feature. First, for arg_type with MEM_UNINIT flag which initialize a dynptr, they must be rejected for such register type. Secondly, in the future, there are plans to add dynptr helpers that operate on the dynptr itself and may change its offset and other properties. In all of these cases, PTR_TO_DYNPTR shouldn't be allowed to be passed to such helpers, however the current code simply returns 0. The rejection for helpers that release the dynptr is already handled. For fixing this, we take a step back and rework existing code in a way that will allow fitting in all classes of helpers and have a coherent model for dealing with the variety of use cases in which dynptr is used. First, for ARG_PTR_TO_DYNPTR, it can either be set alone or together with a DYNPTR_TYPE_* constant that denotes the only type it accepts. Next, helpers which initialize a dynptr use MEM_UNINIT to indicate this fact. To make the distinction clear, use MEM_RDONLY flag to indicate that the helper only operates on the memory pointed to by the dynptr, not the dynptr itself. In C parlance, it would be equivalent to taking the dynptr as a point to const argument. When either of these flags are not present, the helper is allowed to mutate both the dynptr itself and also the memory it points to. Currently, the read only status of the memory is not tracked in the dynptr, but it would be trivial to add this support inside dynptr state of the register. With these changes and renaming PTR_TO_DYNPTR to CONST_PTR_TO_DYNPTR to better reflect its usage, it can no longer be passed to helpers that initialize a dynptr, i.e. bpf_dynptr_from_mem, bpf_ringbuf_reserve_dynptr. A note to reviewers is that in code that does mark_stack_slots_dynptr, and unmark_stack_slots_dynptr, we implicitly rely on the fact that PTR_TO_STACK reg is the only case that can reach that code path, as one cannot pass CONST_PTR_TO_DYNPTR to helpers that don't set MEM_RDONLY. In both cases such helpers won't be setting that flag. The next patch will add a couple of selftest cases to make sure this doesn't break. Fixes: 205715673844 ("bpf: Add bpf_user_ringbuf_drain() helper") Acked-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221207204141.308952-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>