summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-07-28net/mlx5: DR, Add support for flow metering ASOYevgeny Kliteynik6-0/+227
Add support for ASO action of type flow metering on device that supports STEv1. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Hamdan Igbaria <hamdani@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-07-28net/mlx5e: Fix wrong use of skb_tcp_all_headers() with encapsulationGal Pressman1-1/+1
Use skb_inner_tcp_all_headers() instead of skb_tcp_all_headers() when transmitting an encapsulated packet in mlx5e_tx_get_gso_ihs(). Fixes: 504148fedb85 ("net: add skb_[inner_]tcp_all_headers helpers") Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-07-28net/mlx5: Fix driver use of uninitialized timeoutShay Drory3-10/+4
Currently, driver is setting default values to all timeouts during function setup. The offending commit is using a timeout before function setup, meaning: the timeout is 0 (or garbage), since no value have been set. This may result in failure to probe the driver: mlx5_function_setup:1034:(pid 69850): Firmware over 4294967296 MS in pre-initializing state, aborting probe_one:1591:(pid 69850): mlx5_init_one failed with error code -16 Hence, set default values to timeouts during tout_init() Fixes: 37ca95e62ee2 ("net/mlx5: Increase FW pre-init timeout for health recovery") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-07-28net/mlx5: DR, Fix SMFS steering info dump formatYevgeny Kliteynik1-5/+8
Fix several issues in SMFS steering info dump: - Fix outdated macro value for matcher mask in the SMFS debug dump format. The existing value denotes the old format of the matcher mask, as it was used during the early stages of development, and it results in wrong parsing by the steering dump parser - wrong fields are shown in the parsed output. - Add the missing destination table to the dumped action. The missing dest table handle breaks the ability to associate between the "go to table" action and the actual table in the steering info. Fixes: 9222f0b27da2 ("net/mlx5: DR, Add support for dumping steering info") Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Muhammad Sammar <muhammads@nvidia.com> Reviewed-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-07-28net/mlx5: Adjust log_max_qp to be 18 at mostMaher Sanalla1-1/+1
The cited commit limited log_max_qp to be 17 due to FW capabilities. Recently, it turned out that there are old FW versions that supported more than 17, so the cited commit caused a degradation. Thus, set the maximum log_max_qp back to 18 as it was before the cited commit. Fixes: 7f839965b2d7 ("net/mlx5: Update log_max_qp value to be 17 at most") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-07-28net/mlx5e: Modify slow path rules to go to slow fdbVlad Buslov1-6/+17
While extending available range of supported chains/prios referenced commit also modified slow path rules to go to FT chain instead of actual slow FDB. However neither of existing users of the MLX5_ATTR_FLAG_SLOW_PATH flag (tunnel encap entries with invalid encap and flows with trap action) need to match on FT chain. After bridge offload was implemented packets of such flows can also be matched by bridge priority tables which is undesirable. Restore slow path flows implementation to redirect packets to slow_fdb. Fixes: 278d51f24330 ("net/mlx5: E-Switch, Increase number of chains and priorities") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-07-28net/mlx5e: Fix calculations related to max MPWQE sizeMaxim Mikityanskiy1-9/+10
Before commit 76c31e5f7585 ("net/mlx5e: Use FW limitation for max MPW WQEBBs"), the maximum size of MPWQE in WQEBBs was hardcoded as a driver constant. That commit started using the firmware capability that can further limit the size, however, it unintentionally changed a few things: 1. The calculation of MLX5E_MAX_KLM_PER_WQE used the size in DS, which was replaced by the size in WQEBBs, making the resulting value 4 times smaller. 2. MLX5E_TX_MPW_MAX_WQEBBS used to be aligned to the cache line size (either 64 or 128 bytes, i.e. 1 or 2 WQEBBs), but it's no longer the case if the firmware capability is smaller than the driver maximum. Fix both issues by using the correct units for MLX5E_MAX_KLM_PER_WQE and by aligning mlx5e_get_sw_max_sq_mpw_wqebbs after taking the minimum. Besides fixing the arithmetics in calculation of MLX5E_MAX_KLM_PER_WQE, also use appropriate constants: `size of BSF * num of DS per WQEBB * number of WQEBBs` (the calculation before the blamed commit) doesn't make much sense to calculate the WQE size in bytes, so just use `size of WQEBB * number of WQEBBs`. While at it, replace the types that hold the number of WQEBBs by u8. These values don't exceed 16, and it allows to fill holes in two structs. Fixes: 76c31e5f7585 ("net/mlx5e: Use FW limitation for max MPW WQEBBs") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-07-28net/mlx5e: xsk: Account for XSK RQ UMRs when calculating ICOSQ sizeMaxim Mikityanskiy1-0/+12
ICOSQ is used to post UMR WQEs for both regular RQ and XSK RQ. However, space in ICOSQ is reserved only for the regular RQ, which may cause ICOSQ overflows when using XSK (the most risk is on activating channels). This commit fixes the issue by reserving space for XSK UMR WQEs as well. As XSK may be enabled without restarting the channel and recreating the ICOSQ, this space is reserved unconditionally. Fixes: db05815b36cb ("net/mlx5e: Add XSK zero-copy support") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-07-28net/mlx5e: Fix the value of MLX5E_MAX_RQ_NUM_MTTSMaxim Mikityanskiy1-1/+1
MLX5E_MAX_RQ_NUM_MTTS should be the maximum value, so that MLX5_MTT_OCTW(MLX5E_MAX_RQ_NUM_MTTS) fits into u16. The current value of 1 << 17 results in MLX5_MTT_OCTW(1 << 17) = 1 << 16, which doesn't fit into u16. This commit replaces it with the maximum value that still fits u16. Fixes: 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-07-28net/mlx5e: TC, Fix post_act to not match on in_port metadataMaor Dickman1-0/+1
The cited commit changed CT to use multi table actions post act infrastructure instead of using it own post act infrastructure, this broke decap during VF tunnel offload (Stack devices) with CT due to wrong match on in_port metadata in the post act table. This changed only broke VF tunnel offload because it modify the packet in_port metadata to be VF metadata and it isn't propagate the post act creation. Fixed by modify post act rules to match only on fte_id and not match on in_port metadata which isn't needed. Fixes: a81283263bb0 ("net/mlx5e: Use multi table support for CT and sample actions") Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-07-28net/mlx5e: Remove WARN_ON when trying to offload an unsupported TLS ↵Gal Pressman1-1/+1
cipher/version The driver reports whether TX/RX TLS device offloads are supported, but not which ciphers/versions, these should be handled by returning -EOPNOTSUPP when .tls_dev_add() is called. Remove the WARN_ON kernel trace when the driver gets a request to offload a cipher/version that is not supported as it is expected. Fixes: d2ead1f360e8 ("net/mlx5e: Add kTLS TX HW offload support") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-07-28nouveau/svm: Fix to migrate all requested pagesAlistair Popple1-1/+5
Users may request that pages from an OpenCL SVM allocation be migrated to the GPU with clEnqueueSVMMigrateMem(). In Nouveau this will call into nouveau_dmem_migrate_vma() to do the migration. If the total range to be migrated exceeds SG_MAX_SINGLE_ALLOC the pages will be migrated in chunks of size SG_MAX_SINGLE_ALLOC. However a typo in updating the starting address means that only the first chunk will get migrated. Fix the calculation so that the entire range will get migrated if possible. Signed-off-by: Alistair Popple <apopple@nvidia.com> Fixes: e3d8b0890469 ("drm/nouveau/svm: map pages after migration") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Lyude Paul <lyude@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220720062745.960701-1-apopple@nvidia.com Cc: <stable@vger.kernel.org> # v5.8+
2022-07-28Documentation: KUnit: Fix example with compilation errorMaíra Canal1-1/+1
The Parameterized Testing example contains a compilation error, as the signature for the description helper function is void(*)(const struct sha1_test_case *, char *), and the struct is non-const. This is warned by Clang: error: initialization of ‘void (*)(struct sha1_test_case *, char *)’ from incompatible pointer type ‘void (*)(const struct sha1_test_case *, char *)’ [-Werror=incompatible-pointer-types] 33 | KUNIT_ARRAY_PARAM(sha1, cases, case_to_desc); | ^~~~~~~~~~~~ ../include/kunit/test.h:1339:70: note: in definition of macro ‘KUNIT_ARRAY_PARAM’ 1339 | void (*__get_desc)(typeof(__next), char *) = get_desc; \ Signed-off-by: Maíra Canal <mairacanal@riseup.net> Reviewed-by: Daniel Latypov <dlatypov@google.com> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2022-07-28Merge tag 'net-5.19-final' of ↵Linus Torvalds56-321/+481
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth and netfilter, no known blockers for the release. Current release - regressions: - wifi: mac80211: do not abuse fq.lock in ieee80211_do_stop(), fix taking the lock before its initialized - Bluetooth: mgmt: fix double free on error path Current release - new code bugs: - eth: ice: fix tunnel checksum offload with fragmented traffic Previous releases - regressions: - tcp: md5: fix IPv4-mapped support after refactoring, don't take the pure v6 path - Revert "tcp: change pingpong threshold to 3", improving detection of interactive sessions - mld: fix netdev refcount leak in mld_{query | report}_work() due to a race - Bluetooth: - always set event mask on suspend, avoid early wake ups - L2CAP: fix use-after-free caused by l2cap_chan_put - bridge: do not send empty IFLA_AF_SPEC attribute Previous releases - always broken: - ping6: fix memleak in ipv6_renew_options() - sctp: prevent null-deref caused by over-eager error paths - virtio-net: fix the race between refill work and close, resulting in NAPI scheduled after close and a BUG() - macsec: - fix three netlink parsing bugs - avoid breaking the device state on invalid change requests - fix a memleak in another error path Misc: - dt-bindings: net: ethernet-controller: rework 'fixed-link' schema - two more batches of sysctl data race adornment" * tag 'net-5.19-final' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (67 commits) stmmac: dwmac-mediatek: fix resource leak in probe ipv6/addrconf: fix a null-ptr-deref bug for ip6_ptr net: ping6: Fix memleak in ipv6_renew_options(). net/funeth: Fix fun_xdp_tx() and XDP packet reclaim sctp: leave the err path free in sctp_stream_init to sctp_stream_free sfc: disable softirqs for ptp TX ptp: ocp: Select CRC16 in the Kconfig. tcp: md5: fix IPv4-mapped support virtio-net: fix the race between refill work and close mptcp: Do not return EINPROGRESS when subflow creation succeeds Bluetooth: L2CAP: Fix use-after-free caused by l2cap_chan_put Bluetooth: Always set event mask on suspend Bluetooth: mgmt: Fix double free on error path wifi: mac80211: do not abuse fq.lock in ieee80211_do_stop() ice: do not setup vlan for loopback VSI ice: check (DD | EOF) bits on Rx descriptor rather than (EOP | RS) ice: Fix VSIs unable to share unicast MAC ice: Fix tunnel checksum offload with fragmented traffic ice: Fix max VLANs available for VF netfilter: nft_queue: only allow supported familes and hooks ...
2022-07-28ice: allow toggling loopback mode via ndo_set_features callbackMaciej Fijalkowski1-1/+31
Add support for NETIF_F_LOOPBACK. This feature can be set via: $ ethtool -K eth0 loopback <on|off> Feature can be useful for local data path tests. Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-07-28ice: compress branches in ice_set_features()Maciej Fijalkowski1-21/+19
Instead of rather verbose comparison of current netdev->features bits vs the incoming ones from user, let us compress them by a helper features set that will be the result of netdev->features XOR features. This way, current, extensive branches: if (features & NETIF_F_BIT && !(netdev->features & NETIF_F_BIT)) set_feature(true); else if (!(features & NETIF_F_BIT) && netdev->features & NETIF_F_BIT) set_feature(false); can become: netdev_features_t changed = netdev->features ^ features; if (changed & NETIF_F_BIT) set_feature(!!(features & NETIF_F_BIT)); This is nothing new as currently several other drivers use this approach, which I find much more convenient. Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-07-28ice: Fix promiscuous mode not turning offMichal Wilczynski3-19/+72
When trust is turned off for the VF, the expectation is that promiscuous and allmulticast filters are removed. Currently default VSI filter is not getting cleared in this flow. Example: ip link set enp236s0f0 vf 0 trust on ip link set enp236s0f0v0 promisc on ip link set enp236s0f0 vf 0 trust off /* promiscuous mode is still enabled on VF0 */ Remove switch filters for both cases. This commit fixes above behavior by removing default VSI filters and allmulticast filters when vf-true-promisc-support is OFF. Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-07-28ice: Introduce enabling promiscuous mode on multiple VF'sMichal Wilczynski12-169/+155
In current implementation default VSI switch filter is only able to forward traffic to a single VSI. This limits promiscuous mode with private flag 'vf-true-promisc-support' to a single VF. Enabling it on the second VF won't work. Also allmulticast support doesn't seem to be properly implemented when vf-true-promisc-support is true. Use standard ice_add_rule_internal() function that already implements forwarding to multiple VSI's instead of constructing AQ call manually. Add switch filter for allmulticast mode when vf-true-promisc-support is enabled. The same filter is added regardless of the flag - it doesn't matter for this case. Remove unnecessary fields in switch structure. From now on book keeping will be done by ice_add_rule_internal(). Refactor unnecessarily passed function arguments. To test: 1) Create 2 VM's, and two VF's. Attach VF's to VM's. 2) Enable promiscuous mode on both of them and check if traffic is seen on both of them. Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-07-28tools/power turbostat: version 2022.07.28Len Brown1-1/+1
update version number Signed-off-by: Len Brown <len.brown@intel.com>
2022-07-28tools/power turbostat: do not decode ACC for ICX and SPRArtem Bityutskiy1-2/+0
The ACC (automatic C-state conversion) feature was available on Sky Lake and Cascade Lake Xeons (SKX and CLX), but it is not available on Ice Lake and Sapphire Rapids Xeons (ICX and SPR). Therefore, stop decoding it for ICX and SPR. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2022-07-28tools/power turbostat: fix SPR PC6 limitsArtem Bityutskiy1-1/+1
Sapphire Rapids Xeon (SPR) supports 2 flavors of PC6 - PC6N (non-retention) and PC6R (retention). Before this patch we used ICX package C-state limits, which was wrong, because ICX has only one PC6 flavor. With this patch, we use SKX PC6 limits for SPR, because they are the same. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2022-07-28tools/power turbostat: cleanup 'automatic_cstate_conversion_probe()'Artem Bityutskiy1-1/+9
The 'automatic_cstate_conversion_probe()' function has a too long 'if' statement, convert it to a 'switch' statement in order to improve code readability a bit. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2022-07-28tools/power turbostat: separate SPR from ICXArtem Bityutskiy1-5/+26
Before this patch, SPR platform was considered identical to ICX platform. This patch separates SPR support from ICX. This patch is a preparation for adding SPR-specific package C-state limits support. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2022-07-28tools/power turbosstat: fix commentJiang Jian1-1/+1
remove duplicate "the" in comment Signed-off-by: Jiang Jian <jiangjian@cdjrlc.com> Signed-off-by: Len Brown <len.brown@intel.com>
2022-07-28tools/power turbostat: Support RAPTORLAKE PGeorge D Sworo1-0/+1
Add initial support for Raptorlake model Signed-off-by: George D Sworo <george.d.sworo@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2022-07-28tools/power turbostat: add support for ALDERLAKE_NZhang Rui1-0/+1
Add support for ALDERLAKE_N platform. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2022-07-28tools/power turbostat: dump secondary Turbo-Ratio-LimitLen Brown2-5/+11
Intel Performance Hybrid processors have a 2nd MSR describing the turbo limits enforced on the Ecores. Note, TRL and Secondary-TRL are usually R/O information, but on overclock-capable parts, they can be written. Signed-off-by: Len Brown <len.brown@intel.com>
2022-07-28tools/power turbostat: simplify dump_turbo_ratio_limits()Len Brown1-46/+9
code cleanup only. no functional change. Signed-off-by: Len Brown <len.brown@intel.com>
2022-07-28tools/power turbostat: dump CPUID.7.EDX.HybridLen Brown1-1/+5
CPUID leaf 7 EDX now tells us if the processor has hybrid CPUs Signed-off-by: Len Brown <len.brown@intel.com>
2022-07-28tools/power turbostat: update turbostat.8Len Brown1-76/+124
Update turbostat.8 to reflect new uncore frequency output (UncMHz) Also, refresh examples. Signed-off-by: Len Brown <len.brown@intel.com>
2022-07-28tools/power turbostat: Show uncore frequencyLen Brown1-1/+88
When CONFIG_INTEL_UNCORE_FREQ_CONTROL is effective, (Linux 5.9 and later), print the current (and default) min and max uncore frequency limits. When that driver provides the current uncore frequency (Linux 5.18 and later), print a UncMHz column reflecting the current uncore frequency. Note that UncMHz is an instantaneous sample, not an average. eg. $ sudo ./turbostat -S --show frequency ... Uncore Frequency pkg0 die0: 800 - 3900 MHz (800 - 3900 MHz) ... Avg_MHz Busy% Bzy_MHz TSC_MHz UncMHz 28 0.70 4049 3095 3900 Signed-off-by: Len Brown <len.brown@intel.com>
2022-07-28tools/power turbostat: Fix file pointer leakColin Ian King1-1/+1
Currently if a fscanf fails then an early return leaks an open file pointer. Fix this by fclosing the file before the return. Detected using static analysis with cppcheck: tools/power/x86/turbostat/turbostat.c:2039:3: error: Resource leak: fp [resourceLeak] Fixes: eae97e053fe3 ("tools/power turbostat: Support thermal throttle count print") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Acked-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Tom Rix <trix@redhat.com> Signed-off-by: Len Brown <len.brown@intel.com>
2022-07-28tools/power turbostat: replace strncmp with single character compareColin Ian King1-1/+1
Using strncmp for a single character comparison is overly complicated, just use a simpler single character comparison instead. Also stops static analyzers (such as cppcheck) from complaining about strncmp on non-null terminated strings. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Len Brown <len.brown@intel.com>
2022-07-28tools/power turbostat: print the kernel boot commandlineChen Yu1-1/+26
It would be handy to have cmdline in turbostat output. For example, according to the turbostat output, there are no C-states requested. In this case the user is very curious if something like intel_idle.max_cstate=0 was used, or may be idle=none too. It is also curious whether things like intel_pstate=nohwp were used. Print the boot command line accordingly: turbostat version 21.05.04 - Len Brown <lenb@kernel.org> Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.16.0+ root=UUID= b42359ed-1e05-42eb-8757-6bf2a1c19070 ro quiet splash vt.handoff=7 Suggested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2022-07-28tools/power turbostat: Introduce support for RaptorLakeZhang Rui1-0/+1
RaptorLake is compatible with AlderLake. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2022-07-28igb: convert .adjfreq to .adjfineJacob Keller1-8/+7
The 82576 PTP implementation still uses .adjfreq instead of using the newer .adjfine. This implementation uses a pre-simplified calculation since the base increment value for the 82576 is just 16 * 2^19. Converting this into scaled_ppm is tricky, and makes the intent a bit less clear. Simply convert to the normal flow of multiplying the base increment value by the scaled_ppm and then dividing by 1000000ULL << 16. This can be implemented using mul_u64_u64_div_u64 which can avoid the possible overflow that might occur for large adjustments. Use of .adjfine can improve the precision of small adjustments and gets us one driver closer to removing the old implementation from the kernel entirely. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-07-28ixgbe: convert .adjfreq to .adjfineJacob Keller1-33/+40
Convert the ixgbe PTP frequency adjustment implementations from .adjfreq to .adjfine. This allows using the scaled parts per million adjustment from the PTP core and results in a more precise adjustment for small corrections. To avoid overflow, use mul_u64_u64_div_u64 to perform the calculation. On X86 platforms, this will use instructions that perform the operations with 128bit intermediate values. For other architectures, the implementation will limit the loss of precision as much as possible. This change slightly improves the precision of frequency adjustments for all ixgbe based devices, and gets us one driver closer to being able to remove the older .adjfreq implementation from the kernel. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-07-28i40e: convert .adjfreq to .adjfineJacob Keller1-10/+12
The i40e driver currently implements the .adjfreq handler for frequency adjustments. This takes the adjustment parameter in parts per billion. The PTP core supports .adjfine which provides an adjustment in scaled parts per million. This has a higher resolution and can result in more precise adjustments for small corrections. Convert the existing .adjfreq implementation to the newer .adjfine implementation. This is trivial since it just requires changing the divisor from 1000000000ULL to (1000000ULL << 16) in the mul_u64_u64_div_u64 call. This improves the precision of the adjustments and gets us one driver closer to removing the old .adjfreq support from the kernel. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-07-28i40e: use mul_u64_u64_div_u64 for PTP frequency calculationJacob Keller1-13/+4
The i40e device has a different clock rate depending on the current link speed. This requires using a different increment rate for the PTP clock registers. For slower link speeds, the base increment value is larger. Directly multiplying the larger increment value by the parts per billion adjustment might overflow. To avoid this, the i40e implementation defaults to using the lower increment value and then multiplying the adjustment afterwards. This causes a loss of precision for lower link speeds. We can fix this by using mul_u64_u64_div_u64 instead of performing the multiplications using standard C operations. On X86, this will use special instructions that perform the multiplication and division with 128bit intermediate values. For other architectures, the fallback implementation will limit the loss of precision for large values. Small adjustments don't overflow anyways and won't lose precision at all. This allows first multiplying the base increment value and then performing the adjustment calculation, since we no longer fear overflowing. It also makes it easier to convert to the even more precise .adjfine implementation in a following change. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-07-28e1000e: convert .adjfreq to .adjfineJacob Keller3-10/+11
The PTP implementation for the e1000e driver uses the older .adjfreq method. This method takes an adjustment in parts per billion. The newer .adjfine implementation uses scaled_ppm. The use of scaled_ppm allows for finer grained adjustments and is preferred over using the older implementation. Make use of mul_u64_u64_div_u64 in order to handle possible overflow of the multiplication used to calculate the desired adjustment to the hardware increment value. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-07-28e1000e: remove unnecessary range check in e1000e_phc_adjfreqJacob Keller1-3/+0
The e1000e_phc_adjfreq function validates that the input delta is within the maximum range. This is already handled by the core PTP code and this is a duplicate and thus unnecessary check. It also complicates refactoring to use the newer .adjfine implementation, where the input is no longer specified in parts per billion. Remove the range validation check. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-07-28ice: implement adjfine with mul_u64_u64_div_u64Jacob Keller1-13/+3
The PTP frequency adjustment code needs to determine an appropriate adjustment given an input scaled_ppm adjustment. We calculate the adjustment to the register by multiplying the base (nominal) increment value by the scaled_ppm and then dividing by the scaled one million value. For very large adjustments, this might overflow. To avoid this, both the scaled_ppm and divisor values are downshifted. We can avoid that on X86 architectures by using mul_u64_u64_div_u64. This helper function will perform the multiplication and division with 128bit intermediate values. We know that scaled_ppm is never larger than the divisor so this operation will never result in an overflow. This improves the accuracy of the calculations for large adjustment values on X86. It is likely an improvement on other architectures as well because the default implementation of mul_u64_u64_div_u64 is smarter than the original approach taken in the ice code. Additionally, this implementation is easier to read, using fewer local variables and lines of code to implement. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-07-28stmmac: dwmac-mediatek: fix resource leak in probeDan Carpenter1-4/+5
If mediatek_dwmac_clks_config() fails, then call stmmac_remove_config_dt() before returning. Otherwise it is a resource leak. Fixes: fa4b3ca60e80 ("stmmac: dwmac-mediatek: fix clock issue") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/YuJ4aZyMUlG6yGGa@kili Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-28ipv6/addrconf: fix a null-ptr-deref bug for ip6_ptrZiyang Xuan1-0/+3
Change net device's MTU to smaller than IPV6_MIN_MTU or unregister device while matching route. That may trigger null-ptr-deref bug for ip6_ptr probability as following. ========================================================= BUG: KASAN: null-ptr-deref in find_match.part.0+0x70/0x134 Read of size 4 at addr 0000000000000308 by task ping6/263 CPU: 2 PID: 263 Comm: ping6 Not tainted 5.19.0-rc7+ #14 Call trace: dump_backtrace+0x1a8/0x230 show_stack+0x20/0x70 dump_stack_lvl+0x68/0x84 print_report+0xc4/0x120 kasan_report+0x84/0x120 __asan_load4+0x94/0xd0 find_match.part.0+0x70/0x134 __find_rr_leaf+0x408/0x470 fib6_table_lookup+0x264/0x540 ip6_pol_route+0xf4/0x260 ip6_pol_route_output+0x58/0x70 fib6_rule_lookup+0x1a8/0x330 ip6_route_output_flags_noref+0xd8/0x1a0 ip6_route_output_flags+0x58/0x160 ip6_dst_lookup_tail+0x5b4/0x85c ip6_dst_lookup_flow+0x98/0x120 rawv6_sendmsg+0x49c/0xc70 inet_sendmsg+0x68/0x94 Reproducer as following: Firstly, prepare conditions: $ip netns add ns1 $ip netns add ns2 $ip link add veth1 type veth peer name veth2 $ip link set veth1 netns ns1 $ip link set veth2 netns ns2 $ip netns exec ns1 ip -6 addr add 2001:0db8:0:f101::1/64 dev veth1 $ip netns exec ns2 ip -6 addr add 2001:0db8:0:f101::2/64 dev veth2 $ip netns exec ns1 ifconfig veth1 up $ip netns exec ns2 ifconfig veth2 up $ip netns exec ns1 ip -6 route add 2000::/64 dev veth1 metric 1 $ip netns exec ns2 ip -6 route add 2001::/64 dev veth2 metric 1 Secondly, execute the following two commands in two ssh windows respectively: $ip netns exec ns1 sh $while true; do ip -6 addr add 2001:0db8:0:f101::1/64 dev veth1; ip -6 route add 2000::/64 dev veth1 metric 1; ping6 2000::2; done $ip netns exec ns1 sh $while true; do ip link set veth1 mtu 1000; ip link set veth1 mtu 1500; sleep 5; done It is because ip6_ptr has been assigned to NULL in addrconf_ifdown() firstly, then ip6_ignore_linkdown() accesses ip6_ptr directly without NULL check. cpu0 cpu1 fib6_table_lookup __find_rr_leaf addrconf_notify [ NETDEV_CHANGEMTU ] addrconf_ifdown RCU_INIT_POINTER(dev->ip6_ptr, NULL) find_match ip6_ignore_linkdown So we can add NULL check for ip6_ptr before using in ip6_ignore_linkdown() to fix the null-ptr-deref bug. Fixes: dcd1f572954f ("net/ipv6: Remove fib6_idev") Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220728013307.656257-1-william.xuanziyang@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-28net: ping6: Fix memleak in ipv6_renew_options().Kuniyuki Iwashima1-0/+6
When we close ping6 sockets, some resources are left unfreed because pingv6_prot is missing sk->sk_prot->destroy(). As reported by syzbot [0], just three syscalls leak 96 bytes and easily cause OOM. struct ipv6_sr_hdr *hdr; char data[24] = {0}; int fd; hdr = (struct ipv6_sr_hdr *)data; hdr->hdrlen = 2; hdr->type = IPV6_SRCRT_TYPE_4; fd = socket(AF_INET6, SOCK_DGRAM, NEXTHDR_ICMP); setsockopt(fd, IPPROTO_IPV6, IPV6_RTHDR, data, 24); close(fd); To fix memory leaks, let's add a destroy function. Note the socket() syscall checks if the GID is within the range of net.ipv4.ping_group_range. The default value is [1, 0] so that no GID meets the condition (1 <= GID <= 0). Thus, the local DoS does not succeed until we change the default value. However, at least Ubuntu/Fedora/RHEL loosen it. $ cat /usr/lib/sysctl.d/50-default.conf ... -net.ipv4.ping_group_range = 0 2147483647 Also, there could be another path reported with these options, and some of them require CAP_NET_RAW. setsockopt IPV6_ADDRFORM (inet6_sk(sk)->pktoptions) IPV6_RECVPATHMTU (inet6_sk(sk)->rxpmtu) IPV6_HOPOPTS (inet6_sk(sk)->opt) IPV6_RTHDRDSTOPTS (inet6_sk(sk)->opt) IPV6_RTHDR (inet6_sk(sk)->opt) IPV6_DSTOPTS (inet6_sk(sk)->opt) IPV6_2292PKTOPTIONS (inet6_sk(sk)->opt) getsockopt IPV6_FLOWLABEL_MGR (inet6_sk(sk)->ipv6_fl_list) For the record, I left a different splat with syzbot's one. unreferenced object 0xffff888006270c60 (size 96): comm "repro2", pid 231, jiffies 4294696626 (age 13.118s) hex dump (first 32 bytes): 01 00 00 00 44 00 00 00 00 00 00 00 00 00 00 00 ....D........... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000f6bc7ea9>] sock_kmalloc (net/core/sock.c:2564 net/core/sock.c:2554) [<000000006d699550>] do_ipv6_setsockopt.constprop.0 (net/ipv6/ipv6_sockglue.c:715) [<00000000c3c3b1f5>] ipv6_setsockopt (net/ipv6/ipv6_sockglue.c:1024) [<000000007096a025>] __sys_setsockopt (net/socket.c:2254) [<000000003a8ff47b>] __x64_sys_setsockopt (net/socket.c:2265 net/socket.c:2262 net/socket.c:2262) [<000000007c409dcb>] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [<00000000e939c4a9>] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) [0]: https://syzkaller.appspot.com/bug?extid=a8430774139ec3ab7176 Fixes: 6d0bfe226116 ("net: ipv6: Add IPv6 support to the ping socket.") Reported-by: syzbot+a8430774139ec3ab7176@syzkaller.appspotmail.com Reported-by: Ayushman Dutta <ayudutta@amazon.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220728012220.46918-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-28cgroup: Skip subtree root in cgroup_update_dfl_csses()Waiman Long1-0/+9
The cgroup_update_dfl_csses() function updates css associations when a cgroup's subtree_control file is modified. Any changes made to a cgroup's subtree_control file, however, will only affect its descendants but not the cgroup itself. So there is no point in migrating csses associated with that cgroup. We can skip them instead. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-07-28watch_queue: Fix missing locking in add_watch_to_object()Linus Torvalds1-22/+36
If a watch is being added to a queue, it needs to guard against interference from addition of a new watch, manual removal of a watch and removal of a watch due to some other queue being destroyed. KEYCTL_WATCH_KEY guards against this for the same {key,queue} pair by holding the key->sem writelocked and by holding refs on both the key and the queue - but that doesn't prevent interaction from other {key,queue} pairs. While add_watch_to_object() does take the spinlock on the event queue, it doesn't take the lock on the source's watch list. The assumption was that the caller would prevent that (say by taking key->sem) - but that doesn't prevent interference from the destruction of another queue. Fix this by locking the watcher list in add_watch_to_object(). Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: syzbot+03d7b43290037d1f87ca@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com> cc: keyrings@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-28watch_queue: Fix missing rcu annotationDavid Howells1-1/+1
Since __post_watch_notification() walks wlist->watchers with only the RCU read lock held, we need to use RCU methods to add to the list (we already use RCU methods to remove from the list). Fix add_watch_to_object() to use hlist_add_head_rcu() instead of hlist_add_head() for that list. Fixes: c73be61cede5 ("pipe: Add general notification queue support") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-28docs: efi-stub: Fix paths for x86 / arm stubsJoão Paulo Rechi Vita1-2/+2
This fixes the paths of x86 / arm efi-stub source files. Signed-off-by: João Paulo Rechi Vita <jprvita@endlessos.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20220727140539.10021-1-jprvita@endlessos.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-07-28Docs/zh_CN: Update the translation of sched-stats to 5.19-rc8Yanteng Si1-4/+4
Update to commit 6c757e9f55f0 ("docs/scheduler: fix unit error") ddb21d27a6a5 ("docs/scheduler: Change unit of cpu_time and rq_time to nanoseconds") Signed-off-by: Yanteng Si <siyanteng@loongson.cn> Reviewed-by: Wu XiangCheng <bobwxc@email.cn> Reviewed-by: Alex Shi <alexs@kernel.org> Link: https://lore.kernel.org/r/3cb1c4c466dfa38d72a867dc6e2c833ceb69ecb7.1658983157.git.siyanteng@loongson.cn Signed-off-by: Jonathan Corbet <corbet@lwn.net>