summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-02-06Merge tag 'bcachefs-2024-02-05' of https://evilpiepirate.org/git/bcachefsLinus Torvalds4-6/+7
Pull bcachefs fixes from Kent Overstreet: "Two serious ones here that we'll want to backport to stable: a fix for a race in the thread_with_file code, and another locking fixup in the subvolume deletion path" * tag 'bcachefs-2024-02-05' of https://evilpiepirate.org/git/bcachefs: bcachefs: time_stats: Check for last_event == 0 when updating freq stats bcachefs: install fd later to avoid race with close bcachefs: unlock parent dir if entry is not found in subvolume deletion bcachefs: Fix build on parisc by avoiding __multi3()
2024-02-06LoongArch: vDSO: Disable UBSAN instrumentationKees Cook1-0/+1
The vDSO executes in userspace, so the kernel's UBSAN should not instrument it. Solves these kind of build errors: loongarch64-linux-ld: arch/loongarch/vdso/vgettimeofday.o: in function `vdso_shift_ns': lib/vdso/gettimeofday.c:23:(.text+0x3f8): undefined reference to `__ubsan_handle_shift_out_of_bounds' Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202401310530.lZHCj1Zl-lkp@intel.com/ Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Fangrui Song <maskray@google.com> Cc: loongarch@lists.linux.dev Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-02-06LoongArch: Fix earlycon parameter if KASAN enabledHuacai Chen1-0/+3
The earlycon parameter is based on fixmap, and fixmap addresses are not supposed to be shadowed by KASAN. So return the kasan_early_shadow_page in kasan_mem_to_shadow() if the input address is above FIXADDR_START. Otherwise earlycon cannot work after kasan_init(). Cc: stable@vger.kernel.org Fixes: 5aa4ac64e6add3e ("LoongArch: Add KASAN (Kernel Address Sanitizer) support") Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-02-06LoongArch: Change acpi_core_pic[NR_CPUS] to acpi_core_pic[MAX_CORE_PIC]Huacai Chen2-4/+4
With default config, the value of NR_CPUS is 64. When HW platform has more then 64 cpus, system will crash on these platforms. MAX_CORE_PIC is the maximum cpu number in MADT table (max physical number) which can exceed the supported maximum cpu number (NR_CPUS, max logical number), but kernel should not crash. Kernel should boot cpus with NR_CPUS, let the remainder cpus stay in BIOS. The potential crash reason is that the array acpi_core_pic[NR_CPUS] can be overflowed when parsing MADT table, and it is obvious that CORE_PIC should be corresponding to physical core rather than logical core, so it is better to define the array as acpi_core_pic[MAX_CORE_PIC]. With the patch, system can boot up 64 vcpus with qemu parameter -smp 128, otherwise system will crash with the following message. [ 0.000000] CPU 0 Unable to handle kernel paging request at virtual address 0000420000004259, era == 90000000037a5f0c, ra == 90000000037a46ec [ 0.000000] Oops[#1]: [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-rc2+ #192 [ 0.000000] Hardware name: QEMU QEMU Virtual Machine, BIOS unknown 2/2/2022 [ 0.000000] pc 90000000037a5f0c ra 90000000037a46ec tp 9000000003c90000 sp 9000000003c93d60 [ 0.000000] a0 0000000000000019 a1 9000000003d93bc0 a2 0000000000000000 a3 9000000003c93bd8 [ 0.000000] a4 9000000003c93a74 a5 9000000083c93a67 a6 9000000003c938f0 a7 0000000000000005 [ 0.000000] t0 0000420000004201 t1 0000000000000000 t2 0000000000000001 t3 0000000000000001 [ 0.000000] t4 0000000000000003 t5 0000000000000000 t6 0000000000000030 t7 0000000000000063 [ 0.000000] t8 0000000000000014 u0 ffffffffffffffff s9 0000000000000000 s0 9000000003caee98 [ 0.000000] s1 90000000041b0480 s2 9000000003c93da0 s3 9000000003c93d98 s4 9000000003c93d90 [ 0.000000] s5 9000000003caa000 s6 000000000a7fd000 s7 000000000f556b60 s8 000000000e0a4330 [ 0.000000] ra: 90000000037a46ec platform_init+0x214/0x250 [ 0.000000] ERA: 90000000037a5f0c efi_runtime_init+0x30/0x94 [ 0.000000] CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) [ 0.000000] PRMD: 00000000 (PPLV0 -PIE -PWE) [ 0.000000] EUEN: 00000000 (-FPE -SXE -ASXE -BTE) [ 0.000000] ECFG: 00070800 (LIE=11 VS=7) [ 0.000000] ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0) [ 0.000000] BADV: 0000420000004259 [ 0.000000] PRID: 0014c010 (Loongson-64bit, Loongson-3A5000) [ 0.000000] Modules linked in: [ 0.000000] Process swapper (pid: 0, threadinfo=(____ptrval____), task=(____ptrval____)) [ 0.000000] Stack : 9000000003c93a14 9000000003800898 90000000041844f8 90000000037a46ec [ 0.000000] 000000000a7fd000 0000000008290000 0000000000000000 0000000000000000 [ 0.000000] 0000000000000000 0000000000000000 00000000019d8000 000000000f556b60 [ 0.000000] 000000000a7fd000 000000000f556b08 9000000003ca7700 9000000003800000 [ 0.000000] 9000000003c93e50 9000000003800898 9000000003800108 90000000037a484c [ 0.000000] 000000000e0a4330 000000000f556b60 000000000a7fd000 000000000f556b08 [ 0.000000] 9000000003ca7700 9000000004184000 0000000000200000 000000000e02b018 [ 0.000000] 000000000a7fd000 90000000037a0790 9000000003800108 0000000000000000 [ 0.000000] 0000000000000000 000000000e0a4330 000000000f556b60 000000000a7fd000 [ 0.000000] 000000000f556b08 000000000eaae298 000000000eaa5040 0000000000200000 [ 0.000000] ... [ 0.000000] Call Trace: [ 0.000000] [<90000000037a5f0c>] efi_runtime_init+0x30/0x94 [ 0.000000] [<90000000037a46ec>] platform_init+0x214/0x250 [ 0.000000] [<90000000037a484c>] setup_arch+0x124/0x45c [ 0.000000] [<90000000037a0790>] start_kernel+0x90/0x670 [ 0.000000] [<900000000378b0d8>] kernel_entry+0xd8/0xdc Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-02-06LoongArch: Select HAVE_ARCH_SECCOMP to use the common SECCOMP menuMasahiro Yamada1-17/+1
LoongArch missed the refactoring made by commit 282a181b1a0d ("seccomp: Move config option SECCOMP to arch/Kconfig") because LoongArch was not mainlined at that time. The 'depends on PROC_FS' statement is stale as described in that commit. Select HAVE_ARCH_SECCOMP, and remove the duplicated config entry. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-02-06LoongArch: Select ARCH_ENABLE_THP_MIGRATION instead of redefining itMasahiro Yamada1-4/+1
ARCH_ENABLE_THP_MIGRATION is supposed to be selected by arch Kconfig. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-02-06net/mlx5e: XDP, Exclude headroom and tailroom from memory calculationsCarolina Jubran1-1/+8
In the case of XDP Multi-Buffer with Striding RQ, an extra page is allocated for the linear part of non-linear SKBs. Including headroom and tailroom in the calculation may result in an unnecessary increase in the amount of memory allocated. This could be critical, particularly for large MTUs (e.g. 7975B) and large RQ sizes (e.g. 8192). In this case, the requested page pool size is 64K, but 32K would be sufficient. This causes a failure due to exceeding the page pool size limit of 32K. Exclude headroom and tailroom from SKB size calculations to reduce page pool size. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-06net/mlx5e: XSK, Exclude tailroom from non-linear SKBs memory calculationsCarolina Jubran1-4/+11
Packet data buffers lack reserved headroom or tailroom, and SKBs are allocated on a side memory when needed. Exclude the tailroom from the SKB size calculations. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-06net/mlx5: DR, Change SWS usage to debug fs seq_file interfaceHamdan Igbaria2-134/+620
In current SWS debug dump mechanism we implement the seq_file interface, but we only implement the 'show' callback to dump the whole steering DB with a single call to this callback. However, for large data size the seq_printf function will fail to allocate a buffer with the adequate capacity to hold such data. This patch solves this problem by utilizing the seq_file interface mechanism in the following way: - when the user triggers a dump procedure, we will allocate a list of buffers that hold the whole data dump (in the start callback) - using the start, next, show and stop callbacks of the seq_file API we iterate through the list and dump the whole data Signed-off-by: Hamdan Igbaria <hamdani@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-06net/mlx5: Change missing SyncE capability print to debugGal Pressman1-1/+1
Lack of SyncE capability should not emit a warning, change the print to debug level. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-06net/mlx5: Remove initial segmentation duplicate definitionsGal Pressman4-20/+14
Device definitions belong in mlx5_ifc, remove the duplicates in mlx5_core.h. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-06net/mlx5: Return specific error code for timeout on wait_fw_initMoshe Shemesh1-19/+19
The function wait_fw_init() returns same error code either if it breaks waiting due to timeout or other reason. Thus, the function callers print error message on timeout without checking error type. Return different error code for different failure reason and print error message accordingly on wait_fw_init(). Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-06net/mlx5: SF, Stop waiting for FW as teardown was calledMoshe Shemesh1-8/+13
When PF/VF teardown is called the driver sets the flag MLX5_BREAK_FW_WAIT to stop waiting for FW loading and initializing. Same should be applied to SF driver teardown to cut waiting time. On mlx5_sf_dev_remove() set the flag before draining health WQ as recovery flow may also wait for FW reloading while it is not relevant anymore. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-06net/mlx5: remove fw reporter dump option for non PFMoshe Shemesh1-3/+10
In case function is not a Physical Function it is not allowed to get FW core dump, so if tried it will fail the fw health reporter dump option. Instead of failing, remove the option of fw_fatal health reporter dump for such function. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-06net/mlx5: remove fw_fatal reporter dump option for non PFMoshe Shemesh1-2/+10
In case function is not a Physical Function it is not allowed to collect crdump, so if tried it will fail the fw_fatal health reporter dump option. Instead of failing on permission, remove the option of fw_fatal health reporter dump for such function. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-06net/mlx5: Rename mlx5_sf_dev_removeMoshe Shemesh1-4/+5
Mlx5 has two functions with the same name mlx5_sf_dev_remove. Both are static, in different files, so no compilation or logical issue, but it makes it hard to follow the code and some traces even can get both as one leads to the other [1]. Rename one to mlx5_sf_dev_remove_aux() as it actually removes the auxiliary device of the SF. [1] mlx5_sf_dev_remove+0x2a/0x70 [mlx5_core] auxiliary_bus_remove+0x18/0x30 device_release_driver_internal+0x199/0x200 bus_remove_device+0xd7/0x140 device_del+0x153/0x3d0 ? process_one_work+0x16a/0x4b0 mlx5_sf_dev_remove+0x2e/0x90 [mlx5_core] mlx5_sf_dev_table_destroy+0xa0/0x100 [mlx5_core] Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-06Documentation: Fix counter name of mlx5 vnic reporterMoshe Shemesh1-2/+3
Fix counter name in documentation of mlx5 vnic health reporter diagnose output: total_error_queues. While here fix alignment in the documentation file of another counter, comp_eq_overrun, as it should have its own line and not be part of another counter's description. Example: $ devlink health diagnose pci/0000:00:04.0 reporter vnic vNIC env counters: total_error_queues: 0 send_queue_priority_update_flow: 0 comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0 invalid_command: 0 quota_exceeded_command: 0 nic_receive_steering_discard: 0 Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-06net/mlx5e: Delete obsolete IPsec codeLeon Romanovsky4-26/+2
After addition of HW managed counters and implementation drop in flow steering logic, the code in driver which checks syndrome is not reachable anymore. Let's delete it. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-06net/mlx5e: Connect mlx5 IPsec statistics with XFRM coreLeon Romanovsky1-2/+20
Fill integrity, replay and bad trailer counters. As an example, after simulating replay window attack with 5 packets: [leonro@c ~]$ grep XfrmInStateSeqError /proc/net/xfrm_stat XfrmInStateSeqError 5 [leonro@c ~]$ sudo ip -s x s <...> stats: replay-window 0 replay 5 failed 0 Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-06xfrm: get global statistics from the offloaded deviceLeon Romanovsky4-1/+19
Iterate over all SAs in order to fill global IPsec statistics. Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-06xfrm: generalize xdo_dev_state_update_curlft to allow statistics updateLeon Romanovsky6-16/+14
In order to allow drivers to fill all statistics, change the name of xdo_dev_state_update_curlft to be xdo_dev_state_update_stats. Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-05wifi: mt76: mt7996: fix fortify warningFelix Fietkau1-1/+2
Copy cck and ofdm separately in order to avoid __read_overflow2_field warning. Fixes: f75e4779d215 ("wifi: mt76: mt7996: add txpower setting support") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://msgid.link/20240203132446.54790-1-nbd@nbd.name
2024-02-05netdevsim: add Makefile for selftestsDavid Wei2-0/+18
Add a Makefile for netdevsim selftests and add selftests path to MAINTAINERS Signed-off-by: David Wei <dw@davidwei.uk> Link: https://lore.kernel.org/r/20240130214620.3722189-5-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-05nfsd: don't take fi_lock in nfsd_break_deleg_cb()NeilBrown1-6/+5
A recent change to check_for_locks() changed it to take ->flc_lock while holding ->fi_lock. This creates a lock inversion (reported by lockdep) because there is a case where ->fi_lock is taken while holding ->flc_lock. ->flc_lock is held across ->fl_lmops callbacks, and nfsd_break_deleg_cb() is one of those and does take ->fi_lock. However it doesn't need to. Prior to v4.17-rc1~110^2~22 ("nfsd: create a separate lease for each delegation") nfsd_break_deleg_cb() would walk the ->fi_delegations list and so needed the lock. Since then it doesn't walk the list and doesn't need the lock. Two actions are performed under the lock. One is to call nfsd_break_one_deleg which calls nfsd4_run_cb(). These doesn't act on the nfs4_file at all, so don't need the lock. The other is to set ->fi_had_conflict which is in the nfs4_file. This field is only ever set here (except when initialised to false) so there is no possible problem will multiple threads racing when setting it. The field is tested twice in nfs4_set_delegation(). The first test does not hold a lock and is documented as an opportunistic optimisation, so it doesn't impose any need to hold ->fi_lock while setting ->fi_had_conflict. The second test in nfs4_set_delegation() *is* make under ->fi_lock, so removing the locking when ->fi_had_conflict is set could make a change. The change could only be interesting if ->fi_had_conflict tested as false even though nfsd_break_one_deleg() ran before ->fi_lock was unlocked. i.e. while hash_delegation_locked() was running. As hash_delegation_lock() doesn't interact in any way with nfs4_run_cb() there can be no importance to this interaction. So this patch removes the locking from nfsd_break_one_deleg() and moves the final test on ->fi_had_conflict out of the locked region to make it clear that locking isn't important to the test. It is still tested *after* vfs_setlease() has succeeded. This might be significant and as vfs_setlease() takes ->flc_lock, and nfsd_break_one_deleg() is called under ->flc_lock this "after" is a true ordering provided by a spinlock. Fixes: edcf9725150e ("nfsd: fix RELEASE_LOCKOWNER") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-02-05Merge branch 'qca8k-cleanup-fixes'David S. Miller1-10/+9
Vladimir Oltean says: ==================== Fixups for qca8k ds->user_mii_bus cleanup The series "ds->user_mii_bus cleanup (part 1)" from the last development cycle: https://patchwork.kernel.org/project/netdevbpf/cover/20240104140037.374166-1-vladimir.oltean@nxp.com/ had some review comments I didn't have the time to address at the time. One from Alvin and one from Luiz. They can reasonably be treated as improvements for v6.9. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05net: dsa: qca8k: consistently use "ret" rather than "err" for error codesVladimir Oltean1-8/+8
It was pointed out during the review [1] of commit 68e1010cda79 ("net: dsa: qca8k: put MDIO bus OF node on qca8k_mdio_register() failure") that the rest of the qca8k driver uses "int ret" rather than "int err". Make everything consistent in that regard, not only qca8k_mdio_register(), but also qca8k_setup_mdio_bus(). [1] https://lore.kernel.org/netdev/qyl2w3ownx5q7363kqxib52j5htar4y6pkn7gen27rj45xr4on@pvy5agi6o2te/ Suggested-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05net: dsa: qca8k: put MDIO controller OF node if unavailableVladimir Oltean1-2/+1
It was pointed out during the review [1] of commit e66bf63a7f67 ("net: dsa: qca8k: skip MDIO bus creation if its OF node has status = "disabled"") that we now leak a reference to the "mdio" OF node if it is disabled. This is only a concern when using dynamic OF as far as I can tell (like probing on an overlay), since OF nodes are never freed in the regular case. Additionally, I'm unaware of any actual device trees (in production or elsewhere) which have status = "disabled" for the MDIO OF node. So handling this as a simple enhancement. [1] https://lore.kernel.org/netdev/CAJq09z4--Ug+3FAmp=EimQ8HTQYOWOuVon-PUMGB5a1N=RPv4g@mail.gmail.com/ Suggested-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05net: dsa: reindent arguments of dsa_user_vlan_for_each()Vladimir Oltean1-4/+4
These got misaligned after commit 6ca80638b90c ("net: dsa: Use conduit and user terms"). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05selftests: net: let big_tcp test cope with slow envPaolo Abeni1-1/+3
In very slow environments, most big TCP cases including segmentation and reassembly of big TCP packets have a good chance to fail: by default the TCP client uses write size well below 64K. If the host is low enough autocorking is unable to build real big TCP packets. Address the issue using much larger write operations. Note that is hard to observe the issue without an extremely slow and/or overloaded environment; reduce the TCP transfer time to allow for much easier/faster reproducibility. Fixes: 6bb382bcf742 ("selftests: add a selftest for big tcp") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05net: ocelot: update the MODULE_DESCRIPTION()Breno Leitao1-1/+1
commit 1c870c63d7d2 ("net: fill in MODULE_DESCRIPTION()s for ocelot") got a suggestion from Vladimir Oltean after it had landed in net-next. Rewrite the module description according to Vladimir's suggestion. Fixes: 1c870c63d7d2 ("net: fill in MODULE_DESCRIPTION()s for ocelot") Suggested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05Merge branch 'rxrpc-fixes'David S. Miller9-43/+154
David Howells says: ==================== rxrpc: Miscellaneous fixes Here some miscellaneous fixes for AF_RXRPC: (1) The zero serial number has a special meaning in an ACK packet serial reference, so skip it when assigning serial numbers to transmitted packets. (2) Don't set the reference serial number in a delayed ACK as the ACK cannot be used for RTT calculation. (3) Don't emit a DUP ACK response to a PING RESPONSE ACK coming back to a call that completed in the meantime. (4) Fix the counting of acks and nacks in ACK packet to better drive congestion management. We want to know if there have been new acks/nacks since the last ACK packet, not that there are still acks/nacks. This is more complicated as we have to save the old SACK table and compare it. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05rxrpc: Fix counting of new acks and nacksDavid Howells5-28/+122
Fix the counting of new acks and nacks when parsing a packet - something that is used in congestion control. As the code stands, it merely notes if there are any nacks whereas what we really should do is compare the previous SACK table to the new one, assuming we get two successive ACK packets with nacks in them. However, we really don't want to do that if we can avoid it as the tables might not correspond directly as one may be shifted from the other - something that will only get harder to deal with once extended ACK tables come into full use (with a capacity of up to 8192). Instead, count the number of nacks shifted out of the old SACK, the number of nacks retained in the portion still active and the number of new acks and nacks in the new table then calculate what we need. Note this ends up a bit of an estimate as the Rx protocol allows acks to be withdrawn by the receiver and packets requested to be retransmitted. Fixes: d57a3a151660 ("rxrpc: Save last ACK's SACK table rather than marking txbufs") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: "David S. Miller" <davem@davemloft.net> cc: Eric Dumazet <edumazet@google.com> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: linux-afs@lists.infradead.org cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05rxrpc: Fix response to PING RESPONSE ACKs to a dead callDavid Howells1-0/+8
Stop rxrpc from sending a DUP ACK in response to a PING RESPONSE ACK on a dead call. We may have initiated the ping but the call may have beaten the response to completion. Fixes: 18bfeba50dfd ("rxrpc: Perform terminal call ACK/ABORT retransmission from conn processor") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: "David S. Miller" <davem@davemloft.net> cc: Eric Dumazet <edumazet@google.com> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: linux-afs@lists.infradead.org cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05rxrpc: Fix delayed ACKs to not set the reference serial numberDavid Howells2-6/+1
Fix the construction of delayed ACKs to not set the reference serial number as they can't be used as an RTT reference. Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: "David S. Miller" <davem@davemloft.net> cc: Eric Dumazet <edumazet@google.com> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: linux-afs@lists.infradead.org cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05rxrpc: Fix generation of serial numbers to skip zeroDavid Howells5-9/+23
In the Rx protocol, every packet generated is marked with a per-connection monotonically increasing serial number. This number can be referenced in an ACK packet generated in response to an incoming packet - thereby allowing the sender to use this for RTT determination, amongst other things. However, if the reference field in the ACK is zero, it doesn't refer to any incoming packet (it could be a ping to find out if a packet got lost, for example) - so we shouldn't generate zero serial numbers. Fix the generation of serial numbers to retry if it comes up with a zero. Furthermore, since the serial numbers are only ever allocated within the I/O thread this connection is bound to, there's no need for atomics so remove that too. Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: "David S. Miller" <davem@davemloft.net> cc: Eric Dumazet <edumazet@google.com> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: linux-afs@lists.infradead.org cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05net: blackhole_dev: fix build warning for ethh set but not usedBreno Leitao1-2/+1
lib/test_blackhole_dev.c sets a variable that is never read, causing this following building warning: lib/test_blackhole_dev.c:32:17: warning: variable 'ethh' set but not used [-Wunused-but-set-variable] Remove the variable struct ethhdr *ethh, which is unused. Fixes: 509e56b37cc3 ("blackhole_dev: add a selftest") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05Merge branch 'mptcp-annotate-lockless'David S. Miller7-49/+55
Matthieu Baerts says: ==================== mptcp: annotate lockless access This is a series of 5 patches from Paolo to annotate lockless access. The MPTCP locking schema is already quite complex. We need to clarify it and make the lockless access already there consistent, or later changes will be even harder to follow and understand. This series goes through all the msk fields accessed in the RX and TX path and makes the lockless annotation consistent with the in-use locking schema. As a bonus, this should fix data races eventually found by fuzzers -- even if we haven't seen many such reports so far. Patch 1/5 hints we could remove "local_key" and "remote_key" from the subflow context, and always use the ones from the msk socket, possibly reducing the context memory usage. That change is left over as a possible follow-up. ==================== Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05mptcp: annotate lockless accesses around read-mostly fieldsPaolo Abeni2-8/+8
The following MPTCP socket fields: - can_ack - fully_established - rcv_data_fin - snd_data_fin_enable - rcv_fastclose - use_64bit_ack are accessed without any lock, add the appropriate annotation. The schema is safe as each field can change its value at most once in the whole mptcp socket life cycle. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05mptcp: annotate lockless access for tokenPaolo Abeni3-7/+7
The token field is manipulated under the msk socket lock and accessed lockless in a few spots, add proper ONCE annotation Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05mptcp: annotate lockless access for RX path fieldsPaolo Abeni2-11/+14
The following fields: - ack_seq - snd_una - wnd_end - rmem_fwd_alloc are protected by the data lock end accessed lockless in a few spots. Ensure ONCE annotation for write (under such lock) and for lockless read. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05mptcp: annotate lockless access for the tx pathPaolo Abeni3-10/+9
The mptcp-level TX path info (write_seq, bytes_sent, snd_nxt) are under the msk socket lock protection, and are accessed lockless in a few spots. Always mark the write operations with WRITE_ONCE, read operations outside the lock with READ_ONCE and drop the annotation for read under such lock. To simplify the annotations move mptcp_pending_data_fin_ack() from __mptcp_data_acked() to __mptcp_clean_una(), under the msk socket lock, where such call would belong. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05mptcp: annotate access for msk keysPaolo Abeni4-13/+17
Both the local and the remote key follow the same locking schema, put in place the proper ONCE accessors. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05tsnep: Add helper for RX XDP_RING_NEED_WAKEUP flagGerhard Engleder1-12/+11
Similar chunk of code is used in tsnep_rx_poll_zc() and tsnep_rx_reopen_xsk() to maintain the RX XDP_RING_NEED_WAKEUP flag. Consolidate the code to common helper function. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05Merge branch 'nfp-fixes'David S. Miller3-3/+6
Louis Peens says: ==================== nfp: a few simple driver fixes This is combining a few unrelated one-liner fixes which have been floating around internally into a single series. I'm not sure what is the least amount of overhead for reviewers, this or a separate submission per-patch? I guess it probably depends on personal preference, but please let me know if there is a strong preference to rather split these in the future. Summary: Patch1: Fixes an old issue which was hidden because 0 just so happens to be the correct value. Patch2: Fixes a corner case for flower offloading with bond ports Patch3: Re-enables the 'NETDEV_XDP_ACT_REDIRECT', which was accidentally disabled after a previous refactor. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05nfp: enable NETDEV_XDP_ACT_REDIRECT feature flagJames Hershaw1-0/+1
Enable previously excluded xdp feature flag for NFD3 devices. This feature flag is required in order to bind nfp interfaces to an xdp socket and the nfp driver does in fact support the feature. Fixes: 66c0e13ad236 ("drivers: net: turn on XDP features") Cc: stable@vger.kernel.org # 6.3+ Signed-off-by: James Hershaw <james.hershaw@corigine.com> Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05nfp: flower: prevent re-adding mac index for bonded portDaniel de Villiers1-1/+1
When physical ports are reset (either through link failure or manually toggled down and up again) that are slaved to a Linux bond with a tunnel endpoint IP address on the bond device, not all tunnel packets arriving on the bond port are decapped as expected. The bond dev assigns the same MAC address to itself and each of its slaves. When toggling a slave device, the same MAC address is therefore offloaded to the NFP multiple times with different indexes. The issue only occurs when re-adding the shared mac. The nfp_tunnel_add_shared_mac() function has a conditional check early on that checks if a mac entry already exists and if that mac entry is global: (entry && nfp_tunnel_is_mac_idx_global(entry->index)). In the case of a bonded device (For example br-ex), the mac index is obtained, and no new index is assigned. We therefore modify the conditional in nfp_tunnel_add_shared_mac() to check if the port belongs to the LAG along with the existing checks to prevent a new global mac index from being re-assigned to the slave port. Fixes: 20cce8865098 ("nfp: flower: enable MAC address sharing for offloadable devs") CC: stable@vger.kernel.org # 5.1+ Signed-off-by: Daniel de Villiers <daniel.devilliers@corigine.com> Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05nfp: use correct macro for LengthSelect in BAR configDaniel Basilio1-2/+4
The 1st and 2nd expansion BAR configuration registers are configured, when the driver starts up, in variables 'barcfg_msix_general' and 'barcfg_msix_xpb', respectively. The 'LengthSelect' field is ORed in from bit 0, which is incorrect. The 'LengthSelect' field should start from bit 27. This has largely gone un-noticed because NFP_PCIE_BAR_PCIE2CPP_LengthSelect_32BIT happens to be 0. Fixes: 4cb584e0ee7d ("nfp: add CPP access core") Cc: stable@vger.kernel.org # 4.11+ Signed-off-by: Daniel Basilio <daniel.basilio@corigine.com> Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05sctp: preserve const qualifier in sctp_sk()Eric Dumazet1-4/+1
We can change sctp_sk() to propagate its argument const qualifier, thanks to container_of_const(). Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Cc: Xin Long <lucien.xin@gmail.com> Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-05bcachefs: time_stats: Check for last_event == 0 when updating freq statsKent Overstreet1-2/+3
This fixes spurious outliers in the frequency stats. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-05bcachefs: install fd later to avoid race with closeMathias Krause1-1/+1
Calling fd_install() makes a file reachable for userland, including the possibility to close the file descriptor, which leads to calling its 'release' hook. If that happens before the code had a chance to bump the reference of the newly created task struct, the release callback will call put_task_struct() too early, leading to the premature destruction of the kernel thread. Avoid that race by calling fd_install() later, after all the setup is done. Fixes: 1c6fdbd8f246 ("bcachefs: Initial commit") Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>