summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-02-09net: dsa: ar9331: register the mdiobus under devresVladimir Oltean1-2/+1
As explained in commits: 74b6d7d13307 ("net: dsa: realtek: register the MDIO bus under devres") 5135e96a3dd2 ("net: dsa: don't allocate the slave_mii_bus using devres") mdiobus_free() will panic when called from devm_mdiobus_free() <- devres_release_all() <- __device_release_driver(), and that mdiobus was not previously unregistered. The ar9331 is an MDIO device, so the initial set of constraints that I thought would cause this (I2C or SPI buses which call ->remove on ->shutdown) do not apply. But there is one more which applies here. If the DSA master itself is on a bus that calls ->remove from ->shutdown (like dpaa2-eth, which is on the fsl-mc bus), there is a device link between the switch and the DSA master, and device_links_unbind_consumers() will unbind the ar9331 switch driver on shutdown. So the same treatment must be applied to all DSA switch drivers, which is: either use devres for both the mdiobus allocation and registration, or don't use devres at all. The ar9331 driver doesn't have a complex code structure for mdiobus removal, so just replace of_mdiobus_register with the devres variant in order to be all-devres and ensure that we don't free a still-registered bus. Fixes: ac3a68d56651 ("net: phy: don't abuse devres in devm_mdiobus_register()") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-09net: dsa: mv88e6xxx: don't use devres for mdiobusVladimir Oltean1-3/+8
As explained in commits: 74b6d7d13307 ("net: dsa: realtek: register the MDIO bus under devres") 5135e96a3dd2 ("net: dsa: don't allocate the slave_mii_bus using devres") mdiobus_free() will panic when called from devm_mdiobus_free() <- devres_release_all() <- __device_release_driver(), and that mdiobus was not previously unregistered. The mv88e6xxx is an MDIO device, so the initial set of constraints that I thought would cause this (I2C or SPI buses which call ->remove on ->shutdown) do not apply. But there is one more which applies here. If the DSA master itself is on a bus that calls ->remove from ->shutdown (like dpaa2-eth, which is on the fsl-mc bus), there is a device link between the switch and the DSA master, and device_links_unbind_consumers() will unbind the Marvell switch driver on shutdown. systemd-shutdown[1]: Powering off. mv88e6085 0x0000000008b96000:00 sw_gl0: Link is Down fsl-mc dpbp.9: Removing from iommu group 7 fsl-mc dpbp.8: Removing from iommu group 7 ------------[ cut here ]------------ kernel BUG at drivers/net/phy/mdio_bus.c:677! Internal error: Oops - BUG: 0 [#1] PREEMPT SMP Modules linked in: CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 5.16.5-00040-gdc05f73788e5 #15 pc : mdiobus_free+0x44/0x50 lr : devm_mdiobus_free+0x10/0x20 Call trace: mdiobus_free+0x44/0x50 devm_mdiobus_free+0x10/0x20 devres_release_all+0xa0/0x100 __device_release_driver+0x190/0x220 device_release_driver_internal+0xac/0xb0 device_links_unbind_consumers+0xd4/0x100 __device_release_driver+0x4c/0x220 device_release_driver_internal+0xac/0xb0 device_links_unbind_consumers+0xd4/0x100 __device_release_driver+0x94/0x220 device_release_driver+0x28/0x40 bus_remove_device+0x118/0x124 device_del+0x174/0x420 fsl_mc_device_remove+0x24/0x40 __fsl_mc_device_remove+0xc/0x20 device_for_each_child+0x58/0xa0 dprc_remove+0x90/0xb0 fsl_mc_driver_remove+0x20/0x5c __device_release_driver+0x21c/0x220 device_release_driver+0x28/0x40 bus_remove_device+0x118/0x124 device_del+0x174/0x420 fsl_mc_bus_remove+0x80/0x100 fsl_mc_bus_shutdown+0xc/0x1c platform_shutdown+0x20/0x30 device_shutdown+0x154/0x330 kernel_power_off+0x34/0x6c __do_sys_reboot+0x15c/0x250 __arm64_sys_reboot+0x20/0x30 invoke_syscall.constprop.0+0x4c/0xe0 do_el0_svc+0x4c/0x150 el0_svc+0x24/0xb0 el0t_64_sync_handler+0xa8/0xb0 el0t_64_sync+0x178/0x17c So the same treatment must be applied to all DSA switch drivers, which is: either use devres for both the mdiobus allocation and registration, or don't use devres at all. The Marvell driver already has a good structure for mdiobus removal, so just plug in mdiobus_free and get rid of devres. Fixes: ac3a68d56651 ("net: phy: don't abuse devres in devm_mdiobus_register()") Reported-by: Rafael Richter <Rafael.Richter@gin.de> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Daniel Klauer <daniel.klauer@gin.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-09net: add dev->dev_registered_trackerEric Dumazet2-2/+7
Convert one dev_hold()/dev_put() pair in register_netdevice() and unregister_netdevice_many() to dev_hold_track() and dev_put_track(). This would allow to detect a rogue dev_put() a bit earlier. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220207184107.1401096-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-09Merge branch 'fix bpf_prog_pack build errors'Alexei Starovoitov3-3/+7
Song Liu says: ==================== Fix build errors reported by kernel test robot. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-02-09bpf: Fix bpf_prog_pack build HPAGE_PMD_SIZESong Liu1-1/+5
Fix build with CONFIG_TRANSPARENT_HUGEPAGE=n with BPF_PROG_PACK_SIZE as PAGE_SIZE. Fixes: 57631054fae6 ("bpf: Introduce bpf_prog_pack allocator") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220208220509.4180389-3-song@kernel.org
2022-02-09bonding: pair enable_port with slave_arr_updatesMahesh Bandewar1-1/+2
When 803.2ad mode enables a participating port, it should update the slave-array. I have observed that the member links are participating and are part of the active aggregator while the traffic is egressing via only one member link (in a case where two links are participating). Via kprobes I discovered that slave-arr has only one link added while the other participating link wasn't part of the slave-arr. I couldn't see what caused that situation but the simple code-walk through provided me hints that the enable_port wasn't always associated with the slave-array update. Fixes: ee6377147409 ("bonding: Simplify the xmit function for modes that use xmit_hash") Signed-off-by: Mahesh Bandewar <maheshb@google.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Link: https://lore.kernel.org/r/20220207222901.1795287-1-maheshb@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-09gve: Recording rx queue before sending to napiTao Liu1-0/+1
This caused a significant performance degredation when using generic XDP with multiple queues. Fixes: f5cedc84a30d2 ("gve: Add transmit and receive support") Signed-off-by: Tao Liu <xliutaox@google.com> Link: https://lore.kernel.org/r/20220207175901.2486596-1-jeroendb@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-09et131x: support arbitrary MAX_SKB_FRAGSEric Dumazet1-4/+10
This NIC does not support TSO, it is very unlikely it would have to send packets with many fragments. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220208004855.1887345-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-09Merge branch 'iwl-next' of ↵Jakub Kicinski2-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux Nguyen, Anthony L says: ==================== iwl-next Intel Wired LAN Driver Updates 2022-02-07 Dave adds support for ice driver to provide DSCP QoS mappings to irdma driver. [1] https://lore.kernel.org/netdev/20220202191921.1638-1-shiraz.saleem@intel.com/ * 'iwl-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux: ice: add support for DSCP QoS for IDC ==================== Link: https://lore.kernel.org/r/20220207235921.1303522-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-09bpf: Fix leftover header->pages in sparc and powerpc code.Song Liu2-2/+2
Replace header->pages * PAGE_SIZE with new header->size. Fixes: ed2d9e1a26cc ("bpf: Use size instead of pages in bpf_binary_header") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220208220509.4180389-2-song@kernel.org
2022-02-09libbpf: Fix signedness bug in btf_dump_array_data()Dan Carpenter1-2/+3
The btf__resolve_size() function returns negative error codes so "elem_size" must be signed for the error handling to work. Fixes: 920d16af9b42 ("libbpf: BTF dumper support for typed data") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220208071552.GB10495@kili
2022-02-08selftests/bpf: Do not export subtest as standalone testHou Tao5-8/+8
Two subtests in ksyms_module.c are not qualified as static, so these subtests are exported as standalone tests in tests.h and lead to confusion for the output of "./test_progs -t ksyms_module". By using the following command ... grep "^void \(serial_\)\?test_[a-zA-Z0-9_]\+(\(void\)\?)" \ tools/testing/selftests/bpf/prog_tests/*.c | \ awk -F : '{print $1}' | sort | uniq -c | awk '$1 != 1' ... one finds out that other tests also have a similar problem, so fix these tests by marking subtests in these tests as static. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220208065444.648778-1-houtao1@huawei.com
2022-02-08Documentation: KUnit: Fix usage bugAkira Kawata1-1/+1
Fix a bug of kunit documentation. Link: https://bugzilla.kernel.org/show_bug.cgi?id=205773 : Quoting Steve Pfetsch: : : kunit documentation is incorrect: : https://kunit.dev/third_party/stable_kernel/docs/usage.html : struct rectangle *self = container_of(this, struct shape, parent); : : : Shouldn't it be: : struct rectangle *self = container_of(this, struct rectangle, parent); : ? Signed-off-by: Akira Kawata <akirakawata1@gmail.com> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2022-02-08Merge tag 'nfs-for-5.17-2' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds8-11/+36
Pull NFS client fixes from Anna Schumaker: "Stable Fixes: - Fix initialization of nfs_client cl_flags Other Fixes: - Fix performance issues with uncached readdir calls - Fix potential pointer dereferences in rpcrdma_ep_create - Fix nfs4_proc_get_locations() kernel-doc comment - Fix locking during sunrpc sysfs reads - Update my email address in the MAINTAINERS file to my new kernel.org email" * tag 'nfs-for-5.17-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: SUNRPC: lock against ->sock changing during sysfs read MAINTAINERS: Update my email address NFS: Fix nfs4_proc_get_locations() kernel-doc comment xprtrdma: fix pointer derefs in error cases of rpcrdma_ep_create NFS: Fix initialisation of nfs_client cl_flags field NFS: Avoid duplicate uncached readdir calls on eof NFS: Don't skip directory entries when doing uncached readdir NFS: Don't overfill uncached readdir pages
2022-02-08bpf, x86_64: Fail gracefully on bpf_jit_binary_pack_finalize failuresSong Liu1-2/+6
Instead of BUG_ON(), fail gracefully and return orig_prog. Fixes: 1022a5498f6f ("bpf, x86_64: Use bpf_jit_binary_pack_alloc") Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220208062533.3802081-1-song@kernel.org
2022-02-08i40e: Add a stat for tracking busy rx pagesJoe Damato5-5/+15
In some cases, pages cannot be reused by i40e because the page is busy. Add a counter for this event. Busy page count is accessible via ethtool. Signed-off-by: Joe Damato <jdamato@fastly.com> Tested-by: Dave Switzer <david.switzer@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-02-08i40e: Add a stat for tracking pages waivedJoe Damato5-4/+17
In some cases, pages can not be reused because they are not associated with the correct NUMA zone. Knowing how often pages are waived helps users to understand the interaction between the driver's memory usage and their system. Pass rx_stats through to i40e_can_reuse_rx_page to allow tracking when pages are waived. The page waive count is accessible via ethtool. Signed-off-by: Joe Damato <jdamato@fastly.com> Tested-by: Dave Switzer <david.switzer@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-02-08i40e: Add a stat tracking new RX page allocationsJoe Damato5-1/+9
Add a counter for new page allocations in the i40e RX path. This stat is accessible with ethtool. Signed-off-by: Joe Damato <jdamato@fastly.com> Tested-by: Dave Switzer <david.switzer@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-02-08i40e: Aggregate and export RX page reuse statJoe Damato3-1/+6
rx page reuse was already being tracked by the i40e driver per RX ring. Aggregate the counts and make them accessible via ethtool. Signed-off-by: Joe Damato <jdamato@fastly.com> Tested-by: Dave Switzer <david.switzer@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-02-08i40e: Remove rx page reuse double countJoe Damato1-2/+0
Page reuse was being tracked from two locations: - i40e_reuse_rx_page (via 40e_clean_rx_irq), and - i40e_alloc_mapped_page Remove the double count and only count reuse from i40e_alloc_mapped_page when the page is about to be reused. Signed-off-by: Joe Damato <jdamato@fastly.com> Tested-by: Dave Switzer <david.switzer@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-02-08SUNRPC: lock against ->sock changing during sysfs readNeilBrown2-2/+10
->sock can be set to NULL asynchronously unless ->recv_mutex is held. So it is important to hold that mutex. Otherwise a sysfs read can trigger an oops. Commit 17f09d3f619a ("SUNRPC: Check if the xprt is connected before handling sysfs reads") appears to attempt to fix this problem, but it only narrows the race window. Fixes: 17f09d3f619a ("SUNRPC: Check if the xprt is connected before handling sysfs reads") Fixes: a8482488a7d6 ("SUNRPC query transport's source port") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-02-08MAINTAINERS: Update my email addressAnna Schumaker1-1/+1
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-02-08NFS: Fix nfs4_proc_get_locations() kernel-doc commentYang Li1-1/+2
Add the description of @server and @fhandle, and remove the excess @inode in nfs4_proc_get_locations() kernel-doc comment to remove warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/nfs/nfs4proc.c:8219: warning: Function parameter or member 'server' not described in 'nfs4_proc_get_locations' fs/nfs/nfs4proc.c:8219: warning: Function parameter or member 'fhandle' not described in 'nfs4_proc_get_locations' fs/nfs/nfs4proc.c:8219: warning: Excess function parameter 'inode' description in 'nfs4_proc_get_locations' Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-02-08xprtrdma: fix pointer derefs in error cases of rpcrdma_ep_createDan Aloni1-0/+3
If there are failures then we must not leave the non-NULL pointers with the error value, otherwise `rpcrdma_ep_destroy` gets confused and tries free them, resulting in an Oops. Signed-off-by: Dan Aloni <dan.aloni@vastdata.com> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-02-08NFS: Fix initialisation of nfs_client cl_flags fieldTrond Myklebust1-1/+1
For some long forgotten reason, the nfs_client cl_flags field is initialised in nfs_get_client() instead of being initialised at allocation time. This quirk was harmless until we moved the call to nfs_create_rpc_client(). Fixes: dd99e9f98fbf ("NFSv4: Initialise connection to the server in nfs4_alloc_client()") Cc: stable@vger.kernel.org # 4.8.x Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-02-08Merge branch 'inet-separate-dscp-from-ecn-bits-using-new-dscp_t-type'Jakub Kicinski12-50/+278
Guillaume Nault says: ==================== inet: Separate DSCP from ECN bits using new dscp_t type The networking stack currently doesn't clearly distinguish between DSCP and ECN bits. The entire DSCP+ECN bits are stored in u8 variables (or structure fields), and each part of the stack handles them in their own way, using different macros. This has created several bugs in the past and some uncommon code paths are still unfixed. Such bugs generally manifest by selecting invalid routes because of ECN bits interfering with FIB routes and rules lookups (more details in the LPC 2021 talk[1] and in the RFC of this series[2]). This patch series aims at preventing the introduction of such bugs (and detecting existing ones), by introducing a dscp_t type, representing "sanitised" DSCP values (that is, with no ECN information), as opposed to plain u8 values that contain both DSCP and ECN information. dscp_t makes it clear for the reader what we're working on, and Sparse can flag invalid interactions between dscp_t and plain u8. This series converts only a few variables and structures: * Patch 1 converts the tclass field of struct fib6_rule. It effectively forbids the use of ECN bits in the tos/dsfield option of ip -6 rule. Rules now match packets solely based on their DSCP bits, so ECN doesn't influence the result any more. This contrasts with the previous behaviour where all 8 bits of the Traffic Class field were used. It is believed that this change is acceptable as matching ECN bits wasn't usable for IPv4, so only IPv6-only deployments could be depending on it. Also the previous behaviour made DSCP-based ip6-rules fail for packets with both a DSCP and an ECN mark, which is another reason why any such deploy is unlikely. * Patch 2 converts the tos field of struct fib4_rule. This one too effectively forbids defining ECN bits, this time in ip -4 rule. Before that, setting ECN bit 1 was accepted, while ECN bit 0 was rejected. But even when accepted, the rule would never match, as the packets would have their ECN bits cleared before doing the rule lookup. * Patch 3 converts the fc_tos field of struct fib_config. This is equivalent to patch 2, but for IPv4 routes. Routes using a tos/dsfield option with any ECN bit set is now rejected. Before this patch, they were accepted but, as with ip4 rules, these routes couldn't match any packet, since their ECN bits are cleared before the lookup. * Patch 4 converts the fa_tos field of struct fib_alias. This one is pure internal u8 to dscp_t conversion. While patches 1-3 had user facing consequences, this patch shouldn't have any side effect and is there to give an overview of what future conversion patches will look like. Conversions are quite mechanical, but imply some code churn, which is the price for the extra clarity a possibility of type checking. To summarise, all the behaviour changes required for the dscp_t type approach to work should be contained in patches 1-3. These changes are edge cases of ip-route and ip-rule that don't currently work properly. So they should be safe. Also, a kernel selftest is added for each of them. Finally, this work also paves the way for allowing the usage of the 3 high order DSCP bits in IPv4 (a few call paths already handle them, but in general the stack clears them before IPv4 rule and route lookups). References: [1] LPC 2021 talk: - https://linuxplumbersconf.org/event/11/contributions/943/ - Direct link to slide deck: https://linuxplumbersconf.org/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf [2] RFC version of this series: - https://lore.kernel.org/netdev/cover.1638814614.git.gnault@redhat.com/ ==================== Link: https://lore.kernel.org/r/cover.1643981839.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08ipv4: Use dscp_t in struct fib_aliasGuillaume Nault4-34/+45
Use the new dscp_t type to replace the fa_tos field of fib_alias. This ensures ECN bits are ignored and makes the field compatible with the fc_dscp field of struct fib_config. Converting old *tos variables and fields to dscp_t allows sparse to flag incorrect uses of DSCP and ECN bits. This patch is entirely about type annotation and shouldn't change any existing behaviour. Signed-off-by: Guillaume Nault <gnault@redhat.com> Acked-by: David Ahern <dsahern@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08ipv4: Reject routes specifying ECN bits in rtm_tosGuillaume Nault4-4/+93
Use the new dscp_t type to replace the fc_tos field of fib_config, to ensure IPv4 routes aren't influenced by ECN bits when configured with non-zero rtm_tos. Before this patch, IPv4 routes specifying an rtm_tos with some of the ECN bits set were accepted. However they wouldn't work (never match) as IPv4 normally clears the ECN bits with IPTOS_RT_MASK before doing a FIB lookup (although a few buggy code paths don't). After this patch, IPv4 routes specifying an rtm_tos with any ECN bit set is rejected. Note: IPv6 routes ignore rtm_tos altogether, any rtm_tos is accepted, but treated as if it were 0. Signed-off-by: Guillaume Nault <gnault@redhat.com> Acked-by: David Ahern <dsahern@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08ipv4: Stop taking ECN bits into account in fib4-rulesGuillaume Nault2-9/+39
Use the new dscp_t type to replace the tos field of struct fib4_rule, so that fib4-rules consistently ignore ECN bits. Before this patch, fib4-rules did accept rules with the high order ECN bit set (but not the low order one). Also, it relied on its callers masking the ECN bits of ->flowi4_tos to prevent those from influencing the result. This was brittle and a few call paths still do the lookup without masking the ECN bits first. After this patch fib4-rules only compare the DSCP bits. ECN can't influence the result anymore, even if the caller didn't mask these bits. Also, fib4-rules now must have both ECN bits cleared or they will be rejected. Signed-off-by: Guillaume Nault <gnault@redhat.com> Acked-by: David Ahern <dsahern@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rulesGuillaume Nault4-7/+105
Define a dscp_t type and its appropriate helpers that ensure ECN bits are not taken into account when handling DSCP. Use this new type to replace the tclass field of struct fib6_rule, so that fib6-rules don't get influenced by ECN bits anymore. Before this patch, fib6-rules didn't make any distinction between the DSCP and ECN bits. Therefore, rules specifying a DSCP (tos or dsfield options in iproute2) stopped working as soon a packets had at least one of its ECN bits set (as a work around one could create four rules for each DSCP value to match, one for each possible ECN value). After this patch fib6-rules only compare the DSCP bits. ECN doesn't influence the result anymore. Also, fib6-rules now must have the ECN bits cleared or they will be rejected. Signed-off-by: Guillaume Nault <gnault@redhat.com> Acked-by: David Ahern <dsahern@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08net: stmmac: optimize locking around PTP clock readsYannick Vignon5-20/+20
Reading the PTP clock is a simple operation requiring only 3 register reads. Under a PREEMPT_RT kernel, protecting those reads by a spin_lock is counter-productive: if the 2nd task preempting the 1st has a higher prio but needs to read time as well, it will require 2 context switches, which will pretty much always be more costly than just disabling preemption for the duration of the reads. Moreover, with the code logic recently added to get_systime(), disabling preemption is not even required anymore: reads and writes just need to be protected from each other, to prevent a clock read while the clock is being updated. Improve the above situation by replacing the PTP spinlock by a rwlock, and using read_lock for PTP clock reads so simultaneous reads do not block each other. Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com> Link: https://lore.kernel.org/r/20220204135545.2770625-1-yannick.vignon@oss.nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08net: phy: marvell: Fix RGMII Tx/Rx delays setting in 88e1121-compatible PHYsPavel Parkhomenko1-4/+6
It is mandatory for a software to issue a reset upon modifying RGMII Receive Timing Control and RGMII Transmit Timing Control bit fields of MAC Specific Control register 2 (page 2, register 21) otherwise the changes won't be perceived by the PHY (the same is applicable for a lot of other registers). Not setting the RGMII delays on the platforms that imply it' being done on the PHY side will consequently cause the traffic loss. We discovered that the denoted soft-reset is missing in the m88e1121_config_aneg() method for the case if the RGMII delays are modified but the MDIx polarity isn't changed or the auto-negotiation is left enabled, thus causing the traffic loss on our platform with Marvell Alaska 88E1510 installed. Let's fix that by issuing the soft-reset if the delays have been actually set in the m88e1121_config_aneg_rgmii_delays() method. Cc: stable@vger.kernel.org Fixes: d6ab93364734 ("net: phy: marvell: Avoid unnecessary soft reset") Signed-off-by: Pavel Parkhomenko <Pavel.Parkhomenko@baikalelectronics.ru> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Serge Semin <fancer.lancer@gmail.com> Link: https://lore.kernel.org/r/20220205203932.26899-1-Pavel.Parkhomenko@baikalelectronics.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08net: typhoon: include <net/vxlan.h>Eric Dumazet1-0/+3
We need this to get vxlan_features_check() definition. Fixes: d2692eee05b8 ("net: typhoon: implement ndo_features_check method") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220208003502.1799728-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-08bpf: test_run: Fix overflow in bpf_test_finish frags parsingStanislav Fomichev1-2/+3
This place also uses signed min_t and passes this singed int to copy_to_user (which accepts unsigned argument). I don't think there is an issue, but let's be consistent. Fixes: 7855e0db150ad ("bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature") Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220204235849.14658-2-sdf@google.com
2022-02-08bpf: test_run: Fix overflow in xdp frags parsingStanislav Fomichev1-2/+2
When kattr->test.data_size_in > INT_MAX, signed min_t will assign negative value to data_len. This negative value then gets passed over to copy_from_user where it is converted to (big) unsigned. Use unsigned min_t to avoid this overflow. usercopy: Kernel memory overwrite attempt detected to wrapped address (offset 0, size 18446612140539162846)! ------------[ cut here ]------------ kernel BUG at mm/usercopy.c:102! invalid opcode: 0000 [#1] SMP KASAN Modules linked in: CPU: 0 PID: 3781 Comm: syz-executor226 Not tainted 4.15.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:usercopy_abort+0xbd/0xbf mm/usercopy.c:102 RSP: 0018:ffff8801e9703a38 EFLAGS: 00010286 RAX: 000000000000006c RBX: ffffffff84fc7040 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff816560a2 RDI: ffffed003d2e0739 RBP: ffff8801e9703a90 R08: 000000000000006c R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff84fc73a0 R13: ffffffff84fc7180 R14: ffffffff84fc7040 R15: ffffffff84fc7040 FS: 00007f54e0bec300(0000) GS:ffff8801f6600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000280 CR3: 00000001e90ea000 CR4: 00000000003426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: check_bogus_address mm/usercopy.c:155 [inline] __check_object_size mm/usercopy.c:263 [inline] __check_object_size.cold+0x8c/0xad mm/usercopy.c:253 check_object_size include/linux/thread_info.h:112 [inline] check_copy_size include/linux/thread_info.h:143 [inline] copy_from_user include/linux/uaccess.h:142 [inline] bpf_prog_test_run_xdp+0xe57/0x1240 net/bpf/test_run.c:989 bpf_prog_test_run kernel/bpf/syscall.c:3377 [inline] __sys_bpf+0xdf2/0x4a50 kernel/bpf/syscall.c:4679 SYSC_bpf kernel/bpf/syscall.c:4765 [inline] SyS_bpf+0x26/0x50 kernel/bpf/syscall.c:4763 do_syscall_64+0x21a/0x3e0 arch/x86/entry/common.c:305 entry_SYSCALL_64_after_hwframe+0x46/0xbb Fixes: 1c1949982524 ("bpf: introduce frags support to bpf_prog_test_run_xdp()") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220204235849.14658-1-sdf@google.com
2022-02-08Merge branch 'bpf_prog_pack allocator'Alexei Starovoitov9-62/+349
Song Liu says: ==================== Changes v8 => v9: 1. Fix an error with multi function program, in 4/9. Changes v7 => v8: 1. Rebase and fix conflicts. 2. Lock text_mutex for text_poke_copy. (Daniel) Changes v6 => v7: 1. Redesign the interface between generic and arch logic, based on feedback from Alexei and Ilya. 2. Split 6/7 of v6 to 7/9 and 8/9 in v7, for cleaner logic. 3. Add bpf_arch_text_copy in 6/9. Changes v5 => v6: 1. Make jit_hole_buffer 128 byte long. Only fill the first and last 128 bytes of header with INT3. (Alexei) 2. Use kvmalloc for temporary buffer. (Alexei) 3. Rename tmp_header/tmp_image => rw_header/rw_image. Remove tmp_image from x64_jit_data. (Alexei) 4. Change fall back round_up_to in bpf_jit_binary_alloc_pack() from BPF_PROG_MAX_PACK_PROG_SIZE to PAGE_SIZE. Changes v4 => v5: 1. Do not use atomic64 for bpf_jit_current. (Alexei) Changes v3 => v4: 1. Rename text_poke_jit() => text_poke_copy(). (Peter) 2. Change comment style. (Peter) Changes v2 => v3: 1. Fix tailcall. Changes v1 => v2: 1. Use text_poke instead of writing through linear mapping. (Peter) 2. Avoid making changes to non-x86_64 code. Most BPF programs are small, but they consume a page each. For systems with busy traffic and many BPF programs, this could also add significant pressure to instruction TLB. High iTLB pressure usually causes slow down for the whole system, which includes visible performance degradation for production workloads. This set tries to solve this problem with customized allocator that pack multiple programs into a huge page. Patches 1-6 prepare the work. Patch 7 contains key logic of bpf_prog_pack allocator. Patch 8 contains bpf_jit_binary_pack_alloc logic on top of bpf_prog_pack allocator. Patch 9 uses this allocator in x86_64 jit. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-02-08bpf, x86_64: Use bpf_jit_binary_pack_allocSong Liu1-27/+31
Use bpf_jit_binary_pack_alloc in x86_64 jit. The jit engine first writes the program to the rw buffer. When the jit is done, the program is copied to the final location with bpf_jit_binary_pack_finalize. Note that we need to do bpf_tail_call_direct_fixup after finalize. Therefore, the text_live = false logic in __bpf_arch_text_poke is no longer needed. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220204185742.271030-10-song@kernel.org
2022-02-08bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]Song Liu3-10/+120
This is the jit binary allocator built on top of bpf_prog_pack. bpf_prog_pack allocates RO memory, which cannot be used directly by the JIT engine. Therefore, a temporary rw buffer is allocated for the JIT engine. Once JIT is done, bpf_jit_binary_pack_finalize is used to copy the program to the RO memory. bpf_jit_binary_pack_alloc reserves 16 bytes of extra space for illegal instructions, which is small than the 128 bytes space reserved by bpf_jit_binary_alloc. This change is necessary for bpf_jit_binary_hdr to find the correct header. Also, flag use_bpf_prog_pack is added to differentiate a program allocated by bpf_jit_binary_pack_alloc. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220204185742.271030-9-song@kernel.org
2022-02-08bpf: Introduce bpf_prog_pack allocatorSong Liu1-0/+127
Most BPF programs are small, but they consume a page each. For systems with busy traffic and many BPF programs, this could add significant pressure to instruction TLB. High iTLB pressure usually causes slow down for the whole system, which includes visible performance degradation for production workloads. Introduce bpf_prog_pack allocator to pack multiple BPF programs in a huge page. The memory is then allocated in 64 byte chunks. Memory allocated by bpf_prog_pack allocator is RO protected after initial allocation. To write to it, the user (jit engine) need to use text poke API. Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220204185742.271030-8-song@kernel.org
2022-02-08bpf: Introduce bpf_arch_text_copySong Liu3-0/+14
This will be used to copy JITed text to RO protected module memory. On x86, bpf_arch_text_copy is implemented with text_poke_copy. bpf_arch_text_copy returns pointer to dst on success, and ERR_PTR(errno) on errors. Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220204185742.271030-7-song@kernel.org
2022-02-08x86/alternative: Introduce text_poke_copySong Liu2-0/+35
This will be used by BPF jit compiler to dump JITed binary to a RX huge page, and thus allow multiple BPF programs sharing the a huge (2MB) page. Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/bpf/20220204185742.271030-6-song@kernel.org
2022-02-08bpf: Use prog->jited_len in bpf_prog_ksym_set_addr()Song Liu2-4/+2
Using prog->jited_len is simpler and more accurate than current estimation (header + header->size). Also, fix missing prog->jited_len with multi function program. This hasn't been a real issue before this. Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220204185742.271030-5-song@kernel.org
2022-02-08bpf: Use size instead of pages in bpf_binary_headerSong Liu2-9/+8
This is necessary to charge sub page memory for the BPF program. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220204185742.271030-4-song@kernel.org
2022-02-08bpf: Use bytes instead of pages for bpf_jit_[charge|uncharge]_modmemSong Liu3-14/+13
This enables sub-page memory charge and allocation. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220204185742.271030-3-song@kernel.org
2022-02-08x86/Kconfig: Select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAPSong Liu1-0/+1
This enables module_alloc() to allocate huge page for 2MB+ requests. To check the difference of this change, we need enable config CONFIG_PTDUMP_DEBUGFS, and call module_alloc(2MB). Before the change, /sys/kernel/debug/page_tables/kernel shows pte for this map. With the change, /sys/kernel/debug/page_tables/ show pmd for thie map. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220204185742.271030-2-song@kernel.org
2022-02-08Merge tag '5.17-rc3-ksmbd-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds6-14/+70
Pull ksmbd server fixes from Steve French: - NTLMSSP authentication improvement - RDMA (smbdirect) fix allowing broader set of NICs to be supported - improved buffer validation - additional small fixes, including a posix extensions fix for stable * tag '5.17-rc3-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: add support for key exchange ksmbd: reduce smb direct max read/write size ksmbd: don't align last entry offset in smb2 query directory ksmbd: fix same UniqueId for dot and dotdot entries ksmbd: smbd: validate buffer descriptor structures ksmbd: fix SMB 3.11 posix extension mount failure
2022-02-08igb: refactor XDP registrationCorinna Vinschen2-10/+13
On changing the RX ring parameters igb uses a hack to avoid a warning when calling xdp_rxq_info_reg via igb_setup_rx_resources. It just clears the struct xdp_rxq_info content. Instead, change this to unregister if we're already registered. Align code to the igc code. Fixes: 9cbc948b5a20c ("igb: add XDP support") Signed-off-by: Corinna Vinschen <vinschen@redhat.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-02-08igc: avoid kernel warning when changing RX ring parametersCorinna Vinschen1-0/+3
Calling ethtool changing the RX ring parameters like this: $ ethtool -G eth0 rx 1024 on igc triggers kernel warnings like this: [ 225.198467] ------------[ cut here ]------------ [ 225.198473] Missing unregister, handled but fix driver [ 225.198485] WARNING: CPU: 7 PID: 959 at net/core/xdp.c:168 xdp_rxq_info_reg+0x79/0xd0 [...] [ 225.198601] Call Trace: [ 225.198604] <TASK> [ 225.198609] igc_setup_rx_resources+0x3f/0xe0 [igc] [ 225.198617] igc_ethtool_set_ringparam+0x30e/0x450 [igc] [ 225.198626] ethnl_set_rings+0x18a/0x250 [ 225.198631] genl_family_rcv_msg_doit+0xca/0x110 [ 225.198637] genl_rcv_msg+0xce/0x1c0 [ 225.198640] ? rings_prepare_data+0x60/0x60 [ 225.198644] ? genl_get_cmd+0xd0/0xd0 [ 225.198647] netlink_rcv_skb+0x4e/0xf0 [ 225.198652] genl_rcv+0x24/0x40 [ 225.198655] netlink_unicast+0x20e/0x330 [ 225.198659] netlink_sendmsg+0x23f/0x480 [ 225.198663] sock_sendmsg+0x5b/0x60 [ 225.198667] __sys_sendto+0xf0/0x160 [ 225.198671] ? handle_mm_fault+0xb2/0x280 [ 225.198676] ? do_user_addr_fault+0x1eb/0x690 [ 225.198680] __x64_sys_sendto+0x20/0x30 [ 225.198683] do_syscall_64+0x38/0x90 [ 225.198687] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 225.198693] RIP: 0033:0x7f7ae38ac3aa igc_ethtool_set_ringparam() copies the igc_ring structure but neglects to reset the xdp_rxq_info member before calling igc_setup_rx_resources(). This in turn calls xdp_rxq_info_reg() with an already registered xdp_rxq_info. Make sure to unregister the xdp_rxq_info structure first in igc_setup_rx_resources. Fixes: 73f1071c1d29 ("igc: Add support for XDP_TX action") Reported-by: Lennert Buytenhek <buytenh@arista.com> Signed-off-by: Corinna Vinschen <vinschen@redhat.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Tested-by: Dvora Fuxbrumer <dvorax.fuxbrumer@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-02-07Merge branch 'bpf: Fix strict mode calculation'Andrii Nakryiko3-13/+2
Mauricio Vásquez <mauricio@kinvolk.io> says: ==================== This series fixes a bad calculation of strict mode in two places. It also updates libbpf to make it easier for the users to disable a specific LIBBPF_STRICT_* flag. v1 -> v2: - remove check in libbpf_set_strict_mode() - split in different commits v1: https://lore.kernel.org/bpf/20220204220435.301896-1-mauricio@kinvolk.io/ ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2022-02-07selftests/bpf: Fix strict mode calculationMauricio Vásquez1-1/+1
"(__LIBBPF_STRICT_LAST - 1) & ~LIBBPF_STRICT_MAP_DEFINITIONS" is wrong as it is equal to 0 (LIBBPF_STRICT_NONE). Let's use "LIBBPF_STRICT_ALL & ~LIBBPF_STRICT_MAP_DEFINITIONS" now that the previous commit makes it possible in libbpf. Fixes: 93b8952d223a ("libbpf: deprecate legacy BPF map definitions") Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220207145052.124421-4-mauricio@kinvolk.io