summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2021-03-29nexthop: Rename artifacts related to legacy multipath nexthop groupsPetr Machata1-2/+2
After resilient next-hop groups have been added recently, there are two types of multipath next-hop groups: the legacy "mpath", and the new "resilient". Calling the legacy next-hop group type "mpath" is unfortunate, because that describes the fact that a packet could be forwarded in one of several paths, which is also true for the resilient next-hop groups. Therefore, to make the naming clearer, rename various artifacts to reflect the assumptions made. Therefore as of this patch: - The flag for multipath groups is nh_grp_entry::is_multipath. This includes the legacy and resilient groups, as well as any future group types that behave as multipath groups. Functions that assume this have "mpath" in the name. - The flag for legacy multipath groups is nh_grp_entry::hash_threshold. Functions that assume this have "hthr" in the name. - The flag for resilient groups is nh_grp_entry::resilient. Functions that assume this have "res" in the name. Besides the above, struct nh_grp_entry::mpath was renamed to ::hthr as well. UAPI artifacts were obviously left intact. Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-27mld: add mc_lock for protecting per-interface mld dataTaehee Yoo1-0/+1
The purpose of this lock is to avoid a bottleneck in the query/report event handler logic. By previous patches, almost all mld data is protected by RTNL. So, the query and report event handler, which is data path logic acquires RTNL too. Therefore if a lot of query and report events are received, it uses RTNL for a long time. So it makes the control-plane bottleneck because of using RTNL. In order to avoid this bottleneck, mc_lock is added. mc_lock protect only per-interface mld data and per-interface mld data is used in the query/report event handler logic. So, no longer rtnl_lock is needed in the query/report event handler logic. Therefore bottleneck will be disappeared by mc_lock. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-27mld: add new workqueues for process mld eventsTaehee Yoo2-1/+11
When query/report packets are received, mld module processes them. But they are processed under BH context so it couldn't use sleepable functions. So, in order to switch context, the two workqueues are added which processes query and report event. In the struct inet6_dev, mc_{query | report}_queue are added so it is per-interface queue. And mc_{query | report}_work are workqueue structure. When the query or report event is received, skb is queued to proper queue and worker function is scheduled immediately. Workqueues and queues are protected by spinlock, which is mc_{query | report}_lock, and worker functions are protected by RTNL. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-27mld: convert ifmcaddr6 to RCUTaehee Yoo1-3/+4
The ifmcaddr6 has been protected by inet6_dev->lock(rwlock) so that the critical section is atomic context. In order to switch this context, changing locking is needed. The ifmcaddr6 actually already protected by RTNL So if it's converted to use RCU, its control path context can be switched to sleepable. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-27mld: convert ip6_sf_list to RCUTaehee Yoo1-3/+4
The ip6_sf_list has been protected by mca_lock(spin_lock) so that the critical section is atomic context. In order to switch this context, changing locking is needed. The ip6_sf_list actually already protected by RTNL So if it's converted to use RCU, its control path context can be switched to sleepable. But It doesn't remove mca_lock yet because ifmcaddr6 isn't converted to RCU yet. So, It's not fully converted to the sleepable context. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-27mld: convert ipv6_mc_socklist->sflist to RCUTaehee Yoo1-2/+2
The sflist has been protected by rwlock so that the critical section is atomic context. In order to switch this context, changing locking is needed. The sflist actually already protected by RTNL So if it's converted to use RCU, its control path context can be switched to sleepable. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-27mld: get rid of inet6_dev->mc_lockTaehee Yoo1-1/+0
The purpose of mc_lock is to protect inet6_dev->mc_tomb. But mc_tomb is already protected by RTNL and all functions, which manipulate mc_tomb are called under RTNL. So, mc_lock is not needed. Furthermore, it is spinlock so the critical section is atomic. In order to reduce atomic context, it should be removed. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-27mld: convert from timer to delayed workTaehee Yoo1-4/+4
mcast.c has several timers for delaying works. Timer's expire handler is working under atomic context so it can't use sleepable things such as GFP_KERNEL, mutex, etc. In order to use sleepable APIs, it converts from timers to delayed work. But there are some critical sections, which is used by both process and BH context. So that it still uses spin_lock_bh() and rwlock. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-27ethtool: document the enum values not definesJakub Kicinski1-10/+10
kdoc does not have good support for documenting defines, and we can't abuse the enum documentation because it generates warnings. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-27ethtool: fec: add note about reuse of reservedJakub Kicinski1-0/+4
struct ethtool_fecparam::reserved can't be used in SET, because ethtool user space doesn't zero-initialize the structure. Make this clear. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26tcp: convert elligible sysctls to u8Eric Dumazet1-34/+34
Many tcp sysctls are either bools or small ints that can fit into u8. Reducing space taken by sysctls can save few cache line misses when sending/receiving data while cpu caches are empty, for example after cpu idle period. This is hard to measure with typical network performance tests, but after this patch, struct netns_ipv4 has shrunk by three cache lines. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26inet: convert tcp_early_demux and udp_early_demux to u8Eric Dumazet1-2/+2
For these sysctls, their dedicated helpers have to use proc_dou8vec_minmax(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26ipv4: convert ip_forward_update_priority sysctl to u8Eric Dumazet1-1/+1
This sysctl uses ip_fwd_update_priority() helper, so the conversion needs to change it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26ipv4: shrink netns_ipv4 with sysctl conversionsEric Dumazet1-16/+16
These sysctls that can fit in one byte instead of one int are converted to save space and thus reduce cache line misses. - icmp_echo_ignore_all, icmp_echo_ignore_broadcasts, - icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr - tcp_ecn, tcp_ecn_fallback - ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu - ip_nonlocal_bind, ip_autobind_reuse - ip_dynaddr, ip_early_demux, raw_l3mdev_accept - nexthop_compat_mode, fwmark_reflect Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26sysctl: add proc_dou8vec_minmax()Eric Dumazet1-0/+2
Networking has many sysctls that could fit in one u8. This patch adds proc_dou8vec_minmax() for this purpose. Note that the .extra1 and .extra2 fields are pointing to integers, because it makes conversions easier. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26net: stmmac: use interrupt mode INTM=1 for multi-MSIWong, Vee Khee1-0/+1
For interrupt mode INTM=0, TX/RX transfer complete will trigger signal not only on sbd_perch_[tx|rx]_intr_o (Transmit/Receive Per Channel) but also on the sbd_intr_o (Common). As for multi-MSI implementation, setting interrupt mode INTM=1 is more efficient as each TX intr and RX intr (TI/RI) will be handled by TX/RX ISR without the need of calling the common MAC ISR. Updated the TX/RX NORMAL interrupts status checking process as the NIS status bit is not asserted for any RI/TI events for INTM=1. Signed-off-by: Wong, Vee Khee <vee.khee.wong@intel.com> Co-developed-by: Voon Weifeng <weifeng.voon@intel.com> Signed-off-by: Voon Weifeng <weifeng.voon@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26net: stmmac: introduce MSI Interrupt routines for mac, safety, RX & TXOng Boon Leong1-0/+8
Now we introduce MSI interrupt service routines and hook these routines up if stmmac_open() sees valid irq line being requested:- stmmac_mac_interrupt() :- MAC (dev->irq), WOL (wol_irq), LPI (lpi_irq) stmmac_safety_interrupt() :- Safety Feat Correctible Error (sfty_ce_irq) & Uncorrectible Error (sfty_ue_irq) stmmac_msi_intr_rx() :- For all RX MSI irq (rx_irq) stmmac_msi_intr_tx() :- For all TX MSI irq (tx_irq) Each of IRQs will have its unique name so that we can differentiate them easily under /proc/interrupts. Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: Voon Weifeng <weifeng.voon@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26ethtool: clarify the ethtool FEC interfaceJakub Kicinski1-7/+30
The definition of the FEC driver interface is quite unclear. Improve the documentation. This is based on current driver and user space code, as well as the discussions about the interface: RFC v1 (24 Oct 2016): https://lore.kernel.org/netdev/1477363849-36517-1-git-send-email-vidya@cumulusnetworks.com/ - this version has the autoneg field - no active_fec field - none vs off confusion is already present RFC v2 (10 Feb 2017): https://lore.kernel.org/netdev/1486727004-11316-1-git-send-email-vidya@cumulusnetworks.com/ - autoneg removed - active_fec added v1 (10 Feb 2017): https://lore.kernel.org/netdev/1486751311-42019-1-git-send-email-vidya@cumulusnetworks.com/ - no changes in the code v1 (24 Jun 2017): https://lore.kernel.org/netdev/1498331985-8525-1-git-send-email-roopa@cumulusnetworks.com/ - include in tree user v2 (27 Jul 2017): https://lore.kernel.org/netdev/1501199248-24695-1-git-send-email-roopa@cumulusnetworks.com/ Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26ethtool: fec: sanitize ethtool_fecparam->active_fecJakub Kicinski1-1/+1
struct ethtool_fecparam::active_fec is a GET-only field, all in-tree drivers correctly ignore it on SET. Clear the field on SET to avoid any confusion. Again, we can't reject non-zero now since ethtool user space does not zero-init the param correctly. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26ethtool: fec: sanitize ethtool_fecparam->reservedJakub Kicinski1-1/+1
struct ethtool_fecparam::reserved is never looked at by the core. Make sure it's actually 0. Unfortunately we can't return an error because old ethtool doesn't zero-initialize the structure for SET. On GET we can be more verbose, there are no in tree (ab)users. Fix up the kdoc on the structure. Remove the mention of FEC bypass. Seems like a niche thing to configure in the first place. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26ethtool: fec: remove long structure descriptionJakub Kicinski1-4/+0
Digging through the mailing list archive @autoneg was part of the first version of the RFC, this left over comment was pointed out twice in review but wasn't removed. The sentence is an exact copy-paste from pauseparam. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26ethtool: fec: fix typo in kdocJakub Kicinski1-1/+1
s/porte/the port/ Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-0/+1
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-03-24 The following pull-request contains BPF updates for your *net-next* tree. We've added 37 non-merge commits during the last 15 day(s) which contain a total of 65 files changed, 3200 insertions(+), 738 deletions(-). The main changes are: 1) Static linking of multiple BPF ELF files, from Andrii. 2) Move drop error path to devmap for XDP_REDIRECT, from Lorenzo. 3) Spelling fixes from various folks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller56-115/+283
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25Merge branch 'akpm' (patches from Andrew)Linus Torvalds4-12/+35
Merge misc fixes from Andrew Morton: "14 patches. Subsystems affected by this patch series: mm (hugetlb, kasan, gup, selftests, z3fold, kfence, memblock, and highmem), squashfs, ia64, gcov, and mailmap" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mailmap: update Andrey Konovalov's email address mm/highmem: fix CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP mm: memblock: fix section mismatch warning again kfence: make compatible with kmemleak gcov: fix clang-11+ support ia64: fix format strings for err_inject ia64: mca: allocate early mca with GFP_ATOMIC squashfs: fix xattr id and id lookup sanity checks squashfs: fix inode lookup sanity checks z3fold: prevent reclaim/free race for headless pages selftests/vm: fix out-of-tree build mm/mmu_notifiers: ensure range_end() is paired with range_start() kasan: fix per-page tags for non-page_alloc pages hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings
2021-03-25mm: memblock: fix section mismatch warning againMike Rapoport1-2/+2
Commit 34dc2efb39a2 ("memblock: fix section mismatch warning") marked memblock_bottom_up() and memblock_set_bottom_up() as __init, but they could be referenced from non-init functions like memblock_find_in_range_node() on architectures that enable CONFIG_ARCH_KEEP_MEMBLOCK. For such builds kernel test robot reports: WARNING: modpost: vmlinux.o(.text+0x74fea4): Section mismatch in reference from the function memblock_find_in_range_node() to the function .init.text:memblock_bottom_up() The function memblock_find_in_range_node() references the function __init memblock_bottom_up(). This is often because memblock_find_in_range_node lacks a __init annotation or the annotation of memblock_bottom_up is wrong. Replace __init annotations with __init_memblock annotations so that the appropriate section will be selected depending on CONFIG_ARCH_KEEP_MEMBLOCK. Link: https://lore.kernel.org/lkml/202103160133.UzhgY0wt-lkp@intel.com Link: https://lkml.kernel.org/r/20210316171347.14084-1-rppt@kernel.org Fixes: 34dc2efb39a2 ("memblock: fix section mismatch warning") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-25mm/mmu_notifiers: ensure range_end() is paired with range_start()Sean Christopherson1-5/+5
If one or more notifiers fails .invalidate_range_start(), invoke .invalidate_range_end() for "all" notifiers. If there are multiple notifiers, those that did not fail are expecting _start() and _end() to be paired, e.g. KVM's mmu_notifier_count would become imbalanced. Disallow notifiers that can fail _start() from implementing _end() so that it's unnecessary to either track which notifiers rejected _start(), or had already succeeded prior to a failed _start(). Note, the existing behavior of calling _start() on all notifiers even after a previous notifier failed _start() was an unintented "feature". Make it canon now that the behavior is depended on for correctness. As of today, the bug is likely benign: 1. The only caller of the non-blocking notifier is OOM kill. 2. The only notifiers that can fail _start() are the i915 and Nouveau drivers. 3. The only notifiers that utilize _end() are the SGI UV GRU driver and KVM. 4. The GRU driver will never coincide with the i195/Nouveau drivers. 5. An imbalanced kvm->mmu_notifier_count only causes soft lockup in the _guest_, and the guest is already doomed due to being an OOM victim. Fix the bug now to play nice with future usage, e.g. KVM has a potential use case for blocking memslot updates in KVM while an invalidation is in-progress, and failure to unblock would result in said updates being blocked indefinitely and hanging. Found by inspection. Verified by adding a second notifier in KVM that periodically returns -EAGAIN on non-blockable ranges, triggering OOM, and observing that KVM exits with an elevated notifier count. Link: https://lkml.kernel.org/r/20210311180057.1582638-1-seanjc@google.com Fixes: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") Signed-off-by: Sean Christopherson <seanjc@google.com> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: David Rientjes <rientjes@google.com> Cc: Ben Gardon <bgardon@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-25kasan: fix per-page tags for non-page_alloc pagesAndrey Konovalov1-3/+15
To allow performing tag checks on page_alloc addresses obtained via page_address(), tag-based KASAN modes store tags for page_alloc allocations in page->flags. Currently, the default tag value stored in page->flags is 0x00. Therefore, page_address() returns a 0x00ffff... address for pages that were not allocated via page_alloc. This might cause problems. A particular case we encountered is a conflict with KFENCE. If a KFENCE-allocated slab object is being freed via kfree(page_address(page) + offset), the address passed to kfree() will get tagged with 0x00 (as slab pages keep the default per-page tags). This leads to is_kfence_address() check failing, and a KFENCE object ending up in normal slab freelist, which causes memory corruptions. This patch changes the way KASAN stores tag in page-flags: they are now stored xor'ed with 0xff. This way, KASAN doesn't need to initialize per-page flags for every created page, which might be slow. With this change, page_address() returns natively-tagged (with 0xff) pointers for pages that didn't have tags set explicitly. This patch fixes the encountered conflict with KFENCE and prevents more similar issues that can occur in the future. Link: https://lkml.kernel.org/r/1a41abb11c51b264511d9e71c303bb16d5cb367b.1615475452.git.andreyknvl@google.com Fixes: 2813b9c02962 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-25hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappingsMiaohe Lin1-2/+13
The current implementation of hugetlb_cgroup for shared mappings could have different behavior. Consider the following two scenarios: 1.Assume initial css reference count of hugetlb_cgroup is 1: 1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference count is 2 associated with 1 file_region. 1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference count is 3 associated with 2 file_region. 1.3 coalesce_file_region will coalesce these two file_regions into one. So css reference count is 3 associated with 1 file_region now. 2.Assume initial css reference count of hugetlb_cgroup is 1 again: 2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference count is 2 associated with 1 file_region. Therefore, we might have one file_region while holding one or more css reference counts. This inconsistency could lead to imbalanced css_get() and css_put() pair. If we do css_put one by one (i.g. hole punch case), scenario 2 would put one more css reference. If we do css_put all together (i.g. truncate case), scenario 1 will leak one css reference. The imbalanced css_get() and css_put() pair would result in a non-zero reference when we try to destroy the hugetlb cgroup. The hugetlb cgroup directory is removed __but__ associated resource is not freed. This might result in OOM or can not create a new hugetlb cgroup in a busy workload ultimately. In order to fix this, we have to make sure that one file_region must hold exactly one css reference. So in coalesce_file_region case, we should release one css reference before coalescence. Also only put css reference when the entire file_region is removed. The last thing to note is that the caller of region_add() will only hold one reference to h_cg->css for the whole contiguous reservation region. But this area might be scattered when there are already some file_regions reside in it. As a result, many file_regions may share only one h_cg->css reference. In order to ensure that one file_region must hold exactly one css reference, we should do css_get() for each file_region and release the reference held by caller when they are done. [linmiaohe@huawei.com: fix imbalanced css_get and css_put pair for shared mappings] Link: https://lkml.kernel.org/r/20210316023002.53921-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210301120540.37076-1-linmiaohe@huawei.com Fixes: 075a61d07a8e ("hugetlb_cgroup: add accounting for shared mappings") Reported-by: kernel test robot <lkp@intel.com> (auto build test ERROR) Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mina Almasry <almasrymina@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds16-27/+104
Pull networking fixes from David Miller: "Various fixes, all over: 1) Fix overflow in ptp_qoriq_adjfine(), from Yangbo Lu. 2) Always store the rx queue mapping in veth, from Maciej Fijalkowski. 3) Don't allow vmlinux btf in map_create, from Alexei Starovoitov. 4) Fix memory leak in octeontx2-af from Colin Ian King. 5) Use kvalloc in bpf x86 JIT for storing jit'd addresses, from Yonghong Song. 6) Fix tx ptp stats in mlx5, from Aya Levin. 7) Check correct ip version in tun decap, fropm Roi Dayan. 8) Fix rate calculation in mlx5 E-Switch code, from arav Pandit. 9) Work item memork leak in mlx5, from Shay Drory. 10) Fix ip6ip6 tunnel crash with bpf, from Daniel Borkmann. 11) Lack of preemptrion awareness in macvlan, from Eric Dumazet. 12) Fix data race in pxa168_eth, from Pavel Andrianov. 13) Range validate stab in red_check_params(), from Eric Dumazet. 14) Inherit vlan filtering setting properly in b53 driver, from Florian Fainelli. 15) Fix rtnl locking in igc driver, from Sasha Neftin. 16) Pause handling fixes in igc driver, from Muhammad Husaini Zulkifli. 17) Missing rtnl locking in e1000_reset_task, from Vitaly Lifshits. 18) Use after free in qlcnic, from Lv Yunlong. 19) fix crash in fritzpci mISDN, from Tong Zhang. 20) Premature rx buffer reuse in igb, from Li RongQing. 21) Missing termination of ip[a driver message handler arrays, from Alex Elder. 22) Fix race between "x25_close" and "x25_xmit"/"x25_rx" in hdlc_x25 driver, from Xie He. 23) Use after free in c_can_pci_remove(), from Tong Zhang. 24) Uninitialized variable use in nl80211, from Jarod Wilson. 25) Off by one size calc in bpf verifier, from Piotr Krysiuk. 26) Use delayed work instead of deferrable for flowtable GC, from Yinjun Zhang. 27) Fix infinite loop in NPC unmap of octeontx2 driver, from Hariprasad Kelam. 28) Fix being unable to change MTU of dwmac-sun8i devices due to lack of fifo sizes, from Corentin Labbe. 29) DMA use after free in r8169 with WoL, fom Heiner Kallweit. 30) Mismatched prototypes in isdn-capi, from Arnd Bergmann. 31) Fix psample UAPI breakage, from Ido Schimmel" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (171 commits) psample: Fix user API breakage math: Export mul_u64_u64_div_u64 ch_ktls: fix enum-conversion warning octeontx2-af: Fix memory leak of object buf ptp_qoriq: fix overflow in ptp_qoriq_adjfine() u64 calcalation net: bridge: don't notify switchdev for local FDB addresses net/sched: act_ct: clear post_ct if doing ct_clear net: dsa: don't assign an error value to tag_ops isdn: capi: fix mismatched prototypes net/mlx5: SF, do not use ecpu bit for vhca state processing net/mlx5e: Fix division by 0 in mlx5e_select_queue net/mlx5e: Fix error path for ethtool set-priv-flag net/mlx5e: Offload tuple rewrite for non-CT flows net/mlx5e: Allow to match on MPLS parameters only for MPLS over UDP net/mlx5: Add back multicast stats for uplink representor net: ipconfig: ic_dev can be NULL in ic_close_devs MAINTAINERS: Combine "QLOGIC QLGE 10Gb ETHERNET DRIVER" sections into one docs: networking: Fix a typo r8169: fix DMA being used after buffer free if WoL is enabled net: ipa: fix init header command validation ...
2021-03-25net: stmmac: support FPE link partner hand-shaking procedureOng Boon Leong1-0/+27
In order to discover whether remote station supports frame preemption, local station sends verify mPacket and expects response mPacket in return from the remote station. So, we add the functions to send and handle event when verify mPacket and response mPacket are exchanged between the networked stations. The mechanism to handle different FPE states between local and remote station (link partner) is implemented using workqueue which starts a task each time there is some sign of verify & response mPacket exchange as check in FPE IRQ event. The task retries couple of times to try to spot the states that both stations are ready to enter FPE ON. This allows different end points to enable FPE at different time and verify-response mPacket can happen asynchronously. Ultimately, the task will only turn FPE ON when local station have both exchange response in both directions. Thanks to Voon Weifeng for implementing the core functions for detecting FPE events and send mPacket and phylink related change. Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Co-developed-by: Voon Weifeng <weifeng.voon@intel.com> Signed-off-by: Voon Weifeng <weifeng.voon@intel.com> Co-developed-by: Tan Tee Min <tee.min.tan@intel.com> Signed-off-by: Tan Tee Min <tee.min.tan@intel.com> Co-developed-by: Mohammad Athari Bin Ismail <mohammad.athari.ismail@intel.com> Signed-off-by: Mohammad Athari Bin Ismail <mohammad.athari.ismail@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25tcp_metrics: tcpm_hash_bucket is strictly localEric Dumazet1-1/+0
After commit 098a697b497e ("tcp_metrics: Use a single hash table for all network namespaces."), tcpm_hash_bucket is local to net/ipv4/tcp_metrics.c Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25psample: Fix user API breakageIdo Schimmel1-4/+1
Cited commit added a new attribute before the existing group reference count attribute, thereby changing its value and breaking existing applications on new kernels. Before: # psample -l libpsample ERROR psample_group_foreach: failed to recv message: Operation not supported After: # psample -l Group Num Refcount Group Seq 1 1 0 Fix by restoring the value of the old attribute and remove the misleading comments from the enumerator to avoid future bugs. Cc: stable@vger.kernel.org Fixes: d8bed686ab96 ("net: psample: Add tunnel support") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reported-by: Adiel Bidani <adielb@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25Add Open Routing Protocol ID to `rtnetlink.h`Cooper Lees1-0/+1
- The Open Routing (Open/R) network protocol netlink handler uses ID 99 - Will also add to `/etc/iproute2/rt_protos` once this is accepted - For more information: https://github.com/facebook/openr Signed-off-by: From: Cooper Lees <me@cooperlees.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25net: phy: add genphy_c45_loopbackWong Vee Khee1-0/+1
Add generic code to enable C45 PHY loopback into the common phy-c45.c file. This will allow C45 PHY drivers aceess this by setting .set_loopback. Suggested-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Wong Vee Khee <vee.khee.wong@linux.intel.com> Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25net: stmmac: Add hardware supported cross-timestampTan Tee Min1-0/+4
Cross timestamping is supported on Integrated Ethernet Controller in Intel SoC such as EHL and TGL with Always Running Timer. The hardware cross-timestamp result is made available to applications through the PTP_SYS_OFFSET_PRECISE ioctl which calls stmmac_getcrosststamp(). Device time is stored in the MAC Auxiliary register. The 64-bit System time (ART timestamp) is stored in registers that are only addressable by using MDIO space. Signed-off-by: Tan Tee Min <tee.min.tan@intel.com> Co-developed-by: Wong Vee Khee <vee.khee.wong@linux.intel.com> Signed-off-by: Wong Vee Khee <vee.khee.wong@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24net: flow_offload: add FLOW_ACTION_PPPOE_PUSHPablo Neira Ayuso1-0/+4
Add an action to represent the PPPoE hardware offload support that includes the session ID. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24netfilter: flowtable: bridge vlan hardware offload and switchdevFelix Fietkau2-3/+6
The switch might have already added the VLAN tag through PVID hardware offload. Keep this extra VLAN in the flowtable but skip it on egress. Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24netfilter: nft_flow_offload: use direct xmit if hardware offload is enabledPablo Neira Ayuso1-0/+2
If there is a forward path to reach an ethernet device and hardware offload is enabled, then use the direct xmit path. Moreover, store the real device in the direct xmit path info since software datapath uses dev_hard_header() to push the layer encapsulation headers while hardware offload refers to the real device. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24netfilter: flowtable: add vlan supportPablo Neira Ayuso1-3/+14
Add the vlan id and protocol to the flow tuple to uniquely identify flows from the receive path. For the transmit path, dev_hard_header() on the vlan device push the headers. This patch includes support for two vlan headers (QinQ) from the ingress path. Add a generic encap field to the flowtable entry which stores the protocol and the tag id. This allows to reuse these fields in the PPPoE support coming in a later patch. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24netfilter: flowtable: use dev_fill_forward_path() to obtain egress devicePablo Neira Ayuso1-2/+14
The egress device in the tuple is obtained from route. Use dev_fill_forward_path() instead to provide the real egress device for this flow whenever this is available. The new FLOW_OFFLOAD_XMIT_DIRECT type uses dev_queue_xmit() to transmit ethernet frames. Cache the source and destination hardware address to use dev_queue_xmit() to transfer packets. The FLOW_OFFLOAD_XMIT_DIRECT replaces FLOW_OFFLOAD_XMIT_NEIGH if dev_fill_forward_path() finds a direct transmit path. In case of topology updates, if peer is moved to different bridge port, the connection will time out, reconnect will result in a new entry with the correct path. Snooping fdb updates would allow for cleaning up stale flowtable entries. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24netfilter: flowtable: use dev_fill_forward_path() to obtain ingress devicePablo Neira Ayuso1-0/+3
Obtain the ingress device in the tuple from the route in the reply direction. Use dev_fill_forward_path() instead to get the real ingress device for this flow. Fall back to use the ingress device that the IP forwarding route provides if: - dev_fill_forward_path() finds no real ingress device. - the ingress device that is obtained is not part of the flowtable devices. - this route has a xfrm policy. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24netfilter: flowtable: add xmit path typesPablo Neira Ayuso1-2/+9
Add the xmit_type field that defines the two supported xmit paths in the flowtable data plane, which are the neighbour and the xfrm xmit paths. This patch prepares for new flowtable xmit path types to come. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24net: dsa: resolve forwarding path for dsa slave portsFelix Fietkau1-0/+5
Add .ndo_fill_forward_path for dsa slave port devices Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24net: ppp: resolve forwarding path for bridge pppoe devicesFelix Fietkau2-0/+5
Pass on the PPPoE session ID, destination hardware address and the real device. Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24net: bridge: resolve forwarding path for VLAN tag actions in bridge devicesFelix Fietkau1-0/+16
Depending on the VLAN settings of the bridge and the port, the bridge can either add or remove a tag. When vlan filtering is enabled, the fdb lookup also needs to know the VLAN tag/proto for the destination address To provide this, keep track of the stack of VLAN tags for the path in the lookup context Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24net: bridge: resolve forwarding path for bridge devicesPablo Neira Ayuso1-0/+1
Add .ndo_fill_forward_path for bridge devices. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24net: 8021q: resolve forwarding path for vlan devicesPablo Neira Ayuso1-0/+7
Add .ndo_fill_forward_path for vlan devices. For instance, assuming the following topology: IP forwarding / \ eth0.100 eth0 | eth0 . . . ethX ab:cd:ef:ab:cd:ef For packets going through IP forwarding to eth0.100 whose destination MAC address is ab:cd:ef:ab:cd:ef, dev_fill_forward_path() provides the following path: eth0.100 -> eth0 Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24net: resolve forwarding path from virtual netdevice and HW destination addressPablo Neira Ayuso1-0/+27
This patch adds dev_fill_forward_path() which resolves the path to reach the real netdevice from the IP forwarding side. This function takes as input the netdevice and the destination hardware address and it walks down the devices calling .ndo_fill_forward_path() for each device until the real device is found. For instance, assuming the following topology: IP forwarding / \ br0 eth0 / \ eth1 eth2 . . . ethX ab:cd:ef:ab:cd:ef where eth1 and eth2 are bridge ports and eth0 provides WAN connectivity. ethX is the interface in another box which is connected to the eth1 bridge port. For packets going through IP forwarding to br0 whose destination MAC address is ab:cd:ef:ab:cd:ef, dev_fill_forward_path() provides the following path: br0 -> eth1 .ndo_fill_forward_path for br0 looks up at the FDB for the bridge port from the destination MAC address to get the bridge port eth1. This information allows to create a fast path that bypasses the classic bridge and IP forwarding paths, so packets go directly from the bridge port eth1 to eth0 (wan interface) and vice versa. fast path .------------------------. / \ | IP forwarding | | / \ \/ | br0 eth0 . / \ -> eth1 eth2 . . . ethX ab:cd:ef:ab:cd:ef Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24net: make unregister netdev warning timeout configurableDmitry Vyukov1-0/+1
netdev_wait_allrefs() issues a warning if refcount does not drop to 0 after 10 seconds. While 10 second wait generally should not happen under normal workload in normal environment, it seems to fire falsely very often during fuzzing and/or in qemu emulation (~10x slower). At least it's not possible to understand if it's really a false positive or not. Automated testing generally bumps all timeouts to very high values to avoid flake failures. Add net.core.netdev_unregister_timeout_secs sysctl to make the timeout configurable for automated testing systems. Lowering the timeout may also be useful for e.g. manual bisection. The default value matches the current behavior. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877 Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>