summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2021-06-25net: mdiobus: withdraw fwnode_mdbiobus_registerMarcin Wojtas1-12/+0
The newly implemented fwnode_mdbiobus_register turned out to be problematic - in case the fwnode_/of_/acpi_mdio are built as modules, a dependency cycle can be observed during the depmod phase of modules_install, eg.: depmod: ERROR: Cycle detected: fwnode_mdio -> of_mdio -> fwnode_mdio depmod: ERROR: Found 2 modules in dependency cycles! OR: depmod: ERROR: Cycle detected: acpi_mdio -> fwnode_mdio -> acpi_mdio depmod: ERROR: Found 2 modules in dependency cycles! A possible solution could be to rework fwnode_mdiobus_register, so that to merge the contents of acpi_mdiobus_register and of_mdiobus_register. However feasible, such change would be very intrusive and affect huge amount of the of_mdiobus_register users. Since there are currently 2 users of ACPI and MDIO (xgmac_mdio and mvmdio), withdraw the fwnode_mdbiobus_register and roll back to a simple 'if' condition in affected drivers. Fixes: 62a6ef6a996f ("net: mdiobus: Introduce fwnode_mdbiobus_register()") Signed-off-by: Marcin Wojtas <mw@semihalf.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: mdiobus: fix fwnode_mdbiobus_register() fallback caseMarcin Wojtas1-1/+1
The fallback case of fwnode_mdbiobus_register() (relevant for !CONFIG_FWNODE_MDIO) was defined with wrong argument name, causing a compilation error. Fix that. Signed-off-by: Marcin Wojtas <mw@semihalf.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22sctp: add pad chunk and its make function and event tableXin Long1-0/+7
This chunk is defined in rfc4820#section-3, and used to pad an SCTP packet. The receiver must discard this chunk and continue processing the rest of the chunks in the packet. Add it now, as it will be bundled with a heartbeat chunk to probe pmtu in the following patches. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22ethtool: Use kernel data types for internal EEPROM structIdo Schimmel1-6/+6
The struct is not visible to user space and therefore should not use the user visible data types. Instead, use internal data types like other structures in the file. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22wwan: core: add WWAN common private data for netdevSergey Ryazanov1-0/+18
The WWAN core not only multiplex the netdev configuration data, but process it too, and needs some space to store its private data associated with the netdev. Add a structure to keep common WWAN core data. The structure will be stored inside the netdev private data before WWAN driver private data and have a field to make it easier to access the driver data. Also add a helper function that simplifies drivers access to their data. At the moment we use the common WWAN private data to store the WWAN data link (channel) id at the time the link is created, and report it back to user using the .fill_info() RTNL callback. This should help the user to be aware which network interface is bound to which WWAN device data channel. Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> CC: M Chetan Kumar <m.chetan.kumar@intel.com> CC: Intel Corporation <linuxwwan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22wwan: core: support default netdev creationSergey Ryazanov1-1/+7
Most, if not each WWAN device driver will create a netdev for the default data channel. Therefore, add an option for the WWAN netdev ops registration function to create a default netdev for the WWAN device. A WWAN device driver should pass a default data channel link id to the ops registering function to request the creation of a default netdev, or a special value WWAN_NO_DEFAULT_LINK to inform the WWAN core that the default netdev should not be created. For now, only wwan_hwsim utilize the default link creation option. Other drivers will be reworked next. Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> CC: M Chetan Kumar <m.chetan.kumar@intel.com> CC: Intel Corporation <linuxwwan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22wwan: core: no more hold netdev ops owning moduleSergey Ryazanov1-2/+0
The WWAN netdev ops owner holding was used to protect from the unexpected memory disappear. This approach causes a dependency cycle (driver -> core -> driver) and effectively prevents a WWAN driver unloading. E.g. WWAN hwsim could not be unloaded until all simulated devices are removed: ~# modprobe wwan_hwsim devices=2 ~# lsmod | grep wwan wwan_hwsim 16384 2 wwan 20480 1 wwan_hwsim ~# rmmod wwan_hwsim rmmod: ERROR: Module wwan_hwsim is in use ~# echo > /sys/kernel/debug/wwan_hwsim/hwsim0/destroy ~# echo > /sys/kernel/debug/wwan_hwsim/hwsim1/destroy ~# lsmod | grep wwan wwan_hwsim 16384 0 wwan 20480 1 wwan_hwsim ~# rmmod wwan_hwsim For a real device driver this will cause an inability to unload module until a served device is physically detached. Since the last commit we are removing all child netdev(s) when a driver unregister the netdev ops. This allows us to permit the driver unloading, since any sane driver will call ops unregistering on a device deinitialization. So, remove the holding of an ops owner to make it easier to unload a driver module. The owner field has also beed removed from the ops structure as there are no more users of this field. Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Reviewed-by: Loic Poulain <loic.poulain@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22net: mdiobus: Introduce fwnode_mdbiobus_register()Marcin Wojtas1-0/+12
This patch introduces a new helper function that wraps acpi_/of_ mdiobus_register() and allows its usage via common fwnode_ interface. Fall back to raw mdiobus_register() in case CONFIG_FWNODE_MDIO is not enabled, in order to satisfy compatibility in all future user drivers. Signed-off-by: Marcin Wojtas <mw@semihalf.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22net: handle ARPHRD_IP6GRE in dev_is_mac_header_xmit()Guillaume Nault1-0/+1
Similar to commit 3b707c3008ca ("net: dev_is_mac_header_xmit() true for ARPHRD_RAWIP"), add ARPHRD_IP6GRE to dev_is_mac_header_xmit(), to make ip6gre compatible with act_mirred and __bpf_redirect(). Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski32-46/+150
Trivial conflicts in net/can/isotp.c and tools/testing/selftests/net/mptcp/mptcp_connect.sh scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c to include/linux/ptp_clock_kernel.h in -next so re-apply the fix there. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-19Merge tag 'net-5.13-rc7' of ↵Linus Torvalds4-3/+6
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.13-rc7, including fixes from wireless, bpf, bluetooth, netfilter and can. Current release - regressions: - mlxsw: spectrum_qdisc: Pass handle, not band number to find_class() to fix modifying offloaded qdiscs - lantiq: net: fix duplicated skb in rx descriptor ring - rtnetlink: fix regression in bridge VLAN configuration, empty info is not an error, bot-generated "fix" was not needed - libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix umem creation Current release - new code bugs: - ethtool: fix NULL pointer dereference during module EEPROM dump via the new netlink API - mlx5e: don't update netdev RQs with PTP-RQ, the special purpose queue should not be visible to the stack - mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs - mlx5e: verify dev is present in get devlink port ndo, avoid a panic Previous releases - regressions: - neighbour: allow NUD_NOARP entries to be force GCed - further fixes for fallout from reorg of WiFi locking (staging: rtl8723bs, mac80211, cfg80211) - skbuff: fix incorrect msg_zerocopy copy notifications - mac80211: fix NULL ptr deref for injected rate info - Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs Previous releases - always broken: - bpf: more speculative execution fixes - netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local - udp: fix race between close() and udp_abort() resulting in a panic - fix out of bounds when parsing TCP options before packets are validated (in netfilter: synproxy, tc: sch_cake and mptcp) - mptcp: improve operation under memory pressure, add missing wake-ups - mptcp: fix double-lock/soft lookup in subflow_error_report() - bridge: fix races (null pointer deref and UAF) in vlan tunnel egress - ena: fix DMA mapping function issues in XDP - rds: fix memory leak in rds_recvmsg Misc: - vrf: allow larger MTUs - icmp: don't send out ICMP messages with a source address of 0.0.0.0 - cdc_ncm: switch to eth%d interface naming" * tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (139 commits) net: ethernet: fix potential use-after-free in ec_bhf_remove selftests/net: Add icmp.sh for testing ICMP dummy address responses icmp: don't send out ICMP messages with a source address of 0.0.0.0 net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY net: ll_temac: Fix TX BD buffer overwrite net: ll_temac: Add memory-barriers for TX BD access net: ll_temac: Make sure to free skb when it is completely used MAINTAINERS: add Guvenc as SMC maintainer bnxt_en: Call bnxt_ethtool_free() in bnxt_init_one() error path bnxt_en: Fix TQM fastpath ring backing store computation bnxt_en: Rediscover PHY capabilities after firmware reset cxgb4: fix wrong shift. mac80211: handle various extensible elements correctly mac80211: reset profile_periodicity/ema_ap cfg80211: avoid double free of PMSR request cfg80211: make certificate generation more robust mac80211: minstrel_ht: fix sample time check net: qed: Fix memcpy() overflow of qed_dcbx_params() net: cdc_eem: fix tx fixup skb leak net: hamradio: fix memory leak in mkiss_close ...
2021-06-18net: wwan: Allow WWAN drivers to provide blocking tx and poll functionStephan Gerhold1-2/+11
At the moment, the WWAN core provides wwan_port_txon/off() to implement blocking writes. The tx() port operation should not block, instead wwan_port_txon/off() should be called when the TX queue is full or has free space again. However, in some cases it is not straightforward to make use of that functionality. For example, the RPMSG API used by rpmsg_wwan_ctrl.c does not provide any way to be notified when the TX queue has space again. Instead, it only provides the following operations: - rpmsg_send(): blocking write (wait until there is space) - rpmsg_trysend(): non-blocking write (return error if no space) - rpmsg_poll(): set poll flags depending on TX queue state Generally that's totally sufficient for implementing a char device, but it does not fit well to the currently provided WWAN port ops. Most of the time, using the non-blocking rpmsg_trysend() in the WWAN tx() port operation works just fine. However, with high-frequent writes to the char device it is possible to trigger a situation where this causes issues. For example, consider the following (somewhat unrealistic) example: # dd if=/dev/zero bs=1000 of=/dev/wwan0qmi0 dd: error writing '/dev/wwan0qmi0': Resource temporarily unavailable 1+0 records out This fails immediately after writing the first record. It's likely only a matter of time until this triggers issues for some real application (e.g. ModemManager sending a lot of large QMI packets). The rpmsg_char device does not have this problem, because it uses rpmsg_trysend() and rpmsg_poll() to support non-blocking operations. Make it possible to use the same in the RPMSG WWAN driver by adding two new optional wwan_port_ops: - tx_blocking(): send data blocking if allowed - tx_poll(): set additional TX poll flags This integrates nicely with the RPMSG API and does not require any change in existing WWAN drivers. With these changes, the dd example above blocks instead of exiting with an error. Cc: Loic Poulain <loic.poulain@linaro.org> Signed-off-by: Stephan Gerhold <stephan@gerhold.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18rpmsg: core: Add driver_data for rpmsg_device_idStephan Gerhold1-0/+1
Most device_id structs provide a driver_data field that can be used by drivers to associate data more easily for a particular device ID. Add the same for the rpmsg_device_id. Cc: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Stephan Gerhold <stephan@gerhold.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18Merge tag 'pm-5.13-rc7' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "Remove recently added frequency invariance support from the CPPC cpufreq driver, because it has turned out to be problematic and it cannot be fixed properly on time for 5.13 (Viresh Kumar)" * tag 'pm-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: Revert "cpufreq: CPPC: Add support for frequency invariance"
2021-06-17driver core: add a helper to setup both the of_node and fwnode of a deviceIoana Ciornei1-0/+1
There are many places where both the fwnode_handle and the of_node of a device need to be populated. Add a function which does both so that we have consistency. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller3-6/+42
Daniel Borkmann says: ==================== pull-request: bpf-next 2021-06-17 The following pull-request contains BPF updates for your *net-next* tree. We've added 50 non-merge commits during the last 25 day(s) which contain a total of 148 files changed, 4779 insertions(+), 1248 deletions(-). The main changes are: 1) BPF infrastructure to migrate TCP child sockets from a listener to another in the same reuseport group/map, from Kuniyuki Iwashima. 2) Add a provably sound, faster and more precise algorithm for tnum_mul() as noted in https://arxiv.org/abs/2105.05398, from Harishankar Vishwanathan. 3) Streamline error reporting changes in libbpf as planned out in the 'libbpf: the road to v1.0' effort, from Andrii Nakryiko. 4) Add broadcast support to xdp_redirect_map(), from Hangbin Liu. 5) Extends bpf_map_lookup_and_delete_elem() functionality to 4 more map types, that is, {LRU_,PERCPU_,LRU_PERCPU_,}HASH, from Denis Salopek. 6) Support new LLVM relocations in libbpf to make them more linker friendly, also add a doc to describe the BPF backend relocations, from Yonghong Song. 7) Silence long standing KUBSAN complaints on register-based shifts in interpreter, from Daniel Borkmann and Eric Biggers. 8) Add dummy PT_REGS macros in libbpf to fail BPF program compilation when target arch cannot be determined, from Lorenz Bauer. 9) Extend AF_XDP to support large umems with 1M+ pages, from Magnus Karlsson. 10) Fix two minor libbpf tc BPF API issues, from Kumar Kartikeya Dwivedi. 11) Move libbpf BPF_SEQ_PRINTF/BPF_SNPRINTF macros that can be used by BPF programs to bpf_helpers.h header, from Florent Revest. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-17net: fix mistake path for netdev_features_stringsJian Shen1-1/+1
Th_strings arrays netdev_features_strings, tunable_strings, and phy_tunable_strings has been moved to file net/ethtool/common.c. So fixes the comment. Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-17net/mlx5e: Don't create devices during unload flowDmytro Linkin1-0/+4
Running devlink reload command for port in switchdev mode cause resources to corrupt: driver can't release allocated EQ and reclaim memory pages, because "rdma" auxiliary device had add CQs which blocks EQ from deletion. Erroneous sequence happens during reload-down phase, and is following: 1. detach device - suspends auxiliary devices which support it, destroys others. During this step "eth-rep" and "rdma-rep" are destroyed, "eth" - suspended. 2. disable SRIOV - moves device to legacy mode; as part of disablement - rescans drivers. This step adds "rdma" auxiliary device. 3. destroy EQ table - <failure>. Driver shouldn't create any device during unload flows. To handle that implement MLX5_PRIV_FLAGS_DETACH flag, set it on device detach and unset on device attach. If flag is set do no-op on drivers rescan. Fixes: a925b5e309c9 ("net/mlx5: Register mlx5 devices to auxiliary virtual bus") Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-16mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()Hugh Dickins1-0/+3
There is a race between THP unmapping and truncation, when truncate sees pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it, but before its page_remove_rmap() gets to decrement compound_mapcount: generating false "BUG: Bad page cache" reports that the page is still mapped when deleted. This commit fixes that, but not in the way I hoped. The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK) instead of unmap_mapping_range() in truncate_cleanup_page(): it has often been an annoyance that we usually call unmap_mapping_range() with no pages locked, but there apply it to a single locked page. try_to_unmap() looks more suitable for a single locked page. However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page): it is used to insert THP migration entries, but not used to unmap THPs. Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB needs are different, I'm too ignorant of the DAX cases, and couldn't decide how far to go for anon+swap. Set that aside. The second attempt took a different tack: make no change in truncate.c, but modify zap_huge_pmd() to insert an invalidated huge pmd instead of clearing it initially, then pmd_clear() between page_remove_rmap() and unlocking at the end. Nice. But powerpc blows that approach out of the water, with its serialize_against_pte_lookup(), and interesting pgtable usage. It would need serious help to get working on powerpc (with a minor optimization issue on s390 too). Set that aside. Just add an "if (page_mapped(page)) synchronize_rcu();" or other such delay, after unmapping in truncate_cleanup_page()? Perhaps, but though that's likely to reduce or eliminate the number of incidents, it would give less assurance of whether we had identified the problem correctly. This successful iteration introduces "unmap_mapping_page(page)" instead of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route, with an addition to details. Then zap_pmd_range() watches for this case, and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk() now does in the PVMW_SYNC case. Not pretty, but safe. Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to assert its interface; but currently that's only used to make sure that page->mapping is stable, and zap_pmd_range() doesn't care if the page is locked or not. Along these lines, in invalidate_inode_pages2_range() move the initial unmap_mapping_range() out from under page lock, before then calling unmap_mapping_page() under page lock if still mapped. Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com Fixes: fc127da085c2 ("truncate: handle file thp") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16mm/thp: try_to_unmap() use TTU_SYNC for safe splittingHugh Dickins1-0/+1
Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE (!unmap_success): with dump_page() showing mapcount:1, but then its raw struct page output showing _mapcount ffffffff i.e. mapcount 0. And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed, it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)), and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG(): all indicative of some mapcount difficulty in development here perhaps. But the !CONFIG_DEBUG_VM path handles the failures correctly and silently. I believe the problem is that once a racing unmap has cleared pte or pmd, try_to_unmap_one() may skip taking the page table lock, and emerge from try_to_unmap() before the racing task has reached decrementing mapcount. Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding TTU_SYNC to the options, and passing that from unmap_page(). When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same for both: the slight overhead added should rarely matter, except perhaps if splitting sparsely-populated multiply-mapped shmem. Once confident that bugs are fixed, TTU_SYNC here can be removed, and the race tolerated. Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16mm/thp: make is_huge_zero_pmd() safe and quickerHugh Dickins1-1/+7
Most callers of is_huge_zero_pmd() supply a pmd already verified present; but a few (notably zap_huge_pmd()) do not - it might be a pmd migration entry, in which the pfn is encoded differently from a present pmd: which might pass the is_huge_zero_pmd() test (though not on x86, since L1TF forced us to protect against that); or perhaps even crash in pmd_page() applied to a swap-like entry. Make it safe by adding pmd_present() check into is_huge_zero_pmd() itself; and make it quicker by saving huge_zero_pfn, so that is_huge_zero_pmd() will not need to do that pmd_page() lookup each time. __split_huge_pmd_locked() checked pmd_trans_huge() before: that worked, but is unnecessary now that is_huge_zero_pmd() checks present. Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16mm/hugetlb: expand restore_reserve_on_error functionalityMike Kravetz1-0/+2
The routine restore_reserve_on_error is called to restore reservation information when an error occurs after page allocation. The routine alloc_huge_page modifies the mapping reserve map and potentially the reserve count during allocation. If code calling alloc_huge_page encounters an error after allocation and needs to free the page, the reservation information needs to be adjusted. Currently, restore_reserve_on_error only takes action on pages for which the reserve count was adjusted(HPageRestoreReserve flag). There is nothing wrong with these adjustments. However, alloc_huge_page ALWAYS modifies the reserve map during allocation even if the reserve count is not adjusted. This can cause issues as observed during development of this patch [1]. One specific series of operations causing an issue is: - Create a shared hugetlb mapping Reservations for all pages created by default - Fault in a page in the mapping Reservation exists so reservation count is decremented - Punch a hole in the file/mapping at index previously faulted Reservation and any associated pages will be removed - Allocate a page to fill the hole No reservation entry, so reserve count unmodified Reservation entry added to map by alloc_huge_page - Error after allocation and before instantiating the page Reservation entry remains in map - Allocate a page to fill the hole Reservation entry exists, so decrement reservation count This will cause a reservation count underflow as the reservation count was decremented twice for the same index. A user would observe a very large number for HugePages_Rsvd in /proc/meminfo. This would also likely cause subsequent allocations of hugetlb pages to fail as it would 'appear' that all pages are reserved. This sequence of operations is unlikely to happen, however they were easily reproduced and observed using hacked up code as described in [1]. Address the issue by having the routine restore_reserve_on_error take action on pages where HPageRestoreReserve is not set. In this case, we need to remove any reserve map entry created by alloc_huge_page. A new helper routine vma_del_reservation assists with this operation. There are three callers of alloc_huge_page which do not currently call restore_reserve_on error before freeing a page on error paths. Add those missing calls. [1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.com/ Link: https://lkml.kernel.org/r/20210607204510.22617-1-mike.kravetz@oracle.com Fixes: 96b96a96ddee ("mm/hugetlb: fix huge page reservation leak in private mapping error paths" Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16mm/swap: fix pte_same_as_swp() not removing uffd-wp bit when comparePeter Xu1-4/+11
I found it by pure code review, that pte_same_as_swp() of unuse_vma() didn't take uffd-wp bit into account when comparing ptes. pte_same_as_swp() returning false negative could cause failure to swapoff swap ptes that was wr-protected by userfaultfd. Link: https://lkml.kernel.org/r/20210603180546.9083-1-peterx@redhat.com Fixes: f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration") Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [5.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16mm,hwpoison: fix race with hugetlb page allocationNaoya Horiguchi1-0/+6
When hugetlb page fault (under overcommitting situation) and memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race: CPU0: CPU1: gather_surplus_pages() page = alloc_surplus_huge_page() memory_failure_hugetlb() get_hwpoison_page(page) __get_hwpoison_page(page) get_page_unless_zero(page) zero = put_page_testzero(page) VM_BUG_ON_PAGE(!zero, page) enqueue_huge_page(h, page) put_page(page) __get_hwpoison_page() only checks the page refcount before taking an additional one for memory error handling, which is not enough because there's a time window where compound pages have non-zero refcount during hugetlb page initialization. So make __get_hwpoison_page() check page status a bit more for hugetlb pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags under hugetlb_lock makes sure that the hugetlb page is not transitive. It's notable that another new function, HWPoisonHandlable(), is helpful to prevent a race against other transitive page states (like a generic compound page just before PageHuge becomes true). Link: https://lkml.kernel.org/r/20210603233632.2964832-2-nao.horiguchi@gmail.com Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> [5.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-15ptp: improve max_adj check against unreasonable valuesJakub Kicinski1-1/+1
Scaled PPM conversion to PPB may (on 64bit systems) result in a value larger than s32 can hold (freq/scaled_ppm is a long). This means the kernel will not correctly reject unreasonably high ->freq values (e.g. > 4294967295ppb, 281474976645 scaled PPM). The conversion is equivalent to a division by ~66 (65.536), so the value of ppb is always smaller than ppm, but not small enough to assume narrowing the type from long -> s32 is okay. Note that reasonable user space (e.g. ptp4l) will not use such high values, anyway, 4289046510ppb ~= 4.3x, so the fix is somewhat pedantic. Fixes: d39a743511cd ("ptp: validate the requested frequency adjustment.") Fixes: d94ba80ebbea ("ptp: Added a brand new class driver for ptp clocks.") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-15bpf: Support socket migration by eBPF.Kuniyuki Iwashima2-0/+3
This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to check if the attached eBPF program is capable of migrating sockets. When the eBPF program is attached, we run it for socket migration if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE or net.ipv4.tcp_migrate_req is enabled. Currently, the expected_attach_type is not enforced for the BPF_PROG_TYPE_SK_REUSEPORT type of program. Thus, this commit follows the earlier idea in the commit aac3fc320d94 ("bpf: Post-hooks for sys_bind") to fix up the zero expected_attach_type in bpf_prog_load_fixup_attach_type(). Moreover, this patch adds a new field (migrating_sk) to sk_reuseport_md to select a new listener based on the child socket. migrating_sk varies depending on if it is migrating a request in the accept queue or during 3WHS. - accept_queue : sock (ESTABLISHED/SYN_RECV) - 3WHS : request_sock (NEW_SYN_RECV) In the eBPF program, we can select a new listener by BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning SK_DROP. This feature is useful when listeners have different settings at the socket API level or when we want to free resources as soon as possible. - SK_PASS with selected_sk, select it as a new listener - SK_PASS with selected_sk NULL, fallbacks to the random selection - SK_DROP, cancel the migration. There is a noteworthy point. We select a listening socket in three places, but we do not have struct skb at closing a listener or retransmitting a SYN+ACK. On the other hand, some helper functions do not expect skb is NULL (e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer() in BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb temporarily before running the eBPF program. Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/netdev/20201203042402.6cskdlit5f3mw4ru@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/bpf/20210612123224.12525-10-kuniyu@amazon.co.jp
2021-06-15net/mlx5: Enlarge interrupt field in CREATE_EQShay Drory1-2/+2
FW is now supporting more than 256 MSI-X per PF (up to 2K). Hence, enlarge interrupt field in CREATE_EQ to make use of the new MSI-X's. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-15net/mlx5: Provide cpumask at EQ creation phaseLeon Romanovsky1-0/+1
The users of EQ are running their code on different CPUs and with various affinity patterns. Move the cpumask setting close to their actual usage. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-14net: phy: micrel: ksz886x/ksz8081: add cabletest supportOleksij Rempel1-0/+1
This patch support for cable test for the ksz886x switches and the ksz8081 PHY. The patch was tested on a KSZ8873RLL switch with following results: - port 1: - provides invalid values, thus return -ENOTSUPP (Errata: DS80000830A: "LinkMD does not work on Port 1", http://ww1.microchip.com/downloads/en/DeviceDoc/KSZ8873-Errata-DS80000830A.pdf) - port 2: - can detect distance - can detect open on each wire of pair A (wire 1 and 2) - can detect open only on one wire of pair B (only wire 3) - can detect short between wires of a pair (wires 1 + 2 or 3 + 6) - short between pairs is detected as open. For example short between wires 2 + 3 is detected as open. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-14net: phy/dsa micrel/ksz886x add MDI-X supportOleksij Rempel1-0/+2
Add support for MDI-X status and configuration Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-14net: phy: micrel: move phy reg offsets to common headerMichael Grzeschik1-0/+13
Some micrel devices share the same PHY register defines. This patch moves them to one common header so other drivers can reuse them. And reuse generic MII_* defines where possible. Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-14Revert "cpufreq: CPPC: Add support for frequency invariance"Viresh Kumar1-1/+0
This reverts commit 4c38f2df71c8e33c0b64865992d693f5022eeaad. There are few races in the frequency invariance support for CPPC driver, namely the driver doesn't stop the kthread_work and irq_work on policy exit during suspend/resume or CPU hotplug. A proper fix won't be possible for the 5.13-rc, as it requires a lot of changes. Lets revert the patch instead for now. Fixes: 4c38f2df71c8 ("cpufreq: CPPC: Add support for frequency invariance") Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-06-12mm: relocate 'write_protect_seq' in struct mm_structFeng Tang1-7/+20
0day robot reported a 9.2% regression for will-it-scale mmap1 test case[1], caused by commit 57efa1fe5957 ("mm/gup: prevent gup_fast from racing with COW during fork"). Further debug shows the regression is due to that commit changes the offset of hot fields 'mmap_lock' inside structure 'mm_struct', thus some cache alignment changes. From the perf data, the contention for 'mmap_lock' is very severe and takes around 95% cpu cycles, and it is a rw_semaphore struct rw_semaphore { atomic_long_t count; /* 8 bytes */ atomic_long_t owner; /* 8 bytes */ struct optimistic_spin_queue osq; /* spinner MCS lock */ ... Before commit 57efa1fe5957 adds the 'write_protect_seq', it happens to have a very optimal cache alignment layout, as Linus explained: "and before the addition of the 'write_protect_seq' field, the mmap_sem was at offset 120 in 'struct mm_struct'. Which meant that count and owner were in two different cachelines, and then when you have contention and spend time in rwsem_down_write_slowpath(), this is probably *exactly* the kind of layout you want. Because first the rwsem_write_trylock() will do a cmpxchg on the first cacheline (for the optimistic fast-path), and then in the case of contention, rwsem_down_write_slowpath() will just access the second cacheline. Which is probably just optimal for a load that spends a lot of time contended - new waiters touch that first cacheline, and then they queue themselves up on the second cacheline." After the commit, the rw_semaphore is at offset 128, which means the 'count' and 'owner' fields are now in the same cacheline, and causes more cache bouncing. Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which will affect its offset: CONFIG_MMU CONFIG_MEMBARRIER CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES The layout above is on 64 bits system with 0day's default kernel config (similar to RHEL-8.3's config), in which all these 3 options are 'y'. And the layout can vary with different kernel configs. Relayouting a structure is usually a double-edged sword, as sometimes it can helps one case, but hurt other cases. For this case, one solution is, as the newly added 'write_protect_seq' is a 4 bytes long seqcount_t (when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an existing 4 bytes hole in 'mm_struct' will not change other fields' alignment, while restoring the regression. Link: https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/ [1] Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-12net: qualcomm: rmnet: trailer value is a checksumAlex Elder1-1/+1
The csum_value field in the rmnet_map_dl_csum_trailer structure is a "real" Internet checksum. It is a 16 bit value, in big endian format, which represents an inverted ones' complement sum over pairs of bytes. Make that clear by changing its type to __sum16. This makes a typecast in rmnet_map_ipv4_dl_csum_trailer() and another in rmnet_map_ipv6_dl_csum_trailer() unnecessary. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-12wwan: add interface creation supportJohannes Berg1-0/+24
Add support to create (and destroy) interfaces via a new rtnetlink kind "wwan". The responsible driver has to use the new wwan_register_ops() to make this possible. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Signed-off-by: Loic Poulain <loic.poulain@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-12net: make get_net_ns return error if NET_NS is disabledChangbin Du1-2/+0
There is a panic in socket ioctl cmd SIOCGSKNS when NET_NS is not enabled. The reason is that nsfs tries to access ns->ops but the proc_ns_operations is not implemented in this case. [7.670023] Unable to handle kernel NULL pointer dereference at virtual address 00000010 [7.670268] pgd = 32b54000 [7.670544] [00000010] *pgd=00000000 [7.671861] Internal error: Oops: 5 [#1] SMP ARM [7.672315] Modules linked in: [7.672918] CPU: 0 PID: 1 Comm: systemd Not tainted 5.13.0-rc3-00375-g6799d4f2da49 #16 [7.673309] Hardware name: Generic DT based system [7.673642] PC is at nsfs_evict+0x24/0x30 [7.674486] LR is at clear_inode+0x20/0x9c The same to tun SIOCGSKNS command. To fix this problem, we make get_net_ns() return -EINVAL when NET_NS is disabled. Meanwhile move it to right place net/core/net_namespace.c. Signed-off-by: Changbin Du <changbin.du@gmail.com> Fixes: c62cce2caee5 ("net: add an ioctl to get a socket network namespace") Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Christian Brauner <christian.brauner@ubuntu.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-12net: phy: Add 25G BASE-R interface modeSteen Hegelund1-0/+4
Add 25gbase-r phy interface mode Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-12Merge tag 'usb-5.13-rc6' of ↵Linus Torvalds2-5/+1
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "Here are a number of tiny USB fixes for 5.13-rc6. There are more than I would normally like, but there's been a bunch of people banging on the gadget and dwc3 and typec code recently for I think an Android release, which has resulted in a number of small fixes. It's nice to see companies send fixes upstream for this type of work, a notable change from years ago. Anyway, fixes in here are: - usb-serial device id updates - usb-serial cp210x driver fixes for broken firmware versions - typec fixes for crazy charging devices and other reported problems - dwc3 fixes for reported problems found - gadget fixes for reported problems - tiny xhci fixes - other small fixes for reported issues. - revert of a problem fix found by linux-next testing All of these have passed 0-day and linux-next testing with no reported problems (the revert for the found linux-next build problem included)" * tag 'usb-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (44 commits) Revert "usb: gadget: fsl: Re-enable driver for ARM SoCs" usb: typec: mux: Fix copy-paste mistake in typec_mux_match usb: typec: ucsi: Clear PPM capability data in ucsi_init() error path usb: gadget: fsl: Re-enable driver for ARM SoCs usb: typec: wcove: Use LE to CPU conversion when accessing msg->header USB: serial: cp210x: fix CP2102N-A01 modem control USB: serial: cp210x: fix alternate function for CP2102N QFN20 usb: misc: brcmstb-usb-pinmap: check return value after calling platform_get_resource() usb: dwc3: ep0: fix NULL pointer exception usb: gadget: eem: fix wrong eem header operation usb: typec: intel_pmc_mux: Put ACPI device using acpi_dev_put() usb: typec: intel_pmc_mux: Add missed error check for devm_ioremap_resource() usb: typec: intel_pmc_mux: Put fwnode in error case during ->probe() usb: typec: tcpm: Do not finish VDM AMS for retrying Responses usb: fix various gadget panics on 10gbps cabling usb: fix various gadgets null ptr deref on 10gbps cabling. usb: pci-quirks: disable D3cold on xhci suspend for s2idle on AMD Renoir usb: f_ncm: only first packet of aggregate needs to start timer USB: f_ncm: ncm_bitrate (speed) is unsigned MAINTAINERS: usb: add entry for isp1760 ...
2021-06-12Merge tag 'char-misc-5.13-rc6' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver fixes from Greg KH: "Here are some small misc driver fixes for 5.13-rc6 that fix some reported problems: - Tiny phy driver fixes for reported issues - rtsx regression for when the device suspended - mhi driver fix for a use-after-free All of these have been in linux-next for a few days with no reported issues" * tag 'char-misc-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: misc: rtsx: separate aspm mode into MODE_REG and MODE_CFG bus: mhi: pci-generic: Fix hibernation bus: mhi: pci_generic: Fix possible use-after-free in mhi_pci_remove() bus: mhi: pci_generic: T99W175: update channel name from AT to DUN phy: Sparx5 Eth SerDes: check return value after calling platform_get_resource() phy: ralink: phy-mt7621-pci: drop 'of_match_ptr' to fix -Wunused-const-variable phy: ti: Fix an error code in wiz_probe() phy: phy-mtk-tphy: Fix some resource leaks in mtk_phy_init() phy: cadence: Sierra: Fix error return code in cdns_sierra_phy_probe() phy: usb: Fix misuse of IS_ENABLED
2021-06-12Merge tag 'sched-urgent-2021-06-12' of ↵Linus Torvalds3-1/+17
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixes: - Fix performance regression caused by lack of intended batching of RCU callbacks by over-eager NOHZ-full code. - Fix cgroups related corruption of load_avg and load_sum metrics. - Three fixes to fix blocked load, util_sum/runnable_sum and util_est tracking bugs" * tag 'sched-urgent-2021-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix util_est UTIL_AVG_UNCHANGED handling sched/pelt: Ensure that *_sum is always synced with *_avg tick/nohz: Only check for RCU deferred wakeup on user/guest entry when needed sched/fair: Make sure to update tg contrib for blocked load sched/fair: Keep load_avg and load_sum synced
2021-06-11net: pcs: xpcs: export xpcs_do_config and xpcs_link_upVladimir Oltean1-0/+4
The sja1105 hardware has a quirk in that some changes require a switch reset, which loses all configuration. When the reset is initiated, everything needs to be reprogrammed, including the MACs and the PCS. This is currently done in sja1105_static_config_reload() - we manually call sja1105_adjust_port_config(), sja1105_sgmii_pcs_config() and sja1105_sgmii_pcs_force_speed() which are all internal functions. There is a desire for sja1105 to use the common xpcs driver, and that means that the equivalents of those functions, xpcs_do_config() and xpcs_link_up() respectively, will no longer be local functions. Forcing phylink to retrigger a resolve somehow, say by doing dev_close() followed by dev_open() is not really an option, because the CPU port might have a PCS as well, and there is no net device which we can close and reopen for that. Additionally, the dev_close/dev_open sequence might force a renegotiation of the copper-side link for SGMII ports connected to a PHY, and this is undesirable as well, because the switch reset is much quicker than a PHY autoneg, so we would have a lot more downtime. The only solution I see is for the sja1105 driver to keep doing what it's doing, and that means we need to export the equivalents from xpcs for sja1105_sgmii_pcs_config and sja1105_sgmii_pcs_force_speed, and call them directly in sja1105_static_config_reload(). This will be done during the conversion patch. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: pcs: xpcs: add support for NXP SJA1110Vladimir Oltean1-0/+1
The NXP SJA1110 switch integrates its own, non-Synopsys PMA, but it manages it through the register space of the XPCS itself, in a small register window inside MDIO_MMD_VEND2 from address 0x8030 to 0x806e. This coincides with where the registers for the default Synopsys PMA are, but the register definitions are of course not the same. This situation is an odd hardware quirk, but the simplest way to manage it is to drive the SJA1110's PMA from within the XPCS driver. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: pcs: xpcs: add support for NXP SJA1105Vladimir Oltean1-0/+2
The NXP SJA1105 DSA switch integrates a Synopsys SGMII XPCS on port 4. The generic code works fine, except there is an integration issue which needs to be dealt with: in this switch, the XPCS is integrated with a PMA that has the TX lane polarity inverted by default (PLUS is MINUS, MINUS is PLUS). To obtain normal non-inverted behavior, the TX lane polarity must be inverted in the PCS, via the DIGITAL_CONTROL_2 register. We introduce a pma_config() method in xpcs_compat which is called by the phylink_pcs_config() implementation. Also, the NXP SJA1105 returns all zeroes in the PHY ID registers 2 and 3. We need to hack up an ad-hoc PHY ID (OUI is zero, device ID is 1) in order for the XPCS driver to recognize it. This PHY ID is added to the public include/linux/pcs/pcs-xpcs.h for that reason (for the sja1105 driver to be able to use it in a later patch). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: pcs: xpcs: rename mdio_xpcs_args to dw_xpcsVladimir Oltean1-7/+7
The struct mdio_xpcs_args is reminiscent of when a similarly named struct mdio_xpcs_ops existed. Now that that is removed, we can shorten the name to dw_xpcs (dw for DesignWare). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11Merge branch '100GbE' of ↵David S. Miller1-0/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Jake Keller says: ==================== 100GbE Intel Wired LAN Driver Updates 2021-06-11 Extend the ice driver to support basic PTP clock functionality for E810 devices. This includes some tangential work required to setup the sideband queue and driver shared parameters as well. This series only supports E810-based devices. This is because other devices based on the E822 MAC use a different and more complex PHY. The low level device functionality is kept within ice_ptp_hw.c and is designed to be extensible for supporting E822 devices in a future series. This series also only supports very basic functionality including the ptp_clock device and timestamping. Support for configuring periodic outputs and external input timestamps will be implemented in a future series. There are a couple of potential "what? why?" bits in this series I want to point out: 1) the PTP hardware functionality is shared between multiple functions. This means that the same clock registers are shared across multiple PFs. In order to avoid contention or clashing between PFs, firmware assigns "ownership" to one PF, while other PFs are merely "associated" with the timer. Because we share the hardware resource, only the clock owner will allocate and register a PTP clock device. Other PFs determine the appropriate PTP clock index to report by using a firmware interface to read a shared parameter that is set by the owning PF. 2) the ice driver uses its own kthread instead of using do_aux_work. This is because the periodic and asynchronous tasks are necessary for all PFs, but only one PF will allocate the clock. The series is broken up into functional pieces to allow easy review. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11virtio/vsock: rest of SOCK_SEQPACKET supportArseny Krasnov1-0/+5
Small updates to make SOCK_SEQPACKET work: 1) Send SHUTDOWN on socket close for SEQPACKET type. 2) Set SEQPACKET packet type during send. 3) Set 'VIRTIO_VSOCK_SEQ_EOR' bit in flags for last packet of message. 4) Implement data check function for SEQPACKET. 5) Check for max datagram size. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11virtio/vsock: dequeue callback for SOCK_SEQPACKETArseny Krasnov1-0/+5
Callback fetches RW packets from rx queue of socket until whole record is copied(if user's buffer is full, user is not woken up). This is done to not stall sender, because if we wake up user and it leaves syscall, nobody will send credit update for rest of record, and sender will wait for next enter of read syscall at receiver's side. So if user buffer is full, we just send credit update and drop data. Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: phylink: introduce phylink_fwnode_phy_connect()Calvin Johnson1-0/+3
Define phylink_fwnode_phy_connect() to connect phy specified by a fwnode to a phylink instance. Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com> Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Acked-by: Grant Likely <grant.likely@arm.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: mdio: Add ACPI support code for mdioCalvin Johnson1-0/+26
Define acpi_mdiobus_register() to Register mii_bus and create PHYs for each ACPI child node. Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com> Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Acked-by: Grant Likely <grant.likely@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11ACPI: utils: Introduce acpi_get_local_address()Calvin Johnson1-0/+7
Introduce a wrapper around the _ADR evaluation. Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Acked-by: Grant Likely <grant.likely@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net>