summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2020-12-06mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPINGMinchan Kim1-1/+0
While I was doing zram testing, I found sometimes decompression failed since the compression buffer was corrupted. With investigation, I found below commit calls cond_resched unconditionally so it could make a problem in atomic context if the task is reschedule. BUG: sleeping function called from invalid context at mm/vmalloc.c:108 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog 3 locks held by memhog/946: #0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160 #1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160 #2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0 CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 Call Trace: unmap_kernel_range_noflush+0x2eb/0x350 unmap_kernel_range+0x14/0x30 zs_unmap_object+0xd5/0xe0 zram_bvec_rw.isra.0+0x38c/0x8e0 zram_rw_page+0x90/0x101 bdev_write_page+0x92/0xe0 __swap_writepage+0x94/0x4a0 pageout+0xe3/0x3a0 shrink_page_list+0xb94/0xd60 shrink_inactive_list+0x158/0x460 We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which contains the offending calling code) from zsmalloc. Even though this option showed some amount improvement(e.g., 30%) in some arm32 platforms, it has been headache to maintain since it have abused APIs[1](e.g., unmap_kernel_range in atomic context). Since we are approaching to deprecate 32bit machines and already made the config option available for only builtin build since v5.8, lastly it has been not default option in zsmalloc, it's time to drop the option for better maintenance. [1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org Fixes: e47110e90584 ("mm/vunmap: add cond_resched() in vunmap_pmd_range") Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Tony Lindgren <tony@atomide.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Harish Sriram <harish@linux.ibm.com> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201117202916.GA3856507@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-05Merge tag 'for-5.10/dm-fixes-2' of ↵Linus Torvalds1-4/+6
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull fix for device mapper fixes from Mike Snitzer: "Apologies for the glaring bug I introduced with my previous pull request! Fix incorrect branching at top of blk_max_size_offset()" * tag 'for-5.10/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: block: fix incorrect branching in blk_max_size_offset()
2020-12-05block: fix incorrect branching in blk_max_size_offset()Mike Snitzer1-4/+6
If non-zero 'chunk_sectors' is passed in to blk_max_size_offset() that override will be incorrectly ignored. Old blk_max_size_offset() branching, prior to commit 3ee16db390b4, must be used only if passed 'chunk_sectors' override is zero. Fixes: 3ee16db390b4 ("dm: fix IO splitting") Cc: stable@vger.kernel.org # 5.9 Reported-by: John Dorminy <jdorminy@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-12-05Merge tag 'for-5.10/dm-fixes' of ↵Linus Torvalds1-5/+6
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix DM's bio splitting changes that were made during v5.9. This restores splitting in terms of varied per-target ti->max_io_len rather than use block core's single stacked 'chunk_sectors' limit. - Like DM crypt, update DM integrity to not use crypto drivers that have CRYPTO_ALG_ALLOCATES_MEMORY set. - Fix DM writecache target's argument parsing and status display. - Remove needless BUG() from dm writecache's persistent_memory_claim() - Remove old gcc workaround in DM cache target's block_div() for ARM link errors now that gcc >= 4.9 is required. - Fix RCU locking in dm_blk_report_zones and dm_dax_zero_page_range. - Remove old, and now frowned upon, BUG_ON(in_interrupt()) in dm_table_event(). - Remove invalid sparse annotations from dm_prepare_ioctl() and dm_unprepare_ioctl(). * tag 'for-5.10/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: remove invalid sparse __acquires and __releases annotations dm: fix double RCU unlock in dm_dax_zero_page_range() error path dm: fix IO splitting dm writecache: remove BUG() and fail gracefully instead dm table: Remove BUG_ON(in_interrupt()) dm: fix bug with RCU locking in dm_blk_report_zones Revert "dm cache: fix arm link errors with inline" dm writecache: fix the maximum number of arguments dm writecache: advance the number of arguments when reporting max_age dm integrity: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY
2020-12-04dm: fix IO splittingMike Snitzer1-5/+6
Commit 882ec4e609c1 ("dm table: stack 'chunk_sectors' limit to account for target-specific splitting") caused a couple regressions: 1) Using lcm_not_zero() when stacking chunk_sectors was a bug because chunk_sectors must reflect the most limited of all devices in the IO stack. 2) DM targets that set max_io_len but that do _not_ provide an .iterate_devices method no longer had there IO split properly. And commit 5091cdec56fa ("dm: change max_io_len() to use blk_max_size_offset()") also caused a regression where DM no longer supported varied (per target) IO splitting. The implication being the potential for severely reduced performance for IO stacks that use a DM target like dm-cache to hide performance limitations of a slower device (e.g. one that requires 4K IO splitting). Coming full circle: Fix all these issues by discontinuing stacking chunk_sectors up using ti->max_io_len in dm_calculate_queue_limits(), add optional chunk_sectors override argument to blk_max_size_offset() and update DM's max_io_len() to pass ti->max_io_len to its blk_max_size_offset() call. Passing in an optional chunk_sectors override to blk_max_size_offset() allows for code reuse of block's centralized calculation for max IO size based on provided offset and split boundary. Fixes: 882ec4e609c1 ("dm table: stack 'chunk_sectors' limit to account for target-specific splitting") Fixes: 5091cdec56fa ("dm: change max_io_len() to use blk_max_size_offset()") Cc: stable@vger.kernel.org Reported-by: John Dorminy <jdorminy@redhat.com> Reported-by: Bruce Johnston <bjohnsto@redhat.com> Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: John Dorminy <jdorminy@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk>
2020-12-04Merge tag 'net-5.10-rc7' of ↵Linus Torvalds2-2/+21
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.10-rc7, including fixes from bpf, netfilter, wireless drivers, wireless mesh and can. Current release - regressions: - mt76: usb: fix crash on device removal Current release - always broken: - xsk: Fix umem cleanup from wrong context in socket destruct Previous release - regressions: - net: ip6_gre: set dev->hard_header_len when using header_ops - ipv4: Fix TOS mask in inet_rtm_getroute() - net, xsk: Avoid taking multiple skbuff references Previous release - always broken: - net/x25: prevent a couple of overflows - netfilter: ipset: prevent uninit-value in hash_ip6_add - geneve: pull IP header before ECN decapsulation - mpls: ensure LSE is pullable in TC and openvswitch paths - vxlan: respect needed_headroom of lower device - batman-adv: Consider fragmentation for needed packet headroom - can: drivers: don't count arbitration loss as an error - netfilter: bridge: reset skb->pkt_type after POST_ROUTING traversal - inet_ecn: Fix endianness of checksum update when setting ECT(1) - ibmvnic: fix various corner cases around reset handling - net/mlx5: fix rejecting unsupported Connect-X6DX SW steering - net/mlx5: Enforce HW TX csum offload with kTLS" * tag 'net-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits) net/mlx5: DR, Proper handling of unsupported Connect-X6DX SW steering net/mlx5e: kTLS, Enforce HW TX csum offload with kTLS net: mlx5e: fix fs_tcp.c build when IPV6 is not enabled net/mlx5: Fix wrong address reclaim when command interface is down net/sched: act_mpls: ensure LSE is pullable before reading it net: openvswitch: ensure LSE is pullable before reading it net: skbuff: ensure LSE is pullable before decrementing the MPLS ttl net: mvpp2: Fix error return code in mvpp2_open() chelsio/chtls: fix a double free in chtls_setkey() rtw88: debug: Fix uninitialized memory in debugfs code vxlan: fix error return code in __vxlan_dev_create() net: pasemi: fix error return code in pasemi_mac_open() cxgb3: fix error return code in t3_sge_alloc_qset() net/x25: prevent a couple of overflows dpaa_eth: copy timestamp fields to new skb in A-050385 workaround net: ip6_gre: set dev->hard_header_len when using header_ops mt76: usb: fix crash on device removal iwlwifi: pcie: add some missing entries for AX210 iwlwifi: pcie: invert values of NO_160 device config entries iwlwifi: pcie: add one missing entry for AX210 ...
2020-12-03net/mlx5: DR, Proper handling of unsupported Connect-X6DX SW steeringYevgeny Kliteynik1-1/+8
STEs format for Connect-X5 and Connect-X6DX different. Currently, on Connext-X6DX the SW steering would break at some point when building STEs w/o giving a proper error message. Fix this by checking the STE format of the current device when initializing domain: add mlx5_ifc definitions for Connect-X6DX SW steering, read FW capability to get the current format version, and check this version when domain is being created. Fixes: 26d688e33f88 ("net/mlx5: DR, Add Steering entry (STE) utilities") Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-02Merge tag 'trace-v5.10-rc6' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - Use correct timestamp variable for ring buffer write stamp update - Fix up before stamp and write stamp when crossing ring buffer sub buffers - Keep a zero delta in ring buffer in slow path if cmpxchg fails - Fix trace_printk static buffer for archs that care - Fix ftrace record accounting for ftrace ops with trampolines - Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency - Remove WARN_ON in hwlat tracer that triggers on something that is OK - Make "my_tramp" trampoline in ftrace direct sample code global - Fixes in the bootconfig tool for better alignment management * tag 'trace-v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ring-buffer: Always check to put back before stamp when crossing pages ftrace: Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency ftrace: Fix updating FTRACE_FL_TRAMP tracing: Fix alignment of static buffer tracing: Remove WARN_ON in start_thread() samples/ftrace: Mark my_tramp[12]? global ring-buffer: Set the right timestamp in the slow path of __rb_reserve_next() ring-buffer: Update write stamp with the correct ts docs: bootconfig: Update file format on initrd image tools/bootconfig: Align the bootconfig applied initrd image size to 4 tools/bootconfig: Fix to check the write failure correctly tools/bootconfig: Fix errno reference after printf()
2020-11-28Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski1-1/+13
Daniel Borkmann says: ==================== pull-request: bpf 2020-11-28 1) Do not reference the skb for xsk's generic TX side since when looped back into RX it might crash in generic XDP, from Björn Töpel. 2) Fix umem cleanup on a partially set up xsk socket when being destroyed, from Magnus Karlsson. 3) Fix an incorrect netdev reference count when failing xsk_bind() operation, from Marek Majtyka. 4) Fix bpftool to set an error code on failed calloc() in build_btf_type_table(), from Zhen Lei. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Add MAINTAINERS entry for BPF LSM bpftool: Fix error return value in build_btf_type_table net, xsk: Avoid taking multiple skbuff references xsk: Fix incorrect netdev reference count xsk: Fix umem cleanup bug at socket destruct MAINTAINERS: Update XDP and AF_XDP entries ==================== Link: https://lore.kernel.org/r/20201128005104.1205-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28Merge tag 'asm-generic-fixes-5.10-2' of ↵Linus Torvalds1-0/+13
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull asm-generic fix from Arnd Bergmann: "Add correct MAX_POSSIBLE_PHYSMEM_BITS setting to asm-generic. This is a single bugfix for a bug that Stefan Agner found on 32-bit Arm, but that exists on several other architectures" * tag 'asm-generic-fixes-5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: arch: pgtable: define MAX_POSSIBLE_PHYSMEM_BITS where needed
2020-11-28Merge tag 'arm-soc-fixes-v5.10-3' of ↵Linus Torvalds2-4/+1
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull ARM SoC fixes from Arnd Bergmann: "Another set of patches for devicetree files and Arm SoC specific drivers: - A fix for OP-TEE shared memory on non-SMP systems - multiple code fixes for the OMAP platform, including one regression for the CPSW network driver and a few runtime warning fixes - Some DT patches for the Rockchip RK3399 platform, in particular fixing the MMC device ordering that recently became nondeterministic with async probe. - Multiple DT fixes for the Tegra platform, including a regression fix for suspend/resume on TX2 - A regression fix for a user-triggered fault in the NXP dpio driver - A regression fix for a bug caused by an earlier bug fix in the xilinx firmware driver - Two more DTC warning fixes - Sylvain Lemieux steps down as maintainer for the NXP LPC32xx platform" * tag 'arm-soc-fixes-v5.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (24 commits) arm64: tegra: Fix Tegra234 VDK node names arm64: tegra: Wrong AON HSP reg property size arm64: tegra: Fix USB_VBUS_EN0 regulator on Jetson TX1 arm64: tegra: Correct the UART for Jetson Xavier NX arm64: tegra: Disable the ACONNECT for Jetson TX2 optee: add writeback to valid memory type firmware: xilinx: Use hash-table for api feature check firmware: xilinx: Fix SD DLL node reset issue soc: fsl: dpio: Get the cpumask through cpumask_of(cpu) ARM: dts: dra76x: m_can: fix order of clocks bus: ti-sysc: suppress err msg for timers used as clockevent/source MAINTAINERS: Remove myself as LPC32xx maintainers arm64: dts: qcom: clear the warnings caused by empty dma-ranges arm64: dts: broadcom: clear the warnings caused by empty dma-ranges ARM: dts: am437x-l4: fix compatible for cpsw switch dt node arm64: dts: rockchip: Reorder LED triggers from mmc devices on rk3399-roc-pc. arm64: dts: rockchip: Assign a fixed index to mmc devices on rk3399 boards. arm64: dts: rockchip: Remove system-power-controller from pmic on Odroid Go Advance arm64: dts: rockchip: fix NanoPi R2S GMAC clock name ARM: OMAP2+: Manage MPU state properly for omap_enter_idle_coupled() ...
2020-11-28Merge tag 'net-5.10-rc6' of ↵Linus Torvalds1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.10-rc6, including fixes from the WiFi driver, and CAN subtrees. Current release - regressions: - gro_cells: reduce number of synchronize_net() calls - ch_ktls: release a lock before jumping to an error path Current release - always broken: - tcp: Allow full IP tos/IPv6 tclass to be reflected in L3 header Previous release - regressions: - net/tls: fix missing received data after fast remote close - vsock/virtio: discard packets only when socket is really closed - sock: set sk_err to ee_errno on dequeue from errq - cxgb4: fix the panic caused by non smac rewrite Previous release - always broken: - tcp: fix corner cases around setting ECN with BPF selection of congestion control - tcp: fix race condition when creating child sockets from syncookies on loopback interface - usbnet: ipheth: fix connectivity with iOS 14 - tun: honor IOCB_NOWAIT flag - net/packet: fix packet receive on L3 devices without visible hard header - devlink: Make sure devlink instance and port are in same net namespace - net: openvswitch: fix TTL decrement action netlink message format - bonding: wait for sysfs kobject destruction before freeing struct slave - net: stmmac: fix upstream patch applied to the wrong context - bnxt_en: fix return value and unwind in probe error paths Misc: - devlink: add extra layer of categorization to the reload stats uAPI before it's released" * tag 'net-5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (68 commits) sock: set sk_err to ee_errno on dequeue from errq mptcp: fix NULL ptr dereference on bad MPJ net: openvswitch: fix TTL decrement action netlink message format can: af_can: can_rx_unregister(): remove WARN() statement from list operation sanity check can: m_can: m_can_dev_setup(): add support for bosch mcan version 3.3.0 can: m_can: fix nominal bitiming tseg2 min for version >= 3.1 can: m_can: m_can_open(): remove IRQF_TRIGGER_FALLING from request_threaded_irq()'s flags can: mcp251xfd: mcp251xfd_probe(): bail out if no IRQ was given can: gs_usb: fix endianess problem with candleLight firmware ch_ktls: lock is not freed net/tls: Protect from calling tls_dev_del for TLS RX twice devlink: Make sure devlink instance and port are in same net namespace devlink: Hold rtnl lock while reading netdev attributes ptp: clockmatrix: bug fix for idtcm_strverscmp enetc: Let the hardware auto-advance the taprio base-time of 0 gro_cells: reduce number of synchronize_net() calls net: stmmac: fix incorrect merge of patch upstream ipv6: addrlabel: fix possible memory leak in ip6addrlbl_net_init Documentation: netdev-FAQ: suggest how to post co-dependent series ibmvnic: enhance resetting status check during module exit ...
2020-11-27Merge tag 'omap-for-v5.10/fixes-rc5-signed' of ↵Arnd Bergmann1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into arm/fixes Fixes for omaps for various issues noticed during the -rc cycle: - Earlier omap4 cpuidle fix was incomplete and needs to use a configured idle state instead - Fix am4 cpsw driver compatible to avoid invalid resource error for the legacy driver - Two kconfig fixes for genpd support that we added for for v5.10 for proper location of the option and adding missing option - Fix ti-sysc reset status checking on enabling modules to ignore quirky modules with reset status only usable when the quirk is activated during reset. Also fix bogus resetdone warning for cpsw and modules with no sysst register reset status bit - Suppress a ti-sysc warning for timers reserved as system timers - Fix the ordering of clocks for dra7 m_can * tag 'omap-for-v5.10/fixes-rc5-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap: ARM: dts: dra76x: m_can: fix order of clocks bus: ti-sysc: suppress err msg for timers used as clockevent/source ARM: dts: am437x-l4: fix compatible for cpsw switch dt node ARM: OMAP2+: Manage MPU state properly for omap_enter_idle_coupled() bus: ti-sysc: Fix bogus resetdone warning on enable for cpsw bus: ti-sysc: Fix reset status check for modules with quirks ARM: OMAP2+: Fix missing select PM_GENERIC_DOMAINS_OF ARM: OMAP2+: Fix location for select PM_GENERIC_DOMAINS Link: https://lore.kernel.org/r/pull-1606460270-864284@atomide.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2020-11-26mm: memcg: relayout structure mem_cgroup to avoid cache interferenceFeng Tang1-14/+14
0day reported one -22.7% regression for will-it-scale page_fault2 case [1] on a 4 sockets 144 CPU platform, and bisected to it to be caused by Waiman's optimization (commit bd0b230fe1) of saving one 'struct page_counter' space for 'struct mem_cgroup'. Initially we thought it was due to the cache alignment change introduced by the patch, but further debug shows that it is due to some hot data members ('vmstats_local', 'vmstats_percpu', 'vmstats') sit in 2 adjacent cacheline (2N and 2N+1 cacheline), and when adjacent cache line prefetch is enabled, it triggers an "extended level" of cache false sharing for 2 adjacent cache lines. So exchange the 2 member blocks, while keeping mostly the original cache alignment, which can restore and even enhance the performance, and save 64 bytes of space for 'struct mem_cgroup' (from 2880 to 2816, with 0day's default RHEL-8.3 kernel config) [1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/ Fixes: bd0b230fe145 ("mm/memcg: unify swap and memsw page counters") Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-25net, xsk: Avoid taking multiple skbuff referencesBjörn Töpel1-1/+13
Commit 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY") addressed the problem that packets were discarded from the Tx AF_XDP ring, when the driver returned NETDEV_TX_BUSY. Part of the fix was bumping the skbuff reference count, so that the buffer would not be freed by dev_direct_xmit(). A reference count larger than one means that the skbuff is "shared", which is not the case. If the "shared" skbuff is sent to the generic XDP receive path, netif_receive_generic_xdp(), and pskb_expand_head() is entered the BUG_ON(skb_shared(skb)) will trigger. This patch adds a variant to dev_direct_xmit(), __dev_direct_xmit(), where a user can select the skbuff free policy. This allows AF_XDP to avoid bumping the reference count, but still keep the NETDEV_TX_BUSY behavior. Fixes: 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY") Reported-by: Yonghong Song <yhs@fb.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20201123175600.146255-1-bjorn.topel@gmail.com
2020-11-24firmware: xilinx: Use hash-table for api feature checkAmit Sunil Dhamne1-4/+0
Currently array of fix length PM_API_MAX is used to cache the pm_api version (valid or invalid). However ATF based PM APIs values are much higher then PM_API_MAX. So to include ATF based PM APIs also, use hash-table to store the pm_api version status. Signed-off-by: Amit Sunil Dhamne <amit.sunil.dhamne@xilinx.com> Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ravi Patel <ravi.patel@xilinx.com> Signed-off-by: Rajan Vaja <rajan.vaja@xilinx.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Tested-by: Michal Simek <michal.simek@xilinx.com> Fixes: f3217d6f2f7a ("firmware: xilinx: fix out-of-bounds access") Cc: stable <stable@vger.kernel.org> Link: https://lore.kernel.org/r/1606197161-25976-1-git-send-email-rajan.vaja@xilinx.com Signed-off-by: Michal Simek <michal.simek@xilinx.com>
2020-11-24net/packet: fix packet receive on L3 devices without visible hard headerEyal Birger1-0/+5
In the patchset merged by commit b9fcf0a0d826 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'") L3 devices which did not have header_ops were given one for the purpose of protocol parsing on af_packet transmit path. That change made af_packet receive path regard these devices as having a visible L3 header and therefore aligned incoming skb->data to point to the skb's mac_header. Some devices, such as ipip, xfrmi, and others, do not reset their mac_header prior to ingress and therefore their incoming packets became malformed. Ideally these devices would reset their mac headers, or af_packet would be able to rely on dev->hard_header_len being 0 for such cases, but it seems this is not the case. Fix by changing af_packet RX ll visibility criteria to include the existence of a '.create()' header operation, which is used when creating a device hard header - via dev_hard_header() - by upper layers, and does not exist in these L3 devices. As this predicate may be useful in other situations, add it as a common dev_has_header() helper in netdevice.h. Fixes: b9fcf0a0d826 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'") Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Acked-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20201121062817.3178900-1-eyal.birger@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-23Merge tag 'sched-urgent-2020-11-22' of ↵Linus Torvalds1-2/+24
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "A couple of scheduler fixes: - Make the conditional update of the overutilized state work correctly by caching the relevant flags state before overwriting them and checking them afterwards. - Fix a data race in the wakeup path which caused loadavg on ARM64 platforms to become a random number generator. - Fix the ordering of the iowaiter accounting operations so it can't be decremented before it is incremented. - Fix a bug in the deadline scheduler vs. priority inheritance when a non-deadline task A has inherited the parameters of a deadline task B and then blocks on a non-deadline task C. The second inheritance step used the static deadline parameters of task A, which are usually 0, instead of further propagating task B's parameters. The zero initialized parameters trigger a bug in the deadline scheduler" * tag 'sched-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix priority inheritance with multiple scheduling classes sched: Fix rq->nr_iowait ordering sched: Fix data-race in wakeup sched/fair: Fix overutilized update in enqueue_task_fair()
2020-11-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds4-15/+33
Merge misc fixes from Andrew Morton: "8 patches. Subsystems affected by this patch series: mm (madvise, pagemap, readahead, memcg, userfaultfd), kbuild, and vfs" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: fix madvise WILLNEED performance problem libfs: fix error cast of negative value in simple_attr_write() mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault() mm: memcg/slab: fix root memcg vmstats mm: fix readahead_page_batch for retry entries mm: fix phys_to_target_node() and memory_add_physaddr_to_nid() exports compiler-clang: remove version check for BPF Tracing mm/madvise: fix memory leak from process_madvise
2020-11-22Merge tag 'ext4_for_linus_fixes2' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "A final set of miscellaneous bug fixes for ext4" * tag 'ext4_for_linus_fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix bogus warning in ext4_update_dx_flag() jbd2: fix kernel-doc markups ext4: drop fast_commit from /proc/mounts
2020-11-22mm: fix readahead_page_batch for retry entriesMatthew Wilcox (Oracle)1-0/+2
Both btrfs and fuse have reported faults caused by seeing a retry entry instead of the page they were looking for. This was caused by a missing check in the iterator. As can be seen in the below panic log, the accessing 0x402 causes a panic. In the xarray.h, 0x402 means RETRY_ENTRY. BUG: kernel NULL pointer dereference, address: 0000000000000402 CPU: 14 PID: 306003 Comm: as Not tainted 5.9.0-1-amd64 #1 Debian 5.9.1-1 Hardware name: Lenovo ThinkSystem SR665/7D2VCTO1WW, BIOS D8E106Q-1.01 05/30/2020 RIP: 0010:fuse_readahead+0x152/0x470 [fuse] Code: 41 8b 57 18 4c 8d 54 10 ff 4c 89 d6 48 8d 7c 24 10 e8 d2 e3 28 f9 48 85 c0 0f 84 fe 00 00 00 44 89 f2 49 89 04 d4 44 8d 72 01 <48> 8b 10 41 8b 4f 1c 48 c1 ea 10 83 e2 01 80 fa 01 19 d2 81 e2 01 RSP: 0018:ffffad99ceaebc50 EFLAGS: 00010246 RAX: 0000000000000402 RBX: 0000000000000001 RCX: 0000000000000002 RDX: 0000000000000000 RSI: ffff94c5af90bd98 RDI: ffffad99ceaebc60 RBP: ffff94ddc1749a00 R08: 0000000000000402 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000100 R12: ffff94de6c429ce0 R13: ffff94de6c4d3700 R14: 0000000000000001 R15: ffffad99ceaebd68 FS: 00007f228c5c7040(0000) GS:ffff94de8ed80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000402 CR3: 0000001dbd9b4000 CR4: 0000000000350ee0 Call Trace: read_pages+0x83/0x270 page_cache_readahead_unbounded+0x197/0x230 generic_file_buffered_read+0x57a/0xa20 new_sync_read+0x112/0x1a0 vfs_read+0xf8/0x180 ksys_read+0x5f/0xe0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 042124cc64c3 ("mm: add new readahead_control API") Reported-by: David Sterba <dsterba@suse.com> Reported-by: Wonhyuk Yang <vvghjk1234@gmail.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201103142852.8543-1-willy@infradead.org Link: https://lkml.kernel.org/r/20201103124349.16722-1-vvghjk1234@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-22mm: fix phys_to_target_node() and memory_add_physaddr_to_nid() exportsDan Williams2-15/+29
The core-mm has a default __weak implementation of phys_to_target_node() to mirror the weak definition of memory_add_physaddr_to_nid(). That symbol is exported for modules. However, while the export in mm/memory_hotplug.c exported the symbol in the configuration cases of: CONFIG_NUMA_KEEP_MEMINFO=y CONFIG_MEMORY_HOTPLUG=y ...and: CONFIG_NUMA_KEEP_MEMINFO=n CONFIG_MEMORY_HOTPLUG=y ...it failed to export the symbol in the case of: CONFIG_NUMA_KEEP_MEMINFO=y CONFIG_MEMORY_HOTPLUG=n Not only is that broken, but Christoph points out that the kernel should not be exporting any __weak symbol, which means that memory_add_physaddr_to_nid() example that phys_to_target_node() copied is broken too. Rework the definition of phys_to_target_node() and memory_add_physaddr_to_nid() to not require weak symbols. Move to the common arch override design-pattern of an asm header defining a symbol to replace the default implementation. The only common header that all memory_add_physaddr_to_nid() producing architectures implement is asm/sparsemem.h. In fact, powerpc already defines its memory_add_physaddr_to_nid() helper in sparsemem.h. Double-down on that observation and define phys_to_target_node() where necessary in asm/sparsemem.h. An alternate consideration that was discarded was to put this override in asm/numa.h, but that entangles with the definition of MAX_NUMNODES relative to the inclusion of linux/nodemask.h, and requires powerpc to grow a new header. The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid now that the symbol is properly exported / stubbed in all combinations of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG. [dan.j.williams@intel.com: v4] Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com [dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning] Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: a035b6bf863e ("mm/memory_hotplug: introduce default phys_to_target_node() implementation") Reported-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: https://lkml.kernel.org/r/160447639846.1133764.7044090803980177548.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-22compiler-clang: remove version check for BPF TracingNick Desaulniers1-0/+2
bpftrace parses the kernel headers and uses Clang under the hood. Remove the version check when __BPF_TRACING__ is defined (as bpftrace does) so that this tool can continue to parse kernel headers, even with older clang sources. Fixes: commit 1f7a44f63e6c ("compiler-clang: add build check for clang 10.0.1") Reported-by: Chen Yu <yu.chen.surf@gmail.com> Reported-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Nathan Chancellor <natechancellor@gmail.com> Acked-by: Miguel Ojeda <ojeda@kernel.org> Link: https://lkml.kernel.org/r/20201104191052.390657-1-ndesaulniers@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-20Merge tag 'iommu-fixes' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull iommu fixes from Will Deacon: "Two straightforward vt-d fixes: - Fix boot when intel iommu initialisation fails under TXT (tboot) - Fix intel iommu compilation error when DMAR is enabled without ATS and temporarily update IOMMU MAINTAINERs entry" * tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: MAINTAINERS: Temporarily add myself to the IOMMU entry iommu/vt-d: Fix compile error with CONFIG_PCI_ATS not set iommu/vt-d: Avoid panic if iommu init fails in tboot system
2020-11-20jbd2: fix kernel-doc markupsMauro Carvalho Chehab1-1/+1
Kernel-doc markup should use this format: identifier - description They should not have any type before that, as otherwise the parser won't do the right thing. Also, some identifiers have different names between their prototypes and the kernel-doc markup. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/72f5c6628f5f278d67625f60893ffbc2ca28d46e.1605521731.git.mchehab+huawei@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-20Merge tag 'net-5.10-rc5' of ↵Linus Torvalds1-0/+21
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.10-rc5, including fixes from the WiFi (mac80211), can and bpf (including the strncpy_from_user fix). Current release - regressions: - mac80211: fix memory leak of filtered powersave frames - mac80211: free sta in sta_info_insert_finish() on errors to avoid sleeping in atomic context - netlabel: fix an uninitialized variable warning added in -rc4 Previous release - regressions: - vsock: forward all packets to the host when no H2G is registered, un-breaking AWS Nitro Enclaves - net: Exempt multicast addresses from five-second neighbor lifetime requirement, decreasing the chances neighbor tables fill up - net/tls: fix corrupted data in recvmsg - qed: fix ILT configuration of SRC block - can: m_can: process interrupt only when not runtime suspended Previous release - always broken: - page_frag: Recover from memory pressure by not recycling pages allocating from the reserves - strncpy_from_user: Mask out bytes after NUL terminator - ip_tunnels: Set tunnel option flag only when tunnel metadata is present, always setting it confuses Open vSwitch - bpf, sockmap: - Fix partial copy_page_to_iter so progress can still be made - Fix socket memory accounting and obeying SO_RCVBUF - net: Have netpoll bring-up DSA management interface - net: bridge: add missing counters to ndo_get_stats64 callback - tcp: brr: only postpone PROBE_RTT if RTT is < current min_rtt - enetc: Workaround MDIO register access HW bug - net/ncsi: move netlink family registration to a subsystem init, instead of tying it to driver probe - net: ftgmac100: unregister NC-SI when removing driver to avoid crash - lan743x: - prevent interrupt storm on open - fix freeing skbs in the wrong context - net/mlx5e: Fix socket refcount leak on kTLS RX resync - net: dsa: mv88e6xxx: Avoid VLAN database corruption on 6097 - fix 21 unset return codes and other mistakes on error paths, mostly detected by the Hulk Robot" * tag 'net-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (115 commits) fail_function: Remove a redundant mutex unlock selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL lib/strncpy_from_user.c: Mask out bytes after NUL terminator. net/smc: fix direct access to ib_gid_addr->ndev in smc_ib_determine_gid() net/smc: fix matching of existing link groups ipv6: Remove dependency of ipv6_frag_thdr_truncated on ipv6 module libbpf: Fix VERSIONED_SYM_COUNT number parsing net/mlx4_core: Fix init_hca fields offset atm: nicstar: Unmap DMA on send error page_frag: Recover from memory pressure net: dsa: mv88e6xxx: Wait for EEPROM done after HW reset mlxsw: core: Use variable timeout for EMAD retries mlxsw: Fix firmware flashing net: Have netpoll bring-up DSA management interface atl1e: fix error return code in atl1e_probe() atl1c: fix error return code in atl1c_probe() ah6: fix error return code in ah6_input() net: usb: qmi_wwan: Set DTR quirk for MR400 can: m_can: process interrupt only when not runtime suspended can: flexcan: flexcan_chip_start(): fix erroneous flexcan_transceiver_enable() during bus-off recovery ...
2020-11-19Merge tag 'spi-fix-v5.10-rc4' of ↵Linus Torvalds1-0/+19
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi fixes from Mark Brown: "This is a relatively large set of fixes, the bulk of it being a series from Lukas Wunner which fixes confusion with the lifetime of driver data allocated along with the SPI controller structure that's been created as part of the conversion to devm APIs. The simplest fix, explained in detail in Lukas' commit message, is to move to a devm_ function for allocation of the controller and hence driver data in order to push the free of that after anything tries to reference the driver data in the remove path. This results in a relatively large diff due to the addition of a new function but isn't particularly complex. There's also a fix from Sven van Asbroeck which fixes yet more fallout from the conflicts between the various different places one can configure the polarity of GPIOs in modern systems. Otherwise everything is fairly small and driver specific" * tag 'spi-fix-v5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: spi: npcm-fiu: Don't leak SPI master in probe error path spi: dw: Set transfer handler before unmasking the IRQs spi: cadence-quadspi: Fix error return code in cqspi_probe spi: bcm2835aux: Restore err assignment in bcm2835aux_spi_probe spi: lpspi: Fix use-after-free on unbind spi: bcm-qspi: Fix use-after-free on unbind spi: bcm2835aux: Fix use-after-free on unbind spi: bcm2835: Fix use-after-free on unbind spi: Introduce device-managed SPI controller allocation spi: fsi: Fix transfer returning without finalizing message spi: fix client driver breakages when using GPIO descriptors
2020-11-19tools/bootconfig: Align the bootconfig applied initrd image size to 4Masami Hiramatsu1-0/+3
Align the bootconfig applied initrd image size to 4. To fill the gap, the bootconfig command uses null characters in between the bootconfig data and the footer. This will expands the footer size but don't change the checksum. Thus the block image of the initrd file with bootconfig is as follows. [initrd][bootconfig][(pad)][size][csum]["#BOOTCONFIG\n"] Link: https://lkml.kernel.org/r/160576522046.320071.8550680670010950634.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-18iommu/vt-d: Avoid panic if iommu init fails in tboot systemZhenzhong Duan1-1/+0
"intel_iommu=off" command line is used to disable iommu but iommu is force enabled in a tboot system for security reason. However for better performance on high speed network device, a new option "intel_iommu=tboot_noforce" is introduced to disable the force on. By default kernel should panic if iommu init fail in tboot for security reason, but it's unnecessory if we use "intel_iommu=tboot_noforce,off". Fix the code setting force_on and move intel_iommu_tboot_noforce from tboot code to intel iommu code. Fixes: 7304e8f28bb2 ("iommu/vt-d: Correctly disable Intel IOMMU force on") Signed-off-by: Zhenzhong Duan <zhenzhong.duan@gmail.com> Tested-by: Lukasz Hawrylko <lukasz.hawrylko@linux.intel.com> Acked-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20201110071908.3133-1-zhenzhong.duan@gmail.com Signed-off-by: Will Deacon <will@kernel.org>
2020-11-17sched/deadline: Fix priority inheritance with multiple scheduling classesJuri Lelli1-1/+9
Glenn reported that "an application [he developed produces] a BUG in deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested PTHREAD_PRIO_INHERIT mutexes. I believe the bug is triggered when a CFS task that was boosted by a SCHED_DEADLINE task boosts another CFS task (nested priority inheritance). ------------[ cut here ]------------ kernel BUG at kernel/sched/deadline.c:1462! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ... Hardware name: ... RIP: 0010:enqueue_task_dl+0x335/0x910 Code: ... RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002 RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500 RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600 RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078 R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600 R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600 FS: 00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: ? intel_pstate_update_util_hwp+0x13/0x170 rt_mutex_setprio+0x1cc/0x4b0 task_blocks_on_rt_mutex+0x225/0x260 rt_spin_lock_slowlock_locked+0xab/0x2d0 rt_spin_lock_slowlock+0x50/0x80 hrtimer_grab_expiry_lock+0x20/0x30 hrtimer_cancel+0x13/0x30 do_nanosleep+0xa0/0x150 hrtimer_nanosleep+0xe1/0x230 ? __hrtimer_init_sleeper+0x60/0x60 __x64_sys_nanosleep+0x8d/0xa0 do_syscall_64+0x4a/0x100 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fa58b52330d ... ---[ end trace 0000000000000002 ]— He also provided a simple reproducer creating the situation below: So the execution order of locking steps are the following (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2 are mutexes that are enabled * with priority inheritance.) Time moves forward as this timeline goes down: N1 N2 D1 | | | | | | Lock(M1) | | | | | | Lock(M2) | | | | | | Lock(M2) | | | | Lock(M1) | | (!!bug triggered!) | Daniel reported a similar situation as well, by just letting ksoftirqd run with DEADLINE (and eventually block on a mutex). Problem is that boosted entities (Priority Inheritance) use static DEADLINE parameters of the top priority waiter. However, there might be cases where top waiter could be a non-DEADLINE entity that is currently boosted by a DEADLINE entity from a different lock chain (i.e., nested priority chains involving entities of non-DEADLINE classes). In this case, top waiter static DEADLINE parameters could be null (initialized to 0 at fork()) and replenish_dl_entity() would hit a BUG(). Fix this by keeping track of the original donor and using its parameters when a task is boosted. Reported-by: Glenn Elliott <glenn@aurora.tech> Reported-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com
2020-11-17sched: Fix data-race in wakeupPeter Zijlstra1-1/+15
Mel reported that on some ARM64 platforms loadavg goes bananas and Will tracked it down to the following race: CPU0 CPU1 schedule() prev->sched_contributes_to_load = X; deactivate_task(prev); try_to_wake_up() if (p->on_rq &&) // false if (smp_load_acquire(&p->on_cpu) && // true ttwu_queue_wakelist()) p->sched_remote_wakeup = Y; smp_store_release(prev->on_cpu, 0); where both p->sched_contributes_to_load and p->sched_remote_wakeup are in the same word, and thus the stores X and Y race (and can clobber one another's data). Whereas prior to commit c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") the p->on_cpu handoff serialized access to p->sched_remote_wakeup (just as it still does with p->sched_contributes_to_load) that commit broke that by calling ttwu_queue_wakelist() with p->on_cpu != 0. However, due to p->XXX = X ttwu() schedule() if (p->on_rq && ...) // false smp_mb__after_spinlock() if (smp_load_acquire(&p->on_cpu) && deactivate_task() ttwu_queue_wakelist()) p->on_rq = 0; p->sched_remote_wakeup = Y; We can be sure any 'current' store is complete and 'current' is guaranteed asleep. Therefore we can move p->sched_remote_wakeup into the current flags word. Note: while the observed failure was loadavg accounting gone wrong due to ttwu() cobbering p->sched_contributes_to_load, the reverse problem is also possible where schedule() clobbers p->sched_remote_wakeup, this could result in enqueue_entity() wrecking ->vruntime and causing scheduling artifacts. Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") Reported-by: Mel Gorman <mgorman@techsingularity.net> Debugged-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201117083016.GK3121392@hirez.programming.kicks-ass.net
2020-11-17Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-0/+1
Pull vhost fixes from Michael Tsirkin: "Fixes all over the place, most notably vhost scsi IO error fixes" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vhost scsi: Add support for LUN resets. vhost scsi: add lun parser helper vhost scsi: fix cmd completion race vhost scsi: alloc cmds per vq instead of session vhost: add helper to check if a vq has been setup vdpasim: fix "mac_pton" undefined error swiotlb: using SIZE_MAX needs limits.h included
2020-11-16PM: runtime: Add pm_runtime_resume_and_get to deal with usage counterZhang Qilong1-0/+21
In many case, we need to check return value of pm_runtime_get_sync, but it brings a trouble to the usage counter processing. Many callers forget to decrease the usage counter when it failed, which could resulted in reference leak. It has been discussed a lot[0][1]. So we add a function to deal with the usage counter for better coding. [0]https://lkml.org/lkml/2020/6/14/88 [1]https://patchwork.ozlabs.org/project/linux-tegra/list/?series=178139 Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16arch: pgtable: define MAX_POSSIBLE_PHYSMEM_BITS where neededArnd Bergmann1-0/+13
Stefan Agner reported a bug when using zsram on 32-bit Arm machines with RAM above the 4GB address boundary: Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = a27bd01c [00000000] *pgd=236a0003, *pmd=1ffa64003 Internal error: Oops: 207 [#1] SMP ARM Modules linked in: mdio_bcm_unimac(+) brcmfmac cfg80211 brcmutil raspberrypi_hwmon hci_uart crc32_arm_ce bcm2711_thermal phy_generic genet CPU: 0 PID: 123 Comm: mkfs.ext4 Not tainted 5.9.6 #1 Hardware name: BCM2711 PC is at zs_map_object+0x94/0x338 LR is at zram_bvec_rw.constprop.0+0x330/0xa64 pc : [<c0602b38>] lr : [<c0bda6a0>] psr: 60000013 sp : e376bbe0 ip : 00000000 fp : c1e2921c r10: 00000002 r9 : c1dda730 r8 : 00000000 r7 : e8ff7a00 r6 : 00000000 r5 : 02f9ffa0 r4 : e3710000 r3 : 000fdffe r2 : c1e0ce80 r1 : ebf979a0 r0 : 00000000 Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user Control: 30c5383d Table: 235c2a80 DAC: fffffffd Process mkfs.ext4 (pid: 123, stack limit = 0x495a22e6) Stack: (0xe376bbe0 to 0xe376c000) As it turns out, zsram needs to know the maximum memory size, which is defined in MAX_PHYSMEM_BITS when CONFIG_SPARSEMEM is set, or in MAX_POSSIBLE_PHYSMEM_BITS on the x86 architecture. The same problem will be hit on all 32-bit architectures that have a physical address space larger than 4GB and happen to not enable sparsemem and include asm/sparsemem.h from asm/pgtable.h. After the initial discussion, I suggested just always defining MAX_POSSIBLE_PHYSMEM_BITS whenever CONFIG_PHYS_ADDR_T_64BIT is set, or provoking a build error otherwise. This addresses all configurations that can currently have this runtime bug, but leaves all other configurations unchanged. I looked up the possible number of bits in source code and datasheets, here is what I found: - on ARC, CONFIG_ARC_HAS_PAE40 controls whether 32 or 40 bits are used - on ARM, CONFIG_LPAE enables 40 bit addressing, without it we never support more than 32 bits, even though supersections in theory allow up to 40 bits as well. - on MIPS, some MIPS32r1 or later chips support 36 bits, and MIPS32r5 XPA supports up to 60 bits in theory, but 40 bits are more than anyone will ever ship - On PowerPC, there are three different implementations of 36 bit addressing, but 32-bit is used without CONFIG_PTE_64BIT - On RISC-V, the normal page table format can support 34 bit addressing. There is no highmem support on RISC-V, so anything above 2GB is unused, but it might be useful to eventually support CONFIG_ZRAM for high pages. Fixes: 61989a80fb3a ("staging: zsmalloc: zsmalloc memory allocation library") Fixes: 02390b87a945 ("mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS") Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Reviewed-by: Stefan Agner <stefan@agner.ch> Tested-by: Stefan Agner <stefan@agner.ch> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lore.kernel.org/linux-mm/bdfa44bf1c570b05d6c70898e2bbb0acf234ecdf.1604762181.git.stefan@agner.ch/ Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2020-11-15Merge tag 'perf-urgent-2020-11-15' of ↵Linus Torvalds2-12/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A set of fixes for perf: - A set of commits which reduce the stack usage of various perf event handling functions which allocated large data structs on stack causing stack overflows in the worst case - Use the proper mechanism for detecting soft interrupts in the recursion protection - Make the resursion protection simpler and more robust - Simplify the scheduling of event groups to make the code more robust and prepare for fixing the issues vs. scheduling of exclusive event groups - Prevent event multiplexing and rotation for exclusive event groups - Correct the perf event attribute exclusive semantics to take pinned events, e.g. the PMU watchdog, into account - Make the anythread filtering conditional for Intel's generic PMU counters as it is not longer guaranteed to be supported on newer CPUs. Check the corresponding CPUID leaf to make sure - Fixup a duplicate initialization in an array which was probably caused by the usual 'copy & paste - forgot to edit' mishap" * tag 'perf-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/uncore: Fix Add BW copypasta perf/x86/intel: Make anythread filter support conditional perf: Tweak perf_event_attr::exclusive semantics perf: Fix event multiplexing for exclusive groups perf: Simplify group_sched_in() perf: Simplify group_sched_out() perf/x86: Make dummy_iregs static perf/arch: Remove perf_sample_data::regs_user_copy perf: Optimize get_recursion_context() perf: Fix get_recursion_context() perf/x86: Reduce stack usage for x86_pmu::drain_pebs() perf: Reduce stack usage of perf_output_begin()
2020-11-14Merge branch 'akpm' (patches from Andrew)Linus Torvalds4-29/+25
Merge fixes from Andrew Morton: "14 patches. Subsystems affected by this patch series: mm (migration, vmscan, slub, gup, memcg, hugetlbfs), mailmap, kbuild, reboot, watchdog, panic, and ocfs2" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: ocfs2: initialize ip_next_orphan panic: don't dump stack twice on warn hugetlbfs: fix anon huge page migration race mm: memcontrol: fix missing wakeup polling thread kernel/watchdog: fix watchdog_allowed_mask not used warning reboot: fix overflow parsing reboot cpu number Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint" compiler.h: fix barrier_data() on clang mm/gup: use unpin_user_pages() in __gup_longterm_locked() mm/slub: fix panic in slab_alloc_node() mailmap: fix entry for Dmitry Baryshkov/Eremin-Solenikov mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate mm/compaction: count pages and stop correctly during page isolation
2020-11-14mm: memcontrol: fix missing wakeup polling threadMuchun Song1-2/+9
When we poll the swap.events, we can miss being woken up when the swap event occurs. Because we didn't notify. Fixes: f3a53a3a1e5b ("mm, memcontrol: implement memory.swap.events") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Chris Down <chris@chrisdown.name> Cc: Tejun Heo <tj@kernel.org> Link: https://lkml.kernel.org/r/20201105161936.98312-1-songmuchun@bytedance.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-14compiler.h: fix barrier_data() on clangArvind Sankar3-27/+16
Commit 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h mutually exclusive") neglected to copy barrier_data() from compiler-gcc.h into compiler-clang.h. The definition in compiler-gcc.h was really to work around clang's more aggressive optimization, so this broke barrier_data() on clang, and consequently memzero_explicit() as well. For example, this results in at least the memzero_explicit() call in lib/crypto/sha256.c:sha256_transform() being optimized away by clang. Fix this by moving the definition of barrier_data() into compiler.h. Also move the gcc/clang definition of barrier() into compiler.h, __memory_barrier() is icc-specific (and barrier() is already defined using it in compiler-intel.h) and doesn't belong in compiler.h. [rdunlap@infradead.org: fix ALPHA builds when SMP is not enabled] Link: https://lkml.kernel.org/r/20201101231835.4589-1-rdunlap@infradead.org Fixes: 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h mutually exclusive") Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201014212631.207844-1-nivedita@alum.mit.edu Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-14Merge tag 'vfs-5.10-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-11/+27
Pull fs freeze fix and cleanups from Darrick Wong: "A single vfs fix for 5.10, along with two subsequent cleanups. A very long time ago, a hack was added to the vfs fs freeze protection code to work around lockdep complaints about XFS, which would try to run a transaction (which requires intwrite protection) to finalize an xfs freeze (by which time the vfs had already taken intwrite). Fast forward a few years, and XFS fixed the recursive intwrite problem on its own, and the hack became unnecessary. Fast forward almost a decade, and latent bugs in the code converting this hack from freeze flags to freeze locks combine with lockdep bugs to make this reproduce frequently enough to notice page faults racing with freeze. Since the hack is unnecessary and causes thread race errors, just get rid of it completely. Making this kind of vfs change midway through a cycle makes me nervous, but a large enough number of the usual VFS/ext4/XFS/btrfs suspects have said this looks good and solves a real problem vector. And once that removal is done, __sb_start_write is now simple enough that it becomes possible to refactor the function into smaller, simpler static inline helpers in linux/fs.h. The cleanup is straightforward. Summary: - Finally remove the "convert to trylock" weirdness in the fs freezer code. It was necessary 10 years ago to deal with nested transactions in XFS, but we've long since removed that; and now this is causing subtle race conditions when lockdep goes offline and sb_start_* aren't prepared to retry a trylock failure. - Minor cleanups of the sb_start_* fs freeze helpers" * tag 'vfs-5.10-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: vfs: move __sb_{start,end}_write* to fs.h vfs: separate __sb_start_write into blocking and non-blocking helpers vfs: remove lockdep bogosity in __sb_start_write
2020-11-14Merge tag 'block-5.10-2020-11-13' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull block fixes from Jens Axboe: "A few small fixes: - NVMe pull request from Christoph: - don't clear the read-only bit on a revalidate (Sagi Grimberg) - nbd error case refcount leak (Christoph) - loop/generic uevent fix (Christoph, Petr)" * tag 'block-5.10-2020-11-13' of git://git.kernel.dk/linux-block: loop: Fix occasional uevent drop block: add a return value to set_capacity_revalidate_and_notify nbd: fix a block_device refcount leak in nbd_release nvme: fix incorrect behavior when BLKROSET is called by the user
2020-11-13Merge tag 'net-5.10-rc4' of ↵Linus Torvalds3-17/+11
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Current release - regressions: - arm64: dts: fsl-ls1028a-kontron-sl28: specify in-band mode for ENETC Current release - bugs in new features: - mptcp: provide rmem[0] limit offset to fix oops Previous release - regressions: - IPv6: Set SIT tunnel hard_header_len to zero to fix path MTU calculations - lan743x: correctly handle chips with internal PHY - bpf: Don't rely on GCC __attribute__((optimize)) to disable GCSE - mlx5e: Fix VXLAN port table synchronization after function reload Previous release - always broken: - bpf: Zero-fill re-used per-cpu map element - fix out-of-order UDP packets when forwarding with UDP GSO fraglists turned on: - fix UDP header access on Fast/frag0 UDP GRO - fix IP header access and skb lookup on Fast/frag0 UDP GRO - ethtool: netlink: add missing netdev_features_change() call - net: Update window_clamp if SOCK_RCVBUF is set - igc: Fix returning wrong statistics - ch_ktls: fix multiple leaks and corner cases in Chelsio TLS offload - tunnels: Fix off-by-one in lower MTU bounds for ICMP/ICMPv6 replies - r8169: disable hw csum for short packets on all chip versions - vrf: Fix fast path output packet handling with async Netfilter rules" * tag 'net-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (65 commits) lan743x: fix use of uninitialized variable net: udp: fix IP header access and skb lookup on Fast/frag0 UDP GRO net: udp: fix UDP header access on Fast/frag0 UDP GRO devlink: Avoid overwriting port attributes of registered port vrf: Fix fast path output packet handling with async Netfilter rules cosa: Add missing kfree in error path of cosa_write net: switch to the kernel.org patchwork instance ch_ktls: stop the txq if reaches threshold ch_ktls: tcb update fails sometimes ch_ktls/cxgb4: handle partial tag alone SKBs ch_ktls: don't free skb before sending FIN ch_ktls: packet handling prior to start marker ch_ktls: Correction in middle record handling ch_ktls: missing handling of header alone ch_ktls: Correction in trimmed_len calculation cxgb4/ch_ktls: creating skbs causes panic ch_ktls: Update cheksum information ch_ktls: Correction in finding correct length cxgb4/ch_ktls: decrypted bit is not enough net/x25: Fix null-ptr-deref in x25_connect ...
2020-11-12block: add a return value to set_capacity_revalidate_and_notifyChristoph Hellwig1-1/+1
Return if the function ended up sending an uevent or not. Cc: stable@vger.kernel.org # v5.9 Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-12Merge tag 'pm-5.10-rc4' of ↵Linus Torvalds1-2/+16
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "Make the intel_pstate driver behave as expected when it operates in the passive mode with HWP enabled and the 'powersave' governor on top of it" * tag 'pm-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: intel_pstate: Take CPUFREQ_GOV_STRICT_TARGET into account cpufreq: Add strict_target to struct cpufreq_policy cpufreq: Introduce CPUFREQ_GOV_STRICT_TARGET cpufreq: Introduce governor flags
2020-11-12spi: Introduce device-managed SPI controller allocationLukas Wunner1-0/+19
SPI driver probing currently comprises two steps, whereas removal comprises only one step: spi_alloc_master() spi_register_controller() spi_unregister_controller() That's because spi_unregister_controller() calls device_unregister() instead of device_del(), thereby releasing the reference on the spi_controller which was obtained by spi_alloc_master(). An SPI driver's private data is contained in the same memory allocation as the spi_controller struct. Thus, once spi_unregister_controller() has been called, the private data is inaccessible. But some drivers need to access it after spi_unregister_controller() to perform further teardown steps. Introduce devm_spi_alloc_master() and devm_spi_alloc_slave(), which release a reference on the spi_controller struct only after the driver has unbound, thereby keeping the memory allocation accessible. Change spi_unregister_controller() to not release a reference if the spi_controller was allocated by one of these new devm functions. The present commit is small enough to be backportable to stable. It allows fixing drivers which use the private data in their ->remove() hook after it's been freed. It also allows fixing drivers which neglect to release a reference on the spi_controller in the probe error path. Long-term, most SPI drivers shall be moved over to the devm functions introduced herein. The few that can't shall be changed in a treewide commit to explicitly release the last reference on the controller. That commit shall amend spi_unregister_controller() to no longer release a reference, thereby completing the migration. As a result, the behaviour will be less surprising and more consistent with subsystems such as IIO, which also includes the private data in the allocation of the generic iio_dev struct, but calls device_del() in iio_device_unregister(). Signed-off-by: Lukas Wunner <lukas@wunner.de> Link: https://lore.kernel.org/r/272bae2ef08abd21388c98e23729886663d19192.1605121038.git.lukas@wunner.de Signed-off-by: Mark Brown <broonie@kernel.org>
2020-11-12Merge branch 'stable/for-linus-5.10-rc2' of ↵Linus Torvalds1-7/+3
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb Pull swiotlb fixes from Konrad Rzeszutek Wilk: "Two tiny fixes for issues that make drivers under Xen unhappy under certain conditions" * 'stable/for-linus-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb: swiotlb: remove the tbl_dma_addr argument to swiotlb_tbl_map_single swiotlb: fix "x86: Don't panic if can not alloc buffer for swiotlb"
2020-11-11vfs: move __sb_{start,end}_write* to fs.hDarrick J. Wong1-3/+18
Now that we've straightened out the callers, move these three functions to fs.h since they're fairly trivial. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz>
2020-11-11vfs: separate __sb_start_write into blocking and non-blocking helpersDarrick J. Wong1-10/+11
Break this function into two helpers so that it's obvious that the trylock versions return a value that must be checked, and the blocking versions don't require that. While we're at it, clean up the return type mismatch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-11-10cpufreq: Add strict_target to struct cpufreq_policyRafael J. Wysocki1-0/+6
Add a new field to be set when the CPUFREQ_GOV_STRICT_TARGET flag is set for the current governor to struct cpufreq_policy, so that the drivers needing to check CPUFREQ_GOV_STRICT_TARGET do not have to access the governor object during every frequency transition. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-11-10cpufreq: Introduce CPUFREQ_GOV_STRICT_TARGETRafael J. Wysocki1-0/+3
Introduce a new governor flag, CPUFREQ_GOV_STRICT_TARGET, for the governors that want the target frequency to be set exactly to the given value without leaving any room for adjustments on the hardware side and set this flag for the powersave and performance governors. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-11-10cpufreq: Introduce governor flagsRafael J. Wysocki1-2/+7
A new cpufreq governor flag will be added subsequently, so replace the bool dynamic_switching fleid in struct cpufreq_governor with a flags field and introduce CPUFREQ_GOV_DYNAMIC_SWITCHING to set for the "dynamic switching" governors instead of it. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>