summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-08-28Merge branch kvm-arm64/6.6/pmu-fixes into kvmarm-master/nextMarc Zyngier7-12/+55
* kvm-arm64/6.6/pmu-fixes: : . : Another set of PMU fixes, coutrtesy of Reiji Watanabe. : From the cover letter: : : "This series fixes a couple of PMUver related handling of : vPMU support. : : On systems where the PMUVer is not uniform across all PEs, : KVM currently does not advertise PMUv3 to the guest, : even if userspace successfully runs KVM_ARM_VCPU_INIT with : KVM_ARM_VCPU_PMU_V3." : : Additionally, a fix for an obscure counter oversubscription : issue happening when the hsot profines the guest's EL0. : . KVM: arm64: pmu: Guard PMU emulation definitions with CONFIG_KVM KVM: arm64: pmu: Resync EL0 state on counter rotation KVM: arm64: PMU: Don't advertise STALL_SLOT_{FRONTEND,BACKEND} KVM: arm64: PMU: Don't advertise the STALL_SLOT event KVM: arm64: PMU: Avoid inappropriate use of host's PMUVer KVM: arm64: PMU: Disallow vPMU on non-uniform PMUVer Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-08-28Merge branch kvm-arm64/tlbi-range into kvmarm-master/nextMarc Zyngier21-132/+286
* kvm-arm64/tlbi-range: : . : FEAT_TLBIRANGE support, courtesy of Raghavendra Rao Ananta. : From the cover letter: : : "In certain code paths, KVM/ARM currently invalidates the entire VM's : page-tables instead of just invalidating a necessary range. For example, : when collapsing a table PTE to a block PTE, instead of iterating over : each PTE and flushing them, KVM uses 'vmalls12e1is' TLBI operation to : flush all the entries. This is inefficient since the guest would have : to refill the TLBs again, even for the addresses that aren't covered : by the table entry. The performance impact would scale poorly if many : addresses in the VM is going through this remapping. : : For architectures that implement FEAT_TLBIRANGE, KVM can replace such : inefficient paths by performing the invalidations only on the range of : addresses that are in scope. This series tries to achieve the same in : the areas of stage-2 map, unmap and write-protecting the pages." : . KVM: arm64: Use TLBI range-based instructions for unmap KVM: arm64: Invalidate the table entries upon a range KVM: arm64: Flush only the memslot after write-protect KVM: arm64: Implement kvm_arch_flush_remote_tlbs_range() KVM: arm64: Define kvm_tlb_flush_vmid_range() KVM: arm64: Implement __kvm_tlb_flush_vmid_range() arm64: tlb: Implement __flush_s2_tlb_range_op() arm64: tlb: Refactor the core flush algorithm of __flush_tlb_range KVM: Move kvm_arch_flush_remote_tlbs_memslot() to common code KVM: Allow range-based TLB invalidation from common code KVM: Remove CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL KVM: arm64: Use kvm_arch_flush_remote_tlbs() KVM: Declare kvm_arch_flush_remote_tlbs() globally KVM: Rename kvm_arch_flush_remote_tlb() to kvm_arch_flush_remote_tlbs() Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-08-28Merge branch kvm-arm64/nv-trap-forwarding into kvmarm-master/nextMarc Zyngier15-41/+2488
* kvm-arm64/nv-trap-forwarding: (30 commits) : . : This implements the so called "trap forwarding" infrastructure, which : gets used when we take a trap from an L2 guest and that the L1 guest : wants to see the trap for itself. : . KVM: arm64: nv: Add trap description for SPSR_EL2 and ELR_EL2 KVM: arm64: nv: Select XARRAY_MULTI to fix build error KVM: arm64: nv: Add support for HCRX_EL2 KVM: arm64: Move HCRX_EL2 switch to load/put on VHE systems KVM: arm64: nv: Expose FGT to nested guests KVM: arm64: nv: Add switching support for HFGxTR/HDFGxTR KVM: arm64: nv: Expand ERET trap forwarding to handle FGT KVM: arm64: nv: Add SVC trap forwarding KVM: arm64: nv: Add trap forwarding for HDFGxTR_EL2 KVM: arm64: nv: Add trap forwarding for HFGITR_EL2 KVM: arm64: nv: Add trap forwarding for HFGxTR_EL2 KVM: arm64: nv: Add fine grained trap forwarding infrastructure KVM: arm64: nv: Add trap forwarding for CNTHCTL_EL2 KVM: arm64: nv: Add trap forwarding for MDCR_EL2 KVM: arm64: nv: Expose FEAT_EVT to nested guests KVM: arm64: nv: Add trap forwarding for HCR_EL2 KVM: arm64: nv: Add trap forwarding infrastructure KVM: arm64: Restructure FGT register switching KVM: arm64: nv: Add FGT registers KVM: arm64: Add missing HCR_EL2 trap bits ... Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-08-28Linux 6.5Linus Torvalds1-1/+1
2023-08-27Merge tag 'scsi-fixes' of ↵Linus Torvalds5-57/+6
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Three small driver fixes and one larger unused function set removal in the raid class (so no external impact)" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: snic: Fix double free in snic_tgt_create() scsi: core: raid_class: Remove raid_component_add() scsi: ufs: ufs-qcom: Clear qunipro_g4_sel for HW major version > 5 scsi: ufs: mcq: Fix the search/wrap around logic
2023-08-26Merge tag 'x86-urgent-2023-08-26' of ↵Linus Torvalds3-3/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Fix an FPU invalidation bug on exec(), and fix a performance regression due to a missing setting of X86_FEATURE_OSXSAVE" * tag 'x86-urgent-2023-08-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu: Set X86_FEATURE_OSXSAVE feature after enabling OSXSAVE in CR4 x86/fpu: Invalidate FPU state correctly on exec()
2023-08-26Merge tag 'irq-urgent-2023-08-26' of ↵Linus Torvalds1-1/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Thomas Gleixner: "A last minute fix for a regression introduced in the v6.5 merge window. The conversion of the software based interrupt resend mechanism to hlist missed to add a check whether the descriptor is already enqueued and dropped the interrupt descriptor lookup for nested interrupts. The missing check whether the descriptor is already queued causes hlist corruption and can be observed in the wild. The dropped parent descriptor lookup has not yet caused problems, but it would result in stale interrupt line in the worst case. Add the missing enqueued check and bring the descriptor lookup back to cure this" * tag 'irq-urgent-2023-08-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Fix software resend lockup and nested resend
2023-08-26Merge tag 'loongarch-fixes-6.5-2' of ↵Linus Torvalds22-38/+42
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson Pull LoongArch fixes from Huacai Chen: "Fix a ptrace bug, a hw_breakpoint bug, some build errors/warnings and some trivial cleanups" * tag 'loongarch-fixes-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: LoongArch: Fix hw_breakpoint_control() for watchpoints LoongArch: Ensure FP/SIMD registers in the core dump file is up to date LoongArch: Put the body of play_dead() into arch_cpu_idle_dead() LoongArch: Add identifier names to arguments of die() declaration LoongArch: Return earlier in die() if notify_die() returns NOTIFY_STOP LoongArch: Do not kill the task in die() if notify_die() returns NOTIFY_STOP LoongArch: Remove <asm/export.h> LoongArch: Replace #include <asm/export.h> with #include <linux/export.h> LoongArch: Remove unneeded #include <asm/export.h> LoongArch: Replace -ffreestanding with finer-grained -fno-builtin's LoongArch: Remove redundant "source drivers/firmware/Kconfig"
2023-08-26genirq: Fix software resend lockup and nested resendJohan Hovold1-1/+6
The switch to using hlist for managing software resend of interrupts broke resend in at least two ways: First, unconditionally adding interrupt descriptors to the resend list can corrupt the list when the descriptor in question has already been added. This causes the resend tasklet to loop indefinitely with interrupts disabled as was recently reported with the Lenovo ThinkPad X13s after threaded NAPI was disabled in the ath11k WiFi driver. This bug is easily fixed by restoring the old semantics of irq_sw_resend() so that it can be called also for descriptors that have already been marked for resend. Second, the offending commit also broke software resend of nested interrupts by simply discarding the code that made sure that such interrupts are retriggered using the parent interrupt. Add back the corresponding code that adds the parent descriptor to the resend list. Fixes: bc06a9e08742 ("genirq: Use hlist for managing resend handlers") Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/lkml/20230809073432.4193-1-johan+linaro@kernel.org/ Link: https://lore.kernel.org/r/20230826154004.1417-1-johan+linaro@kernel.org
2023-08-26LoongArch: Fix hw_breakpoint_control() for watchpointsHuacai Chen1-2/+1
In hw_breakpoint_control(), encode_ctrl_reg() has already encoded the MWPnCFG3_LoadEn/MWPnCFG3_StoreEn bits in info->ctrl. We don't need to add (1 << MWPnCFG3_LoadEn | 1 << MWPnCFG3_StoreEn) unconditionally. Otherwise we can't set read watchpoint and write watchpoint separately. Cc: stable@vger.kernel.org Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-08-26LoongArch: Ensure FP/SIMD registers in the core dump file is up to dateHuacai Chen2-4/+22
This is a port of commit 379eb01c21795edb4c ("riscv: Ensure the value of FP registers in the core dump file is up to date"). The values of FP/SIMD registers in the core dump file come from the thread.fpu. However, kernel saves the FP/SIMD registers only before scheduling out the process. If no process switch happens during the exception handling, kernel will not have a chance to save the latest values of FP/SIMD registers. So it may cause their values in the core dump file incorrect. To solve this problem, force fpr_get()/simd_get() to save the FP/SIMD registers into the thread.fpu if the target task equals the current task. Cc: stable@vger.kernel.org Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-08-26KVM: arm64: Remove size-order align in the nVHE hyp private VA rangeVincent Donnefort6-85/+138
commit f922c13e778d ("KVM: arm64: Introduce pkvm_alloc_private_va_range()") and commit 92abe0f81e13 ("KVM: arm64: Introduce hyp_alloc_private_va_range()") added an alignment for the start address of any allocation into the nVHE hypervisor private VA range. This alignment (order of the size of the allocation) intends to enable efficient stack verification (if the PAGE_SHIFT bit is zero, the stack pointer is on the guard page and a stack overflow occurred). But this is only necessary for stack allocation and can waste a lot of VA space. So instead make stack-specific functions, handling the guard page requirements, while other users (e.g. fixmap) will only get page alignment. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230811112037.1147863-1-vdonnefort@google.com
2023-08-26Merge tag 'clk-fixes-for-linus' of ↵Linus Torvalds3-48/+51
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk fixes from Stephen Boyd: "One clk driver fix and two clk framework fixes: - Fix an OOB access when devm_get_clk_from_child() is used and devm_clk_release() casts the void pointer to the wrong type - Move clk_rate_exclusive_{get,put}() within the correct ifdefs in clk.h so that the stubs are used when CONFIG_COMMON_CLK=n - Register the proper clk provider function depending on the value of #clock-cells in the TI keystone driver" * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: clk: Fix slab-out-of-bounds error in devm_clk_release() clk: Fix undefined reference to `clk_rate_exclusive_{get,put}' clk: keystone: syscon-clk: Fix audio refclk
2023-08-25lib/clz_ctz.c: Fix __clzdi2() and __ctzdi2() for 32-bit kernelsHelge Deller1-26/+6
The gcc compiler translates on some architectures the 64-bit __builtin_clzll() function to a call to the libgcc function __clzdi2(), which should take a 64-bit parameter on 32- and 64-bit platforms. But in the current kernel code, the built-in __clzdi2() function is defined to operate (wrongly) on 32-bit parameters if BITS_PER_LONG == 32, thus the return values on 32-bit kernels are in the range from [0..31] instead of the expected [0..63] range. This patch fixes the in-kernel functions __clzdi2() and __ctzdi2() to take a 64-bit parameter on 32-bit kernels as well, thus it makes the functions identical for 32- and 64-bit kernels. This bug went unnoticed since kernel 3.11 for over 10 years, and here are some possible reasons for that: a) Some architectures have assembly instructions to count the bits and which are used instead of calling __clzdi2(), e.g. on x86 the bsr instruction and on ppc cntlz is used. On such architectures the wrong __clzdi2() implementation isn't used and as such the bug has no effect and won't be noticed. b) Some architectures link to libgcc.a, and the in-kernel weak functions get replaced by the correct 64-bit variants from libgcc.a. c) __builtin_clzll() and __clzdi2() doesn't seem to be used in many places in the kernel, and most likely only in uncritical functions, e.g. when printing hex values via seq_put_hex_ll(). The wrong return value will still print the correct number, but just in a wrong formatting (e.g. with too many leading zeroes). d) 32-bit kernels aren't used that much any longer, so they are less tested. A trivial testcase to verify if the currently running 32-bit kernel is affected by the bug is to look at the output of /proc/self/maps: Here the kernel uses a correct implementation of __clzdi2(): root@debian:~# cat /proc/self/maps 00010000-00019000 r-xp 00000000 08:05 787324 /usr/bin/cat 00019000-0001a000 rwxp 00009000 08:05 787324 /usr/bin/cat 0001a000-0003b000 rwxp 00000000 00:00 0 [heap] f7551000-f770d000 r-xp 00000000 08:05 794765 /usr/lib/hppa-linux-gnu/libc.so.6 ... and this kernel uses the broken implementation of __clzdi2(): root@debian:~# cat /proc/self/maps 0000000010000-0000000019000 r-xp 00000000 000000008:000000005 787324 /usr/bin/cat 0000000019000-000000001a000 rwxp 000000009000 000000008:000000005 787324 /usr/bin/cat 000000001a000-000000003b000 rwxp 00000000 00:00 0 [heap] 00000000f73d1000-00000000f758d000 r-xp 00000000 000000008:000000005 794765 /usr/lib/hppa-linux-gnu/libc.so.6 ... Signed-off-by: Helge Deller <deller@gmx.de> Fixes: 4df87bb7b6a22 ("lib: add weak clz/ctz functions") Cc: Chanho Min <chanho.min@lge.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: stable@vger.kernel.org # v3.11+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-25Merge tag 'mm-hotfixes-stable-2023-08-25-11-07' of ↵Linus Torvalds32-73/+279
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "18 hotfixes. 13 are cc:stable and the remainder pertain to post-6.4 issues or aren't considered suitable for a -stable backport" * tag 'mm-hotfixes-stable-2023-08-25-11-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: shmem: fix smaps BUG sleeping while atomic selftests: cachestat: catch failing fsync test on tmpfs selftests: cachestat: test for cachestat availability maple_tree: disable mas_wr_append() when other readers are possible madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check mm: multi-gen LRU: don't spin during memcg release mm: memory-failure: fix unexpected return value in soft_offline_page() radix tree: remove unused variable mm: add a call to flush_cache_vmap() in vmap_pfn() selftests/mm: FOLL_LONGTERM need to be updated to 0x100 nilfs2: fix general protection fault in nilfs_lookup_dirty_data_buffers() mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via GUP-fast selftests: cgroup: fix test_kmem_basic less than error mm: enable page walking API to lock vmas during the walk smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd() mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
2023-08-25Merge tag 'riscv-for-linus-6.5-rc8' of ↵Linus Torvalds5-75/+7
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: "This is obviously not ideal, particularly for something this late in the cycle. Unfortunately we found some uABI issues in the vector support while reviewing the GDB port, which has triggered a revert -- probably a good sign we should have reviewed GDB before merging this, I guess I just dropped the ball because I was so worried about the context extension and libc suff I forgot. Hence the late revert. There's some risk here as we're still exposing the vector context for signal handlers, but changing that would have meant reverting all of the vector support. The issues we've found so far have been fixed already and they weren't absolute showstoppers, so we're essentially just playing it safe by holding ptrace support for another release (or until we get through a proper userspace code review). Summary: - The vector ucontext extension has been extended with vlenb - The vector registers ELF core dump note type has been changed to avoid aliasing with the CSR type used in embedded systems - Support for accessing vector registers via ptrace() has been reverted - Another build fix for the ISA spec changes around Zifencei/Zicsr that manifests on some systems built with binutils-2.37 and gcc-11.2" * tag 'riscv-for-linus-6.5-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: Fix build errors using binutils2.37 toolchains RISC-V: vector: export VLENB csr in __sc_riscv_v_state RISC-V: Remove ptrace support for vectors
2023-08-25Merge tag 'gpio-fixes-for-v6.5' of ↵Linus Torvalds1-1/+14
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux Pull gpio fixes from Bartosz Golaszewski: - fix an irq mapping leak in gpio-sim - associate the GPIO device's software node with the irq domain in gpio-sim * tag 'gpio-fixes-for-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: gpio: sim: pass the GPIO device's software node to irq domain gpio: sim: dispose of irq mappings before destroying the irq_sim domain
2023-08-25Merge tag 'pinctrl-v6.5-4' of ↵Linus Torvalds4-7/+68
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pin control fixes from Linus Walleij: "Here are some Renesas and AMD driver fixes, the AMD fix affects important laptops in the wild so this one is pretty important. It seems a bit tough to get this right. - Fix DT parsing and related locking in the Renesas driver. - Fix wakeup IRQs in the AMD driver once again. Really tricky this one" * tag 'pinctrl-v6.5-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: pinctrl: amd: Mask wake bits on probe again pinctrl: renesas: rza2: Add lock around pinctrl_generic{{add,remove}_group,{add,remove}_function} pinctrl: renesas: rzv2m: Fix NULL pointer dereference in rzv2m_dt_subnode_to_map() pinctrl: renesas: rzg2l: Fix NULL pointer dereference in rzg2l_dt_subnode_to_map()
2023-08-25KVM: VMX: Delete ancient pr_warn() about KVM_SET_TSS_ADDR not being setSean Christopherson1-7/+0
Delete KVM's printk about KVM_SET_TSS_ADDR not being called. When the printk was added by commit 776e58ea3d37 ("KVM: unbreak userspace that does not sets tss address"), KVM also stuffed a "hopefully safe" value, i.e. the message wasn't purely informational. For reasons unknown, ostensibly to try and help people running outdated qemu-kvm versions, the message got left behind when KVM's stuffing was removed by commit 4918c6ca6838 ("KVM: VMX: Require KVM_SET_TSS_ADDR being called prior to running a VCPU"). Today, the message is completely nonsensical, as it has been over a decade since KVM supported userspace running a Real Mode guest, on a CPU without unrestricted guest support, without doing KVM_SET_TSS_ADDR before KVM_RUN. I.e. KVM's ABI has required KVM_SET_TSS_ADDR for 10+ years. To make matters worse, the message is prone to false positives as it triggers when simply *creating* a vCPU due to RESET putting vCPUs into Real Mode, even when the user has no intention of ever *running* the vCPU in a Real Mode. E.g. KVM selftests stuff 64-bit mode and never touch Real Mode, but trigger the message even though they run just fine without doing KVM_SET_TSS_ADDR. Creating "dummy" vCPUs, e.g. to probe features, can also trigger the message. In both scenarios, the message confuses users and falsely implies that they've done something wrong. Reported-by: Thorsten Glaser <t.glaser@tarent.de> Closes: https://lkml.kernel.org/r/f1afa6c0-cde2-ab8b-ea71-bfa62a45b956%40tarent.de Link: https://lore.kernel.org/r/20230815174215.433222-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25KVM: x86: Update MAINTAINTERS to include selftestsSean Christopherson1-0/+2
Give KVM x86 the same treatment as all other KVM architectures, and officially take ownership of x86 specific KVM selftests (changes have been routed through kvm and/or kvm-x86 for quite some time). Cc: kvm@vger.kernel.org Link: https://lore.kernel.org/r/20230817234114.1420092-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25KVM: selftests: Explicit set #UD when *potentially* injecting exceptionSean Christopherson1-0/+3
Explicitly set the exception vector to #UD when potentially injecting an exception in sync_regs_test's subtests that try to detect TOCTOU bugs in KVM's handling of exceptions injected by userspace. A side effect of the original KVM bug was that KVM would clear the vector, but relying on KVM to clear the vector (i.e. make it #DE) makes it less likely that the test would ever find *new* KVM bugs, e.g. because only the first iteration would run with a legal vector to start. Explicitly inject #UD for race_events_inj_pen() as well, e.g. so that it doesn't inherit the illegal 255 vector from race_events_exc(), which currently runs first. Link: https://lore.kernel.org/r/20230817233430.1416463-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25KVM: selftests: Reload "good" vCPU state if vCPU hits shutdownSean Christopherson1-1/+13
Reload known good vCPU state if the vCPU triple faults in any of the race_sync_regs() subtests, e.g. if KVM successfully injects an exception (the vCPU isn't configured to handle exceptions). On Intel, the VMCS is preserved even after shutdown, but AMD's APM states that the VMCB is undefined after a shutdown and so KVM synthesizes an INIT to sanitize vCPU/VMCB state, e.g. to guard against running with a garbage VMCB. The synthetic INIT results in the vCPU never exiting to userspace, as it gets put into Real Mode at the reset vector, which is full of zeros (as is GPA 0 and beyond), and so executes ADD for a very, very long time. Fixes: 60c4063b4752 ("KVM: selftests: Extend x86's sync_regs_test to check for event vector races") Cc: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230817233430.1416463-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25KVM: SVM: Require nrips support for SEV guests (and beyond)Sean Christopherson3-8/+6
Disallow SEV (and beyond) if nrips is disabled via module param, as KVM can't read guest memory to partially emulate and skip an instruction. All CPUs that support SEV support NRIPS, i.e. this is purely stopping the user from shooting themselves in the foot. Cc: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20230825013621.2845700-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25KVM: SVM: Don't inject #UD if KVM attempts to skip SEV guest insnSean Christopherson1-8/+27
Don't inject a #UD if KVM attempts to "emulate" to skip an instruction for an SEV guest, and instead resume the guest and hope that it can make forward progress. When commit 04c40f344def ("KVM: SVM: Inject #UD on attempted emulation for SEV guest w/o insn buffer") added the completely arbitrary #UD behavior, there were no known scenarios where a well-behaved guest would induce a VM-Exit that triggered emulation, i.e. it was thought that injecting #UD would be helpful. However, now that KVM (correctly) attempts to re-inject INT3/INTO, e.g. if a #NPF is encountered when attempting to deliver the INT3/INTO, an SEV guest can trigger emulation without a buffer, through no fault of its own. Resuming the guest and retrying the INT3/INTO is architecturally wrong, e.g. the vCPU will incorrectly re-hit code #DBs, but for SEV guests there is literally no other option that has a chance of making forward progress. Drop the #UD injection for all "skip" emulation, not just those related to INT3/INTO, even though that means that the guest will likely end up in an infinite loop instead of getting a #UD (the vCPU may also crash, e.g. if KVM emulated everything about an instruction except for advancing RIP). There's no evidence that suggests that an unexpected #UD is actually better than hanging the vCPU, e.g. a soft-hung vCPU can still respond to IRQs and NMIs to generate a backtrace. Reported-by: Wu Zongyo <wuzongyo@mail.ustc.edu.cn> Closes: https://lore.kernel.org/all/8eb933fd-2cf3-d7a9-32fe-2a1d82eac42a@mail.ustc.edu.cn Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction") Cc: stable@vger.kernel.org Cc: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20230825013621.2845700-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25KVM: SVM: Skip VMSA init in sev_es_init_vmcb() if pointer is NULLSean Christopherson1-2/+5
Skip initializing the VMSA physical address in the VMCB if the VMSA is NULL, which occurs during intrahost migration as KVM initializes the VMCB before copying over state from the source to the destination (including the VMSA and its physical address). In normal builds, __pa() is just math, so the bug isn't fatal, but with CONFIG_DEBUG_VIRTUAL=y, the validity of the virtual address is verified and passing in NULL will make the kernel unhappy. Fixes: 6defa24d3b12 ("KVM: SEV: Init target VMCBs in sev_migrate_from") Cc: stable@vger.kernel.org Cc: Peter Gonda <pgonda@google.com> Reviewed-by: Peter Gonda <pgonda@google.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20230825022357.2852133-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25KVM: SVM: Get source vCPUs from source VM for SEV-ES intrahost migrationSean Christopherson1-1/+1
Fix a goof where KVM tries to grab source vCPUs from the destination VM when doing intrahost migration. Grabbing the wrong vCPU not only hoses the guest, it also crashes the host due to the VMSA pointer being left NULL. BUG: unable to handle page fault for address: ffffe38687000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 39 PID: 17143 Comm: sev_migrate_tes Tainted: GO 6.5.0-smp--fff2e47e6c3b-next #151 Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 34.28.0 07/10/2023 RIP: 0010:__free_pages+0x15/0xd0 RSP: 0018:ffff923fcf6e3c78 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffe38687000000 RCX: 0000000000000100 RDX: 0000000000000100 RSI: 0000000000000000 RDI: ffffe38687000000 RBP: ffff923fcf6e3c88 R08: ffff923fcafb0000 R09: 0000000000000000 R10: 0000000000000000 R11: ffffffff83619b90 R12: ffff923fa9540000 R13: 0000000000080007 R14: ffff923f6d35d000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff929d0d7c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffe38687000000 CR3: 0000005224c34005 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> sev_free_vcpu+0xcb/0x110 [kvm_amd] svm_vcpu_free+0x75/0xf0 [kvm_amd] kvm_arch_vcpu_destroy+0x36/0x140 [kvm] kvm_destroy_vcpus+0x67/0x100 [kvm] kvm_arch_destroy_vm+0x161/0x1d0 [kvm] kvm_put_kvm+0x276/0x560 [kvm] kvm_vm_release+0x25/0x30 [kvm] __fput+0x106/0x280 ____fput+0x12/0x20 task_work_run+0x86/0xb0 do_exit+0x2e3/0x9c0 do_group_exit+0xb1/0xc0 __x64_sys_exit_group+0x1b/0x20 do_syscall_64+0x41/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd </TASK> CR2: ffffe38687000000 Fixes: 6defa24d3b12 ("KVM: SEV: Init target VMCBs in sev_migrate_from") Cc: stable@vger.kernel.org Cc: Peter Gonda <pgonda@google.com> Reviewed-by: Peter Gonda <pgonda@google.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20230825022357.2852133-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25Merge tag 'sound-6.5' of ↵Linus Torvalds9-32/+93
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "Hopefully the last bits for 6.5. It's slightly higher LOCs than wished, but it doesn't look scary. The biggest change is MAINTAINERS update for TI; it's good to have the update before the final release, so that people can contact to the right persons for bug reports (which shouldn't happen of course!) The rest are all device-specific fixes and quirks, most for various ASoC platforms" * tag 'sound-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ASoC: amd: yc: Fix a non-functional mic on Lenovo 82SJ ALSA: ymfpci: Fix the missing snd_card_free() call at probe error ASoC: cs35l41: Correct amp_gain_tlv values ASoC: amd: yc: Add VivoBook Pro 15 to quirks list for acp6x ASoC: tas2781: fixed register access error when switching to other chips ASoC: cs35l56: Add an ACPI match table ASoC: cs35l56: Read firmware uuid from a device property instead of _SUB ASoC: SOF: ipc4-pcm: fix possible null pointer deference MAINTAINERS: Add entries for TEXAS INSTRUMENTS ASoC DRIVERS
2023-08-25LoongArch: Put the body of play_dead() into arch_cpu_idle_dead()Tiezhu Yang3-10/+1
The initial aim is to silence the following objtool warning: arch/loongarch/kernel/process.o: warning: objtool: arch_cpu_idle_dead() falls through to next function start_thread() According to tools/objtool/Documentation/objtool.txt, this is because the last instruction of arch_cpu_idle_dead() is a call to a noreturn function play_dead(). In order to silence the warning, one simple way is to add the noreturn function play_dead() to objtool's hard-coded global_noreturns array, that is to say, just put "NORETURN(play_dead)" into tools/objtool/noreturns.h, it works well. But I noticed that play_dead() is only defined once and only called by arch_cpu_idle_dead(), so put the body of play_dead() into the caller arch_cpu_idle_dead(), then remove the noreturn function play_dead() is an alternative way which can reduce the overhead of the function call at the same time. Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-08-25LoongArch: Add identifier names to arguments of die() declarationTiezhu Yang1-1/+1
Add identifier names to arguments of die() declaration in ptrace.h to fix the following checkpatch warnings: WARNING: function definition argument 'const char *' should also have an identifier name WARNING: function definition argument 'struct pt_regs *' should also have an identifier name Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-08-25LoongArch: Return earlier in die() if notify_die() returns NOTIFY_STOPTiezhu Yang1-2/+4
After the call to oops_exit(), it should not panic or execute the crash kernel if the oops is to be suppressed. Suggested-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-08-25LoongArch: Do not kill the task in die() if notify_die() returns NOTIFY_STOPTiezhu Yang2-7/+7
If notify_die() returns NOTIFY_STOP, honor the return value from the handler chain invocation in die() and return without killing the task as, through a debugger, the fault may have been fixed. It makes sense even if ignoring the event will make the system unstable: by allowing access through a debugger it has been compromised already anyway. It makes our port consistent with x86, arm64, riscv and csky. Commit 20c0d2d44029 ("[PATCH] i386: pass proper trap numbers to die chain handlers") may be the earliest of similar changes. Link: https://lore.kernel.org/r/43DDF02E.76F0.0078.0@novell.com/ Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-08-25LoongArch: Remove <asm/export.h>Masahiro Yamada1-1/+0
All *.S files under arch/loongarch/ have been converted to include <linux/export.h> instead of <asm/export.h>. Remove <asm/export.h>. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-08-25LoongArch: Replace #include <asm/export.h> with #include <linux/export.h>Masahiro Yamada8-8/+8
Commit ddb5cdbafaaad ("kbuild: generate KSYMTAB entries by modpost") deprecated <asm/export.h>, which is now a wrapper of <linux/export.h>. Replace #include <asm/export.h> with #include <linux/export.h>. After all the <asm/export.h> lines are converted, <asm/export.h> and <asm-generic/export.h> will be removed. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-08-25LoongArch: Remove unneeded #include <asm/export.h>Masahiro Yamada3-3/+0
There is no EXPORT_SYMBOL() line there, hence #include <asm/export.h> is unneeded. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-08-25LoongArch: Replace -ffreestanding with finer-grained -fno-builtin'sWANG Xuerui1-1/+1
As explained by Nick in the original issue: the kernel usually does a good job of providing library helpers that have similar semantics as their ordinary userspace libc equivalents, but -ffreestanding disables such libcall optimization and other related features in the compiler, which can lead to unexpected things such as CONFIG_FORTIFY_SOURCE not working (!). However, due to the desire for better control over unaligned accesses with respect to CONFIG_ARCH_STRICT_ALIGN, and also for avoiding the GCC bug https://gcc.gnu.org/PR109465, we do want to still disable optimizations for the memory libcalls (memcpy, memmove and memset for now). Use finer-grained -fno-builtin-* toggles to achieve this without losing source fortification and other libcall optimizations. Closes: https://github.com/ClangBuiltLinux/linux/issues/1897 Reported-by: Nathan Chancellor <nathan@kernel.org> Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: WANG Xuerui <git@xen0n.name> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-08-25LoongArch: Remove redundant "source drivers/firmware/Kconfig"Xi Ruoyao1-2/+0
In drivers/Kconfig, drivers/firmware/Kconfig is sourced for all ports so there is no need to source it in the port-specific Kconfig file. And sourcing it here also caused the "Firmware Drivers" menu appeared two times: one in the "Device Drivers" menu, another in the toplevel menu. This is really puzzling so remove it. Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Xi Ruoyao <xry111@xry111.site> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2023-08-25Merge tag 'drm-fixes-2023-08-25' of git://anongit.freedesktop.org/drm/drmLinus Torvalds16-94/+136
Pull drm fixes from Dave Airlie: "A bit bigger than I'd care for, but it's mostly a single vmwgfx fix and a fix for an i915 hotplug probing. Otherwise misc i915, bridge, panfrost and dma-buf fixes. core: - add a HPD poll helper i915: - fix regression in i915 polling - fix docs build warning - fix DG2 idle power consumption bridge: - samsung-dsim: init fix panfrost: - fix speed binning issue dma-buf: - fix recursive lock in fence signal vmwgfx: - fix shader stage validation - fix NULL ptr derefs in gem put" * tag 'drm-fixes-2023-08-25' of git://anongit.freedesktop.org/drm/drm: drm/i915: Fix HPD polling, reenabling the output poll work as needed drm: Add an HPD poll helper to reschedule the poll work drm/vmwgfx: Fix possible invalid drm gem put calls drm/vmwgfx: Fix shader stage validation dma-buf/sw_sync: Avoid recursive lock during fence signal drm/i915: fix Sphinx indentation warning drm/i915/dgfx: Enable d3cold at s2idle drm/display/dp: Fix the DP DSC Receiver cap size drm/panfrost: Skip speed binning on EOPNOTSUPP drm: bridge: samsung-dsim: Fix init during host transfer
2023-08-25Merge tag 'asoc-fix-v6.5-rc7-2' of ↵Takashi Iwai1-1/+1
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus ASoC: Quirk for v6.5 One additional fix for v6.5, an additional quirk. As with the other fixes this could wait for the merge window.
2023-08-25Merge tag 'trace-v6.5-rc6' of ↵Linus Torvalds12-78/+166
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix ring buffer being permanently disabled due to missed record_disabled() Changing the trace cpu mask will disable the ring buffers for the CPUs no longer in the mask. But it fails to update the snapshot buffer. If a snapshot takes place, the accounting for the ring buffer being disabled is corrupted and this can lead to the ring buffer being permanently disabled. - Add test case for snapshot and cpu mask working together - Fix memleak by the function graph tracer not getting closed properly. The iterator is used to read the ring buffer. When it opens, it calls the open function of a tracer, and when it is closed, it calls the close iteration. While a trace is being read, it is still possible to change the tracer. If this happens between the function graph tracer and the wakeup tracer (which uses function graph tracing), the tracers are not closed properly during when the iterator sees the switch, and the wakeup function did not initialize its private pointer to NULL, which is used to know if the function graph tracer was the last tracer. It could be fooled in thinking it is, but then on exit it does not call the close function of the function graph tracer to clean up its data. - Fix synthetic events on big endian machines, by introducing a union that does the conversions properly. - Fix synthetic events from printing out the number of elements in the stacktrace when it shouldn't. - Fix synthetic events stacktrace to not print a bogus value at the end. - Introduce a pipe_cpumask that prevents the trace_pipe files from being opened by more than one task (file descriptor). There was a race found where if splice is called, the iter->ent could become stale and events could be missed. There's no point reading a producer/consumer file by more than one task as they will corrupt each other anyway. Add a cpumask that keeps track of the per_cpu trace_pipe files as well as the global trace_pipe file that prevents more than one open of a trace_pipe file that represents the same ring buffer. This prevents the race from happening. - Fix ftrace samples for arm64 to work with older compilers. * tag 'trace-v6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: samples: ftrace: Replace bti assembly with hint for older compiler tracing: Introduce pipe_cpumask to avoid race on trace_pipes tracing: Fix memleak due to race between current_tracer and trace tracing/synthetic: Allocate one additional element for size tracing/synthetic: Skip first entry for stack traces tracing/synthetic: Use union instead of casts selftests/ftrace: Add a basic testcase for snapshot tracing: Fix cpu buffers unavailable due to 'record_disabled' missed
2023-08-25scsi: snic: Fix double free in snic_tgt_create()Zhu Wang1-2/+1
Commit 41320b18a0e0 ("scsi: snic: Fix possible memory leak if device_add() fails") fixed the memory leak caused by dev_set_name() when device_add() failed. However, it did not consider that 'tgt' has already been released when put_device(&tgt->dev) is called. Remove kfree(tgt) in the error path to avoid double free of 'tgt' and move put_device(&tgt->dev) after the removed kfree(tgt) to avoid a use-after-free. Fixes: 41320b18a0e0 ("scsi: snic: Fix possible memory leak if device_add() fails") Signed-off-by: Zhu Wang <wangzhu9@huawei.com> Link: https://lore.kernel.org/r/20230819083941.164365-1-wangzhu9@huawei.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-08-25Merge tag 'media/v6.5-4' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media fix from Mauro Carvalho Chehab: "Fix a potential array out-of-bounds in the mediatek vcodec driver" * tag 'media/v6.5-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: media: vcodec: Fix potential array out-of-bounds in encoder queue_setup
2023-08-25scsi: core: raid_class: Remove raid_component_add()Zhu Wang2-52/+0
The raid_component_add() function was added to the kernel tree via patch "[SCSI] embryonic RAID class" (2005). Remove this function since it never has had any callers in the Linux kernel. And also raid_component_release() is only used in raid_component_add(), so it is also removed. Signed-off-by: Zhu Wang <wangzhu9@huawei.com> Link: https://lore.kernel.org/r/20230822015254.184270-1-wangzhu9@huawei.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Fixes: 04b5b5cb0136 ("scsi: core: Fix possible memory leak if device_add() fails") Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-08-25Merge tag 'drm-intel-fixes-2023-08-24' of ↵Dave Airlie5-39/+69
git://anongit.freedesktop.org/drm/drm-intel into drm-fixes - Fix power consumption at s2idle on DG2 (Anshuman) - Fix documentation build warning (Jani) - Fix Display HPD (Imre) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/ZOdPRFSJpo0ErPX/@intel.com
2023-08-25shmem: fix smaps BUG sleeping while atomicHugh Dickins1-2/+4
smaps_pte_hole_lookup() is calling shmem_partial_swap_usage() with page table lock held: but shmem_partial_swap_usage() does cond_resched_rcu() if need_resched(): "BUG: sleeping function called from invalid context". Since shmem_partial_swap_usage() is designed to count across a range, but smaps_pte_hole_lookup() only calls it for a single page slot, just break out of the loop on the last or only page, before checking need_resched(). Link: https://lkml.kernel.org/r/6fe3b3ec-abdf-332f-5c23-6a3b3a3b11a9@google.com Fixes: 230100321518 ("mm/smaps: simplify shmem handling of pte holes") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> [5.16+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-25selftests: cachestat: catch failing fsync test on tmpfsAndre Przywara1-15/+47
The cachestat kselftest runs a test on a normal file, which is created temporarily in the current directory. Among the tests it runs there is a call to fsync(), which is expected to clean all dirty pages used by the file. However the tmpfs filesystem implements fsync() as noop_fsync(), so the call will not even attempt to clean anything when this test file happens to live on a tmpfs instance. This happens in an initramfs, or when the current directory is in /dev/shm or sometimes /tmp. To avoid this test failing wrongly, use statfs() to check which filesystem the test file lives on. If that is "tmpfs", we skip the fsync() test. Since the fsync test is only one part of the "normal file" test, we now execute this twice, skipping the fsync part on the first call. This way only the second test, including the fsync part, would be skipped. Link: https://lkml.kernel.org/r/20230821160534.3414911-3-andre.przywara@arm.com Signed-off-by: Andre Przywara <andre.przywara@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-25selftests: cachestat: test for cachestat availabilityAndre Przywara1-1/+19
Patch series "selftests: cachestat: fix run on older kernels", v2. I ran all kernel selftests on some test machine, and stumbled upon cachestat failing (among others). These patches fix the run on older kernels and when the current directory is on a tmpfs instance. This patch (of 2): As cachestat is a new syscall, it won't be available on older kernels, for instance those running on a development machine. At the moment the test reports all tests as "not ok" in this case. Test for the cachestat syscall availability first, before doing further tests, and bail out early with a TAP SKIP comment. This also uses the opportunity to add the proper TAP headers, and add one check for proper error handling (illegal file descriptor). Link: https://lkml.kernel.org/r/20230821160534.3414911-1-andre.przywara@arm.com Link: https://lkml.kernel.org/r/20230821160534.3414911-2-andre.przywara@arm.com Signed-off-by: Andre Przywara <andre.przywara@arm.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-25maple_tree: disable mas_wr_append() when other readers are possibleLiam R. Howlett1-0/+7
The current implementation of append may cause duplicate data and/or incorrect ranges to be returned to a reader during an update. Although this has not been reported or seen, disable the append write operation while the tree is in rcu mode out of an abundance of caution. During the analysis of the mas_next_slot() the following was artificially created by separating the writer and reader code: Writer: reader: mas_wr_append set end pivot updates end metata Detects write to last slot last slot write is to start of slot store current contents in slot overwrite old end pivot mas_next_slot(): read end metadata read old end pivot return with incorrect range store new value Alternatively: Writer: reader: mas_wr_append set end pivot updates end metata Detects write to last slot last lost write to end of slot store value mas_next_slot(): read end metadata read old end pivot read new end pivot return with incorrect range set old end pivot There may be other accesses that are not safe since we are now updating both metadata and pointers, so disabling append if there could be rcu readers is the safest action. Link: https://lkml.kernel.org/r/20230819004356.1454718-2-Liam.Howlett@oracle.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-25madvise:madvise_free_pte_range(): don't use mapcount() against large folio ↵Yin Fengwei1-1/+1
for sharing check Commit 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping. It's not correct for large folios. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared. Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here. User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then. NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready. Link: https://lkml.kernel.org/r/20230808020917.2230692-4-fengwei.yin@intel.com Fixes: 98b211d6415f ("madvise: convert madvise_free_pte_range() to use a folio") Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Reviewed-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-25madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio ↵Yin Fengwei1-1/+1
for sharing check Commit fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping. It's not correct for large folios. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared. Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here. User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then. NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready. Link: https://lkml.kernel.org/r/20230808020917.2230692-3-fengwei.yin@intel.com Fixes: fc986a38b670 ("mm: huge_memory: convert madvise_free_huge_pmd to use a folio") Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Reviewed-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-25madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against ↵Yin Fengwei1-2/+2
large folio for sharing check Patch series "don't use mapcount() to check large folio sharing", v2. In madvise_cold_or_pageout_pte_range() and madvise_free_pte_range(), folio_mapcount() is used to check whether the folio is shared. But it's not correct as folio_mapcount() returns total mapcount of large folio. Use folio_estimated_sharers() here as the estimated number is enough. This patchset will fix the cases: User space application call madvise() with MADV_FREE, MADV_COLD and MADV_PAGEOUT for specific address range. There are THP mapped to the range. Without the patchset, the THP is skipped. With the patch, the THP will be split and handled accordingly. David reported the cow self test skip some cases because of MADV_PAGEOUT skip THP: https://lore.kernel.org/linux-mm/9e92e42d-488f-47db-ac9d-75b24cd0d037@intel.com/T/#mbf0f2ec7fbe45da47526de1d7036183981691e81 and I confirmed this patchset make it work again. This patch (of 3): Commit 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios") replaced the page_mapcount() with folio_mapcount() to check whether the folio is shared by other mapping. It's not correct for large folio. folio_mapcount() returns the total mapcount of large folio which is not suitable to detect whether the folio is shared. Use folio_estimated_sharers() which returns a estimated number of shares. That means it's not 100% correct. It should be OK for madvise case here. User-visible effects is that the THP is skipped when user call madvise. But the correct behavior is THP should be split and processed then. NOTE: this change is a temporary fix to reduce the user-visible effects before the long term fix from David is ready. Link: https://lkml.kernel.org/r/20230808020917.2230692-1-fengwei.yin@intel.com Link: https://lkml.kernel.org/r/20230808020917.2230692-2-fengwei.yin@intel.com Fixes: 07e8c82b5eff ("madvise: convert madvise_cold_or_pageout_pte_range() to use folios") Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Reviewed-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>