summaryrefslogtreecommitdiff
path: root/drivers/hv
AgeCommit message (Collapse)AuthorFilesLines
8 daysmshv: Fix error handling in mshv_region_pinStanislav Kinsburskii1-2/+4
[ Upstream commit c0e296f257671ba10249630fe58026f29e4804d9 ] The current error handling has two issues: First, pin_user_pages_fast() can return a short pin count (less than requested but greater than zero) when it cannot pin all requested pages. This is treated as success, leading to partially pinned regions being used, which causes memory corruption. Second, when an error occurs mid-loop, already pinned pages from the current batch are not properly accounted for before calling mshv_region_invalidate_pages(), causing a page reference leak. Treat short pins as errors and fix partial batch accounting before cleanup. Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-25mshv: Fix use-after-free in mshv_map_user_memory error pathStanislav Kinsburskii1-1/+1
[ Upstream commit 6922db250422a0dfee34de322f86b7a73d713d33 ] In the error path of mshv_map_user_memory(), calling vfree() directly on the region leaves the MMU notifier registered. When userspace later unmaps the memory, the notifier fires and accesses the freed region, causing a use-after-free and potential kernel panic. Replace vfree() with mshv_partition_put() to properly unregister the MMU notifier before freeing the region. Fixes: b9a66cd5ccbb9 ("mshv: Add support for movable memory regions") Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04mshv: clear eventfd counter on irqfd shutdownCarlos López2-4/+2
[ Upstream commit 2b4246153e2184e3a3b4edc8cc35337d7a2455a6 ] While unhooking from the irqfd waitqueue, clear the internal eventfd counter by using eventfd_ctx_remove_wait_queue() instead of remove_wait_queue(), preventing potential spurious interrupts. This removes the need to store a pointer into the workqueue, as the eventfd already keeps track of it. This mimicks what other similar subsystems do on their equivalent paths with their irqfds (KVM, Xen, ACRN support, etc). Signed-off-by: Carlos López <clopez@suse.de> Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04mshv: Ignore second stats page map result failurePurna Pavan Chandra Aekkaladevi2-4/+51
[ Upstream commit 7538b80e5a4b473b73428d13b3a47ceaad9a8a7c ] Older versions of the hypervisor do not have a concept of separate SELF and PARENT stats areas. In this case, mapping the HV_STATS_AREA_SELF page is sufficient - it's the only page and it contains all available stats. Mapping HV_STATS_AREA_PARENT returns HV_STATUS_INVALID_PARAMETER which currently causes module init to fail on older hypevisor versions. Detect this case and gracefully fall back to populating stats_pages[HV_STATS_AREA_PARENT] with the already-mapped SELF page. Add comments to clarify the behavior, including a clarification of why this isn't needed for hv_call_map_stats_page2() which always supports PARENT and SELF areas. Signed-off-by: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-27Drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RTJan Kiszka1-1/+65
commit f8e6343b7a89c7c649db5a9e309ba7aa20401813 upstream. Resolves the following lockdep report when booting PREEMPT_RT on Hyper-V with related guest support enabled: [ 1.127941] hv_vmbus: registering driver hyperv_drm [ 1.132518] ============================= [ 1.132519] [ BUG: Invalid wait context ] [ 1.132521] 6.19.0-rc8+ #9 Not tainted [ 1.132524] ----------------------------- [ 1.132525] swapper/0/0 is trying to lock: [ 1.132526] ffff8b9381bb3c90 (&channel->sched_lock){....}-{3:3}, at: vmbus_chan_sched+0xc4/0x2b0 [ 1.132543] other info that might help us debug this: [ 1.132544] context-{2:2} [ 1.132545] 1 lock held by swapper/0/0: [ 1.132547] #0: ffffffffa010c4c0 (rcu_read_lock){....}-{1:3}, at: vmbus_chan_sched+0x31/0x2b0 [ 1.132557] stack backtrace: [ 1.132560] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.19.0-rc8+ #9 PREEMPT_{RT,(lazy)} [ 1.132565] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025 [ 1.132567] Call Trace: [ 1.132570] <IRQ> [ 1.132573] dump_stack_lvl+0x6e/0xa0 [ 1.132581] __lock_acquire+0xee0/0x21b0 [ 1.132592] lock_acquire+0xd5/0x2d0 [ 1.132598] ? vmbus_chan_sched+0xc4/0x2b0 [ 1.132606] ? lock_acquire+0xd5/0x2d0 [ 1.132613] ? vmbus_chan_sched+0x31/0x2b0 [ 1.132619] rt_spin_lock+0x3f/0x1f0 [ 1.132623] ? vmbus_chan_sched+0xc4/0x2b0 [ 1.132629] ? vmbus_chan_sched+0x31/0x2b0 [ 1.132634] vmbus_chan_sched+0xc4/0x2b0 [ 1.132641] vmbus_isr+0x2c/0x150 [ 1.132648] __sysvec_hyperv_callback+0x5f/0xa0 [ 1.132654] sysvec_hyperv_callback+0x88/0xb0 [ 1.132658] </IRQ> [ 1.132659] <TASK> [ 1.132660] asm_sysvec_hyperv_callback+0x1a/0x20 As code paths that handle vmbus IRQs use sleepy locks under PREEMPT_RT, the vmbus_isr execution needs to be moved into thread context. Open- coding this allows to skip the IPI that irq_work would additionally bring and which we do not need, being an IRQ, never an NMI. This affects both x86 and arm64, therefore hook into the common driver logic. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Reviewed-by: Florian Bezdeka <florian.bezdeka@siemens.com> Tested-by: Florian Bezdeka <florian.bezdeka@siemens.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-27mshv: fix SRCU protection in irqfd resampler ack handlerLi RongQing1-2/+3
[ Upstream commit 2e7577cd5ddc1f86d1b6c48caf3cfa87dbb14e34 ] Replace hlist_for_each_entry_rcu() with hlist_for_each_entry_srcu() in mshv_irqfd_resampler_ack() to correctly handle SRCU-protected linked list traversal. The function uses SRCU (sleepable RCU) synchronization via partition->pt_irq_srcu, but was incorrectly using the RCU variant for list iteration. This could lead to race conditions when the list is modified concurrently. Also add srcu_read_lock_held() assertion as required by hlist_for_each_entry_srcu() to ensure we're in the proper read-side critical section. Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs") Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-15mshv: handle gpa intercepts for arm64Anirudh Rayabharam (Microsoft)1-7/+8
The mshv driver now uses movable pages for guests. For arm64 guests to be functional, handle gpa intercepts for arm64 too (the current code implements handling only for x86). Move some arch-agnostic functions out of #ifdefs so that they can be re-used. Fixes: b9a66cd5ccbb ("mshv: Add support for movable memory regions") Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2026-01-15mshv: Add __user attribute to argument passed to access_ok()Michael Kelley1-1/+1
access_ok() expects its first argument to have the __user attribute since it is checking access to user space. Current code passes an argument that lacks that attribute, resulting in 'sparse' flagging the incorrect usage. However, the compiler doesn't generate code based on the attribute, so there's no actual bug. In the interest of general correctness and to avoid noise from sparse, add the __user attribute. No functional change. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202512141339.791TCKnB-lkp@intel.com/ Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2026-01-15mshv: Store the result of vfs_poll in a variable of type __poll_tMichael Kelley1-1/+1
vfs_poll() returns a result of type __poll_t, but current code is using an "unsigned int" local variable. The difference is that __poll_t carries the "bitwise" attribute. This attribute is not interpreted by the C compiler; it is only used by 'sparse' to flag incorrect usage of the return value. The return value is used correctly here, so there's no bug, but sparse complains about the type mismatch. In the interest of general correctness and to avoid noise from sparse, change the local variable to type __poll_t. No functional change. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202512141339.791TCKnB-lkp@intel.com/ Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2026-01-15mshv: Align huge page stride with guest mappingStanislav Kinsburskii1-31/+62
Ensure that a stride larger than 1 (huge page) is only used when page points to a head of a huge page and both the guest frame number (gfn) and the operation size (page_count) are aligned to the huge page size (PTRS_PER_PMD). This matches the hypervisor requirement that map/unmap operations for huge pages must be guest-aligned and cover a full huge page. Add mshv_chunk_stride() to encapsulate this alignment and page-order validation, and plumb a huge_page flag into the region chunk handlers. This prevents issuing large-page map/unmap/share operations that the hypervisor would reject due to misaligned guest mappings. Fixes: abceb4297bf8 ("mshv: Fix huge page handling in memory region traversal") Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2026-01-15Drivers: hv: Always do Hyper-V panic notification in hv_kmsg_dump()Michael Kelley1-5/+7
hv_kmsg_dump() currently skips the panic notification entirely if it doesn't get any message bytes to pass to Hyper-V due to an error from kmsg_dump_get_buffer(). Skipping the notification is undesirable because it leaves the Hyper-V host uncertain about the state of a panic'ed guest. Fix this by always doing the panic notification, even if bytes_written is zero. Also ensure that bytes_written is initialized, which fixes a kernel test robot warning. The warning is actually bogus because kmsg_dump_get_buffer() happens to set bytes_written even if it fails, and in the kernel test robot's CONFIG_PRINTK not set case, hv_kmsg_dump() is never called. But do the initialization for robustness and to quiet the static checker. Fixes: 9c318a1d9b50 ("Drivers: hv: move panic report code from vmbus to hv early init code") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/202512172103.OcUspn1Z-lkp@intel.com/ Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Roman Kisel <vdso@mailbox.org> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2026-01-15Drivers: hv: vmbus: fix typo in function name referenceJulia Lawall1-1/+1
Replace cmxchg by cmpxchg. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Reviewed-by: Roman Kisel <vdso@mailbox.org> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-18mshv: release mutex on region invalidation failureAnirudh Rayabharam (Microsoft)1-1/+3
In the region invalidation failure path in mshv_region_interval_invalidate(), the region mutex is not released. Fix it by releasing the mutex in the failure path. Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Fixes: b9a66cd5ccbb ("mshv: Add support for movable memory regions") Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Reviewed-by: Roman Kisel <vdso@mailbox.org> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-17mshv: hide x86-specific functions on arm64Arnd Bergmann1-0/+2
The hv_sleep_notifiers_register() and hv_machine_power_off() functions are only called and declared on x86, but cause a warning on arm64: drivers/hv/mshv_common.c:210:6: error: no previous prototype for 'hv_sleep_notifiers_register' [-Werror=missing-prototypes] 210 | void hv_sleep_notifiers_register(void) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/hv/mshv_common.c:224:6: error: no previous prototype for 'hv_machine_power_off' [-Werror=missing-prototypes] 224 | void hv_machine_power_off(void) | ^~~~~~~~~~~~~~~~~~~~ Hide both inside of an #ifdef block. Fixes: f0be2600ac55 ("mshv: Use reboot notifier to configure sleep state") Fixes: 615a6e7d83f9 ("mshv: Cleanly shutdown root partition with MSHV") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-15mshv: Initialize local variables early upon region invalidationStanislav Kinsburskii1-7/+7
Ensure local variables are initialized before use so that the warning can print the right values if locking the region to invalidate fails due to inability to lock the region. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: b9a66cd5ccbb ("mshv: Add support for movable memory regions") Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-15mshv: Use PMD_ORDER instead of HPAGE_PMD_ORDER when processing regionsStanislav Kinsburskii1-1/+1
Fix page order determination logic when CONFIG_PGTABLE_HAS_HUGE_LEAVES is undefined, as HPAGE_PMD_SHIFT is defined as BUILD_BUG in that case. Fixes: abceb4297bf8 ("mshv: Fix huge page handling in memory region traversal") Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-09Merge tag 'hyperv-next-signed-20251207' of ↵Linus Torvalds21-602/+3306
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Enhancements to Linux as the root partition for Microsoft Hypervisor: - Support a new mode called L1VH, which allows Linux to drive the hypervisor running the Azure Host directly - Support for MSHV crash dump collection - Allow Linux's memory management subsystem to better manage guest memory regions - Fix issues that prevented a clean shutdown of the whole system on bare metal and nested configurations - ARM64 support for the MSHV driver - Various other bug fixes and cleanups - Add support for Confidential VMBus for Linux guest on Hyper-V - Secure AVIC support for Linux guests on Hyper-V - Add the mshv_vtl driver to allow Linux to run as the secure kernel in a higher virtual trust level for Hyper-V * tag 'hyperv-next-signed-20251207' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (58 commits) mshv: Cleanly shutdown root partition with MSHV mshv: Use reboot notifier to configure sleep state mshv: Add definitions for MSHV sleep state configuration mshv: Add support for movable memory regions mshv: Add refcount and locking to mem regions mshv: Fix huge page handling in memory region traversal mshv: Move region management to mshv_regions.c mshv: Centralize guest memory region destruction mshv: Refactor and rename memory region handling functions mshv: adjust interrupt control structure for ARM64 Drivers: hv: use kmalloc_array() instead of kmalloc() mshv: Add ioctl for self targeted passthrough hvcalls Drivers: hv: Introduce mshv_vtl driver Drivers: hv: Export some symbols for mshv_vtl static_call: allow using STATIC_CALL_TRAMP_STR() from assembly mshv: Extend create partition ioctl to support cpu features mshv: Allow mappings that overlap in uaddr mshv: Fix create memory region overlap check mshv: add WQ_PERCPU to alloc_workqueue users Drivers: hv: Use kmalloc_array() instead of kmalloc() ...
2025-12-06Merge tag 'soc-drivers-6.19' of ↵Linus Torvalds1-5/+9
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull SoC driver updates from Arnd Bergmann: "This is the first half of the driver changes: - A treewide interface change to the "syscore" operations for power management, as a preparation for future Tegra specific changes - Reset controller updates with added drivers for LAN969x, eic770 and RZ/G3S SoCs - Protection of system controller registers on Renesas and Google SoCs, to prevent trivially triggering a system crash from e.g. debugfs access - soc_device identification updates on Nvidia, Exynos and Mediatek - debugfs support in the ST STM32 firewall driver - Minor updates for SoC drivers on AMD/Xilinx, Renesas, Allwinner, TI - Cleanups for memory controller support on Nvidia and Renesas" * tag 'soc-drivers-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (114 commits) memory: tegra186-emc: Fix missing put_bpmp Documentation: reset: Remove reset_controller_add_lookup() reset: fix BIT macro reference reset: rzg2l-usbphy-ctrl: Fix a NULL vs IS_ERR() bug in probe reset: th1520: Support reset controllers in more subsystems reset: th1520: Prepare for supporting multiple controllers dt-bindings: reset: thead,th1520-reset: Add controllers for more subsys dt-bindings: reset: thead,th1520-reset: Remove non-VO-subsystem resets reset: remove legacy reset lookup code clk: davinci: psc: drop unused reset lookup reset: rzg2l-usbphy-ctrl: Add support for RZ/G3S SoC reset: rzg2l-usbphy-ctrl: Add support for USB PWRRDY dt-bindings: reset: renesas,rzg2l-usbphy-ctrl: Document RZ/G3S support reset: eswin: Add eic7700 reset driver dt-bindings: reset: eswin: Documentation for eic7700 SoC reset: sparx5: add LAN969x support dt-bindings: reset: microchip: Add LAN969x support soc: rockchip: grf: Add select correct PWM implementation on RK3368 soc/tegra: pmc: Add USB wake events for Tegra234 amba: tegra-ahb: Fix device leak on SMMU enable ...
2025-12-06mshv: Cleanly shutdown root partition with MSHVPraveen K Paladugu1-0/+21
Root partitions running on MSHV currently attempt ACPI power-off, which MSHV intercepts and triggers a Machine Check Exception (MCE), leading to a kernel panic. Root partitions panic with a trace similar to: [ 81.306348] reboot: Power down [ 81.314709] mce: [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 0: b2000000c0060001 [ 81.314711] mce: [Hardware Error]: TSC 3b8cb60a66 PPIN 11d98332458e4ea9 [ 81.314713] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME 1759339405 SOCKET 0 APIC 0 microcode ffffffff [ 81.314715] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [ 81.314716] mce: [Hardware Error]: Machine check: Processor context corrupt [ 81.314717] Kernel panic - not syncing: Fatal machine check To avoid this, configure the sleep state in the hypervisor and invoke the HVCALL_ENTER_SLEEP_STATE hypercall as the final step in the shutdown sequence. This ensures a clean and safe shutdown of the root partition. Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com> Co-developed-by: Anatol Belski <anbelski@linux.microsoft.com> Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com> Reviewed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-06mshv: Use reboot notifier to configure sleep statePraveen K Paladugu1-0/+78
Configure sleep state information from ACPI in MSHV hypervisor using a reboot notifier. This data allows the hypervisor to correctly power off the host during shutdown. Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com> Co-developed-by: Anatol Belski <anbelski@linux.microsoft.com> Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com> Reviewed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Reviewed-by: Stansialv Kinsburskii <skinsburskii@linux.miscrosoft.com> Acked-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-06mshv: Add support for movable memory regionsStanislav Kinsburskii4-36/+346
Introduce support for movable memory regions in the Hyper-V root partition driver to improve memory management flexibility and enable advanced use cases such as dynamic memory remapping. Mirror the address space between the Linux root partition and guest VMs using HMM. The root partition owns the memory, while guest VMs act as devices with page tables managed via hypercalls. MSHV handles VP intercepts by invoking hmm_range_fault() and updating SLAT entries. When memory is reclaimed, HMM invalidates the relevant regions, prompting MSHV to clear SLAT entries; guest VMs will fault again on access. Integrate mmu_interval_notifier for movable regions, implement handlers for HMM faults and memory invalidation, and update memory region mapping logic to support movable regions. While MMU notifiers are commonly used in virtualization drivers, this implementation leverages HMM (Heterogeneous Memory Management) for its specialized functionality. HMM provides a framework for mirroring, invalidation, and fault handling, reducing boilerplate and improving maintainability compared to generic MMU notifiers. Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-06mshv: Add refcount and locking to mem regionsStanislav Kinsburskii3-12/+45
Introduce kref-based reference counting and spinlock protection for memory regions in Hyper-V partition management. This change improves memory region lifecycle management and ensures thread-safe access to the region list. Previously, the regions list was protected by the partition mutex. However, this approach is too heavy for frequent fault and invalidation operations. Finer grained locking is now used to improve efficiency and concurrency. This is a precursor to supporting movable memory regions. Fault and invalidation handling for movable regions will require safe traversal of the region list and holding a region reference while performing invalidation or fault operations. Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-06mshv: Fix huge page handling in memory region traversalStanislav Kinsburskii2-31/+191
The previous code assumed that if a region's first page was huge, the entire region consisted of huge pages and stored this in a large_pages flag. This premise is incorrect not only for movable regions (where pages can be split and merged on invalidate callbacks or page faults), but even for pinned regions: THPs can be split and merged during allocation, so a large, pinned region may contain a mix of huge and regular pages. This change removes the large_pages flag and replaces region-wide assumptions with per-chunk inspection of the actual page size when mapping, unmapping, sharing, and unsharing. This makes huge page handling correct for mixed-page regions and avoids relying on stale metadata that can easily become invalid as memory is remapped. Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-06mshv: Move region management to mshv_regions.cStanislav Kinsburskii4-165/+198
Refactor memory region management functions from mshv_root_main.c into mshv_regions.c for better modularity and code organization. Adjust function calls and headers to use the new implementation. Improve maintainability and separation of concerns in the mshv_root module. Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-06mshv: Centralize guest memory region destructionStanislav Kinsburskii1-31/+34
Centralize guest memory region destruction to prevent resource leaks and inconsistent cleanup across unmap and partition destruction paths. Unify region removal, encrypted partition access recovery, and region invalidation to improve maintainability and reliability. Reduce code duplication and make future updates less error-prone by encapsulating cleanup logic in a single helper. Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-06mshv: Refactor and rename memory region handling functionsStanislav Kinsburskii1-44/+36
Simplify and unify memory region management to improve code clarity and reliability. Consolidate pinning and invalidation logic, adopt consistent naming, and remove redundant checks to reduce complexity. Enhance documentation and update call sites for maintainability. Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-06mshv: adjust interrupt control structure for ARM64Jinank Jain3-0/+16
Interrupt control structure (union hv_interupt_control) has different fields when it comes to x86 vs ARM64. Bring in the correct structure from HyperV header files and adjust the existing interrupt routing code accordingly. Signed-off-by: Jinank Jain <jinankjain@microsoft.com> Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-06Drivers: hv: use kmalloc_array() instead of kmalloc()Gongwei Li1-1/+1
Replace kmalloc() with kmalloc_array() to prevent potential overflow, as recommended in Documentation/process/deprecated.rst. Signed-off-by: Gongwei Li <ligongwei@kylinos.cn> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-06mshv: Add ioctl for self targeted passthrough hvcallsAnirudh Rayabharam (Microsoft)1-9/+38
Allow MSHV_ROOT_HVCALL IOCTL on the /dev/mshv fd. This IOCTL would execute a passthrough hypercall targeting the root/parent partition i.e. HV_PARTITION_ID_SELF. This will be useful for the VMM to query things like supported synthetic processor features, supported VMM capabiliites etc. Since hypercalls targeting the host partition could potentially perform privileged operations, allow only a limited set of hypercalls. To begin with, allow only: HVCALL_GET_PARTITION_PROPERTY HVCALL_GET_PARTITION_PROPERTY_EX Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-06Drivers: hv: Introduce mshv_vtl driverNaman Jain4-2/+1449
Provide an interface for Virtual Machine Monitor like OpenVMM and its use as OpenHCL paravisor to control VTL0 (Virtual trust Level). Expose devices and support IOCTLs for features like VTL creation, VTL0 memory management, context switch, making hypercalls, mapping VTL0 address space to VTL2 userspace, getting new VMBus messages and channel events in VTL2 etc. Co-developed-by: Roman Kisel <romank@linux.microsoft.com> Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Co-developed-by: Saurabh Sengar <ssengar@linux.microsoft.com> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Naman Jain <namjain@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-12-02Merge tag 'core-rseq-2025-11-30' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull rseq updates from Thomas Gleixner: "A large overhaul of the restartable sequences and CID management: The recent enablement of RSEQ in glibc resulted in regressions which are caused by the related overhead. It turned out that the decision to invoke the exit to user work was not really a decision. More or less each context switch caused that. There is a long list of small issues which sums up nicely and results in a 3-4% regression in I/O benchmarks. The other detail which caused issues due to extra work in context switch and task migration is the CID (memory context ID) management. It also requires to use a task work to consolidate the CID space, which is executed in the context of an arbitrary task and results in sporadic uncontrolled exit latencies. The rewrite addresses this by: - Removing deprecated and long unsupported functionality - Moving the related data into dedicated data structures which are optimized for fast path processing. - Caching values so actual decisions can be made - Replacing the current implementation with a optimized inlined variant. - Separating fast and slow path for architectures which use the generic entry code, so that only fault and error handling goes into the TIF_NOTIFY_RESUME handler. - Rewriting the CID management so that it becomes mostly invisible in the context switch path. That moves the work of switching modes into the fork/exit path, which is a reasonable tradeoff. That work is only required when a process creates more threads than the cpuset it is allowed to run on or when enough threads exit after that. An artificial thread pool benchmarks which triggers this did not degrade, it actually improved significantly. The main effect in migration heavy scenarios is that runqueue lock held time and therefore contention goes down significantly" * tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) sched/mmcid: Switch over to the new mechanism sched/mmcid: Implement deferred mode change irqwork: Move data struct to a types header sched/mmcid: Provide CID ownership mode fixup functions sched/mmcid: Provide new scheduler CID mechanism sched/mmcid: Introduce per task/CPU ownership infrastructure sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex sched/mmcid: Provide precomputed maximal value sched/mmcid: Move initialization out of line signal: Move MMCID exit out of sighand lock sched/mmcid: Convert mm CID mask to a bitmap cpumask: Cache num_possible_cpus() sched/mmcid: Use cpumask_weighted_or() cpumask: Introduce cpumask_weighted_or() sched/mmcid: Prevent pointless work in mm_update_cpus_allowed() sched/mmcid: Move scheduler code out of global header sched: Fixup whitespace damage sched/mmcid: Cacheline align MM CID storage sched/mmcid: Use proper data structures sched/mmcid: Revert the complex CID management ...
2025-11-28hv: convert mshv_ioctl_create_partition() to FD_ADD()Christian Brauner1-24/+6
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-39-b6efa1706cfd@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-15Drivers: hv: Export some symbols for mshv_vtlNaman Jain3-1/+7
MSHV_VTL driver is going to be introduced, which is supposed to provide interface for Virtual Machine Monitors (VMMs) to control Virtual Trust Level (VTL). Export the symbols needed to make it work (vmbus_isr, hv_context and hv_post_message). Co-developed-by: Roman Kisel <romank@linux.microsoft.com> Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Co-developed-by: Saurabh Sengar <ssengar@linux.microsoft.com> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202506110544.q0NDMQVc-lkp@intel.com/ Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Naman Jain <namjain@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15mshv: Extend create partition ioctl to support cpu featuresMuminul Islam1-21/+97
The existing mshv create partition ioctl does not provide a way to specify which cpu features are enabled in the guest. Instead, it attempts to enable all features and those that are not supported are silently disabled by the hypervisor. This was done to reduce unnecessary complexity and is sufficient for many cases. However, new scenarios require fine-grained control over these features. Define a new mshv_create_partition_v2 structure which supports passing the disabled processor and xsave feature bits through to the create partition hypercall directly. Introduce a new flag MSHV_PT_BIT_CPU_AND_XSAVE_FEATURES which enables the new structure. If unset, the original mshv_create_partition struct is used, with the old behavior of enabling all features. Co-developed-by: Jinank Jain <jinankjain@microsoft.com> Signed-off-by: Jinank Jain <jinankjain@microsoft.com> Signed-off-by: Muminul Islam <muislam@microsoft.com> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15mshv: Allow mappings that overlap in uaddrMagnus Kulke1-6/+2
Currently the MSHV driver rejects mappings that would overlap in userspace. Some VMMs require the same memory to be mapped to different parts of the guest's address space, and so working around this restriction is difficult. The hypervisor itself doesn't prohibit mappings that overlap in uaddr, (really in SPA; system physical addresses), so supporting this in the driver doesn't require any extra work: only the checks need to be removed. Since no userspace code until now has been able to overlap regions in userspace, relaxing this constraint can't break any existing code. Signed-off-by: Magnus Kulke <magnuskulke@linux.microsoft.com> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15mshv: Fix create memory region overlap checkNuno Das Neves1-20/+11
The current check is incorrect; it only checks if the beginning or end of a region is within an existing region. This doesn't account for userspace specifying a region that begins before and ends after an existing region. Change the logic to a range intersection check against gfns and uaddrs for each region. Remove mshv_partition_region_by_uaddr() as it is no longer used. Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs") Reported-by: Michael Kelley <mhklinux@outlook.com> Closes: https://lore.kernel.org/linux-hyperv/SN6PR02MB41575BE0406D3AB22E1D7DB5D4C2A@SN6PR02MB4157.namprd02.prod.outlook.com/ Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15mshv: add WQ_PERCPU to alloc_workqueue usersMarco Crivellari1-1/+1
Currently if a user enqueues a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistency cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This continues the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue flag in: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") This change adds a new WQ_PERCPU flag to explicitly request alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15Drivers: hv: Use kmalloc_array() instead of kmalloc()Rahul Kumar1-1/+1
Documentation/process/deprecated.rst recommends against the use of kmalloc with dynamic size calculations due to the risk of overflow and smaller allocation being made than the caller was expecting. Replace kmalloc() with kmalloc_array() in hv_common.c to make the intended allocation size clearer and avoid potential overflow issues. The number of pages (pgcount) is bounded, so overflow is not a practical concern here. However, using kmalloc_array() better reflects the intent to allocate an array and improves consistency with other allocations. No functional change intended. Signed-off-by: Rahul Kumar <rk0006818@gmail.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15Drivers: hv: Resolve ambiguity in hypervisor version logStanislav Kinsburskii1-2/+2
Update the log message in hv_common_init to explicitly state that the reported version is for the Microsoft Hypervisor, not the host OS. Previously, this message was accurate for guests running on Windows hosts, where the host and hypervisor versions matched. With support for Linux hosts running the Hyper-V hypervisor, the host OS and hypervisor versions may differ. This change avoids confusion by making it clear that the version refers to the Microsoft Hypervisor regardless of the host operating system. Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15Drivers: hv: fix missing kernel-doc description for 'size' in request_arr_init()Kriish Sharma1-1/+1
Add missing kernel-doc entry for the @size parameter in request_arr_init(), fixing the following documentation warning reported by the kernel test robot and detected via kernel-doc: Warning: drivers/hv/channel.c:595 function parameter 'size' not described in 'request_arr_init' Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202503021934.wH1BERla-lkp@intel.com Signed-off-by: Kriish Sharma <kriish.sharma2006@gmail.com> Reviewed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15mshv: Introduce new hypercall to map stats page for L1VH partitionsJinank Jain3-20/+107
Introduce HVCALL_MAP_STATS_PAGE2 which provides a map location (GPFN) to map the stats to. This hypercall is required for L1VH partitions, depending on the hypervisor version. This uses the same check as the state page map location; mshv_use_overlay_gpfn(). Add mshv_map_vp_state_page() helpers to use this new hypercall or the old one depending on availability. For unmapping, the original HVCALL_UNMAP_STATS_PAGE works for both cases. Signed-off-by: Jinank Jain <jinankjain@linux.microsoft.com> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15mshv: Allocate vp state page for HVCALL_MAP_VP_STATE_PAGE on L1VHJinank Jain3-50/+101
Introduce mshv_use_overlay_gpfn() to check if a page needs to be allocated and passed to the hypervisor to map VP state pages. This is only needed on L1VH, and only on some (newer) versions of the hypervisor, hence the need to check vmm_capabilities. Introduce functions hv_map/unmap_vp_state_page() to handle the allocation and freeing. Signed-off-by: Jinank Jain <jinankjain@linux.microsoft.com> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Praveen K Paladugu <prapal@linux.microsoft.com> Reviewed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Reviewed-by: Anirudh Rayabharam <anirudh@anirudhrb.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15mshv: Get the vmm capabilities offered by the hypervisorPurna Pavan Chandra Aekkaladevi2-0/+19
Some hypervisor APIs are gated by feature bits in the "vmm capabilities" partition property. Store the capabilities on mshv_root module init, using HVCALL_GET_PARTITION_PROPERTY_EX. This is not supported on all hypervisors. In that case, just set the capabilities to 0 and proceed as normal. Signed-off-by: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Praveen K Paladugu <prapal@linux.microsoft.com> Reviewed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15mshv: Add the HVCALL_GET_PARTITION_PROPERTY_EX hypercallPurna Pavan Chandra Aekkaladevi2-0/+33
This hypercall can be used to fetch extended properties of a partition. Extended properties are properties with values larger than a u64. Some of these also need additional input arguments. Add helper function for using the hypercall in the mshv_root driver. Signed-off-by: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Anirudh Rayabharam <anirudh@anirudhrb.com> Reviewed-by: Praveen K Paladugu <prapal@linux.microsoft.com> Reviewed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15mshv: Only map vp->vp_stats_pages if on root schedulerNuno Das Neves1-4/+8
This mapping is only used for checking if the dispatch thread is blocked. This is only relevant for the root scheduler, so check the scheduler type to determine whether to map/unmap these pages, instead of the current check, which is incorrect. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Anirudh Rayabharam <anirudh@anirudhrb.com> Reviewed-by: Praveen K Paladugu <prapal@linux.microsoft.com> Reviewed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15Drivers: hv: Support establishing the confidential VMBus connectionRoman Kisel1-62/+106
To establish the confidential VMBus connection the CoCo VM, the guest first checks on the confidential VMBus availability, and then proceeds to initializing the communication stack. Implement that in the VMBus driver initialization. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15Drivers: hv: Set the default VMBus version to 6.0Roman Kisel1-1/+2
The confidential VMBus is supported by the protocol version 6.0 onwards. Attempt to establish the VMBus 6.0 connection thus enabling the confidential VMBus features when available. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15Drivers: hv: Support confidential VMBus channelsRoman Kisel2-0/+22
To make use of Confidential VMBus channels, initialize the co_ring_buffers and co_external_memory fields of the channel structure. Advertise support upon negotiating the version and compute values for those fields and initialize them. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15Drivers: hv: Free msginfo when the buffer fails to decryptRoman Kisel1-6/+18
The early failure path in __vmbus_establish_gpadl() doesn't deallocate msginfo if the buffer fails to decrypt. Fix the leak by breaking out the cleanup code into a separate function and calling it where required. Fixes: d4dccf353db80 ("Drivers: hv: vmbus: Mark vmbus ring buffer visible to host in Isolation VM") Reported-by: Michael Kelley <mkhlinux@outlook.com> Closes: https://lore.kernel.org/linux-hyperv/SN6PR02MB41573796F9787F67E0E97049D472A@SN6PR02MB4157.namprd02.prod.outlook.com Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15Drivers: hv: Allocate encrypted buffers when requestedRoman Kisel3-23/+34
Confidential VMBus is built around using buffers not shared with the host. Support allocating encrypted buffers when requested. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>