<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/Documentation/virt/kvm/locking.rst, branch linux-7.1.y</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=linux-7.1.y</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=linux-7.1.y'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-03-02T17:52:09+00:00</updated>
<entry>
<title>Documentation: KVM: Formalizing taking vcpu-&gt;mutex *outside* of kvm-&gt;slots_lock</title>
<updated>2026-03-02T17:52:09+00:00</updated>
<author>
<name>Sean Christopherson</name>
<email>seanjc@google.com</email>
</author>
<published>2026-03-02T17:02:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f8211e95dfda702ba81ea2e3e7a8c6c967f385fa'/>
<id>urn:sha1:f8211e95dfda702ba81ea2e3e7a8c6c967f385fa</id>
<content type='text'>
Explicitly document the ordering of vcpu-&gt;mutex being taken *outside* of
kvm-&gt;slots_lock.  While somewhat unintuitive since vCPUs conceptually have
narrower scope than VMs, the scope of the owning object (vCPU versus VM)
doesn't automatically carry over to the lock.  In this case, vcpu-&gt;mutex
has far broader scope than kvm-&gt;slots_lock.  As Paolo put it, it's a
"don't worry about multiple ioctls at the same time" mutex that's intended
to be taken at the outer edges of KVM.

More importantly, arm64 and x86 have gained flows that take kvm-&gt;slots_lock
inside of vcpu-&gt;mutex.  x86's kvm_inhibit_apic_access_page() is particularly
nasty, as slots_lock is taken quite deep within KVM_RUN, i.e. simply
swapping the ordering isn't an option.

Commit to the vcpu-&gt;mutex =&gt; kvm-&gt;slots_lock ordering, as vcpu-&gt;mutex
really is intended to be a "top-level" lock, whereas kvm-&gt;slots_lock is
"just" a helper lock.

Opportunistically document that vcpu-&gt;mutex is also taken outside of
slots_arch_lock, e.g. when allocating shadow roots on x86 (which is the
entire reason slots_arch_lock exists, as shadow roots must be allocated
while holding kvm-&gt;srcu)

  kvm_mmu_new_pgd()
  |
  -&gt; kvm_mmu_reload()
     |
     -&gt; kvm_mmu_load()
        |
        -&gt; mmu_alloc_shadow_roots()
           |
           -&gt; mmu_first_shadow_root_alloc()

but also when manipulating memslots in vCPU context, e.g. when inhibiting
the APIC-access page via the aforementioned kvm_inhibit_apic_access_page()

  kvm_inhibit_apic_access_page()
  |
  -&gt; __x86_set_memory_region()
     |
     -&gt; kvm_set_internal_memslot()
        |
        -&gt; kvm_set_memory_region()
           |
           -&gt; kvm_set_memslot()

Cc: Oliver Upton &lt;oliver.upton@linux.dev&gt;
Cc: Marc Zyngier &lt;maz@kernel.org&gt;
Link: https://patch.msgid.link/20260302170239.596810-1-seanjc@google.com
Signed-off-by: Sean Christopherson &lt;seanjc@google.com&gt;
</content>
</entry>
<entry>
<title>KVM: x86/mmu: Don't force atomic update if only the Accessed bit is volatile</title>
<updated>2025-02-14T15:16:45+00:00</updated>
<author>
<name>James Houghton</name>
<email>jthoughton@google.com</email>
</author>
<published>2025-02-04T00:40:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=61d65f2dc766c70673d45a4b787f49317384642c'/>
<id>urn:sha1:61d65f2dc766c70673d45a4b787f49317384642c</id>
<content type='text'>
Don't force SPTE modifications to be done atomically if the only volatile
bit in the SPTE is the Accessed bit.  KVM and the primary MMU tolerate
stale aging state, and the probability of an Accessed bit A/D assist being
clobbered *and* affecting again is likely far lower than the probability
of consuming stale information due to not flushing TLBs when aging.

Rename spte_has_volatile_bits() to spte_needs_atomic_update() to better
capture the nature of the helper.

Opportunstically do s/write/update on the TDP MMU wrapper, as it's not
simply the "write" that needs to be done atomically, it's the entire
update, i.e. the entire read-modify-write operation needs to be done
atomically so that KVM has an accurate view of the old SPTE.

Leave kvm_tdp_mmu_write_spte_atomic() as is.  While the name is imperfect,
it pairs with kvm_tdp_mmu_write_spte(), which in turn pairs with
kvm_tdp_mmu_read_spte().  And renaming all of those isn't obviously a net
positive, and would require significant churn.

Signed-off-by: James Houghton &lt;jthoughton@google.com&gt;
Link: https://lore.kernel.org/r/20250204004038.1680123-6-jthoughton@google.com
Co-developed-by: Sean Christopherson &lt;seanjc@google.com&gt;
Signed-off-by: Sean Christopherson &lt;seanjc@google.com&gt;
</content>
</entry>
<entry>
<title>Documentation: KVM: fix malformed table</title>
<updated>2024-11-13T12:20:01+00:00</updated>
<author>
<name>Paolo Bonzini</name>
<email>pbonzini@redhat.com</email>
</author>
<published>2024-11-13T12:19:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=35ff7bfb04af6d07af677357a15493ca0fbd739e'/>
<id>urn:sha1:35ff7bfb04af6d07af677357a15493ca0fbd739e</id>
<content type='text'>
Reported-by: Stephen Rothwell &lt;sfr@canb.auug.org.au&gt;
Fixes: 5f6a3badbb74 ("KVM: x86/mmu: Mark page/folio accessed only when zapping leaf SPTEs")
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</content>
</entry>
<entry>
<title>KVM: Drop @atomic param from gfn=&gt;pfn and hva=&gt;pfn APIs</title>
<updated>2024-10-25T16:57:58+00:00</updated>
<author>
<name>Sean Christopherson</name>
<email>seanjc@google.com</email>
</author>
<published>2024-10-10T18:23:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e2d2ca71ac03c748dbc44e0dd7dc1557befb1ab6'/>
<id>urn:sha1:e2d2ca71ac03c748dbc44e0dd7dc1557befb1ab6</id>
<content type='text'>
Drop @atomic from the myriad "to_pfn" APIs now that all callers pass
"false", and remove a comment blurb about KVM running only the "GUP fast"
part in atomic context.

No functional change intended.

Reviewed-by: Alex Bennée &lt;alex.bennee@linaro.org&gt;
Tested-by: Alex Bennée &lt;alex.bennee@linaro.org&gt;
Signed-off-by: Sean Christopherson &lt;seanjc@google.com&gt;
Tested-by: Dmitry Osipenko &lt;dmitry.osipenko@collabora.com&gt;
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Message-ID: &lt;20241010182427.1434605-13-seanjc@google.com&gt;
</content>
</entry>
<entry>
<title>KVM: x86/mmu: Mark page/folio accessed only when zapping leaf SPTEs</title>
<updated>2024-10-25T16:54:42+00:00</updated>
<author>
<name>Sean Christopherson</name>
<email>seanjc@google.com</email>
</author>
<published>2024-10-10T18:23:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5f6a3badbb74231aaf2dc9996d689c538101ffb6'/>
<id>urn:sha1:5f6a3badbb74231aaf2dc9996d689c538101ffb6</id>
<content type='text'>
Now that KVM doesn't clobber Accessed bits of shadow-present SPTEs,
e.g. when prefetching, mark folios as accessed only when zapping leaf
SPTEs, which is a rough heuristic for "only in response to an mmu_notifier
invalidation".  Page aging and LRUs are tolerant of false negatives, i.e.
KVM doesn't need to be precise for correctness, and re-marking folios as
accessed when zapping entire roots or when zapping collapsible SPTEs is
expensive and adds very little value.

E.g. when a VM is dying, all of its memory is being freed; marking folios
accessed at that time provides no known value.  Similarly, because KVM
marks folios as accessed when creating SPTEs, marking all folios as
accessed when userspace happens to delete a memslot doesn't add value.
The folio was marked access when the old SPTE was created, and will be
marked accessed yet again if a vCPU accesses the pfn again after reloading
a new root.  Zapping collapsible SPTEs is a similar story; marking folios
accessed just because userspace disable dirty logging is a side effect of
KVM behavior, not a deliberate goal.

As an intermediate step, a.k.a. bisection point, towards *never* marking
folios accessed when dropping SPTEs, mark folios accessed when the primary
MMU might be invalidating mappings, as such zappings are not KVM initiated,
i.e. might actually be related to page aging and LRU activity.

Note, x86 is the only KVM architecture that "double dips"; every other
arch marks pfns as accessed only when mapping into the guest, not when
mapping into the guest _and_ when removing from the guest.

Tested-by: Alex Bennée &lt;alex.bennee@linaro.org&gt;
Signed-off-by: Sean Christopherson &lt;seanjc@google.com&gt;
Tested-by: Dmitry Osipenko &lt;dmitry.osipenko@collabora.com&gt;
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Message-ID: &lt;20241010182427.1434605-10-seanjc@google.com&gt;
</content>
</entry>
<entry>
<title>KVM: Remove unused kvm_vcpu_gfn_to_pfn_atomic</title>
<updated>2024-10-20T11:05:51+00:00</updated>
<author>
<name>Dr. David Alan Gilbert</name>
<email>linux@treblig.org</email>
</author>
<published>2024-10-01T14:13:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=bc07eea2f3b330127242df2e0ec2d6cd16b4f2e8'/>
<id>urn:sha1:bc07eea2f3b330127242df2e0ec2d6cd16b4f2e8</id>
<content type='text'>
The last use of kvm_vcpu_gfn_to_pfn_atomic was removed by commit
1bbc60d0c7e5 ("KVM: x86/mmu: Remove MMU auditing")

Remove it.

Signed-off-by: Dr. David Alan Gilbert &lt;linux@treblig.org&gt;
Message-ID: &lt;20241001141354.18009-3-linux@treblig.org&gt;
[Adjust Documentation/virt/kvm/locking.rst. - Paolo]
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</content>
</entry>
<entry>
<title>Documentation: KVM: fix warning in "make htmldocs"</title>
<updated>2024-09-27T15:45:50+00:00</updated>
<author>
<name>Paolo Bonzini</name>
<email>pbonzini@redhat.com</email>
</author>
<published>2024-09-27T15:45:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=efbc6bd090f48ccf64f7a8dd5daea775821d57ec'/>
<id>urn:sha1:efbc6bd090f48ccf64f7a8dd5daea775821d57ec</id>
<content type='text'>
The warning

 Documentation/virt/kvm/locking.rst:31: ERROR: Unexpected indentation.

is caused by incorrectly treating a line as the continuation of a paragraph,
rather than as the first line in a bullet list.

Fixed: 44d174596260 ("KVM: Use dedicated mutex to protect kvm_usage_count to avoid deadlock")
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</content>
</entry>
<entry>
<title>KVM: Register cpuhp and syscore callbacks when enabling hardware</title>
<updated>2024-09-04T15:02:33+00:00</updated>
<author>
<name>Sean Christopherson</name>
<email>seanjc@google.com</email>
</author>
<published>2024-08-30T04:35:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=9a798b1337afe4fdcce53efa77953b068d8614f5'/>
<id>urn:sha1:9a798b1337afe4fdcce53efa77953b068d8614f5</id>
<content type='text'>
Register KVM's cpuhp and syscore callback when enabling virtualization
in hardware instead of registering the callbacks during initialization,
and let the CPU up/down framework invoke the inner enable/disable
functions.  Registering the callbacks during initialization makes things
more complex than they need to be, as KVM needs to be very careful about
handling races between enabling CPUs being onlined/offlined and hardware
being enabled/disabled.

Intel TDX support will require KVM to enable virtualization during KVM
initialization, i.e. will add another wrinkle to things, at which point
sorting out the potential races with kvm_usage_count would become even
more complex.

Note, using the cpuhp framework has a subtle behavioral change: enabling
will be done serially across all CPUs, whereas KVM currently sends an IPI
to all CPUs in parallel.  While serializing virtualization enabling could
create undesirable latency, the issue is limited to the 0=&gt;1 transition of
VM creation.  And even that can be mitigated, e.g. by letting userspace
force virtualization to be enabled when KVM is initialized.

Cc: Chao Gao &lt;chao.gao@intel.com&gt;
Reviewed-by: Kai Huang &lt;kai.huang@intel.com&gt;
Acked-by: Kai Huang &lt;kai.huang@intel.com&gt;
Tested-by: Farrah Chen &lt;farrah.chen@intel.com&gt;
Signed-off-by: Sean Christopherson &lt;seanjc@google.com&gt;
Message-ID: &lt;20240830043600.127750-3-seanjc@google.com&gt;
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</content>
</entry>
<entry>
<title>KVM: Use dedicated mutex to protect kvm_usage_count to avoid deadlock</title>
<updated>2024-09-04T15:02:33+00:00</updated>
<author>
<name>Sean Christopherson</name>
<email>seanjc@google.com</email>
</author>
<published>2024-08-30T04:35:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=44d17459626052a2390457e550a12cb973506b2f'/>
<id>urn:sha1:44d17459626052a2390457e550a12cb973506b2f</id>
<content type='text'>
Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock
on x86 due to a chain of locks and SRCU synchronizations.  Translating the
below lockdep splat, CPU1 #6 will wait on CPU0 #1, CPU0 #8 will wait on
CPU2 #3, and CPU2 #7 will wait on CPU1 #4 (if there's a writer, due to the
fairness of r/w semaphores).

    CPU0                     CPU1                     CPU2
1   lock(&amp;kvm-&gt;slots_lock);
2                                                     lock(&amp;vcpu-&gt;mutex);
3                                                     lock(&amp;kvm-&gt;srcu);
4                            lock(cpu_hotplug_lock);
5                            lock(kvm_lock);
6                            lock(&amp;kvm-&gt;slots_lock);
7                                                     lock(cpu_hotplug_lock);
8   sync(&amp;kvm-&gt;srcu);

Note, there are likely more potential deadlocks in KVM x86, e.g. the same
pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with
__kvmclock_cpufreq_notifier():

  cpuhp_cpufreq_online()
  |
  -&gt; cpufreq_online()
     |
     -&gt; cpufreq_gov_performance_limits()
        |
        -&gt; __cpufreq_driver_target()
           |
           -&gt; __target_index()
              |
              -&gt; cpufreq_freq_transition_begin()
                 |
                 -&gt; cpufreq_notify_transition()
                    |
                    -&gt; ... __kvmclock_cpufreq_notifier()

But, actually triggering such deadlocks is beyond rare due to the
combination of dependencies and timings involved.  E.g. the cpufreq
notifier is only used on older CPUs without a constant TSC, mucking with
the NX hugepage mitigation while VMs are running is very uncommon, and
doing so while also onlining/offlining a CPU (necessary to generate
contention on cpu_hotplug_lock) would be even more unusual.

The most robust solution to the general cpu_hotplug_lock issue is likely
to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq
notifier doesn't to take kvm_lock.  For now, settle for fixing the most
blatant deadlock, as switching to an RCU-protected list is a much more
involved change, but add a comment in locking.rst to call out that care
needs to be taken when walking holding kvm_lock and walking vm_list.

  ======================================================
  WARNING: possible circular locking dependency detected
  6.10.0-smp--c257535a0c9d-pip #330 Tainted: G S         O
  ------------------------------------------------------
  tee/35048 is trying to acquire lock:
  ff6a80eced71e0a8 (&amp;kvm-&gt;slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm]

  but task is already holding lock:
  ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm]

  which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

  -&gt; #3 (kvm_lock){+.+.}-{3:3}:
         __mutex_lock+0x6a/0xb40
         mutex_lock_nested+0x1f/0x30
         kvm_dev_ioctl+0x4fb/0xe50 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -&gt; #2 (cpu_hotplug_lock){++++}-{0:0}:
         cpus_read_lock+0x2e/0xb0
         static_key_slow_inc+0x16/0x30
         kvm_lapic_set_base+0x6a/0x1c0 [kvm]
         kvm_set_apic_base+0x8f/0xe0 [kvm]
         kvm_set_msr_common+0x9ae/0xf80 [kvm]
         vmx_set_msr+0xa54/0xbe0 [kvm_intel]
         __kvm_set_msr+0xb6/0x1a0 [kvm]
         kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm]
         kvm_vcpu_ioctl+0x485/0x5b0 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -&gt; #1 (&amp;kvm-&gt;srcu){.+.+}-{0:0}:
         __synchronize_srcu+0x44/0x1a0
         synchronize_srcu_expedited+0x21/0x30
         kvm_swap_active_memslots+0x110/0x1c0 [kvm]
         kvm_set_memslot+0x360/0x620 [kvm]
         __kvm_set_memory_region+0x27b/0x300 [kvm]
         kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm]
         kvm_vm_ioctl+0x295/0x650 [kvm]
         __se_sys_ioctl+0x7b/0xd0
         __x64_sys_ioctl+0x21/0x30
         x64_sys_call+0x15d0/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -&gt; #0 (&amp;kvm-&gt;slots_lock){+.+.}-{3:3}:
         __lock_acquire+0x15ef/0x2e30
         lock_acquire+0xe0/0x260
         __mutex_lock+0x6a/0xb40
         mutex_lock_nested+0x1f/0x30
         set_nx_huge_pages+0x179/0x1e0 [kvm]
         param_attr_store+0x93/0x100
         module_attr_store+0x22/0x40
         sysfs_kf_write+0x81/0xb0
         kernfs_fop_write_iter+0x133/0x1d0
         vfs_write+0x28d/0x380
         ksys_write+0x70/0xe0
         __x64_sys_write+0x1f/0x30
         x64_sys_call+0x281b/0x2e60
         do_syscall_64+0x83/0x160
         entry_SYSCALL_64_after_hwframe+0x76/0x7e

Cc: Chao Gao &lt;chao.gao@intel.com&gt;
Fixes: 0bf50497f03b ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock")
Cc: stable@vger.kernel.org
Reviewed-by: Kai Huang &lt;kai.huang@intel.com&gt;
Acked-by: Kai Huang &lt;kai.huang@intel.com&gt;
Tested-by: Farrah Chen &lt;farrah.chen@intel.com&gt;
Signed-off-by: Sean Christopherson &lt;seanjc@google.com&gt;
Message-ID: &lt;20240830043600.127750-2-seanjc@google.com&gt;
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</content>
</entry>
<entry>
<title>KVM: x86/mmu: always take tdp_mmu_pages_lock</title>
<updated>2023-12-01T15:52:08+00:00</updated>
<author>
<name>Paolo Bonzini</name>
<email>pbonzini@redhat.com</email>
</author>
<published>2023-11-25T08:33:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=250ce1b4d21a94f910c3df5141ff6434ea92524e'/>
<id>urn:sha1:250ce1b4d21a94f910c3df5141ff6434ea92524e</id>
<content type='text'>
It is cheap to take tdp_mmu_pages_lock in all write-side critical sections.
We already do it all the time when zapping with read_lock(), so it is not
a problem to do it from the kvm_tdp_mmu_zap_all() path (aka
kvm_arch_flush_shadow_all(), aka VM destruction and MMU notifier release).

Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Link: https://lore.kernel.org/r/20231125083400.1399197-4-pbonzini@redhat.com
Signed-off-by: Sean Christopherson &lt;seanjc@google.com&gt;
</content>
</entry>
</feed>
