summaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2026-05-09 05:42:10 +0300
committerLinus Torvalds <torvalds@linux-foundation.org>2026-05-09 05:42:10 +0300
commit7f0023215262221ca08d56be2203e8a4770be033 (patch)
tree33c8dec5486e41d6b1e29dbaab172b3cbc0ddc41 /Documentation
parente5cf0260a7472b4f34a46c418c14bec272aac404 (diff)
parent9f6d929ee2c6f0266edb564bcd2bd47fd6e884a8 (diff)
downloadlinux-7f0023215262221ca08d56be2203e8a4770be033.tar.xz
Merge tag 'sched-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar: - Fix spurious failures in rseq self-tests (Mark Brown) - Fix rseq rseq::cpu_id_start ABI regression due to TCMalloc's creative use of the supposedly read-only field The fix is to introduce a new ABI variant based on a new (larger) rseq area registration size, to keep the TCMalloc use of rseq backwards compatible on new kernels (Thomas Gleixner) - Fix wakeup_preempt_fair() for not waking up task (Vincent Guittot) - Fix s64 mult overflow in vruntime_eligible() (Zhan Xusheng) * tag 'sched-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix wakeup_preempt_fair() for not waking up task sched/fair: Fix overflow in vruntime_eligible() selftests/rseq: Expand for optimized RSEQ ABI v2 rseq: Reenable performance optimizations conditionally rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode selftests/rseq: Validate legacy behavior selftests/rseq: Make registration flexible for legacy and optimized mode selftests/rseq: Skip tests if time slice extensions are not available rseq: Revert to historical performance killing behaviour rseq: Don't advertise time slice extensions if disabled rseq: Protect rseq_reset() against interrupts rseq: Set rseq::cpu_id_start to 0 on unregistration selftests/rseq: Don't run tests with runner scripts outside of the scripts
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/userspace-api/rseq.rst94
1 files changed, 93 insertions, 1 deletions
diff --git a/Documentation/userspace-api/rseq.rst b/Documentation/userspace-api/rseq.rst
index 3cd27a3c7c7e..8549a6c61531 100644
--- a/Documentation/userspace-api/rseq.rst
+++ b/Documentation/userspace-api/rseq.rst
@@ -24,6 +24,97 @@ Quick access to CPU number, node ID
Allows to implement per CPU data efficiently. Documentation is in code and
selftests. :(
+Optimized RSEQ V2
+-----------------
+
+On architectures which utilize the generic entry code and generic TIF bits
+the kernel supports runtime optimizations for RSEQ, which also enable
+enhanced features like scheduler time slice extensions.
+
+To enable them a task has to register the RSEQ region with at least the
+length advertised by getauxval(AT_RSEQ_FEATURE_SIZE).
+
+If existing binaries register with RSEQ_ORIG_SIZE (32 bytes), the kernel
+keeps the legacy low performance mode enabled to fulfil the expectations
+of existing users regarding the original RSEQ implementation behaviour.
+
+The following table documents the ABI and behavioral guarantees of the
+legacy and the optimized V2 mode.
+
+.. list-table:: RSEQ modes
+ :header-rows: 1
+
+ * - Nr
+ - What
+
+ - Legacy
+ - Optimized V2
+
+ * - 1
+ - The cpu_id_start, cpu_id, node_id and mm_cid fields (User mode read
+ only)
+ .. Legacy
+ - Updated by the kernel unconditionally after each context switch and
+ before signal delivery
+ .. Optimized V2
+ - Updated by the kernel if and only if they change, i.e. if the task
+ is migrated or mm_cid changes
+
+ * - 2
+ - The rseq_cs critical section field
+ .. Legacy
+ - Evaluated and handled unconditionally after each context switch and
+ before signal delivery
+ .. Optimized V2
+ - Evaluated and handled conditionally only when user space was
+ interrupted and was scheduled out or before delivering a signal in
+ the interrupted context.
+
+ * - 3
+ - Read only fields
+ .. Legacy
+ - No strict enforcement except in debug mode
+ .. Optimized V2
+ - Strict enforcement
+
+ * - 4
+ - membarrier(...RSEQ)
+ .. Legacy
+ - All running threads of the process are interrupted and the ID fields
+ are rewritten and eventually active critical sections are aborted
+ before they return to user space. All threads which are scheduled
+ out whether voluntary or not are covered by #1/#2 above.
+ .. Optimized V2
+ - All running threads of the process are interrupted and eventually
+ active critical sections are aborted before these threads return to
+ user space. The ID fields are only updated if changed as a
+ consequence of the interrupt. All threads which are scheduled out
+ whether voluntary or not are covered by #1/#2 above.
+
+ * - 5
+ - Time slice extensions
+ .. Legacy
+ - Not supported
+ .. Optimized V2
+ - Supported
+
+The legacy mode is obviously less performant as it does unconditional
+updates and critical section checks even if not strictly required by the
+ABI contract. That can't be changed anymore as some users depend on that
+observed behavior, which in turn enables them to violate the ABI and
+overwrite the cpu_id_start field for their own purposes. This is obviously
+discouraged as it renders RSEQ incompatible with the intended usage and
+breaks the expectation of other libraries in the same application.
+
+The ABI compliant optimized v2 mode, which respects the read only fields,
+does not require unconditional updates and therefore is way more
+performant. The kernel validates the read only fields for compliance. If
+user space modifies them, the process is killed. Compliant usage allows
+multiple libraries in the same application to benefit from the RSEQ
+functionality without disturbing each other. The ABI compliant optimized v2
+mode also enables extended RSEQ features like time slice extensions.
+
+
Scheduler time slice extensions
-------------------------------
@@ -37,7 +128,8 @@ The prerequisites for this functionality are:
* Enabled at boot time (default is enabled)
- * A rseq userspace pointer has been registered for the thread
+ * A rseq userspace pointer has been registered for the thread in
+ optimized V2 mode
The thread has to enable the functionality via prctl(2)::