1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
|
.. SPDX-License-Identifier: GPL-2.0
.. _kernel_hacking_locktypes:
==========================
Lock types and their rules
==========================
Introduction
============
The kernel provides a variety of locking primitives which can be divided
into two categories:
- Sleeping locks
- Spinning locks
This document conceptually describes these lock types and provides rules
for their nesting, including the rules for use under PREEMPT_RT.
Lock categories
===============
Sleeping locks
--------------
Sleeping locks can only be acquired in preemptible task context.
Although implementations allow try_lock() from other contexts, it is
necessary to carefully evaluate the safety of unlock() as well as of
try_lock(). Furthermore, it is also necessary to evaluate the debugging
versions of these primitives. In short, don't acquire sleeping locks from
other contexts unless there is no other option.
Sleeping lock types:
- mutex
- rt_mutex
- semaphore
- rw_semaphore
- ww_mutex
- percpu_rw_semaphore
On PREEMPT_RT kernels, these lock types are converted to sleeping locks:
- spinlock_t
- rwlock_t
Spinning locks
--------------
- raw_spinlock_t
- bit spinlocks
On non-PREEMPT_RT kernels, these lock types are also spinning locks:
- spinlock_t
- rwlock_t
Spinning locks implicitly disable preemption and the lock / unlock functions
can have suffixes which apply further protections:
=================== ====================================================
_bh() Disable / enable bottom halves (soft interrupts)
_irq() Disable / enable interrupts
_irqsave/restore() Save and disable / restore interrupt disabled state
=================== ====================================================
Owner semantics
===============
The aforementioned lock types except semaphores have strict owner
semantics:
The context (task) that acquired the lock must release it.
rw_semaphores have a special interface which allows non-owner release for
readers.
rtmutex
=======
RT-mutexes are mutexes with support for priority inheritance (PI).
PI has limitations on non-PREEMPT_RT kernels due to preemption and
interrupt disabled sections.
PI clearly cannot preempt preemption-disabled or interrupt-disabled
regions of code, even on PREEMPT_RT kernels. Instead, PREEMPT_RT kernels
execute most such regions of code in preemptible task context, especially
interrupt handlers and soft interrupts. This conversion allows spinlock_t
and rwlock_t to be implemented via RT-mutexes.
semaphore
=========
semaphore is a counting semaphore implementation.
Semaphores are often used for both serialization and waiting, but new use
cases should instead use separate serialization and wait mechanisms, such
as mutexes and completions.
semaphores and PREEMPT_RT
----------------------------
PREEMPT_RT does not change the semaphore implementation because counting
semaphores have no concept of owners, thus preventing PREEMPT_RT from
providing priority inheritance for semaphores. After all, an unknown
owner cannot be boosted. As a consequence, blocking on semaphores can
result in priority inversion.
rw_semaphore
============
rw_semaphore is a multiple readers and single writer lock mechanism.
On non-PREEMPT_RT kernels the implementation is fair, thus preventing
writer starvation.
rw_semaphore complies by default with the strict owner semantics, but there
exist special-purpose interfaces that allow non-owner release for readers.
These interfaces work independent of the kernel configuration.
rw_semaphore and PREEMPT_RT
---------------------------
PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based
implementation, thus changing the fairness:
Because an rw_semaphore writer cannot grant its priority to multiple
readers, a preempted low-priority reader will continue holding its lock,
thus starving even high-priority writers. In contrast, because readers
can grant their priority to a writer, a preempted low-priority writer will
have its priority boosted until it releases the lock, thus preventing that
writer from starving readers.
raw_spinlock_t and spinlock_t
=============================
raw_spinlock_t
--------------
raw_spinlock_t is a strict spinning lock implementation regardless of the
kernel configuration including PREEMPT_RT enabled kernels.
raw_spinlock_t is a strict spinning lock implementation in all kernels,
including PREEMPT_RT kernels. Use raw_spinlock_t only in real critical
core code, low-level interrupt handling and places where disabling
preemption or interrupts is required, for example, to safely access
hardware state. raw_spinlock_t can sometimes also be used when the
critical section is tiny, thus avoiding RT-mutex overhead.
spinlock_t
----------
The semantics of spinlock_t change with the state of PREEMPT_RT.
On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has
exactly the same semantics.
spinlock_t and PREEMPT_RT
-------------------------
On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation
based on rt_mutex which changes the semantics:
- Preemption is not disabled.
- The hard interrupt related suffixes for spin_lock / spin_unlock
operations (_irq, _irqsave / _irqrestore) do not affect the CPU's
interrupt disabled state.
- The soft interrupt related suffix (_bh()) still disables softirq
handlers.
Non-PREEMPT_RT kernels disable preemption to get this effect.
PREEMPT_RT kernels use a per-CPU lock for serialization which keeps
preemption disabled. The lock disables softirq handlers and also
prevents reentrancy due to task preemption.
PREEMPT_RT kernels preserve all other spinlock_t semantics:
- Tasks holding a spinlock_t do not migrate. Non-PREEMPT_RT kernels
avoid migration by disabling preemption. PREEMPT_RT kernels instead
disable migration, which ensures that pointers to per-CPU variables
remain valid even if the task is preempted.
- Task state is preserved across spinlock acquisition, ensuring that the
task-state rules apply to all kernel configurations. Non-PREEMPT_RT
kernels leave task state untouched. However, PREEMPT_RT must change
task state if the task blocks during acquisition. Therefore, it saves
the current task state before blocking and the corresponding lock wakeup
restores it, as shown below::
task->state = TASK_INTERRUPTIBLE
lock()
block()
task->saved_state = task->state
task->state = TASK_UNINTERRUPTIBLE
schedule()
lock wakeup
task->state = task->saved_state
Other types of wakeups would normally unconditionally set the task state
to RUNNING, but that does not work here because the task must remain
blocked until the lock becomes available. Therefore, when a non-lock
wakeup attempts to awaken a task blocked waiting for a spinlock, it
instead sets the saved state to RUNNING. Then, when the lock
acquisition completes, the lock wakeup sets the task state to the saved
state, in this case setting it to RUNNING::
task->state = TASK_INTERRUPTIBLE
lock()
block()
task->saved_state = task->state
task->state = TASK_UNINTERRUPTIBLE
schedule()
non lock wakeup
task->saved_state = TASK_RUNNING
lock wakeup
task->state = task->saved_state
This ensures that the real wakeup cannot be lost.
rwlock_t
========
rwlock_t is a multiple readers and single writer lock mechanism.
Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the
suffix rules of spinlock_t apply accordingly. The implementation is fair,
thus preventing writer starvation.
rwlock_t and PREEMPT_RT
-----------------------
PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based
implementation, thus changing semantics:
- All the spinlock_t changes also apply to rwlock_t.
- Because an rwlock_t writer cannot grant its priority to multiple
readers, a preempted low-priority reader will continue holding its lock,
thus starving even high-priority writers. In contrast, because readers
can grant their priority to a writer, a preempted low-priority writer
will have its priority boosted until it releases the lock, thus
preventing that writer from starving readers.
PREEMPT_RT caveats
==================
spinlock_t and rwlock_t
-----------------------
These changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels
have a few implications. For example, on a non-PREEMPT_RT kernel the
following code sequence works as expected::
local_irq_disable();
spin_lock(&lock);
and is fully equivalent to::
spin_lock_irq(&lock);
Same applies to rwlock_t and the _irqsave() suffix variants.
On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a
fully preemptible context. Instead, use spin_lock_irq() or
spin_lock_irqsave() and their unlock counterparts. In cases where the
interrupt disabling and locking must remain separate, PREEMPT_RT offers a
local_lock mechanism. Acquiring the local_lock pins the task to a CPU,
allowing things like per-CPU interrupt disabled locks to be acquired.
However, this approach should be used only where absolutely necessary.
raw_spinlock_t
--------------
Acquiring a raw_spinlock_t disables preemption and possibly also
interrupts, so the critical section must avoid acquiring a regular
spinlock_t or rwlock_t, for example, the critical section must avoid
allocating memory. Thus, on a non-PREEMPT_RT kernel the following code
works perfectly::
raw_spin_lock(&lock);
p = kmalloc(sizeof(*p), GFP_ATOMIC);
But this code fails on PREEMPT_RT kernels because the memory allocator is
fully preemptible and therefore cannot be invoked from truly atomic
contexts. However, it is perfectly fine to invoke the memory allocator
while holding normal non-raw spinlocks because they do not disable
preemption on PREEMPT_RT kernels::
spin_lock(&lock);
p = kmalloc(sizeof(*p), GFP_ATOMIC);
bit spinlocks
-------------
PREEMPT_RT cannot substitute bit spinlocks because a single bit is too
small to accommodate an RT-mutex. Therefore, the semantics of bit
spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t
caveats also apply to bit spinlocks.
Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT
using conditional (#ifdef'ed) code changes at the usage site. In contrast,
usage-site changes are not needed for the spinlock_t substitution.
Instead, conditionals in header files and the core locking implemementation
enable the compiler to do the substitution transparently.
Lock type nesting rules
=======================
The most basic rules are:
- Lock types of the same lock category (sleeping, spinning) can nest
arbitrarily as long as they respect the general lock ordering rules to
prevent deadlocks.
- Sleeping lock types cannot nest inside spinning lock types.
- Spinning lock types can nest inside sleeping lock types.
These constraints apply both in PREEMPT_RT and otherwise.
The fact that PREEMPT_RT changes the lock category of spinlock_t and
rwlock_t from spinning to sleeping means that they cannot be acquired while
holding a raw spinlock. This results in the following nesting ordering:
1) Sleeping locks
2) spinlock_t and rwlock_t
3) raw_spinlock_t and bit spinlocks
Lockdep will complain if these constraints are violated, both in
PREEMPT_RT and otherwise.
|