Documentation/core-api/pin_user_pages.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279

.. SPDX-License-Identifier: GPL-2.0

====================================================
pin_user_pages() and related calls
====================================================

.. contents:: :local:

Overview
========

This document describes the following functions::

 pin_user_pages()
 pin_user_pages_fast()
 pin_user_pages_remote()

Basic description of FOLL_PIN
=============================

FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
("gup") family of functions. FOLL_PIN has significant interactions and
interdependencies with FOLL_LONGTERM, so both are covered here.

FOLL_PIN is internal to gup, meaning that it should not appear at the gup call
sites. This allows the associated wrapper functions  (pin_user_pages*() and
others) to set the correct combination of these flags, and to check for problems
as well.

FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites.
This is in order to avoid creating a large number of wrapper functions to cover
all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
that's a natural dividing line, and a good point to make separate wrapper calls.
In other words, use pin_user_pages*() for DMA-pinned pages, and
get_user_pages*() for other cases. There are five cases described later on in
this document, to further clarify that concept.

FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
multiple threads and call sites are free to pin the same struct pages, via both
FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
other, not the struct page(s).

The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN
uses a different reference counting technique.

FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is,
FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.

Which flags are set by each wrapper
===================================

For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
flags the caller provides. The caller is required to pass in a non-null struct
pages* array, and the function then pins pages by incrementing each by a special
value: GUP_PIN_COUNTING_BIAS.

For huge pages (and in fact, any compound page of more than 2 pages), the
GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin counting
is achieved, by using the 3rd struct page in the compound page. A new struct
page field, hpage_pinned_refcount, has been added in order to support this.

This approach for compound pages avoids the counting upper limit problems that
are discussed below. Those limitations would have been aggravated severely by
huge pages, because each tail page adds a refcount to the head page. And in
fact, testing revealed that, without a separate hpage_pinned_refcount field,
page overflows were seen in some huge page stress tests.

This also means that huge pages and compound pages (of order > 1) do not suffer
from the false positives problem that is mentioned below.::

 Function
 --------
 pin_user_pages          FOLL_PIN is always set internally by this function.
 pin_user_pages_fast     FOLL_PIN is always set internally by this function.
 pin_user_pages_remote   FOLL_PIN is always set internally by this function.

For these get_user_pages*() functions, FOLL_GET might not even be specified.
Behavior is a little more complex than above. If FOLL_GET was *not* specified,
but the caller passed in a non-null struct pages* array, then the function
sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount
of each page by +1.::

 Function
 --------
 get_user_pages           FOLL_GET is sometimes set internally by this function.
 get_user_pages_fast      FOLL_GET is sometimes set internally by this function.
 get_user_pages_remote    FOLL_GET is sometimes set internally by this function.

Tracking dma-pinned pages
=========================

Some of the key design constraints, and solutions, for tracking dma-pinned
pages:

* An actual reference count, per struct page, is required. This is because
  multiple processes may pin and unpin a page.

* False positives (reporting that a page is dma-pinned, when in fact it is not)
  are acceptable, but false negatives are not.

* struct page may not be increased in size for this, and all fields are already
  used.

* Given the above, we can overload the page->_refcount field by using, sort of,
  the upper bits in that field for a dma-pinned count. "Sort of", means that,
  rather than dividing page->_refcount into bit fields, we simple add a medium-
  large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
  page->_refcount. This provides fuzzy behavior: if a page has get_page() called
  on it 1024 times, then it will appear to have a single dma-pinned count.
  And again, that's acceptable.

This also leads to limitations: there are only 31-10==21 bits available for a
counter that increments 10 bits at a time.

* Callers must specifically request "dma-pinned tracking of pages". In other
  words, just calling get_user_pages() will not suffice; a new set of functions,
  pin_user_page() and related, must be used.

FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
==========================================================

Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
these categories:

CASE 1: Direct IO (DIO)
-----------------------
There are GUP references to pages that are serving
as DIO buffers. These buffers are needed for a relatively short time (so they
are not "long term"). No special synchronization with page_mkclean() or
munmap() is provided. Therefore, flags to set at the call site are: ::

    FOLL_PIN

...but rather than setting FOLL_PIN directly, call sites should use one of
the pin_user_pages*() routines that set FOLL_PIN.

CASE 2: RDMA
------------
There are GUP references to pages that are serving as DMA
buffers. These buffers are needed for a long time ("long term"). No special
synchronization with page_mkclean() or munmap() is provided. Therefore, flags
to set at the call site are: ::

    FOLL_PIN | FOLL_LONGTERM

NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
because DAX pages do not have a separate page cache, and so "pinning" implies
locking down file system blocks, which is not (yet) supported in that way.

CASE 3: MMU notifier registration, with or without page faulting hardware
-------------------------------------------------------------------------
Device drivers can pin pages via get_user_pages*(), and register for mmu
notifier callbacks for the memory range. Then, upon receiving a notifier
"invalidate range" callback , stop the device from using the range, and unpin
the pages. There may be other possible schemes, such as for example explicitly
synchronizing against pending IO, that accomplish approximately the same thing.

Or, if the hardware supports replayable page faults, then the device driver can
avoid pinning entirely (this is ideal), as follows: register for mmu notifier
callbacks as above, but instead of stopping the device and unpinning in the
callback, simply remove the range from the device's page tables.

Either way, as long as the driver unpins the pages upon mmu notifier callback,
then there is proper synchronization with both filesystem and mm
(page_mkclean(), munmap(), etc). Therefore, neither flag needs to be set.

CASE 4: Pinning for struct page manipulation only
-------------------------------------------------
If only struct page data (as opposed to the actual memory contents that a page
is tracking) is affected, then normal GUP calls are sufficient, and neither flag
needs to be set.

CASE 5: Pinning in order to write to the data within the page
-------------------------------------------------------------
Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
write to a page's data, unpin" can cause a problem. Case 5 may be considered a
superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
other words, if the code is neither Case 1 nor Case 2, it may still require
FOLL_PIN, for patterns like this:

Correct (uses FOLL_PIN calls):
    pin_user_pages()
    write to the data within the pages
    unpin_user_pages()

INCORRECT (uses FOLL_GET calls):
    get_user_pages()
    write to the data within the pages
    put_page()

page_maybe_dma_pinned(): the whole point of pinning
===================================================

The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
(and file system writeback code in general) to make informed decisions about
what to do when a page cannot be unmapped due to such pins.

What to do in those cases is the subject of a years-long series of discussions
and debates (see the References at the end of this document). It's a TODO item
here: fill in the details once that's worked out. Meanwhile, it's safe to say
that having this available: ::

        static inline bool page_maybe_dma_pinned(struct page *page)

...is a prerequisite to solving the long-running gup+DMA problem.

Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM
===================================================================

Another way of thinking about these flags is as a progression of restrictions:
FOLL_GET is for struct page manipulation, without affecting the data that the
struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
restrictive case that has FOLL_PIN as a prerequisite: this is for pages that
will be pinned longterm, and whose data will be accessed.

Unit testing
============
This file::

 tools/testing/selftests/vm/gup_test.c

has the following new calls to exercise the new pin*() wrapper functions:

* PIN_FAST_BENCHMARK (./gup_test -a)
* PIN_BASIC_TEST (./gup_test -b)

You can monitor how many total dma-pinned pages have been acquired and released
since the system was booted, via two new /proc/vmstat entries: ::

    /proc/vmstat/nr_foll_pin_acquired
    /proc/vmstat/nr_foll_pin_released

Under normal conditions, these two values will be equal unless there are any
long-term [R]DMA pins in place, or during pin/unpin transitions.

* nr_foll_pin_acquired: This is the number of logical pins that have been
  acquired since the system was powered on. For huge pages, the head page is
  pinned once for each page (head page and each tail page) within the huge page.
  This follows the same sort of behavior that get_user_pages() uses for huge
  pages: the head page is refcounted once for each tail or head page in the huge
  page, when get_user_pages() is applied to a huge page.

* nr_foll_pin_released: The number of logical pins that have been released since
  the system was powered on. Note that pages are released (unpinned) on a
  PAGE_SIZE granularity, even if the original pin was applied to a huge page.
  Becaused of the pin count behavior described above in "nr_foll_pin_acquired",
  the accounting balances out, so that after doing this::

    pin_user_pages(huge_page);
    for (each page in huge_page)
        unpin_user_page(page);

...the following is expected::

    nr_foll_pin_released == nr_foll_pin_acquired

(...unless it was already out of balance due to a long-term RDMA pin being in
place.)

Other diagnostics
=================

dump_page() has been enhanced slightly, to handle these new counting fields, and
to better report on compound pages in general. Specifically, for compound pages
with order > 1, the exact (hpage_pinned_refcount) pincount is reported.

References
==========

* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
* `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`_

John Hubbard, October, 2019