summaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/00-INDEX2
-rw-r--r--Documentation/RCU/00-INDEX2
-rw-r--r--Documentation/RCU/Design/Data-Structures/Data-Structures.html233
-rw-r--r--Documentation/RCU/Design/Data-Structures/nxtlist.svg34
-rw-r--r--Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html47
-rw-r--r--Documentation/RCU/Design/Requirements/Requirements.html195
-rw-r--r--Documentation/RCU/rcu_dereference.txt9
-rw-r--r--Documentation/RCU/rculist_nulls.txt6
-rw-r--r--Documentation/RCU/stallwarn.txt190
-rw-r--r--Documentation/RCU/whatisRCU.txt32
-rw-r--r--Documentation/admin-guide/LSM/LoadPin.rst (renamed from Documentation/security/LoadPin.txt)12
-rw-r--r--Documentation/admin-guide/LSM/SELinux.rst (renamed from Documentation/security/SELinux.txt)18
-rw-r--r--Documentation/admin-guide/LSM/Smack.rst (renamed from Documentation/security/Smack.txt)273
-rw-r--r--Documentation/admin-guide/LSM/Yama.rst (renamed from Documentation/security/Yama.txt)55
-rw-r--r--Documentation/admin-guide/LSM/apparmor.rst (renamed from Documentation/security/apparmor.txt)36
-rw-r--r--Documentation/admin-guide/LSM/index.rst (renamed from Documentation/security/LSM.txt)28
-rw-r--r--Documentation/admin-guide/LSM/tomoyo.rst (renamed from Documentation/security/tomoyo.txt)22
-rw-r--r--Documentation/admin-guide/index.rst1
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt29
-rw-r--r--Documentation/arm64/tagged-pointers.txt62
-rw-r--r--Documentation/block/bfq-iosched.txt17
-rw-r--r--Documentation/cgroup-v2.txt12
-rw-r--r--Documentation/core-api/assoc_array.rst5
-rw-r--r--Documentation/crypto/asymmetric-keys.txt2
-rw-r--r--Documentation/devicetree/bindings/arm/firmware/linaro,optee-tz.txt31
-rw-r--r--Documentation/devicetree/bindings/arm/mediatek/mediatek,apmixedsys.txt1
-rw-r--r--Documentation/devicetree/bindings/arm/mediatek/mediatek,imgsys.txt1
-rw-r--r--Documentation/devicetree/bindings/arm/mediatek/mediatek,infracfg.txt1
-rw-r--r--Documentation/devicetree/bindings/arm/mediatek/mediatek,mmsys.txt1
-rw-r--r--Documentation/devicetree/bindings/arm/mediatek/mediatek,topckgen.txt1
-rw-r--r--Documentation/devicetree/bindings/arm/mediatek/mediatek,vdecsys.txt1
-rw-r--r--Documentation/devicetree/bindings/arm/mediatek/mediatek,vencsys.txt3
-rw-r--r--Documentation/devicetree/bindings/clock/idt,versaclock5.txt16
-rw-r--r--Documentation/devicetree/bindings/clock/rockchip,rv1108-cru.txt (renamed from Documentation/devicetree/bindings/clock/rockchip,rk1108-cru.txt)12
-rw-r--r--Documentation/devicetree/bindings/clock/sunxi-ccu.txt18
-rw-r--r--Documentation/devicetree/bindings/display/imx/fsl,imx-fb.txt2
-rw-r--r--Documentation/devicetree/bindings/iommu/arm,smmu.txt28
-rw-r--r--Documentation/devicetree/bindings/mtd/atmel-nand.txt107
-rw-r--r--Documentation/devicetree/bindings/mtd/denali-nand.txt7
-rw-r--r--Documentation/devicetree/bindings/mtd/gpio-control-nand.txt4
-rw-r--r--Documentation/devicetree/bindings/mtd/stm32-quadspi.txt43
-rw-r--r--Documentation/devicetree/bindings/power/power_domain.txt2
-rw-r--r--Documentation/devicetree/bindings/power/supply/axp20x_battery.txt20
-rw-r--r--Documentation/devicetree/bindings/powerpc/ibm,powerpc-cpu-features.txt248
-rw-r--r--Documentation/devicetree/bindings/pwm/atmel-pwm.txt1
-rw-r--r--Documentation/devicetree/bindings/pwm/nvidia,tegra20-pwm.txt45
-rw-r--r--Documentation/devicetree/bindings/pwm/pwm-mediatek.txt34
-rw-r--r--Documentation/devicetree/bindings/rtc/cpcap-rtc.txt18
-rw-r--r--Documentation/devicetree/bindings/rtc/rtc-sh.txt28
-rw-r--r--Documentation/devicetree/bindings/soc/fsl/cpm_qe/gpio.txt21
-rw-r--r--Documentation/devicetree/bindings/thermal/brcm,bcm2835-thermal.txt32
-rw-r--r--Documentation/devicetree/bindings/thermal/brcm,ns-thermal37
-rw-r--r--Documentation/devicetree/bindings/thermal/da9062-thermal.txt36
-rw-r--r--Documentation/devicetree/bindings/trivial-devices.txt1
-rw-r--r--Documentation/devicetree/bindings/usb/da8xx-usb.txt41
-rw-r--r--Documentation/devicetree/bindings/vendor-prefixes.txt2
-rw-r--r--Documentation/filesystems/bfs.txt2
-rw-r--r--Documentation/filesystems/nfs/idmapper.txt2
-rw-r--r--Documentation/filesystems/nfs/pnfs.txt37
-rw-r--r--Documentation/filesystems/overlayfs.txt9
-rw-r--r--Documentation/ioctl/ioctl-number.txt1
-rw-r--r--Documentation/kbuild/makefiles.txt74
-rw-r--r--Documentation/memory-barriers.txt2
-rw-r--r--Documentation/networking/dns_resolver.txt2
-rw-r--r--Documentation/security/00-INDEX26
-rw-r--r--Documentation/security/IMA-templates.rst (renamed from Documentation/security/IMA-templates.txt)46
-rw-r--r--Documentation/security/LSM.rst14
-rw-r--r--Documentation/security/conf.py8
-rw-r--r--Documentation/security/credentials.rst (renamed from Documentation/security/credentials.txt)275
-rw-r--r--Documentation/security/index.rst8
-rw-r--r--Documentation/security/keys/core.rst (renamed from Documentation/security/keys.txt)314
-rw-r--r--Documentation/security/keys/ecryptfs.rst (renamed from Documentation/security/keys-ecryptfs.txt)19
-rw-r--r--Documentation/security/keys/index.rst11
-rw-r--r--Documentation/security/keys/request-key.rst (renamed from Documentation/security/keys-request-key.txt)73
-rw-r--r--Documentation/security/keys/trusted-encrypted.rst (renamed from Documentation/security/keys-trusted-encrypted.txt)32
-rw-r--r--Documentation/security/self-protection.rst (renamed from Documentation/security/self-protection.txt)99
-rwxr-xr-xDocumentation/target/target-export-device80
-rw-r--r--Documentation/tee.txt118
-rw-r--r--Documentation/thermal/sysfs-api.txt21
-rw-r--r--Documentation/userspace-api/index.rst2
-rw-r--r--Documentation/userspace-api/no_new_privs.rst (renamed from Documentation/prctl/no_new_privs.txt)44
-rw-r--r--Documentation/userspace-api/seccomp_filter.rst (renamed from Documentation/prctl/seccomp_filter.txt)116
-rw-r--r--Documentation/userspace-api/unshare.rst2
-rw-r--r--Documentation/virtual/kvm/devices/arm-vgic-its.txt121
-rw-r--r--Documentation/virtual/kvm/devices/arm-vgic-v3.txt6
-rw-r--r--Documentation/x86/intel_rdt_ui.txt2
86 files changed, 2652 insertions, 1079 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX
index d0165461b024..f35473f8c630 100644
--- a/Documentation/00-INDEX
+++ b/Documentation/00-INDEX
@@ -410,6 +410,8 @@ sysctl/
- directory with info on the /proc/sys/* files.
target/
- directory with info on generating TCM v4 fabric .ko modules
+tee.txt
+ - info on the TEE subsystem and drivers
this_cpu_ops.txt
- List rationale behind and the way to use this_cpu operations.
thermal/
diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX
index f773a264ae02..1672573b037a 100644
--- a/Documentation/RCU/00-INDEX
+++ b/Documentation/RCU/00-INDEX
@@ -17,7 +17,7 @@ rcu_dereference.txt
rcubarrier.txt
- RCU and Unloadable Modules
rculist_nulls.txt
- - RCU list primitives for use with SLAB_DESTROY_BY_RCU
+ - RCU list primitives for use with SLAB_TYPESAFE_BY_RCU
rcuref.txt
- Reference-count design for elements of lists/arrays protected by RCU
rcu.txt
diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.html b/Documentation/RCU/Design/Data-Structures/Data-Structures.html
index d583c653a703..38d6d800761f 100644
--- a/Documentation/RCU/Design/Data-Structures/Data-Structures.html
+++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.html
@@ -19,6 +19,8 @@ to each other.
The <tt>rcu_state</tt> Structure</a>
<li> <a href="#The rcu_node Structure">
The <tt>rcu_node</tt> Structure</a>
+<li> <a href="#The rcu_segcblist Structure">
+ The <tt>rcu_segcblist</tt> Structure</a>
<li> <a href="#The rcu_data Structure">
The <tt>rcu_data</tt> Structure</a>
<li> <a href="#The rcu_dynticks Structure">
@@ -841,6 +843,134 @@ for lockdep lock-class names.
Finally, lines&nbsp;64-66 produce an error if the maximum number of
CPUs is too large for the specified fanout.
+<h3><a name="The rcu_segcblist Structure">
+The <tt>rcu_segcblist</tt> Structure</a></h3>
+
+The <tt>rcu_segcblist</tt> structure maintains a segmented list of
+callbacks as follows:
+
+<pre>
+ 1 #define RCU_DONE_TAIL 0
+ 2 #define RCU_WAIT_TAIL 1
+ 3 #define RCU_NEXT_READY_TAIL 2
+ 4 #define RCU_NEXT_TAIL 3
+ 5 #define RCU_CBLIST_NSEGS 4
+ 6
+ 7 struct rcu_segcblist {
+ 8 struct rcu_head *head;
+ 9 struct rcu_head **tails[RCU_CBLIST_NSEGS];
+10 unsigned long gp_seq[RCU_CBLIST_NSEGS];
+11 long len;
+12 long len_lazy;
+13 };
+</pre>
+
+<p>
+The segments are as follows:
+
+<ol>
+<li> <tt>RCU_DONE_TAIL</tt>: Callbacks whose grace periods have elapsed.
+ These callbacks are ready to be invoked.
+<li> <tt>RCU_WAIT_TAIL</tt>: Callbacks that are waiting for the
+ current grace period.
+ Note that different CPUs can have different ideas about which
+ grace period is current, hence the <tt>-&gt;gp_seq</tt> field.
+<li> <tt>RCU_NEXT_READY_TAIL</tt>: Callbacks waiting for the next
+ grace period to start.
+<li> <tt>RCU_NEXT_TAIL</tt>: Callbacks that have not yet been
+ associated with a grace period.
+</ol>
+
+<p>
+The <tt>-&gt;head</tt> pointer references the first callback or
+is <tt>NULL</tt> if the list contains no callbacks (which is
+<i>not</i> the same as being empty).
+Each element of the <tt>-&gt;tails[]</tt> array references the
+<tt>-&gt;next</tt> pointer of the last callback in the corresponding
+segment of the list, or the list's <tt>-&gt;head</tt> pointer if
+that segment and all previous segments are empty.
+If the corresponding segment is empty but some previous segment is
+not empty, then the array element is identical to its predecessor.
+Older callbacks are closer to the head of the list, and new callbacks
+are added at the tail.
+This relationship between the <tt>-&gt;head</tt> pointer, the
+<tt>-&gt;tails[]</tt> array, and the callbacks is shown in this
+diagram:
+
+</p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%">
+
+</p><p>In this figure, the <tt>-&gt;head</tt> pointer references the
+first
+RCU callback in the list.
+The <tt>-&gt;tails[RCU_DONE_TAIL]</tt> array element references
+the <tt>-&gt;head</tt> pointer itself, indicating that none
+of the callbacks is ready to invoke.
+The <tt>-&gt;tails[RCU_WAIT_TAIL]</tt> array element references callback
+CB&nbsp;2's <tt>-&gt;next</tt> pointer, which indicates that
+CB&nbsp;1 and CB&nbsp;2 are both waiting on the current grace period,
+give or take possible disagreements about exactly which grace period
+is the current one.
+The <tt>-&gt;tails[RCU_NEXT_READY_TAIL]</tt> array element
+references the same RCU callback that <tt>-&gt;tails[RCU_WAIT_TAIL]</tt>
+does, which indicates that there are no callbacks waiting on the next
+RCU grace period.
+The <tt>-&gt;tails[RCU_NEXT_TAIL]</tt> array element references
+CB&nbsp;4's <tt>-&gt;next</tt> pointer, indicating that all the
+remaining RCU callbacks have not yet been assigned to an RCU grace
+period.
+Note that the <tt>-&gt;tails[RCU_NEXT_TAIL]</tt> array element
+always references the last RCU callback's <tt>-&gt;next</tt> pointer
+unless the callback list is empty, in which case it references
+the <tt>-&gt;head</tt> pointer.
+
+<p>
+There is one additional important special case for the
+<tt>-&gt;tails[RCU_NEXT_TAIL]</tt> array element: It can be <tt>NULL</tt>
+when this list is <i>disabled</i>.
+Lists are disabled when the corresponding CPU is offline or when
+the corresponding CPU's callbacks are offloaded to a kthread,
+both of which are described elsewhere.
+
+</p><p>CPUs advance their callbacks from the
+<tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the
+<tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments
+as grace periods advance.
+
+</p><p>The <tt>-&gt;gp_seq[]</tt> array records grace-period
+numbers corresponding to the list segments.
+This is what allows different CPUs to have different ideas as to
+which is the current grace period while still avoiding premature
+invocation of their callbacks.
+In particular, this allows CPUs that go idle for extended periods
+to determine which of their callbacks are ready to be invoked after
+reawakening.
+
+</p><p>The <tt>-&gt;len</tt> counter contains the number of
+callbacks in <tt>-&gt;head</tt>, and the
+<tt>-&gt;len_lazy</tt> contains the number of those callbacks that
+are known to only free memory, and whose invocation can therefore
+be safely deferred.
+
+<p><b>Important note</b>: It is the <tt>-&gt;len</tt> field that
+determines whether or not there are callbacks associated with
+this <tt>rcu_segcblist</tt> structure, <i>not</i> the <tt>-&gt;head</tt>
+pointer.
+The reason for this is that all the ready-to-invoke callbacks
+(that is, those in the <tt>RCU_DONE_TAIL</tt> segment) are extracted
+all at once at callback-invocation time.
+If callback invocation must be postponed, for example, because a
+high-priority process just woke up on this CPU, then the remaining
+callbacks are placed back on the <tt>RCU_DONE_TAIL</tt> segment.
+Either way, the <tt>-&gt;len</tt> and <tt>-&gt;len_lazy</tt> counts
+are adjusted after the corresponding callbacks have been invoked, and so
+again it is the <tt>-&gt;len</tt> count that accurately reflects whether
+or not there are callbacks associated with this <tt>rcu_segcblist</tt>
+structure.
+Of course, off-CPU sampling of the <tt>-&gt;len</tt> count requires
+the use of appropriate synchronization, for example, memory barriers.
+This synchronization can be a bit subtle, particularly in the case
+of <tt>rcu_barrier()</tt>.
+
<h3><a name="The rcu_data Structure">
The <tt>rcu_data</tt> Structure</a></h3>
@@ -983,62 +1113,18 @@ choice.
as follows:
<pre>
- 1 struct rcu_head *nxtlist;
- 2 struct rcu_head **nxttail[RCU_NEXT_SIZE];
- 3 unsigned long nxtcompleted[RCU_NEXT_SIZE];
- 4 long qlen_lazy;
- 5 long qlen;
- 6 long qlen_last_fqs_check;
+ 1 struct rcu_segcblist cblist;
+ 2 long qlen_last_fqs_check;
+ 3 unsigned long n_cbs_invoked;
+ 4 unsigned long n_nocbs_invoked;
+ 5 unsigned long n_cbs_orphaned;
+ 6 unsigned long n_cbs_adopted;
7 unsigned long n_force_qs_snap;
- 8 unsigned long n_cbs_invoked;
- 9 unsigned long n_cbs_orphaned;
-10 unsigned long n_cbs_adopted;
-11 long blimit;
+ 8 long blimit;
</pre>
-<p>The <tt>-&gt;nxtlist</tt> pointer and the
-<tt>-&gt;nxttail[]</tt> array form a four-segment list with
-older callbacks near the head and newer ones near the tail.
-Each segment contains callbacks with the corresponding relationship
-to the current grace period.
-The pointer out of the end of each of the four segments is referenced
-by the element of the <tt>-&gt;nxttail[]</tt> array indexed by
-<tt>RCU_DONE_TAIL</tt> (for callbacks handled by a prior grace period),
-<tt>RCU_WAIT_TAIL</tt> (for callbacks waiting on the current grace period),
-<tt>RCU_NEXT_READY_TAIL</tt> (for callbacks that will wait on the next
-grace period), and
-<tt>RCU_NEXT_TAIL</tt> (for callbacks that are not yet associated
-with a specific grace period)
-respectively, as shown in the following figure.
-
-</p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%">
-
-</p><p>In this figure, the <tt>-&gt;nxtlist</tt> pointer references the
-first
-RCU callback in the list.
-The <tt>-&gt;nxttail[RCU_DONE_TAIL]</tt> array element references
-the <tt>-&gt;nxtlist</tt> pointer itself, indicating that none
-of the callbacks is ready to invoke.
-The <tt>-&gt;nxttail[RCU_WAIT_TAIL]</tt> array element references callback
-CB&nbsp;2's <tt>-&gt;next</tt> pointer, which indicates that
-CB&nbsp;1 and CB&nbsp;2 are both waiting on the current grace period.
-The <tt>-&gt;nxttail[RCU_NEXT_READY_TAIL]</tt> array element
-references the same RCU callback that <tt>-&gt;nxttail[RCU_WAIT_TAIL]</tt>
-does, which indicates that there are no callbacks waiting on the next
-RCU grace period.
-The <tt>-&gt;nxttail[RCU_NEXT_TAIL]</tt> array element references
-CB&nbsp;4's <tt>-&gt;next</tt> pointer, indicating that all the
-remaining RCU callbacks have not yet been assigned to an RCU grace
-period.
-Note that the <tt>-&gt;nxttail[RCU_NEXT_TAIL]</tt> array element
-always references the last RCU callback's <tt>-&gt;next</tt> pointer
-unless the callback list is empty, in which case it references
-the <tt>-&gt;nxtlist</tt> pointer.
-
-</p><p>CPUs advance their callbacks from the
-<tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the
-<tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments
-as grace periods advance.
+<p>The <tt>-&gt;cblist</tt> structure is the segmented callback list
+described earlier.
The CPU advances the callbacks in its <tt>rcu_data</tt> structure
whenever it notices that another RCU grace period has completed.
The CPU detects the completion of an RCU grace period by noticing
@@ -1049,16 +1135,7 @@ Recall that each <tt>rcu_node</tt> structure's
<tt>-&gt;completed</tt> field is updated at the end of each
grace period.
-</p><p>The <tt>-&gt;nxtcompleted[]</tt> array records grace-period
-numbers corresponding to the list segments.
-This allows CPUs that go idle for extended periods to determine
-which of their callbacks are ready to be invoked after reawakening.
-
-</p><p>The <tt>-&gt;qlen</tt> counter contains the number of
-callbacks in <tt>-&gt;nxtlist</tt>, and the
-<tt>-&gt;qlen_lazy</tt> contains the number of those callbacks that
-are known to only free memory, and whose invocation can therefore
-be safely deferred.
+<p>
The <tt>-&gt;qlen_last_fqs_check</tt> and
<tt>-&gt;n_force_qs_snap</tt> coordinate the forcing of quiescent
states from <tt>call_rcu()</tt> and friends when callback
@@ -1069,6 +1146,10 @@ lists grow excessively long.
fields count the number of callbacks invoked,
sent to other CPUs when this CPU goes offline,
and received from other CPUs when those other CPUs go offline.
+The <tt>-&gt;n_nocbs_invoked</tt> is used when the CPU's callbacks
+are offloaded to a kthread.
+
+<p>
Finally, the <tt>-&gt;blimit</tt> counter is the maximum number of
RCU callbacks that may be invoked at a given time.
@@ -1104,6 +1185,9 @@ Its fields are as follows:
1 int dynticks_nesting;
2 int dynticks_nmi_nesting;
3 atomic_t dynticks;
+ 4 bool rcu_need_heavy_qs;
+ 5 unsigned long rcu_qs_ctr;
+ 6 bool rcu_urgent_qs;
</pre>
<p>The <tt>-&gt;dynticks_nesting</tt> field counts the
@@ -1117,11 +1201,32 @@ NMIs are counted by the <tt>-&gt;dynticks_nmi_nesting</tt>
field, except that NMIs that interrupt non-dyntick-idle execution
are not counted.
-</p><p>Finally, the <tt>-&gt;dynticks</tt> field counts the corresponding
+</p><p>The <tt>-&gt;dynticks</tt> field counts the corresponding
CPU's transitions to and from dyntick-idle mode, so that this counter
has an even value when the CPU is in dyntick-idle mode and an odd
value otherwise.
+</p><p>The <tt>-&gt;rcu_need_heavy_qs</tt> field is used
+to record the fact that the RCU core code would really like to
+see a quiescent state from the corresponding CPU, so much so that
+it is willing to call for heavy-weight dyntick-counter operations.
+This flag is checked by RCU's context-switch and <tt>cond_resched()</tt>
+code, which provide a momentary idle sojourn in response.
+
+</p><p>The <tt>-&gt;rcu_qs_ctr</tt> field is used to record
+quiescent states from <tt>cond_resched()</tt>.
+Because <tt>cond_resched()</tt> can execute quite frequently, this
+must be quite lightweight, as in a non-atomic increment of this
+per-CPU field.
+
+</p><p>Finally, the <tt>-&gt;rcu_urgent_qs</tt> field is used to record
+the fact that the RCU core code would really like to see a quiescent
+state from the corresponding CPU, with the various other fields indicating
+just how badly RCU wants this quiescent state.
+This flag is checked by RCU's context-switch and <tt>cond_resched()</tt>
+code, which, if nothing else, non-atomically increment <tt>-&gt;rcu_qs_ctr</tt>
+in response.
+
<table>
<tr><th>&nbsp;</th></tr>
<tr><th align="left">Quick Quiz:</th></tr>
diff --git a/Documentation/RCU/Design/Data-Structures/nxtlist.svg b/Documentation/RCU/Design/Data-Structures/nxtlist.svg
index abc4cc73a097..0223e79c38e0 100644
--- a/Documentation/RCU/Design/Data-Structures/nxtlist.svg
+++ b/Documentation/RCU/Design/Data-Structures/nxtlist.svg
@@ -19,7 +19,7 @@
id="svg2"
version="1.1"
inkscape:version="0.48.4 r9939"
- sodipodi:docname="nxtlist.fig">
+ sodipodi:docname="segcblist.svg">
<metadata
id="metadata94">
<rdf:RDF>
@@ -28,7 +28,7 @@
<dc:format>image/svg+xml</dc:format>
<dc:type
rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
- <dc:title></dc:title>
+ <dc:title />
</cc:Work>
</rdf:RDF>
</metadata>
@@ -241,61 +241,51 @@
xml:space="preserve"
x="225"
y="675"
- fill="#000000"
- font-family="Courier"
font-style="normal"
font-weight="bold"
font-size="324"
- text-anchor="start"
- id="text64">nxtlist</text>
+ id="text64"
+ style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">-&gt;head</text>
<!-- Text -->
<text
xml:space="preserve"
x="225"
y="1800"
- fill="#000000"
- font-family="Courier"
font-style="normal"
font-weight="bold"
font-size="324"
- text-anchor="start"
- id="text66">nxttail[RCU_DONE_TAIL]</text>
+ id="text66"
+ style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">-&gt;tails[RCU_DONE_TAIL]</text>
<!-- Text -->
<text
xml:space="preserve"
x="225"
y="2925"
- fill="#000000"
- font-family="Courier"
font-style="normal"
font-weight="bold"
font-size="324"
- text-anchor="start"
- id="text68">nxttail[RCU_WAIT_TAIL]</text>
+ id="text68"
+ style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">-&gt;tails[RCU_WAIT_TAIL]</text>
<!-- Text -->
<text
xml:space="preserve"
x="225"
y="4050"
- fill="#000000"
- font-family="Courier"
font-style="normal"
font-weight="bold"
font-size="324"
- text-anchor="start"
- id="text70">nxttail[RCU_NEXT_READY_TAIL]</text>
+ id="text70"
+ style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">-&gt;tails[RCU_NEXT_READY_TAIL]</text>
<!-- Text -->
<text
xml:space="preserve"
x="225"
y="5175"
- fill="#000000"
- font-family="Courier"
font-style="normal"
font-weight="bold"
font-size="324"
- text-anchor="start"
- id="text72">nxttail[RCU_NEXT_TAIL]</text>
+ id="text72"
+ style="font-size:324px;font-style:normal;font-weight:bold;text-anchor:start;fill:#000000;font-family:Courier">-&gt;tails[RCU_NEXT_TAIL]</text>
<!-- Text -->
<text
xml:space="preserve"
diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html
index 7a3194c5559a..e5d0bbd0230b 100644
--- a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html
+++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html
@@ -284,6 +284,7 @@ Expedited Grace Period Refinements</a></h2>
Funnel locking and wait/wakeup</a>.
<li> <a href="#Use of Workqueues">Use of Workqueues</a>.
<li> <a href="#Stall Warnings">Stall warnings</a>.
+<li> <a href="#Mid-Boot Operation">Mid-boot operation</a>.
</ol>
<h3><a name="Idle-CPU Checks">Idle-CPU Checks</a></h3>
@@ -524,7 +525,7 @@ their grace periods and carrying out their wakeups.
In earlier implementations, the task requesting the expedited
grace period also drove it to completion.
This straightforward approach had the disadvantage of needing to
-account for signals sent to user tasks,
+account for POSIX signals sent to user tasks,
so more recent implemementations use the Linux kernel's
<a href="https://www.kernel.org/doc/Documentation/workqueue.txt">workqueues</a>.
@@ -533,8 +534,8 @@ The requesting task still does counter snapshotting and funnel-lock
processing, but the task reaching the top of the funnel lock
does a <tt>schedule_work()</tt> (from <tt>_synchronize_rcu_expedited()</tt>
so that a workqueue kthread does the actual grace-period processing.
-Because workqueue kthreads do not accept signals, grace-period-wait
-processing need not allow for signals.
+Because workqueue kthreads do not accept POSIX signals, grace-period-wait
+processing need not allow for POSIX signals.
In addition, this approach allows wakeups for the previous expedited
grace period to be overlapped with processing for the next expedited
@@ -586,6 +587,46 @@ blocking the current grace period are printed.
Each stall warning results in another pass through the loop, but the
second and subsequent passes use longer stall times.
+<h3><a name="Mid-Boot Operation">Mid-boot operation</a></h3>
+
+<p>
+The use of workqueues has the advantage that the expedited
+grace-period code need not worry about POSIX signals.
+Unfortunately, it has the
+corresponding disadvantage that workqueues cannot be used until
+they are initialized, which does not happen until some time after
+the scheduler spawns the first task.
+Given that there are parts of the kernel that really do want to
+execute grace periods during this mid-boot &ldquo;dead zone&rdquo;,
+expedited grace periods must do something else during thie time.
+
+<p>
+What they do is to fall back to the old practice of requiring that the
+requesting task drive the expedited grace period, as was the case
+before the use of workqueues.
+However, the requesting task is only required to drive the grace period
+during the mid-boot dead zone.
+Before mid-boot, a synchronous grace period is a no-op.
+Some time after mid-boot, workqueues are used.
+
+<p>
+Non-expedited non-SRCU synchronous grace periods must also operate
+normally during mid-boot.
+This is handled by causing non-expedited grace periods to take the
+expedited code path during mid-boot.
+
+<p>
+The current code assumes that there are no POSIX signals during
+the mid-boot dead zone.
+However, if an overwhelming need for POSIX signals somehow arises,
+appropriate adjustments can be made to the expedited stall-warning code.
+One such adjustment would reinstate the pre-workqueue stall-warning
+checks, but only during the mid-boot dead zone.
+
+<p>
+With this refinement, synchronous grace periods can now be used from
+task context pretty much any time during the life of the kernel.
+
<h3><a name="Summary">
Summary</a></h3>
diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html
index 21593496aca6..f60adf112663 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -659,8 +659,9 @@ systems with more than one CPU:
In other words, a given instance of <tt>synchronize_rcu()</tt>
can avoid waiting on a given RCU read-side critical section only
if it can prove that <tt>synchronize_rcu()</tt> started first.
+ </font>
- <p>
+ <p><font color="ffffff">
A related question is &ldquo;When <tt>rcu_read_lock()</tt>
doesn't generate any code, why does it matter how it relates
to a grace period?&rdquo;
@@ -675,8 +676,9 @@ systems with more than one CPU:
within the critical section, in which case none of the accesses
within the critical section may observe the effects of any
access following the grace period.
+ </font>
- <p>
+ <p><font color="ffffff">
As of late 2016, mathematical models of RCU take this
viewpoint, for example, see slides&nbsp;62 and&nbsp;63
of the
@@ -1616,8 +1618,8 @@ CPUs should at least make reasonable forward progress.
In return for its shorter latencies, <tt>synchronize_rcu_expedited()</tt>
is permitted to impose modest degradation of real-time latency
on non-idle online CPUs.
-That said, it will likely be necessary to take further steps to reduce this
-degradation, hopefully to roughly that of a scheduling-clock interrupt.
+Here, &ldquo;modest&rdquo; means roughly the same latency
+degradation as a scheduling-clock interrupt.
<p>
There are a number of situations where even
@@ -1913,12 +1915,9 @@ This requirement is another factor driving batching of grace periods,
but it is also the driving force behind the checks for large numbers
of queued RCU callbacks in the <tt>call_rcu()</tt> code path.
Finally, high update rates should not delay RCU read-side critical
-sections, although some read-side delays can occur when using
+sections, although some small read-side delays can occur when using
<tt>synchronize_rcu_expedited()</tt>, courtesy of this function's use
-of <tt>try_stop_cpus()</tt>.
-(In the future, <tt>synchronize_rcu_expedited()</tt> will be
-converted to use lighter-weight inter-processor interrupts (IPIs),
-but this will still disturb readers, though to a much smaller degree.)
+of <tt>smp_call_function_single()</tt>.
<p>
Although all three of these corner cases were understood in the early
@@ -2154,7 +2153,8 @@ as will <tt>rcu_assign_pointer()</tt>.
<p>
Although <tt>call_rcu()</tt> may be invoked at any
time during boot, callbacks are not guaranteed to be invoked until after
-the scheduler is fully up and running.
+all of RCU's kthreads have been spawned, which occurs at
+<tt>early_initcall()</tt> time.
This delay in callback invocation is due to the fact that RCU does not
invoke callbacks until it is fully initialized, and this full initialization
cannot occur until after the scheduler has initialized itself to the
@@ -2167,8 +2167,10 @@ on what operations those callbacks could invoke.
Perhaps surprisingly, <tt>synchronize_rcu()</tt>,
<a href="#Bottom-Half Flavor"><tt>synchronize_rcu_bh()</tt></a>
(<a href="#Bottom-Half Flavor">discussed below</a>),
-and
-<a href="#Sched Flavor"><tt>synchronize_sched()</tt></a>
+<a href="#Sched Flavor"><tt>synchronize_sched()</tt></a>,
+<tt>synchronize_rcu_expedited()</tt>,
+<tt>synchronize_rcu_bh_expedited()</tt>, and
+<tt>synchronize_sched_expedited()</tt>
will all operate normally
during very early boot, the reason being that there is only one CPU
and preemption is disabled.
@@ -2178,45 +2180,59 @@ state and thus a grace period, so the early-boot implementation can
be a no-op.
<p>
-Both <tt>synchronize_rcu_bh()</tt> and <tt>synchronize_sched()</tt>
-continue to operate normally through the remainder of boot, courtesy
-of the fact that preemption is disabled across their RCU read-side
-critical sections and also courtesy of the fact that there is still
-only one CPU.
-However, once the scheduler starts initializing, preemption is enabled.
-There is still only a single CPU, but the fact that preemption is enabled
-means that the no-op implementation of <tt>synchronize_rcu()</tt> no
-longer works in <tt>CONFIG_PREEMPT=y</tt> kernels.
-Therefore, as soon as the scheduler starts initializing, the early-boot
-fastpath is disabled.
-This means that <tt>synchronize_rcu()</tt> switches to its runtime
-mode of operation where it posts callbacks, which in turn means that
-any call to <tt>synchronize_rcu()</tt> will block until the corresponding
-callback is invoked.
-Unfortunately, the callback cannot be invoked until RCU's runtime
-grace-period machinery is up and running, which cannot happen until
-the scheduler has initialized itself sufficiently to allow RCU's
-kthreads to be spawned.
-Therefore, invoking <tt>synchronize_rcu()</tt> during scheduler
-initialization can result in deadlock.
+However, once the scheduler has spawned its first kthread, this early
+boot trick fails for <tt>synchronize_rcu()</tt> (as well as for
+<tt>synchronize_rcu_expedited()</tt>) in <tt>CONFIG_PREEMPT=y</tt>
+kernels.
+The reason is that an RCU read-side critical section might be preempted,
+which means that a subsequent <tt>synchronize_rcu()</tt> really does have
+to wait for something, as opposed to simply returning immediately.
+Unfortunately, <tt>synchronize_rcu()</tt> can't do this until all of
+its kthreads are spawned, which doesn't happen until some time during
+<tt>early_initcalls()</tt> time.
+But this is no excuse: RCU is nevertheless required to correctly handle
+synchronous grace periods during this time period.
+Once all of its kthreads are up and running, RCU starts running
+normally.
<table>
<tr><th>&nbsp;</th></tr>
<tr><th align="left">Quick Quiz:</th></tr>
<tr><td>
- So what happens with <tt>synchronize_rcu()</tt> during
- scheduler initialization for <tt>CONFIG_PREEMPT=n</tt>
- kernels?
+ How can RCU possibly handle grace periods before all of its
+ kthreads have been spawned???
</td></tr>
<tr><th align="left">Answer:</th></tr>
<tr><td bgcolor="#ffffff"><font color="ffffff">
- In <tt>CONFIG_PREEMPT=n</tt> kernel, <tt>synchronize_rcu()</tt>
- maps directly to <tt>synchronize_sched()</tt>.
- Therefore, <tt>synchronize_rcu()</tt> works normally throughout
- boot in <tt>CONFIG_PREEMPT=n</tt> kernels.
- However, your code must also work in <tt>CONFIG_PREEMPT=y</tt> kernels,
- so it is still necessary to avoid invoking <tt>synchronize_rcu()</tt>
- during scheduler initialization.
+ Very carefully!
+ </font>
+
+ <p><font color="ffffff">
+ During the &ldquo;dead zone&rdquo; between the time that the
+ scheduler spawns the first task and the time that all of RCU's
+ kthreads have been spawned, all synchronous grace periods are
+ handled by the expedited grace-period mechanism.
+ At runtime, this expedited mechanism relies on workqueues, but
+ during the dead zone the requesting task itself drives the
+ desired expedited grace period.
+ Because dead-zone execution takes place within task context,
+ everything works.
+ Once the dead zone ends, expedited grace periods go back to
+ using workqueues, as is required to avoid problems that would
+ otherwise occur when a user task received a POSIX signal while
+ driving an expedited grace period.
+ </font>
+
+ <p><font color="ffffff">
+ And yes, this does mean that it is unhelpful to send POSIX
+ signals to random tasks between the time that the scheduler
+ spawns its first kthread and the time that RCU's kthreads
+ have all been spawned.
+ If there ever turns out to be a good reason for sending POSIX
+ signals during that time, appropriate adjustments will be made.
+ (If it turns out that POSIX signals are sent during this time for
+ no good reason, other adjustments will be made, appropriate
+ or otherwise.)
</font></td></tr>
<tr><td>&nbsp;</td></tr>
</table>
@@ -2295,12 +2311,61 @@ situation, and Dipankar Sarma incorporated <tt>rcu_barrier()</tt> into RCU.
The need for <tt>rcu_barrier()</tt> for module unloading became
apparent later.
+<p>
+<b>Important note</b>: The <tt>rcu_barrier()</tt> function is not,
+repeat, <i>not</i>, obligated to wait for a grace period.
+It is instead only required to wait for RCU callbacks that have
+already been posted.
+Therefore, if there are no RCU callbacks posted anywhere in the system,
+<tt>rcu_barrier()</tt> is within its rights to return immediately.
+Even if there are callbacks posted, <tt>rcu_barrier()</tt> does not
+necessarily need to wait for a grace period.
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ Wait a minute!
+ Each RCU callbacks must wait for a grace period to complete,
+ and <tt>rcu_barrier()</tt> must wait for each pre-existing
+ callback to be invoked.
+ Doesn't <tt>rcu_barrier()</tt> therefore need to wait for
+ a full grace period if there is even one callback posted anywhere
+ in the system?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ Absolutely not!!!
+ </font>
+
+ <p><font color="ffffff">
+ Yes, each RCU callbacks must wait for a grace period to complete,
+ but it might well be partly (or even completely) finished waiting
+ by the time <tt>rcu_barrier()</tt> is invoked.
+ In that case, <tt>rcu_barrier()</tt> need only wait for the
+ remaining portion of the grace period to elapse.
+ So even if there are quite a few callbacks posted,
+ <tt>rcu_barrier()</tt> might well return quite quickly.
+ </font>
+
+ <p><font color="ffffff">
+ So if you need to wait for a grace period as well as for all
+ pre-existing callbacks, you will need to invoke both
+ <tt>synchronize_rcu()</tt> and <tt>rcu_barrier()</tt>.
+ If latency is a concern, you can always use workqueues
+ to invoke them concurrently.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
<h3><a name="Hotplug CPU">Hotplug CPU</a></h3>
<p>
The Linux kernel supports CPU hotplug, which means that CPUs
can come and go.
-It is of course illegal to use any RCU API member from an offline CPU.
+It is of course illegal to use any RCU API member from an offline CPU,
+with the exception of <a href="#Sleepable RCU">SRCU</a> read-side
+critical sections.
This requirement was present from day one in DYNIX/ptx, but
on the other hand, the Linux kernel's CPU-hotplug implementation
is &ldquo;interesting.&rdquo;
@@ -2310,19 +2375,18 @@ The Linux-kernel CPU-hotplug implementation has notifiers that
are used to allow the various kernel subsystems (including RCU)
to respond appropriately to a given CPU-hotplug operation.
Most RCU operations may be invoked from CPU-hotplug notifiers,
-including even normal synchronous grace-period operations
-such as <tt>synchronize_rcu()</tt>.
-However, expedited grace-period operations such as
-<tt>synchronize_rcu_expedited()</tt> are not supported,
-due to the fact that current implementations block CPU-hotplug
-operations, which could result in deadlock.
+including even synchronous grace-period operations such as
+<tt>synchronize_rcu()</tt> and <tt>synchronize_rcu_expedited()</tt>.
<p>
-In addition, all-callback-wait operations such as
+However, all-callback-wait operations such as
<tt>rcu_barrier()</tt> are also not supported, due to the
fact that there are phases of CPU-hotplug operations where
the outgoing CPU's callbacks will not be invoked until after
the CPU-hotplug operation ends, which could also result in deadlock.
+Furthermore, <tt>rcu_barrier()</tt> blocks CPU-hotplug operations
+during its execution, which results in another type of deadlock
+when invoked from a CPU-hotplug notifier.
<h3><a name="Scheduler and RCU">Scheduler and RCU</a></h3>
@@ -2864,6 +2928,27 @@ API, which, in combination with <tt>srcu_read_unlock()</tt>,
guarantees a full memory barrier.
<p>
+Also unlike other RCU flavors, SRCU's callbacks-wait function
+<tt>srcu_barrier()</tt> may be invoked from CPU-hotplug notifiers,
+though this is not necessarily a good idea.
+The reason that this is possible is that SRCU is insensitive
+to whether or not a CPU is online, which means that <tt>srcu_barrier()</tt>
+need not exclude CPU-hotplug operations.
+
+<p>
+As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating
+a locking bottleneck present in prior kernel versions.
+Although this will allow users to put much heavier stress on
+<tt>call_srcu()</tt>, it is important to note that SRCU does not
+yet take any special steps to deal with callback flooding.
+So if you are posting (say) 10,000 SRCU callbacks per second per CPU,
+you are probably totally OK, but if you intend to post (say) 1,000,000
+SRCU callbacks per second per CPU, please run some tests first.
+SRCU just might need a few adjustment to deal with that sort of load.
+Of course, your mileage may vary based on the speed of your CPUs and
+the size of your memory.
+
+<p>
The
<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">SRCU API</a>
includes
@@ -3021,8 +3106,8 @@ to do some redesign to avoid this scalability problem.
<p>
RCU disables CPU hotplug in a few places, perhaps most notably in the
-expedited grace-period and <tt>rcu_barrier()</tt> operations.
-If there is a strong reason to use expedited grace periods in CPU-hotplug
+<tt>rcu_barrier()</tt> operations.
+If there is a strong reason to use <tt>rcu_barrier()</tt> in CPU-hotplug
notifiers, it will be necessary to avoid disabling CPU hotplug.
This would introduce some complexity, so there had better be a <i>very</i>
good reason.
@@ -3096,9 +3181,5 @@ Andy Lutomirski for their help in rendering
this article human readable, and to Michelle Rankin for her support
of this effort.
Other contributions are acknowledged in the Linux kernel's git archive.
-The cartoon is copyright (c) 2013 by Melissa Broussard,
-and is provided
-under the terms of the Creative Commons Attribution-Share Alike 3.0
-United States license.
</body></html>
diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt
index c0bf2441a2ba..b2a613f16d74 100644
--- a/Documentation/RCU/rcu_dereference.txt
+++ b/Documentation/RCU/rcu_dereference.txt
@@ -138,6 +138,15 @@ o Be very careful about comparing pointers obtained from
This sort of comparison occurs frequently when scanning
RCU-protected circular linked lists.
+ Note that if checks for being within an RCU read-side
+ critical section are not required and the pointer is never
+ dereferenced, rcu_access_pointer() should be used in place
+ of rcu_dereference(). The rcu_access_pointer() primitive
+ does not require an enclosing read-side critical section,
+ and also omits the smp_read_barrier_depends() included in
+ rcu_dereference(), which in turn should provide a small
+ performance gain in some CPUs (e.g., the DEC Alpha).
+
o The comparison is against a pointer that references memory
that was initialized "a long time ago." The reason
this is safe is that even if misordering occurs, the
diff --git a/Documentation/RCU/rculist_nulls.txt b/Documentation/RCU/rculist_nulls.txt
index 18f9651ff23d..8151f0195f76 100644
--- a/Documentation/RCU/rculist_nulls.txt
+++ b/Documentation/RCU/rculist_nulls.txt
@@ -1,5 +1,5 @@
Using hlist_nulls to protect read-mostly linked lists and
-objects using SLAB_DESTROY_BY_RCU allocations.
+objects using SLAB_TYPESAFE_BY_RCU allocations.
Please read the basics in Documentation/RCU/listRCU.txt
@@ -7,7 +7,7 @@ Using special makers (called 'nulls') is a convenient way
to solve following problem :
A typical RCU linked list managing objects which are
-allocated with SLAB_DESTROY_BY_RCU kmem_cache can
+allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can
use following algos :
1) Lookup algo
@@ -96,7 +96,7 @@ unlock_chain(); // typically a spin_unlock()
3) Remove algo
--------------
Nothing special here, we can use a standard RCU hlist deletion.
-But thanks to SLAB_DESTROY_BY_RCU, beware a deleted object can be reused
+But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused
very very fast (before the end of RCU grace period)
if (put_last_reference_on(obj) {
diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
index e93d04133fe7..96a3d81837e1 100644
--- a/Documentation/RCU/stallwarn.txt
+++ b/Documentation/RCU/stallwarn.txt
@@ -1,9 +1,102 @@
Using RCU's CPU Stall Detector
-The rcu_cpu_stall_suppress module parameter enables RCU's CPU stall
-detector, which detects conditions that unduly delay RCU grace periods.
-This module parameter enables CPU stall detection by default, but
-may be overridden via boot-time parameter or at runtime via sysfs.
+This document first discusses what sorts of issues RCU's CPU stall
+detector can locate, and then discusses kernel parameters and Kconfig
+options that can be used to fine-tune the detector's operation. Finally,
+this document explains the stall detector's "splat" format.
+
+
+What Causes RCU CPU Stall Warnings?
+
+So your kernel printed an RCU CPU stall warning. The next question is
+"What caused it?" The following problems can result in RCU CPU stall
+warnings:
+
+o A CPU looping in an RCU read-side critical section.
+
+o A CPU looping with interrupts disabled.
+
+o A CPU looping with preemption disabled. This condition can
+ result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh
+ stalls.
+
+o A CPU looping with bottom halves disabled. This condition can
+ result in RCU-sched and RCU-bh stalls.
+
+o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the
+ kernel without invoking schedule(). Note that cond_resched()
+ does not necessarily prevent RCU CPU stall warnings. Therefore,
+ if the looping in the kernel is really expected and desirable
+ behavior, you might need to replace some of the cond_resched()
+ calls with calls to cond_resched_rcu_qs().
+
+o Booting Linux using a console connection that is too slow to
+ keep up with the boot-time console-message rate. For example,
+ a 115Kbaud serial console can be -way- too slow to keep up
+ with boot-time message rates, and will frequently result in
+ RCU CPU stall warning messages. Especially if you have added
+ debug printk()s.
+
+o Anything that prevents RCU's grace-period kthreads from running.
+ This can result in the "All QSes seen" console-log message.
+ This message will include information on when the kthread last
+ ran and how often it should be expected to run.
+
+o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
+ happen to preempt a low-priority task in the middle of an RCU
+ read-side critical section. This is especially damaging if
+ that low-priority task is not permitted to run on any other CPU,
+ in which case the next RCU grace period can never complete, which
+ will eventually cause the system to run out of memory and hang.
+ While the system is in the process of running itself out of
+ memory, you might see stall-warning messages.
+
+o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
+ is running at a higher priority than the RCU softirq threads.
+ This will prevent RCU callbacks from ever being invoked,
+ and in a CONFIG_PREEMPT_RCU kernel will further prevent
+ RCU grace periods from ever completing. Either way, the
+ system will eventually run out of memory and hang. In the
+ CONFIG_PREEMPT_RCU case, you might see stall-warning
+ messages.
+
+o A hardware or software issue shuts off the scheduler-clock
+ interrupt on a CPU that is not in dyntick-idle mode. This
+ problem really has happened, and seems to be most likely to
+ result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels.
+
+o A bug in the RCU implementation.
+
+o A hardware failure. This is quite unlikely, but has occurred
+ at least once in real life. A CPU failed in a running system,
+ becoming unresponsive, but not causing an immediate crash.
+ This resulted in a series of RCU CPU stall warnings, eventually
+ leading the realization that the CPU had failed.
+
+The RCU, RCU-sched, RCU-bh, and RCU-tasks implementations have CPU stall
+warning. Note that SRCU does -not- have CPU stall warnings. Please note
+that RCU only detects CPU stalls when there is a grace period in progress.
+No grace period, no CPU stall warnings.
+
+To diagnose the cause of the stall, inspect the stack traces.
+The offending function will usually be near the top of the stack.
+If you have a series of stall warnings from a single extended stall,
+comparing the stack traces can often help determine where the stall
+is occurring, which will usually be in the function nearest the top of
+that portion of the stack which remains the same from trace to trace.
+If you can reliably trigger the stall, ftrace can be quite helpful.
+
+RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE
+and with RCU's event tracing. For information on RCU's event tracing,
+see include/trace/events/rcu.h.
+
+
+Fine-Tuning the RCU CPU Stall Detector
+
+The rcuupdate.rcu_cpu_stall_suppress module parameter disables RCU's
+CPU stall detector, which detects conditions that unduly delay RCU grace
+periods. This module parameter enables CPU stall detection by default,
+but may be overridden via boot-time parameter or at runtime via sysfs.
The stall detector's idea of what constitutes "unduly delayed" is
controlled by a set of kernel configuration variables and cpp macros:
@@ -56,6 +149,9 @@ rcupdate.rcu_task_stall_timeout
And continues with the output of sched_show_task() for each
task stalling the current RCU-tasks grace period.
+
+Interpreting RCU's CPU Stall-Detector "Splats"
+
For non-RCU-tasks flavors of RCU, when a CPU detects that it is stalling,
it will print a message similar to the following:
@@ -178,89 +274,3 @@ grace period is in flight.
It is entirely possible to see stall warnings from normal and from
expedited grace periods at about the same time from the same run.
-
-
-What Causes RCU CPU Stall Warnings?
-
-So your kernel printed an RCU CPU stall warning. The next question is
-"What caused it?" The following problems can result in RCU CPU stall
-warnings:
-
-o A CPU looping in an RCU read-side critical section.
-
-o A CPU looping with interrupts disabled. This condition can
- result in RCU-sched and RCU-bh stalls.
-
-o A CPU looping with preemption disabled. This condition can
- result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh
- stalls.
-
-o A CPU looping with bottom halves disabled. This condition can
- result in RCU-sched and RCU-bh stalls.
-
-o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the
- kernel without invoking schedule(). Note that cond_resched()
- does not necessarily prevent RCU CPU stall warnings. Therefore,
- if the looping in the kernel is really expected and desirable
- behavior, you might need to replace some of the cond_resched()
- calls with calls to cond_resched_rcu_qs().
-
-o Booting Linux using a console connection that is too slow to
- keep up with the boot-time console-message rate. For example,
- a 115Kbaud serial console can be -way- too slow to keep up
- with boot-time message rates, and will frequently result in
- RCU CPU stall warning messages. Especially if you have added
- debug printk()s.
-
-o Anything that prevents RCU's grace-period kthreads from running.
- This can result in the "All QSes seen" console-log message.
- This message will include information on when the kthread last
- ran and how often it should be expected to run.
-
-o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
- happen to preempt a low-priority task in the middle of an RCU
- read-side critical section. This is especially damaging if
- that low-priority task is not permitted to run on any other CPU,
- in which case the next RCU grace period can never complete, which
- will eventually cause the system to run out of memory and hang.
- While the system is in the process of running itself out of
- memory, you might see stall-warning messages.
-
-o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
- is running at a higher priority than the RCU softirq threads.
- This will prevent RCU callbacks from ever being invoked,
- and in a CONFIG_PREEMPT_RCU kernel will further prevent
- RCU grace periods from ever completing. Either way, the
- system will eventually run out of memory and hang. In the
- CONFIG_PREEMPT_RCU case, you might see stall-warning
- messages.
-
-o A hardware or software issue shuts off the scheduler-clock
- interrupt on a CPU that is not in dyntick-idle mode. This
- problem really has happened, and seems to be most likely to
- result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels.
-
-o A bug in the RCU implementation.
-
-o A hardware failure. This is quite unlikely, but has occurred
- at least once in real life. A CPU failed in a running system,
- becoming unresponsive, but not causing an immediate crash.
- This resulted in a series of RCU CPU stall warnings, eventually
- leading the realization that the CPU had failed.
-
-The RCU, RCU-sched, RCU-bh, and RCU-tasks implementations have CPU stall
-warning. Note that SRCU does -not- have CPU stall warnings. Please note
-that RCU only detects CPU stalls when there is a grace period in progress.
-No grace period, no CPU stall warnings.
-
-To diagnose the cause of the stall, inspect the stack traces.
-The offending function will usually be near the top of the stack.
-If you have a series of stall warnings from a single extended stall,
-comparing the stack traces can often help determine where the stall
-is occurring, which will usually be in the function nearest the top of
-that portion of the stack which remains the same from trace to trace.
-If you can reliably trigger the stall, ftrace can be quite helpful.
-
-RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE
-and with RCU's event tracing. For information on RCU's event tracing,
-see include/trace/events/rcu.h.
diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index 5cbd8b2395b8..8ed6c9f6133c 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -562,7 +562,9 @@ This section presents a "toy" RCU implementation that is based on
familiar locking primitives. Its overhead makes it a non-starter for
real-life use, as does its lack of scalability. It is also unsuitable
for realtime use, since it allows scheduling latency to "bleed" from
-one read-side critical section to another.
+one read-side critical section to another. It also assumes recursive
+reader-writer locks: If you try this with non-recursive locks, and
+you allow nested rcu_read_lock() calls, you can deadlock.
However, it is probably the easiest implementation to relate to, so is
a good starting point.
@@ -587,20 +589,21 @@ It is extremely simple:
write_unlock(&rcu_gp_mutex);
}
-[You can ignore rcu_assign_pointer() and rcu_dereference() without
-missing much. But here they are anyway. And whatever you do, don't
-forget about them when submitting patches making use of RCU!]
+[You can ignore rcu_assign_pointer() and rcu_dereference() without missing
+much. But here are simplified versions anyway. And whatever you do,
+don't forget about them when submitting patches making use of RCU!]
- #define rcu_assign_pointer(p, v) ({ \
- smp_wmb(); \
- (p) = (v); \
- })
+ #define rcu_assign_pointer(p, v) \
+ ({ \
+ smp_store_release(&(p), (v)); \
+ })
- #define rcu_dereference(p) ({ \
- typeof(p) _________p1 = p; \
- smp_read_barrier_depends(); \
- (_________p1); \
- })
+ #define rcu_dereference(p) \
+ ({ \
+ typeof(p) _________p1 = p; \
+ smp_read_barrier_depends(); \
+ (_________p1); \
+ })
The rcu_read_lock() and rcu_read_unlock() primitive read-acquire
@@ -925,7 +928,8 @@ d. Do you need RCU grace periods to complete even in the face
e. Is your workload too update-intensive for normal use of
RCU, but inappropriate for other synchronization mechanisms?
- If so, consider SLAB_DESTROY_BY_RCU. But please be careful!
+ If so, consider SLAB_TYPESAFE_BY_RCU (which was originally
+ named SLAB_DESTROY_BY_RCU). But please be careful!
f. Do you need read-side critical sections that are respected
even though they are in the middle of the idle loop, during
diff --git a/Documentation/security/LoadPin.txt b/Documentation/admin-guide/LSM/LoadPin.rst
index e11877f5d3d4..32070762d24c 100644
--- a/Documentation/security/LoadPin.txt
+++ b/Documentation/admin-guide/LSM/LoadPin.rst
@@ -1,3 +1,7 @@
+=======
+LoadPin
+=======
+
LoadPin is a Linux Security Module that ensures all kernel-loaded files
(modules, firmware, etc) all originate from the same filesystem, with
the expectation that such a filesystem is backed by a read-only device
@@ -5,13 +9,13 @@ such as dm-verity or CDROM. This allows systems that have a verified
and/or unchangeable filesystem to enforce module and firmware loading
restrictions without needing to sign the files individually.
-The LSM is selectable at build-time with CONFIG_SECURITY_LOADPIN, and
+The LSM is selectable at build-time with ``CONFIG_SECURITY_LOADPIN``, and
can be controlled at boot-time with the kernel command line option
-"loadpin.enabled". By default, it is enabled, but can be disabled at
-boot ("loadpin.enabled=0").
+"``loadpin.enabled``". By default, it is enabled, but can be disabled at
+boot ("``loadpin.enabled=0``").
LoadPin starts pinning when it sees the first file loaded. If the
block device backing the filesystem is not read-only, a sysctl is
-created to toggle pinning: /proc/sys/kernel/loadpin/enabled. (Having
+created to toggle pinning: ``/proc/sys/kernel/loadpin/enabled``. (Having
a mutable filesystem means pinning is mutable too, but having the
sysctl allows for easy testing on systems with a mutable filesystem.)
diff --git a/Documentation/security/SELinux.txt b/Documentation/admin-guide/LSM/SELinux.rst
index 07eae00f3314..f722c9b4173a 100644
--- a/Documentation/security/SELinux.txt
+++ b/Documentation/admin-guide/LSM/SELinux.rst
@@ -1,27 +1,33 @@
+=======
+SELinux
+=======
+
If you want to use SELinux, chances are you will want
to use the distro-provided policies, or install the
latest reference policy release from
+
http://oss.tresys.com/projects/refpolicy
However, if you want to install a dummy policy for
-testing, you can do using 'mdp' provided under
+testing, you can do using ``mdp`` provided under
scripts/selinux. Note that this requires the selinux
userspace to be installed - in particular you will
need checkpolicy to compile a kernel, and setfiles and
fixfiles to label the filesystem.
1. Compile the kernel with selinux enabled.
- 2. Type 'make' to compile mdp.
+ 2. Type ``make`` to compile ``mdp``.
3. Make sure that you are not running with
SELinux enabled and a real policy. If
you are, reboot with selinux disabled
before continuing.
- 4. Run install_policy.sh:
+ 4. Run install_policy.sh::
+
cd scripts/selinux
sh install_policy.sh
Step 4 will create a new dummy policy valid for your
kernel, with a single selinux user, role, and type.
-It will compile the policy, will set your SELINUXTYPE to
-dummy in /etc/selinux/config, install the compiled policy
-as 'dummy', and relabel your filesystem.
+It will compile the policy, will set your ``SELINUXTYPE`` to
+``dummy`` in ``/etc/selinux/config``, install the compiled policy
+as ``dummy``, and relabel your filesystem.
diff --git a/Documentation/security/Smack.txt b/Documentation/admin-guide/LSM/Smack.rst
index 945cc633d883..6a5826a13aea 100644
--- a/Documentation/security/Smack.txt
+++ b/Documentation/admin-guide/LSM/Smack.rst
@@ -1,3 +1,6 @@
+=====
+Smack
+=====
"Good for you, you've decided to clean the elevator!"
@@ -14,6 +17,7 @@ available to determine which is best suited to the problem
at hand.
Smack consists of three major components:
+
- The kernel
- Basic utilities, which are helpful but not required
- Configuration data
@@ -39,16 +43,24 @@ The current git repository for Smack user space is:
This should make and install on most modern distributions.
There are five commands included in smackutil:
-chsmack - display or set Smack extended attribute values
-smackctl - load the Smack access rules
-smackaccess - report if a process with one label has access
- to an object with another
+chsmack:
+ display or set Smack extended attribute values
+
+smackctl:
+ load the Smack access rules
+
+smackaccess:
+ report if a process with one label has access
+ to an object with another
These two commands are obsolete with the introduction of
the smackfs/load2 and smackfs/cipso2 interfaces.
-smackload - properly formats data for writing to smackfs/load
-smackcipso - properly formats data for writing to smackfs/cipso
+smackload:
+ properly formats data for writing to smackfs/load
+
+smackcipso:
+ properly formats data for writing to smackfs/cipso
In keeping with the intent of Smack, configuration data is
minimal and not strictly required. The most important
@@ -56,15 +68,15 @@ configuration step is mounting the smackfs pseudo filesystem.
If smackutil is installed the startup script will take care
of this, but it can be manually as well.
-Add this line to /etc/fstab:
+Add this line to ``/etc/fstab``::
smackfs /sys/fs/smackfs smackfs defaults 0 0
-The /sys/fs/smackfs directory is created by the kernel.
+The ``/sys/fs/smackfs`` directory is created by the kernel.
Smack uses extended attributes (xattrs) to store labels on filesystem
objects. The attributes are stored in the extended attribute security
-name space. A process must have CAP_MAC_ADMIN to change any of these
+name space. A process must have ``CAP_MAC_ADMIN`` to change any of these
attributes.
The extended attributes that Smack uses are:
@@ -73,14 +85,17 @@ SMACK64
Used to make access control decisions. In almost all cases
the label given to a new filesystem object will be the label
of the process that created it.
+
SMACK64EXEC
The Smack label of a process that execs a program file with
this attribute set will run with this attribute's value.
+
SMACK64MMAP
Don't allow the file to be mmapped by a process whose Smack
label does not allow all of the access permitted to a process
with the label contained in this attribute. This is a very
specific use case for shared libraries.
+
SMACK64TRANSMUTE
Can only have the value "TRUE". If this attribute is present
on a directory when an object is created in the directory and
@@ -89,27 +104,29 @@ SMACK64TRANSMUTE
gets the label of the directory instead of the label of the
creating process. If the object being created is a directory
the SMACK64TRANSMUTE attribute is set as well.
+
SMACK64IPIN
This attribute is only available on file descriptors for sockets.
Use the Smack label in this attribute for access control
decisions on packets being delivered to this socket.
+
SMACK64IPOUT
This attribute is only available on file descriptors for sockets.
Use the Smack label in this attribute for access control
decisions on packets coming from this socket.
-There are multiple ways to set a Smack label on a file:
+There are multiple ways to set a Smack label on a file::
# attr -S -s SMACK64 -V "value" path
# chsmack -a value path
A process can see the Smack label it is running with by
-reading /proc/self/attr/current. A process with CAP_MAC_ADMIN
+reading ``/proc/self/attr/current``. A process with ``CAP_MAC_ADMIN``
can set the process Smack by writing there.
Most Smack configuration is accomplished by writing to files
in the smackfs filesystem. This pseudo-filesystem is mounted
-on /sys/fs/smackfs.
+on ``/sys/fs/smackfs``.
access
Provided for backward compatibility. The access2 interface
@@ -120,6 +137,7 @@ access
this file. The next read will indicate whether the access
would be permitted. The text will be either "1" indicating
access, or "0" indicating denial.
+
access2
This interface reports whether a subject with the specified
Smack label has a particular access to an object with a
@@ -127,13 +145,17 @@ access2
this file. The next read will indicate whether the access
would be permitted. The text will be either "1" indicating
access, or "0" indicating denial.
+
ambient
This contains the Smack label applied to unlabeled network
packets.
+
change-rule
This interface allows modification of existing access control rules.
- The format accepted on write is:
+ The format accepted on write is::
+
"%s %s %s %s"
+
where the first string is the subject label, the second the
object label, the third the access to allow and the fourth the
access to deny. The access strings may contain only the characters
@@ -141,47 +163,63 @@ change-rule
modified by enabling the permissions in the third string and disabling
those in the fourth string. If there is no such rule it will be
created using the access specified in the third and the fourth strings.
+
cipso
Provided for backward compatibility. The cipso2 interface
is preferred and should be used instead.
This interface allows a specific CIPSO header to be assigned
- to a Smack label. The format accepted on write is:
+ to a Smack label. The format accepted on write is::
+
"%24s%4d%4d"["%4d"]...
+
The first string is a fixed Smack label. The first number is
the level to use. The second number is the number of categories.
- The following numbers are the categories.
- "level-3-cats-5-19 3 2 5 19"
+ The following numbers are the categories::
+
+ "level-3-cats-5-19 3 2 5 19"
+
cipso2
This interface allows a specific CIPSO header to be assigned
- to a Smack label. The format accepted on write is:
- "%s%4d%4d"["%4d"]...
+ to a Smack label. The format accepted on write is::
+
+ "%s%4d%4d"["%4d"]...
+
The first string is a long Smack label. The first number is
the level to use. The second number is the number of categories.
- The following numbers are the categories.
- "level-3-cats-5-19 3 2 5 19"
+ The following numbers are the categories::
+
+ "level-3-cats-5-19 3 2 5 19"
+
direct
This contains the CIPSO level used for Smack direct label
representation in network packets.
+
doi
This contains the CIPSO domain of interpretation used in
network packets.
+
ipv6host
This interface allows specific IPv6 internet addresses to be
treated as single label hosts. Packets are sent to single
label hosts only from processes that have Smack write access
to the host label. All packets received from single label hosts
- are given the specified label. The format accepted on write is:
+ are given the specified label. The format accepted on write is::
+
"%h:%h:%h:%h:%h:%h:%h:%h label" or
"%h:%h:%h:%h:%h:%h:%h:%h/%d label".
+
The "::" address shortcut is not supported.
If label is "-DELETE" a matched entry will be deleted.
+
load
Provided for backward compatibility. The load2 interface
is preferred and should be used instead.
This interface allows access control rules in addition to
the system defined rules to be specified. The format accepted
- on write is:
+ on write is::
+
"%24s%24s%5s"
+
where the first string is the subject label, the second the
object label, and the third the requested access. The access
string may contain only the characters "rwxat-", and specifies
@@ -189,17 +227,21 @@ load
permissions that are not allowed. The string "r-x--" would
specify read and execute access. Labels are limited to 23
characters in length.
+
load2
This interface allows access control rules in addition to
the system defined rules to be specified. The format accepted
- on write is:
+ on write is::
+
"%s %s %s"
+
where the first string is the subject label, the second the
object label, and the third the requested access. The access
string may contain only the characters "rwxat-", and specifies
which sort of access is allowed. The "-" is a placeholder for
permissions that are not allowed. The string "r-x--" would
specify read and execute access.
+
load-self
Provided for backward compatibility. The load-self2 interface
is preferred and should be used instead.
@@ -208,66 +250,83 @@ load-self
otherwise be permitted, and are intended to provide additional
restrictions on the process. The format is the same as for
the load interface.
+
load-self2
This interface allows process specific access rules to be
defined. These rules are only consulted if access would
otherwise be permitted, and are intended to provide additional
restrictions on the process. The format is the same as for
the load2 interface.
+
logging
This contains the Smack logging state.
+
mapped
This contains the CIPSO level used for Smack mapped label
representation in network packets.
+
netlabel
This interface allows specific internet addresses to be
treated as single label hosts. Packets are sent to single
label hosts without CIPSO headers, but only from processes
that have Smack write access to the host label. All packets
received from single label hosts are given the specified
- label. The format accepted on write is:
+ label. The format accepted on write is::
+
"%d.%d.%d.%d label" or "%d.%d.%d.%d/%d label".
+
If the label specified is "-CIPSO" the address is treated
as a host that supports CIPSO headers.
+
onlycap
This contains labels processes must have for CAP_MAC_ADMIN
- and CAP_MAC_OVERRIDE to be effective. If this file is empty
+ and ``CAP_MAC_OVERRIDE`` to be effective. If this file is empty
these capabilities are effective at for processes with any
label. The values are set by writing the desired labels, separated
by spaces, to the file or cleared by writing "-" to the file.
+
ptrace
This is used to define the current ptrace policy
- 0 - default: this is the policy that relies on Smack access rules.
- For the PTRACE_READ a subject needs to have a read access on
- object. For the PTRACE_ATTACH a read-write access is required.
- 1 - exact: this is the policy that limits PTRACE_ATTACH. Attach is
+
+ 0 - default:
+ this is the policy that relies on Smack access rules.
+ For the ``PTRACE_READ`` a subject needs to have a read access on
+ object. For the ``PTRACE_ATTACH`` a read-write access is required.
+
+ 1 - exact:
+ this is the policy that limits ``PTRACE_ATTACH``. Attach is
only allowed when subject's and object's labels are equal.
- PTRACE_READ is not affected. Can be overridden with CAP_SYS_PTRACE.
- 2 - draconian: this policy behaves like the 'exact' above with an
- exception that it can't be overridden with CAP_SYS_PTRACE.
+ ``PTRACE_READ`` is not affected. Can be overridden with ``CAP_SYS_PTRACE``.
+
+ 2 - draconian:
+ this policy behaves like the 'exact' above with an
+ exception that it can't be overridden with ``CAP_SYS_PTRACE``.
+
revoke-subject
Writing a Smack label here sets the access to '-' for all access
rules with that subject label.
+
unconfined
- If the kernel is configured with CONFIG_SECURITY_SMACK_BRINGUP
- a process with CAP_MAC_ADMIN can write a label into this interface.
+ If the kernel is configured with ``CONFIG_SECURITY_SMACK_BRINGUP``
+ a process with ``CAP_MAC_ADMIN`` can write a label into this interface.
Thereafter, accesses that involve that label will be logged and
the access permitted if it wouldn't be otherwise. Note that this
is dangerous and can ruin the proper labeling of your system.
It should never be used in production.
+
relabel-self
This interface contains a list of labels to which the process can
- transition to, by writing to /proc/self/attr/current.
+ transition to, by writing to ``/proc/self/attr/current``.
Normally a process can change its own label to any legal value, but only
- if it has CAP_MAC_ADMIN. This interface allows a process without
- CAP_MAC_ADMIN to relabel itself to one of labels from predefined list.
- A process without CAP_MAC_ADMIN can change its label only once. When it
+ if it has ``CAP_MAC_ADMIN``. This interface allows a process without
+ ``CAP_MAC_ADMIN`` to relabel itself to one of labels from predefined list.
+ A process without ``CAP_MAC_ADMIN`` can change its label only once. When it
does, this list will be cleared.
The values are set by writing the desired labels, separated
by spaces, to the file or cleared by writing "-" to the file.
If you are using the smackload utility
-you can add access rules in /etc/smack/accesses. They take the form:
+you can add access rules in ``/etc/smack/accesses``. They take the form::
subjectlabel objectlabel access
@@ -277,14 +336,14 @@ object with objectlabel. If there is no rule no access is allowed.
Look for additional programs on http://schaufler-ca.com
-From the Smack Whitepaper:
-
-The Simplified Mandatory Access Control Kernel
+The Simplified Mandatory Access Control Kernel (Whitepaper)
+===========================================================
Casey Schaufler
casey@schaufler-ca.com
Mandatory Access Control
+------------------------
Computer systems employ a variety of schemes to constrain how information is
shared among the people and services using the machine. Some of these schemes
@@ -297,6 +356,7 @@ access control mechanisms because you don't have a choice regarding the users
or programs that have access to pieces of data.
Bell & LaPadula
+---------------
From the middle of the 1980's until the turn of the century Mandatory Access
Control (MAC) was very closely associated with the Bell & LaPadula security
@@ -306,6 +366,7 @@ within the Capital Beltway and Scandinavian supercomputer centers but was
often sited as failing to address general needs.
Domain Type Enforcement
+-----------------------
Around the turn of the century Domain Type Enforcement (DTE) became popular.
This scheme organizes users, programs, and data into domains that are
@@ -316,6 +377,7 @@ necessary to provide a secure domain mapping leads to the scheme being
disabled or used in limited ways in the majority of cases.
Smack
+-----
Smack is a Mandatory Access Control mechanism designed to provide useful MAC
while avoiding the pitfalls of its predecessors. The limitations of Bell &
@@ -326,46 +388,55 @@ Enforcement and avoided by defining access controls in terms of the access
modes already in use.
Smack Terminology
+-----------------
The jargon used to talk about Smack will be familiar to those who have dealt
with other MAC systems and shouldn't be too difficult for the uninitiated to
pick up. There are four terms that are used in a specific way and that are
especially important:
- Subject: A subject is an active entity on the computer system.
+ Subject:
+ A subject is an active entity on the computer system.
On Smack a subject is a task, which is in turn the basic unit
of execution.
- Object: An object is a passive entity on the computer system.
+ Object:
+ An object is a passive entity on the computer system.
On Smack files of all types, IPC, and tasks can be objects.
- Access: Any attempt by a subject to put information into or get
+ Access:
+ Any attempt by a subject to put information into or get
information from an object is an access.
- Label: Data that identifies the Mandatory Access Control
+ Label:
+ Data that identifies the Mandatory Access Control
characteristics of a subject or an object.
These definitions are consistent with the traditional use in the security
community. There are also some terms from Linux that are likely to crop up:
- Capability: A task that possesses a capability has permission to
+ Capability:
+ A task that possesses a capability has permission to
violate an aspect of the system security policy, as identified by
the specific capability. A task that possesses one or more
capabilities is a privileged task, whereas a task with no
capabilities is an unprivileged task.
- Privilege: A task that is allowed to violate the system security
+ Privilege:
+ A task that is allowed to violate the system security
policy is said to have privilege. As of this writing a task can
have privilege either by possessing capabilities or by having an
effective user of root.
Smack Basics
+------------
Smack is an extension to a Linux system. It enforces additional restrictions
on what subjects can access which objects, based on the labels attached to
each of the subject and the object.
Labels
+~~~~~~
Smack labels are ASCII character strings. They can be up to 255 characters
long, but keeping them to twenty-three characters is recommended.
@@ -377,7 +448,7 @@ contain unprintable characters, the "/" (slash), the "\" (backslash), the "'"
(quote) and '"' (double-quote) characters.
Smack labels cannot begin with a '-'. This is reserved for special options.
-There are some predefined labels:
+There are some predefined labels::
_ Pronounced "floor", a single underscore character.
^ Pronounced "hat", a single circumflex character.
@@ -390,14 +461,18 @@ of a process will usually be assigned by the system initialization
mechanism.
Access Rules
+~~~~~~~~~~~~
Smack uses the traditional access modes of Linux. These modes are read,
execute, write, and occasionally append. There are a few cases where the
access mode may not be obvious. These include:
- Signals: A signal is a write operation from the subject task to
+ Signals:
+ A signal is a write operation from the subject task to
the object task.
- Internet Domain IPC: Transmission of a packet is considered a
+
+ Internet Domain IPC:
+ Transmission of a packet is considered a
write operation from the source task to the destination task.
Smack restricts access based on the label attached to a subject and the label
@@ -417,6 +492,7 @@ order:
7. Any other access is denied.
Smack Access Rules
+~~~~~~~~~~~~~~~~~~
With the isolation provided by Smack access separation is simple. There are
many interesting cases where limited access by subjects to objects with
@@ -427,8 +503,9 @@ be "born" highly classified. To accommodate such schemes Smack includes a
mechanism for specifying rules allowing access between labels.
Access Rule Format
+~~~~~~~~~~~~~~~~~~
-The format of an access rule is:
+The format of an access rule is::
subject-label object-label access
@@ -446,7 +523,7 @@ describe access modes:
Uppercase values for the specification letters are allowed as well.
Access mode specifications can be in any order. Examples of acceptable rules
-are:
+are::
TopSecret Secret rx
Secret Unclass R
@@ -456,7 +533,7 @@ are:
New Old rRrRr
Closed Off -
-Examples of unacceptable rules are:
+Examples of unacceptable rules are::
Top Secret Secret rx
Ace Ace r
@@ -469,6 +546,7 @@ access specifications. The dash is a placeholder, so "a-r" is the same
as "ar". A lone dash is used to specify that no access should be allowed.
Applying Access Rules
+~~~~~~~~~~~~~~~~~~~~~
The developers of Linux rarely define new sorts of things, usually importing
schemes and concepts from other systems. Most often, the other systems are
@@ -511,6 +589,7 @@ one process to another requires that the sender have write access to the
receiver. The receiver is not required to have read access to the sender.
Setting Access Rules
+~~~~~~~~~~~~~~~~~~~~
The configuration file /etc/smack/accesses contains the rules to be set at
system startup. The contents are written to the special file
@@ -520,6 +599,7 @@ one rule, with the most recently specified overriding any earlier
specification.
Task Attribute
+~~~~~~~~~~~~~~
The Smack label of a process can be read from /proc/<pid>/attr/current. A
process can read its own Smack label from /proc/self/attr/current. A
@@ -527,12 +607,14 @@ privileged process can change its own Smack label by writing to
/proc/self/attr/current but not the label of another process.
File Attribute
+~~~~~~~~~~~~~~
The Smack label of a filesystem object is stored as an extended attribute
named SMACK64 on the file. This attribute is in the security namespace. It can
only be changed by a process with privilege.
Privilege
+~~~~~~~~~
A process with CAP_MAC_OVERRIDE or CAP_MAC_ADMIN is privileged.
CAP_MAC_OVERRIDE allows the process access to objects it would
@@ -540,6 +622,7 @@ be denied otherwise. CAP_MAC_ADMIN allows a process to change
Smack data, including rules and attributes.
Smack Networking
+~~~~~~~~~~~~~~~~
As mentioned before, Smack enforces access control on network protocol
transmissions. Every packet sent by a Smack process is tagged with its Smack
@@ -551,6 +634,7 @@ packet has write access to the receiving process and if that is not the case
the packet is dropped.
CIPSO Configuration
+~~~~~~~~~~~~~~~~~~~
It is normally unnecessary to specify the CIPSO configuration. The default
values used by the system handle all internal cases. Smack will compose CIPSO
@@ -571,13 +655,13 @@ discarded. The DOI is 3 by default. The value can be read from
The label and category set are mapped to a Smack label as defined in
/etc/smack/cipso.
-A Smack/CIPSO mapping has the form:
+A Smack/CIPSO mapping has the form::
smack level [category [category]*]
Smack does not expect the level or category sets to be related in any
particular way and does not assume or assign accesses based on them. Some
-examples of mappings:
+examples of mappings::
TopSecret 7
TS:A,B 7 1 2
@@ -597,25 +681,30 @@ value can be read from /sys/fs/smackfs/direct and changed by writing to
/sys/fs/smackfs/direct.
Socket Attributes
+~~~~~~~~~~~~~~~~~
There are two attributes that are associated with sockets. These attributes
can only be set by privileged tasks, but any task can read them for their own
sockets.
- SMACK64IPIN: The Smack label of the task object. A privileged
+ SMACK64IPIN:
+ The Smack label of the task object. A privileged
program that will enforce policy may set this to the star label.
- SMACK64IPOUT: The Smack label transmitted with outgoing packets.
+ SMACK64IPOUT:
+ The Smack label transmitted with outgoing packets.
A privileged program may set this to match the label of another
task with which it hopes to communicate.
Smack Netlabel Exceptions
+~~~~~~~~~~~~~~~~~~~~~~~~~
You will often find that your labeled application has to talk to the outside,
unlabeled world. To do this there's a special file /sys/fs/smackfs/netlabel
-where you can add some exceptions in the form of :
-@IP1 LABEL1 or
-@IP2/MASK LABEL2
+where you can add some exceptions in the form of::
+
+ @IP1 LABEL1 or
+ @IP2/MASK LABEL2
It means that your application will have unlabeled access to @IP1 if it has
write access on LABEL1, and access to the subnet @IP2/MASK if it has write
@@ -624,28 +713,32 @@ access on LABEL2.
Entries in the /sys/fs/smackfs/netlabel file are matched by longest mask
first, like in classless IPv4 routing.
-A special label '@' and an option '-CIPSO' can be used there :
-@ means Internet, any application with any label has access to it
--CIPSO means standard CIPSO networking
+A special label '@' and an option '-CIPSO' can be used there::
-If you don't know what CIPSO is and don't plan to use it, you can just do :
-echo 127.0.0.1 -CIPSO > /sys/fs/smackfs/netlabel
-echo 0.0.0.0/0 @ > /sys/fs/smackfs/netlabel
+ @ means Internet, any application with any label has access to it
+ -CIPSO means standard CIPSO networking
+
+If you don't know what CIPSO is and don't plan to use it, you can just do::
+
+ echo 127.0.0.1 -CIPSO > /sys/fs/smackfs/netlabel
+ echo 0.0.0.0/0 @ > /sys/fs/smackfs/netlabel
If you use CIPSO on your 192.168.0.0/16 local network and need also unlabeled
-Internet access, you can have :
-echo 127.0.0.1 -CIPSO > /sys/fs/smackfs/netlabel
-echo 192.168.0.0/16 -CIPSO > /sys/fs/smackfs/netlabel
-echo 0.0.0.0/0 @ > /sys/fs/smackfs/netlabel
+Internet access, you can have::
+ echo 127.0.0.1 -CIPSO > /sys/fs/smackfs/netlabel
+ echo 192.168.0.0/16 -CIPSO > /sys/fs/smackfs/netlabel
+ echo 0.0.0.0/0 @ > /sys/fs/smackfs/netlabel
Writing Applications for Smack
+------------------------------
There are three sorts of applications that will run on a Smack system. How an
application interacts with Smack will determine what it will have to do to
work properly under Smack.
Smack Ignorant Applications
+---------------------------
By far the majority of applications have no reason whatever to care about the
unique properties of Smack. Since invoking a program has no impact on the
@@ -653,12 +746,14 @@ Smack label associated with the process the only concern likely to arise is
whether the process has execute access to the program.
Smack Relevant Applications
+---------------------------
Some programs can be improved by teaching them about Smack, but do not make
any security decisions themselves. The utility ls(1) is one example of such a
program.
Smack Enforcing Applications
+----------------------------
These are special programs that not only know about Smack, but participate in
the enforcement of system policy. In most cases these are the programs that
@@ -666,15 +761,16 @@ set up user sessions. There are also network services that provide information
to processes running with various labels.
File System Interfaces
+----------------------
Smack maintains labels on file system objects using extended attributes. The
Smack label of a file, directory, or other file system object can be obtained
-using getxattr(2).
+using getxattr(2)::
len = getxattr("/", "security.SMACK64", value, sizeof (value));
will put the Smack label of the root directory into value. A privileged
-process can set the Smack label of a file system object with setxattr(2).
+process can set the Smack label of a file system object with setxattr(2)::
len = strlen("Rubble");
rc = setxattr("/foo", "security.SMACK64", "Rubble", len, 0);
@@ -683,17 +779,18 @@ will set the Smack label of /foo to "Rubble" if the program has appropriate
privilege.
Socket Interfaces
+-----------------
The socket attributes can be read using fgetxattr(2).
A privileged process can set the Smack label of outgoing packets with
-fsetxattr(2).
+fsetxattr(2)::
len = strlen("Rubble");
rc = fsetxattr(fd, "security.SMACK64IPOUT", "Rubble", len, 0);
will set the Smack label "Rubble" on packets going out from the socket if the
-program has appropriate privilege.
+program has appropriate privilege::
rc = fsetxattr(fd, "security.SMACK64IPIN, "*", strlen("*"), 0);
@@ -701,33 +798,40 @@ will set the Smack label "*" as the object label against which incoming
packets will be checked if the program has appropriate privilege.
Administration
+--------------
Smack supports some mount options:
- smackfsdef=label: specifies the label to give files that lack
+ smackfsdef=label:
+ specifies the label to give files that lack
the Smack label extended attribute.
- smackfsroot=label: specifies the label to assign the root of the
+ smackfsroot=label:
+ specifies the label to assign the root of the
file system if it lacks the Smack extended attribute.
- smackfshat=label: specifies a label that must have read access to
+ smackfshat=label:
+ specifies a label that must have read access to
all labels set on the filesystem. Not yet enforced.
- smackfsfloor=label: specifies a label to which all labels set on the
+ smackfsfloor=label:
+ specifies a label to which all labels set on the
filesystem must have read access. Not yet enforced.
These mount options apply to all file system types.
Smack auditing
+--------------
If you want Smack auditing of security events, you need to set CONFIG_AUDIT
in your kernel configuration.
By default, all denied events will be audited. You can change this behavior by
-writing a single character to the /sys/fs/smackfs/logging file :
-0 : no logging
-1 : log denied (default)
-2 : log accepted
-3 : log denied & accepted
+writing a single character to the /sys/fs/smackfs/logging file::
+
+ 0 : no logging
+ 1 : log denied (default)
+ 2 : log accepted
+ 3 : log denied & accepted
Events are logged as 'key=value' pairs, for each event you at least will get
the subject, the object, the rights requested, the action, the kernel function
@@ -735,6 +839,7 @@ that triggered the event, plus other pairs depending on the type of event
audited.
Bringup Mode
+------------
Bringup mode provides logging features that can make application
configuration and system bringup easier. Configure the kernel with
diff --git a/Documentation/security/Yama.txt b/Documentation/admin-guide/LSM/Yama.rst
index d9ee7d7a6c7f..13468ea696b7 100644
--- a/Documentation/security/Yama.txt
+++ b/Documentation/admin-guide/LSM/Yama.rst
@@ -1,13 +1,14 @@
+====
+Yama
+====
+
Yama is a Linux Security Module that collects system-wide DAC security
protections that are not handled by the core kernel itself. This is
-selectable at build-time with CONFIG_SECURITY_YAMA, and can be controlled
-at run-time through sysctls in /proc/sys/kernel/yama:
-
-- ptrace_scope
+selectable at build-time with ``CONFIG_SECURITY_YAMA``, and can be controlled
+at run-time through sysctls in ``/proc/sys/kernel/yama``:
-==============================================================
-
-ptrace_scope:
+ptrace_scope
+============
As Linux grows in popularity, it will become a larger target for
malware. One particularly troubling weakness of the Linux process
@@ -25,47 +26,49 @@ exist and remain possible if ptrace is allowed to operate as before.
Since ptrace is not commonly used by non-developers and non-admins, system
builders should be allowed the option to disable this debugging system.
-For a solution, some applications use prctl(PR_SET_DUMPABLE, ...) to
+For a solution, some applications use ``prctl(PR_SET_DUMPABLE, ...)`` to
specifically disallow such ptrace attachment (e.g. ssh-agent), but many
do not. A more general solution is to only allow ptrace directly from a
parent to a child process (i.e. direct "gdb EXE" and "strace EXE" still
-work), or with CAP_SYS_PTRACE (i.e. "gdb --pid=PID", and "strace -p PID"
+work), or with ``CAP_SYS_PTRACE`` (i.e. "gdb --pid=PID", and "strace -p PID"
still work as root).
In mode 1, software that has defined application-specific relationships
between a debugging process and its inferior (crash handlers, etc),
-prctl(PR_SET_PTRACER, pid, ...) can be used. An inferior can declare which
-other process (and its descendants) are allowed to call PTRACE_ATTACH
+``prctl(PR_SET_PTRACER, pid, ...)`` can be used. An inferior can declare which
+other process (and its descendants) are allowed to call ``PTRACE_ATTACH``
against it. Only one such declared debugging process can exists for
each inferior at a time. For example, this is used by KDE, Chromium, and
Firefox's crash handlers, and by Wine for allowing only Wine processes
to ptrace each other. If a process wishes to entirely disable these ptrace
-restrictions, it can call prctl(PR_SET_PTRACER, PR_SET_PTRACER_ANY, ...)
+restrictions, it can call ``prctl(PR_SET_PTRACER, PR_SET_PTRACER_ANY, ...)``
so that any otherwise allowed process (even those in external pid namespaces)
may attach.
-The sysctl settings (writable only with CAP_SYS_PTRACE) are:
+The sysctl settings (writable only with ``CAP_SYS_PTRACE``) are:
-0 - classic ptrace permissions: a process can PTRACE_ATTACH to any other
+0 - classic ptrace permissions:
+ a process can ``PTRACE_ATTACH`` to any other
process running under the same uid, as long as it is dumpable (i.e.
did not transition uids, start privileged, or have called
- prctl(PR_SET_DUMPABLE...) already). Similarly, PTRACE_TRACEME is
+ ``prctl(PR_SET_DUMPABLE...)`` already). Similarly, ``PTRACE_TRACEME`` is
unchanged.
-1 - restricted ptrace: a process must have a predefined relationship
- with the inferior it wants to call PTRACE_ATTACH on. By default,
+1 - restricted ptrace:
+ a process must have a predefined relationship
+ with the inferior it wants to call ``PTRACE_ATTACH`` on. By default,
this relationship is that of only its descendants when the above
classic criteria is also met. To change the relationship, an
- inferior can call prctl(PR_SET_PTRACER, debugger, ...) to declare
- an allowed debugger PID to call PTRACE_ATTACH on the inferior.
- Using PTRACE_TRACEME is unchanged.
+ inferior can call ``prctl(PR_SET_PTRACER, debugger, ...)`` to declare
+ an allowed debugger PID to call ``PTRACE_ATTACH`` on the inferior.
+ Using ``PTRACE_TRACEME`` is unchanged.
-2 - admin-only attach: only processes with CAP_SYS_PTRACE may use ptrace
- with PTRACE_ATTACH, or through children calling PTRACE_TRACEME.
+2 - admin-only attach:
+ only processes with ``CAP_SYS_PTRACE`` may use ptrace
+ with ``PTRACE_ATTACH``, or through children calling ``PTRACE_TRACEME``.
-3 - no attach: no processes may use ptrace with PTRACE_ATTACH nor via
- PTRACE_TRACEME. Once set, this sysctl value cannot be changed.
+3 - no attach:
+ no processes may use ptrace with ``PTRACE_ATTACH`` nor via
+ ``PTRACE_TRACEME``. Once set, this sysctl value cannot be changed.
The original children-only logic was based on the restrictions in grsecurity.
-
-==============================================================
diff --git a/Documentation/security/apparmor.txt b/Documentation/admin-guide/LSM/apparmor.rst
index 93c1fd7d0635..3e9734bd0e05 100644
--- a/Documentation/security/apparmor.txt
+++ b/Documentation/admin-guide/LSM/apparmor.rst
@@ -1,4 +1,9 @@
---- What is AppArmor? ---
+========
+AppArmor
+========
+
+What is AppArmor?
+=================
AppArmor is MAC style security extension for the Linux kernel. It implements
a task centered policy, with task "profiles" being created and loaded
@@ -6,34 +11,41 @@ from user space. Tasks on the system that do not have a profile defined for
them run in an unconfined state which is equivalent to standard Linux DAC
permissions.
---- How to enable/disable ---
+How to enable/disable
+=====================
+
+set ``CONFIG_SECURITY_APPARMOR=y``
-set CONFIG_SECURITY_APPARMOR=y
+If AppArmor should be selected as the default security module then set::
-If AppArmor should be selected as the default security module then
- set CONFIG_DEFAULT_SECURITY="apparmor"
- and CONFIG_SECURITY_APPARMOR_BOOTPARAM_VALUE=1
+ CONFIG_DEFAULT_SECURITY="apparmor"
+ CONFIG_SECURITY_APPARMOR_BOOTPARAM_VALUE=1
Build the kernel
If AppArmor is not the default security module it can be enabled by passing
-security=apparmor on the kernel's command line.
+``security=apparmor`` on the kernel's command line.
If AppArmor is the default security module it can be disabled by passing
-apparmor=0, security=XXXX (where XXX is valid security module), on the
-kernel's command line
+``apparmor=0, security=XXXX`` (where ``XXXX`` is valid security module), on the
+kernel's command line.
For AppArmor to enforce any restrictions beyond standard Linux DAC permissions
policy must be loaded into the kernel from user space (see the Documentation
and tools links).
---- Documentation ---
+Documentation
+=============
-Documentation can be found on the wiki.
+Documentation can be found on the wiki, linked below.
---- Links ---
+Links
+=====
Mailing List - apparmor@lists.ubuntu.com
+
Wiki - http://apparmor.wiki.kernel.org/
+
User space tools - https://launchpad.net/apparmor
+
Kernel module - git://git.kernel.org/pub/scm/linux/kernel/git/jj/apparmor-dev.git
diff --git a/Documentation/security/LSM.txt b/Documentation/admin-guide/LSM/index.rst
index c2683f28ed36..c980dfe9abf1 100644
--- a/Documentation/security/LSM.txt
+++ b/Documentation/admin-guide/LSM/index.rst
@@ -1,12 +1,13 @@
-Linux Security Module framework
--------------------------------
+===========================
+Linux Security Module Usage
+===========================
The Linux Security Module (LSM) framework provides a mechanism for
various security checks to be hooked by new kernel extensions. The name
"module" is a bit of a misnomer since these extensions are not actually
loadable kernel modules. Instead, they are selectable at build-time via
CONFIG_DEFAULT_SECURITY and can be overridden at boot-time via the
-"security=..." kernel command line argument, in the case where multiple
+``"security=..."`` kernel command line argument, in the case where multiple
LSMs were built into a given kernel.
The primary users of the LSM interface are Mandatory Access Control
@@ -19,23 +20,22 @@ in the core functionality of Linux itself.
Without a specific LSM built into the kernel, the default LSM will be the
Linux capabilities system. Most LSMs choose to extend the capabilities
system, building their checks on top of the defined capability hooks.
-For more details on capabilities, see capabilities(7) in the Linux
+For more details on capabilities, see ``capabilities(7)`` in the Linux
man-pages project.
A list of the active security modules can be found by reading
-/sys/kernel/security/lsm. This is a comma separated list, and
+``/sys/kernel/security/lsm``. This is a comma separated list, and
will always include the capability module. The list reflects the
order in which checks are made. The capability module will always
be first, followed by any "minor" modules (e.g. Yama) and then
the one "major" module (e.g. SELinux) if there is one configured.
-Based on https://lkml.org/lkml/2007/10/26/215,
-a new LSM is accepted into the kernel when its intent (a description of
-what it tries to protect against and in what cases one would expect to
-use it) has been appropriately documented in Documentation/security/.
-This allows an LSM's code to be easily compared to its goals, and so
-that end users and distros can make a more informed decision about which
-LSMs suit their requirements.
+.. toctree::
+ :maxdepth: 1
-For extensive documentation on the available LSM hook interfaces, please
-see include/linux/security.h.
+ apparmor
+ LoadPin
+ SELinux
+ Smack
+ tomoyo
+ Yama
diff --git a/Documentation/security/tomoyo.txt b/Documentation/admin-guide/LSM/tomoyo.rst
index 200a2d37cbc8..a5947218fa64 100644
--- a/Documentation/security/tomoyo.txt
+++ b/Documentation/admin-guide/LSM/tomoyo.rst
@@ -1,21 +1,30 @@
---- What is TOMOYO? ---
+======
+TOMOYO
+======
+
+What is TOMOYO?
+===============
TOMOYO is a name-based MAC extension (LSM module) for the Linux kernel.
LiveCD-based tutorials are available at
+
http://tomoyo.sourceforge.jp/1.7/1st-step/ubuntu10.04-live/
-http://tomoyo.sourceforge.jp/1.7/1st-step/centos5-live/ .
+http://tomoyo.sourceforge.jp/1.7/1st-step/centos5-live/
+
Though these tutorials use non-LSM version of TOMOYO, they are useful for you
to know what TOMOYO is.
---- How to enable TOMOYO? ---
+How to enable TOMOYO?
+=====================
-Build the kernel with CONFIG_SECURITY_TOMOYO=y and pass "security=tomoyo" on
+Build the kernel with ``CONFIG_SECURITY_TOMOYO=y`` and pass ``security=tomoyo`` on
kernel's command line.
Please see http://tomoyo.sourceforge.jp/2.3/ for details.
---- Where is documentation? ---
+Where is documentation?
+=======================
User <-> Kernel interface documentation is available at
http://tomoyo.sourceforge.jp/2.3/policy-reference.html .
@@ -42,7 +51,8 @@ History of TOMOYO?
Realities of Mainlining
http://sourceforge.jp/projects/tomoyo/docs/lfj2008.pdf
---- What is future plan? ---
+What is future plan?
+====================
We believe that inode based security and name based security are complementary
and both should be used together. But unfortunately, so far, we cannot enable
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 8c60a8a32a1a..e14c374aaf60 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -61,6 +61,7 @@ configure specific aspects of kernel behavior to your liking.
java
ras
pm/index
+ LSM/index
.. only:: subproject and html
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 230b03caee55..15f79c27748d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1578,6 +1578,15 @@
extended tables themselves, and also PASID support. With
this option set, extended tables will not be used even
on hardware which claims to support them.
+ tboot_noforce [Default Off]
+ Do not force the Intel IOMMU enabled under tboot.
+ By default, tboot will force Intel IOMMU on, which
+ could harm performance of some high-throughput
+ devices like 40GBit network cards, even if identity
+ mapping is enabled.
+ Note that using this option lowers the security
+ provided by tboot because it makes the system
+ vulnerable to DMA attacks.
intel_idle.max_cstate= [KNL,HW,ACPI,X86]
0 disables intel_idle and fall back on acpi_idle.
@@ -1644,6 +1653,12 @@
nobypass [PPC/POWERNV]
Disable IOMMU bypass, using IOMMU for PCI devices.
+ iommu.passthrough=
+ [ARM64] Configure DMA to bypass the IOMMU by default.
+ Format: { "0" | "1" }
+ 0 - Use IOMMU translation for DMA.
+ 1 - Bypass the IOMMU for DMA.
+ unset - Use IOMMU translation for DMA.
io7= [HW] IO7 for Marvel based alpha systems
See comment before marvel_specify_io7 in
@@ -2419,12 +2434,6 @@
and gids from such clients. This is intended to ease
migration from NFSv2/v3.
- objlayoutdriver.osd_login_prog=
- [NFS] [OBJLAYOUT] sets the pathname to the program which
- is used to automatically discover and login into new
- osd-targets. Please see:
- Documentation/filesystems/pnfs.txt for more explanations
-
nmi_debug= [KNL,SH] Specify one or more actions to take
when a NMI is triggered.
Format: [state][,regs][,debounce][,die]
@@ -3785,6 +3794,14 @@
spia_pedr=
spia_peddr=
+ srcutree.exp_holdoff [KNL]
+ Specifies how many nanoseconds must elapse
+ since the end of the last SRCU grace period for
+ a given srcu_struct until the next normal SRCU
+ grace period will be considered for automatic
+ expediting. Set to zero to disable automatic
+ expediting.
+
stacktrace [FTRACE]
Enabled the stack tracer on boot up.
diff --git a/Documentation/arm64/tagged-pointers.txt b/Documentation/arm64/tagged-pointers.txt
index d9995f1f51b3..a25a99e82bb1 100644
--- a/Documentation/arm64/tagged-pointers.txt
+++ b/Documentation/arm64/tagged-pointers.txt
@@ -11,24 +11,56 @@ in AArch64 Linux.
The kernel configures the translation tables so that translations made
via TTBR0 (i.e. userspace mappings) have the top byte (bits 63:56) of
the virtual address ignored by the translation hardware. This frees up
-this byte for application use, with the following caveats:
+this byte for application use.
- (1) The kernel requires that all user addresses passed to EL1
- are tagged with tag 0x00. This means that any syscall
- parameters containing user virtual addresses *must* have
- their top byte cleared before trapping to the kernel.
- (2) Non-zero tags are not preserved when delivering signals.
- This means that signal handlers in applications making use
- of tags cannot rely on the tag information for user virtual
- addresses being maintained for fields inside siginfo_t.
- One exception to this rule is for signals raised in response
- to watchpoint debug exceptions, where the tag information
- will be preserved.
+Passing tagged addresses to the kernel
+--------------------------------------
- (3) Special care should be taken when using tagged pointers,
- since it is likely that C compilers will not hazard two
- virtual addresses differing only in the upper byte.
+All interpretation of userspace memory addresses by the kernel assumes
+an address tag of 0x00.
+
+This includes, but is not limited to, addresses found in:
+
+ - pointer arguments to system calls, including pointers in structures
+ passed to system calls,
+
+ - the stack pointer (sp), e.g. when interpreting it to deliver a
+ signal,
+
+ - the frame pointer (x29) and frame records, e.g. when interpreting
+ them to generate a backtrace or call graph.
+
+Using non-zero address tags in any of these locations may result in an
+error code being returned, a (fatal) signal being raised, or other modes
+of failure.
+
+For these reasons, passing non-zero address tags to the kernel via
+system calls is forbidden, and using a non-zero address tag for sp is
+strongly discouraged.
+
+Programs maintaining a frame pointer and frame records that use non-zero
+address tags may suffer impaired or inaccurate debug and profiling
+visibility.
+
+
+Preserving tags
+---------------
+
+Non-zero tags are not preserved when delivering signals. This means that
+signal handlers in applications making use of tags cannot rely on the
+tag information for user virtual addresses being maintained for fields
+inside siginfo_t. One exception to this rule is for signals raised in
+response to watchpoint debug exceptions, where the tag information will
+be preserved.
The architecture prevents the use of a tagged PC, so the upper byte will
be set to a sign-extension of bit 55 on exception return.
+
+
+Other considerations
+--------------------
+
+Special care should be taken when using tagged pointers, since it is
+likely that C compilers will not hazard two virtual addresses differing
+only in the upper byte.
diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt
index 1b87df6cd476..05e2822a80b3 100644
--- a/Documentation/block/bfq-iosched.txt
+++ b/Documentation/block/bfq-iosched.txt
@@ -11,6 +11,13 @@ controllers), BFQ's main features are:
groups (switching back to time distribution when needed to keep
throughput high).
+In its default configuration, BFQ privileges latency over
+throughput. So, when needed for achieving a lower latency, BFQ builds
+schedules that may lead to a lower throughput. If your main or only
+goal, for a given device, is to achieve the maximum-possible
+throughput at all times, then do switch off all low-latency heuristics
+for that device, by setting low_latency to 0. Full details in Section 3.
+
On average CPUs, the current version of BFQ can handle devices
performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
reference, 30-50 KIOPS correspond to very high bandwidths with
@@ -375,11 +382,19 @@ default, low latency mode is enabled. If enabled, interactive and soft
real-time applications are privileged and experience a lower latency,
as explained in more detail in the description of how BFQ works.
-DO NOT enable this mode if you need full control on bandwidth
+DISABLE this mode if you need full control on bandwidth
distribution. In fact, if it is enabled, then BFQ automatically
increases the bandwidth share of privileged applications, as the main
means to guarantee a lower latency to them.
+In addition, as already highlighted at the beginning of this document,
+DISABLE this mode if your only goal is to achieve a high throughput.
+In fact, privileging the I/O of some application over the rest may
+entail a lower throughput. To achieve the highest-possible throughput
+on a non-rotational device, setting slice_idle to 0 may be needed too
+(at the cost of giving up any strong guarantee on fairness and low
+latency).
+
timeout_sync
------------
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index e50b95c25868..dc5e2dcdbef4 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -918,6 +918,18 @@ PAGE_SIZE multiple when read back.
Number of major page faults incurred
+ workingset_refault
+
+ Number of refaults of previously evicted pages
+
+ workingset_activate
+
+ Number of refaulted pages that were immediately activated
+
+ workingset_nodereclaim
+
+ Number of times a shadow node has been reclaimed
+
memory.swap.current
A read-only single value file which exists on non-root
diff --git a/Documentation/core-api/assoc_array.rst b/Documentation/core-api/assoc_array.rst
index d83cfff9ea43..8231b915c939 100644
--- a/Documentation/core-api/assoc_array.rst
+++ b/Documentation/core-api/assoc_array.rst
@@ -10,7 +10,10 @@ properties:
1. Objects are opaque pointers. The implementation does not care where they
point (if anywhere) or what they point to (if anything).
-.. note:: Pointers to objects _must_ be zero in the least significant bit.
+
+ .. note::
+
+ Pointers to objects _must_ be zero in the least significant bit.
2. Objects do not need to contain linkage blocks for use by the array. This
permits an object to be located in multiple arrays simultaneously.
diff --git a/Documentation/crypto/asymmetric-keys.txt b/Documentation/crypto/asymmetric-keys.txt
index 5ad6480e3fb9..b82b6ad48488 100644
--- a/Documentation/crypto/asymmetric-keys.txt
+++ b/Documentation/crypto/asymmetric-keys.txt
@@ -265,7 +265,7 @@ mandatory:
The caller passes a pointer to the following struct with all of the fields
cleared, except for data, datalen and quotalen [see
- Documentation/security/keys.txt].
+ Documentation/security/keys/core.rst].
struct key_preparsed_payload {
char *description;
diff --git a/Documentation/devicetree/bindings/arm/firmware/linaro,optee-tz.txt b/Documentation/devicetree/bindings/arm/firmware/linaro,optee-tz.txt
new file mode 100644
index 000000000000..d38834c67dff
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/firmware/linaro,optee-tz.txt
@@ -0,0 +1,31 @@
+OP-TEE Device Tree Bindings
+
+OP-TEE is a piece of software using hardware features to provide a Trusted
+Execution Environment. The security can be provided with ARM TrustZone, but
+also by virtualization or a separate chip.
+
+We're using "linaro" as the first part of the compatible property for
+the reference implementation maintained by Linaro.
+
+* OP-TEE based on ARM TrustZone required properties:
+
+- compatible : should contain "linaro,optee-tz"
+
+- method : The method of calling the OP-TEE Trusted OS. Permitted
+ values are:
+
+ "smc" : SMC #0, with the register assignments specified
+ in drivers/tee/optee/optee_smc.h
+
+ "hvc" : HVC #0, with the register assignments specified
+ in drivers/tee/optee/optee_smc.h
+
+
+
+Example:
+ firmware {
+ optee {
+ compatible = "linaro,optee-tz";
+ method = "smc";
+ };
+ };
diff --git a/Documentation/devicetree/bindings/arm/mediatek/mediatek,apmixedsys.txt b/Documentation/devicetree/bindings/arm/mediatek/mediatek,apmixedsys.txt
index cb0054ac7121..cd977db7630c 100644
--- a/Documentation/devicetree/bindings/arm/mediatek/mediatek,apmixedsys.txt
+++ b/Documentation/devicetree/bindings/arm/mediatek/mediatek,apmixedsys.txt
@@ -7,6 +7,7 @@ Required Properties:
- compatible: Should be one of:
- "mediatek,mt2701-apmixedsys"
+ - "mediatek,mt6797-apmixedsys"
- "mediatek,mt8135-apmixedsys"
- "mediatek,mt8173-apmixedsys"
- #clock-cells: Must be 1
diff --git a/Documentation/devicetree/bindings/arm/mediatek/mediatek,imgsys.txt b/Documentation/devicetree/bindings/arm/mediatek/mediatek,imgsys.txt
index f6a916686f4c..047b11ae5f45 100644
--- a/Documentation/devicetree/bindings/arm/mediatek/mediatek,imgsys.txt
+++ b/Documentation/devicetree/bindings/arm/mediatek/mediatek,imgsys.txt
@@ -7,6 +7,7 @@ Required Properties:
- compatible: Should be one of:
- "mediatek,mt2701-imgsys", "syscon"
+ - "mediatek,mt6797-imgsys", "syscon"
- "mediatek,mt8173-imgsys", "syscon"
- #clock-cells: Must be 1
diff --git a/Documentation/devicetree/bindings/arm/mediatek/mediatek,infracfg.txt b/Documentation/devicetree/bindings/arm/mediatek/mediatek,infracfg.txt
index 1620ec2a5a3f..58d58e2006b8 100644
--- a/Documentation/devicetree/bindings/arm/mediatek/mediatek,infracfg.txt
+++ b/Documentation/devicetree/bindings/arm/mediatek/mediatek,infracfg.txt
@@ -8,6 +8,7 @@ Required Properties:
- compatible: Should be one of:
- "mediatek,mt2701-infracfg", "syscon"
+ - "mediatek,mt6797-infracfg", "syscon"
- "mediatek,mt8135-infracfg", "syscon"
- "mediatek,mt8173-infracfg", "syscon"
- #clock-cells: Must be 1
diff --git a/Documentation/devicetree/bindings/arm/mediatek/mediatek,mmsys.txt b/Documentation/devicetree/bindings/arm/mediatek/mediatek,mmsys.txt
index 67dd2e473d25..70529e0b58e9 100644
--- a/Documentation/devicetree/bindings/arm/mediatek/mediatek,mmsys.txt
+++ b/Documentation/devicetree/bindings/arm/mediatek/mediatek,mmsys.txt
@@ -7,6 +7,7 @@ Required Properties:
- compatible: Should be one of:
- "mediatek,mt2701-mmsys", "syscon"
+ - "mediatek,mt6797-mmsys", "syscon"
- "mediatek,mt8173-mmsys", "syscon"
- #clock-cells: Must be 1
diff --git a/Documentation/devicetree/bindings/arm/mediatek/mediatek,topckgen.txt b/Documentation/devicetree/bindings/arm/mediatek/mediatek,topckgen.txt
index 9f2fe7860114..ec93ecbb9f3c 100644
--- a/Documentation/devicetree/bindings/arm/mediatek/mediatek,topckgen.txt
+++ b/Documentation/devicetree/bindings/arm/mediatek/mediatek,topckgen.txt
@@ -7,6 +7,7 @@ Required Properties:
- compatible: Should be one of:
- "mediatek,mt2701-topckgen"
+ - "mediatek,mt6797-topckgen"
- "mediatek,mt8135-topckgen"
- "mediatek,mt8173-topckgen"
- #clock-cells: Must be 1
diff --git a/Documentation/devicetree/bindings/arm/mediatek/mediatek,vdecsys.txt b/Documentation/devicetree/bindings/arm/mediatek/mediatek,vdecsys.txt
index 2440f73450c3..d150104f928a 100644
--- a/Documentation/devicetree/bindings/arm/mediatek/mediatek,vdecsys.txt
+++ b/Documentation/devicetree/bindings/arm/mediatek/mediatek,vdecsys.txt
@@ -7,6 +7,7 @@ Required Properties:
- compatible: Should be one of:
- "mediatek,mt2701-vdecsys", "syscon"
+ - "mediatek,mt6797-vdecsys", "syscon"
- "mediatek,mt8173-vdecsys", "syscon"
- #clock-cells: Must be 1
diff --git a/Documentation/devicetree/bindings/arm/mediatek/mediatek,vencsys.txt b/Documentation/devicetree/bindings/arm/mediatek/mediatek,vencsys.txt
index 5bb2866a2b50..8a93be643647 100644
--- a/Documentation/devicetree/bindings/arm/mediatek/mediatek,vencsys.txt
+++ b/Documentation/devicetree/bindings/arm/mediatek/mediatek,vencsys.txt
@@ -5,7 +5,8 @@ The Mediatek vencsys controller provides various clocks to the system.
Required Properties:
-- compatible: Should be:
+- compatible: Should be one of:
+ - "mediatek,mt6797-vencsys", "syscon"
- "mediatek,mt8173-vencsys", "syscon"
- #clock-cells: Must be 1
diff --git a/Documentation/devicetree/bindings/clock/idt,versaclock5.txt b/Documentation/devicetree/bindings/clock/idt,versaclock5.txt
index 87e9c47a89a3..53d7e50ed875 100644
--- a/Documentation/devicetree/bindings/clock/idt,versaclock5.txt
+++ b/Documentation/devicetree/bindings/clock/idt,versaclock5.txt
@@ -6,18 +6,21 @@ from 3 to 12 output clocks.
==I2C device node==
Required properties:
-- compatible: shall be one of "idt,5p49v5923" , "idt,5p49v5933".
+- compatible: shall be one of "idt,5p49v5923" , "idt,5p49v5933" ,
+ "idt,5p49v5935".
- reg: i2c device address, shall be 0x68 or 0x6a.
- #clock-cells: from common clock binding; shall be set to 1.
- clocks: from common clock binding; list of parent clock handles,
- 5p49v5923: (required) either or both of XTAL or CLKIN
reference clock.
- - 5p49v5933: (optional) property not present (internal
+ - 5p49v5933 and
+ - 5p49v5935: (optional) property not present (internal
Xtal used) or CLKIN reference
clock.
- clock-names: from common clock binding; clock input names, can be
- 5p49v5923: (required) either or both of "xin", "clkin".
- - 5p49v5933: (optional) property not present or "clkin".
+ - 5p49v5933 and
+ - 5p49v5935: (optional) property not present or "clkin".
==Mapping between clock specifier and physical pins==
@@ -34,6 +37,13 @@ clock specifier, the following mapping applies:
1 -- OUT1
2 -- OUT4
+5P49V5935:
+ 0 -- OUT0_SEL_I2CB
+ 1 -- OUT1
+ 2 -- OUT2
+ 3 -- OUT3
+ 4 -- OUT4
+
==Example==
/* 25MHz reference crystal */
diff --git a/Documentation/devicetree/bindings/clock/rockchip,rk1108-cru.txt b/Documentation/devicetree/bindings/clock/rockchip,rv1108-cru.txt
index 4da126116cf0..161326a4f9c1 100644
--- a/Documentation/devicetree/bindings/clock/rockchip,rk1108-cru.txt
+++ b/Documentation/devicetree/bindings/clock/rockchip,rv1108-cru.txt
@@ -1,12 +1,12 @@
-* Rockchip RK1108 Clock and Reset Unit
+* Rockchip RV1108 Clock and Reset Unit
-The RK1108 clock controller generates and supplies clock to various
+The RV1108 clock controller generates and supplies clock to various
controllers within the SoC and also implements a reset controller for SoC
peripherals.
Required Properties:
-- compatible: should be "rockchip,rk1108-cru"
+- compatible: should be "rockchip,rv1108-cru"
- reg: physical base address of the controller and length of memory mapped
region.
- #clock-cells: should be 1.
@@ -19,7 +19,7 @@ Optional Properties:
Each clock is assigned an identifier and client nodes can use this identifier
to specify the clock which they consume. All available clocks are defined as
-preprocessor macros in the dt-bindings/clock/rk1108-cru.h headers and can be
+preprocessor macros in the dt-bindings/clock/rv1108-cru.h headers and can be
used in device tree sources. Similar macros exist for the reset sources in
these files.
@@ -38,7 +38,7 @@ clock-output-names:
Example: Clock controller node:
cru: cru@20200000 {
- compatible = "rockchip,rk1108-cru";
+ compatible = "rockchip,rv1108-cru";
reg = <0x20200000 0x1000>;
rockchip,grf = <&grf>;
@@ -50,7 +50,7 @@ Example: UART controller node that consumes the clock generated by the clock
controller:
uart0: serial@10230000 {
- compatible = "rockchip,rk1108-uart", "snps,dw-apb-uart";
+ compatible = "rockchip,rv1108-uart", "snps,dw-apb-uart";
reg = <0x10230000 0x100>;
interrupts = <GIC_SPI 44 IRQ_TYPE_LEVEL_HIGH>;
reg-shift = <2>;
diff --git a/Documentation/devicetree/bindings/clock/sunxi-ccu.txt b/Documentation/devicetree/bindings/clock/sunxi-ccu.txt
index bae5668cf427..e9c5a1d9834a 100644
--- a/Documentation/devicetree/bindings/clock/sunxi-ccu.txt
+++ b/Documentation/devicetree/bindings/clock/sunxi-ccu.txt
@@ -7,9 +7,12 @@ Required properties :
- "allwinner,sun8i-a23-ccu"
- "allwinner,sun8i-a33-ccu"
- "allwinner,sun8i-h3-ccu"
+ - "allwinner,sun8i-h3-r-ccu"
- "allwinner,sun8i-v3s-ccu"
- "allwinner,sun9i-a80-ccu"
- "allwinner,sun50i-a64-ccu"
+ - "allwinner,sun50i-a64-r-ccu"
+ - "allwinner,sun50i-h5-ccu"
- reg: Must contain the registers base address and length
- clocks: phandle to the oscillators feeding the CCU. Two are needed:
@@ -19,7 +22,10 @@ Required properties :
- #clock-cells : must contain 1
- #reset-cells : must contain 1
-Example:
+For the PRCM CCUs on H3/A64, one more clock is needed:
+- "iosc": the SoC's internal frequency oscillator
+
+Example for generic CCU:
ccu: clock@01c20000 {
compatible = "allwinner,sun8i-h3-ccu";
reg = <0x01c20000 0x400>;
@@ -28,3 +34,13 @@ ccu: clock@01c20000 {
#clock-cells = <1>;
#reset-cells = <1>;
};
+
+Example for PRCM CCU:
+r_ccu: clock@01f01400 {
+ compatible = "allwinner,sun50i-a64-r-ccu";
+ reg = <0x01f01400 0x100>;
+ clocks = <&osc24M>, <&osc32k>, <&iosc>;
+ clock-names = "hosc", "losc", "iosc";
+ #clock-cells = <1>;
+ #reset-cells = <1>;
+};
diff --git a/Documentation/devicetree/bindings/display/imx/fsl,imx-fb.txt b/Documentation/devicetree/bindings/display/imx/fsl,imx-fb.txt
index 7a5c0e204c8e..e5a8b363d829 100644
--- a/Documentation/devicetree/bindings/display/imx/fsl,imx-fb.txt
+++ b/Documentation/devicetree/bindings/display/imx/fsl,imx-fb.txt
@@ -13,6 +13,8 @@ Required nodes:
Additional, the display node has to define properties:
- bits-per-pixel: Bits per pixel
- fsl,pcr: LCDC PCR value
+ A display node may optionally define
+ - fsl,aus-mode: boolean to enable AUS mode (only for imx21)
Optional properties:
- lcd-supply: Regulator for LCD supply voltage.
diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu.txt b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
index 6cdf32d037fc..8a6ffce12af5 100644
--- a/Documentation/devicetree/bindings/iommu/arm,smmu.txt
+++ b/Documentation/devicetree/bindings/iommu/arm,smmu.txt
@@ -60,6 +60,17 @@ conditions.
aliases of secure registers have to be used during
SMMU configuration.
+- stream-match-mask : For SMMUs supporting stream matching and using
+ #iommu-cells = <1>, specifies a mask of bits to ignore
+ when matching stream IDs (e.g. this may be programmed
+ into the SMRn.MASK field of every stream match register
+ used). For cases where it is desirable to ignore some
+ portion of every Stream ID (e.g. for certain MMU-500
+ configurations given globally unique input IDs). This
+ property is not valid for SMMUs using stream indexing,
+ or using stream matching with #iommu-cells = <2>, and
+ may be ignored if present in such cases.
+
** Deprecated properties:
- mmu-masters (deprecated in favour of the generic "iommus" binding) :
@@ -109,3 +120,20 @@ conditions.
master3 {
iommus = <&smmu2 1 0x30>;
};
+
+
+ /* ARM MMU-500 with 10-bit stream ID input configuration */
+ smmu3: iommu {
+ compatible = "arm,mmu-500", "arm,smmu-v2";
+ ...
+ #iommu-cells = <1>;
+ /* always ignore appended 5-bit TBU number */
+ stream-match-mask = 0x7c00;
+ };
+
+ bus {
+ /* bus whose child devices emit one unique 10-bit stream
+ ID each, but may master through multiple SMMU TBUs */
+ iommu-map = <0 &smmu3 0 0x400>;
+ ...
+ };
diff --git a/Documentation/devicetree/bindings/mtd/atmel-nand.txt b/Documentation/devicetree/bindings/mtd/atmel-nand.txt
index 3e7ee99d3949..f6bee57e453a 100644
--- a/Documentation/devicetree/bindings/mtd/atmel-nand.txt
+++ b/Documentation/devicetree/bindings/mtd/atmel-nand.txt
@@ -1,4 +1,109 @@
-Atmel NAND flash
+Atmel NAND flash controller bindings
+
+The NAND flash controller node should be defined under the EBI bus (see
+Documentation/devicetree/bindings/memory-controllers/atmel,ebi.txt).
+One or several NAND devices can be defined under this NAND controller.
+The NAND controller might be connected to an ECC engine.
+
+* NAND controller bindings:
+
+Required properties:
+- compatible: should be one of the following
+ "atmel,at91rm9200-nand-controller"
+ "atmel,at91sam9260-nand-controller"
+ "atmel,at91sam9261-nand-controller"
+ "atmel,at91sam9g45-nand-controller"
+ "atmel,sama5d3-nand-controller"
+- ranges: empty ranges property to forward EBI ranges definitions.
+- #address-cells: should be set to 2.
+- #size-cells: should be set to 1.
+- atmel,nfc-io: phandle to the NFC IO block. Only required for sama5d3
+ controllers.
+- atmel,nfc-sram: phandle to the NFC SRAM block. Only required for sama5d3
+ controllers.
+
+Optional properties:
+- ecc-engine: phandle to the PMECC block. Only meaningful if the SoC embeds
+ a PMECC engine.
+
+* NAND device/chip bindings:
+
+Required properties:
+- reg: describes the CS lines assigned to the NAND device. If the NAND device
+ exposes multiple CS lines (multi-dies chips), your reg property will
+ contain X tuples of 3 entries.
+ 1st entry: the CS line this NAND chip is connected to
+ 2nd entry: the base offset of the memory region assigned to this
+ device (always 0)
+ 3rd entry: the memory region size (always 0x800000)
+
+Optional properties:
+- rb-gpios: the GPIO(s) used to check the Ready/Busy status of the NAND.
+- cs-gpios: the GPIO(s) used to control the CS line.
+- det-gpios: the GPIO used to detect if a Smartmedia Card is present.
+- atmel,rb: an integer identifying the native Ready/Busy pin. Only meaningful
+ on sama5 SoCs.
+
+All generic properties described in
+Documentation/devicetree/bindings/mtd/{common,nand}.txt also apply to the NAND
+device node, and NAND partitions should be defined under the NAND node as
+described in Documentation/devicetree/bindings/mtd/partition.txt.
+
+* ECC engine (PMECC) bindings:
+
+Required properties:
+- compatible: should be one of the following
+ "atmel,at91sam9g45-pmecc"
+ "atmel,sama5d4-pmecc"
+ "atmel,sama5d2-pmecc"
+- reg: should contain 2 register ranges. The first one is pointing to the PMECC
+ block, and the second one to the PMECC_ERRLOC block.
+
+Example:
+
+ pmecc: ecc-engine@ffffc070 {
+ compatible = "atmel,at91sam9g45-pmecc";
+ reg = <0xffffc070 0x490>,
+ <0xffffc500 0x100>;
+ };
+
+ ebi: ebi@10000000 {
+ compatible = "atmel,sama5d3-ebi";
+ #address-cells = <2>;
+ #size-cells = <1>;
+ atmel,smc = <&hsmc>;
+ reg = <0x10000000 0x10000000
+ 0x40000000 0x30000000>;
+ ranges = <0x0 0x0 0x10000000 0x10000000
+ 0x1 0x0 0x40000000 0x10000000
+ 0x2 0x0 0x50000000 0x10000000
+ 0x3 0x0 0x60000000 0x10000000>;
+ clocks = <&mck>;
+
+ nand_controller: nand-controller {
+ compatible = "atmel,sama5d3-nand-controller";
+ atmel,nfc-sram = <&nfc_sram>;
+ atmel,nfc-io = <&nfc_io>;
+ ecc-engine = <&pmecc>;
+ #address-cells = <2>;
+ #size-cells = <1>;
+ ranges;
+
+ nand@3 {
+ reg = <0x3 0x0 0x800000>;
+ atmel,rb = <0>;
+
+ /*
+ * Put generic NAND/MTD properties and
+ * subnodes here.
+ */
+ };
+ };
+ };
+
+-----------------------------------------------------------------------
+
+Deprecated bindings (should not be used in new device trees):
Required properties:
- compatible: The possible values are:
diff --git a/Documentation/devicetree/bindings/mtd/denali-nand.txt b/Documentation/devicetree/bindings/mtd/denali-nand.txt
index b04d03a1d499..e593bbeb2115 100644
--- a/Documentation/devicetree/bindings/mtd/denali-nand.txt
+++ b/Documentation/devicetree/bindings/mtd/denali-nand.txt
@@ -1,11 +1,11 @@
* Denali NAND controller
Required properties:
- - compatible : should be "denali,denali-nand-dt"
+ - compatible : should be one of the following:
+ "altr,socfpga-denali-nand" - for Altera SOCFPGA
- reg : should contain registers location and length for data and reg.
- reg-names: Should contain the reg names "nand_data" and "denali_reg"
- interrupts : The interrupt number.
- - dm-mask : DMA bit mask
The device tree may optionally contain sub-nodes describing partitions of the
address space. See partition.txt for more detail.
@@ -15,9 +15,8 @@ Examples:
nand: nand@ff900000 {
#address-cells = <1>;
#size-cells = <1>;
- compatible = "denali,denali-nand-dt";
+ compatible = "altr,socfpga-denali-nand";
reg = <0xff900000 0x100000>, <0xffb80000 0x10000>;
reg-names = "nand_data", "denali_reg";
interrupts = <0 144 4>;
- dma-mask = <0xffffffff>;
};
diff --git a/Documentation/devicetree/bindings/mtd/gpio-control-nand.txt b/Documentation/devicetree/bindings/mtd/gpio-control-nand.txt
index af8915b41ccf..486a17d533d7 100644
--- a/Documentation/devicetree/bindings/mtd/gpio-control-nand.txt
+++ b/Documentation/devicetree/bindings/mtd/gpio-control-nand.txt
@@ -12,7 +12,7 @@ Required properties:
- #address-cells, #size-cells : Must be present if the device has sub-nodes
representing partitions.
- gpios : Specifies the GPIO pins to control the NAND device. The order of
- GPIO references is: RDY, nCE, ALE, CLE, and an optional nWP.
+ GPIO references is: RDY, nCE, ALE, CLE, and nWP. nCE and nWP are optional.
Optional properties:
- bank-width : Width (in bytes) of the device. If not present, the width
@@ -36,7 +36,7 @@ gpio-nand@1,0 {
#address-cells = <1>;
#size-cells = <1>;
gpios = <&banka 1 0>, /* RDY */
- <&banka 2 0>, /* nCE */
+ <0>, /* nCE */
<&banka 3 0>, /* ALE */
<&banka 4 0>, /* CLE */
<0>; /* nWP */
diff --git a/Documentation/devicetree/bindings/mtd/stm32-quadspi.txt b/Documentation/devicetree/bindings/mtd/stm32-quadspi.txt
new file mode 100644
index 000000000000..ddd18c135148
--- /dev/null
+++ b/Documentation/devicetree/bindings/mtd/stm32-quadspi.txt
@@ -0,0 +1,43 @@
+* STMicroelectronics Quad Serial Peripheral Interface(QuadSPI)
+
+Required properties:
+- compatible: should be "st,stm32f469-qspi"
+- reg: the first contains the register location and length.
+ the second contains the memory mapping address and length
+- reg-names: should contain the reg names "qspi" "qspi_mm"
+- interrupts: should contain the interrupt for the device
+- clocks: the phandle of the clock needed by the QSPI controller
+- A pinctrl must be defined to set pins in mode of operation for QSPI transfer
+
+Optional properties:
+- resets: must contain the phandle to the reset controller.
+
+A spi flash must be a child of the nor_flash node and could have some
+properties. Also see jedec,spi-nor.txt.
+
+Required properties:
+- reg: chip-Select number (QSPI controller may connect 2 nor flashes)
+- spi-max-frequency: max frequency of spi bus
+
+Optional property:
+- spi-rx-bus-width: see ../spi/spi-bus.txt for the description
+
+Example:
+
+qspi: spi@a0001000 {
+ compatible = "st,stm32f469-qspi";
+ reg = <0xa0001000 0x1000>, <0x90000000 0x10000000>;
+ reg-names = "qspi", "qspi_mm";
+ interrupts = <91>;
+ resets = <&rcc STM32F4_AHB3_RESET(QSPI)>;
+ clocks = <&rcc 0 STM32F4_AHB3_CLOCK(QSPI)>;
+ pinctrl-names = "default";
+ pinctrl-0 = <&pinctrl_qspi0>;
+
+ flash@0 {
+ reg = <0>;
+ spi-rx-bus-width = <4>;
+ spi-max-frequency = <108000000>;
+ ...
+ };
+};
diff --git a/Documentation/devicetree/bindings/power/power_domain.txt b/Documentation/devicetree/bindings/power/power_domain.txt
index 940707d095cc..14bd9e945ff6 100644
--- a/Documentation/devicetree/bindings/power/power_domain.txt
+++ b/Documentation/devicetree/bindings/power/power_domain.txt
@@ -81,7 +81,7 @@ Example 3:
child: power-controller@12341000 {
compatible = "foo,power-controller";
reg = <0x12341000 0x1000>;
- power-domains = <&parent 0>;
+ power-domains = <&parent>;
#power-domain-cells = <0>;
domain-idle-states = <&DOMAIN_PWR_DN>;
};
diff --git a/Documentation/devicetree/bindings/power/supply/axp20x_battery.txt b/Documentation/devicetree/bindings/power/supply/axp20x_battery.txt
new file mode 100644
index 000000000000..c24886676a60
--- /dev/null
+++ b/Documentation/devicetree/bindings/power/supply/axp20x_battery.txt
@@ -0,0 +1,20 @@
+AXP20x and AXP22x battery power supply
+
+Required Properties:
+ - compatible, one of:
+ "x-powers,axp209-battery-power-supply"
+ "x-powers,axp221-battery-power-supply"
+
+This node is a subnode of the axp20x/axp22x PMIC.
+
+The AXP20X and AXP22X can read the battery voltage, charge and discharge
+currents of the battery by reading ADC channels from the AXP20X/AXP22X
+ADC.
+
+Example:
+
+&axp209 {
+ battery_power_supply: battery-power-supply {
+ compatible = "x-powers,axp209-battery-power-supply";
+ }
+};
diff --git a/Documentation/devicetree/bindings/powerpc/ibm,powerpc-cpu-features.txt b/Documentation/devicetree/bindings/powerpc/ibm,powerpc-cpu-features.txt
new file mode 100644
index 000000000000..5af426e13334
--- /dev/null
+++ b/Documentation/devicetree/bindings/powerpc/ibm,powerpc-cpu-features.txt
@@ -0,0 +1,248 @@
+*** NOTE ***
+This document is copied from OPAL firmware
+(skiboot/doc/device-tree/ibm,powerpc-cpu-features/binding.txt)
+
+There is more complete overview and documentation of features in that
+source tree. All patches and modifications should go there.
+************
+
+ibm,powerpc-cpu-features binding
+================================
+
+This device tree binding describes CPU features available to software, with
+enablement, privilege, and compatibility metadata.
+
+More general description of design and implementation of this binding is
+found in design.txt, which also points to documentation of specific features.
+
+
+/cpus/ibm,powerpc-cpu-features node binding
+-------------------------------------------
+
+Node: ibm,powerpc-cpu-features
+
+Description: Container of CPU feature nodes.
+
+The node name must be "ibm,powerpc-cpu-features".
+
+It is implemented as a child of the node "/cpus", but this must not be
+assumed by parsers.
+
+The node is optional but should be provided by new OPAL firmware.
+
+Properties:
+
+- compatible
+ Usage: required
+ Value type: string
+ Definition: "ibm,powerpc-cpu-features"
+
+ This compatibility refers to backwards compatibility of the overall
+ design with parsers that behave according to these guidelines. This can
+ be extended in a backward compatible manner which would not warrant a
+ revision of the compatible property.
+
+- isa
+ Usage: required
+ Value type: <u32>
+ Definition:
+
+ isa that the CPU is currently running in. This provides instruction set
+ compatibility, less the individual feature nodes. For example, an ISA v3.0
+ implementation that lacks the "transactional-memory" cpufeature node
+ should not use transactional memory facilities.
+
+ Value corresponds to the "Power ISA Version" multiplied by 1000.
+ For example, <3000> corresponds to Version 3.0, <2070> to Version 2.07.
+ The minor digit is available for revisions.
+
+- display-name
+ Usage: optional
+ Value type: string
+ Definition:
+
+ A human readable name for the CPU.
+
+/cpus/ibm,powerpc-cpu-features/example-feature node bindings
+----------------------------------------------------------------
+
+Each child node of cpu-features represents a CPU feature / capability.
+
+Node: A string describing an architected CPU feature, e.g., "floating-point".
+
+Description: A feature or capability supported by the CPUs.
+
+The name of the node is a human readable string that forms the interface
+used to describe features to software. Features are currently documented
+in the code where they are implemented in skiboot/core/cpufeatures.c
+
+Presence of the node indicates the feature is available.
+
+Properties:
+
+- isa
+ Usage: required
+ Value type: <u32>
+ Definition:
+
+ First level of the Power ISA that the feature appears in.
+ Software should filter out features when constraining the
+ environment to a particular ISA version.
+
+ Value is defined similarly to /cpus/features/isa
+
+- usable-privilege
+ Usage: required
+ Value type: <u32> bit mask
+ Definition:
+ Bit numbers are LSB0
+ bit 0 - PR (problem state / user mode)
+ bit 1 - OS (privileged state)
+ bit 2 - HV (hypervisor state)
+ All other bits reserved and should be zero.
+
+ This property describes the privilege levels and/or software components
+ that can use the feature.
+
+ If bit 0 is set, then the hwcap-bit-nr property will exist.
+
+
+- hv-support
+ Usage: optional
+ Value type: <u32> bit mask
+ Definition:
+ Bit numbers are LSB0
+ bit 0 - HFSCR
+ All other bits reserved and should be zero.
+
+ This property describes the HV privilege support required to enable the
+ feature to lesser privilege levels. If the property does not exist then no
+ support is required.
+
+ If no bits are set, the hypervisor must have explicit/custom support for
+ this feature.
+
+ If the HFSCR bit is set, then the hfscr-bit-nr property will exist and
+ the feature may be enabled by setting this bit in the HFSCR register.
+
+
+- os-support
+ Usage: optional
+ Value type: <u32> bit mask
+ Definition:
+ Bit numbers are LSB0
+ bit 0 - FSCR
+ All other bits reserved and should be zero.
+
+ This property describes the OS privilege support required to enable the
+ feature to lesser privilege levels. If the property does not exist then no
+ support is required.
+
+ If no bits are set, the operating system must have explicit/custom support
+ for this feature.
+
+ If the FSCR bit is set, then the fscr-bit-nr property will exist and
+ the feature may be enabled by setting this bit in the FSCR register.
+
+
+- hfscr-bit-nr
+ Usage: optional
+ Value type: <u32>
+ Definition: HFSCR bit position (LSB0)
+
+ This property exists when the hv-support property HFSCR bit is set. This
+ property describes the bit number in the HFSCR register that the
+ hypervisor must set in order to enable this feature.
+
+ This property also exists if an HFSCR bit corresponds with this feature.
+ This makes CPU feature parsing slightly simpler.
+
+
+- fscr-bit-nr
+ Usage: optional
+ Value type: <u32>
+ Definition: FSCR bit position (LSB0)
+
+ This property exists when the os-support property FSCR bit is set. This
+ property describes the bit number in the FSCR register that the
+ operating system must set in order to enable this feature.
+
+ This property also exists if an FSCR bit corresponds with this feature.
+ This makes CPU feature parsing slightly simpler.
+
+
+- hwcap-bit-nr
+ Usage: optional
+ Value type: <u32>
+ Definition: Linux ELF AUX vector bit position (LSB0)
+
+ This property may exist when the usable-privilege property value has PR bit set.
+ This property describes the bit number that should be set in the ELF AUX
+ hardware capability vectors in order to advertise this feature to userspace.
+ Bits 0-31 correspond to bits 0-31 in AT_HWCAP vector. Bits 32-63 correspond
+ to 0-31 in AT_HWCAP2 vector, and so on. Missing AT_HWCAPx vectors implies
+ that the feature is not enabled or can not be advertised. Operating systems
+ may provide a number of unassigned hardware capability bits to allow for new
+ features to be advertised.
+
+ Some properties representing features created before this binding are
+ advertised to userspace without a one-to-one hwcap bit number may not specify
+ this bit. Operating system will handle those bits specifically. All new
+ features usable by userspace will have a hwcap-bit-nr property.
+
+
+- dependencies
+ Usage: optional
+ Value type: <prop-encoded-array>
+ Definition:
+
+ If this property exists then it is a list of phandles to cpu feature
+ nodes that must be enabled for this feature to be enabled.
+
+
+Example
+-------
+
+ /cpus/ibm,powerpc-cpu-features {
+ compatible = "ibm,powerpc-cpu-features";
+
+ isa = <3020>;
+
+ darn {
+ isa = <3000>;
+ usable-privilege = <1 | 2 | 4>;
+ hwcap-bit-nr = <xx>;
+ };
+
+ scv {
+ isa = <3000>;
+ usable-privilege = <1 | 2>;
+ os-support = <0>;
+ hwcap-bit-nr = <xx>;
+ };
+
+ stop {
+ isa = <3000>;
+ usable-privilege = <2 | 4>;
+ hv-support = <0>;
+ os-support = <0>;
+ };
+
+ vsx2 (hypothetical) {
+ isa = <3010>;
+ usable-privilege = <1 | 2 | 4>;
+ hv-support = <0>;
+ os-support = <0>;
+ hwcap-bit-nr = <xx>;
+ };
+
+ vsx2-newinsns {
+ isa = <3020>;
+ usable-privilege = <1 | 2 | 4>;
+ os-support = <1>;
+ fscr-bit-nr = <xx>;
+ hwcap-bit-nr = <xx>;
+ dependencies = <&vsx2>;
+ };
+
+ };
diff --git a/Documentation/devicetree/bindings/pwm/atmel-pwm.txt b/Documentation/devicetree/bindings/pwm/atmel-pwm.txt
index 02331b904d4e..c8c831d7b0d1 100644
--- a/Documentation/devicetree/bindings/pwm/atmel-pwm.txt
+++ b/Documentation/devicetree/bindings/pwm/atmel-pwm.txt
@@ -4,6 +4,7 @@ Required properties:
- compatible: should be one of:
- "atmel,at91sam9rl-pwm"
- "atmel,sama5d3-pwm"
+ - "atmel,sama5d2-pwm"
- reg: physical base address and length of the controller's registers
- #pwm-cells: Should be 3. See pwm.txt in this directory for a
description of the cells format.
diff --git a/Documentation/devicetree/bindings/pwm/nvidia,tegra20-pwm.txt b/Documentation/devicetree/bindings/pwm/nvidia,tegra20-pwm.txt
index b4e73778dda3..c57e11b8d937 100644
--- a/Documentation/devicetree/bindings/pwm/nvidia,tegra20-pwm.txt
+++ b/Documentation/devicetree/bindings/pwm/nvidia,tegra20-pwm.txt
@@ -19,6 +19,19 @@ Required properties:
- reset-names: Must include the following entries:
- pwm
+Optional properties:
+============================
+In some of the interface like PWM based regulator device, it is required
+to configure the pins differently in different states, especially in suspend
+state of the system. The configuration of pin is provided via the pinctrl
+DT node as detailed in the pinctrl DT binding document
+ Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt
+
+The PWM node will have following optional properties.
+pinctrl-names: Pin state names. Must be "default" and "sleep".
+pinctrl-0: phandle for the default/active state of pin configurations.
+pinctrl-1: phandle for the sleep state of pin configurations.
+
Example:
pwm: pwm@7000a000 {
@@ -29,3 +42,35 @@ Example:
resets = <&tegra_car 17>;
reset-names = "pwm";
};
+
+
+Example with the pin configuration for suspend and resume:
+=========================================================
+Suppose pin PE7 (On Tegra210) interfaced with the regulator device and
+it requires PWM output to be tristated when system enters suspend.
+Following will be DT binding to achieve this:
+
+#include <dt-bindings/pinctrl/pinctrl-tegra.h>
+
+ pinmux@700008d4 {
+ pwm_active_state: pwm_active_state {
+ pe7 {
+ nvidia,pins = "pe7";
+ nvidia,tristate = <TEGRA_PIN_DISABLE>;
+ };
+ };
+
+ pwm_sleep_state: pwm_sleep_state {
+ pe7 {
+ nvidia,pins = "pe7";
+ nvidia,tristate = <TEGRA_PIN_ENABLE>;
+ };
+ };
+ };
+
+ pwm@7000a000 {
+ /* Mandatory PWM properties */
+ pinctrl-names = "default", "sleep";
+ pinctrl-0 = <&pwm_active_state>;
+ pinctrl-1 = <&pwm_sleep_state>;
+ };
diff --git a/Documentation/devicetree/bindings/pwm/pwm-mediatek.txt b/Documentation/devicetree/bindings/pwm/pwm-mediatek.txt
new file mode 100644
index 000000000000..54c59b0560ad
--- /dev/null
+++ b/Documentation/devicetree/bindings/pwm/pwm-mediatek.txt
@@ -0,0 +1,34 @@
+MediaTek PWM controller
+
+Required properties:
+ - compatible: should be "mediatek,<name>-pwm":
+ - "mediatek,mt7623-pwm": found on mt7623 SoC.
+ - reg: physical base address and length of the controller's registers.
+ - #pwm-cells: must be 2. See pwm.txt in this directory for a description of
+ the cell format.
+ - clocks: phandle and clock specifier of the PWM reference clock.
+ - clock-names: must contain the following:
+ - "top": the top clock generator
+ - "main": clock used by the PWM core
+ - "pwm1-5": the five per PWM clocks
+ - pinctrl-names: Must contain a "default" entry.
+ - pinctrl-0: One property must exist for each entry in pinctrl-names.
+ See pinctrl/pinctrl-bindings.txt for details of the property values.
+
+Example:
+ pwm0: pwm@11006000 {
+ compatible = "mediatek,mt7623-pwm";
+ reg = <0 0x11006000 0 0x1000>;
+ #pwm-cells = <2>;
+ clocks = <&topckgen CLK_TOP_PWM_SEL>,
+ <&pericfg CLK_PERI_PWM>,
+ <&pericfg CLK_PERI_PWM1>,
+ <&pericfg CLK_PERI_PWM2>,
+ <&pericfg CLK_PERI_PWM3>,
+ <&pericfg CLK_PERI_PWM4>,
+ <&pericfg CLK_PERI_PWM5>;
+ clock-names = "top", "main", "pwm1", "pwm2",
+ "pwm3", "pwm4", "pwm5";
+ pinctrl-names = "default";
+ pinctrl-0 = <&pwm0_pins>;
+ };
diff --git a/Documentation/devicetree/bindings/rtc/cpcap-rtc.txt b/Documentation/devicetree/bindings/rtc/cpcap-rtc.txt
new file mode 100644
index 000000000000..45750ff3112d
--- /dev/null
+++ b/Documentation/devicetree/bindings/rtc/cpcap-rtc.txt
@@ -0,0 +1,18 @@
+Motorola CPCAP PMIC RTC
+-----------------------
+
+This module is part of the CPCAP. For more details about the whole
+chip see Documentation/devicetree/bindings/mfd/motorola-cpcap.txt.
+
+Requires node properties:
+- compatible: should contain "motorola,cpcap-rtc"
+- interrupts: An interrupt specifier for alarm and 1 Hz irq
+
+Example:
+
+&cpcap {
+ cpcap_rtc: rtc {
+ compatible = "motorola,cpcap-rtc";
+ interrupts = <39 IRQ_TYPE_NONE>, <26 IRQ_TYPE_NONE>;
+ };
+};
diff --git a/Documentation/devicetree/bindings/rtc/rtc-sh.txt b/Documentation/devicetree/bindings/rtc/rtc-sh.txt
new file mode 100644
index 000000000000..7676c7d28874
--- /dev/null
+++ b/Documentation/devicetree/bindings/rtc/rtc-sh.txt
@@ -0,0 +1,28 @@
+* Real Time Clock for Renesas SH and ARM SoCs
+
+Required properties:
+- compatible: Should be "renesas,r7s72100-rtc" and "renesas,sh-rtc" as a
+ fallback.
+- reg: physical base address and length of memory mapped region.
+- interrupts: 3 interrupts for alarm, period, and carry.
+- interrupt-names: The interrupts should be labeled as "alarm", "period", and
+ "carry".
+- clocks: The functional clock source for the RTC controller must be listed
+ first (if exists). Additionally, potential clock counting sources are to be
+ listed.
+- clock-names: The functional clock must be labeled as "fck". Other clocks
+ may be named in accordance to the SoC hardware manuals.
+
+
+Example:
+rtc: rtc@fcff1000 {
+ compatible = "renesas,r7s72100-rtc", "renesas,sh-rtc";
+ reg = <0xfcff1000 0x2e>;
+ interrupts = <GIC_SPI 276 IRQ_TYPE_EDGE_RISING
+ GIC_SPI 277 IRQ_TYPE_EDGE_RISING
+ GIC_SPI 278 IRQ_TYPE_EDGE_RISING>;
+ interrupt-names = "alarm", "period", "carry";
+ clocks = <&mstp6_clks R7S72100_CLK_RTC>, <&rtc_x1_clk>,
+ <&rtc_x3_clk>, <&extal_clk>;
+ clock-names = "fck", "rtc_x1", "rtc_x3", "extal";
+};
diff --git a/Documentation/devicetree/bindings/soc/fsl/cpm_qe/gpio.txt b/Documentation/devicetree/bindings/soc/fsl/cpm_qe/gpio.txt
index 349f79fd7076..626e1afa64a6 100644
--- a/Documentation/devicetree/bindings/soc/fsl/cpm_qe/gpio.txt
+++ b/Documentation/devicetree/bindings/soc/fsl/cpm_qe/gpio.txt
@@ -13,8 +13,17 @@ Required properties:
- #gpio-cells : Should be two. The first cell is the pin number and the
second cell is used to specify optional parameters (currently unused).
- gpio-controller : Marks the port as GPIO controller.
+Optional properties:
+- fsl,cpm1-gpio-irq-mask : For banks having interrupt capability (like port C
+ on CPM1), this item tells which ports have an associated interrupt (ports are
+ listed in the same order as in PCINT register)
+- interrupts : This property provides the list of interrupt for each GPIO having
+ one as described by the fsl,cpm1-gpio-irq-mask property. There should be as
+ many interrupts as number of ones in the mask property. The first interrupt in
+ the list corresponds to the most significant bit of the mask.
+- interrupt-parent : Parent for the above interrupt property.
-Example of three SOC GPIO banks defined as gpio-controller nodes:
+Example of four SOC GPIO banks defined as gpio-controller nodes:
CPM1_PIO_A: gpio-controller@950 {
#gpio-cells = <2>;
@@ -30,6 +39,16 @@ Example of three SOC GPIO banks defined as gpio-controller nodes:
gpio-controller;
};
+ CPM1_PIO_C: gpio-controller@960 {
+ #gpio-cells = <2>;
+ compatible = "fsl,cpm1-pario-bank-c";
+ reg = <0x960 0x10>;
+ fsl,cpm1-gpio-irq-mask = <0x0fff>;
+ interrupts = <1 2 6 9 10 11 14 15 23 24 26 31>;
+ interrupt-parent = <&CPM_PIC>;
+ gpio-controller;
+ };
+
CPM1_PIO_E: gpio-controller@ac8 {
#gpio-cells = <2>;
compatible = "fsl,cpm1-pario-bank-e";
diff --git a/Documentation/devicetree/bindings/thermal/brcm,bcm2835-thermal.txt b/Documentation/devicetree/bindings/thermal/brcm,bcm2835-thermal.txt
index 474531d2b2c5..da8c5b73ad10 100644
--- a/Documentation/devicetree/bindings/thermal/brcm,bcm2835-thermal.txt
+++ b/Documentation/devicetree/bindings/thermal/brcm,bcm2835-thermal.txt
@@ -3,15 +3,39 @@ Binding for Thermal Sensor driver for BCM2835 SoCs.
Required parameters:
-------------------
-compatible: should be one of: "brcm,bcm2835-thermal",
- "brcm,bcm2836-thermal" or "brcm,bcm2837-thermal"
-reg: Address range of the thermal registers.
-clocks: Phandle of the clock used by the thermal sensor.
+compatible: should be one of: "brcm,bcm2835-thermal",
+ "brcm,bcm2836-thermal" or "brcm,bcm2837-thermal"
+reg: Address range of the thermal registers.
+clocks: Phandle of the clock used by the thermal sensor.
+#thermal-sensor-cells: should be 0 (see thermal.txt)
Example:
+thermal-zones {
+ cpu_thermal: cpu-thermal {
+ polling-delay-passive = <0>;
+ polling-delay = <1000>;
+
+ thermal-sensors = <&thermal>;
+
+ trips {
+ cpu-crit {
+ temperature = <80000>;
+ hysteresis = <0>;
+ type = "critical";
+ };
+ };
+
+ coefficients = <(-538) 407000>;
+
+ cooling-maps {
+ };
+ };
+};
+
thermal: thermal@7e212000 {
compatible = "brcm,bcm2835-thermal";
reg = <0x7e212000 0x8>;
clocks = <&clocks BCM2835_CLOCK_TSENS>;
+ #thermal-sensor-cells = <0>;
};
diff --git a/Documentation/devicetree/bindings/thermal/brcm,ns-thermal b/Documentation/devicetree/bindings/thermal/brcm,ns-thermal
new file mode 100644
index 000000000000..68e047170039
--- /dev/null
+++ b/Documentation/devicetree/bindings/thermal/brcm,ns-thermal
@@ -0,0 +1,37 @@
+* Broadcom Northstar Thermal
+
+This binding describes thermal sensor that is part of Northstar's DMU (Device
+Management Unit).
+
+Required properties:
+- compatible : Must be "brcm,ns-thermal"
+- reg : iomem address range of PVTMON registers
+- #thermal-sensor-cells : Should be <0>
+
+Example:
+
+thermal: thermal@1800c2c0 {
+ compatible = "brcm,ns-thermal";
+ reg = <0x1800c2c0 0x10>;
+ #thermal-sensor-cells = <0>;
+};
+
+thermal-zones {
+ cpu_thermal: cpu-thermal {
+ polling-delay-passive = <0>;
+ polling-delay = <1000>;
+ coefficients = <(-556) 418000>;
+ thermal-sensors = <&thermal>;
+
+ trips {
+ cpu-crit {
+ temperature = <125000>;
+ hysteresis = <0>;
+ type = "critical";
+ };
+ };
+
+ cooling-maps {
+ };
+ };
+};
diff --git a/Documentation/devicetree/bindings/thermal/da9062-thermal.txt b/Documentation/devicetree/bindings/thermal/da9062-thermal.txt
new file mode 100644
index 000000000000..e241bb5a5584
--- /dev/null
+++ b/Documentation/devicetree/bindings/thermal/da9062-thermal.txt
@@ -0,0 +1,36 @@
+* Dialog DA9062/61 TJUNC Thermal Module
+
+This module is part of the DA9061/DA9062. For more details about entire
+DA9062 and DA9061 chips see Documentation/devicetree/bindings/mfd/da9062.txt
+
+Junction temperature thermal module uses an interrupt signal to identify
+high THERMAL_TRIP_HOT temperatures for the PMIC device.
+
+Required properties:
+
+- compatible: should be one of the following valid compatible string lines:
+ "dlg,da9061-thermal", "dlg,da9062-thermal"
+ "dlg,da9062-thermal"
+
+Optional properties:
+
+- polling-delay-passive : Specify the polling period, measured in
+ milliseconds, between thermal zone device update checks.
+
+Example: DA9062
+
+ pmic0: da9062@58 {
+ thermal {
+ compatible = "dlg,da9062-thermal";
+ polling-delay-passive = <3000>;
+ };
+ };
+
+Example: DA9061 using a fall-back compatible for the DA9062 onkey driver
+
+ pmic0: da9061@58 {
+ thermal {
+ compatible = "dlg,da9061-thermal", "dlg,da9062-thermal";
+ polling-delay-passive = <3000>;
+ };
+ };
diff --git a/Documentation/devicetree/bindings/trivial-devices.txt b/Documentation/devicetree/bindings/trivial-devices.txt
index ad10fbe61562..3e0a34c88e07 100644
--- a/Documentation/devicetree/bindings/trivial-devices.txt
+++ b/Documentation/devicetree/bindings/trivial-devices.txt
@@ -160,6 +160,7 @@ sii,s35390a 2-wire CMOS real-time clock
silabs,si7020 Relative Humidity and Temperature Sensors
skyworks,sky81452 Skyworks SKY81452: Six-Channel White LED Driver with Touch Panel Bias Supply
st,24c256 i2c serial eeprom (24cxx)
+st,m41t0 Serial real-time clock (RTC)
st,m41t00 Serial real-time clock (RTC)
st,m41t62 Serial real-time clock (RTC) with alarm
st,m41t80 M41T80 - SERIAL ACCESS RTC WITH ALARMS
diff --git a/Documentation/devicetree/bindings/usb/da8xx-usb.txt b/Documentation/devicetree/bindings/usb/da8xx-usb.txt
index ccb844aba7d4..717c5f656237 100644
--- a/Documentation/devicetree/bindings/usb/da8xx-usb.txt
+++ b/Documentation/devicetree/bindings/usb/da8xx-usb.txt
@@ -18,10 +18,26 @@ Required properties:
- phy-names: Should be "usb-phy"
+ - dmas: specifies the dma channels
+
+ - dma-names: specifies the names of the channels. Use "rxN" for receive
+ and "txN" for transmit endpoints. N specifies the endpoint number.
+
Optional properties:
~~~~~~~~~~~~~~~~~~~~
- vbus-supply: Phandle to a regulator providing the USB bus power.
+DMA
+~~~
+- compatible: ti,da830-cppi41
+- reg: offset and length of the following register spaces: CPPI DMA Controller,
+ CPPI DMA Scheduler, Queue Manager
+- reg-names: "controller", "scheduler", "queuemgr"
+- #dma-cells: should be set to 2. The first number represents the
+ channel number (0 … 3 for endpoints 1 … 4).
+ The second number is 0 for RX and 1 for TX transfers.
+- #dma-channels: should be set to 4 representing the 4 endpoints.
+
Example:
usb_phy: usb-phy {
compatible = "ti,da830-usb-phy";
@@ -30,7 +46,10 @@ Example:
};
usb0: usb@200000 {
compatible = "ti,da830-musb";
- reg = <0x00200000 0x10000>;
+ reg = <0x00200000 0x1000>;
+ ranges;
+ #address-cells = <1>;
+ #size-cells = <1>;
interrupts = <58>;
interrupt-names = "mc";
@@ -39,5 +58,25 @@ Example:
phys = <&usb_phy 0>;
phy-names = "usb-phy";
+ dmas = <&cppi41dma 0 0 &cppi41dma 1 0
+ &cppi41dma 2 0 &cppi41dma 3 0
+ &cppi41dma 0 1 &cppi41dma 1 1
+ &cppi41dma 2 1 &cppi41dma 3 1>;
+ dma-names =
+ "rx1", "rx2", "rx3", "rx4",
+ "tx1", "tx2", "tx3", "tx4";
+
status = "okay";
+
+ cppi41dma: dma-controller@201000 {
+ compatible = "ti,da830-cppi41";
+ reg = <0x201000 0x1000
+ 0x202000 0x1000
+ 0x204000 0x4000>;
+ reg-names = "controller", "scheduler", "queuemgr";
+ interrupts = <58>;
+ #dma-cells = <2>;
+ #dma-channels = <4>;
+ };
+
};
diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt b/Documentation/devicetree/bindings/vendor-prefixes.txt
index f9fe94535b46..c03d20140366 100644
--- a/Documentation/devicetree/bindings/vendor-prefixes.txt
+++ b/Documentation/devicetree/bindings/vendor-prefixes.txt
@@ -173,6 +173,7 @@ lego LEGO Systems A/S
lenovo Lenovo Group Ltd.
lg LG Corporation
licheepi Lichee Pi
+linaro Linaro Limited
linux Linux-specific binding
lltc Linear Technology Corporation
lsi LSI Corp. (LSI Logic)
@@ -196,6 +197,7 @@ minix MINIX Technology Ltd.
miramems MiraMEMS Sensing Technology Co., Ltd.
mitsubishi Mitsubishi Electric Corporation
mosaixtech Mosaix Technologies, Inc.
+motorola Motorola, Inc.
moxa Moxa
mpl MPL AG
mqmaker mqmaker Inc.
diff --git a/Documentation/filesystems/bfs.txt b/Documentation/filesystems/bfs.txt
index 78043d5a8fc3..843ce91a2e40 100644
--- a/Documentation/filesystems/bfs.txt
+++ b/Documentation/filesystems/bfs.txt
@@ -54,4 +54,4 @@ The first 4 bytes should be 0x1badface.
If you have any patches, questions or suggestions regarding this BFS
implementation please contact the author:
-Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
+Tigran Aivazian <aivazian.tigran@gmail.com>
diff --git a/Documentation/filesystems/nfs/idmapper.txt b/Documentation/filesystems/nfs/idmapper.txt
index fe03d10bb79a..b86831acd583 100644
--- a/Documentation/filesystems/nfs/idmapper.txt
+++ b/Documentation/filesystems/nfs/idmapper.txt
@@ -55,7 +55,7 @@ request-key will find the first matching line and corresponding program. In
this case, /some/other/program will handle all uid lookups and
/usr/sbin/nfs.idmap will handle gid, user, and group lookups.
-See <file:Documentation/security/keys-request-key.txt> for more information
+See <file:Documentation/security/keys/request-key.rst> for more information
about the request-key function.
diff --git a/Documentation/filesystems/nfs/pnfs.txt b/Documentation/filesystems/nfs/pnfs.txt
index 8de578a98222..80dc0bdc302a 100644
--- a/Documentation/filesystems/nfs/pnfs.txt
+++ b/Documentation/filesystems/nfs/pnfs.txt
@@ -64,46 +64,9 @@ table which are called by the nfs-client pnfs-core to implement the
different layout types.
Files-layout-driver code is in: fs/nfs/filelayout/.. directory
-Objects-layout-driver code is in: fs/nfs/objlayout/.. directory
Blocks-layout-driver code is in: fs/nfs/blocklayout/.. directory
Flexfiles-layout-driver code is in: fs/nfs/flexfilelayout/.. directory
-objects-layout setup
---------------------
-
-As part of the full STD implementation the objlayoutdriver.ko needs, at times,
-to automatically login to yet undiscovered iscsi/osd devices. For this the
-driver makes up-calles to a user-mode script called *osd_login*
-
-The path_name of the script to use is by default:
- /sbin/osd_login.
-This name can be overridden by the Kernel module parameter:
- objlayoutdriver.osd_login_prog
-
-If Kernel does not find the osd_login_prog path it will zero it out
-and will not attempt farther logins. An admin can then write new value
-to the objlayoutdriver.osd_login_prog Kernel parameter to re-enable it.
-
-The /sbin/osd_login is part of the nfs-utils package, and should usually
-be installed on distributions that support this Kernel version.
-
-The API to the login script is as follows:
- Usage: $0 -u <URI> -o <OSDNAME> -s <SYSTEMID>
- Options:
- -u target uri e.g. iscsi://<ip>:<port>
- (always exists)
- (More protocols can be defined in the future.
- The client does not interpret this string it is
- passed unchanged as received from the Server)
- -o osdname of the requested target OSD
- (Might be empty)
- (A string which denotes the OSD name, there is a
- limit of 64 chars on this string)
- -s systemid of the requested target OSD
- (Might be empty)
- (This string, if not empty is always an hex
- representation of the 20 bytes osd_system_id)
-
blocks-layout setup
-------------------
diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt
index 634d03e20c2d..c9e884b52698 100644
--- a/Documentation/filesystems/overlayfs.txt
+++ b/Documentation/filesystems/overlayfs.txt
@@ -21,12 +21,19 @@ from accessing the corresponding object from the original filesystem.
This is most obvious from the 'st_dev' field returned by stat(2).
While directories will report an st_dev from the overlay-filesystem,
-all non-directory objects will report an st_dev from the lower or
+non-directory objects may report an st_dev from the lower filesystem or
upper filesystem that is providing the object. Similarly st_ino will
only be unique when combined with st_dev, and both of these can change
over the lifetime of a non-directory object. Many applications and
tools ignore these values and will not be affected.
+In the special case of all overlay layers on the same underlying
+filesystem, all objects will report an st_dev from the overlay
+filesystem and st_ino from the underlying filesystem. This will
+make the overlay mount more compliant with filesystem scanners and
+overlay objects will be distinguishable from the corresponding
+objects in the original filesystem.
+
Upper and Lower
---------------
diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index eccb675a2852..1e9fcb4d0ec8 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -309,6 +309,7 @@ Code Seq#(hex) Include File Comments
0xA3 80-8F Port ACL in development:
<mailto:tlewis@mindspring.com>
0xA3 90-9F linux/dtlk.h
+0xA4 00-1F uapi/linux/tee.h Generic TEE subsystem
0xAA 00-3F linux/uapi/linux/userfaultfd.h
0xAB 00-1F linux/nbd.h
0xAC 00-1F linux/raw.h
diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt
index 9b9c4797fc55..659afd56ecdb 100644
--- a/Documentation/kbuild/makefiles.txt
+++ b/Documentation/kbuild/makefiles.txt
@@ -44,11 +44,11 @@ This document describes the Linux kernel Makefiles.
--- 6.11 Post-link pass
=== 7 Kbuild syntax for exported headers
- --- 7.1 header-y
+ --- 7.1 no-export-headers
--- 7.2 genhdr-y
- --- 7.3 destination-y
- --- 7.4 generic-y
- --- 7.5 generated-y
+ --- 7.3 generic-y
+ --- 7.4 generated-y
+ --- 7.5 mandatory-y
=== 8 Kbuild Variables
=== 9 Makefile language
@@ -1236,7 +1236,7 @@ When kbuild executes, the following steps are followed (roughly):
that may be shared between individual architectures.
The recommended approach how to use a generic header file is
to list the file in the Kbuild file.
- See "7.4 generic-y" for further info on syntax etc.
+ See "7.3 generic-y" for further info on syntax etc.
--- 6.11 Post-link pass
@@ -1263,53 +1263,32 @@ The pre-processing does:
- drop include of compiler.h
- drop all sections that are kernel internal (guarded by ifdef __KERNEL__)
-Each relevant directory contains a file name "Kbuild" which specifies the
-headers to be exported.
-See subsequent chapter for the syntax of the Kbuild file.
-
- --- 7.1 header-y
-
- header-y specifies header files to be exported.
-
- Example:
- #include/linux/Kbuild
- header-y += usb/
- header-y += aio_abi.h
+All headers under include/uapi/, include/generated/uapi/,
+arch/<arch>/include/uapi/ and arch/<arch>/include/generated/uapi/
+are exported.
- The convention is to list one file per line and
- preferably in alphabetic order.
+A Kbuild file may be defined under arch/<arch>/include/uapi/asm/ and
+arch/<arch>/include/asm/ to list asm files coming from asm-generic.
+See subsequent chapter for the syntax of the Kbuild file.
- header-y also specifies which subdirectories to visit.
- A subdirectory is identified by a trailing '/' which
- can be seen in the example above for the usb subdirectory.
+ --- 7.1 no-export-headers
- Subdirectories are visited before their parent directories.
+ no-export-headers is essentially used by include/uapi/linux/Kbuild to
+ avoid exporting specific headers (e.g. kvm.h) on architectures that do
+ not support it. It should be avoided as much as possible.
--- 7.2 genhdr-y
- genhdr-y specifies generated files to be exported.
- Generated files are special as they need to be looked
- up in another directory when doing 'make O=...' builds.
+ genhdr-y specifies asm files to be generated.
Example:
- #include/linux/Kbuild
- genhdr-y += version.h
+ #arch/x86/include/uapi/asm/Kbuild
+ genhdr-y += unistd_32.h
+ genhdr-y += unistd_64.h
+ genhdr-y += unistd_x32.h
- --- 7.3 destination-y
- When an architecture has a set of exported headers that needs to be
- exported to a different directory destination-y is used.
- destination-y specifies the destination directory for all exported
- headers in the file where it is present.
-
- Example:
- #arch/xtensa/platforms/s6105/include/platform/Kbuild
- destination-y := include/linux
-
- In the example above all exported headers in the Kbuild file
- will be located in the directory "include/linux" when exported.
-
- --- 7.4 generic-y
+ --- 7.3 generic-y
If an architecture uses a verbatim copy of a header from
include/asm-generic then this is listed in the file
@@ -1336,7 +1315,7 @@ See subsequent chapter for the syntax of the Kbuild file.
Example: termios.h
#include <asm-generic/termios.h>
- --- 7.5 generated-y
+ --- 7.4 generated-y
If an architecture generates other header files alongside generic-y
wrappers, and not included in genhdr-y, then generated-y specifies
@@ -1349,6 +1328,15 @@ See subsequent chapter for the syntax of the Kbuild file.
#arch/x86/include/asm/Kbuild
generated-y += syscalls_32.h
+ --- 7.5 mandatory-y
+
+ mandatory-y is essentially used by include/uapi/asm-generic/Kbuild.asm
+ to define the minimum set of headers that must be exported in
+ include/asm.
+
+ The convention is to list one subdir per line and
+ preferably in alphabetic order.
+
=== 8 Kbuild Variables
The top Makefile exports the following variables:
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index d323adcb7b88..732f10ea382e 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -768,7 +768,7 @@ equal to zero, in which case the compiler is within its rights to
transform the above code into the following:
q = READ_ONCE(a);
- WRITE_ONCE(b, 1);
+ WRITE_ONCE(b, 2);
do_something_else();
Given this transformation, the CPU is not required to respect the ordering
diff --git a/Documentation/networking/dns_resolver.txt b/Documentation/networking/dns_resolver.txt
index d86adcdae420..eaa8f9a6fd5d 100644
--- a/Documentation/networking/dns_resolver.txt
+++ b/Documentation/networking/dns_resolver.txt
@@ -143,7 +143,7 @@ the key will be discarded and recreated when the data it holds has expired.
dns_query() returns a copy of the value attached to the key, or an error if
that is indicated instead.
-See <file:Documentation/security/keys-request-key.txt> for further
+See <file:Documentation/security/keys/request-key.rst> for further
information about request-key function.
diff --git a/Documentation/security/00-INDEX b/Documentation/security/00-INDEX
deleted file mode 100644
index 45c82fd3e9d3..000000000000
--- a/Documentation/security/00-INDEX
+++ /dev/null
@@ -1,26 +0,0 @@
-00-INDEX
- - this file.
-LSM.txt
- - description of the Linux Security Module framework.
-SELinux.txt
- - how to get started with the SELinux security enhancement.
-Smack.txt
- - documentation on the Smack Linux Security Module.
-Yama.txt
- - documentation on the Yama Linux Security Module.
-apparmor.txt
- - documentation on the AppArmor security extension.
-credentials.txt
- - documentation about credentials in Linux.
-keys-ecryptfs.txt
- - description of the encryption keys for the ecryptfs filesystem.
-keys-request-key.txt
- - description of the kernel key request service.
-keys-trusted-encrypted.txt
- - info on the Trusted and Encrypted keys in the kernel key ring service.
-keys.txt
- - description of the kernel key retention service.
-tomoyo.txt
- - documentation on the TOMOYO Linux Security Module.
-IMA-templates.txt
- - documentation on the template management mechanism for IMA.
diff --git a/Documentation/security/IMA-templates.txt b/Documentation/security/IMA-templates.rst
index 839b5dad9226..2cd0e273cc9a 100644
--- a/Documentation/security/IMA-templates.txt
+++ b/Documentation/security/IMA-templates.rst
@@ -1,9 +1,12 @@
- IMA Template Management Mechanism
+=================================
+IMA Template Management Mechanism
+=================================
-==== INTRODUCTION ====
+Introduction
+============
-The original 'ima' template is fixed length, containing the filedata hash
+The original ``ima`` template is fixed length, containing the filedata hash
and pathname. The filedata hash is limited to 20 bytes (md5/sha1).
The pathname is a null terminated string, limited to 255 characters.
To overcome these limitations and to add additional file metadata, it is
@@ -28,61 +31,64 @@ a new data type, developers define the field identifier and implement
two functions, init() and show(), respectively to generate and display
measurement entries. Defining a new template descriptor requires
specifying the template format (a string of field identifiers separated
-by the '|' character) through the 'ima_template_fmt' kernel command line
+by the ``|`` character) through the ``ima_template_fmt`` kernel command line
parameter. At boot time, IMA initializes the chosen template descriptor
by translating the format into an array of template fields structures taken
from the set of the supported ones.
-After the initialization step, IMA will call ima_alloc_init_template()
+After the initialization step, IMA will call ``ima_alloc_init_template()``
(new function defined within the patches for the new template management
mechanism) to generate a new measurement entry by using the template
descriptor chosen through the kernel configuration or through the newly
-introduced 'ima_template' and 'ima_template_fmt' kernel command line parameters.
+introduced ``ima_template`` and ``ima_template_fmt`` kernel command line parameters.
It is during this phase that the advantages of the new architecture are
clearly shown: the latter function will not contain specific code to handle
-a given template but, instead, it simply calls the init() method of the template
+a given template but, instead, it simply calls the ``init()`` method of the template
fields associated to the chosen template descriptor and store the result
(pointer to allocated data and data length) in the measurement entry structure.
The same mechanism is employed to display measurements entries.
-The functions ima[_ascii]_measurements_show() retrieve, for each entry,
+The functions ``ima[_ascii]_measurements_show()`` retrieve, for each entry,
the template descriptor used to produce that entry and call the show()
method for each item of the array of template fields structures.
-==== SUPPORTED TEMPLATE FIELDS AND DESCRIPTORS ====
+Supported Template Fields and Descriptors
+=========================================
In the following, there is the list of supported template fields
-('<identifier>': description), that can be used to define new template
+``('<identifier>': description)``, that can be used to define new template
descriptors by adding their identifier to the format string
(support for more data types will be added later):
- 'd': the digest of the event (i.e. the digest of a measured file),
- calculated with the SHA1 or MD5 hash algorithm;
+ calculated with the SHA1 or MD5 hash algorithm;
- 'n': the name of the event (i.e. the file name), with size up to 255 bytes;
- 'd-ng': the digest of the event, calculated with an arbitrary hash
- algorithm (field format: [<hash algo>:]digest, where the digest
- prefix is shown only if the hash algorithm is not SHA1 or MD5);
+ algorithm (field format: [<hash algo>:]digest, where the digest
+ prefix is shown only if the hash algorithm is not SHA1 or MD5);
- 'n-ng': the name of the event, without size limitations;
- 'sig': the file signature.
Below, there is the list of defined template descriptors:
- - "ima": its format is 'd|n';
- - "ima-ng" (default): its format is 'd-ng|n-ng';
- - "ima-sig": its format is 'd-ng|n-ng|sig'.
+ - "ima": its format is ``d|n``;
+ - "ima-ng" (default): its format is ``d-ng|n-ng``;
+ - "ima-sig": its format is ``d-ng|n-ng|sig``.
-==== USE ====
+
+Use
+===
To specify the template descriptor to be used to generate measurement entries,
currently the following methods are supported:
- select a template descriptor among those supported in the kernel
- configuration ('ima-ng' is the default choice);
+ configuration (``ima-ng`` is the default choice);
- specify a template descriptor name from the kernel command line through
- the 'ima_template=' parameter;
+ the ``ima_template=`` parameter;
- register a new template descriptor with custom format through the kernel
- command line parameter 'ima_template_fmt='.
+ command line parameter ``ima_template_fmt=``.
diff --git a/Documentation/security/LSM.rst b/Documentation/security/LSM.rst
new file mode 100644
index 000000000000..d75778b0fa10
--- /dev/null
+++ b/Documentation/security/LSM.rst
@@ -0,0 +1,14 @@
+=================================
+Linux Security Module Development
+=================================
+
+Based on https://lkml.org/lkml/2007/10/26/215,
+a new LSM is accepted into the kernel when its intent (a description of
+what it tries to protect against and in what cases one would expect to
+use it) has been appropriately documented in ``Documentation/security/LSM``.
+This allows an LSM's code to be easily compared to its goals, and so
+that end users and distros can make a more informed decision about which
+LSMs suit their requirements.
+
+For extensive documentation on the available LSM hook interfaces, please
+see ``include/linux/lsm_hooks.h``.
diff --git a/Documentation/security/conf.py b/Documentation/security/conf.py
deleted file mode 100644
index 472fc9a8eb67..000000000000
--- a/Documentation/security/conf.py
+++ /dev/null
@@ -1,8 +0,0 @@
-project = "The kernel security subsystem manual"
-
-tags.add("subproject")
-
-latex_documents = [
- ('index', 'security.tex', project,
- 'The kernel development community', 'manual'),
-]
diff --git a/Documentation/security/credentials.txt b/Documentation/security/credentials.rst
index 86257052e31a..038a7e19eff9 100644
--- a/Documentation/security/credentials.txt
+++ b/Documentation/security/credentials.rst
@@ -1,38 +1,18 @@
- ====================
- CREDENTIALS IN LINUX
- ====================
+====================
+Credentials in Linux
+====================
By: David Howells <dhowells@redhat.com>
-Contents:
-
- (*) Overview.
-
- (*) Types of credentials.
-
- (*) File markings.
-
- (*) Task credentials.
+.. contents:: :local:
- - Immutable credentials.
- - Accessing task credentials.
- - Accessing another task's credentials.
- - Altering credentials.
- - Managing credentials.
-
- (*) Open file credentials.
-
- (*) Overriding the VFS's use of credentials.
-
-
-========
-OVERVIEW
+Overview
========
There are several parts to the security check performed by Linux when one
object acts upon another:
- (1) Objects.
+ 1. Objects.
Objects are things in the system that may be acted upon directly by
userspace programs. Linux has a variety of actionable objects, including:
@@ -48,7 +28,7 @@ object acts upon another:
As a part of the description of all these objects there is a set of
credentials. What's in the set depends on the type of object.
- (2) Object ownership.
+ 2. Object ownership.
Amongst the credentials of most objects, there will be a subset that
indicates the ownership of that object. This is used for resource
@@ -57,7 +37,7 @@ object acts upon another:
In a standard UNIX filesystem, for instance, this will be defined by the
UID marked on the inode.
- (3) The objective context.
+ 3. The objective context.
Also amongst the credentials of those objects, there will be a subset that
indicates the 'objective context' of that object. This may or may not be
@@ -67,7 +47,7 @@ object acts upon another:
The objective context is used as part of the security calculation that is
carried out when an object is acted upon.
- (4) Subjects.
+ 4. Subjects.
A subject is an object that is acting upon another object.
@@ -77,10 +57,10 @@ object acts upon another:
Objects other than tasks may under some circumstances also be subjects.
For instance an open file may send SIGIO to a task using the UID and EUID
- given to it by a task that called fcntl(F_SETOWN) upon it. In this case,
+ given to it by a task that called ``fcntl(F_SETOWN)`` upon it. In this case,
the file struct will have a subjective context too.
- (5) The subjective context.
+ 5. The subjective context.
A subject has an additional interpretation of its credentials. A subset
of its credentials forms the 'subjective context'. The subjective context
@@ -92,7 +72,7 @@ object acts upon another:
from the real UID and GID that normally form the objective context of the
task.
- (6) Actions.
+ 6. Actions.
Linux has a number of actions available that a subject may perform upon an
object. The set of actions available depends on the nature of the subject
@@ -101,7 +81,7 @@ object acts upon another:
Actions include reading, writing, creating and deleting files; forking or
signalling and tracing tasks.
- (7) Rules, access control lists and security calculations.
+ 7. Rules, access control lists and security calculations.
When a subject acts upon an object, a security calculation is made. This
involves taking the subjective context, the objective context and the
@@ -111,7 +91,7 @@ object acts upon another:
There are two main sources of rules:
- (a) Discretionary access control (DAC):
+ a. Discretionary access control (DAC):
Sometimes the object will include sets of rules as part of its
description. This is an 'Access Control List' or 'ACL'. A Linux
@@ -127,7 +107,7 @@ object acts upon another:
A Linux file might also sport a POSIX ACL. This is a list of rules
that grants various permissions to arbitrary subjects.
- (b) Mandatory access control (MAC):
+ b. Mandatory access control (MAC):
The system as a whole may have one or more sets of rules that get
applied to all subjects and objects, regardless of their source.
@@ -139,65 +119,65 @@ object acts upon another:
that says that this action is either granted or denied.
-====================
-TYPES OF CREDENTIALS
+Types of Credentials
====================
The Linux kernel supports the following types of credentials:
- (1) Traditional UNIX credentials.
+ 1. Traditional UNIX credentials.
- Real User ID
- Real Group ID
+ - Real User ID
+ - Real Group ID
The UID and GID are carried by most, if not all, Linux objects, even if in
some cases it has to be invented (FAT or CIFS files for example, which are
derived from Windows). These (mostly) define the objective context of
that object, with tasks being slightly different in some cases.
- Effective, Saved and FS User ID
- Effective, Saved and FS Group ID
- Supplementary groups
+ - Effective, Saved and FS User ID
+ - Effective, Saved and FS Group ID
+ - Supplementary groups
These are additional credentials used by tasks only. Usually, an
EUID/EGID/GROUPS will be used as the subjective context, and real UID/GID
will be used as the objective. For tasks, it should be noted that this is
not always true.
- (2) Capabilities.
+ 2. Capabilities.
- Set of permitted capabilities
- Set of inheritable capabilities
- Set of effective capabilities
- Capability bounding set
+ - Set of permitted capabilities
+ - Set of inheritable capabilities
+ - Set of effective capabilities
+ - Capability bounding set
These are only carried by tasks. They indicate superior capabilities
granted piecemeal to a task that an ordinary task wouldn't otherwise have.
These are manipulated implicitly by changes to the traditional UNIX
- credentials, but can also be manipulated directly by the capset() system
- call.
+ credentials, but can also be manipulated directly by the ``capset()``
+ system call.
The permitted capabilities are those caps that the process might grant
- itself to its effective or permitted sets through capset(). This
+ itself to its effective or permitted sets through ``capset()``. This
inheritable set might also be so constrained.
The effective capabilities are the ones that a task is actually allowed to
make use of itself.
The inheritable capabilities are the ones that may get passed across
- execve().
+ ``execve()``.
The bounding set limits the capabilities that may be inherited across
- execve(), especially when a binary is executed that will execute as UID 0.
+ ``execve()``, especially when a binary is executed that will execute as
+ UID 0.
- (3) Secure management flags (securebits).
+ 3. Secure management flags (securebits).
These are only carried by tasks. These govern the way the above
credentials are manipulated and inherited over certain operations such as
execve(). They aren't used directly as objective or subjective
credentials.
- (4) Keys and keyrings.
+ 4. Keys and keyrings.
These are only carried by tasks. They carry and cache security tokens
that don't fit into the other standard UNIX credentials. They are for
@@ -218,7 +198,7 @@ The Linux kernel supports the following types of credentials:
For more information on using keys, see Documentation/security/keys.txt.
- (5) LSM
+ 5. LSM
The Linux Security Module allows extra controls to be placed over the
operations that a task may do. Currently Linux supports several LSM
@@ -228,7 +208,7 @@ The Linux kernel supports the following types of credentials:
rules (policies) that say what operations a task with one label may do to
an object with another label.
- (6) AF_KEY
+ 6. AF_KEY
This is a socket-based approach to credential management for networking
stacks [RFC 2367]. It isn't discussed by this document as it doesn't
@@ -244,25 +224,19 @@ network filesystem where the credentials of the opened file should be presented
to the server, regardless of who is actually doing a read or a write upon it.
-=============
-FILE MARKINGS
+File Markings
=============
Files on disk or obtained over the network may have annotations that form the
objective security context of that file. Depending on the type of filesystem,
this may include one or more of the following:
- (*) UNIX UID, GID, mode;
-
- (*) Windows user ID;
-
- (*) Access control list;
-
- (*) LSM security label;
-
- (*) UNIX exec privilege escalation bits (SUID/SGID);
-
- (*) File capabilities exec privilege escalation bits.
+ * UNIX UID, GID, mode;
+ * Windows user ID;
+ * Access control list;
+ * LSM security label;
+ * UNIX exec privilege escalation bits (SUID/SGID);
+ * File capabilities exec privilege escalation bits.
These are compared to the task's subjective security context, and certain
operations allowed or disallowed as a result. In the case of execve(), the
@@ -270,8 +244,7 @@ privilege escalation bits come into play, and may allow the resulting process
extra privileges, based on the annotations on the executable file.
-================
-TASK CREDENTIALS
+Task Credentials
================
In Linux, all of a task's credentials are held in (uid, gid) or through
@@ -282,20 +255,20 @@ task_struct.
Once a set of credentials has been prepared and committed, it may not be
changed, barring the following exceptions:
- (1) its reference count may be changed;
+ 1. its reference count may be changed;
- (2) the reference count on the group_info struct it points to may be changed;
+ 2. the reference count on the group_info struct it points to may be changed;
- (3) the reference count on the security data it points to may be changed;
+ 3. the reference count on the security data it points to may be changed;
- (4) the reference count on any keyrings it points to may be changed;
+ 4. the reference count on any keyrings it points to may be changed;
- (5) any keyrings it points to may be revoked, expired or have their security
- attributes changed; and
+ 5. any keyrings it points to may be revoked, expired or have their security
+ attributes changed; and
- (6) the contents of any keyrings to which it points may be changed (the whole
- point of keyrings being a shared set of credentials, modifiable by anyone
- with appropriate access).
+ 6. the contents of any keyrings to which it points may be changed (the whole
+ point of keyrings being a shared set of credentials, modifiable by anyone
+ with appropriate access).
To alter anything in the cred struct, the copy-and-replace principle must be
adhered to. First take a copy, then alter the copy and then use RCU to change
@@ -303,37 +276,37 @@ the task pointer to make it point to the new copy. There are wrappers to aid
with this (see below).
A task may only alter its _own_ credentials; it is no longer permitted for a
-task to alter another's credentials. This means the capset() system call is no
-longer permitted to take any PID other than the one of the current process.
-Also keyctl_instantiate() and keyctl_negate() functions no longer permit
-attachment to process-specific keyrings in the requesting process as the
-instantiating process may need to create them.
+task to alter another's credentials. This means the ``capset()`` system call
+is no longer permitted to take any PID other than the one of the current
+process. Also ``keyctl_instantiate()`` and ``keyctl_negate()`` functions no
+longer permit attachment to process-specific keyrings in the requesting
+process as the instantiating process may need to create them.
-IMMUTABLE CREDENTIALS
+Immutable Credentials
---------------------
-Once a set of credentials has been made public (by calling commit_creds() for
-example), it must be considered immutable, barring two exceptions:
+Once a set of credentials has been made public (by calling ``commit_creds()``
+for example), it must be considered immutable, barring two exceptions:
- (1) The reference count may be altered.
+ 1. The reference count may be altered.
- (2) Whilst the keyring subscriptions of a set of credentials may not be
- changed, the keyrings subscribed to may have their contents altered.
+ 2. Whilst the keyring subscriptions of a set of credentials may not be
+ changed, the keyrings subscribed to may have their contents altered.
To catch accidental credential alteration at compile time, struct task_struct
has _const_ pointers to its credential sets, as does struct file. Furthermore,
-certain functions such as get_cred() and put_cred() operate on const pointers,
-thus rendering casts unnecessary, but require to temporarily ditch the const
-qualification to be able to alter the reference count.
+certain functions such as ``get_cred()`` and ``put_cred()`` operate on const
+pointers, thus rendering casts unnecessary, but require to temporarily ditch
+the const qualification to be able to alter the reference count.
-ACCESSING TASK CREDENTIALS
+Accessing Task Credentials
--------------------------
A task being able to alter only its own credentials permits the current process
to read or replace its own credentials without the need for any form of locking
-- which simplifies things greatly. It can just call:
+-- which simplifies things greatly. It can just call::
const struct cred *current_cred()
@@ -341,7 +314,7 @@ to get a pointer to its credentials structure, and it doesn't have to release
it afterwards.
There are convenience wrappers for retrieving specific aspects of a task's
-credentials (the value is simply returned in each case):
+credentials (the value is simply returned in each case)::
uid_t current_uid(void) Current's real UID
gid_t current_gid(void) Current's real GID
@@ -354,7 +327,7 @@ credentials (the value is simply returned in each case):
struct user_struct *current_user(void) Current's user account
There are also convenience wrappers for retrieving specific associated pairs of
-a task's credentials:
+a task's credentials::
void current_uid_gid(uid_t *, gid_t *);
void current_euid_egid(uid_t *, gid_t *);
@@ -365,12 +338,12 @@ them from the current task's credentials.
In addition, there is a function for obtaining a reference on the current
-process's current set of credentials:
+process's current set of credentials::
const struct cred *get_current_cred(void);
and functions for getting references to one of the credentials that don't
-actually live in struct cred:
+actually live in struct cred::
struct user_struct *get_current_user(void);
struct group_info *get_current_groups(void);
@@ -378,22 +351,22 @@ actually live in struct cred:
which get references to the current process's user accounting structure and
supplementary groups list respectively.
-Once a reference has been obtained, it must be released with put_cred(),
-free_uid() or put_group_info() as appropriate.
+Once a reference has been obtained, it must be released with ``put_cred()``,
+``free_uid()`` or ``put_group_info()`` as appropriate.
-ACCESSING ANOTHER TASK'S CREDENTIALS
+Accessing Another Task's Credentials
------------------------------------
Whilst a task may access its own credentials without the need for locking, the
same is not true of a task wanting to access another task's credentials. It
-must use the RCU read lock and rcu_dereference().
+must use the RCU read lock and ``rcu_dereference()``.
-The rcu_dereference() is wrapped by:
+The ``rcu_dereference()`` is wrapped by::
const struct cred *__task_cred(struct task_struct *task);
-This should be used inside the RCU read lock, as in the following example:
+This should be used inside the RCU read lock, as in the following example::
void foo(struct task_struct *t, struct foo_data *f)
{
@@ -410,39 +383,40 @@ This should be used inside the RCU read lock, as in the following example:
Should it be necessary to hold another task's credentials for a long period of
time, and possibly to sleep whilst doing so, then the caller should get a
-reference on them using:
+reference on them using::
const struct cred *get_task_cred(struct task_struct *task);
This does all the RCU magic inside of it. The caller must call put_cred() on
the credentials so obtained when they're finished with.
- [*] Note: The result of __task_cred() should not be passed directly to
- get_cred() as this may race with commit_cred().
+.. note::
+ The result of ``__task_cred()`` should not be passed directly to
+ ``get_cred()`` as this may race with ``commit_cred()``.
There are a couple of convenience functions to access bits of another task's
-credentials, hiding the RCU magic from the caller:
+credentials, hiding the RCU magic from the caller::
uid_t task_uid(task) Task's real UID
uid_t task_euid(task) Task's effective UID
-If the caller is holding the RCU read lock at the time anyway, then:
+If the caller is holding the RCU read lock at the time anyway, then::
__task_cred(task)->uid
__task_cred(task)->euid
should be used instead. Similarly, if multiple aspects of a task's credentials
-need to be accessed, RCU read lock should be used, __task_cred() called, the
-result stored in a temporary pointer and then the credential aspects called
+need to be accessed, RCU read lock should be used, ``__task_cred()`` called,
+the result stored in a temporary pointer and then the credential aspects called
from that before dropping the lock. This prevents the potentially expensive
RCU magic from being invoked multiple times.
Should some other single aspect of another task's credentials need to be
-accessed, then this can be used:
+accessed, then this can be used::
task_cred_xxx(task, member)
-where 'member' is a non-pointer member of the cred struct. For instance:
+where 'member' is a non-pointer member of the cred struct. For instance::
uid_t task_cred_xxx(task, suid);
@@ -451,7 +425,7 @@ magic. This may not be used for pointer members as what they point to may
disappear the moment the RCU read lock is dropped.
-ALTERING CREDENTIALS
+Altering Credentials
--------------------
As previously mentioned, a task may only alter its own credentials, and may not
@@ -459,7 +433,7 @@ alter those of another task. This means that it doesn't need to use any
locking to alter its own credentials.
To alter the current process's credentials, a function should first prepare a
-new set of credentials by calling:
+new set of credentials by calling::
struct cred *prepare_creds(void);
@@ -467,9 +441,10 @@ this locks current->cred_replace_mutex and then allocates and constructs a
duplicate of the current process's credentials, returning with the mutex still
held if successful. It returns NULL if not successful (out of memory).
-The mutex prevents ptrace() from altering the ptrace state of a process whilst
-security checks on credentials construction and changing is taking place as
-the ptrace state may alter the outcome, particularly in the case of execve().
+The mutex prevents ``ptrace()`` from altering the ptrace state of a process
+whilst security checks on credentials construction and changing is taking place
+as the ptrace state may alter the outcome, particularly in the case of
+``execve()``.
The new credentials set should be altered appropriately, and any security
checks and hooks done. Both the current and the proposed sets of credentials
@@ -478,36 +453,37 @@ still at this point.
When the credential set is ready, it should be committed to the current process
-by calling:
+by calling::
int commit_creds(struct cred *new);
This will alter various aspects of the credentials and the process, giving the
-LSM a chance to do likewise, then it will use rcu_assign_pointer() to actually
-commit the new credentials to current->cred, it will release
-current->cred_replace_mutex to allow ptrace() to take place, and it will notify
-the scheduler and others of the changes.
+LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to
+actually commit the new credentials to ``current->cred``, it will release
+``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it
+will notify the scheduler and others of the changes.
This function is guaranteed to return 0, so that it can be tail-called at the
-end of such functions as sys_setresuid().
+end of such functions as ``sys_setresuid()``.
Note that this function consumes the caller's reference to the new credentials.
-The caller should _not_ call put_cred() on the new credentials afterwards.
+The caller should _not_ call ``put_cred()`` on the new credentials afterwards.
Furthermore, once this function has been called on a new set of credentials,
those credentials may _not_ be changed further.
-Should the security checks fail or some other error occur after prepare_creds()
-has been called, then the following function should be invoked:
+Should the security checks fail or some other error occur after
+``prepare_creds()`` has been called, then the following function should be
+invoked::
void abort_creds(struct cred *new);
-This releases the lock on current->cred_replace_mutex that prepare_creds() got
-and then releases the new credentials.
+This releases the lock on ``current->cred_replace_mutex`` that
+``prepare_creds()`` got and then releases the new credentials.
-A typical credentials alteration function would look something like this:
+A typical credentials alteration function would look something like this::
int alter_suid(uid_t suid)
{
@@ -529,53 +505,50 @@ A typical credentials alteration function would look something like this:
}
-MANAGING CREDENTIALS
+Managing Credentials
--------------------
There are some functions to help manage credentials:
- (*) void put_cred(const struct cred *cred);
+ - ``void put_cred(const struct cred *cred);``
This releases a reference to the given set of credentials. If the
reference count reaches zero, the credentials will be scheduled for
destruction by the RCU system.
- (*) const struct cred *get_cred(const struct cred *cred);
+ - ``const struct cred *get_cred(const struct cred *cred);``
This gets a reference on a live set of credentials, returning a pointer to
that set of credentials.
- (*) struct cred *get_new_cred(struct cred *cred);
+ - ``struct cred *get_new_cred(struct cred *cred);``
This gets a reference on a set of credentials that is under construction
and is thus still mutable, returning a pointer to that set of credentials.
-=====================
-OPEN FILE CREDENTIALS
+Open File Credentials
=====================
When a new file is opened, a reference is obtained on the opening task's
-credentials and this is attached to the file struct as 'f_cred' in place of
-'f_uid' and 'f_gid'. Code that used to access file->f_uid and file->f_gid
-should now access file->f_cred->fsuid and file->f_cred->fsgid.
+credentials and this is attached to the file struct as ``f_cred`` in place of
+``f_uid`` and ``f_gid``. Code that used to access ``file->f_uid`` and
+``file->f_gid`` should now access ``file->f_cred->fsuid`` and
+``file->f_cred->fsgid``.
-It is safe to access f_cred without the use of RCU or locking because the
+It is safe to access ``f_cred`` without the use of RCU or locking because the
pointer will not change over the lifetime of the file struct, and nor will the
contents of the cred struct pointed to, barring the exceptions listed above
(see the Task Credentials section).
-=======================================
-OVERRIDING THE VFS'S USE OF CREDENTIALS
+Overriding the VFS's Use of Credentials
=======================================
Under some circumstances it is desirable to override the credentials used by
-the VFS, and that can be done by calling into such as vfs_mkdir() with a
+the VFS, and that can be done by calling into such as ``vfs_mkdir()`` with a
different set of credentials. This is done in the following places:
- (*) sys_faccessat().
-
- (*) do_coredump().
-
- (*) nfs4recover.c.
+ * ``sys_faccessat()``.
+ * ``do_coredump()``.
+ * nfs4recover.c.
diff --git a/Documentation/security/index.rst b/Documentation/security/index.rst
index 9bae6bb20e7f..298a94a33f05 100644
--- a/Documentation/security/index.rst
+++ b/Documentation/security/index.rst
@@ -1,7 +1,13 @@
======================
-Security documentation
+Security Documentation
======================
.. toctree::
+ :maxdepth: 1
+ credentials
+ IMA-templates
+ keys/index
+ LSM
+ self-protection
tpm/index
diff --git a/Documentation/security/keys.txt b/Documentation/security/keys/core.rst
index cd5019934d7f..0d831a7afe4f 100644
--- a/Documentation/security/keys.txt
+++ b/Documentation/security/keys/core.rst
@@ -1,6 +1,6 @@
- ============================
- KERNEL KEY RETENTION SERVICE
- ============================
+============================
+Kernel Key Retention Service
+============================
This service allows cryptographic keys, authentication tokens, cross-domain
user mappings, and similar to be cached in the kernel for the use of
@@ -29,8 +29,7 @@ This document has the following sections:
- Garbage collection
-============
-KEY OVERVIEW
+Key Overview
============
In this context, keys represent units of cryptographic data, authentication
@@ -47,14 +46,14 @@ Each key has a number of attributes:
- State.
- (*) Each key is issued a serial number of type key_serial_t that is unique for
+ * Each key is issued a serial number of type key_serial_t that is unique for
the lifetime of that key. All serial numbers are positive non-zero 32-bit
integers.
Userspace programs can use a key's serial numbers as a way to gain access
to it, subject to permission checking.
- (*) Each key is of a defined "type". Types must be registered inside the
+ * Each key is of a defined "type". Types must be registered inside the
kernel by a kernel service (such as a filesystem) before keys of that type
can be added or used. Userspace programs cannot define new types directly.
@@ -64,18 +63,18 @@ Each key has a number of attributes:
Should a type be removed from the system, all the keys of that type will
be invalidated.
- (*) Each key has a description. This should be a printable string. The key
+ * Each key has a description. This should be a printable string. The key
type provides an operation to perform a match between the description on a
key and a criterion string.
- (*) Each key has an owner user ID, a group ID and a permissions mask. These
+ * Each key has an owner user ID, a group ID and a permissions mask. These
are used to control what a process may do to a key from userspace, and
whether a kernel service will be able to find the key.
- (*) Each key can be set to expire at a specific time by the key type's
+ * Each key can be set to expire at a specific time by the key type's
instantiation function. Keys can also be immortal.
- (*) Each key can have a payload. This is a quantity of data that represent the
+ * Each key can have a payload. This is a quantity of data that represent the
actual "key". In the case of a keyring, this is a list of keys to which
the keyring links; in the case of a user-defined key, it's an arbitrary
blob of data.
@@ -91,39 +90,38 @@ Each key has a number of attributes:
permitted, another key type operation will be called to convert the key's
attached payload back into a blob of data.
- (*) Each key can be in one of a number of basic states:
+ * Each key can be in one of a number of basic states:
- (*) Uninstantiated. The key exists, but does not have any data attached.
+ * Uninstantiated. The key exists, but does not have any data attached.
Keys being requested from userspace will be in this state.
- (*) Instantiated. This is the normal state. The key is fully formed, and
+ * Instantiated. This is the normal state. The key is fully formed, and
has data attached.
- (*) Negative. This is a relatively short-lived state. The key acts as a
+ * Negative. This is a relatively short-lived state. The key acts as a
note saying that a previous call out to userspace failed, and acts as
a throttle on key lookups. A negative key can be updated to a normal
state.
- (*) Expired. Keys can have lifetimes set. If their lifetime is exceeded,
+ * Expired. Keys can have lifetimes set. If their lifetime is exceeded,
they traverse to this state. An expired key can be updated back to a
normal state.
- (*) Revoked. A key is put in this state by userspace action. It can't be
+ * Revoked. A key is put in this state by userspace action. It can't be
found or operated upon (apart from by unlinking it).
- (*) Dead. The key's type was unregistered, and so the key is now useless.
+ * Dead. The key's type was unregistered, and so the key is now useless.
Keys in the last three states are subject to garbage collection. See the
section on "Garbage collection".
-====================
-KEY SERVICE OVERVIEW
+Key Service Overview
====================
The key service provides a number of features besides keys:
- (*) The key service defines three special key types:
+ * The key service defines three special key types:
(+) "keyring"
@@ -149,7 +147,7 @@ The key service provides a number of features besides keys:
be created and updated from userspace, but the payload is only
readable from kernel space.
- (*) Each process subscribes to three keyrings: a thread-specific keyring, a
+ * Each process subscribes to three keyrings: a thread-specific keyring, a
process-specific keyring, and a session-specific keyring.
The thread-specific keyring is discarded from the child when any sort of
@@ -170,7 +168,7 @@ The key service provides a number of features besides keys:
The ownership of the thread keyring changes when the real UID and GID of
the thread changes.
- (*) Each user ID resident in the system holds two special keyrings: a user
+ * Each user ID resident in the system holds two special keyrings: a user
specific keyring and a default user session keyring. The default session
keyring is initialised with a link to the user-specific keyring.
@@ -180,7 +178,7 @@ The key service provides a number of features besides keys:
If a process attempts to access its session key when it doesn't have one,
it will be subscribed to the default for its current UID.
- (*) Each user has two quotas against which the keys they own are tracked. One
+ * Each user has two quotas against which the keys they own are tracked. One
limits the total number of keys and keyrings, the other limits the total
amount of description and payload space that can be consumed.
@@ -194,54 +192,53 @@ The key service provides a number of features besides keys:
If a system call that modifies a key or keyring in some way would put the
user over quota, the operation is refused and error EDQUOT is returned.
- (*) There's a system call interface by which userspace programs can create and
+ * There's a system call interface by which userspace programs can create and
manipulate keys and keyrings.
- (*) There's a kernel interface by which services can register types and search
+ * There's a kernel interface by which services can register types and search
for keys.
- (*) There's a way for the a search done from the kernel to call back to
+ * There's a way for the a search done from the kernel to call back to
userspace to request a key that can't be found in a process's keyrings.
- (*) An optional filesystem is available through which the key database can be
+ * An optional filesystem is available through which the key database can be
viewed and manipulated.
-======================
-KEY ACCESS PERMISSIONS
+Key Access Permissions
======================
Keys have an owner user ID, a group access ID, and a permissions mask. The mask
has up to eight bits each for possessor, user, group and other access. Only
six of each set of eight bits are defined. These permissions granted are:
- (*) View
+ * View
This permits a key or keyring's attributes to be viewed - including key
type and description.
- (*) Read
+ * Read
This permits a key's payload to be viewed or a keyring's list of linked
keys.
- (*) Write
+ * Write
This permits a key's payload to be instantiated or updated, or it allows a
link to be added to or removed from a keyring.
- (*) Search
+ * Search
This permits keyrings to be searched and keys to be found. Searches can
only recurse into nested keyrings that have search permission set.
- (*) Link
+ * Link
This permits a key or keyring to be linked to. To create a link from a
keyring to a key, a process must have Write permission on the keyring and
Link permission on the key.
- (*) Set Attribute
+ * Set Attribute
This permits a key's UID, GID and permissions mask to be changed.
@@ -249,8 +246,7 @@ For changing the ownership, group ID or permissions mask, being the owner of
the key or having the sysadmin capability is sufficient.
-===============
-SELINUX SUPPORT
+SELinux Support
===============
The security class "key" has been added to SELinux so that mandatory access
@@ -282,14 +278,13 @@ their associated thread, and both session and process keyrings are handled
similarly.
-================
-NEW PROCFS FILES
+New ProcFS Files
================
Two files have been added to procfs by which an administrator can find out
about the status of the key service:
- (*) /proc/keys
+ * /proc/keys
This lists the keys that are currently viewable by the task reading the
file, giving information about their type, description and permissions.
@@ -301,7 +296,7 @@ about the status of the key service:
security checks are still performed, and may further filter out keys that
the current process is not authorised to view.
- The contents of the file look like this:
+ The contents of the file look like this::
SERIAL FLAGS USAGE EXPY PERM UID GID TYPE DESCRIPTION: SUMMARY
00000001 I----- 39 perm 1f3f0000 0 0 keyring _uid_ses.0: 1/4
@@ -314,7 +309,7 @@ about the status of the key service:
00000893 I--Q-N 1 35s 1f3f0000 0 0 user metal:silver: 0
00000894 I--Q-- 1 10h 003f0000 0 0 user metal:gold: 0
- The flags are:
+ The flags are::
I Instantiated
R Revoked
@@ -324,10 +319,10 @@ about the status of the key service:
N Negative key
- (*) /proc/key-users
+ * /proc/key-users
This file lists the tracking data for each user that has at least one key
- on the system. Such data includes quota information and statistics:
+ on the system. Such data includes quota information and statistics::
[root@andromeda root]# cat /proc/key-users
0: 46 45/45 1/100 13/10000
@@ -335,7 +330,8 @@ about the status of the key service:
32: 2 2/2 2/100 40/10000
38: 2 2/2 2/100 40/10000
- The format of each line is
+ The format of each line is::
+
<UID>: User ID to which this applies
<usage> Structure refcount
<inst>/<keys> Total number of keys and number instantiated
@@ -346,14 +342,14 @@ about the status of the key service:
Four new sysctl files have been added also for the purpose of controlling the
quota limits on keys:
- (*) /proc/sys/kernel/keys/root_maxkeys
+ * /proc/sys/kernel/keys/root_maxkeys
/proc/sys/kernel/keys/root_maxbytes
These files hold the maximum number of keys that root may have and the
maximum total number of bytes of data that root may have stored in those
keys.
- (*) /proc/sys/kernel/keys/maxkeys
+ * /proc/sys/kernel/keys/maxkeys
/proc/sys/kernel/keys/maxbytes
These files hold the maximum number of keys that each non-root user may
@@ -364,8 +360,7 @@ Root may alter these by writing each new limit as a decimal number string to
the appropriate file.
-===============================
-USERSPACE SYSTEM CALL INTERFACE
+Userspace System Call Interface
===============================
Userspace can manipulate keys directly through three new syscalls: add_key,
@@ -375,7 +370,7 @@ manipulating keys.
When referring to a key directly, userspace programs should use the key's
serial number (a positive 32-bit integer). However, there are some special
values available for referring to special keys and keyrings that relate to the
-process making the call:
+process making the call::
CONSTANT VALUE KEY REFERENCED
============================== ====== ===========================
@@ -391,8 +386,8 @@ process making the call:
The main syscalls are:
- (*) Create a new key of given type, description and payload and add it to the
- nominated keyring:
+ * Create a new key of given type, description and payload and add it to the
+ nominated keyring::
key_serial_t add_key(const char *type, const char *desc,
const void *payload, size_t plen,
@@ -432,8 +427,8 @@ The main syscalls are:
The ID of the new or updated key is returned if successful.
- (*) Search the process's keyrings for a key, potentially calling out to
- userspace to create it.
+ * Search the process's keyrings for a key, potentially calling out to
+ userspace to create it::
key_serial_t request_key(const char *type, const char *description,
const char *callout_info,
@@ -453,7 +448,7 @@ The main syscalls are:
The keyctl syscall functions are:
- (*) Map a special key ID to a real key ID for this process:
+ * Map a special key ID to a real key ID for this process::
key_serial_t keyctl(KEYCTL_GET_KEYRING_ID, key_serial_t id,
int create);
@@ -466,7 +461,7 @@ The keyctl syscall functions are:
non-zero; and the error ENOKEY will be returned if "create" is zero.
- (*) Replace the session keyring this process subscribes to with a new one:
+ * Replace the session keyring this process subscribes to with a new one::
key_serial_t keyctl(KEYCTL_JOIN_SESSION_KEYRING, const char *name);
@@ -484,7 +479,7 @@ The keyctl syscall functions are:
The ID of the new session keyring is returned if successful.
- (*) Update the specified key:
+ * Update the specified key::
long keyctl(KEYCTL_UPDATE, key_serial_t key, const void *payload,
size_t plen);
@@ -498,7 +493,7 @@ The keyctl syscall functions are:
add_key().
- (*) Revoke a key:
+ * Revoke a key::
long keyctl(KEYCTL_REVOKE, key_serial_t key);
@@ -507,7 +502,7 @@ The keyctl syscall functions are:
be findable.
- (*) Change the ownership of a key:
+ * Change the ownership of a key::
long keyctl(KEYCTL_CHOWN, key_serial_t key, uid_t uid, gid_t gid);
@@ -520,7 +515,7 @@ The keyctl syscall functions are:
its group list members.
- (*) Change the permissions mask on a key:
+ * Change the permissions mask on a key::
long keyctl(KEYCTL_SETPERM, key_serial_t key, key_perm_t perm);
@@ -531,7 +526,7 @@ The keyctl syscall functions are:
error EINVAL will be returned.
- (*) Describe a key:
+ * Describe a key::
long keyctl(KEYCTL_DESCRIBE, key_serial_t key, char *buffer,
size_t buflen);
@@ -547,7 +542,7 @@ The keyctl syscall functions are:
A process must have view permission on the key for this function to be
successful.
- If successful, a string is placed in the buffer in the following format:
+ If successful, a string is placed in the buffer in the following format::
<type>;<uid>;<gid>;<perm>;<description>
@@ -555,12 +550,12 @@ The keyctl syscall functions are:
is hexadecimal. A NUL character is included at the end of the string if
the buffer is sufficiently big.
- This can be parsed with
+ This can be parsed with::
sscanf(buffer, "%[^;];%d;%d;%o;%s", type, &uid, &gid, &mode, desc);
- (*) Clear out a keyring:
+ * Clear out a keyring::
long keyctl(KEYCTL_CLEAR, key_serial_t keyring);
@@ -573,7 +568,7 @@ The keyctl syscall functions are:
DNS resolver cache keyring is an example of this.
- (*) Link a key into a keyring:
+ * Link a key into a keyring::
long keyctl(KEYCTL_LINK, key_serial_t keyring, key_serial_t key);
@@ -592,7 +587,7 @@ The keyctl syscall functions are:
added.
- (*) Unlink a key or keyring from another keyring:
+ * Unlink a key or keyring from another keyring::
long keyctl(KEYCTL_UNLINK, key_serial_t keyring, key_serial_t key);
@@ -604,7 +599,7 @@ The keyctl syscall functions are:
is not present, error ENOENT will be the result.
- (*) Search a keyring tree for a key:
+ * Search a keyring tree for a key::
key_serial_t keyctl(KEYCTL_SEARCH, key_serial_t keyring,
const char *type, const char *description,
@@ -628,7 +623,7 @@ The keyctl syscall functions are:
fails. On success, the resulting key ID will be returned.
- (*) Read the payload data from a key:
+ * Read the payload data from a key::
long keyctl(KEYCTL_READ, key_serial_t keyring, char *buffer,
size_t buflen);
@@ -650,7 +645,7 @@ The keyctl syscall functions are:
available rather than the amount copied.
- (*) Instantiate a partially constructed key.
+ * Instantiate a partially constructed key::
long keyctl(KEYCTL_INSTANTIATE, key_serial_t key,
const void *payload, size_t plen,
@@ -677,7 +672,7 @@ The keyctl syscall functions are:
array instead of a single buffer.
- (*) Negatively instantiate a partially constructed key.
+ * Negatively instantiate a partially constructed key::
long keyctl(KEYCTL_NEGATE, key_serial_t key,
unsigned timeout, key_serial_t keyring);
@@ -700,12 +695,12 @@ The keyctl syscall functions are:
as rejecting the key with ENOKEY as the error code.
- (*) Set the default request-key destination keyring.
+ * Set the default request-key destination keyring::
long keyctl(KEYCTL_SET_REQKEY_KEYRING, int reqkey_defl);
This sets the default keyring to which implicitly requested keys will be
- attached for this thread. reqkey_defl should be one of these constants:
+ attached for this thread. reqkey_defl should be one of these constants::
CONSTANT VALUE NEW DEFAULT KEYRING
====================================== ====== =======================
@@ -731,7 +726,7 @@ The keyctl syscall functions are:
there is one, otherwise the user default session keyring.
- (*) Set the timeout on a key.
+ * Set the timeout on a key::
long keyctl(KEYCTL_SET_TIMEOUT, key_serial_t key, unsigned timeout);
@@ -744,7 +739,7 @@ The keyctl syscall functions are:
or expired keys.
- (*) Assume the authority granted to instantiate a key
+ * Assume the authority granted to instantiate a key::
long keyctl(KEYCTL_ASSUME_AUTHORITY, key_serial_t key);
@@ -766,7 +761,7 @@ The keyctl syscall functions are:
The assumed authoritative key is inherited across fork and exec.
- (*) Get the LSM security context attached to a key.
+ * Get the LSM security context attached to a key::
long keyctl(KEYCTL_GET_SECURITY, key_serial_t key, char *buffer,
size_t buflen)
@@ -787,7 +782,7 @@ The keyctl syscall functions are:
successful.
- (*) Install the calling process's session keyring on its parent.
+ * Install the calling process's session keyring on its parent::
long keyctl(KEYCTL_SESSION_TO_PARENT);
@@ -807,7 +802,7 @@ The keyctl syscall functions are:
kernel and resumes executing userspace.
- (*) Invalidate a key.
+ * Invalidate a key::
long keyctl(KEYCTL_INVALIDATE, key_serial_t key);
@@ -823,20 +818,19 @@ The keyctl syscall functions are:
A process must have search permission on the key for this function to be
successful.
- (*) Compute a Diffie-Hellman shared secret or public key
+ * Compute a Diffie-Hellman shared secret or public key::
- long keyctl(KEYCTL_DH_COMPUTE, struct keyctl_dh_params *params,
- char *buffer, size_t buflen,
- struct keyctl_kdf_params *kdf);
+ long keyctl(KEYCTL_DH_COMPUTE, struct keyctl_dh_params *params,
+ char *buffer, size_t buflen, struct keyctl_kdf_params *kdf);
- The params struct contains serial numbers for three keys:
+ The params struct contains serial numbers for three keys::
- The prime, p, known to both parties
- The local private key
- The base integer, which is either a shared generator or the
remote public key
- The value computed is:
+ The value computed is::
result = base ^ private (mod prime)
@@ -858,12 +852,12 @@ The keyctl syscall functions are:
of the KDF is returned to the caller. The KDF is characterized with
struct keyctl_kdf_params as follows:
- - char *hashname specifies the NUL terminated string identifying
+ - ``char *hashname`` specifies the NUL terminated string identifying
the hash used from the kernel crypto API and applied for the KDF
operation. The KDF implemenation complies with SP800-56A as well
as with SP800-108 (the counter KDF).
- - char *otherinfo specifies the OtherInfo data as documented in
+ - ``char *otherinfo`` specifies the OtherInfo data as documented in
SP800-56A section 5.8.1.2. The length of the buffer is given with
otherinfolen. The format of OtherInfo is defined by the caller.
The otherinfo pointer may be NULL if no OtherInfo shall be used.
@@ -875,10 +869,10 @@ The keyctl syscall functions are:
and either the buffer length or the OtherInfo length exceeds the
allowed length.
- (*) Restrict keyring linkage
+ * Restrict keyring linkage::
- long keyctl(KEYCTL_RESTRICT_KEYRING, key_serial_t keyring,
- const char *type, const char *restriction);
+ long keyctl(KEYCTL_RESTRICT_KEYRING, key_serial_t keyring,
+ const char *type, const char *restriction);
An existing keyring can restrict linkage of additional keys by evaluating
the contents of the key according to a restriction scheme.
@@ -900,8 +894,7 @@ The keyctl syscall functions are:
To apply a keyring restriction the process must have Set Attribute
permission and the keyring must not be previously restricted.
-===============
-KERNEL SERVICES
+Kernel Services
===============
The kernel services for key management are fairly simple to deal with. They can
@@ -915,29 +908,29 @@ call, and the key released upon close. How to deal with conflicting keys due to
two different users opening the same file is left to the filesystem author to
solve.
-To access the key manager, the following header must be #included:
+To access the key manager, the following header must be #included::
<linux/key.h>
Specific key types should have a header file under include/keys/ that should be
-used to access that type. For keys of type "user", for example, that would be:
+used to access that type. For keys of type "user", for example, that would be::
<keys/user-type.h>
Note that there are two different types of pointers to keys that may be
encountered:
- (*) struct key *
+ * struct key *
This simply points to the key structure itself. Key structures will be at
least four-byte aligned.
- (*) key_ref_t
+ * key_ref_t
- This is equivalent to a struct key *, but the least significant bit is set
+ This is equivalent to a ``struct key *``, but the least significant bit is set
if the caller "possesses" the key. By "possession" it is meant that the
calling processes has a searchable link to the key from one of its
- keyrings. There are three functions for dealing with these:
+ keyrings. There are three functions for dealing with these::
key_ref_t make_key_ref(const struct key *key, bool possession);
@@ -955,7 +948,7 @@ When accessing a key's payload contents, certain precautions must be taken to
prevent access vs modification races. See the section "Notes on accessing
payload contents" for more information.
-(*) To search for a key, call:
+ * To search for a key, call::
struct key *request_key(const struct key_type *type,
const char *description,
@@ -977,7 +970,7 @@ payload contents" for more information.
See also Documentation/security/keys-request-key.txt.
-(*) To search for a key, passing auxiliary data to the upcaller, call:
+ * To search for a key, passing auxiliary data to the upcaller, call::
struct key *request_key_with_auxdata(const struct key_type *type,
const char *description,
@@ -990,14 +983,14 @@ payload contents" for more information.
is a blob of length callout_len, if given (the length may be 0).
-(*) A key can be requested asynchronously by calling one of:
+ * A key can be requested asynchronously by calling one of::
struct key *request_key_async(const struct key_type *type,
const char *description,
const void *callout_info,
size_t callout_len);
- or:
+ or::
struct key *request_key_async_with_auxdata(const struct key_type *type,
const char *description,
@@ -1010,7 +1003,7 @@ payload contents" for more information.
These two functions return with the key potentially still under
construction. To wait for construction completion, the following should be
- called:
+ called::
int wait_for_key_construction(struct key *key, bool intr);
@@ -1022,11 +1015,11 @@ payload contents" for more information.
case error ERESTARTSYS will be returned.
-(*) When it is no longer required, the key should be released using:
+ * When it is no longer required, the key should be released using::
void key_put(struct key *key);
- Or:
+ Or::
void key_ref_put(key_ref_t key_ref);
@@ -1034,8 +1027,8 @@ payload contents" for more information.
the argument will not be parsed.
-(*) Extra references can be made to a key by calling one of the following
- functions:
+ * Extra references can be made to a key by calling one of the following
+ functions::
struct key *__key_get(struct key *key);
struct key *key_get(struct key *key);
@@ -1047,7 +1040,7 @@ payload contents" for more information.
then the key will not be dereferenced and no increment will take place.
-(*) A key's serial number can be obtained by calling:
+ * A key's serial number can be obtained by calling::
key_serial_t key_serial(struct key *key);
@@ -1055,7 +1048,7 @@ payload contents" for more information.
latter case without parsing the argument).
-(*) If a keyring was found in the search, this can be further searched by:
+ * If a keyring was found in the search, this can be further searched by::
key_ref_t keyring_search(key_ref_t keyring_ref,
const struct key_type *type,
@@ -1070,7 +1063,7 @@ payload contents" for more information.
reference pointer if successful.
-(*) A keyring can be created by:
+ * A keyring can be created by::
struct key *keyring_alloc(const char *description, uid_t uid, gid_t gid,
const struct cred *cred,
@@ -1109,7 +1102,7 @@ payload contents" for more information.
-EPERM to in this case.
-(*) To check the validity of a key, this function can be called:
+ * To check the validity of a key, this function can be called::
int validate_key(struct key *key);
@@ -1119,7 +1112,7 @@ payload contents" for more information.
returned (in the latter case without parsing the argument).
-(*) To register a key type, the following function should be called:
+ * To register a key type, the following function should be called::
int register_key_type(struct key_type *type);
@@ -1127,13 +1120,13 @@ payload contents" for more information.
present.
-(*) To unregister a key type, call:
+ * To unregister a key type, call::
void unregister_key_type(struct key_type *type);
Under some circumstances, it may be desirable to deal with a bundle of keys.
-The facility provides access to the keyring type for managing such a bundle:
+The facility provides access to the keyring type for managing such a bundle::
struct key_type key_type_keyring;
@@ -1143,8 +1136,7 @@ with keyring_search(). Note that it is not possible to use request_key() to
search a specific keyring, so using keyrings in this way is of limited utility.
-===================================
-NOTES ON ACCESSING PAYLOAD CONTENTS
+Notes On Accessing Payload Contents
===================================
The simplest payload is just data stored in key->payload directly. In this
@@ -1154,31 +1146,31 @@ More complex payload contents must be allocated and pointers to them set in the
key->payload.data[] array. One of the following ways must be selected to
access the data:
- (1) Unmodifiable key type.
+ 1) Unmodifiable key type.
If the key type does not have a modify method, then the key's payload can
be accessed without any form of locking, provided that it's known to be
instantiated (uninstantiated keys cannot be "found").
- (2) The key's semaphore.
+ 2) The key's semaphore.
The semaphore could be used to govern access to the payload and to control
the payload pointer. It must be write-locked for modifications and would
have to be read-locked for general access. The disadvantage of doing this
is that the accessor may be required to sleep.
- (3) RCU.
+ 3) RCU.
RCU must be used when the semaphore isn't already held; if the semaphore
is held then the contents can't change under you unexpectedly as the
semaphore must still be used to serialise modifications to the key. The
key management code takes care of this for the key type.
- However, this means using:
+ However, this means using::
rcu_read_lock() ... rcu_dereference() ... rcu_read_unlock()
- to read the pointer, and:
+ to read the pointer, and::
rcu_dereference() ... rcu_assign_pointer() ... call_rcu()
@@ -1194,11 +1186,11 @@ access the data:
usage. This is called key->payload.rcu_data0. The following accessors
wrap the RCU calls to this element:
- (a) Set or change the first payload pointer:
+ a) Set or change the first payload pointer::
rcu_assign_keypointer(struct key *key, void *data);
- (b) Read the first payload pointer with the key semaphore held:
+ b) Read the first payload pointer with the key semaphore held::
[const] void *dereference_key_locked([const] struct key *key);
@@ -1206,39 +1198,38 @@ access the data:
parameter. Static analysis will give an error if it things the lock
isn't held.
- (c) Read the first payload pointer with the RCU read lock held:
+ c) Read the first payload pointer with the RCU read lock held::
const void *dereference_key_rcu(const struct key *key);
-===================
-DEFINING A KEY TYPE
+Defining a Key Type
===================
A kernel service may want to define its own key type. For instance, an AFS
filesystem might want to define a Kerberos 5 ticket key type. To do this, it
author fills in a key_type struct and registers it with the system.
-Source files that implement key types should include the following header file:
+Source files that implement key types should include the following header file::
<linux/key-type.h>
The structure has a number of fields, some of which are mandatory:
- (*) const char *name
+ * ``const char *name``
The name of the key type. This is used to translate a key type name
supplied by userspace into a pointer to the structure.
- (*) size_t def_datalen
+ * ``size_t def_datalen``
This is optional - it supplies the default payload data length as
contributed to the quota. If the key type's payload is always or almost
always the same size, then this is a more efficient way to do things.
The data length (and quota) on a particular key can always be changed
- during instantiation or update by calling:
+ during instantiation or update by calling::
int key_payload_reserve(struct key *key, size_t datalen);
@@ -1246,18 +1237,18 @@ The structure has a number of fields, some of which are mandatory:
viable.
- (*) int (*vet_description)(const char *description);
+ * ``int (*vet_description)(const char *description);``
This optional method is called to vet a key description. If the key type
doesn't approve of the key description, it may return an error, otherwise
it should return 0.
- (*) int (*preparse)(struct key_preparsed_payload *prep);
+ * ``int (*preparse)(struct key_preparsed_payload *prep);``
This optional method permits the key type to attempt to parse payload
before a key is created (add key) or the key semaphore is taken (update or
- instantiate key). The structure pointed to by prep looks like:
+ instantiate key). The structure pointed to by prep looks like::
struct key_preparsed_payload {
char *description;
@@ -1285,7 +1276,7 @@ The structure has a number of fields, some of which are mandatory:
otherwise.
- (*) void (*free_preparse)(struct key_preparsed_payload *prep);
+ * ``void (*free_preparse)(struct key_preparsed_payload *prep);``
This method is only required if the preparse() method is provided,
otherwise it is unused. It cleans up anything attached to the description
@@ -1294,7 +1285,7 @@ The structure has a number of fields, some of which are mandatory:
successfully, even if instantiate() or update() succeed.
- (*) int (*instantiate)(struct key *key, struct key_preparsed_payload *prep);
+ * ``int (*instantiate)(struct key *key, struct key_preparsed_payload *prep);``
This method is called to attach a payload to a key during construction.
The payload attached need not bear any relation to the data passed to this
@@ -1318,7 +1309,7 @@ The structure has a number of fields, some of which are mandatory:
free_preparse method doesn't release the data.
- (*) int (*update)(struct key *key, const void *data, size_t datalen);
+ * ``int (*update)(struct key *key, const void *data, size_t datalen);``
If this type of key can be updated, then this method should be provided.
It is called to update a key's payload from the blob of data provided.
@@ -1343,10 +1334,10 @@ The structure has a number of fields, some of which are mandatory:
It is safe to sleep in this method.
- (*) int (*match_preparse)(struct key_match_data *match_data);
+ * ``int (*match_preparse)(struct key_match_data *match_data);``
This method is optional. It is called when a key search is about to be
- performed. It is given the following structure:
+ performed. It is given the following structure::
struct key_match_data {
bool (*cmp)(const struct key *key,
@@ -1357,23 +1348,23 @@ The structure has a number of fields, some of which are mandatory:
};
On entry, raw_data will be pointing to the criteria to be used in matching
- a key by the caller and should not be modified. (*cmp)() will be pointing
+ a key by the caller and should not be modified. ``(*cmp)()`` will be pointing
to the default matcher function (which does an exact description match
against raw_data) and lookup_type will be set to indicate a direct lookup.
The following lookup_type values are available:
- [*] KEYRING_SEARCH_LOOKUP_DIRECT - A direct lookup hashes the type and
+ * KEYRING_SEARCH_LOOKUP_DIRECT - A direct lookup hashes the type and
description to narrow down the search to a small number of keys.
- [*] KEYRING_SEARCH_LOOKUP_ITERATE - An iterative lookup walks all the
+ * KEYRING_SEARCH_LOOKUP_ITERATE - An iterative lookup walks all the
keys in the keyring until one is matched. This must be used for any
search that's not doing a simple direct match on the key description.
The method may set cmp to point to a function of its choice that does some
other form of match, may set lookup_type to KEYRING_SEARCH_LOOKUP_ITERATE
- and may attach something to the preparsed pointer for use by (*cmp)().
- (*cmp)() should return true if a key matches and false otherwise.
+ and may attach something to the preparsed pointer for use by ``(*cmp)()``.
+ ``(*cmp)()`` should return true if a key matches and false otherwise.
If preparsed is set, it may be necessary to use the match_free() method to
clean it up.
@@ -1381,20 +1372,20 @@ The structure has a number of fields, some of which are mandatory:
The method should return 0 if successful or a negative error code
otherwise.
- It is permitted to sleep in this method, but (*cmp)() may not sleep as
+ It is permitted to sleep in this method, but ``(*cmp)()`` may not sleep as
locks will be held over it.
If match_preparse() is not provided, keys of this type will be matched
exactly by their description.
- (*) void (*match_free)(struct key_match_data *match_data);
+ * ``void (*match_free)(struct key_match_data *match_data);``
This method is optional. If given, it called to clean up
match_data->preparsed after a successful call to match_preparse().
- (*) void (*revoke)(struct key *key);
+ * ``void (*revoke)(struct key *key);``
This method is optional. It is called to discard part of the payload
data upon a key being revoked. The caller will have the key semaphore
@@ -1404,7 +1395,7 @@ The structure has a number of fields, some of which are mandatory:
a deadlock against the key semaphore.
- (*) void (*destroy)(struct key *key);
+ * ``void (*destroy)(struct key *key);``
This method is optional. It is called to discard the payload data on a key
when it is being destroyed.
@@ -1416,7 +1407,7 @@ The structure has a number of fields, some of which are mandatory:
It is not safe to sleep in this method; the caller may hold spinlocks.
- (*) void (*describe)(const struct key *key, struct seq_file *p);
+ * ``void (*describe)(const struct key *key, struct seq_file *p);``
This method is optional. It is called during /proc/keys reading to
summarise a key's description and payload in text form.
@@ -1432,7 +1423,7 @@ The structure has a number of fields, some of which are mandatory:
caller.
- (*) long (*read)(const struct key *key, char __user *buffer, size_t buflen);
+ * ``long (*read)(const struct key *key, char __user *buffer, size_t buflen);``
This method is optional. It is called by KEYCTL_READ to translate the
key's payload into something a blob of data for userspace to deal with.
@@ -1448,8 +1439,7 @@ The structure has a number of fields, some of which are mandatory:
as might happen when the userspace buffer is accessed.
- (*) int (*request_key)(struct key_construction *cons, const char *op,
- void *aux);
+ * ``int (*request_key)(struct key_construction *cons, const char *op, void *aux);``
This method is optional. If provided, request_key() and friends will
invoke this function rather than upcalling to /sbin/request-key to operate
@@ -1463,7 +1453,7 @@ The structure has a number of fields, some of which are mandatory:
This method is permitted to return before the upcall is complete, but the
following function must be called under all circumstances to complete the
instantiation process, whether or not it succeeds, whether or not there's
- an error:
+ an error::
void complete_request_key(struct key_construction *cons, int error);
@@ -1479,16 +1469,16 @@ The structure has a number of fields, some of which are mandatory:
The key under construction and the authorisation key can be found in the
key_construction struct pointed to by cons:
- (*) struct key *key;
+ * ``struct key *key;``
The key under construction.
- (*) struct key *authkey;
+ * ``struct key *authkey;``
The authorisation key.
- (*) struct key_restriction *(*lookup_restriction)(const char *params);
+ * ``struct key_restriction *(*lookup_restriction)(const char *params);``
This optional method is used to enable userspace configuration of keyring
restrictions. The restriction parameter string (not including the key type
@@ -1497,12 +1487,11 @@ The structure has a number of fields, some of which are mandatory:
attempted key link operation. If there is no match, -EINVAL is returned.
-============================
-REQUEST-KEY CALLBACK SERVICE
+Request-Key Callback Service
============================
To create a new key, the kernel will attempt to execute the following command
-line:
+line::
/sbin/request-key create <key> <uid> <gid> \
<threadring> <processring> <sessionring> <callout_info>
@@ -1511,10 +1500,10 @@ line:
keyrings from the process that caused the search to be issued. These are
included for two reasons:
- (1) There may be an authentication token in one of the keyrings that is
+ 1 There may be an authentication token in one of the keyrings that is
required to obtain the key, eg: a Kerberos Ticket-Granting Ticket.
- (2) The new key should probably be cached in one of these rings.
+ 2 The new key should probably be cached in one of these rings.
This program should set it UID and GID to those specified before attempting to
access any more keys. It may then look around for a user specific process to
@@ -1539,7 +1528,7 @@ instead.
Similarly, the kernel may attempt to update an expired or a soon to expire key
-by executing:
+by executing::
/sbin/request-key update <key> <uid> <gid> \
<threadring> <processring> <sessionring>
@@ -1548,8 +1537,7 @@ In this case, the program isn't required to actually attach the key to a ring;
the rings are provided for reference.
-==================
-GARBAGE COLLECTION
+Garbage Collection
==================
Dead keys (for which the type has been removed) will be automatically unlinked
@@ -1557,6 +1545,6 @@ from those keyrings that point to them and deleted as soon as possible by a
background garbage collector.
Similarly, revoked and expired keys will be garbage collected, but only after a
-certain amount of time has passed. This time is set as a number of seconds in:
+certain amount of time has passed. This time is set as a number of seconds in::
/proc/sys/kernel/keys/gc_delay
diff --git a/Documentation/security/keys-ecryptfs.txt b/Documentation/security/keys/ecryptfs.rst
index c3bbeba63562..4920f3a8ea75 100644
--- a/Documentation/security/keys-ecryptfs.txt
+++ b/Documentation/security/keys/ecryptfs.rst
@@ -1,4 +1,6 @@
- Encrypted keys for the eCryptfs filesystem
+==========================================
+Encrypted keys for the eCryptfs filesystem
+==========================================
ECryptfs is a stacked filesystem which transparently encrypts and decrypts each
file using a randomly generated File Encryption Key (FEK).
@@ -35,20 +37,23 @@ controlled environment. Another advantage is that the key is not exposed to
threats of malicious software, because it is available in clear form only at
kernel level.
-Usage:
+Usage::
+
keyctl add encrypted name "new ecryptfs key-type:master-key-name keylen" ring
keyctl add encrypted name "load hex_blob" ring
keyctl update keyid "update key-type:master-key-name"
-name:= '<16 hexadecimal characters>'
-key-type:= 'trusted' | 'user'
-keylen:= 64
+Where::
+
+ name:= '<16 hexadecimal characters>'
+ key-type:= 'trusted' | 'user'
+ keylen:= 64
Example of encrypted key usage with the eCryptfs filesystem:
Create an encrypted key "1000100010001000" of length 64 bytes with format
-'ecryptfs' and save it using a previously loaded user key "test":
+'ecryptfs' and save it using a previously loaded user key "test"::
$ keyctl add encrypted 1000100010001000 "new ecryptfs user:test 64" @u
19184530
@@ -62,7 +67,7 @@ Create an encrypted key "1000100010001000" of length 64 bytes with format
$ keyctl pipe 19184530 > ecryptfs.blob
Mount an eCryptfs filesystem using the created encrypted key "1000100010001000"
-into the '/secret' directory:
+into the '/secret' directory::
$ mount -i -t ecryptfs -oecryptfs_sig=1000100010001000,\
ecryptfs_cipher=aes,ecryptfs_key_bytes=32 /secret /secret
diff --git a/Documentation/security/keys/index.rst b/Documentation/security/keys/index.rst
new file mode 100644
index 000000000000..647d58f2588e
--- /dev/null
+++ b/Documentation/security/keys/index.rst
@@ -0,0 +1,11 @@
+===========
+Kernel Keys
+===========
+
+.. toctree::
+ :maxdepth: 1
+
+ core
+ ecryptfs
+ request-key
+ trusted-encrypted
diff --git a/Documentation/security/keys-request-key.txt b/Documentation/security/keys/request-key.rst
index 51987bfecfed..aba32784174c 100644
--- a/Documentation/security/keys-request-key.txt
+++ b/Documentation/security/keys/request-key.rst
@@ -1,19 +1,19 @@
- ===================
- KEY REQUEST SERVICE
- ===================
+===================
+Key Request Service
+===================
The key request service is part of the key retention service (refer to
Documentation/security/keys.txt). This document explains more fully how
the requesting algorithm works.
The process starts by either the kernel requesting a service by calling
-request_key*():
+``request_key*()``::
struct key *request_key(const struct key_type *type,
const char *description,
const char *callout_info);
-or:
+or::
struct key *request_key_with_auxdata(const struct key_type *type,
const char *description,
@@ -21,14 +21,14 @@ or:
size_t callout_len,
void *aux);
-or:
+or::
struct key *request_key_async(const struct key_type *type,
const char *description,
const char *callout_info,
size_t callout_len);
-or:
+or::
struct key *request_key_async_with_auxdata(const struct key_type *type,
const char *description,
@@ -36,7 +36,7 @@ or:
size_t callout_len,
void *aux);
-Or by userspace invoking the request_key system call:
+Or by userspace invoking the request_key system call::
key_serial_t request_key(const char *type,
const char *description,
@@ -67,38 +67,37 @@ own upcall mechanisms. If they do, then those should be substituted for the
forking and execution of /sbin/request-key.
-===========
-THE PROCESS
+The Process
===========
A request proceeds in the following manner:
- (1) Process A calls request_key() [the userspace syscall calls the kernel
+ 1) Process A calls request_key() [the userspace syscall calls the kernel
interface].
- (2) request_key() searches the process's subscribed keyrings to see if there's
+ 2) request_key() searches the process's subscribed keyrings to see if there's
a suitable key there. If there is, it returns the key. If there isn't,
and callout_info is not set, an error is returned. Otherwise the process
proceeds to the next step.
- (3) request_key() sees that A doesn't have the desired key yet, so it creates
+ 3) request_key() sees that A doesn't have the desired key yet, so it creates
two things:
- (a) An uninstantiated key U of requested type and description.
+ a) An uninstantiated key U of requested type and description.
- (b) An authorisation key V that refers to key U and notes that process A
+ b) An authorisation key V that refers to key U and notes that process A
is the context in which key U should be instantiated and secured, and
from which associated key requests may be satisfied.
- (4) request_key() then forks and executes /sbin/request-key with a new session
+ 4) request_key() then forks and executes /sbin/request-key with a new session
keyring that contains a link to auth key V.
- (5) /sbin/request-key assumes the authority associated with key U.
+ 5) /sbin/request-key assumes the authority associated with key U.
- (6) /sbin/request-key execs an appropriate program to perform the actual
+ 6) /sbin/request-key execs an appropriate program to perform the actual
instantiation.
- (7) The program may want to access another key from A's context (say a
+ 7) The program may want to access another key from A's context (say a
Kerberos TGT key). It just requests the appropriate key, and the keyring
search notes that the session keyring has auth key V in its bottom level.
@@ -106,15 +105,15 @@ A request proceeds in the following manner:
UID, GID, groups and security info of process A as if it was process A,
and come up with key W.
- (8) The program then does what it must to get the data with which to
+ 8) The program then does what it must to get the data with which to
instantiate key U, using key W as a reference (perhaps it contacts a
Kerberos server using the TGT) and then instantiates key U.
- (9) Upon instantiating key U, auth key V is automatically revoked so that it
+ 9) Upon instantiating key U, auth key V is automatically revoked so that it
may not be used again.
-(10) The program then exits 0 and request_key() deletes key V and returns key
- U to the caller.
+ 10) The program then exits 0 and request_key() deletes key V and returns key
+ U to the caller.
This also extends further. If key W (step 7 above) didn't exist, key W would
be created uninstantiated, another auth key (X) would be created (as per step
@@ -127,8 +126,7 @@ This is because process A's keyrings can't simply be attached to
of them, and (b) it requires the same UID/GID/Groups all the way through.
-====================================
-NEGATIVE INSTANTIATION AND REJECTION
+Negative Instantiation And Rejection
====================================
Rather than instantiating a key, it is possible for the possessor of an
@@ -145,23 +143,22 @@ signal, the key under construction will be automatically negatively
instantiated for a short amount of time.
-====================
-THE SEARCH ALGORITHM
+The Search Algorithm
====================
A search of any particular keyring proceeds in the following fashion:
- (1) When the key management code searches for a key (keyring_search_aux) it
+ 1) When the key management code searches for a key (keyring_search_aux) it
firstly calls key_permission(SEARCH) on the keyring it's starting with,
if this denies permission, it doesn't search further.
- (2) It considers all the non-keyring keys within that keyring and, if any key
+ 2) It considers all the non-keyring keys within that keyring and, if any key
matches the criteria specified, calls key_permission(SEARCH) on it to see
if the key is allowed to be found. If it is, that key is returned; if
not, the search continues, and the error code is retained if of higher
priority than the one currently set.
- (3) It then considers all the keyring-type keys in the keyring it's currently
+ 3) It then considers all the keyring-type keys in the keyring it's currently
searching. It calls key_permission(SEARCH) on each keyring, and if this
grants permission, it recurses, executing steps (2) and (3) on that
keyring.
@@ -173,20 +170,20 @@ returned.
When search_process_keyrings() is invoked, it performs the following searches
until one succeeds:
- (1) If extant, the process's thread keyring is searched.
+ 1) If extant, the process's thread keyring is searched.
- (2) If extant, the process's process keyring is searched.
+ 2) If extant, the process's process keyring is searched.
- (3) The process's session keyring is searched.
+ 3) The process's session keyring is searched.
- (4) If the process has assumed the authority associated with a request_key()
+ 4) If the process has assumed the authority associated with a request_key()
authorisation key then:
- (a) If extant, the calling process's thread keyring is searched.
+ a) If extant, the calling process's thread keyring is searched.
- (b) If extant, the calling process's process keyring is searched.
+ b) If extant, the calling process's process keyring is searched.
- (c) The calling process's session keyring is searched.
+ c) The calling process's session keyring is searched.
The moment one succeeds, all pending errors are discarded and the found key is
returned.
@@ -194,7 +191,7 @@ returned.
Only if all these fail does the whole thing fail with the highest priority
error. Note that several errors may have come from LSM.
-The error priority is:
+The error priority is::
EKEYREVOKED > EKEYEXPIRED > ENOKEY
diff --git a/Documentation/security/keys-trusted-encrypted.txt b/Documentation/security/keys/trusted-encrypted.rst
index b20a993a32af..7b503831bdea 100644
--- a/Documentation/security/keys-trusted-encrypted.txt
+++ b/Documentation/security/keys/trusted-encrypted.rst
@@ -1,4 +1,6 @@
- Trusted and Encrypted Keys
+==========================
+Trusted and Encrypted Keys
+==========================
Trusted and Encrypted Keys are two new key types added to the existing kernel
key ring service. Both of these new types are variable length symmetric keys,
@@ -20,7 +22,8 @@ By default, trusted keys are sealed under the SRK, which has the default
authorization value (20 zeros). This can be set at takeownership time with the
trouser's utility: "tpm_takeownership -u -z".
-Usage:
+Usage::
+
keyctl add trusted name "new keylen [options]" ring
keyctl add trusted name "load hex_blob [pcrlock=pcrnum]" ring
keyctl update key "update [options]"
@@ -64,19 +67,22 @@ The decrypted portion of encrypted keys can contain either a simple symmetric
key or a more complex structure. The format of the more complex structure is
application specific, which is identified by 'format'.
-Usage:
+Usage::
+
keyctl add encrypted name "new [format] key-type:master-key-name keylen"
ring
keyctl add encrypted name "load hex_blob" ring
keyctl update keyid "update key-type:master-key-name"
-format:= 'default | ecryptfs'
-key-type:= 'trusted' | 'user'
+Where::
+
+ format:= 'default | ecryptfs'
+ key-type:= 'trusted' | 'user'
Examples of trusted and encrypted key usage:
-Create and save a trusted key named "kmk" of length 32 bytes:
+Create and save a trusted key named "kmk" of length 32 bytes::
$ keyctl add trusted kmk "new 32" @u
440502848
@@ -99,7 +105,7 @@ Create and save a trusted key named "kmk" of length 32 bytes:
$ keyctl pipe 440502848 > kmk.blob
-Load a trusted key from the saved blob:
+Load a trusted key from the saved blob::
$ keyctl add trusted kmk "load `cat kmk.blob`" @u
268728824
@@ -114,7 +120,7 @@ Load a trusted key from the saved blob:
f1f8fff03ad0acb083725535636addb08d73dedb9832da198081e5deae84bfaf0409c22b
e4a8aea2b607ec96931e6f4d4fe563ba
-Reseal a trusted key under new pcr values:
+Reseal a trusted key under new pcr values::
$ keyctl update 268728824 "update pcrinfo=`cat pcr.blob`"
$ keyctl print 268728824
@@ -135,11 +141,13 @@ compromised by a user level problem, and when sealed to specific boot PCR
values, protects against boot and offline attacks. Create and save an
encrypted key "evm" using the above trusted key "kmk":
-option 1: omitting 'format'
+option 1: omitting 'format'::
+
$ keyctl add encrypted evm "new trusted:kmk 32" @u
159771175
-option 2: explicitly defining 'format' as 'default'
+option 2: explicitly defining 'format' as 'default'::
+
$ keyctl add encrypted evm "new default trusted:kmk 32" @u
159771175
@@ -150,7 +158,7 @@ option 2: explicitly defining 'format' as 'default'
$ keyctl pipe 159771175 > evm.blob
-Load an encrypted key "evm" from saved blob:
+Load an encrypted key "evm" from saved blob::
$ keyctl add encrypted evm "load `cat evm.blob`" @u
831684262
@@ -164,4 +172,4 @@ Other uses for trusted and encrypted keys, such as for disk and file encryption
are anticipated. In particular the new format 'ecryptfs' has been defined in
in order to use encrypted keys to mount an eCryptfs filesystem. More details
about the usage can be found in the file
-'Documentation/security/keys-ecryptfs.txt'.
+``Documentation/security/keys-ecryptfs.txt``.
diff --git a/Documentation/security/self-protection.txt b/Documentation/security/self-protection.rst
index 141acfebe6ef..60c8bd8b77bf 100644
--- a/Documentation/security/self-protection.txt
+++ b/Documentation/security/self-protection.rst
@@ -1,4 +1,6 @@
-# Kernel Self-Protection
+======================
+Kernel Self-Protection
+======================
Kernel self-protection is the design and implementation of systems and
structures within the Linux kernel to protect against security flaws in
@@ -26,7 +28,8 @@ mentioning them, since these aspects need to be explored, dealt with,
and/or accepted.
-## Attack Surface Reduction
+Attack Surface Reduction
+========================
The most fundamental defense against security exploits is to reduce the
areas of the kernel that can be used to redirect execution. This ranges
@@ -34,13 +37,15 @@ from limiting the exposed APIs available to userspace, making in-kernel
APIs hard to use incorrectly, minimizing the areas of writable kernel
memory, etc.
-### Strict kernel memory permissions
+Strict kernel memory permissions
+--------------------------------
When all of kernel memory is writable, it becomes trivial for attacks
to redirect execution flow. To reduce the availability of these targets
the kernel needs to protect its memory with a tight set of permissions.
-#### Executable code and read-only data must not be writable
+Executable code and read-only data must not be writable
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Any areas of the kernel with executable memory must not be writable.
While this obviously includes the kernel text itself, we must consider
@@ -51,18 +56,19 @@ kernel, they are implemented in a way where the memory is temporarily
made writable during the update, and then returned to the original
permissions.)
-In support of this are CONFIG_STRICT_KERNEL_RWX and
-CONFIG_STRICT_MODULE_RWX, which seek to make sure that code is not
+In support of this are ``CONFIG_STRICT_KERNEL_RWX`` and
+``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not
writable, data is not executable, and read-only data is neither writable
nor executable.
Most architectures have these options on by default and not user selectable.
For some architectures like arm that wish to have these be selectable,
the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
-a Kconfig prompt. CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT determines
+a Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines
the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.
-#### Function pointers and sensitive variables must not be writable
+Function pointers and sensitive variables must not be writable
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Vast areas of kernel memory contain function pointers that are looked
up by the kernel and used to continue execution (e.g. descriptor/vector
@@ -74,8 +80,8 @@ so that they live in the .rodata section instead of the .data section
of the kernel, gaining the protection of the kernel's strict memory
permissions as described above.
-For variables that are initialized once at __init time, these can
-be marked with the (new and under development) __ro_after_init
+For variables that are initialized once at ``__init`` time, these can
+be marked with the (new and under development) ``__ro_after_init``
attribute.
What remains are variables that are updated rarely (e.g. GDT). These
@@ -85,7 +91,8 @@ of their lifetime read-only. (For example, when being updated, only the
CPU thread performing the update would be given uninterruptible write
access to the memory.)
-#### Segregation of kernel memory from userspace memory
+Segregation of kernel memory from userspace memory
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The kernel must never execute userspace memory. The kernel must also never
access userspace memory without explicit expectation to do so. These
@@ -95,10 +102,11 @@ By blocking userspace memory in this way, execution and data parsing
cannot be passed to trivially-controlled userspace memory, forcing
attacks to operate entirely in kernel memory.
-### Reduced access to syscalls
+Reduced access to syscalls
+--------------------------
One trivial way to eliminate many syscalls for 64-bit systems is building
-without CONFIG_COMPAT. However, this is rarely a feasible scenario.
+without ``CONFIG_COMPAT``. However, this is rarely a feasible scenario.
The "seccomp" system provides an opt-in feature made available to
userspace, which provides a way to reduce the number of kernel entry
@@ -112,7 +120,8 @@ to trusted processes. This would keep the scope of kernel entry points
restricted to the more regular set of normally available to unprivileged
userspace.
-### Restricting access to kernel modules
+Restricting access to kernel modules
+------------------------------------
The kernel should never allow an unprivileged user the ability to
load specific kernel modules, since that would provide a facility to
@@ -127,11 +136,12 @@ for debate in some scenarios.)
To protect against even privileged users, systems may need to either
disable module loading entirely (e.g. monolithic kernel builds or
modules_disabled sysctl), or provide signed modules (e.g.
-CONFIG_MODULE_SIG_FORCE, or dm-crypt with LoadPin), to keep from having
+``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having
root load arbitrary kernel code via the module loader interface.
-## Memory integrity
+Memory integrity
+================
There are many memory structures in the kernel that are regularly abused
to gain execution control during an attack, By far the most commonly
@@ -139,16 +149,18 @@ understood is that of the stack buffer overflow in which the return
address stored on the stack is overwritten. Many other examples of this
kind of attack exist, and protections exist to defend against them.
-### Stack buffer overflow
+Stack buffer overflow
+---------------------
The classic stack buffer overflow involves writing past the expected end
of a variable stored on the stack, ultimately writing a controlled value
to the stack frame's stored return address. The most widely used defense
is the presence of a stack canary between the stack variables and the
-return address (CONFIG_CC_STACKPROTECTOR), which is verified just before
+return address (``CONFIG_CC_STACKPROTECTOR``), which is verified just before
the function returns. Other defenses include things like shadow stacks.
-### Stack depth overflow
+Stack depth overflow
+--------------------
A less well understood attack is using a bug that triggers the
kernel to consume stack memory with deep function calls or large stack
@@ -158,27 +170,31 @@ important changes need to be made for better protections: moving the
sensitive thread_info structure elsewhere, and adding a faulting memory
hole at the bottom of the stack to catch these overflows.
-### Heap memory integrity
+Heap memory integrity
+---------------------
The structures used to track heap free lists can be sanity-checked during
allocation and freeing to make sure they aren't being used to manipulate
other memory areas.
-### Counter integrity
+Counter integrity
+-----------------
Many places in the kernel use atomic counters to track object references
or perform similar lifetime management. When these counters can be made
to wrap (over or under) this traditionally exposes a use-after-free
flaw. By trapping atomic wrapping, this class of bug vanishes.
-### Size calculation overflow detection
+Size calculation overflow detection
+-----------------------------------
Similar to counter overflow, integer overflows (usually size calculations)
need to be detected at runtime to kill this class of bug, which
traditionally leads to being able to write past the end of kernel buffers.
-## Statistical defenses
+Probabilistic defenses
+======================
While many protections can be considered deterministic (e.g. read-only
memory cannot be written to), some protections provide only statistical
@@ -186,7 +202,8 @@ defense, in that an attack must gather enough information about a
running system to overcome the defense. While not perfect, these do
provide meaningful defenses.
-### Canaries, blinding, and other secrets
+Canaries, blinding, and other secrets
+-------------------------------------
It should be noted that things like the stack canary discussed earlier
are technically statistical defenses, since they rely on a secret value,
@@ -201,7 +218,8 @@ It is critical that the secret values used must be separate (e.g.
different canary per stack) and high entropy (e.g. is the RNG actually
working?) in order to maximize their success.
-### Kernel Address Space Layout Randomization (KASLR)
+Kernel Address Space Layout Randomization (KASLR)
+-------------------------------------------------
Since the location of kernel memory is almost always instrumental in
mounting a successful attack, making the location non-deterministic
@@ -209,22 +227,25 @@ raises the difficulty of an exploit. (Note that this in turn makes
the value of information exposures higher, since they may be used to
discover desired memory locations.)
-#### Text and module base
+Text and module base
+~~~~~~~~~~~~~~~~~~~~
By relocating the physical and virtual base address of the kernel at
-boot-time (CONFIG_RANDOMIZE_BASE), attacks needing kernel code will be
+boot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be
frustrated. Additionally, offsetting the module loading base address
means that even systems that load the same set of modules in the same
order every boot will not share a common base address with the rest of
the kernel text.
-#### Stack base
+Stack base
+~~~~~~~~~~
If the base address of the kernel stack is not the same between processes,
or even not the same between syscalls, targets on or beyond the stack
become more difficult to locate.
-#### Dynamic memory base
+Dynamic memory base
+~~~~~~~~~~~~~~~~~~~
Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
being relatively deterministic in layout due to the order of early-boot
@@ -232,7 +253,8 @@ initializations. If the base address of these areas is not the same
between boots, targeting them is frustrated, requiring an information
exposure specific to the region.
-#### Structure layout
+Structure layout
+~~~~~~~~~~~~~~~~
By performing a per-build randomization of the layout of sensitive
structures, attacks must either be tuned to known kernel builds or expose
@@ -240,26 +262,30 @@ enough kernel memory to determine structure layouts before manipulating
them.
-## Preventing Information Exposures
+Preventing Information Exposures
+================================
Since the locations of sensitive structures are the primary target for
attacks, it is important to defend against exposure of both kernel memory
addresses and kernel memory contents (since they may contain kernel
addresses or other sensitive things like canary values).
-### Unique identifiers
+Unique identifiers
+------------------
Kernel memory addresses must never be used as identifiers exposed to
userspace. Instead, use an atomic counter, an idr, or similar unique
identifier.
-### Memory initialization
+Memory initialization
+---------------------
Memory copied to userspace must always be fully initialized. If not
explicitly memset(), this will require changes to the compiler to make
sure structure holes are cleared.
-### Memory poisoning
+Memory poisoning
+----------------
When releasing memory, it is best to poison the contents (clear stack on
syscall return, wipe heap memory on a free), to avoid reuse attacks that
@@ -267,9 +293,10 @@ rely on the old contents of memory. This frustrates many uninitialized
variable attacks, stack content exposures, heap content exposures, and
use-after-free attacks.
-### Destination tracking
+Destination tracking
+--------------------
To help kill classes of bugs that result in kernel addresses being
written to userspace, the destination of writes needs to be tracked. If
-the buffer is destined for userspace (e.g. seq_file backed /proc files),
+the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files),
it should automatically censor sensitive values.
diff --git a/Documentation/target/target-export-device b/Documentation/target/target-export-device
new file mode 100755
index 000000000000..b803f4f886b5
--- /dev/null
+++ b/Documentation/target/target-export-device
@@ -0,0 +1,80 @@
+#!/bin/sh
+#
+# This script illustrates the sequence of operations in configfs to
+# create a very simple LIO iSCSI target with a file or block device
+# backstore.
+#
+# (C) Copyright 2014 Christophe Vu-Brugier <cvubrugier@fastmail.fm>
+#
+
+print_usage() {
+ cat <<EOF
+Usage: $(basename $0) [-p PORTAL] DEVICE|FILE
+Export a block device or a file as an iSCSI target with a single LUN
+EOF
+}
+
+die() {
+ echo $1
+ exit 1
+}
+
+while getopts "hp:" arg; do
+ case $arg in
+ h) print_usage; exit 0;;
+ p) PORTAL=${OPTARG};;
+ esac
+done
+shift $(($OPTIND - 1))
+
+DEVICE=$1
+[ -n "$DEVICE" ] || die "Missing device or file argument"
+[ -b $DEVICE -o -f $DEVICE ] || die "Invalid device or file: ${DEVICE}"
+IQN="iqn.2003-01.org.linux-iscsi.$(hostname):$(basename $DEVICE)"
+[ -n "$PORTAL" ] || PORTAL="0.0.0.0:3260"
+
+CONFIGFS=/sys/kernel/config
+CORE_DIR=$CONFIGFS/target/core
+ISCSI_DIR=$CONFIGFS/target/iscsi
+
+# Load the target modules and mount the config file system
+lsmod | grep -q configfs || modprobe configfs
+lsmod | grep -q target_core_mod || modprobe target_core_mod
+mount | grep -q ^configfs || mount -t configfs none $CONFIGFS
+mkdir -p $ISCSI_DIR
+
+# Create a backstore
+if [ -b $DEVICE ]; then
+ BACKSTORE_DIR=$CORE_DIR/iblock_0/data
+ mkdir -p $BACKSTORE_DIR
+ echo "udev_path=${DEVICE}" > $BACKSTORE_DIR/control
+else
+ BACKSTORE_DIR=$CORE_DIR/fileio_0/data
+ mkdir -p $BACKSTORE_DIR
+ DEVICE_SIZE=$(du -b $DEVICE | cut -f1)
+ echo "fd_dev_name=${DEVICE}" > $BACKSTORE_DIR/control
+ echo "fd_dev_size=${DEVICE_SIZE}" > $BACKSTORE_DIR/control
+ echo 1 > $BACKSTORE_DIR/attrib/emulate_write_cache
+fi
+echo 1 > $BACKSTORE_DIR/enable
+
+# Create an iSCSI target and a target portal group (TPG)
+mkdir $ISCSI_DIR/$IQN
+mkdir $ISCSI_DIR/$IQN/tpgt_1/
+
+# Create a LUN
+mkdir $ISCSI_DIR/$IQN/tpgt_1/lun/lun_0
+ln -s $BACKSTORE_DIR $ISCSI_DIR/$IQN/tpgt_1/lun/lun_0/data
+echo 1 > $ISCSI_DIR/$IQN/tpgt_1/enable
+
+# Create a network portal
+mkdir $ISCSI_DIR/$IQN/tpgt_1/np/$PORTAL
+
+# Disable authentication
+echo 0 > $ISCSI_DIR/$IQN/tpgt_1/attrib/authentication
+echo 1 > $ISCSI_DIR/$IQN/tpgt_1/attrib/generate_node_acls
+
+# Allow write access for non authenticated initiators
+echo 0 > $ISCSI_DIR/$IQN/tpgt_1/attrib/demo_mode_write_protect
+
+echo "Target ${IQN}, portal ${PORTAL} has been created"
diff --git a/Documentation/tee.txt b/Documentation/tee.txt
new file mode 100644
index 000000000000..718599357596
--- /dev/null
+++ b/Documentation/tee.txt
@@ -0,0 +1,118 @@
+TEE subsystem
+This document describes the TEE subsystem in Linux.
+
+A TEE (Trusted Execution Environment) is a trusted OS running in some
+secure environment, for example, TrustZone on ARM CPUs, or a separate
+secure co-processor etc. A TEE driver handles the details needed to
+communicate with the TEE.
+
+This subsystem deals with:
+
+- Registration of TEE drivers
+
+- Managing shared memory between Linux and the TEE
+
+- Providing a generic API to the TEE
+
+The TEE interface
+=================
+
+include/uapi/linux/tee.h defines the generic interface to a TEE.
+
+User space (the client) connects to the driver by opening /dev/tee[0-9]* or
+/dev/teepriv[0-9]*.
+
+- TEE_IOC_SHM_ALLOC allocates shared memory and returns a file descriptor
+ which user space can mmap. When user space doesn't need the file
+ descriptor any more, it should be closed. When shared memory isn't needed
+ any longer it should be unmapped with munmap() to allow the reuse of
+ memory.
+
+- TEE_IOC_VERSION lets user space know which TEE this driver handles and
+ the its capabilities.
+
+- TEE_IOC_OPEN_SESSION opens a new session to a Trusted Application.
+
+- TEE_IOC_INVOKE invokes a function in a Trusted Application.
+
+- TEE_IOC_CANCEL may cancel an ongoing TEE_IOC_OPEN_SESSION or TEE_IOC_INVOKE.
+
+- TEE_IOC_CLOSE_SESSION closes a session to a Trusted Application.
+
+There are two classes of clients, normal clients and supplicants. The latter is
+a helper process for the TEE to access resources in Linux, for example file
+system access. A normal client opens /dev/tee[0-9]* and a supplicant opens
+/dev/teepriv[0-9].
+
+Much of the communication between clients and the TEE is opaque to the
+driver. The main job for the driver is to receive requests from the
+clients, forward them to the TEE and send back the results. In the case of
+supplicants the communication goes in the other direction, the TEE sends
+requests to the supplicant which then sends back the result.
+
+OP-TEE driver
+=============
+
+The OP-TEE driver handles OP-TEE [1] based TEEs. Currently it is only the ARM
+TrustZone based OP-TEE solution that is supported.
+
+Lowest level of communication with OP-TEE builds on ARM SMC Calling
+Convention (SMCCC) [2], which is the foundation for OP-TEE's SMC interface
+[3] used internally by the driver. Stacked on top of that is OP-TEE Message
+Protocol [4].
+
+OP-TEE SMC interface provides the basic functions required by SMCCC and some
+additional functions specific for OP-TEE. The most interesting functions are:
+
+- OPTEE_SMC_FUNCID_CALLS_UID (part of SMCCC) returns the version information
+ which is then returned by TEE_IOC_VERSION
+
+- OPTEE_SMC_CALL_GET_OS_UUID returns the particular OP-TEE implementation, used
+ to tell, for instance, a TrustZone OP-TEE apart from an OP-TEE running on a
+ separate secure co-processor.
+
+- OPTEE_SMC_CALL_WITH_ARG drives the OP-TEE message protocol
+
+- OPTEE_SMC_GET_SHM_CONFIG lets the driver and OP-TEE agree on which memory
+ range to used for shared memory between Linux and OP-TEE.
+
+The GlobalPlatform TEE Client API [5] is implemented on top of the generic
+TEE API.
+
+Picture of the relationship between the different components in the
+OP-TEE architecture.
+
+ User space Kernel Secure world
+ ~~~~~~~~~~ ~~~~~~ ~~~~~~~~~~~~
+ +--------+ +-------------+
+ | Client | | Trusted |
+ +--------+ | Application |
+ /\ +-------------+
+ || +----------+ /\
+ || |tee- | ||
+ || |supplicant| \/
+ || +----------+ +-------------+
+ \/ /\ | TEE Internal|
+ +-------+ || | API |
+ + TEE | || +--------+--------+ +-------------+
+ | Client| || | TEE | OP-TEE | | OP-TEE |
+ | API | \/ | subsys | driver | | Trusted OS |
+ +-------+----------------+----+-------+----+-----------+-------------+
+ | Generic TEE API | | OP-TEE MSG |
+ | IOCTL (TEE_IOC_*) | | SMCCC (OPTEE_SMC_CALL_*) |
+ +-----------------------------+ +------------------------------+
+
+RPC (Remote Procedure Call) are requests from secure world to kernel driver
+or tee-supplicant. An RPC is identified by a special range of SMCCC return
+values from OPTEE_SMC_CALL_WITH_ARG. RPC messages which are intended for the
+kernel are handled by the kernel driver. Other RPC messages will be forwarded to
+tee-supplicant without further involvement of the driver, except switching
+shared memory buffer representation.
+
+References:
+[1] https://github.com/OP-TEE/optee_os
+[2] http://infocenter.arm.com/help/topic/com.arm.doc.den0028a/index.html
+[3] drivers/tee/optee/optee_smc.h
+[4] drivers/tee/optee/optee_msg.h
+[5] http://www.globalplatform.org/specificationsdevice.asp look for
+ "TEE Client API Specification v1.0" and click download.
diff --git a/Documentation/thermal/sysfs-api.txt b/Documentation/thermal/sysfs-api.txt
index ef473dc7f55e..bb9a0a53e76b 100644
--- a/Documentation/thermal/sysfs-api.txt
+++ b/Documentation/thermal/sysfs-api.txt
@@ -582,3 +582,24 @@ platform data is provided, this uses the step_wise throttling policy.
This function serves as an arbitrator to set the state of a cooling
device. It sets the cooling device to the deepest cooling state if
possible.
+
+6. thermal_emergency_poweroff:
+
+On an event of critical trip temperature crossing. Thermal framework
+allows the system to shutdown gracefully by calling orderly_poweroff().
+In the event of a failure of orderly_poweroff() to shut down the system
+we are in danger of keeping the system alive at undesirably high
+temperatures. To mitigate this high risk scenario we program a work
+queue to fire after a pre-determined number of seconds to start
+an emergency shutdown of the device using the kernel_power_off()
+function. In case kernel_power_off() fails then finally
+emergency_restart() is called in the worst case.
+
+The delay should be carefully profiled so as to give adequate time for
+orderly_poweroff(). In case of failure of an orderly_poweroff() the
+emergency poweroff kicks in after the delay has elapsed and shuts down
+the system.
+
+If set to 0 emergency poweroff will not be supported. So a carefully
+profiled non-zero positive value is a must for emergerncy poweroff to be
+triggered.
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index a9d01b44a659..7b2eb1b7d4ca 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -16,6 +16,8 @@ place where this information is gathered.
.. toctree::
:maxdepth: 2
+ no_new_privs
+ seccomp_filter
unshare
.. only:: subproject and html
diff --git a/Documentation/prctl/no_new_privs.txt b/Documentation/userspace-api/no_new_privs.rst
index f7be84fba910..d060ea217ea1 100644
--- a/Documentation/prctl/no_new_privs.txt
+++ b/Documentation/userspace-api/no_new_privs.rst
@@ -1,3 +1,7 @@
+======================
+No New Privileges Flag
+======================
+
The execve system call can grant a newly-started program privileges that
its parent did not have. The most obvious examples are setuid/setgid
programs and file capabilities. To prevent the parent program from
@@ -5,53 +9,55 @@ gaining these privileges as well, the kernel and user code must be
careful to prevent the parent from doing anything that could subvert the
child. For example:
- - The dynamic loader handles LD_* environment variables differently if
+ - The dynamic loader handles ``LD_*`` environment variables differently if
a program is setuid.
- chroot is disallowed to unprivileged processes, since it would allow
- /etc/passwd to be replaced from the point of view of a process that
+ ``/etc/passwd`` to be replaced from the point of view of a process that
inherited chroot.
- The exec code has special handling for ptrace.
-These are all ad-hoc fixes. The no_new_privs bit (since Linux 3.5) is a
+These are all ad-hoc fixes. The ``no_new_privs`` bit (since Linux 3.5) is a
new, generic mechanism to make it safe for a process to modify its
execution environment in a manner that persists across execve. Any task
-can set no_new_privs. Once the bit is set, it is inherited across fork,
-clone, and execve and cannot be unset. With no_new_privs set, execve
+can set ``no_new_privs``. Once the bit is set, it is inherited across fork,
+clone, and execve and cannot be unset. With ``no_new_privs`` set, ``execve()``
promises not to grant the privilege to do anything that could not have
been done without the execve call. For example, the setuid and setgid
bits will no longer change the uid or gid; file capabilities will not
add to the permitted set, and LSMs will not relax constraints after
execve.
-To set no_new_privs, use prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0).
+To set ``no_new_privs``, use::
+
+ prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
Be careful, though: LSMs might also not tighten constraints on exec
-in no_new_privs mode. (This means that setting up a general-purpose
-service launcher to set no_new_privs before execing daemons may
+in ``no_new_privs`` mode. (This means that setting up a general-purpose
+service launcher to set ``no_new_privs`` before execing daemons may
interfere with LSM-based sandboxing.)
-Note that no_new_privs does not prevent privilege changes that do not
-involve execve. An appropriately privileged task can still call
-setuid(2) and receive SCM_RIGHTS datagrams.
+Note that ``no_new_privs`` does not prevent privilege changes that do not
+involve ``execve()``. An appropriately privileged task can still call
+``setuid(2)`` and receive SCM_RIGHTS datagrams.
-There are two main use cases for no_new_privs so far:
+There are two main use cases for ``no_new_privs`` so far:
- Filters installed for the seccomp mode 2 sandbox persist across
execve and can change the behavior of newly-executed programs.
Unprivileged users are therefore only allowed to install such filters
- if no_new_privs is set.
+ if ``no_new_privs`` is set.
- - By itself, no_new_privs can be used to reduce the attack surface
+ - By itself, ``no_new_privs`` can be used to reduce the attack surface
available to an unprivileged user. If everything running with a
- given uid has no_new_privs set, then that uid will be unable to
+ given uid has ``no_new_privs`` set, then that uid will be unable to
escalate its privileges by directly attacking setuid, setgid, and
fcap-using binaries; it will need to compromise something without the
- no_new_privs bit set first.
+ ``no_new_privs`` bit set first.
In the future, other potentially dangerous kernel features could become
-available to unprivileged tasks if no_new_privs is set. In principle,
-several options to unshare(2) and clone(2) would be safe when
-no_new_privs is set, and no_new_privs + chroot is considerable less
+available to unprivileged tasks if ``no_new_privs`` is set. In principle,
+several options to ``unshare(2)`` and ``clone(2)`` would be safe when
+``no_new_privs`` is set, and ``no_new_privs`` + ``chroot`` is considerable less
dangerous than chroot by itself.
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/userspace-api/seccomp_filter.rst
index 1e469ef75778..f71eb5ef1f2d 100644
--- a/Documentation/prctl/seccomp_filter.txt
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -1,8 +1,9 @@
- SECure COMPuting with filters
- =============================
+===========================================
+Seccomp BPF (SECure COMPuting with filters)
+===========================================
Introduction
-------------
+============
A large number of system calls are exposed to every userland process
with many of them going unused for the entire lifetime of the process.
@@ -27,7 +28,7 @@ pointers which constrains all filters to solely evaluating the system
call arguments directly.
What it isn't
--------------
+=============
System call filtering isn't a sandbox. It provides a clearly defined
mechanism for minimizing the exposed kernel surface. It is meant to be
@@ -40,13 +41,13 @@ system calls in socketcall() is allowed, for instance) which could be
construed, incorrectly, as a more complete sandboxing solution.
Usage
------
+=====
An additional seccomp mode is added and is enabled using the same
prctl(2) call as the strict seccomp. If the architecture has
-CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below:
+``CONFIG_HAVE_ARCH_SECCOMP_FILTER``, then filters may be added as below:
-PR_SET_SECCOMP:
+``PR_SET_SECCOMP``:
Now takes an additional argument which specifies a new filter
using a BPF program.
The BPF program will be executed over struct seccomp_data
@@ -55,24 +56,25 @@ PR_SET_SECCOMP:
acceptable values to inform the kernel which action should be
taken.
- Usage:
+ Usage::
+
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);
The 'prog' argument is a pointer to a struct sock_fprog which
will contain the filter program. If the program is invalid, the
- call will return -1 and set errno to EINVAL.
+ call will return -1 and set errno to ``EINVAL``.
- If fork/clone and execve are allowed by @prog, any child
+ If ``fork``/``clone`` and ``execve`` are allowed by @prog, any child
processes will be constrained to the same filters and system
call ABI as the parent.
- Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
- run with CAP_SYS_ADMIN privileges in its namespace. If these are not
- true, -EACCES will be returned. This requirement ensures that filter
+ Prior to use, the task must call ``prctl(PR_SET_NO_NEW_PRIVS, 1)`` or
+ run with ``CAP_SYS_ADMIN`` privileges in its namespace. If these are not
+ true, ``-EACCES`` will be returned. This requirement ensures that filter
programs cannot be applied to child processes with greater privileges
than the task that installed them.
- Additionally, if prctl(2) is allowed by the attached filter,
+ Additionally, if ``prctl(2)`` is allowed by the attached filter,
additional filters may be layered on which will increase evaluation
time, but allow for further decreasing the attack surface during
execution of a process.
@@ -80,51 +82,52 @@ PR_SET_SECCOMP:
The above call returns 0 on success and non-zero on error.
Return values
--------------
+=============
+
A seccomp filter may return any of the following values. If multiple
filters exist, the return value for the evaluation of a given system
call will always use the highest precedent value. (For example,
-SECCOMP_RET_KILL will always take precedence.)
+``SECCOMP_RET_KILL`` will always take precedence.)
In precedence order, they are:
-SECCOMP_RET_KILL:
+``SECCOMP_RET_KILL``:
Results in the task exiting immediately without executing the
- system call. The exit status of the task (status & 0x7f) will
- be SIGSYS, not SIGKILL.
+ system call. The exit status of the task (``status & 0x7f``) will
+ be ``SIGSYS``, not ``SIGKILL``.
-SECCOMP_RET_TRAP:
- Results in the kernel sending a SIGSYS signal to the triggering
- task without executing the system call. siginfo->si_call_addr
+``SECCOMP_RET_TRAP``:
+ Results in the kernel sending a ``SIGSYS`` signal to the triggering
+ task without executing the system call. ``siginfo->si_call_addr``
will show the address of the system call instruction, and
- siginfo->si_syscall and siginfo->si_arch will indicate which
+ ``siginfo->si_syscall`` and ``siginfo->si_arch`` will indicate which
syscall was attempted. The program counter will be as though
the syscall happened (i.e. it will not point to the syscall
instruction). The return value register will contain an arch-
dependent value -- if resuming execution, set it to something
sensible. (The architecture dependency is because replacing
- it with -ENOSYS could overwrite some useful information.)
+ it with ``-ENOSYS`` could overwrite some useful information.)
- The SECCOMP_RET_DATA portion of the return value will be passed
- as si_errno.
+ The ``SECCOMP_RET_DATA`` portion of the return value will be passed
+ as ``si_errno``.
- SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP.
+ ``SIGSYS`` triggered by seccomp will have a si_code of ``SYS_SECCOMP``.
-SECCOMP_RET_ERRNO:
+``SECCOMP_RET_ERRNO``:
Results in the lower 16-bits of the return value being passed
to userland as the errno without executing the system call.
-SECCOMP_RET_TRACE:
+``SECCOMP_RET_TRACE``:
When returned, this value will cause the kernel to attempt to
- notify a ptrace()-based tracer prior to executing the system
- call. If there is no tracer present, -ENOSYS is returned to
+ notify a ``ptrace()``-based tracer prior to executing the system
+ call. If there is no tracer present, ``-ENOSYS`` is returned to
userland and the system call is not executed.
- A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
- using ptrace(PTRACE_SETOPTIONS). The tracer will be notified
- of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of
+ A tracer will be notified if it requests ``PTRACE_O_TRACESECCOM``P
+ using ``ptrace(PTRACE_SETOPTIONS)``. The tracer will be notified
+ of a ``PTRACE_EVENT_SECCOMP`` and the ``SECCOMP_RET_DATA`` portion of
the BPF program return value will be available to the tracer
- via PTRACE_GETEVENTMSG.
+ via ``PTRACE_GETEVENTMSG``.
The tracer can skip the system call by changing the syscall number
to -1. Alternatively, the tracer can change the system call
@@ -138,19 +141,19 @@ SECCOMP_RET_TRACE:
allow use of ptrace, even of other sandboxed processes, without
extreme care; ptracers can use this mechanism to escape.)
-SECCOMP_RET_ALLOW:
+``SECCOMP_RET_ALLOW``:
Results in the system call being executed.
If multiple filters exist, the return value for the evaluation of a
given system call will always use the highest precedent value.
-Precedence is only determined using the SECCOMP_RET_ACTION mask. When
+Precedence is only determined using the ``SECCOMP_RET_ACTION`` mask. When
multiple filters return values of the same precedence, only the
-SECCOMP_RET_DATA from the most recently installed filter will be
+``SECCOMP_RET_DATA`` from the most recently installed filter will be
returned.
Pitfalls
---------
+========
The biggest pitfall to avoid during use is filtering on system call
number without checking the architecture value. Why? On any
@@ -160,39 +163,40 @@ the numbers in the different calling conventions overlap, then checks in
the filters may be abused. Always check the arch value!
Example
--------
+=======
-The samples/seccomp/ directory contains both an x86-specific example
+The ``samples/seccomp/`` directory contains both an x86-specific example
and a more generic example of a higher level macro interface for BPF
program generation.
Adding architecture support
------------------------
+===========================
-See arch/Kconfig for the authoritative requirements. In general, if an
+See ``arch/Kconfig`` for the authoritative requirements. In general, if an
architecture supports both ptrace_event and seccomp, it will be able to
-support seccomp filter with minor fixup: SIGSYS support and seccomp return
-value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER
+support seccomp filter with minor fixup: ``SIGSYS`` support and seccomp return
+value checking. Then it must just add ``CONFIG_HAVE_ARCH_SECCOMP_FILTER``
to its arch-specific Kconfig.
Caveats
--------
+=======
The vDSO can cause some system calls to run entirely in userspace,
leading to surprises when you run programs on different machines that
fall back to real syscalls. To minimize these surprises on x86, make
sure you test with
-/sys/devices/system/clocksource/clocksource0/current_clocksource set to
-something like acpi_pm.
+``/sys/devices/system/clocksource/clocksource0/current_clocksource`` set to
+something like ``acpi_pm``.
On x86-64, vsyscall emulation is enabled by default. (vsyscalls are
-legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccomp, with a few oddities:
+legacy variants on vDSO calls.) Currently, emulated vsyscalls will
+honor seccomp, with a few oddities:
-- A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to
+- A return value of ``SECCOMP_RET_TRAP`` will set a ``si_call_addr`` pointing to
the vsyscall entry for the given call and not the address after the
'syscall' instruction. Any code which wants to restart the call
should be aware that (a) a ret instruction has been emulated and (b)
@@ -200,7 +204,7 @@ legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccom
emulation security checks, making resuming the syscall mostly
pointless.
-- A return value of SECCOMP_RET_TRACE will signal the tracer as usual,
+- A return value of ``SECCOMP_RET_TRACE`` will signal the tracer as usual,
but the syscall may not be changed to another system call using the
orig_rax register. It may only be changed to -1 order to skip the
currently emulated call. Any other change MAY terminate the process.
@@ -209,14 +213,14 @@ legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccom
rip or rsp. (Do not rely on other changes terminating the process.
They might work. For example, on some kernels, choosing a syscall
that only exists in future kernels will be correctly emulated (by
- returning -ENOSYS).
+ returning ``-ENOSYS``).
-To detect this quirky behavior, check for addr & ~0x0C00 ==
-0xFFFFFFFFFF600000. (For SECCOMP_RET_TRACE, use rip. For
-SECCOMP_RET_TRAP, use siginfo->si_call_addr.) Do not check any other
+To detect this quirky behavior, check for ``addr & ~0x0C00 ==
+0xFFFFFFFFFF600000``. (For ``SECCOMP_RET_TRACE``, use rip. For
+``SECCOMP_RET_TRAP``, use ``siginfo->si_call_addr``.) Do not check any other
condition: future kernels may improve vsyscall emulation and current
kernels in vsyscall=native mode will behave differently, but the
-instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these
+instructions at ``0xF...F600{0,4,8,C}00`` will not be system calls in these
cases.
Note that modern systems are unlikely to use vsyscalls at all -- they
diff --git a/Documentation/userspace-api/unshare.rst b/Documentation/userspace-api/unshare.rst
index 737c192cf4e7..877e90a35238 100644
--- a/Documentation/userspace-api/unshare.rst
+++ b/Documentation/userspace-api/unshare.rst
@@ -107,7 +107,7 @@ the benefits of this new feature can exceed its cost.
unshare() reverses sharing that was done using clone(2) system call,
so unshare() should have a similar interface as clone(2). That is,
-since flags in clone(int flags, void *stack) specifies what should
+since flags in clone(int flags, void \*stack) specifies what should
be shared, similar flags in unshare(int flags) should specify
what should be unshared. Unfortunately, this may appear to invert
the meaning of the flags from the way they are used in clone(2).
diff --git a/Documentation/virtual/kvm/devices/arm-vgic-its.txt b/Documentation/virtual/kvm/devices/arm-vgic-its.txt
index 6081a5b7fc1e..eb06beb75960 100644
--- a/Documentation/virtual/kvm/devices/arm-vgic-its.txt
+++ b/Documentation/virtual/kvm/devices/arm-vgic-its.txt
@@ -32,7 +32,128 @@ Groups:
KVM_DEV_ARM_VGIC_CTRL_INIT
request the initialization of the ITS, no additional parameter in
kvm_device_attr.addr.
+
+ KVM_DEV_ARM_ITS_SAVE_TABLES
+ save the ITS table data into guest RAM, at the location provisioned
+ by the guest in corresponding registers/table entries.
+
+ The layout of the tables in guest memory defines an ABI. The entries
+ are laid out in little endian format as described in the last paragraph.
+
+ KVM_DEV_ARM_ITS_RESTORE_TABLES
+ restore the ITS tables from guest RAM to ITS internal structures.
+
+ The GICV3 must be restored before the ITS and all ITS registers but
+ the GITS_CTLR must be restored before restoring the ITS tables.
+
+ The GITS_IIDR read-only register must also be restored before
+ calling KVM_DEV_ARM_ITS_RESTORE_TABLES as the IIDR revision field
+ encodes the ABI revision.
+
+ The expected ordering when restoring the GICv3/ITS is described in section
+ "ITS Restore Sequence".
+
Errors:
-ENXIO: ITS not properly configured as required prior to setting
this attribute
-ENOMEM: Memory shortage when allocating ITS internal data
+ -EINVAL: Inconsistent restored data
+ -EFAULT: Invalid guest ram access
+ -EBUSY: One or more VCPUS are running
+
+ KVM_DEV_ARM_VGIC_GRP_ITS_REGS
+ Attributes:
+ The attr field of kvm_device_attr encodes the offset of the
+ ITS register, relative to the ITS control frame base address
+ (ITS_base).
+
+ kvm_device_attr.addr points to a __u64 value whatever the width
+ of the addressed register (32/64 bits). 64 bit registers can only
+ be accessed with full length.
+
+ Writes to read-only registers are ignored by the kernel except for:
+ - GITS_CREADR. It must be restored otherwise commands in the queue
+ will be re-executed after restoring CWRITER. GITS_CREADR must be
+ restored before restoring the GITS_CTLR which is likely to enable the
+ ITS. Also it must be restored after GITS_CBASER since a write to
+ GITS_CBASER resets GITS_CREADR.
+ - GITS_IIDR. The Revision field encodes the table layout ABI revision.
+ In the future we might implement direct injection of virtual LPIs.
+ This will require an upgrade of the table layout and an evolution of
+ the ABI. GITS_IIDR must be restored before calling
+ KVM_DEV_ARM_ITS_RESTORE_TABLES.
+
+ For other registers, getting or setting a register has the same
+ effect as reading/writing the register on real hardware.
+ Errors:
+ -ENXIO: Offset does not correspond to any supported register
+ -EFAULT: Invalid user pointer for attr->addr
+ -EINVAL: Offset is not 64-bit aligned
+ -EBUSY: one or more VCPUS are running
+
+ ITS Restore Sequence:
+ -------------------------
+
+The following ordering must be followed when restoring the GIC and the ITS:
+a) restore all guest memory and create vcpus
+b) restore all redistributors
+c) provide the its base address
+ (KVM_DEV_ARM_VGIC_GRP_ADDR)
+d) restore the ITS in the following order:
+ 1. Restore GITS_CBASER
+ 2. Restore all other GITS_ registers, except GITS_CTLR!
+ 3. Load the ITS table data (KVM_DEV_ARM_ITS_RESTORE_TABLES)
+ 4. Restore GITS_CTLR
+
+Then vcpus can be started.
+
+ ITS Table ABI REV0:
+ -------------------
+
+ Revision 0 of the ABI only supports the features of a virtual GICv3, and does
+ not support a virtual GICv4 with support for direct injection of virtual
+ interrupts for nested hypervisors.
+
+ The device table and ITT are indexed by the DeviceID and EventID,
+ respectively. The collection table is not indexed by CollectionID, and the
+ entries in the collection are listed in no particular order.
+ All entries are 8 bytes.
+
+ Device Table Entry (DTE):
+
+ bits: | 63| 62 ... 49 | 48 ... 5 | 4 ... 0 |
+ values: | V | next | ITT_addr | Size |
+
+ where;
+ - V indicates whether the entry is valid. If not, other fields
+ are not meaningful.
+ - next: equals to 0 if this entry is the last one; otherwise it
+ corresponds to the DeviceID offset to the next DTE, capped by
+ 2^14 -1.
+ - ITT_addr matches bits [51:8] of the ITT address (256 Byte aligned).
+ - Size specifies the supported number of bits for the EventID,
+ minus one
+
+ Collection Table Entry (CTE):
+
+ bits: | 63| 62 .. 52 | 51 ... 16 | 15 ... 0 |
+ values: | V | RES0 | RDBase | ICID |
+
+ where:
+ - V indicates whether the entry is valid. If not, other fields are
+ not meaningful.
+ - RES0: reserved field with Should-Be-Zero-or-Preserved behavior.
+ - RDBase is the PE number (GICR_TYPER.Processor_Number semantic),
+ - ICID is the collection ID
+
+ Interrupt Translation Entry (ITE):
+
+ bits: | 63 ... 48 | 47 ... 16 | 15 ... 0 |
+ values: | next | pINTID | ICID |
+
+ where:
+ - next: equals to 0 if this entry is the last one; otherwise it corresponds
+ to the EventID offset to the next ITE capped by 2^16 -1.
+ - pINTID is the physical LPI ID; if zero, it means the entry is not valid
+ and other fields are not meaningful.
+ - ICID is the collection ID
diff --git a/Documentation/virtual/kvm/devices/arm-vgic-v3.txt b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
index c1a24612c198..9293b45abdb9 100644
--- a/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
+++ b/Documentation/virtual/kvm/devices/arm-vgic-v3.txt
@@ -167,11 +167,17 @@ Groups:
KVM_DEV_ARM_VGIC_CTRL_INIT
request the initialization of the VGIC, no additional parameter in
kvm_device_attr.addr.
+ KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES
+ save all LPI pending bits into guest RAM pending tables.
+
+ The first kB of the pending table is not altered by this operation.
Errors:
-ENXIO: VGIC not properly configured as required prior to calling
this attribute
-ENODEV: no online VCPU
-ENOMEM: memory shortage when allocating vgic internal data
+ -EFAULT: Invalid guest ram access
+ -EBUSY: One or more VCPUS are running
KVM_DEV_ARM_VGIC_GRP_LEVEL_INFO
diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index 0f6d8477b66c..c491a1b82de2 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -295,7 +295,7 @@ kernel and the tasks running there get 50% of the cache. They should
also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
siblings and only the real time threads are scheduled on the cores 4-7.
-# echo C0 > p0/cpus
+# echo F0 > p0/cpus
4) Locking between applications