diff options
Diffstat (limited to 'Documentation')
53 files changed, 2491 insertions, 503 deletions
diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block index e34cdeeeb9d4..a0ed87386639 100644 --- a/Documentation/ABI/testing/sysfs-block +++ b/Documentation/ABI/testing/sysfs-block @@ -28,6 +28,18 @@ Description: For more details refer Documentation/admin-guide/iostats.rst +What: /sys/block/<disk>/diskseq +Date: February 2021 +Contact: Matteo Croce <mcroce@microsoft.com> +Description: + The /sys/block/<disk>/diskseq files reports the disk + sequence number, which is a monotonically increasing + number assigned to every drive. + Some devices, like the loop device, refresh such number + every time the backing file is changed. + The value type is 64 bit unsigned. + + What: /sys/block/<disk>/<part>/stat Date: February 2008 Contact: Jerome Marchand <jmarchan@redhat.com> diff --git a/Documentation/ABI/testing/sysfs-block-device b/Documentation/ABI/testing/sysfs-block-device index aa0fb500e3c9..7ac7b19b2f72 100644 --- a/Documentation/ABI/testing/sysfs-block-device +++ b/Documentation/ABI/testing/sysfs-block-device @@ -55,6 +55,43 @@ Date: Oct, 2016 KernelVersion: v4.10 Contact: linux-ide@vger.kernel.org Description: - (RW) Write to the file to turn on or off the SATA ncq (native - command queueing) support. By default this feature is turned - off. + (RW) Write to the file to turn on or off the SATA NCQ (native + command queueing) priority support. By default this feature is + turned off. If the device does not support the SATA NCQ + priority feature, writing "1" to this file results in an error + (see ncq_prio_supported). + + +What: /sys/block/*/device/sas_ncq_prio_enable +Date: Oct, 2016 +KernelVersion: v4.10 +Contact: linux-ide@vger.kernel.org +Description: + (RW) This is the equivalent of the ncq_prio_enable attribute + file for SATA devices connected to a SAS host-bus-adapter + (HBA) implementing support for the SATA NCQ priority feature. + This file does not exist if the HBA driver does not implement + support for the SATA NCQ priority feature, regardless of the + device support for this feature (see sas_ncq_prio_supported). + + +What: /sys/block/*/device/ncq_prio_supported +Date: Aug, 2021 +KernelVersion: v5.15 +Contact: linux-ide@vger.kernel.org +Description: + (RO) Indicates if the device supports the SATA NCQ (native + command queueing) priority feature. + + +What: /sys/block/*/device/sas_ncq_prio_supported +Date: Aug, 2021 +KernelVersion: v5.15 +Contact: linux-ide@vger.kernel.org +Description: + (RO) This is the equivalent of the ncq_prio_supported attribute + file for SATA devices connected to a SAS host-bus-adapter + (HBA) implementing support for the SATA NCQ priority feature. + This file does not exist if the HBA driver does not implement + support for the SATA NCQ priority feature, regardless of the + device support for this feature. diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-uncore b/Documentation/ABI/testing/sysfs-bus-event_source-devices-uncore new file mode 100644 index 000000000000..b56e8f019fd4 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-uncore @@ -0,0 +1,13 @@ +What: /sys/bus/event_source/devices/uncore_*/alias +Date: June 2021 +KernelVersion: 5.15 +Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org> +Description: Read-only. An attribute to describe the alias name of + the uncore PMU if an alias exists on some platforms. + The 'perf(1)' tool should treat both names the same. + They both can be used to access the uncore PMU. + + Example: + + $ cat /sys/devices/uncore_cha_2/alias + uncore_type_0_2 diff --git a/Documentation/ABI/testing/sysfs-bus-platform b/Documentation/ABI/testing/sysfs-bus-platform index 194ca700e962..ff30728595ef 100644 --- a/Documentation/ABI/testing/sysfs-bus-platform +++ b/Documentation/ABI/testing/sysfs-bus-platform @@ -28,3 +28,17 @@ Description: value comes from an ACPI _PXM method or a similar firmware source. Initial users for this file would be devices like arm smmu which are populated by arm64 acpi_iort. + +What: /sys/bus/platform/devices/.../msi_irqs/ +Date: August 2021 +Contact: Barry Song <song.bao.hua@hisilicon.com> +Description: + The /sys/devices/.../msi_irqs directory contains a variable set + of files, with each file being named after a corresponding msi + irq vector allocated to that device. + +What: /sys/bus/platform/devices/.../msi_irqs/<N> +Date: August 2021 +Contact: Barry Song <song.bao.hua@hisilicon.com> +Description: + This attribute will show "msi" if <N> is a valid msi irq diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst index 11cdab037bff..eeb351296df1 100644 --- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst @@ -112,6 +112,35 @@ on PowerPC. The ``smp_mb__after_unlock_lock()`` invocations prevent this ``WARN_ON()`` from triggering. ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But the chain of rcu_node-structure lock acquisitions guarantees | +| that new readers will see all of the updater's pre-grace-period | +| accesses and also guarantees that the updater's post-grace-period | +| accesses will see all of the old reader's accesses. So why do we | +| need all of those calls to smp_mb__after_unlock_lock()? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because we must provide ordering for RCU's polling grace-period | +| primitives, for example, get_state_synchronize_rcu() and | +| poll_state_synchronize_rcu(). Consider this code:: | +| | +| CPU 0 CPU 1 | +| ---- ---- | +| WRITE_ONCE(X, 1) WRITE_ONCE(Y, 1) | +| g = get_state_synchronize_rcu() smp_mb() | +| while (!poll_state_synchronize_rcu(g)) r1 = READ_ONCE(X) | +| continue; | +| r0 = READ_ONCE(Y) | +| | +| RCU guarantees that the outcome r0 == 0 && r1 == 0 will not | +| happen, even if CPU 1 is in an RCU extended quiescent state | +| (idle or offline) and thus won't interact directly with the RCU | +| core processing at all. | ++-----------------------------------------------------------------------+ + This approach must be extended to include idle CPUs, which need RCU's grace-period memory ordering guarantee to extend to any RCU read-side critical sections preceding and following the current diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst index 38a39476fc24..45278e2974c0 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.rst +++ b/Documentation/RCU/Design/Requirements/Requirements.rst @@ -362,9 +362,8 @@ do_something_gp() uses rcu_dereference() to fetch from ``gp``: 12 } The rcu_dereference() uses volatile casts and (for DEC Alpha) memory -barriers in the Linux kernel. Should a `high-quality implementation of -C11 ``memory_order_consume`` -[PDF] <http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf>`__ +barriers in the Linux kernel. Should a |high-quality implementation of +C11 memory_order_consume [PDF]|_ ever appear, then rcu_dereference() could be implemented as a ``memory_order_consume`` load. Regardless of the exact implementation, a pointer fetched by rcu_dereference() may not be used outside of the @@ -374,6 +373,9 @@ element has been passed from RCU to some other synchronization mechanism, most commonly locking or `reference counting <https://www.kernel.org/doc/Documentation/RCU/rcuref.txt>`__. +.. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF] +.. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf + In short, updaters use rcu_assign_pointer() and readers use rcu_dereference(), and these two RCU API elements work together to ensure that readers have a consistent view of newly added data elements. diff --git a/Documentation/RCU/checklist.rst b/Documentation/RCU/checklist.rst index 01cc21f17f7b..f4545b7c9a63 100644 --- a/Documentation/RCU/checklist.rst +++ b/Documentation/RCU/checklist.rst @@ -37,7 +37,7 @@ over a rather long period of time, but improvements are always welcome! 1. Does the update code have proper mutual exclusion? - RCU does allow -readers- to run (almost) naked, but -writers- must + RCU does allow *readers* to run (almost) naked, but *writers* must still use some sort of mutual exclusion, such as: a. locking, @@ -73,7 +73,7 @@ over a rather long period of time, but improvements are always welcome! critical section is every bit as bad as letting them leak out from under a lock. Unless, of course, you have arranged some other means of protection, such as a lock or a reference count - -before- letting them out of the RCU read-side critical section. + *before* letting them out of the RCU read-side critical section. 3. Does the update code tolerate concurrent accesses? @@ -101,7 +101,7 @@ over a rather long period of time, but improvements are always welcome! c. Make updates appear atomic to readers. For example, pointer updates to properly aligned fields will appear atomic, as will individual atomic primitives. - Sequences of operations performed under a lock will -not- + Sequences of operations performed under a lock will *not* appear to be atomic to RCU readers, nor will sequences of multiple atomic primitives. @@ -333,7 +333,7 @@ over a rather long period of time, but improvements are always welcome! for example) may be omitted. 10. Conversely, if you are in an RCU read-side critical section, - and you don't hold the appropriate update-side lock, you -must- + and you don't hold the appropriate update-side lock, you *must* use the "_rcu()" variants of the list macros. Failing to do so will break Alpha, cause aggressive compilers to generate bad code, and confuse people trying to read your code. @@ -359,12 +359,12 @@ over a rather long period of time, but improvements are always welcome! callback pending, then that RCU callback will execute on some surviving CPU. (If this was not the case, a self-spawning RCU callback would prevent the victim CPU from ever going offline.) - Furthermore, CPUs designated by rcu_nocbs= might well -always- + Furthermore, CPUs designated by rcu_nocbs= might well *always* have their RCU callbacks executed on some other CPUs, in fact, for some real-time workloads, this is the whole point of using the rcu_nocbs= kernel boot parameter. -13. Unlike other forms of RCU, it -is- permissible to block in an +13. Unlike other forms of RCU, it *is* permissible to block in an SRCU read-side critical section (demarked by srcu_read_lock() and srcu_read_unlock()), hence the "SRCU": "sleepable RCU". Please note that if you don't need to sleep in read-side critical @@ -411,16 +411,16 @@ over a rather long period of time, but improvements are always welcome! 14. The whole point of call_rcu(), synchronize_rcu(), and friends is to wait until all pre-existing readers have finished before carrying out some otherwise-destructive operation. It is - therefore critically important to -first- remove any path + therefore critically important to *first* remove any path that readers can follow that could be affected by the - destructive operation, and -only- -then- invoke call_rcu(), + destructive operation, and *only then* invoke call_rcu(), synchronize_rcu(), or friends. Because these primitives only wait for pre-existing readers, it is the caller's responsibility to guarantee that any subsequent readers will execute safely. -15. The various RCU read-side primitives do -not- necessarily contain +15. The various RCU read-side primitives do *not* necessarily contain memory barriers. You should therefore plan for the CPU and the compiler to freely reorder code into and out of RCU read-side critical sections. It is the responsibility of the @@ -459,8 +459,8 @@ over a rather long period of time, but improvements are always welcome! pass in a function defined within a loadable module, then it in necessary to wait for all pending callbacks to be invoked after the last invocation and before unloading that module. Note that - it is absolutely -not- sufficient to wait for a grace period! - The current (say) synchronize_rcu() implementation is -not- + it is absolutely *not* sufficient to wait for a grace period! + The current (say) synchronize_rcu() implementation is *not* guaranteed to wait for callbacks registered on other CPUs. Or even on the current CPU if that CPU recently went offline and came back online. @@ -470,7 +470,7 @@ over a rather long period of time, but improvements are always welcome! - call_rcu() -> rcu_barrier() - call_srcu() -> srcu_barrier() - However, these barrier functions are absolutely -not- guaranteed + However, these barrier functions are absolutely *not* guaranteed to wait for a grace period. In fact, if there are no call_rcu() callbacks waiting anywhere in the system, rcu_barrier() is within its rights to return immediately. diff --git a/Documentation/RCU/rcu_dereference.rst b/Documentation/RCU/rcu_dereference.rst index f3e587acb4de..0b418a5b243c 100644 --- a/Documentation/RCU/rcu_dereference.rst +++ b/Documentation/RCU/rcu_dereference.rst @@ -43,7 +43,7 @@ Follow these rules to keep your RCU code working properly: - Set bits and clear bits down in the must-be-zero low-order bits of that pointer. This clearly means that the pointer must have alignment constraints, for example, this does - -not- work in general for char* pointers. + *not* work in general for char* pointers. - XOR bits to translate pointers, as is done in some classic buddy-allocator algorithms. @@ -174,7 +174,7 @@ Follow these rules to keep your RCU code working properly: Please see the "CONTROL DEPENDENCIES" section of Documentation/memory-barriers.txt for more details. - - The pointers are not equal -and- the compiler does + - The pointers are not equal *and* the compiler does not have enough information to deduce the value of the pointer. Note that the volatile cast in rcu_dereference() will normally prevent the compiler from knowing too much. @@ -360,7 +360,7 @@ in turn destroying the ordering between this load and the loads of the return values. This can result in "p->b" returning pre-initialization garbage values. -In short, rcu_dereference() is -not- optional when you are going to +In short, rcu_dereference() is *not* optional when you are going to dereference the resulting pointer. diff --git a/Documentation/RCU/stallwarn.rst b/Documentation/RCU/stallwarn.rst index 7148e9be08c3..5036df24ae61 100644 --- a/Documentation/RCU/stallwarn.rst +++ b/Documentation/RCU/stallwarn.rst @@ -32,7 +32,7 @@ warnings: - Booting Linux using a console connection that is too slow to keep up with the boot-time console-message rate. For example, - a 115Kbaud serial console can be -way- too slow to keep up + a 115Kbaud serial console can be *way* too slow to keep up with boot-time message rates, and will frequently result in RCU CPU stall warning messages. Especially if you have added debug printk()s. @@ -105,7 +105,7 @@ warnings: leading the realization that the CPU had failed. The RCU, RCU-sched, and RCU-tasks implementations have CPU stall warning. -Note that SRCU does -not- have CPU stall warnings. Please note that +Note that SRCU does *not* have CPU stall warnings. Please note that RCU only detects CPU stalls when there is a grace period in progress. No grace period, no CPU stall warnings. @@ -145,7 +145,7 @@ CONFIG_RCU_CPU_STALL_TIMEOUT this parameter is checked only at the beginning of a cycle. So if you are 10 seconds into a 40-second stall, setting this sysfs parameter to (say) five will shorten the timeout for the - -next- stall, or the following warning for the current stall + *next* stall, or the following warning for the current stall (assuming the stall lasts long enough). It will not affect the timing of the next warning for the current stall. @@ -189,8 +189,8 @@ rcupdate.rcu_task_stall_timeout Interpreting RCU's CPU Stall-Detector "Splats" ============================================== -For non-RCU-tasks flavors of RCU, when a CPU detects that it is stalling, -it will print a message similar to the following:: +For non-RCU-tasks flavors of RCU, when a CPU detects that some other +CPU is stalling, it will print a message similar to the following:: INFO: rcu_sched detected stalls on CPUs/tasks: 2-...: (3 GPs behind) idle=06c/0/0 softirq=1453/1455 fqs=0 @@ -202,8 +202,10 @@ causing stalls, and that the stall was affecting RCU-sched. This message will normally be followed by stack dumps for each CPU. Please note that PREEMPT_RCU builds can be stalled by tasks as well as by CPUs, and that the tasks will be indicated by PID, for example, "P3421". It is even -possible for an rcu_state stall to be caused by both CPUs -and- tasks, +possible for an rcu_state stall to be caused by both CPUs *and* tasks, in which case the offending CPUs and tasks will all be called out in the list. +In some cases, CPUs will detect themselves stalling, which will result +in a self-detected stall. CPU 2's "(3 GPs behind)" indicates that this CPU has not interacted with the RCU core for the past three grace periods. In contrast, CPU 16's "(0 @@ -224,7 +226,7 @@ is the number that had executed since boot at the time that this CPU last noted the beginning of a grace period, which might be the current (stalled) grace period, or it might be some earlier grace period (for example, if the CPU might have been in dyntick-idle mode for an extended -time period. The number after the "/" is the number that have executed +time period). The number after the "/" is the number that have executed since boot until the current time. If this latter number stays constant across repeated stall-warning messages, it is possible that RCU's softirq handlers are no longer able to execute on this CPU. This can happen if @@ -283,7 +285,8 @@ If the relevant grace-period kthread has been unable to run prior to the stall warning, as was the case in the "All QSes seen" line above, the following additional line is printed:: - kthread starved for 23807 jiffies! g7075 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1 ->cpu=5 + rcu_sched kthread starved for 23807 jiffies! g7075 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1 ->cpu=5 + Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. Starving the grace-period kthreads of CPU time can of course result in RCU CPU stall warnings even when all CPUs and tasks have passed @@ -313,15 +316,21 @@ is the current ``TIMER_SOFTIRQ`` count on cpu 4. If this value does not change on successive RCU CPU stall warnings, there is further reason to suspect a timer problem. +These messages are usually followed by stack dumps of the CPUs and tasks +involved in the stall. These stack traces can help you locate the cause +of the stall, keeping in mind that the CPU detecting the stall will have +an interrupt frame that is mainly devoted to detecting the stall. + Multiple Warnings From One Stall ================================ -If a stall lasts long enough, multiple stall-warning messages will be -printed for it. The second and subsequent messages are printed at +If a stall lasts long enough, multiple stall-warning messages will +be printed for it. The second and subsequent messages are printed at longer intervals, so that the time between (say) the first and second message will be about three times the interval between the beginning -of the stall and the first message. +of the stall and the first message. It can be helpful to compare the +stack dumps for the different messages for the same stalled grace period. Stall Warnings for Expedited Grace Periods diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst index f12cda55538b..8cbc711cda93 100644 --- a/Documentation/admin-guide/hw-vuln/index.rst +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -16,3 +16,4 @@ are configurable at compile, boot or run time. multihit.rst special-register-buffer-data-sampling.rst core-scheduling.rst + l1d_flush.rst diff --git a/Documentation/admin-guide/hw-vuln/l1d_flush.rst b/Documentation/admin-guide/hw-vuln/l1d_flush.rst new file mode 100644 index 000000000000..210020bc3f56 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/l1d_flush.rst @@ -0,0 +1,69 @@ +L1D Flushing +============ + +With an increasing number of vulnerabilities being reported around data +leaks from the Level 1 Data cache (L1D) the kernel provides an opt-in +mechanism to flush the L1D cache on context switch. + +This mechanism can be used to address e.g. CVE-2020-0550. For applications +the mechanism keeps them safe from vulnerabilities, related to leaks +(snooping of) from the L1D cache. + + +Related CVEs +------------ +The following CVEs can be addressed by this +mechanism + + ============= ======================== ================== + CVE-2020-0550 Improper Data Forwarding OS related aspects + ============= ======================== ================== + +Usage Guidelines +---------------- + +Please see document: :ref:`Documentation/userspace-api/spec_ctrl.rst +<set_spec_ctrl>` for details. + +**NOTE**: The feature is disabled by default, applications need to +specifically opt into the feature to enable it. + +Mitigation +---------- + +When PR_SET_L1D_FLUSH is enabled for a task a flush of the L1D cache is +performed when the task is scheduled out and the incoming task belongs to a +different process and therefore to a different address space. + +If the underlying CPU supports L1D flushing in hardware, the hardware +mechanism is used, software fallback for the mitigation, is not supported. + +Mitigation control on the kernel command line +--------------------------------------------- + +The kernel command line allows to control the L1D flush mitigations at boot +time with the option "l1d_flush=". The valid arguments for this option are: + + ============ ============================================================= + on Enables the prctl interface, applications trying to use + the prctl() will fail with an error if l1d_flush is not + enabled + ============ ============================================================= + +By default the mechanism is disabled. + +Limitations +----------- + +The mechanism does not mitigate L1D data leaks between tasks belonging to +different processes which are concurrently executing on sibling threads of +a physical CPU core when SMT is enabled on the system. + +This can be addressed by controlled placement of processes on physical CPU +cores or by disabling SMT. See the relevant chapter in the L1TF mitigation +document: :ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`. + +**NOTE** : The opt-in of a task for L1D flushing works only when the task's +affinity is limited to cores running in non-SMT mode. If a task which +requested L1D flushing is scheduled on a SMT-enabled core the kernel sends +a SIGBUS to the task. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index bdb22006f713..56bd70ee82fa 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2421,6 +2421,23 @@ feature (tagged TLBs) on capable Intel chips. Default is 1 (enabled) + l1d_flush= [X86,INTEL] + Control mitigation for L1D based snooping vulnerability. + + Certain CPUs are vulnerable to an exploit against CPU + internal buffers which can forward information to a + disclosure gadget under certain conditions. + + In vulnerable processors, the speculatively + forwarded data can be used in a cache side channel + attack, to access data to which the attacker does + not have direct access. + + This parameter controls the mitigation. The + options are: + + on - enable the interface for the mitigation + l1tf= [X86] Control mitigation of the L1TF vulnerability on affected CPUs @@ -4777,7 +4794,7 @@ reboot= [KNL] Format (x86 or x86_64): - [w[arm] | c[old] | h[ard] | s[oft] | g[pio]] \ + [w[arm] | c[old] | h[ard] | s[oft] | g[pio]] | d[efault] \ [[,]s[mp]#### \ [[,]b[ios] | a[cpi] | k[bd] | t[riple] | e[fi] | p[ci]] \ [[,]f[orce] diff --git a/Documentation/atomic_t.txt b/Documentation/atomic_t.txt index 0f1fdedf36bb..0f1ffa03db09 100644 --- a/Documentation/atomic_t.txt +++ b/Documentation/atomic_t.txt @@ -271,3 +271,97 @@ WRITE_ONCE. Thus: SC *y, t; is allowed. + + +CMPXCHG vs TRY_CMPXCHG +---------------------- + + int atomic_cmpxchg(atomic_t *ptr, int old, int new); + bool atomic_try_cmpxchg(atomic_t *ptr, int *oldp, int new); + +Both provide the same functionality, but try_cmpxchg() can lead to more +compact code. The functions relate like: + + bool atomic_try_cmpxchg(atomic_t *ptr, int *oldp, int new) + { + int ret, old = *oldp; + ret = atomic_cmpxchg(ptr, old, new); + if (ret != old) + *oldp = ret; + return ret == old; + } + +and: + + int atomic_cmpxchg(atomic_t *ptr, int old, int new) + { + (void)atomic_try_cmpxchg(ptr, &old, new); + return old; + } + +Usage: + + old = atomic_read(&v); old = atomic_read(&v); + for (;;) { do { + new = func(old); new = func(old); + tmp = atomic_cmpxchg(&v, old, new); } while (!atomic_try_cmpxchg(&v, &old, new)); + if (tmp == old) + break; + old = tmp; + } + +NB. try_cmpxchg() also generates better code on some platforms (notably x86) +where the function more closely matches the hardware instruction. + + +FORWARD PROGRESS +---------------- + +In general strong forward progress is expected of all unconditional atomic +operations -- those in the Arithmetic and Bitwise classes and xchg(). However +a fair amount of code also requires forward progress from the conditional +atomic operations. + +Specifically 'simple' cmpxchg() loops are expected to not starve one another +indefinitely. However, this is not evident on LL/SC architectures, because +while an LL/SC architecure 'can/should/must' provide forward progress +guarantees between competing LL/SC sections, such a guarantee does not +transfer to cmpxchg() implemented using LL/SC. Consider: + + old = atomic_read(&v); + do { + new = func(old); + } while (!atomic_try_cmpxchg(&v, &old, new)); + +which on LL/SC becomes something like: + + old = atomic_read(&v); + do { + new = func(old); + } while (!({ + volatile asm ("1: LL %[oldval], %[v]\n" + " CMP %[oldval], %[old]\n" + " BNE 2f\n" + " SC %[new], %[v]\n" + " BNE 1b\n" + "2:\n" + : [oldval] "=&r" (oldval), [v] "m" (v) + : [old] "r" (old), [new] "r" (new) + : "memory"); + success = (oldval == old); + if (!success) + old = oldval; + success; })); + +However, even the forward branch from the failed compare can cause the LL/SC +to fail on some architectures, let alone whatever the compiler makes of the C +loop body. As a result there is no guarantee what so ever the cacheline +containing @v will stay on the local CPU and progress is made. + +Even native CAS architectures can fail to provide forward progress for their +primitive (See Sparc64 for an example). + +Such implementations are strongly encouraged to add exponential backoff loops +to a failed CAS in order to ensure some progress. Affected architectures are +also strongly encouraged to inspect/audit the atomic fallbacks, refcount_t and +their locking primitives. diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst index a2c96bec5ee8..1122cd3044c0 100644 --- a/Documentation/core-api/cpu_hotplug.rst +++ b/Documentation/core-api/cpu_hotplug.rst @@ -220,7 +220,7 @@ goes online (offline) and during initial setup (shutdown) of the driver. However each registration and removal function is also available with a ``_nocalls`` suffix which does not invoke the provided callbacks if the invocation of the callbacks is not desired. During the manual setup (or teardown) the functions -``get_online_cpus()`` and ``put_online_cpus()`` should be used to inhibit CPU +``cpus_read_lock()`` and ``cpus_read_unlock()`` should be used to inhibit CPU hotplug operations. diff --git a/Documentation/core-api/irq/irq-domain.rst b/Documentation/core-api/irq/irq-domain.rst index 53283b3729a1..6979b4af2c1f 100644 --- a/Documentation/core-api/irq/irq-domain.rst +++ b/Documentation/core-api/irq/irq-domain.rst @@ -55,8 +55,24 @@ exist then it will allocate a new Linux irq_desc, associate it with the hwirq, and call the .map() callback so the driver can perform any required hardware setup. -When an interrupt is received, irq_find_mapping() function should -be used to find the Linux IRQ number from the hwirq number. +Once a mapping has been established, it can be retrieved or used via a +variety of methods: + +- irq_resolve_mapping() returns a pointer to the irq_desc structure + for a given domain and hwirq number, and NULL if there was no + mapping. +- irq_find_mapping() returns a Linux IRQ number for a given domain and + hwirq number, and 0 if there was no mapping +- irq_linear_revmap() is now identical to irq_find_mapping(), and is + deprecated +- generic_handle_domain_irq() handles an interrupt described by a + domain and a hwirq number +- handle_domain_irq() does the same thing for root interrupt + controllers and deals with the set_irq_reg()/irq_enter() sequences + that most architecture requires + +Note that irq domain lookups must happen in contexts that are +compatible with a RCU read-side critical section. The irq_create_mapping() function must be called *atleast once* before any call to irq_find_mapping(), lest the descriptor will not @@ -137,7 +153,9 @@ required. Calling irq_create_direct_mapping() will allocate a Linux IRQ number and call the .map() callback so that driver can program the Linux IRQ number into the hardware. -Most drivers cannot use this mapping. +Most drivers cannot use this mapping, and it is now gated on the +CONFIG_IRQ_DOMAIN_NOMAP option. Please refrain from introducing new +users of this API. Legacy ------ @@ -157,6 +175,10 @@ for IRQ numbers that are passed to struct device registrations. In that case the Linux IRQ numbers cannot be dynamically assigned and the legacy mapping should be used. +As the name implies, the *_legacy() functions are deprecated and only +exist to ease the support of ancient platforms. No new users should be +added. + The legacy map assumes a contiguous range of IRQ numbers has already been allocated for the controller and that the IRQ number can be calculated by adding a fixed offset to the hwirq number, and diff --git a/Documentation/cpu-freq/cpu-drivers.rst b/Documentation/cpu-freq/cpu-drivers.rst index d84ededb66f9..3b32336a7803 100644 --- a/Documentation/cpu-freq/cpu-drivers.rst +++ b/Documentation/cpu-freq/cpu-drivers.rst @@ -75,9 +75,6 @@ And optionally .resume - A pointer to a per-policy resume function which is called with interrupts disabled and _before_ the governor is started again. - .ready - A pointer to a per-policy ready function which is called after - the policy is fully initialized. - .attr - A pointer to a NULL-terminated list of "struct freq_attr" which allow to export values to sysfs. diff --git a/Documentation/devicetree/bindings/cpufreq/cpufreq-mediatek-hw.yaml b/Documentation/devicetree/bindings/cpufreq/cpufreq-mediatek-hw.yaml new file mode 100644 index 000000000000..9cd42a64b13e --- /dev/null +++ b/Documentation/devicetree/bindings/cpufreq/cpufreq-mediatek-hw.yaml @@ -0,0 +1,70 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/cpufreq/cpufreq-mediatek-hw.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: MediaTek's CPUFREQ Bindings + +maintainers: + - Hector Yuan <hector.yuan@mediatek.com> + +description: + CPUFREQ HW is a hardware engine used by MediaTek SoCs to + manage frequency in hardware. It is capable of controlling + frequency for multiple clusters. + +properties: + compatible: + const: mediatek,cpufreq-hw + + reg: + minItems: 1 + maxItems: 2 + description: + Addresses and sizes for the memory of the HW bases in + each frequency domain. Each entry corresponds to + a register bank for each frequency domain present. + + "#performance-domain-cells": + description: + Number of cells in a performance domain specifier. + Set const to 1 here for nodes providing multiple + performance domains. + const: 1 + +required: + - compatible + - reg + - "#performance-domain-cells" + +additionalProperties: false + +examples: + - | + cpus { + #address-cells = <1>; + #size-cells = <0>; + + cpu0: cpu@0 { + device_type = "cpu"; + compatible = "arm,cortex-a55"; + enable-method = "psci"; + performance-domains = <&performance 0>; + reg = <0x000>; + }; + }; + + /* ... */ + + soc { + #address-cells = <2>; + #size-cells = <2>; + + performance: performance-controller@11bc00 { + compatible = "mediatek,cpufreq-hw"; + reg = <0 0x0011bc10 0 0x120>, <0 0x0011bd30 0 0x120>; + + #performance-domain-cells = <1>; + }; + }; diff --git a/Documentation/devicetree/bindings/fsi/ibm,fsi2spi.yaml b/Documentation/devicetree/bindings/fsi/ibm,fsi2spi.yaml index e425278653f5..e2ca0b000471 100644 --- a/Documentation/devicetree/bindings/fsi/ibm,fsi2spi.yaml +++ b/Documentation/devicetree/bindings/fsi/ibm,fsi2spi.yaml @@ -19,7 +19,6 @@ properties: compatible: enum: - ibm,fsi2spi - - ibm,fsi2spi-restricted reg: items: diff --git a/Documentation/devicetree/bindings/gpio/rockchip,gpio-bank.yaml b/Documentation/devicetree/bindings/gpio/rockchip,gpio-bank.yaml index d993e002cebe..0d62c28fb58d 100644 --- a/Documentation/devicetree/bindings/gpio/rockchip,gpio-bank.yaml +++ b/Documentation/devicetree/bindings/gpio/rockchip,gpio-bank.yaml @@ -22,7 +22,10 @@ properties: maxItems: 1 clocks: - maxItems: 1 + minItems: 1 + items: + - description: APB interface clock source + - description: GPIO debounce reference clock source gpio-controller: true diff --git a/Documentation/devicetree/bindings/power/supply/battery.yaml b/Documentation/devicetree/bindings/power/supply/battery.yaml index c3b4b7543591..d56ac484fec5 100644 --- a/Documentation/devicetree/bindings/power/supply/battery.yaml +++ b/Documentation/devicetree/bindings/power/supply/battery.yaml @@ -31,6 +31,20 @@ properties: compatible: const: simple-battery + device-chemistry: + description: This describes the chemical technology of the battery. + oneOf: + - const: nickel-cadmium + - const: nickel-metal-hydride + - const: lithium-ion + description: This is a blanket type for all lithium-ion batteries, + including those below. If possible, a precise compatible string + from below should be used, but sometimes it is unknown which specific + lithium ion battery is employed and this wide compatible can be used. + - const: lithium-ion-polymer + - const: lithium-ion-iron-phosphate + - const: lithium-ion-manganese-oxide + over-voltage-threshold-microvolt: description: battery over-voltage limit diff --git a/Documentation/devicetree/bindings/power/supply/maxim,max17042.yaml b/Documentation/devicetree/bindings/power/supply/maxim,max17042.yaml index c70f05ea6d27..971b53c58cc6 100644 --- a/Documentation/devicetree/bindings/power/supply/maxim,max17042.yaml +++ b/Documentation/devicetree/bindings/power/supply/maxim,max17042.yaml @@ -19,12 +19,15 @@ properties: - maxim,max17047 - maxim,max17050 - maxim,max17055 + - maxim,max77849-battery reg: maxItems: 1 interrupts: maxItems: 1 + description: | + The ALRT pin, an open-drain interrupt. maxim,rsns-microohm: $ref: /schemas/types.yaml#/definitions/uint32 diff --git a/Documentation/devicetree/bindings/power/supply/mt6360_charger.yaml b/Documentation/devicetree/bindings/power/supply/mt6360_charger.yaml new file mode 100644 index 000000000000..b89b15a5bfa4 --- /dev/null +++ b/Documentation/devicetree/bindings/power/supply/mt6360_charger.yaml @@ -0,0 +1,48 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/power/supply/mt6360_charger.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Battery charger driver for MT6360 PMIC from MediaTek Integrated. + +maintainers: + - Gene Chen <gene_chen@richtek.com> + +description: | + This module is part of the MT6360 MFD device. + Provides Battery Charger, Boost for OTG devices and BC1.2 detection. + +properties: + compatible: + const: mediatek,mt6360-chg + + richtek,vinovp-microvolt: + description: Maximum CHGIN regulation voltage in uV. + enum: [ 5500000, 6500000, 11000000, 14500000 ] + + + usb-otg-vbus-regulator: + type: object + description: OTG boost regulator. + $ref: /schemas/regulator/regulator.yaml# + +required: + - compatible + +additionalProperties: false + +examples: + - | + mt6360_charger: charger { + compatible = "mediatek,mt6360-chg"; + richtek,vinovp-microvolt = <14500000>; + + otg_vbus_regulator: usb-otg-vbus-regulator { + regulator-compatible = "usb-otg-vbus"; + regulator-name = "usb-otg-vbus"; + regulator-min-microvolt = <4425000>; + regulator-max-microvolt = <5825000>; + }; + }; +... diff --git a/Documentation/devicetree/bindings/power/supply/summit,smb347-charger.yaml b/Documentation/devicetree/bindings/power/supply/summit,smb347-charger.yaml index 983fc215c1e5..20862cdfc116 100644 --- a/Documentation/devicetree/bindings/power/supply/summit,smb347-charger.yaml +++ b/Documentation/devicetree/bindings/power/supply/summit,smb347-charger.yaml @@ -73,6 +73,26 @@ properties: - 1 # SMB3XX_SOFT_TEMP_COMPENSATE_CURRENT Current compensation - 2 # SMB3XX_SOFT_TEMP_COMPENSATE_VOLTAGE Voltage compensation + summit,inok-polarity: + description: | + Polarity of INOK signal indicating presence of external power supply. + $ref: /schemas/types.yaml#/definitions/uint32 + enum: + - 0 # SMB3XX_SYSOK_INOK_ACTIVE_LOW + - 1 # SMB3XX_SYSOK_INOK_ACTIVE_HIGH + + usb-vbus: + $ref: "../../regulator/regulator.yaml#" + type: object + + properties: + summit,needs-inok-toggle: + type: boolean + description: INOK signal is fixed and polarity needs to be toggled + in order to enable/disable output mode. + + unevaluatedProperties: false + allOf: - if: properties: @@ -134,6 +154,7 @@ examples: reg = <0x7f>; summit,enable-charge-control = <SMB3XX_CHG_ENABLE_PIN_ACTIVE_HIGH>; + summit,inok-polarity = <SMB3XX_SYSOK_INOK_ACTIVE_LOW>; summit,chip-temperature-threshold-celsius = <110>; summit,mains-current-limit-microamp = <2000000>; summit,usb-current-limit-microamp = <500000>; @@ -141,6 +162,15 @@ examples: summit,enable-mains-charging; monitored-battery = <&battery>; + + usb-vbus { + regulator-name = "usb_vbus"; + regulator-min-microvolt = <5000000>; + regulator-max-microvolt = <5000000>; + regulator-min-microamp = <750000>; + regulator-max-microamp = <750000>; + summit,needs-inok-toggle; + }; }; }; diff --git a/Documentation/devicetree/bindings/power/supply/x-powers,axp20x-ac-power-supply.yaml b/Documentation/devicetree/bindings/power/supply/x-powers,axp20x-ac-power-supply.yaml index dcda6660b8ed..de6a23aee977 100644 --- a/Documentation/devicetree/bindings/power/supply/x-powers,axp20x-ac-power-supply.yaml +++ b/Documentation/devicetree/bindings/power/supply/x-powers,axp20x-ac-power-supply.yaml @@ -21,10 +21,13 @@ allOf: properties: compatible: - enum: - - x-powers,axp202-ac-power-supply - - x-powers,axp221-ac-power-supply - - x-powers,axp813-ac-power-supply + oneOf: + - const: x-powers,axp202-ac-power-supply + - const: x-powers,axp221-ac-power-supply + - items: + - const: x-powers,axp803-ac-power-supply + - const: x-powers,axp813-ac-power-supply + - const: x-powers,axp813-ac-power-supply required: - compatible diff --git a/Documentation/devicetree/bindings/power/supply/x-powers,axp20x-battery-power-supply.yaml b/Documentation/devicetree/bindings/power/supply/x-powers,axp20x-battery-power-supply.yaml index 86e8a713d4e2..d055428ae39f 100644 --- a/Documentation/devicetree/bindings/power/supply/x-powers,axp20x-battery-power-supply.yaml +++ b/Documentation/devicetree/bindings/power/supply/x-powers,axp20x-battery-power-supply.yaml @@ -19,10 +19,14 @@ allOf: properties: compatible: - enum: - - x-powers,axp209-battery-power-supply - - x-powers,axp221-battery-power-supply - - x-powers,axp813-battery-power-supply + oneOf: + - const: x-powers,axp202-battery-power-supply + - const: x-powers,axp209-battery-power-supply + - const: x-powers,axp221-battery-power-supply + - items: + - const: x-powers,axp803-battery-power-supply + - const: x-powers,axp813-battery-power-supply + - const: x-powers,axp813-battery-power-supply required: - compatible diff --git a/Documentation/devicetree/bindings/power/supply/x-powers,axp20x-usb-power-supply.yaml b/Documentation/devicetree/bindings/power/supply/x-powers,axp20x-usb-power-supply.yaml index 61f1b320c157..0c371b55c9e1 100644 --- a/Documentation/devicetree/bindings/power/supply/x-powers,axp20x-usb-power-supply.yaml +++ b/Documentation/devicetree/bindings/power/supply/x-powers,axp20x-usb-power-supply.yaml @@ -20,11 +20,15 @@ allOf: properties: compatible: - enum: - - x-powers,axp202-usb-power-supply - - x-powers,axp221-usb-power-supply - - x-powers,axp223-usb-power-supply - - x-powers,axp813-usb-power-supply + oneOf: + - enum: + - x-powers,axp202-usb-power-supply + - x-powers,axp221-usb-power-supply + - x-powers,axp223-usb-power-supply + - x-powers,axp813-usb-power-supply + - items: + - const: x-powers,axp803-usb-power-supply + - const: x-powers,axp813-usb-power-supply required: diff --git a/Documentation/devicetree/bindings/regulator/richtek,rtq2134-regulator.yaml b/Documentation/devicetree/bindings/regulator/richtek,rtq2134-regulator.yaml new file mode 100644 index 000000000000..3f47e8e6c4fd --- /dev/null +++ b/Documentation/devicetree/bindings/regulator/richtek,rtq2134-regulator.yaml @@ -0,0 +1,106 @@ +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/regulator/richtek,rtq2134-regulator.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Richtek RTQ2134 SubPMIC Regulator + +maintainers: + - ChiYuan Huang <cy_huang@richtek.com> + +description: | + The RTQ2134 is a multi-phase, programmable power management IC that + integrates with four high efficient, synchronous step-down converter cores. + + Datasheet is available at + https://www.richtek.com/assets/product_file/RTQ2134-QA/DSQ2134-QA-01.pdf + +properties: + compatible: + enum: + - richtek,rtq2134 + + reg: + maxItems: 1 + + regulators: + type: object + + patternProperties: + "^buck[1-3]$": + type: object + $ref: regulator.yaml# + description: | + regulator description for buck[1-3]. + + properties: + richtek,use-vsel-dvs: + type: boolean + description: | + If specified, buck will listen to 'vsel' pin for dvs config. + Else, use dvs0 voltage by default. + + richtek,uv-shutdown: + type: boolean + description: | + If specified, use shutdown as UV action. Else, hiccup by default. + + unevaluatedProperties: false + + additionalProperties: false + +required: + - compatible + - reg + - regulators + +additionalProperties: false + +examples: + - | + i2c { + #address-cells = <1>; + #size-cells = <0>; + + rtq2134@18 { + compatible = "richtek,rtq2134"; + reg = <0x18>; + + regulators { + buck1 { + regulator-name = "rtq2134-buck1"; + regulator-min-microvolt = <300000>; + regulator-max-microvolt = <1850000>; + regulator-always-on; + richtek,use-vsel-dvs; + regulator-state-mem { + regulator-suspend-min-microvolt = <550000>; + regulator-suspend-max-microvolt = <550000>; + }; + }; + buck2 { + regulator-name = "rtq2134-buck2"; + regulator-min-microvolt = <1120000>; + regulator-max-microvolt = <1120000>; + regulator-always-on; + richtek,use-vsel-dvs; + regulator-state-mem { + regulator-suspend-min-microvolt = <1120000>; + regulator-suspend-max-microvolt = <1120000>; + }; + }; + buck3 { + regulator-name = "rtq2134-buck3"; + regulator-min-microvolt = <600000>; + regulator-max-microvolt = <600000>; + regulator-always-on; + richtek,use-vsel-dvs; + regulator-state-mem { + regulator-suspend-min-microvolt = <600000>; + regulator-suspend-max-microvolt = <600000>; + }; + }; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/regulator/richtek,rtq6752-regulator.yaml b/Documentation/devicetree/bindings/regulator/richtek,rtq6752-regulator.yaml new file mode 100644 index 000000000000..e6e5a9a7d940 --- /dev/null +++ b/Documentation/devicetree/bindings/regulator/richtek,rtq6752-regulator.yaml @@ -0,0 +1,76 @@ +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/regulator/richtek,rtq6752-regulator.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Richtek RTQ6752 TFT LCD Voltage Regulator + +maintainers: + - ChiYuan Huang <cy_huang@richtek.com> + +description: | + The RTQ6752 is an I2C interface pgorammable power management IC. It includes + two synchronous boost converter for PAVDD, and one synchronous NAVDD + buck-boost. The device is suitable for automotive TFT-LCD panel. + +properties: + compatible: + enum: + - richtek,rtq6752 + + reg: + maxItems: 1 + + enable-gpios: + description: | + A connection of the chip 'enable' gpio line. If not provided, treat it as + external pull up. + maxItems: 1 + + regulators: + type: object + + patternProperties: + "^(p|n)avdd$": + type: object + $ref: regulator.yaml# + description: | + regulator description for pavdd and navdd. + + additionalProperties: false + +required: + - compatible + - reg + - regulators + +additionalProperties: false + +examples: + - | + i2c { + #address-cells = <1>; + #size-cells = <0>; + + rtq6752@6b { + compatible = "richtek,rtq6752"; + reg = <0x6b>; + enable-gpios = <&gpio26 2 0>; + + regulators { + pavdd { + regulator-name = "rtq6752-pavdd"; + regulator-min-microvolt = <5000000>; + regulator-max-microvolt = <7300000>; + regulator-boot-on; + }; + navdd { + regulator-name = "rtq6752-navdd"; + regulator-min-microvolt = <5000000>; + regulator-max-microvolt = <7300000>; + regulator-boot-on; + }; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/regulator/socionext,uniphier-regulator.yaml b/Documentation/devicetree/bindings/regulator/socionext,uniphier-regulator.yaml new file mode 100644 index 000000000000..861d5f3c79e8 --- /dev/null +++ b/Documentation/devicetree/bindings/regulator/socionext,uniphier-regulator.yaml @@ -0,0 +1,85 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/regulator/socionext,uniphier-regulator.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Socionext UniPhier regulator controller + +description: | + This regulator controls VBUS and belongs to USB3 glue layer. Before using + the regulator, it is necessary to control the clocks and resets to enable + this layer. These clocks and resets should be described in each property. + +maintainers: + - Kunihiko Hayashi <hayashi.kunihiko@socionext.com> + +allOf: + - $ref: "regulator.yaml#" + +# USB3 Controller + +properties: + compatible: + enum: + - socionext,uniphier-pro4-usb3-regulator + - socionext,uniphier-pro5-usb3-regulator + - socionext,uniphier-pxs2-usb3-regulator + - socionext,uniphier-ld20-usb3-regulator + - socionext,uniphier-pxs3-usb3-regulator + + reg: + maxItems: 1 + + clocks: + minItems: 1 + maxItems: 2 + + clock-names: + oneOf: + - items: # for Pro4, Pro5 + - const: gio + - const: link + - items: # for others + - const: link + + resets: + minItems: 1 + maxItems: 2 + + reset-names: + oneOf: + - items: # for Pro4, Pro5 + - const: gio + - const: link + - items: + - const: link + +additionalProperties: false + +required: + - compatible + - reg + - clocks + - clock-names + - resets + - reset-names + +examples: + - | + usb-glue@65b00000 { + compatible = "simple-mfd"; + #address-cells = <1>; + #size-cells = <1>; + ranges = <0 0x65b00000 0x400>; + + usb_vbus0: regulators@100 { + compatible = "socionext,uniphier-ld20-usb3-regulator"; + reg = <0x100 0x10>; + clock-names = "link"; + clocks = <&sys_clk 14>; + reset-names = "link"; + resets = <&sys_rst 14>; + }; + }; + diff --git a/Documentation/devicetree/bindings/regulator/uniphier-regulator.txt b/Documentation/devicetree/bindings/regulator/uniphier-regulator.txt deleted file mode 100644 index 94fd38b0d163..000000000000 --- a/Documentation/devicetree/bindings/regulator/uniphier-regulator.txt +++ /dev/null @@ -1,58 +0,0 @@ -Socionext UniPhier Regulator Controller - -This describes the devicetree bindings for regulator controller implemented -on Socionext UniPhier SoCs. - -USB3 Controller ---------------- - -This regulator controls VBUS and belongs to USB3 glue layer. Before using -the regulator, it is necessary to control the clocks and resets to enable -this layer. These clocks and resets should be described in each property. - -Required properties: -- compatible: Should be - "socionext,uniphier-pro4-usb3-regulator" - for Pro4 SoC - "socionext,uniphier-pro5-usb3-regulator" - for Pro5 SoC - "socionext,uniphier-pxs2-usb3-regulator" - for PXs2 SoC - "socionext,uniphier-ld20-usb3-regulator" - for LD20 SoC - "socionext,uniphier-pxs3-usb3-regulator" - for PXs3 SoC -- reg: Specifies offset and length of the register set for the device. -- clocks: A list of phandles to the clock gate for USB3 glue layer. - According to the clock-names, appropriate clocks are required. -- clock-names: Should contain - "gio", "link" - for Pro4 and Pro5 SoCs - "link" - for others -- resets: A list of phandles to the reset control for USB3 glue layer. - According to the reset-names, appropriate resets are required. -- reset-names: Should contain - "gio", "link" - for Pro4 and Pro5 SoCs - "link" - for others - -See Documentation/devicetree/bindings/regulator/regulator.txt -for more details about the regulator properties. - -Example: - - usb-glue@65b00000 { - compatible = "socionext,uniphier-ld20-dwc3-glue", - "simple-mfd"; - #address-cells = <1>; - #size-cells = <1>; - ranges = <0 0x65b00000 0x400>; - - usb_vbus0: regulators@100 { - compatible = "socionext,uniphier-ld20-usb3-regulator"; - reg = <0x100 0x10>; - clock-names = "link"; - clocks = <&sys_clk 14>; - reset-names = "link"; - resets = <&sys_rst 14>; - }; - - phy { - ... - phy-supply = <&usb_vbus0>; - }; - ... - }; diff --git a/Documentation/devicetree/bindings/spi/omap-spi.txt b/Documentation/devicetree/bindings/spi/omap-spi.txt deleted file mode 100644 index 487208c256c0..000000000000 --- a/Documentation/devicetree/bindings/spi/omap-spi.txt +++ /dev/null @@ -1,48 +0,0 @@ -OMAP2+ McSPI device - -Required properties: -- compatible : - - "ti,am654-mcspi" for AM654. - - "ti,omap2-mcspi" for OMAP2 & OMAP3. - - "ti,omap4-mcspi" for OMAP4+. -- ti,spi-num-cs : Number of chipselect supported by the instance. -- ti,hwmods: Name of the hwmod associated to the McSPI -- ti,pindir-d0-out-d1-in: Select the D0 pin as output and D1 as - input. The default is D0 as input and - D1 as output. - -Optional properties: -- dmas: List of DMA specifiers with the controller specific format - as described in the generic DMA client binding. A tx and rx - specifier is required for each chip select. -- dma-names: List of DMA request names. These strings correspond - 1:1 with the DMA specifiers listed in dmas. The string naming - is to be "rxN" and "txN" for RX and TX requests, - respectively, where N equals the chip select number. - -Examples: - -[hwmod populated DMA resources] - -mcspi1: mcspi@1 { - #address-cells = <1>; - #size-cells = <0>; - compatible = "ti,omap4-mcspi"; - ti,hwmods = "mcspi1"; - ti,spi-num-cs = <4>; -}; - -[generic DMA request binding] - -mcspi1: mcspi@1 { - #address-cells = <1>; - #size-cells = <0>; - compatible = "ti,omap4-mcspi"; - ti,hwmods = "mcspi1"; - ti,spi-num-cs = <2>; - dmas = <&edma 42 - &edma 43 - &edma 44 - &edma 45>; - dma-names = "tx0", "rx0", "tx1", "rx1"; -}; diff --git a/Documentation/devicetree/bindings/spi/omap-spi.yaml b/Documentation/devicetree/bindings/spi/omap-spi.yaml new file mode 100644 index 000000000000..e55538186cf6 --- /dev/null +++ b/Documentation/devicetree/bindings/spi/omap-spi.yaml @@ -0,0 +1,117 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/spi/omap-spi.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: SPI controller bindings for OMAP and K3 SoCs + +maintainers: + - Aswath Govindraju <a-govindraju@ti.com> + +allOf: + - $ref: spi-controller.yaml# + +properties: + compatible: + oneOf: + - items: + - enum: + - ti,am654-mcspi + - ti,am4372-mcspi + - const: ti,omap4-mcspi + - items: + - enum: + - ti,omap2-mcspi + - ti,omap4-mcspi + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + clocks: + maxItems: 1 + + power-domains: + maxItems: 1 + + ti,spi-num-cs: + $ref: /schemas/types.yaml#/definitions/uint32 + description: Number of chipselect supported by the instance. + minimum: 1 + maximum: 4 + + ti,hwmods: + $ref: /schemas/types.yaml#/definitions/string + description: + Must be "mcspi<n>", n being the instance number (1-based). + This property is applicable only on legacy platforms mainly omap2/3 + and ti81xx and should not be used on other platforms. + deprecated: true + + ti,pindir-d0-out-d1-in: + description: + Select the D0 pin as output and D1 as input. The default is D0 + as input and D1 as output. + type: boolean + + dmas: + description: + List of DMA specifiers with the controller specific format as + described in the generic DMA client binding. A tx and rx + specifier is required for each chip select. + minItems: 1 + maxItems: 8 + + dma-names: + description: + List of DMA request names. These strings correspond 1:1 with + the DMA sepecifiers listed in dmas. The string names is to be + "rxN" and "txN" for RX and TX requests, respectively. Where N + is the chip select number. + minItems: 1 + maxItems: 8 + +required: + - compatible + - reg + - interrupts + +unevaluatedProperties: false + +if: + properties: + compatible: + oneOf: + - const: ti,omap2-mcspi + - const: ti,omap4-mcspi + +then: + properties: + ti,hwmods: + items: + - pattern: "^mcspi([1-9])$" + +else: + properties: + ti,hwmods: false + +examples: + - | + #include <dt-bindings/interrupt-controller/irq.h> + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/soc/ti,sci_pm_domain.h> + + spi@2100000 { + compatible = "ti,am654-mcspi","ti,omap4-mcspi"; + reg = <0x2100000 0x400>; + interrupts = <GIC_SPI 184 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&k3_clks 137 1>; + power-domains = <&k3_pds 137 TI_SCI_PD_EXCLUSIVE>; + #address-cells = <1>; + #size-cells = <0>; + dmas = <&main_udmap 0xc500>, <&main_udmap 0x4500>; + dma-names = "tx0", "rx0"; + }; diff --git a/Documentation/devicetree/bindings/spi/rockchip-sfc.yaml b/Documentation/devicetree/bindings/spi/rockchip-sfc.yaml new file mode 100644 index 000000000000..339fb39529f3 --- /dev/null +++ b/Documentation/devicetree/bindings/spi/rockchip-sfc.yaml @@ -0,0 +1,91 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/spi/rockchip-sfc.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Rockchip Serial Flash Controller (SFC) + +maintainers: + - Heiko Stuebner <heiko@sntech.de> + - Chris Morgan <macromorgan@hotmail.com> + +allOf: + - $ref: spi-controller.yaml# + +properties: + compatible: + const: rockchip,sfc + description: + The rockchip sfc controller is a standalone IP with version register, + and the driver can handle all the feature difference inside the IP + depending on the version register. + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + clocks: + items: + - description: Bus Clock + - description: Module Clock + + clock-names: + items: + - const: clk_sfc + - const: hclk_sfc + + power-domains: + maxItems: 1 + + rockchip,sfc-no-dma: + description: Disable DMA and utilize FIFO mode only + type: boolean + +patternProperties: + "^flash@[0-3]$": + type: object + properties: + reg: + minimum: 0 + maximum: 3 + +required: + - compatible + - reg + - interrupts + - clocks + - clock-names + +unevaluatedProperties: false + +examples: + - | + #include <dt-bindings/clock/px30-cru.h> + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/power/px30-power.h> + + sfc: spi@ff3a0000 { + compatible = "rockchip,sfc"; + reg = <0xff3a0000 0x4000>; + interrupts = <GIC_SPI 56 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&cru SCLK_SFC>, <&cru HCLK_SFC>; + clock-names = "clk_sfc", "hclk_sfc"; + pinctrl-0 = <&sfc_clk &sfc_cs &sfc_bus2>; + pinctrl-names = "default"; + power-domains = <&power PX30_PD_MMC_NAND>; + #address-cells = <1>; + #size-cells = <0>; + + flash@0 { + compatible = "jedec,spi-nor"; + reg = <0>; + spi-max-frequency = <108000000>; + spi-rx-bus-width = <2>; + spi-tx-bus-width = <2>; + }; + }; + +... diff --git a/Documentation/devicetree/bindings/spi/spi-mt65xx.txt b/Documentation/devicetree/bindings/spi/spi-mt65xx.txt index 4d0e4c15c4ea..2a24969159cc 100644 --- a/Documentation/devicetree/bindings/spi/spi-mt65xx.txt +++ b/Documentation/devicetree/bindings/spi/spi-mt65xx.txt @@ -11,6 +11,7 @@ Required properties: - mediatek,mt8135-spi: for mt8135 platforms - mediatek,mt8173-spi: for mt8173 platforms - mediatek,mt8183-spi: for mt8183 platforms + - mediatek,mt6893-spi: for mt6893 platforms - "mediatek,mt8192-spi", "mediatek,mt6765-spi": for mt8192 platforms - "mediatek,mt8195-spi", "mediatek,mt6765-spi": for mt8195 platforms - "mediatek,mt8516-spi", "mediatek,mt2712-spi": for mt8516 platforms diff --git a/Documentation/devicetree/bindings/spi/spi-sprd-adi.txt b/Documentation/devicetree/bindings/spi/spi-sprd-adi.txt deleted file mode 100644 index 2567c829e2dc..000000000000 --- a/Documentation/devicetree/bindings/spi/spi-sprd-adi.txt +++ /dev/null @@ -1,63 +0,0 @@ -Spreadtrum ADI controller - -ADI is the abbreviation of Anolog-Digital interface, which is used to access -analog chip (such as PMIC) from digital chip. ADI controller follows the SPI -framework for its hardware implementation is alike to SPI bus and its timing -is compatile to SPI timing. - -ADI controller has 50 channels including 2 software read/write channels and -48 hardware channels to access analog chip. For 2 software read/write channels, -users should set ADI registers to access analog chip. For hardware channels, -we can configure them to allow other hardware components to use it independently, -which means we can just link one analog chip address to one hardware channel, -then users can access the mapped analog chip address by this hardware channel -triggered by hardware components instead of ADI software channels. - -Thus we introduce one property named "sprd,hw-channels" to configure hardware -channels, the first value specifies the hardware channel id which is used to -transfer data triggered by hardware automatically, and the second value specifies -the analog chip address where user want to access by hardware components. - -Since we have multi-subsystems will use unique ADI to access analog chip, when -one system is reading/writing data by ADI software channels, that should be under -one hardware spinlock protection to prevent other systems from reading/writing -data by ADI software channels at the same time, or two parallel routine of setting -ADI registers will make ADI controller registers chaos to lead incorrect results. -Then we need one hardware spinlock to synchronize between the multiple subsystems. - -The new version ADI controller supplies multiple master channels for different -subsystem accessing, that means no need to add hardware spinlock to synchronize, -thus change the hardware spinlock support to be optional to keep backward -compatibility. - -Required properties: -- compatible: Should be "sprd,sc9860-adi". -- reg: Offset and length of ADI-SPI controller register space. -- #address-cells: Number of cells required to define a chip select address - on the ADI-SPI bus. Should be set to 1. -- #size-cells: Size of cells required to define a chip select address size - on the ADI-SPI bus. Should be set to 0. - -Optional properties: -- hwlocks: Reference to a phandle of a hwlock provider node. -- hwlock-names: Reference to hwlock name strings defined in the same order - as the hwlocks, should be "adi". -- sprd,hw-channels: This is an array of channel values up to 49 channels. - The first value specifies the hardware channel id which is used to - transfer data triggered by hardware automatically, and the second - value specifies the analog chip address where user want to access - by hardware components. - -SPI slave nodes must be children of the SPI controller node and can contain -properties described in Documentation/devicetree/bindings/spi/spi-bus.txt. - -Example: - adi_bus: spi@40030000 { - compatible = "sprd,sc9860-adi"; - reg = <0 0x40030000 0 0x10000>; - hwlocks = <&hwlock1 0>; - hwlock-names = "adi"; - #address-cells = <1>; - #size-cells = <0>; - sprd,hw-channels = <30 0x8c20>; - }; diff --git a/Documentation/devicetree/bindings/spi/sprd,spi-adi.yaml b/Documentation/devicetree/bindings/spi/sprd,spi-adi.yaml new file mode 100644 index 000000000000..fe014020da69 --- /dev/null +++ b/Documentation/devicetree/bindings/spi/sprd,spi-adi.yaml @@ -0,0 +1,104 @@ +# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) + +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/spi/sprd,spi-adi.yaml#" +$schema: "http://devicetree.org/meta-schemas/core.yaml#" + +title: Spreadtrum ADI controller + +maintainers: + - Orson Zhai <orsonzhai@gmail.com> + - Baolin Wang <baolin.wang7@gmail.com> + - Chunyan Zhang <zhang.lyra@gmail.com> + +description: | + ADI is the abbreviation of Anolog-Digital interface, which is used to access + analog chip (such as PMIC) from digital chip. ADI controller follows the SPI + framework for its hardware implementation is alike to SPI bus and its timing + is compatile to SPI timing. + + ADI controller has 50 channels including 2 software read/write channels and + 48 hardware channels to access analog chip. For 2 software read/write channels, + users should set ADI registers to access analog chip. For hardware channels, + we can configure them to allow other hardware components to use it independently, + which means we can just link one analog chip address to one hardware channel, + then users can access the mapped analog chip address by this hardware channel + triggered by hardware components instead of ADI software channels. + + Thus we introduce one property named "sprd,hw-channels" to configure hardware + channels, the first value specifies the hardware channel id which is used to + transfer data triggered by hardware automatically, and the second value specifies + the analog chip address where user want to access by hardware components. + + Since we have multi-subsystems will use unique ADI to access analog chip, when + one system is reading/writing data by ADI software channels, that should be under + one hardware spinlock protection to prevent other systems from reading/writing + data by ADI software channels at the same time, or two parallel routine of setting + ADI registers will make ADI controller registers chaos to lead incorrect results. + Then we need one hardware spinlock to synchronize between the multiple subsystems. + + The new version ADI controller supplies multiple master channels for different + subsystem accessing, that means no need to add hardware spinlock to synchronize, + thus change the hardware spinlock support to be optional to keep backward + compatibility. + +allOf: + - $ref: /spi/spi-controller.yaml# + +properties: + compatible: + enum: + - sprd,sc9860-adi + - sprd,sc9863-adi + - sprd,ums512-adi + + reg: + maxItems: 1 + + hwlocks: + maxItems: 1 + + hwlock-names: + const: adi + + sprd,hw-channels: + $ref: /schemas/types.yaml#/definitions/uint32-matrix + description: A list of hardware channels + minItems: 1 + maxItems: 48 + items: + items: + - description: The hardware channel id which is used to transfer data + triggered by hardware automatically, channel id 0-1 are for software + use, 2-49 are hardware channels. + minimum: 2 + maximum: 49 + - description: The analog chip address where user want to access by + hardware components. + +required: + - compatible + - reg + - '#address-cells' + - '#size-cells' + +unevaluatedProperties: false + +examples: + - | + aon { + #address-cells = <2>; + #size-cells = <2>; + + adi_bus: spi@40030000 { + compatible = "sprd,sc9860-adi"; + reg = <0 0x40030000 0 0x10000>; + hwlocks = <&hwlock1 0>; + hwlock-names = "adi"; + #address-cells = <1>; + #size-cells = <0>; + sprd,hw-channels = <30 0x8c20>; + }; + }; +... diff --git a/Documentation/devicetree/bindings/timer/rockchip,rk-timer.txt b/Documentation/devicetree/bindings/timer/rockchip,rk-timer.txt deleted file mode 100644 index d65fdce7c7f0..000000000000 --- a/Documentation/devicetree/bindings/timer/rockchip,rk-timer.txt +++ /dev/null @@ -1,27 +0,0 @@ -Rockchip rk timer - -Required properties: -- compatible: should be: - "rockchip,rv1108-timer", "rockchip,rk3288-timer": for Rockchip RV1108 - "rockchip,rk3036-timer", "rockchip,rk3288-timer": for Rockchip RK3036 - "rockchip,rk3066-timer", "rockchip,rk3288-timer": for Rockchip RK3066 - "rockchip,rk3188-timer", "rockchip,rk3288-timer": for Rockchip RK3188 - "rockchip,rk3228-timer", "rockchip,rk3288-timer": for Rockchip RK3228 - "rockchip,rk3229-timer", "rockchip,rk3288-timer": for Rockchip RK3229 - "rockchip,rk3288-timer": for Rockchip RK3288 - "rockchip,rk3368-timer", "rockchip,rk3288-timer": for Rockchip RK3368 - "rockchip,rk3399-timer": for Rockchip RK3399 -- reg: base address of the timer register starting with TIMERS CONTROL register -- interrupts: should contain the interrupts for Timer0 -- clocks : must contain an entry for each entry in clock-names -- clock-names : must include the following entries: - "timer", "pclk" - -Example: - timer: timer@ff810000 { - compatible = "rockchip,rk3288-timer"; - reg = <0xff810000 0x20>; - interrupts = <GIC_SPI 72 IRQ_TYPE_LEVEL_HIGH>; - clocks = <&xin24m>, <&cru PCLK_TIMER>; - clock-names = "timer", "pclk"; - }; diff --git a/Documentation/devicetree/bindings/timer/rockchip,rk-timer.yaml b/Documentation/devicetree/bindings/timer/rockchip,rk-timer.yaml new file mode 100644 index 000000000000..e26ecb5893ae --- /dev/null +++ b/Documentation/devicetree/bindings/timer/rockchip,rk-timer.yaml @@ -0,0 +1,64 @@ +# SPDX-License-Identifier: GPL-2.0 +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/timer/rockchip,rk-timer.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Rockchip Timer Device Tree Bindings + +maintainers: + - Daniel Lezcano <daniel.lezcano@linaro.org> + +properties: + compatible: + oneOf: + - const: rockchip,rk3288-timer + - const: rockchip,rk3399-timer + - items: + - enum: + - rockchip,rv1108-timer + - rockchip,rk3036-timer + - rockchip,rk3066-timer + - rockchip,rk3188-timer + - rockchip,rk3228-timer + - rockchip,rk3229-timer + - rockchip,rk3288-timer + - rockchip,rk3368-timer + - rockchip,px30-timer + - const: rockchip,rk3288-timer + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + clocks: + minItems: 2 + maxItems: 2 + + clock-names: + items: + - const: pclk + - const: timer + +required: + - compatible + - reg + - interrupts + - clocks + - clock-names + +additionalProperties: false + +examples: + - | + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/clock/rk3288-cru.h> + + timer: timer@ff810000 { + compatible = "rockchip,rk3288-timer"; + reg = <0xff810000 0x20>; + interrupts = <GIC_SPI 72 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&cru PCLK_TIMER>, <&xin24m>; + clock-names = "pclk", "timer"; + }; diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index f5a3207aa7fa..c57c609ad2eb 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -85,7 +85,6 @@ available subsections can be seen below. io-mapping io_ordering generic-counter - lightnvm-pblk memory-devices/index men-chameleon-bus ntb diff --git a/Documentation/driver-api/lightnvm-pblk.rst b/Documentation/driver-api/lightnvm-pblk.rst deleted file mode 100644 index 1040ed1cec81..000000000000 --- a/Documentation/driver-api/lightnvm-pblk.rst +++ /dev/null @@ -1,21 +0,0 @@ -pblk: Physical Block Device Target -================================== - -pblk implements a fully associative, host-based FTL that exposes a traditional -block I/O interface. Its primary responsibilities are: - - - Map logical addresses onto physical addresses (4KB granularity) in a - logical-to-physical (L2P) table. - - Maintain the integrity and consistency of the L2P table as well as its - recovery from normal tear down and power outage. - - Deal with controller- and media-specific constrains. - - Handle I/O errors. - - Implement garbage collection. - - Maintain consistency across the I/O stack during synchronization points. - -For more information please refer to: - - http://lightnvm.io - -which maintains updated FAQs, manual pages, technical documentation, tools, -contacts, etc. diff --git a/Documentation/fault-injection/fault-injection.rst b/Documentation/fault-injection/fault-injection.rst index f47d05ed0d94..4a25c5eb6f07 100644 --- a/Documentation/fault-injection/fault-injection.rst +++ b/Documentation/fault-injection/fault-injection.rst @@ -24,6 +24,10 @@ Available fault injection capabilities injects futex deadlock and uaddr fault errors. +- fail_sunrpc + + injects kernel RPC client and server failures. + - fail_make_request injects disk IO errors on devices permitted by setting @@ -151,6 +155,20 @@ configuration of fault-injection capabilities. default is 'N', setting it to 'Y' will disable failure injections when dealing with private (address space) futexes. +- /sys/kernel/debug/fail_sunrpc/ignore-client-disconnect: + + Format: { 'Y' | 'N' } + + default is 'N', setting it to 'Y' will disable disconnect + injection on the RPC client. + +- /sys/kernel/debug/fail_sunrpc/ignore-server-disconnect: + + Format: { 'Y' | 'N' } + + default is 'N', setting it to 'Y' will disable disconnect + injection on the RPC server. + - /sys/kernel/debug/fail_function/inject: Format: { 'function-name' | '!function-name' | '' } diff --git a/Documentation/filesystems/cifs/index.rst b/Documentation/filesystems/cifs/index.rst new file mode 100644 index 000000000000..1c8597a679ab --- /dev/null +++ b/Documentation/filesystems/cifs/index.rst @@ -0,0 +1,10 @@ +=============================== +CIFS +=============================== + + +.. toctree:: + :maxdepth: 1 + + ksmbd + cifsroot diff --git a/Documentation/filesystems/cifs/ksmbd.rst b/Documentation/filesystems/cifs/ksmbd.rst new file mode 100644 index 000000000000..a1326157d53f --- /dev/null +++ b/Documentation/filesystems/cifs/ksmbd.rst @@ -0,0 +1,165 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +KSMBD - SMB3 Kernel Server +========================== + +KSMBD is a linux kernel server which implements SMB3 protocol in kernel space +for sharing files over network. + +KSMBD architecture +================== + +The subset of performance related operations belong in kernelspace and +the other subset which belong to operations which are not really related with +performance in userspace. So, DCE/RPC management that has historically resulted +into number of buffer overflow issues and dangerous security bugs and user +account management are implemented in user space as ksmbd.mountd. +File operations that are related with performance (open/read/write/close etc.) +in kernel space (ksmbd). This also allows for easier integration with VFS +interface for all file operations. + +ksmbd (kernel daemon) +--------------------- + +When the server daemon is started, It starts up a forker thread +(ksmbd/interface name) at initialization time and open a dedicated port 445 +for listening to SMB requests. Whenever new clients make request, Forker +thread will accept the client connection and fork a new thread for dedicated +communication channel between the client and the server. It allows for parallel +processing of SMB requests(commands) from clients as well as allowing for new +clients to make new connections. Each instance is named ksmbd/1~n(port number) +to indicate connected clients. Depending on the SMB request types, each new +thread can decide to pass through the commands to the user space (ksmbd.mountd), +currently DCE/RPC commands are identified to be handled through the user space. +To further utilize the linux kernel, it has been chosen to process the commands +as workitems and to be executed in the handlers of the ksmbd-io kworker threads. +It allows for multiplexing of the handlers as the kernel take care of initiating +extra worker threads if the load is increased and vice versa, if the load is +decreased it destroys the extra worker threads. So, after connection is +established with client. Dedicated ksmbd/1..n(port number) takes complete +ownership of receiving/parsing of SMB commands. Each received command is worked +in parallel i.e., There can be multiple clients commands which are worked in +parallel. After receiving each command a separated kernel workitem is prepared +for each command which is further queued to be handled by ksmbd-io kworkers. +So, each SMB workitem is queued to the kworkers. This allows the benefit of load +sharing to be managed optimally by the default kernel and optimizing client +performance by handling client commands in parallel. + +ksmbd.mountd (user space daemon) +-------------------------------- + +ksmbd.mountd is userspace process to, transfer user account and password that +are registered using ksmbd.adduser(part of utils for user space). Further it +allows sharing information parameters that parsed from smb.conf to ksmbd in +kernel. For the execution part it has a daemon which is continuously running +and connected to the kernel interface using netlink socket, it waits for the +requests(dcerpc and share/user info). It handles RPC calls (at a minimum few +dozen) that are most important for file server from NetShareEnum and +NetServerGetInfo. Complete DCE/RPC response is prepared from the user space +and passed over to the associated kernel thread for the client. + + +KSMBD Feature Status +==================== + +============================== ================================================= +Feature name Status +============================== ================================================= +Dialects Supported. SMB2.1 SMB3.0, SMB3.1.1 dialects + (intentionally excludes security vulnerable SMB1 + dialect). +Auto Negotiation Supported. +Compound Request Supported. +Oplock Cache Mechanism Supported. +SMB2 leases(v1 lease) Supported. +Directory leases(v2 lease) Planned for future. +Multi-credits Supported. +NTLM/NTLMv2 Supported. +HMAC-SHA256 Signing Supported. +Secure negotiate Supported. +Signing Update Supported. +Pre-authentication integrity Supported. +SMB3 encryption(CCM, GCM) Supported. (CCM and GCM128 supported, GCM256 in + progress) +SMB direct(RDMA) Partially Supported. SMB3 Multi-channel is + required to connect to Windows client. +SMB3 Multi-channel Partially Supported. Planned to implement + replay/retry mechanisms for future. +SMB3.1.1 POSIX extension Supported. +ACLs Partially Supported. only DACLs available, SACLs + (auditing) is planned for the future. For + ownership (SIDs) ksmbd generates random subauth + values(then store it to disk) and use uid/gid + get from inode as RID for local domain SID. + The current acl implementation is limited to + standalone server, not a domain member. + Integration with Samba tools is being worked on + to allow future support for running as a domain + member. +Kerberos Supported. +Durable handle v1,v2 Planned for future. +Persistent handle Planned for future. +SMB2 notify Planned for future. +Sparse file support Supported. +DCE/RPC support Partially Supported. a few calls(NetShareEnumAll, + NetServerGetInfo, SAMR, LSARPC) that are needed + for file server handled via netlink interface + from ksmbd.mountd. Additional integration with + Samba tools and libraries via upcall is being + investigated to allow support for additional + DCE/RPC management calls (and future support + for Witness protocol e.g.) +ksmbd/nfsd interoperability Planned for future. The features that ksmbd + support are Leases, Notify, ACLs and Share modes. +============================== ================================================= + + +How to run +========== + +1. Download ksmbd-tools and compile them. + - https://github.com/cifsd-team/ksmbd-tools + +2. Create user/password for SMB share. + + # mkdir /etc/ksmbd/ + # ksmbd.adduser -a <Enter USERNAME for SMB share access> + +3. Create /etc/ksmbd/smb.conf file, add SMB share in smb.conf file + - Refer smb.conf.example and + https://github.com/cifsd-team/ksmbd-tools/blob/master/Documentation/configuration.txt + +4. Insert ksmbd.ko module + + # insmod ksmbd.ko + +5. Start ksmbd user space daemon + # ksmbd.mountd + +6. Access share from Windows or Linux using CIFS + +Shutdown KSMBD +============== + +1. kill user and kernel space daemon + # sudo ksmbd.control -s + +How to turn debug print on +========================== + +Each layer +/sys/class/ksmbd-control/debug + +1. Enable all component prints + # sudo ksmbd.control -d "all" + +2. Enable one of components(smb, auth, vfs, oplock, ipc, conn, rdma) + # sudo ksmbd.control -d "smb" + +3. Show what prints are enable. + # cat/sys/class/ksmbd-control/debug + [smb] auth vfs oplock ipc conn [rdma] + +4. Disable prints: + If you try the selected component once more, It is disabled without brackets. diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst index 44b67ebd6e40..0eb799d9d05a 100644 --- a/Documentation/filesystems/fscrypt.rst +++ b/Documentation/filesystems/fscrypt.rst @@ -1063,11 +1063,6 @@ astute users may notice some differences in behavior: - DAX (Direct Access) is not supported on encrypted files. -- The st_size of an encrypted symlink will not necessarily give the - length of the symlink target as required by POSIX. It will actually - give the length of the ciphertext, which will be slightly longer - than the plaintext due to NUL-padding and an extra 2-byte overhead. - - The maximum length of an encrypted symlink is 2 bytes shorter than the maximum length of an unencrypted symlink. For example, on an EXT4 filesystem with a 4K block size, unencrypted symlinks can be up @@ -1235,12 +1230,12 @@ the user-supplied name to get the ciphertext. Lookups without the key are more complicated. The raw ciphertext may contain the ``\0`` and ``/`` characters, which are illegal in -filenames. Therefore, readdir() must base64-encode the ciphertext for -presentation. For most filenames, this works fine; on ->lookup(), the -filesystem just base64-decodes the user-supplied name to get back to -the raw ciphertext. +filenames. Therefore, readdir() must base64url-encode the ciphertext +for presentation. For most filenames, this works fine; on ->lookup(), +the filesystem just base64url-decodes the user-supplied name to get +back to the raw ciphertext. -However, for very long filenames, base64 encoding would cause the +However, for very long filenames, base64url encoding would cause the filename length to exceed NAME_MAX. To prevent this, readdir() actually presents long filenames in an abbreviated form which encodes a strong "hash" of the ciphertext filename, along with the optional diff --git a/Documentation/filesystems/idmappings.rst b/Documentation/filesystems/idmappings.rst new file mode 100644 index 000000000000..1229a75ec75d --- /dev/null +++ b/Documentation/filesystems/idmappings.rst @@ -0,0 +1,1026 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Idmappings +========== + +Most filesystem developers will have encountered idmappings. They are used when +reading from or writing ownership to disk, reporting ownership to userspace, or +for permission checking. This document is aimed at filesystem developers that +want to know how idmappings work. + +Formal notes +------------ + +An idmapping is essentially a translation of a range of ids into another or the +same range of ids. The notational convention for idmappings that is widely used +in userspace is:: + + u:k:r + +``u`` indicates the first element in the upper idmapset ``U`` and ``k`` +indicates the first element in the lower idmapset ``K``. The ``r`` parameter +indicates the range of the idmapping, i.e. how many ids are mapped. From now +on, we will always prefix ids with ``u`` or ``k`` to make it clear whether +we're talking about an id in the upper or lower idmapset. + +To see what this looks like in practice, let's take the following idmapping:: + + u22:k10000:r3 + +and write down the mappings it will generate:: + + u22 -> k10000 + u23 -> k10001 + u24 -> k10002 + +From a mathematical viewpoint ``U`` and ``K`` are well-ordered sets and an +idmapping is an order isomorphism from ``U`` into ``K``. So ``U`` and ``K`` are +order isomorphic. In fact, ``U`` and ``K`` are always well-ordered subsets of +the set of all possible ids useable on a given system. + +Looking at this mathematically briefly will help us highlight some properties +that make it easier to understand how we can translate between idmappings. For +example, we know that the inverse idmapping is an order isomorphism as well:: + + k10000 -> u22 + k10001 -> u23 + k10002 -> u24 + +Given that we are dealing with order isomorphisms plus the fact that we're +dealing with subsets we can embedd idmappings into each other, i.e. we can +sensibly translate between different idmappings. For example, assume we've been +given the three idmappings:: + + 1. u0:k10000:r10000 + 2. u0:k20000:r10000 + 3. u0:k30000:r10000 + +and id ``k11000`` which has been generated by the first idmapping by mapping +``u1000`` from the upper idmapset down to ``k11000`` in the lower idmapset. + +Because we're dealing with order isomorphic subsets it is meaningful to ask +what id ``k11000`` corresponds to in the second or third idmapping. The +straightfoward algorithm to use is to apply the inverse of the first idmapping, +mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using +either the second idmapping mapping or third idmapping mapping. The second +idmapping would map ``u1000`` down to ``21000``. The third idmapping would map +``u1000`` down to ``u31000``. + +If we were given the same task for the following three idmappings:: + + 1. u0:k10000:r10000 + 2. u0:k20000:r200 + 3. u0:k30000:r300 + +we would fail to translate as the sets aren't order isomorphic over the full +range of the first idmapping anymore (However they are order isomorphic over +the full range of the second idmapping.). Neither the second or third idmapping +contain ``u1000`` in the upper idmapset ``U``. This is equivalent to not having +an id mapped. We can simply say that ``u1000`` is unmapped in the second and +third idmapping. The kernel will report unmapped ids as the overflowuid +``(uid_t)-1`` or overflowgid ``(gid_t)-1`` to userspace. + +The algorithm to calculate what a given id maps to is pretty simple. First, we +need to verify that the range can contain our target id. We will skip this step +for simplicity. After that if we want to know what ``id`` maps to we can do +simple calculations: + +- If we want to map from left to right:: + + u:k:r + id - u + k = n + +- If we want to map from right to left:: + + u:k:r + id - k + u = n + +Instead of "left to right" we can also say "down" and instead of "right to +left" we can also say "up". Obviously mapping down and up invert each other. + +To see whether the simple formulas above work, consider the following two +idmappings:: + + 1. u0:k20000:r10000 + 2. u500:k30000:r10000 + +Assume we are given ``k21000`` in the lower idmapset of the first idmapping. We +want to know what id this was mapped from in the upper idmapset of the first +idmapping. So we're mapping up in the first idmapping:: + + id - k + u = n + k21000 - k20000 + u0 = u1000 + +Now assume we are given the id ``u1100`` in the upper idmapset of the second +idmapping and we want to know what this id maps down to in the lower idmapset +of the second idmapping. This means we're mapping down in the second +idmapping:: + + id - u + k = n + u1100 - u500 + k30000 = k30600 + +General notes +------------- + +In the context of the kernel an idmapping can be interpreted as mapping a range +of userspace ids into a range of kernel ids:: + + userspace-id:kernel-id:range + +A userspace id is always an element in the upper idmapset of an idmapping of +type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the lower +idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on +"userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t`` +types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``. + +The kernel is mostly concerned with kernel ids. They are used when performing +permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field. +A userspace id on the other hand is an id that is reported to userspace by the +kernel, or is passed by userspace to the kernel, or a raw device id that is +written or read from disk. + +Note that we are only concerned with idmappings as the kernel stores them not +how userspace would specify them. + +For the rest of this document we will prefix all userspace ids with ``u`` and +all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So +an idmapping will be written as ``u0:k10000:r10000``. + +For example, the id ``u1000`` is an id in the upper idmapset or "userspace +idmapset" starting with ``u1000``. And it is mapped to ``k11000`` which is a +kernel id in the lower idmapset or "kernel idmapset" starting with ``k10000``. + +A kernel id is always created by an idmapping. Such idmappings are associated +with user namespaces. Since we mainly care about how idmappings work we're not +going to be concerned with how idmappings are created nor how they are used +outside of the filesystem context. This is best left to an explanation of user +namespaces. + +The initial user namespace is special. It always has an idmapping of the +following form:: + + u0:k0:r4294967295 + +which is an identity idmapping over the full range of ids available on this +system. + +Other user namespaces usually have non-identity idmappings such as:: + + u0:k10000:r10000 + +When a process creates or wants to change ownership of a file, or when the +ownership of a file is read from disk by a filesystem, the userspace id is +immediately translated into a kernel id according to the idmapping associated +with the relevant user namespace. + +For instance, consider a file that is stored on disk by a filesystem as being +owned by ``u1000``: + +- If a filesystem were to be mounted in the initial user namespaces (as most + filesystems are) then the initial idmapping will be used. As we saw this is + simply the identity idmapping. This would mean id ``u1000`` read from disk + would be mapped to id ``k1000``. So an inode's ``i_uid`` and ``i_gid`` field + would contain ``k1000``. + +- If a filesystem were to be mounted with an idmapping of ``u0:k10000:r10000`` + then ``u1000`` read from disk would be mapped to ``k11000``. So an inode's + ``i_uid`` and ``i_gid`` would contain ``k11000``. + +Translation algorithms +---------------------- + +We've already seen briefly that it is possible to translate between different +idmappings. We'll now take a closer look how that works. + +Crossmapping +~~~~~~~~~~~~ + +This translation algorithm is used by the kernel in quite a few places. For +example, it is used when reporting back the ownership of a file to userspace +via the ``stat()`` system call family. + +If we've been given ``k11000`` from one idmapping we can map that id up in +another idmapping. In order for this to work both idmappings need to contain +the same kernel id in their kernel idmapsets. For example, consider the +following idmappings:: + + 1. u0:k10000:r10000 + 2. u20000:k10000:r10000 + +and we are mapping ``u1000`` down to ``k11000`` in the first idmapping . We can +then translate ``k11000`` into a userspace id in the second idmapping using the +kernel idmapset of the second idmapping:: + + /* Map the kernel id up into a userspace id in the second idmapping. */ + from_kuid(u20000:k10000:r10000, k11000) = u21000 + +Note, how we can get back to the kernel id in the first idmapping by inverting +the algorithm:: + + /* Map the userspace id down into a kernel id in the second idmapping. */ + make_kuid(u20000:k10000:r10000, u21000) = k11000 + + /* Map the kernel id up into a userspace id in the first idmapping. */ + from_kuid(u0:k10000:r10000, k11000) = u1000 + +This algorithm allows us to answer the question what userspace id a given +kernel id corresponds to in a given idmapping. In order to be able to answer +this question both idmappings need to contain the same kernel id in their +respective kernel idmapsets. + +For example, when the kernel reads a raw userspace id from disk it maps it down +into a kernel id according to the idmapping associated with the filesystem. +Let's assume the filesystem was mounted with an idmapping of +``u0:k20000:r10000`` and it reads a file owned by ``u1000`` from disk. This +means ``u1000`` will be mapped to ``k21000`` which is what will be stored in +the inode's ``i_uid`` and ``i_gid`` field. + +When someone in userspace calls ``stat()`` or a related function to get +ownership information about the file the kernel can't simply map the id back up +according to the filesystem's idmapping as this would give the wrong owner if +the caller is using an idmapping. + +So the kernel will map the id back up in the idmapping of the caller. Let's +assume the caller has the slighly unconventional idmapping +``u3000:k20000:r10000`` then ``k21000`` would map back up to ``u4000``. +Consequently the user would see that this file is owned by ``u4000``. + +Remapping +~~~~~~~~~ + +It is possible to translate a kernel id from one idmapping to another one via +the userspace idmapset of the two idmappings. This is equivalent to remapping +a kernel id. + +Let's look at an example. We are given the following two idmappings:: + + 1. u0:k10000:r10000 + 2. u0:k20000:r10000 + +and we are given ``k11000`` in the first idmapping. In order to translate this +kernel id in the first idmapping into a kernel id in the second idmapping we +need to perform two steps: + +1. Map the kernel id up into a userspace id in the first idmapping:: + + /* Map the kernel id up into a userspace id in the first idmapping. */ + from_kuid(u0:k10000:r10000, k11000) = u1000 + +2. Map the userspace id down into a kernel id in the second idmapping:: + + /* Map the userspace id down into a kernel id in the second idmapping. */ + make_kuid(u0:k20000:r10000, u1000) = k21000 + +As you can see we used the userspace idmapset in both idmappings to translate +the kernel id in one idmapping to a kernel id in another idmapping. + +This allows us to answer the question what kernel id we would need to use to +get the same userspace id in another idmapping. In order to be able to answer +this question both idmappings need to contain the same userspace id in their +respective userspace idmapsets. + +Note, how we can easily get back to the kernel id in the first idmapping by +inverting the algorithm: + +1. Map the kernel id up into a userspace id in the second idmapping:: + + /* Map the kernel id up into a userspace id in the second idmapping. */ + from_kuid(u0:k20000:r10000, k21000) = u1000 + +2. Map the userspace id down into a kernel id in the first idmapping:: + + /* Map the userspace id down into a kernel id in the first idmapping. */ + make_kuid(u0:k10000:r10000, u1000) = k11000 + +Another way to look at this translation is to treat it as inverting one +idmapping and applying another idmapping if both idmappings have the relevant +userspace id mapped. This will come in handy when working with idmapped mounts. + +Invalid translations +~~~~~~~~~~~~~~~~~~~~ + +It is never valid to use an id in the kernel idmapset of one idmapping as the +id in the userspace idmapset of another or the same idmapping. While the kernel +idmapset always indicates an idmapset in the kernel id space the userspace +idmapset indicates a userspace id. So the following translations are forbidden:: + + /* Map the userspace id down into a kernel id in the first idmapping. */ + make_kuid(u0:k10000:r10000, u1000) = k11000 + + /* INVALID: Map the kernel id down into a kernel id in the second idmapping. */ + make_kuid(u10000:k20000:r10000, k110000) = k21000 + ~~~~~~~ + +and equally wrong:: + + /* Map the kernel id up into a userspace id in the first idmapping. */ + from_kuid(u0:k10000:r10000, k11000) = u1000 + + /* INVALID: Map the userspace id up into a userspace id in the second idmapping. */ + from_kuid(u20000:k0:r10000, u1000) = k21000 + ~~~~~ + +Idmappings when creating filesystem objects +------------------------------------------- + +The concepts of mapping an id down or mapping an id up are expressed in the two +kernel functions filesystem developers are rather familiar with and which we've +already used in this document:: + + /* Map the userspace id down into a kernel id. */ + make_kuid(idmapping, uid) + + /* Map the kernel id up into a userspace id. */ + from_kuid(idmapping, kuid) + +We will take an abbreviated look into how idmappings figure into creating +filesystem objects. For simplicity we will only look at what happens when the +VFS has already completed path lookup right before it calls into the filesystem +itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is +called. We will also assume that the directory we're creating filesystem +objects in is readable and writable for everyone. + +When creating a filesystem object the caller will look at the caller's +filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids +but they are exclusively used when determining file ownership which is why they +are called "filesystem ids". They are usually identical to the uid and gid of +the caller but can differ. We will just assume they are always identical to not +get lost in too many details. + +When the caller enters the kernel two things happen: + +1. Map the caller's userspace ids down into kernel ids in the caller's + idmapping. + (To be precise, the kernel will simply look at the kernel ids stashed in the + credentials of the current task but for our education we'll pretend this + translation happens just in time.) +2. Verify that the caller's kernel ids can be mapped up to userspace ids in the + filesystem's idmapping. + +The second step is important as regular filesystem will ultimately need to map +the kernel id back up into a userspace id when writing to disk. +So with the second step the kernel guarantees that a valid userspace id can be +written to disk. If it can't the kernel will refuse the creation request to not +even remotely risk filesystem corruption. + +The astute reader will have realized that this is simply a varation of the +crossmapping algorithm we mentioned above in a previous section. First, the +kernel maps the caller's userspace id down into a kernel id according to the +caller's idmapping and then maps that kernel id up according to the +filesystem's idmapping. + +Example 1 +~~~~~~~~~ + +:: + + caller id: u1000 + caller idmapping: u0:k0:r4294967295 + filesystem idmapping: u0:k0:r4294967295 + +Both the caller and the filesystem use the identity idmapping: + +1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: + + make_kuid(u0:k0:r4294967295, u1000) = k1000 + +2. Verify that the caller's kernel ids can be mapped to userspace ids in the + filesystem's idmapping. + + For this second step the kernel will call the function + ``fsuidgid_has_mapping()`` which ultimately boils down to calling + ``from_kuid()``:: + + from_kuid(u0:k0:r4294967295, k1000) = u1000 + +In this example both idmappings are the same so there's nothing exciting going +on. Ultimately the userspace id that lands on disk will be ``u1000``. + +Example 2 +~~~~~~~~~ + +:: + + caller id: u1000 + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k20000:r10000 + +1. Map the caller's userspace ids down into kernel ids in the caller's + idmapping:: + + make_kuid(u0:k10000:r10000, u1000) = k11000 + +2. Verify that the caller's kernel ids can be mapped up to userspace ids in the + filesystem's idmapping:: + + from_kuid(u0:k20000:r10000, k11000) = u-1 + +It's immediately clear that while the caller's userspace id could be +successfully mapped down into kernel ids in the caller's idmapping the kernel +ids could not be mapped up according to the filesystem's idmapping. So the +kernel will deny this creation request. + +Note that while this example is less common, because most filesystem can't be +mounted with non-initial idmappings this is a general problem as we can see in +the next examples. + +Example 3 +~~~~~~~~~ + +:: + + caller id: u1000 + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k0:r4294967295 + +1. Map the caller's userspace ids down into kernel ids in the caller's + idmapping:: + + make_kuid(u0:k10000:r10000, u1000) = k11000 + +2. Verify that the caller's kernel ids can be mapped up to userspace ids in the + filesystem's idmapping:: + + from_kuid(u0:k0:r4294967295, k11000) = u11000 + +We can see that the translation always succeeds. The userspace id that the +filesystem will ultimately put to disk will always be identical to the value of +the kernel id that was created in the caller's idmapping. This has mainly two +consequences. + +First, that we can't allow a caller to ultimately write to disk with another +userspace id. We could only do this if we were to mount the whole fileystem +with the caller's or another idmapping. But that solution is limited to a few +filesystems and not very flexible. But this is a use-case that is pretty +important in containerized workloads. + +Second, the caller will usually not be able to create any files or access +directories that have stricter permissions because none of the filesystem's +kernel ids map up into valid userspace ids in the caller's idmapping + +1. Map raw userspace ids down to kernel ids in the filesystem's idmapping:: + + make_kuid(u0:k0:r4294967295, u1000) = k1000 + +2. Map kernel ids up to userspace ids in the caller's idmapping:: + + from_kuid(u0:k10000:r10000, k1000) = u-1 + +Example 4 +~~~~~~~~~ + +:: + + file id: u1000 + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k0:r4294967295 + +In order to report ownership to userspace the kernel uses the crossmapping +algorithm introduced in a previous section: + +1. Map the userspace id on disk down into a kernel id in the filesystem's + idmapping:: + + make_kuid(u0:k0:r4294967295, u1000) = k1000 + +2. Map the kernel id up into a userspace id in the caller's idmapping:: + + from_kuid(u0:k10000:r10000, k1000) = u-1 + +The crossmapping algorithm fails in this case because the kernel id in the +filesystem idmapping cannot be mapped up to a userspace id in the caller's +idmapping. Thus, the kernel will report the ownership of this file as the +overflowid. + +Example 5 +~~~~~~~~~ + +:: + + file id: u1000 + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k20000:r10000 + +In order to report ownership to userspace the kernel uses the crossmapping +algorithm introduced in a previous section: + +1. Map the userspace id on disk down into a kernel id in the filesystem's + idmapping:: + + make_kuid(u0:k20000:r10000, u1000) = k21000 + +2. Map the kernel id up into a userspace id in the caller's idmapping:: + + from_kuid(u0:k10000:r10000, k21000) = u-1 + +Again, the crossmapping algorithm fails in this case because the kernel id in +the filesystem idmapping cannot be mapped to a userspace id in the caller's +idmapping. Thus, the kernel will report the ownership of this file as the +overflowid. + +Note how in the last two examples things would be simple if the caller would be +using the initial idmapping. For a filesystem mounted with the initial +idmapping it would be trivial. So we only consider a filesystem with an +idmapping of ``u0:k20000:r10000``: + +1. Map the userspace id on disk down into a kernel id in the filesystem's + idmapping:: + + make_kuid(u0:k20000:r10000, u1000) = k21000 + +2. Map the kernel id up into a userspace id in the caller's idmapping:: + + from_kuid(u0:k0:r4294967295, k21000) = u21000 + +Idmappings on idmapped mounts +----------------------------- + +The examples we've seen in the previous section where the caller's idmapping +and the filesystem's idmapping are incompatible causes various issues for +workloads. For a more complex but common example, consider two containers +started on the host. To completely prevent the two containers from affecting +each other, an administrator may often use different non-overlapping idmappings +for the two containers:: + + container1 idmapping: u0:k10000:r10000 + container2 idmapping: u0:k20000:r10000 + filesystem idmapping: u0:k30000:r10000 + +An administrator wanting to provide easy read-write access to the following set +of files:: + + dir id: u0 + dir/file1 id: u1000 + dir/file2 id: u2000 + +to both containers currently can't. + +Of course the administrator has the option to recursively change ownership via +``chown()``. For example, they could change ownership so that ``dir`` and all +files below it can be crossmapped from the filesystem's into the container's +idmapping. Let's assume they change ownership so it is compatible with the +first container's idmapping:: + + dir id: u10000 + dir/file1 id: u11000 + dir/file2 id: u12000 + +This would still leave ``dir`` rather useless to the second container. In fact, +``dir`` and all files below it would continue to appear owned by the overflowid +for the second container. + +Or consider another increasingly popular example. Some service managers such as +systemd implement a concept called "portable home directories". A user may want +to use their home directories on different machines where they are assigned +different login userspace ids. Most users will have ``u1000`` as the login id +on their machine at home and all files in their home directory will usually be +owned by ``u1000``. At uni or at work they may have another login id such as +``u1125``. This makes it rather difficult to interact with their home directory +on their work machine. + +In both cases changing ownership recursively has grave implications. The most +obvious one is that ownership is changed globally and permanently. In the home +directory case this change in ownership would even need to happen everytime the +user switches from their home to their work machine. For really large sets of +files this becomes increasingly costly. + +If the user is lucky, they are dealing with a filesystem that is mountable +inside user namespaces. But this would also change ownership globally and the +change in ownership is tied to the lifetime of the filesystem mount, i.e. the +superblock. The only way to change ownership is to completely unmount the +filesystem and mount it again in another user namespace. This is usually +impossible because it would mean that all users currently accessing the +filesystem can't anymore. And it means that ``dir`` still can't be shared +between two containers with different idmappings. +But usually the user doesn't even have this option since most filesystems +aren't mountable inside containers. And not having them mountable might be +desirable as it doesn't require the filesystem to deal with malicious +filesystem images. + +But the usecases mentioned above and more can be handled by idmapped mounts. +They allow to expose the same set of dentries with different ownership at +different mounts. This is achieved by marking the mounts with a user namespace +through the ``mount_setattr()`` system call. The idmapping associated with it +is then used to translate from the caller's idmapping to the filesystem's +idmapping and vica versa using the remapping algorithm we introduced above. + +Idmapped mounts make it possible to change ownership in a temporary and +localized way. The ownership changes are restricted to a specific mount and the +ownership changes are tied to the lifetime of the mount. All other users and +locations where the filesystem is exposed are unaffected. + +Filesystems that support idmapped mounts don't have any real reason to support +being mountable inside user namespaces. A filesystem could be exposed +completely under an idmapped mount to get the same effect. This has the +advantage that filesystems can leave the creation of the superblock to +privileged users in the initial user namespace. + +However, it is perfectly possible to combine idmapped mounts with filesystems +mountable inside user namespaces. We will touch on this further below. + +Remapping helpers +~~~~~~~~~~~~~~~~~ + +Idmapping functions were added that translate between idmappings. They make use +of the remapping algorithm we've introduced earlier. We're going to look at +two: + +- ``i_uid_into_mnt()`` and ``i_gid_into_mnt()`` + + The ``i_*id_into_mnt()`` functions translate filesystem's kernel ids into + kernel ids in the mount's idmapping:: + + /* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */ + from_kuid(filesystem, kid) = uid + + /* Map the filesystem's userspace id down ito a kernel id in the mount's idmapping. */ + make_kuid(mount, uid) = kuid + +- ``mapped_fsuid()`` and ``mapped_fsgid()`` + + The ``mapped_fs*id()`` functions translate the caller's kernel ids into + kernel ids in the filesystem's idmapping. This translation is achieved by + remapping the caller's kernel ids using the mount's idmapping:: + + /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */ + from_kuid(mount, kid) = uid + + /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */ + make_kuid(filesystem, uid) = kuid + +Note that these two functions invert each other. Consider the following +idmappings:: + + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k20000:r10000 + mount idmapping: u0:k10000:r10000 + +Assume a file owned by ``u1000`` is read from disk. The filesystem maps this id +to ``k21000`` according to it's idmapping. This is what is stored in the +inode's ``i_uid`` and ``i_gid`` fields. + +When the caller queries the ownership of this file via ``stat()`` the kernel +would usually simply use the crossmapping algorithm and map the filesystem's +kernel id up to a userspace id in the caller's idmapping. + +But when the caller is accessing the file on an idmapped mount the kernel will +first call ``i_uid_into_mnt()`` thereby translating the filesystem's kernel id +into a kernel id in the mount's idmapping:: + + i_uid_into_mnt(k21000): + /* Map the filesystem's kernel id up into a userspace id. */ + from_kuid(u0:k20000:r10000, k21000) = u1000 + + /* Map the filesystem's userspace id down ito a kernel id in the mount's idmapping. */ + make_kuid(u0:k10000:r10000, u1000) = k11000 + +Finally, when the kernel reports the owner to the caller it will turn the +kernel id in the mount's idmapping into a userspace id in the caller's +idmapping:: + + from_kuid(u0:k10000:r10000, k11000) = u1000 + +We can test whether this algorithm really works by verifying what happens when +we create a new file. Let's say the user is creating a file with ``u1000``. + +The kernel maps this to ``k11000`` in the caller's idmapping. Usually the +kernel would now apply the crossmapping, verifying that ``k11000`` can be +mapped to a userspace id in the filesystem's idmapping. Since ``k11000`` can't +be mapped up in the filesystem's idmapping directly this creation request +fails. + +But when the caller is accessing the file on an idmapped mount the kernel will +first call ``mapped_fs*id()`` thereby translating the caller's kernel id into +a kernel id according to the mount's idmapping:: + + mapped_fsuid(k11000): + /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */ + from_kuid(u0:k10000:r10000, k11000) = u1000 + + /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */ + make_kuid(u0:k20000:r10000, u1000) = k21000 + +When finally writing to disk the kernel will then map ``k21000`` up into a +userspace id in the filesystem's idmapping:: + + from_kuid(u0:k20000:r10000, k21000) = u1000 + +As we can see, we end up with an invertible and therefore information +preserving algorithm. A file created from ``u1000`` on an idmapped mount will +also be reported as being owned by ``u1000`` and vica versa. + +Let's now briefly reconsider the failing examples from earlier in the context +of idmapped mounts. + +Example 2 reconsidered +~~~~~~~~~~~~~~~~~~~~~~ + +:: + + caller id: u1000 + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k20000:r10000 + mount idmapping: u0:k10000:r10000 + +When the caller is using a non-initial idmapping the common case is to attach +the same idmapping to the mount. We now perform three steps: + +1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: + + make_kuid(u0:k10000:r10000, u1000) = k11000 + +2. Translate the caller's kernel id into a kernel id in the filesystem's + idmapping:: + + mapped_fsuid(k11000): + /* Map the kernel id up into a userspace id in the mount's idmapping. */ + from_kuid(u0:k10000:r10000, k11000) = u1000 + + /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ + make_kuid(u0:k20000:r10000, u1000) = k21000 + +2. Verify that the caller's kernel ids can be mapped to userspace ids in the + filesystem's idmapping:: + + from_kuid(u0:k20000:r10000, k21000) = u1000 + +So the ownership that lands on disk will be ``u1000``. + +Example 3 reconsidered +~~~~~~~~~~~~~~~~~~~~~~ + +:: + + caller id: u1000 + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k0:r4294967295 + mount idmapping: u0:k10000:r10000 + +The same translation algorithm works with the third example. + +1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: + + make_kuid(u0:k10000:r10000, u1000) = k11000 + +2. Translate the caller's kernel id into a kernel id in the filesystem's + idmapping:: + + mapped_fsuid(k11000): + /* Map the kernel id up into a userspace id in the mount's idmapping. */ + from_kuid(u0:k10000:r10000, k11000) = u1000 + + /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ + make_kuid(u0:k0:r4294967295, u1000) = k1000 + +2. Verify that the caller's kernel ids can be mapped to userspace ids in the + filesystem's idmapping:: + + from_kuid(u0:k0:r4294967295, k21000) = u1000 + +So the ownership that lands on disk will be ``u1000``. + +Example 4 reconsidered +~~~~~~~~~~~~~~~~~~~~~~ + +:: + + file id: u1000 + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k0:r4294967295 + mount idmapping: u0:k10000:r10000 + +In order to report ownership to userspace the kernel now does three steps using +the translation algorithm we introduced earlier: + +1. Map the userspace id on disk down into a kernel id in the filesystem's + idmapping:: + + make_kuid(u0:k0:r4294967295, u1000) = k1000 + +2. Translate the kernel id into a kernel id in the mount's idmapping:: + + i_uid_into_mnt(k1000): + /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ + from_kuid(u0:k0:r4294967295, k1000) = u1000 + + /* Map the userspace id down into a kernel id in the mounts's idmapping. */ + make_kuid(u0:k10000:r10000, u1000) = k11000 + +3. Map the kernel id up into a userspace id in the caller's idmapping:: + + from_kuid(u0:k10000:r10000, k11000) = u1000 + +Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's +idmapping. With the idmapped mount in place it now can be crossmapped into the +filesystem's idmapping via the mount's idmapping. The file will now be created +with ``u1000`` according to the mount's idmapping. + +Example 5 reconsidered +~~~~~~~~~~~~~~~~~~~~~~ + +:: + + file id: u1000 + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k20000:r10000 + mount idmapping: u0:k10000:r10000 + +Again, in order to report ownership to userspace the kernel now does three +steps using the translation algorithm we introduced earlier: + +1. Map the userspace id on disk down into a kernel id in the filesystem's + idmapping:: + + make_kuid(u0:k20000:r10000, u1000) = k21000 + +2. Translate the kernel id into a kernel id in the mount's idmapping:: + + i_uid_into_mnt(k21000): + /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ + from_kuid(u0:k20000:r10000, k21000) = u1000 + + /* Map the userspace id down into a kernel id in the mounts's idmapping. */ + make_kuid(u0:k10000:r10000, u1000) = k11000 + +3. Map the kernel id up into a userspace id in the caller's idmapping:: + + from_kuid(u0:k10000:r10000, k11000) = u1000 + +Earlier, the file's kernel id couldn't be crossmapped in the filesystems's +idmapping. With the idmapped mount in place it now can be crossmapped into the +filesystem's idmapping via the mount's idmapping. The file is now owned by +``u1000`` according to the mount's idmapping. + +Changing ownership on a home directory +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We've seen above how idmapped mounts can be used to translate between +idmappings when either the caller, the filesystem or both uses a non-initial +idmapping. A wide range of usecases exist when the caller is using +a non-initial idmapping. This mostly happens in the context of containerized +workloads. The consequence is as we have seen that for both, filesystem's +mounted with the initial idmapping and filesystems mounted with non-initial +idmappings, access to the filesystem isn't working because the kernel ids can't +be crossmapped between the caller's and the filesystem's idmapping. + +As we've seen above idmapped mounts provide a solution to this by remapping the +caller's or filesystem's idmapping according to the mount's idmapping. + +Aside from containerized workloads, idmapped mounts have the advantage that +they also work when both the caller and the filesystem use the initial +idmapping which means users on the host can change the ownership of directories +and files on a per-mount basis. + +Consider our previous example where a user has their home directory on portable +storage. At home they have id ``u1000`` and all files in their home directory +are owned by ``u1000`` whereas at uni or work they have login id ``u1125``. + +Taking their home directory with them becomes problematic. They can't easily +access their files, they might not be able to write to disk without applying +lax permissions or ACLs and even if they can, they will end up with an annoying +mix of files and directories owned by ``u1000`` and ``u1125``. + +Idmapped mounts allow to solve this problem. A user can create an idmapped +mount for their home directory on their work computer or their computer at home +depending on what ownership they would prefer to end up on the portable storage +itself. + +Let's assume they want all files on disk to belong to ``u1000``. When the user +plugs in their portable storage at their work station they can setup a job that +creates an idmapped mount with the minimal idmapping ``u1000:k1125:r1``. So now +when they create a file the kernel performs the following steps we already know +from above::: + + caller id: u1125 + caller idmapping: u0:k0:r4294967295 + filesystem idmapping: u0:k0:r4294967295 + mount idmapping: u1000:k1125:r1 + +1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: + + make_kuid(u0:k0:r4294967295, u1125) = k1125 + +2. Translate the caller's kernel id into a kernel id in the filesystem's + idmapping:: + + mapped_fsuid(k1125): + /* Map the kernel id up into a userspace id in the mount's idmapping. */ + from_kuid(u1000:k1125:r1, k1125) = u1000 + + /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ + make_kuid(u0:k0:r4294967295, u1000) = k1000 + +2. Verify that the caller's kernel ids can be mapped to userspace ids in the + filesystem's idmapping:: + + from_kuid(u0:k0:r4294967295, k1000) = u1000 + +So ultimately the file will be created with ``u1000`` on disk. + +Now let's briefly look at what ownership the caller with id ``u1125`` will see +on their work computer: + +:: + + file id: u1000 + caller idmapping: u0:k0:r4294967295 + filesystem idmapping: u0:k0:r4294967295 + mount idmapping: u1000:k1125:r1 + +1. Map the userspace id on disk down into a kernel id in the filesystem's + idmapping:: + + make_kuid(u0:k0:r4294967295, u1000) = k1000 + +2. Translate the kernel id into a kernel id in the mount's idmapping:: + + i_uid_into_mnt(k1000): + /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ + from_kuid(u0:k0:r4294967295, k1000) = u1000 + + /* Map the userspace id down into a kernel id in the mounts's idmapping. */ + make_kuid(u1000:k1125:r1, u1000) = k1125 + +3. Map the kernel id up into a userspace id in the caller's idmapping:: + + from_kuid(u0:k0:r4294967295, k1125) = u1125 + +So ultimately the caller will be reported that the file belongs to ``u1125`` +which is the caller's userspace id on their workstation in our example. + +The raw userspace id that is put on disk is ``u1000`` so when the user takes +their home directory back to their home computer where they are assigned +``u1000`` using the initial idmapping and mount the filesystem with the initial +idmapping they will see all those files owned by ``u1000``. + +Shortcircuting +-------------- + +Currently, the implementation of idmapped mounts enforces that the filesystem +is mounted with the initial idmapping. The reason is simply that none of the +filesystems that we targeted were mountable with a non-initial idmapping. But +that might change soon enough. As we've seen above, thanks to the properties of +idmappings the translation works for both filesystems mounted with the initial +idmapping and filesystem with non-initial idmappings. + +Based on this current restriction to filesystem mounted with the initial +idmapping two noticeable shortcuts have been taken: + +1. We always stash a reference to the initial user namespace in ``struct + vfsmount``. Idmapped mounts are thus mounts that have a non-initial user + namespace attached to them. + + In order to support idmapped mounts this needs to be changed. Instead of + stashing the initial user namespace the user namespace the filesystem was + mounted with must be stashed. An idmapped mount is then any mount that has + a different user namespace attached then the filesystem was mounted with. + This has no user-visible consequences. + +2. The translation algorithms in ``mapped_fs*id()`` and ``i_*id_into_mnt()`` + are simplified. + + Let's consider ``mapped_fs*id()`` first. This function translates the + caller's kernel id into a kernel id in the filesystem's idmapping via + a mount's idmapping. The full algorithm is:: + + mapped_fsuid(kid): + /* Map the kernel id up into a userspace id in the mount's idmapping. */ + from_kuid(mount-idmapping, kid) = uid + + /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ + make_kuid(filesystem-idmapping, uid) = kuid + + We know that the filesystem is always mounted with the initial idmapping as + we enforce this in ``mount_setattr()``. So this can be shortened to:: + + mapped_fsuid(kid): + /* Map the kernel id up into a userspace id in the mount's idmapping. */ + from_kuid(mount-idmapping, kid) = uid + + /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ + KUIDT_INIT(uid) = kuid + + Similarly, for ``i_*id_into_mnt()`` which translated the filesystem's kernel + id into a mount's kernel id:: + + i_uid_into_mnt(kid): + /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ + from_kuid(filesystem-idmapping, kid) = uid + + /* Map the userspace id down into a kernel id in the mounts's idmapping. */ + make_kuid(mount-idmapping, uid) = kuid + + Again, we know that the filesystem is always mounted with the initial + idmapping as we enforce this in ``mount_setattr()``. So this can be + shortened to:: + + i_uid_into_mnt(kid): + /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ + __kuid_val(kid) = uid + + /* Map the userspace id down into a kernel id in the mounts's idmapping. */ + make_kuid(mount-idmapping, uid) = kuid + +Handling filesystems mounted with non-initial idmappings requires that the +translation functions be converted to their full form. They can still be +shortcircuited on non-idmapped mounts. This has no user-visible consequences. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 246af51b277a..1a2dd4d35717 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -34,6 +34,7 @@ algorithms work. quota seq_file sharedsubtree + idmappings automount-support @@ -72,7 +73,7 @@ Documentation for filesystem implementations. befs bfs btrfs - cifs/cifsroot + cifs/index ceph coda configfs diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index 2183fd8cc350..2a75dd5da7b5 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -271,19 +271,19 @@ prototypes:: locking rules: All except set_page_dirty and freepage may block -====================== ======================== ========= -ops PageLocked(page) i_rwsem -====================== ======================== ========= +====================== ======================== ========= =============== +ops PageLocked(page) i_rwsem invalidate_lock +====================== ======================== ========= =============== writepage: yes, unlocks (see below) -readpage: yes, unlocks +readpage: yes, unlocks shared writepages: set_page_dirty no -readahead: yes, unlocks -readpages: no +readahead: yes, unlocks shared +readpages: no shared write_begin: locks the page exclusive write_end: yes, unlocks exclusive bmap: -invalidatepage: yes +invalidatepage: yes exclusive releasepage: yes freepage: yes direct_IO: @@ -295,7 +295,7 @@ is_partially_uptodate: yes error_remove_page: yes swap_activate: no swap_deactivate: no -====================== ======================== ========= +====================== ======================== ========= =============== ->write_begin(), ->write_end() and ->readpage() may be called from the request handler (/dev/loop). @@ -378,7 +378,10 @@ keep it that way and don't breed new callers. ->invalidatepage() is called when the filesystem must attempt to drop some or all of the buffers from the page when it is being truncated. It returns zero on success. If ->invalidatepage is zero, the kernel uses -block_invalidatepage() instead. +block_invalidatepage() instead. The filesystem must exclusively acquire +invalidate_lock before invalidating page cache in truncate / hole punch path +(and thus calling into ->invalidatepage) to block races between page cache +invalidation and page cache filling functions (fault, read, ...). ->releasepage() is called when the kernel is about to try to drop the buffers from the page in preparation for freeing it. It returns zero to @@ -506,6 +509,7 @@ prototypes:: ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll) (struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); @@ -518,12 +522,6 @@ prototypes:: int (*fsync) (struct file *, loff_t start, loff_t end, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); - ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, - loff_t *); - ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, - loff_t *); - ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, - void __user *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, @@ -536,6 +534,14 @@ prototypes:: size_t, unsigned int); int (*setlease)(struct file *, long, struct file_lock **, void **); long (*fallocate)(struct file *, int, loff_t, loff_t); + void (*show_fdinfo)(struct seq_file *m, struct file *f); + unsigned (*mmap_capabilities)(struct file *); + ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, + loff_t, size_t, unsigned int); + loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + loff_t len, unsigned int remap_flags); + int (*fadvise)(struct file *, loff_t, loff_t, int); locking rules: All may block. @@ -570,6 +576,25 @@ in sys_read() and friends. the lease within the individual filesystem to record the result of the operation +->fallocate implementation must be really careful to maintain page cache +consistency when punching holes or performing other operations that invalidate +page cache contents. Usually the filesystem needs to call +truncate_inode_pages_range() to invalidate relevant range of the page cache. +However the filesystem usually also needs to update its internal (and on disk) +view of file offset -> disk block mapping. Until this update is finished, the +filesystem needs to block page faults and reads from reloading now-stale page +cache contents from the disk. Since VFS acquires mapping->invalidate_lock in +shared mode when loading pages from disk (filemap_fault(), filemap_read(), +readahead paths), the fallocate implementation must take the invalidate_lock to +prevent reloading. + +->copy_file_range and ->remap_file_range implementations need to serialize +against modifications of file data while the operation is running. For +blocking changes through write(2) and similar operations inode->i_rwsem can be +used. To block changes to file contents via a memory mapping during the +operation, the filesystem must take mapping->invalidate_lock to coordinate +with ->page_mkwrite. + dquot_operations ================ @@ -627,11 +652,11 @@ pfn_mkwrite: yes access: yes ============= ========= =========================== -->fault() is called when a previously not present pte is about -to be faulted in. The filesystem must find and return the page associated -with the passed in "pgoff" in the vm_fault structure. If it is possible that -the page may be truncated and/or invalidated, then the filesystem must lock -the page, then ensure it is not already truncated (the page lock will block +->fault() is called when a previously not present pte is about to be faulted +in. The filesystem must find and return the page associated with the passed in +"pgoff" in the vm_fault structure. If it is possible that the page may be +truncated and/or invalidated, then the filesystem must lock invalidate_lock, +then ensure the page is not already truncated (invalidate_lock will block subsequent truncate), and then return with VM_FAULT_LOCKED, and the page locked. The VM will unlock the page. @@ -644,12 +669,14 @@ page table entry. Pointer to entry associated with the page is passed in "pte" field in vm_fault structure. Pointers to entries for other offsets should be calculated relative to "pte". -->page_mkwrite() is called when a previously read-only pte is -about to become writeable. The filesystem again must ensure that there are -no truncate/invalidate races, and then return with the page locked. If -the page has been truncated, the filesystem should not look up a new page -like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which -will cause the VM to retry the fault. +->page_mkwrite() is called when a previously read-only pte is about to become +writeable. The filesystem again must ensure that there are no +truncate/invalidate races or races with operations such as ->remap_file_range +or ->copy_file_range, and then return with the page locked. Usually +mapping->invalidate_lock is suitable for proper serialization. If the page has +been truncated, the filesystem should not look up a new page like the ->fault() +handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to +retry the fault. ->pfn_mkwrite() is the same as page_mkwrite but when the pte is VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is diff --git a/Documentation/filesystems/mandatory-locking.rst b/Documentation/filesystems/mandatory-locking.rst deleted file mode 100644 index 9ce73544a8f0..000000000000 --- a/Documentation/filesystems/mandatory-locking.rst +++ /dev/null @@ -1,188 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -===================================================== -Mandatory File Locking For The Linux Operating System -===================================================== - - Andy Walker <andy@lysaker.kvaerner.no> - - 15 April 1996 - - (Updated September 2007) - -0. Why you should avoid mandatory locking ------------------------------------------ - -The Linux implementation is prey to a number of difficult-to-fix race -conditions which in practice make it not dependable: - - - The write system call checks for a mandatory lock only once - at its start. It is therefore possible for a lock request to - be granted after this check but before the data is modified. - A process may then see file data change even while a mandatory - lock was held. - - Similarly, an exclusive lock may be granted on a file after - the kernel has decided to proceed with a read, but before the - read has actually completed, and the reading process may see - the file data in a state which should not have been visible - to it. - - Similar races make the claimed mutual exclusion between lock - and mmap similarly unreliable. - -1. What is mandatory locking? ------------------------------- - -Mandatory locking is kernel enforced file locking, as opposed to the more usual -cooperative file locking used to guarantee sequential access to files among -processes. File locks are applied using the flock() and fcntl() system calls -(and the lockf() library routine which is a wrapper around fcntl().) It is -normally a process' responsibility to check for locks on a file it wishes to -update, before applying its own lock, updating the file and unlocking it again. -The most commonly used example of this (and in the case of sendmail, the most -troublesome) is access to a user's mailbox. The mail user agent and the mail -transfer agent must guard against updating the mailbox at the same time, and -prevent reading the mailbox while it is being updated. - -In a perfect world all processes would use and honour a cooperative, or -"advisory" locking scheme. However, the world isn't perfect, and there's -a lot of poorly written code out there. - -In trying to address this problem, the designers of System V UNIX came up -with a "mandatory" locking scheme, whereby the operating system kernel would -block attempts by a process to write to a file that another process holds a -"read" -or- "shared" lock on, and block attempts to both read and write to a -file that a process holds a "write " -or- "exclusive" lock on. - -The System V mandatory locking scheme was intended to have as little impact as -possible on existing user code. The scheme is based on marking individual files -as candidates for mandatory locking, and using the existing fcntl()/lockf() -interface for applying locks just as if they were normal, advisory locks. - -.. Note:: - - 1. In saying "file" in the paragraphs above I am actually not telling - the whole truth. System V locking is based on fcntl(). The granularity of - fcntl() is such that it allows the locking of byte ranges in files, in - addition to entire files, so the mandatory locking rules also have byte - level granularity. - - 2. POSIX.1 does not specify any scheme for mandatory locking, despite - borrowing the fcntl() locking scheme from System V. The mandatory locking - scheme is defined by the System V Interface Definition (SVID) Version 3. - -2. Marking a file for mandatory locking ---------------------------------------- - -A file is marked as a candidate for mandatory locking by setting the group-id -bit in its file mode but removing the group-execute bit. This is an otherwise -meaningless combination, and was chosen by the System V implementors so as not -to break existing user programs. - -Note that the group-id bit is usually automatically cleared by the kernel when -a setgid file is written to. This is a security measure. The kernel has been -modified to recognize the special case of a mandatory lock candidate and to -refrain from clearing this bit. Similarly the kernel has been modified not -to run mandatory lock candidates with setgid privileges. - -3. Available implementations ----------------------------- - -I have considered the implementations of mandatory locking available with -SunOS 4.1.x, Solaris 2.x and HP-UX 9.x. - -Generally I have tried to make the most sense out of the behaviour exhibited -by these three reference systems. There are many anomalies. - -All the reference systems reject all calls to open() for a file on which -another process has outstanding mandatory locks. This is in direct -contravention of SVID 3, which states that only calls to open() with the -O_TRUNC flag set should be rejected. The Linux implementation follows the SVID -definition, which is the "Right Thing", since only calls with O_TRUNC can -modify the contents of the file. - -HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not -just mandatory locks. That would appear to contravene POSIX.1. - -mmap() is another interesting case. All the operating systems mentioned -prevent mandatory locks from being applied to an mmap()'ed file, but HP-UX -also disallows advisory locks for such a file. SVID actually specifies the -paranoid HP-UX behaviour. - -In my opinion only MAP_SHARED mappings should be immune from locking, and then -only from mandatory locks - that is what is currently implemented. - -SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for -mandatory locks, so reads and writes to locked files always block when they -should return EAGAIN. - -I'm afraid that this is such an esoteric area that the semantics described -below are just as valid as any others, so long as the main points seem to -agree. - -4. Semantics ------------- - -1. Mandatory locks can only be applied via the fcntl()/lockf() locking - interface - in other words the System V/POSIX interface. BSD style - locks using flock() never result in a mandatory lock. - -2. If a process has locked a region of a file with a mandatory read lock, then - other processes are permitted to read from that region. If any of these - processes attempts to write to the region it will block until the lock is - released, unless the process has opened the file with the O_NONBLOCK - flag in which case the system call will return immediately with the error - status EAGAIN. - -3. If a process has locked a region of a file with a mandatory write lock, all - attempts to read or write to that region block until the lock is released, - unless a process has opened the file with the O_NONBLOCK flag in which case - the system call will return immediately with the error status EAGAIN. - -4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has - any mandatory locks owned by other processes will be rejected with the - error status EAGAIN. - -5. Attempts to apply a mandatory lock to a file that is memory mapped and - shared (via mmap() with MAP_SHARED) will be rejected with the error status - EAGAIN. - -6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED) - that has any mandatory locks in effect will be rejected with the error status - EAGAIN. - -5. Which system calls are affected? ------------------------------------ - -Those which modify a file's contents, not just the inode. That gives read(), -write(), readv(), writev(), open(), creat(), mmap(), truncate() and -ftruncate(). truncate() and ftruncate() are considered to be "write" actions -for the purposes of mandatory locking. - -The affected region is usually defined as stretching from the current position -for the total number of bytes read or written. For the truncate calls it is -defined as the bytes of a file removed or added (we must also consider bytes -added, as a lock can specify just "the whole file", rather than a specific -range of bytes.) - -Note 3: I may have overlooked some system calls that need mandatory lock -checking in my eagerness to get this code out the door. Please let me know, or -better still fix the system calls yourself and submit a patch to me or Linus. - -6. Warning! ------------ - -Not even root can override a mandatory lock, so runaway processes can wreak -havoc if they lock crucial files. The way around it is to change the file -permissions (remove the setgid bit) before trying to read or write to it. -Of course, that might be a bit tricky if the system is hung :-( - -7. The "mand" mount option --------------------------- -Mandatory locking is disabled on all filesystems by default, and must be -administratively enabled by mounting with "-o mand". That mount option -is only allowed if the mounting task has the CAP_SYS_ADMIN capability. - -Since kernel v4.5, it is possible to disable mandatory locking -altogether by setting CONFIG_MANDATORY_FILE_LOCKING to "n". A kernel -with this disabled will reject attempts to mount filesystems with the -"mand" mount option with the error status EPERM. diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst index cfc81e98e0b8..4e5b26f03d5b 100644 --- a/Documentation/trace/ftrace.rst +++ b/Documentation/trace/ftrace.rst @@ -2762,7 +2762,7 @@ listed in: put_prev_task_idle kmem_cache_create pick_next_task_rt - get_online_cpus + cpus_read_lock pick_next_task_fair mutex_lock [...] diff --git a/Documentation/translations/zh_CN/cpu-freq/cpu-drivers.rst b/Documentation/translations/zh_CN/cpu-freq/cpu-drivers.rst index 5ae9cfa2ec55..334f30ae198b 100644 --- a/Documentation/translations/zh_CN/cpu-freq/cpu-drivers.rst +++ b/Documentation/translations/zh_CN/cpu-freq/cpu-drivers.rst @@ -80,8 +80,6 @@ CPUfreq核心层注册一个cpufreq_driver结构体。 .resume - 一个指向per-policy恢复函数的指针,该函数在关中断且在调节器再一次开始前被 调用。 - .ready - 一个指向per-policy准备函数的指针,该函数在策略完全初始化之后被调用。 - .attr - 一个指向NULL结尾的"struct freq_attr"列表的指针,该函数允许导出值到 sysfs。 diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst index 1409e40e6345..b7070d76f076 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -160,7 +160,6 @@ Code Seq# Include File Comments 'K' all linux/kd.h 'L' 00-1F linux/loop.h conflict! 'L' 10-1F drivers/scsi/mpt3sas/mpt3sas_ctl.h conflict! -'L' 20-2F linux/lightnvm.h 'L' E0-FF linux/ppdd.h encrypted disk device driver <http://linux01.gwdg.de/~alatham/ppdd.html> 'M' all linux/soundcard.h conflict! diff --git a/Documentation/userspace-api/spec_ctrl.rst b/Documentation/userspace-api/spec_ctrl.rst index 7ddd8f667459..5e8ed9eef9aa 100644 --- a/Documentation/userspace-api/spec_ctrl.rst +++ b/Documentation/userspace-api/spec_ctrl.rst @@ -106,3 +106,11 @@ Speculation misfeature controls * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_ENABLE, 0, 0); * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE, 0, 0); * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0); + +- PR_SPEC_L1D_FLUSH: Flush L1D Cache on context switch out of the task + (works only when tasks run on non SMT cores) + + Invocations: + * prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH, 0, 0, 0); + * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH, PR_SPEC_ENABLE, 0, 0); + * prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_L1D_FLUSH, PR_SPEC_DISABLE, 0, 0); diff --git a/Documentation/x86/x86_64/boot-options.rst b/Documentation/x86/x86_64/boot-options.rst index 5f62b3b86357..ccb7e86bf8d9 100644 --- a/Documentation/x86/x86_64/boot-options.rst +++ b/Documentation/x86/x86_64/boot-options.rst @@ -126,7 +126,7 @@ Idle loop Rebooting ========= - reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] [, [w]arm | [c]old] + reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] | p[ci] [, [w]arm | [c]old] bios Use the CPU reboot vector for warm reset warm @@ -145,6 +145,8 @@ Rebooting Use efi reset_system runtime service. If EFI is not configured or the EFI reset does not work, the reboot path attempts the reset using the keyboard controller. + pci + Use a write to the PCI config space register 0xcf9 to trigger reboot. Using warm reset will be much faster especially on big memory systems because the BIOS will not go through the memory check. @@ -155,6 +157,13 @@ Rebooting Don't stop other CPUs on reboot. This can make reboot more reliable in some cases. + reboot=default + There are some built-in platform specific "quirks" - you may see: + "reboot: <name> series board detected. Selecting <type> for reboots." + In the case where you think the quirk is in error (e.g. you have + newer BIOS, or newer board) using this option will ignore the built-in + quirk table, and use the generic default reboot actions. + Non Executable Mappings ======================= |