summaryrefslogtreecommitdiff
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/bcachefs/casefolding.rst18
-rw-r--r--Documentation/filesystems/bcachefs/future/idle_work.rst78
-rw-r--r--Documentation/filesystems/bcachefs/index.rst7
-rw-r--r--Documentation/filesystems/erofs.rst1
-rw-r--r--Documentation/filesystems/ext4/atomic_writes.rst225
-rw-r--r--Documentation/filesystems/ext4/overview.rst1
-rw-r--r--Documentation/filesystems/ext4/super.rst20
-rw-r--r--Documentation/filesystems/fscrypt.rst189
-rw-r--r--Documentation/filesystems/index.rst1
-rw-r--r--Documentation/filesystems/iomap/design.rst16
-rw-r--r--Documentation/filesystems/locking.rst54
-rw-r--r--Documentation/filesystems/mount_api.rst16
-rw-r--r--Documentation/filesystems/netfs_library.rst1016
-rw-r--r--Documentation/filesystems/porting.rst40
-rw-r--r--Documentation/filesystems/relay.rst26
-rw-r--r--Documentation/filesystems/resctrl.rst1523
-rw-r--r--Documentation/filesystems/vfs.rst39
17 files changed, 2835 insertions, 435 deletions
diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst
index ba5de97d155f..871a38f557e8 100644
--- a/Documentation/filesystems/bcachefs/casefolding.rst
+++ b/Documentation/filesystems/bcachefs/casefolding.rst
@@ -88,3 +88,21 @@ This would fail if negative dentry's were cached.
This is slightly suboptimal, but could be fixed in future with some vfs work.
+
+References
+----------
+
+(from Peter Anvin, on the list)
+
+It is worth noting that Microsoft has basically declared their
+"recommended" case folding (upcase) table to be permanently frozen (for
+new filesystem instances in the case where they use an on-disk
+translation table created at format time.) As far as I know they have
+never supported anything other than 1:1 conversion of BMP code points,
+nor normalization.
+
+The exFAT specification enumerates the full recommended upcase table,
+although in a somewhat annoying format (basically a hex dump of
+compressed data):
+
+https://learn.microsoft.com/en-us/windows/win32/fileio/exfat-specification
diff --git a/Documentation/filesystems/bcachefs/future/idle_work.rst b/Documentation/filesystems/bcachefs/future/idle_work.rst
new file mode 100644
index 000000000000..59a332509dcd
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/future/idle_work.rst
@@ -0,0 +1,78 @@
+Idle/background work classes design doc:
+
+Right now, our behaviour at idle isn't ideal, it was designed for servers that
+would be under sustained load, to keep pending work at a "medium" level, to
+let work build up so we can process it in more efficient batches, while also
+giving headroom for bursts in load.
+
+But for desktops or mobile - scenarios where work is less sustained and power
+usage is more important - we want to operate differently, with a "rush to
+idle" so the system can go to sleep. We don't want to be dribbling out
+background work while the system should be idle.
+
+The complicating factor is that there are a number of background tasks, which
+form a heirarchy (or a digraph, depending on how you divide it up) - one
+background task may generate work for another.
+
+Thus proper idle detection needs to model this heirarchy.
+
+- Foreground writes
+- Page cache writeback
+- Copygc, rebalance
+- Journal reclaim
+
+When we implement idle detection and rush to idle, we need to be careful not
+to disturb too much the existing behaviour that works reasonably well when the
+system is under sustained load (or perhaps improve it in the case of
+rebalance, which currently does not actively attempt to let work batch up).
+
+SUSTAINED LOAD REGIME
+---------------------
+
+When the system is under continuous load, we want these jobs to run
+continuously - this is perhaps best modelled with a P/D controller, where
+they'll be trying to keep a target value (i.e. fragmented disk space,
+available journal space) roughly in the middle of some range.
+
+The goal under sustained load is to balance our ability to handle load spikes
+without running out of x resource (free disk space, free space in the
+journal), while also letting some work accumululate to be batched (or become
+unnecessary).
+
+For example, we don't want to run copygc too aggressively, because then it
+will be evacuating buckets that would have become empty (been overwritten or
+deleted) anyways, and we don't want to wait until we're almost out of free
+space because then the system will behave unpredicably - suddenly we're doing
+a lot more work to service each write and the system becomes much slower.
+
+IDLE REGIME
+-----------
+
+When the system becomes idle, we should start flushing our pending work
+quicker so the system can go to sleep.
+
+Note that the definition of "idle" depends on where in the heirarchy a task
+is - a task should start flushing work more quickly when the task above it has
+stopped generating new work.
+
+e.g. rebalance should start flushing more quickly when page cache writeback is
+idle, and journal reclaim should only start flushing more quickly when both
+copygc and rebalance are idle.
+
+It's important to let work accumulate when more work is still incoming and we
+still have room, because flushing is always more efficient if we let it batch
+up. New writes may overwrite data before rebalance moves it, and tasks may be
+generating more updates for the btree nodes that journal reclaim needs to flush.
+
+On idle, how much work we do at each interval should be proportional to the
+length of time we have been idle for. If we're idle only for a short duration,
+we shouldn't flush everything right away; the system might wake up and start
+generating new work soon, and flushing immediately might end up doing a lot of
+work that would have been unnecessary if we'd allowed things to batch more.
+
+To summarize, we will need:
+
+ - A list of classes for background tasks that generate work, which will
+ include one "foreground" class.
+ - Tracking for each class - "Am I doing work, or have I gone to sleep?"
+ - And each class should check the class above it when deciding how much work to issue.
diff --git a/Documentation/filesystems/bcachefs/index.rst b/Documentation/filesystems/bcachefs/index.rst
index 3864d0ae89c1..e5c4c2120b93 100644
--- a/Documentation/filesystems/bcachefs/index.rst
+++ b/Documentation/filesystems/bcachefs/index.rst
@@ -29,3 +29,10 @@ At this moment, only a few of these are described here.
casefolding
errorcodes
+
+Future design
+-------------
+.. toctree::
+ :maxdepth: 1
+
+ future/idle_work
diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst
index c293f8e37468..7ddb235aee9d 100644
--- a/Documentation/filesystems/erofs.rst
+++ b/Documentation/filesystems/erofs.rst
@@ -128,6 +128,7 @@ device=%s Specify a path to an extra device to be used together.
fsid=%s Specify a filesystem image ID for Fscache back-end.
domain_id=%s Specify a domain ID in fscache mode so that different images
with the same blobs under a given domain ID can share storage.
+fsoffset=%llu Specify block-aligned filesystem offset for the primary device.
=================== =========================================================
Sysfs Entries
diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst
new file mode 100644
index 000000000000..f65767df3620
--- /dev/null
+++ b/Documentation/filesystems/ext4/atomic_writes.rst
@@ -0,0 +1,225 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _atomic_writes:
+
+Atomic Block Writes
+-------------------------
+
+Introduction
+~~~~~~~~~~~~
+
+Atomic (untorn) block writes ensure that either the entire write is committed
+to disk or none of it is. This prevents "torn writes" during power loss or
+system crashes. The ext4 filesystem supports atomic writes (only with Direct
+I/O) on regular files with extents, provided the underlying storage device
+supports hardware atomic writes. This is supported in the following two ways:
+
+1. **Single-fsblock Atomic Writes**:
+ EXT4's supports atomic write operations with a single filesystem block since
+ v6.13. In this the atomic write unit minimum and maximum sizes are both set
+ to filesystem blocksize.
+ e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
+ pagesize system is possible.
+
+2. **Multi-fsblock Atomic Writes with Bigalloc**:
+ EXT4 now also supports atomic writes spanning multiple filesystem blocks
+ using a feature known as bigalloc. The atomic write unit's minimum and
+ maximum sizes are determined by the filesystem block size and cluster size,
+ based on the underlying device’s supported atomic write unit limits.
+
+Requirements
+~~~~~~~~~~~~
+
+Basic requirements for atomic writes in ext4:
+
+ 1. The extents feature must be enabled (default for ext4)
+ 2. The underlying block device must support atomic writes
+ 3. For single-fsblock atomic writes:
+
+ 1. A filesystem with appropriate block size (up to the page size)
+ 4. For multi-fsblock atomic writes:
+
+ 1. The bigalloc feature must be enabled
+ 2. The cluster size must be appropriately configured
+
+NOTE: EXT4 does not support software or COW based atomic write, which means
+atomic writes on ext4 are only supported if underlying storage device supports
+it.
+
+Multi-fsblock Implementation Details
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The bigalloc feature changes ext4 to allocate in units of multiple filesystem
+blocks, also known as clusters. With bigalloc each bit within block bitmap
+represents cluster (power of 2 number of blocks) rather than individual
+filesystem blocks.
+EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the
+following constraints. The minimum atomic write size is the larger of the fs
+block size and the minimum hardware atomic write unit; and the maximum atomic
+write size is smaller of the bigalloc cluster size and the maximum hardware
+atomic write unit. Bigalloc ensures that all allocations are aligned to the
+cluster size, which satisfies the LBA alignment requirements of the hardware
+device if the start of the partition/logical volume is itself aligned correctly.
+
+Here is the block allocation strategy in bigalloc for atomic writes:
+
+ * For regions with fully mapped extents, no additional work is needed
+ * For append writes, a new mapped extent is allocated
+ * For regions that are entirely holes, unwritten extent is created
+ * For large unwritten extents, the extent gets split into two unwritten
+ extents of appropriate requested size
+ * For mixed mapping regions (combinations of holes, unwritten extents, or
+ mapped extents), ext4_map_blocks() is called in a loop with
+ EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
+ mapped extent by writing zeroes to it and converting any unwritten extents to
+ written, if found within the range.
+
+Note: Writing on a single contiguous underlying extent, whether mapped or
+unwritten, is not inherently problematic. However, writing to a mixed mapping
+region (i.e. one containing a combination of mapped and unwritten extents)
+must be avoided when performing atomic writes.
+
+The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
+flag, requires that either all data is written or none at all. In the event of
+a system crash or unexpected power loss during the write operation, the affected
+region (when later read) must reflect either the complete old data or the
+complete new data, but never a mix of both.
+
+To enforce this guarantee, we ensure that the write target is backed by
+a single, contiguous extent before any data is written. This is critical because
+ext4 defers the conversion of unwritten extents to written extents until the I/O
+completion path (typically in ->end_io()). If a write is allowed to proceed over
+a mixed mapping region (with mapped and unwritten extents) and a failure occurs
+mid-write, the system could observe partially updated regions after reboot, i.e.
+new data over mapped areas, and stale (old) data over unwritten extents that
+were never marked written. This violates the atomicity and/or torn write
+prevention guarantee.
+
+To prevent such torn writes, ext4 proactively allocates a single contiguous
+extent for the entire requested region in ``ext4_iomap_alloc`` via
+``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling
+transaction in case if allocation is done over mixed mapping. This ensures any
+pending metadata updates (like unwritten to written extents conversion) in this
+range are in consistent state with the file data blocks, before performing the
+actual write I/O. If the commit fails, the whole I/O must be aborted to prevent
+from any possible torn writes.
+Only after this step, the actual data write operation is performed by the iomap.
+
+Handling Split Extents Across Leaf Blocks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There can be a special edge case where we have logically and physically
+contiguous extents stored in separate leaf nodes of the on-disk extent tree.
+This occurs because on-disk extent tree merges only happens within the leaf
+blocks except for a case where we have 2-level tree which can get merged and
+collapsed entirely into the inode.
+If such a layout exists and, in the worst case, the extent status cache entries
+are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
+a single contiguous extent for these split leaf extents.
+
+To address this edge case, a new get block flag
+``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
+``ext4_map_query_blocks()`` lookup behavior.
+
+This new get block flag allows ``ext4_map_blocks()`` to first check if there is
+an entry in the extent status cache for the full range.
+If not present, it consults the on-disk extent tree using
+``ext4_map_query_blocks()``.
+If the located extent is at the end of a leaf node, it probes the next logical
+block (lblk) to detect a contiguous extent in the adjacent leaf.
+
+For now only one additional leaf block is queried to maintain efficiency, as
+atomic writes are typically constrained to small sizes
+(e.g. [blocksize, clustersize]).
+
+
+Handling Journal transactions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To support multi-fsblock atomic writes, we ensure enough journal credits are
+reserved during:
+
+ 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
+ could be a mixed mapping for the underlying requested range. If yes, then we
+ reserve credits of up to ``m_len``, assuming every alternate block can be
+ an unwritten extent followed by a hole.
+
+ 2. During ``->end_io()`` call, we make sure a single transaction is started for
+ doing unwritten-to-written conversion. The loop for conversion is mainly
+ only required to handle a split extent across leaf blocks.
+
+How to
+------
+
+Creating Filesystems with Atomic Write Support
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First check the atomic write units supported by block device.
+See :ref:`atomic_write_bdev_support` for more details.
+
+For single-fsblock atomic writes with a larger block size
+(on systems with block size < page size):
+
+.. code-block:: bash
+
+ # Create an ext4 filesystem with a 16KB block size
+ # (requires page size >= 16KB)
+ mkfs.ext4 -b 16384 /dev/device
+
+For multi-fsblock atomic writes with bigalloc:
+
+.. code-block:: bash
+
+ # Create an ext4 filesystem with bigalloc and 64KB cluster size
+ mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
+
+Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
+and ``-O bigalloc`` enables the bigalloc feature.
+
+Application Interface
+~~~~~~~~~~~~~~~~~~~~~
+
+Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
+to perform atomic writes:
+
+.. code-block:: c
+
+ pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
+
+The write must be aligned to the filesystem's block size and not exceed the
+filesystem's maximum atomic write unit size.
+See ``generic_atomic_write_valid()`` for more details.
+
+``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
+details:
+
+ * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
+ * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
+ * ``stx_atomic_write_segments_max``: Upper limit for segments. The number of
+ separate memory buffers that can be gathered into a write operation
+ (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
+
+The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
+writes are supported.
+
+.. _atomic_write_bdev_support:
+
+Hardware Support
+----------------
+
+The underlying storage device must support atomic write operations.
+Modern NVMe and SCSI devices often provide this capability.
+The Linux kernel exposes this information through sysfs:
+
+* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
+* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
+
+Nonzero values for these attributes indicate that the device supports
+atomic writes.
+
+See Also
+--------
+
+* :doc:`bigalloc` - Documentation on the bigalloc feature
+* :doc:`allocators` - Documentation on block allocation in ext4
+* Support for atomic block writes in 6.13:
+ https://lwn.net/Articles/1009298/
diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst
index 0fad6eda6e15..9d4054c17ecb 100644
--- a/Documentation/filesystems/ext4/overview.rst
+++ b/Documentation/filesystems/ext4/overview.rst
@@ -25,3 +25,4 @@ order.
.. include:: inlinedata.rst
.. include:: eainode.rst
.. include:: verity.rst
+.. include:: atomic_writes.rst
diff --git a/Documentation/filesystems/ext4/super.rst b/Documentation/filesystems/ext4/super.rst
index a1eb4a11a1d0..1b240661bfa3 100644
--- a/Documentation/filesystems/ext4/super.rst
+++ b/Documentation/filesystems/ext4/super.rst
@@ -328,9 +328,13 @@ The ext4 superblock is laid out as follows in
- s_checksum_type
- Metadata checksum algorithm type. The only valid value is 1 (crc32c).
* - 0x176
- - __le16
- - s_reserved_pad
- -
+ - \_\_u8
+ - s\_encryption\_level
+ - Versioning level for encryption.
+ * - 0x177
+ - \_\_u8
+ - s\_reserved\_pad
+ - Padding to next 32bits.
* - 0x178
- __le64
- s_kbytes_written
@@ -466,9 +470,13 @@ The ext4 superblock is laid out as follows in
- s_last_error_time_hi
- Upper 8 bits of the s_last_error_time field.
* - 0x27A
- - __u8
- - s_pad[2]
- - Zero padding.
+ - \_\_u8
+ - s\_first\_error\_errcode
+ -
+ * - 0x27B
+ - \_\_u8
+ - s\_last\_error\_errcode
+ -
* - 0x27C
- __le16
- s_encoding
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst
index e80329908549..29e84d125e02 100644
--- a/Documentation/filesystems/fscrypt.rst
+++ b/Documentation/filesystems/fscrypt.rst
@@ -70,7 +70,7 @@ Online attacks
--------------
fscrypt (and storage encryption in general) can only provide limited
-protection, if any at all, against online attacks. In detail:
+protection against online attacks. In detail:
Side-channel attacks
~~~~~~~~~~~~~~~~~~~~
@@ -99,16 +99,23 @@ Therefore, any encryption-specific access control checks would merely
be enforced by kernel *code* and therefore would be largely redundant
with the wide variety of access control mechanisms already available.)
-Kernel memory compromise
-~~~~~~~~~~~~~~~~~~~~~~~~
+Read-only kernel memory compromise
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Unless `hardware-wrapped keys`_ are used, an attacker who gains the
+ability to read from arbitrary kernel memory, e.g. by mounting a
+physical attack or by exploiting a kernel security vulnerability, can
+compromise all fscrypt keys that are currently in-use. This also
+extends to cold boot attacks; if the system is suddenly powered off,
+keys the system was using may remain in memory for a short time.
-An attacker who compromises the system enough to read from arbitrary
-memory, e.g. by mounting a physical attack or by exploiting a kernel
-security vulnerability, can compromise all encryption keys that are
-currently in use.
+However, if hardware-wrapped keys are used, then the fscrypt master
+keys and file contents encryption keys (but not other types of fscrypt
+subkeys such as filenames encryption keys) are protected from
+compromises of arbitrary kernel memory.
-However, fscrypt allows encryption keys to be removed from the kernel,
-which may protect them from later compromise.
+In addition, fscrypt allows encryption keys to be removed from the
+kernel, which may protect them from later compromise.
In more detail, the FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (or the
FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS ioctl) can wipe a master
@@ -144,6 +151,24 @@ However, these ioctls have some limitations:
accelerator hardware (if used by the crypto API to implement any of
the algorithms), or in other places not explicitly considered here.
+Full system compromise
+~~~~~~~~~~~~~~~~~~~~~~
+
+An attacker who gains "root" access and/or the ability to execute
+arbitrary kernel code can freely exfiltrate data that is protected by
+any in-use fscrypt keys. Thus, usually fscrypt provides no meaningful
+protection in this scenario. (Data that is protected by a key that is
+absent throughout the entire attack remains protected, modulo the
+limitations of key removal mentioned above in the case where the key
+was removed prior to the attack.)
+
+However, if `hardware-wrapped keys`_ are used, such attackers will be
+unable to exfiltrate the master keys or file contents keys in a form
+that will be usable after the system is powered off. This may be
+useful if the attacker is significantly time-limited and/or
+bandwidth-limited, so they can only exfiltrate some data and need to
+rely on a later offline attack to exfiltrate the rest of it.
+
Limitations of v1 policies
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -170,6 +195,10 @@ policies on all new encrypted directories.
Key hierarchy
=============
+Note: this section assumes the use of raw keys rather than
+hardware-wrapped keys. The use of hardware-wrapped keys modifies the
+key hierarchy slightly. For details, see `Hardware-wrapped keys`_.
+
Master Keys
-----------
@@ -832,7 +861,9 @@ a pointer to struct fscrypt_add_key_arg, defined as follows::
struct fscrypt_key_specifier key_spec;
__u32 raw_size;
__u32 key_id;
- __u32 __reserved[8];
+ #define FSCRYPT_ADD_KEY_FLAG_HW_WRAPPED 0x00000001
+ __u32 flags;
+ __u32 __reserved[7];
__u8 raw[];
};
@@ -851,7 +882,7 @@ a pointer to struct fscrypt_add_key_arg, defined as follows::
struct fscrypt_provisioning_key_payload {
__u32 type;
- __u32 __reserved;
+ __u32 flags;
__u8 raw[];
};
@@ -879,24 +910,32 @@ as follows:
Alternatively, if ``key_id`` is nonzero, this field must be 0, since
in that case the size is implied by the specified Linux keyring key.
-- ``key_id`` is 0 if the raw key is given directly in the ``raw``
- field. Otherwise ``key_id`` is the ID of a Linux keyring key of
- type "fscrypt-provisioning" whose payload is
- struct fscrypt_provisioning_key_payload whose ``raw`` field contains
- the raw key and whose ``type`` field matches ``key_spec.type``.
- Since ``raw`` is variable-length, the total size of this key's
- payload must be ``sizeof(struct fscrypt_provisioning_key_payload)``
- plus the raw key size. The process must have Search permission on
- this key.
-
- Most users should leave this 0 and specify the raw key directly.
- The support for specifying a Linux keyring key is intended mainly to
+- ``key_id`` is 0 if the key is given directly in the ``raw`` field.
+ Otherwise ``key_id`` is the ID of a Linux keyring key of type
+ "fscrypt-provisioning" whose payload is struct
+ fscrypt_provisioning_key_payload whose ``raw`` field contains the
+ key, whose ``type`` field matches ``key_spec.type``, and whose
+ ``flags`` field matches ``flags``. Since ``raw`` is
+ variable-length, the total size of this key's payload must be
+ ``sizeof(struct fscrypt_provisioning_key_payload)`` plus the number
+ of key bytes. The process must have Search permission on this key.
+
+ Most users should leave this 0 and specify the key directly. The
+ support for specifying a Linux keyring key is intended mainly to
allow re-adding keys after a filesystem is unmounted and re-mounted,
- without having to store the raw keys in userspace memory.
+ without having to store the keys in userspace memory.
+
+- ``flags`` contains optional flags from ``<linux/fscrypt.h>``:
+
+ - FSCRYPT_ADD_KEY_FLAG_HW_WRAPPED: This denotes that the key is a
+ hardware-wrapped key. See `Hardware-wrapped keys`_. This flag
+ can't be used if FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR is used.
- ``raw`` is a variable-length field which must contain the actual
key, ``raw_size`` bytes long. Alternatively, if ``key_id`` is
- nonzero, then this field is unused.
+ nonzero, then this field is unused. Note that despite being named
+ ``raw``, if FSCRYPT_ADD_KEY_FLAG_HW_WRAPPED is specified then it
+ will contain a wrapped key, not a raw key.
For v2 policy keys, the kernel keeps track of which user (identified
by effective user ID) added the key, and only allows the key to be
@@ -908,8 +947,8 @@ prevent that other user from unexpectedly removing it. Therefore,
FS_IOC_ADD_ENCRYPTION_KEY may also be used to add a v2 policy key
*again*, even if it's already added by other user(s). In this case,
FS_IOC_ADD_ENCRYPTION_KEY will just install a claim to the key for the
-current user, rather than actually add the key again (but the raw key
-must still be provided, as a proof of knowledge).
+current user, rather than actually add the key again (but the key must
+still be provided, as a proof of knowledge).
FS_IOC_ADD_ENCRYPTION_KEY returns 0 if either the key or a claim to
the key was either added or already exists.
@@ -918,20 +957,23 @@ FS_IOC_ADD_ENCRYPTION_KEY can fail with the following errors:
- ``EACCES``: FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR was specified, but the
caller does not have the CAP_SYS_ADMIN capability in the initial
- user namespace; or the raw key was specified by Linux key ID but the
+ user namespace; or the key was specified by Linux key ID but the
process lacks Search permission on the key.
+- ``EBADMSG``: invalid hardware-wrapped key
- ``EDQUOT``: the key quota for this user would be exceeded by adding
the key
- ``EINVAL``: invalid key size or key specifier type, or reserved bits
were set
-- ``EKEYREJECTED``: the raw key was specified by Linux key ID, but the
- key has the wrong type
-- ``ENOKEY``: the raw key was specified by Linux key ID, but no key
- exists with that ID
+- ``EKEYREJECTED``: the key was specified by Linux key ID, but the key
+ has the wrong type
+- ``ENOKEY``: the key was specified by Linux key ID, but no key exists
+ with that ID
- ``ENOTTY``: this type of filesystem does not implement encryption
- ``EOPNOTSUPP``: the kernel was not configured with encryption
support for this filesystem, or the filesystem superblock has not
- had encryption enabled on it
+ had encryption enabled on it; or a hardware wrapped key was specified
+ but the filesystem does not support inline encryption or the hardware
+ does not support hardware-wrapped keys
Legacy method
~~~~~~~~~~~~~
@@ -994,9 +1036,8 @@ or removed by non-root users.
These ioctls don't work on keys that were added via the legacy
process-subscribed keyrings mechanism.
-Before using these ioctls, read the `Kernel memory compromise`_
-section for a discussion of the security goals and limitations of
-these ioctls.
+Before using these ioctls, read the `Online attacks`_ section for a
+discussion of the security goals and limitations of these ioctls.
FS_IOC_REMOVE_ENCRYPTION_KEY
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -1316,7 +1357,8 @@ inline encryption hardware doesn't have the needed crypto capabilities
(e.g. support for the needed encryption algorithm and data unit size)
and where blk-crypto-fallback is unusable. (For blk-crypto-fallback
to be usable, it must be enabled in the kernel configuration with
-CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK=y.)
+CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK=y, and the file must be
+protected by a raw key rather than a hardware-wrapped key.)
Currently fscrypt always uses the filesystem block size (which is
usually 4096 bytes) as the data unit size. Therefore, it can only use
@@ -1324,7 +1366,76 @@ inline encryption hardware that supports that data unit size.
Inline encryption doesn't affect the ciphertext or other aspects of
the on-disk format, so users may freely switch back and forth between
-using "inlinecrypt" and not using "inlinecrypt".
+using "inlinecrypt" and not using "inlinecrypt". An exception is that
+files that are protected by a hardware-wrapped key can only be
+encrypted/decrypted by the inline encryption hardware and therefore
+can only be accessed when the "inlinecrypt" mount option is used. For
+more information about hardware-wrapped keys, see below.
+
+Hardware-wrapped keys
+---------------------
+
+fscrypt supports using *hardware-wrapped keys* when the inline
+encryption hardware supports it. Such keys are only present in kernel
+memory in wrapped (encrypted) form; they can only be unwrapped
+(decrypted) by the inline encryption hardware and are temporally bound
+to the current boot. This prevents the keys from being compromised if
+kernel memory is leaked. This is done without limiting the number of
+keys that can be used and while still allowing the execution of
+cryptographic tasks that are tied to the same key but can't use inline
+encryption hardware, e.g. filenames encryption.
+
+Note that hardware-wrapped keys aren't specific to fscrypt; they are a
+block layer feature (part of *blk-crypto*). For more details about
+hardware-wrapped keys, see the block layer documentation at
+:ref:`Documentation/block/inline-encryption.rst
+<hardware_wrapped_keys>`. The rest of this section just focuses on
+the details of how fscrypt can use hardware-wrapped keys.
+
+fscrypt supports hardware-wrapped keys by allowing the fscrypt master
+keys to be hardware-wrapped keys as an alternative to raw keys. To
+add a hardware-wrapped key with `FS_IOC_ADD_ENCRYPTION_KEY`_,
+userspace must specify FSCRYPT_ADD_KEY_FLAG_HW_WRAPPED in the
+``flags`` field of struct fscrypt_add_key_arg and also in the
+``flags`` field of struct fscrypt_provisioning_key_payload when
+applicable. The key must be in ephemerally-wrapped form, not
+long-term wrapped form.
+
+Some limitations apply. First, files protected by a hardware-wrapped
+key are tied to the system's inline encryption hardware. Therefore
+they can only be accessed when the "inlinecrypt" mount option is used,
+and they can't be included in portable filesystem images. Second,
+currently the hardware-wrapped key support is only compatible with
+`IV_INO_LBLK_64 policies`_ and `IV_INO_LBLK_32 policies`_, as it
+assumes that there is just one file contents encryption key per
+fscrypt master key rather than one per file. Future work may address
+this limitation by passing per-file nonces down the storage stack to
+allow the hardware to derive per-file keys.
+
+Implementation-wise, to encrypt/decrypt the contents of files that are
+protected by a hardware-wrapped key, fscrypt uses blk-crypto,
+attaching the hardware-wrapped key to the bio crypt contexts. As is
+the case with raw keys, the block layer will program the key into a
+keyslot when it isn't already in one. However, when programming a
+hardware-wrapped key, the hardware doesn't program the given key
+directly into a keyslot but rather unwraps it (using the hardware's
+ephemeral wrapping key) and derives the inline encryption key from it.
+The inline encryption key is the key that actually gets programmed
+into a keyslot, and it is never exposed to software.
+
+However, fscrypt doesn't just do file contents encryption; it also
+uses its master keys to derive filenames encryption keys, key
+identifiers, and sometimes some more obscure types of subkeys such as
+dirhash keys. So even with file contents encryption out of the
+picture, fscrypt still needs a raw key to work with. To get such a
+key from a hardware-wrapped key, fscrypt asks the inline encryption
+hardware to derive a cryptographically isolated "software secret" from
+the hardware-wrapped key. fscrypt uses this "software secret" to key
+its KDF to derive all subkeys other than file contents keys.
+
+Note that this implies that the hardware-wrapped key feature only
+protects the file contents encryption keys. It doesn't protect other
+fscrypt subkeys such as filenames encryption keys.
Direct I/O support
==================
@@ -1409,7 +1520,7 @@ read the ciphertext into the page cache and decrypt it in-place. The
folio lock must be held until decryption has finished, to prevent the
folio from becoming visible to userspace prematurely.
-For the write path (->writepage()) of regular files, filesystems
+For the write path (->writepages()) of regular files, filesystems
cannot encrypt data in-place in the page cache, since the cached
plaintext must be preserved. Instead, filesystems must encrypt into a
temporary buffer or "bounce page", then write out the temporary
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index a9cf8e950b15..32618512a965 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -113,6 +113,7 @@ Documentation for filesystem implementations.
qnx6
ramfs-rootfs-initramfs
relay
+ resctrl
romfs
smb/index
spufs/index
diff --git a/Documentation/filesystems/iomap/design.rst b/Documentation/filesystems/iomap/design.rst
index e29651a42eec..f2df9b6df988 100644
--- a/Documentation/filesystems/iomap/design.rst
+++ b/Documentation/filesystems/iomap/design.rst
@@ -243,13 +243,25 @@ The fields are as follows:
regular file data.
This is only useful for FIEMAP.
- * **IOMAP_F_PRIVATE**: Starting with this value, the upper bits can
- be set by the filesystem for its own purposes.
+ * **IOMAP_F_BOUNDARY**: This indicates I/O and its completion must not be
+ merged with any other I/O or completion. Filesystems must use this when
+ submitting I/O to devices that cannot handle I/O crossing certain LBAs
+ (e.g. ZNS devices). This flag applies only to buffered I/O writeback; all
+ other functions ignore it.
+
+ * **IOMAP_F_PRIVATE**: This flag is reserved for filesystem private use.
* **IOMAP_F_ANON_WRITE**: Indicates that (write) I/O does not have a target
block assigned to it yet and the file system will do that in the bio
submission handler, splitting the I/O as needed.
+ * **IOMAP_F_ATOMIC_BIO**: This indicates write I/O must be submitted with the
+ ``REQ_ATOMIC`` flag set in the bio. Filesystems need to set this flag to
+ inform iomap that the write I/O operation requires torn-write protection
+ based on HW-offload mechanism. They must also ensure that mapping updates
+ upon the completion of the I/O must be performed in a single metadata
+ update.
+
These flags can be set by iomap itself during file operations.
The filesystem should supply an ``->iomap_end`` function if it needs
to observe these flags:
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 0ec0bb6eb0fb..2e567e341c3b 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -249,7 +249,6 @@ address_space_operations
========================
prototypes::
- int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*read_folio)(struct file *, struct folio *);
int (*writepages)(struct address_space *, struct writeback_control *);
bool (*dirty_folio)(struct address_space *, struct folio *folio);
@@ -280,7 +279,6 @@ locking rules:
====================== ======================== ========= ===============
ops folio locked i_rwsem invalidate_lock
====================== ======================== ========= ===============
-writepage: yes, unlocks (see below)
read_folio: yes, unlocks shared
writepages:
dirty_folio: maybe
@@ -309,54 +307,6 @@ completion.
->readahead() unlocks the folios that I/O is attempted on like ->read_folio().
-->writepage() is used for two purposes: for "memory cleansing" and for
-"sync". These are quite different operations and the behaviour may differ
-depending upon the mode.
-
-If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then
-it *must* start I/O against the page, even if that would involve
-blocking on in-progress I/O.
-
-If writepage is called for memory cleansing (sync_mode ==
-WBC_SYNC_NONE) then its role is to get as much writeout underway as
-possible. So writepage should try to avoid blocking against
-currently-in-progress I/O.
-
-If the filesystem is not called for "sync" and it determines that it
-would need to block against in-progress I/O to be able to start new I/O
-against the page the filesystem should redirty the page with
-redirty_page_for_writepage(), then unlock the page and return zero.
-This may also be done to avoid internal deadlocks, but rarely.
-
-If the filesystem is called for sync then it must wait on any
-in-progress I/O and then start new I/O.
-
-The filesystem should unlock the page synchronously, before returning to the
-caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE
-value. WRITEPAGE_ACTIVATE means that page cannot really be written out
-currently, and VM should stop calling ->writepage() on this page for some
-time. VM does this by moving page to the head of the active list, hence the
-name.
-
-Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
-and return zero, writepage *must* run set_page_writeback() against the page,
-followed by unlocking it. Once set_page_writeback() has been run against the
-page, write I/O can be submitted and the write I/O completion handler must run
-end_page_writeback() once the I/O is complete. If no I/O is submitted, the
-filesystem must run end_page_writeback() against the page before returning from
-writepage.
-
-That is: after 2.5.12, pages which are under writeout are *not* locked. Note,
-if the filesystem needs the page to be locked during writeout, that is ok, too,
-the page is allowed to be unlocked at any point in time between the calls to
-set_page_writeback() and end_page_writeback().
-
-Note, failure to run either redirty_page_for_writepage() or the combination of
-set_page_writeback()/end_page_writeback() on a page submitted to writepage
-will leave the page itself marked clean but it will be tagged as dirty in the
-radix tree. This incoherency can lead to all sorts of hard-to-debug problems
-in the filesystem like having dirty inodes at umount and losing written data.
-
->writepages() is used for periodic writeback and for syscall-initiated
sync operations. The address_space should start I/O against at least
``*nr_to_write`` pages. ``*nr_to_write`` must be decremented for each page
@@ -364,8 +314,8 @@ which is written. The address_space implementation may write more (or less)
pages than ``*nr_to_write`` asks for, but it should try to be reasonably close.
If nr_to_write is NULL, all dirty pages must be written.
-writepages should _only_ write pages which are present on
-mapping->io_pages.
+writepages should _only_ write pages which are present in
+mapping->i_pages.
->dirty_folio() is called from various places in the kernel when
the target folio is marked as needing writeback. The folio cannot be
diff --git a/Documentation/filesystems/mount_api.rst b/Documentation/filesystems/mount_api.rst
index d92c276f1575..e149b89118c8 100644
--- a/Documentation/filesystems/mount_api.rst
+++ b/Documentation/filesystems/mount_api.rst
@@ -671,7 +671,6 @@ The members are as follows:
fsparam_bool() fs_param_is_bool
fsparam_u32() fs_param_is_u32
fsparam_u32oct() fs_param_is_u32_octal
- fsparam_u32hex() fs_param_is_u32_hex
fsparam_s32() fs_param_is_s32
fsparam_u64() fs_param_is_u64
fsparam_enum() fs_param_is_enum
@@ -755,21 +754,6 @@ process the parameters it is given.
* ::
- bool validate_constant_table(const struct constant_table *tbl,
- size_t tbl_size,
- int low, int high, int special);
-
- Validate a constant table. Checks that all the elements are appropriately
- ordered, that there are no duplicates and that the values are between low
- and high inclusive, though provision is made for one allowable special
- value outside of that range. If no special value is required, special
- should just be set to lie inside the low-to-high range.
-
- If all is good, true is returned. If the table is invalid, errors are
- logged to the kernel log buffer and false is returned.
-
- * ::
-
bool fs_validate_description(const char *name,
const struct fs_parameter_description *desc);
diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst
index 3886c14f89f4..939b4b624fad 100644
--- a/Documentation/filesystems/netfs_library.rst
+++ b/Documentation/filesystems/netfs_library.rst
@@ -1,33 +1,187 @@
.. SPDX-License-Identifier: GPL-2.0
-=================================
-Network Filesystem Helper Library
-=================================
+===================================
+Network Filesystem Services Library
+===================================
.. Contents:
- Overview.
+ - Requests and streams.
+ - Subrequests.
+ - Result collection and retry.
+ - Local caching.
+ - Content encryption (fscrypt).
- Per-inode context.
- Inode context helper functions.
- - Buffered read helpers.
- - Read helper functions.
- - Read helper structures.
- - Read helper operations.
- - Read helper procedure.
- - Read helper cache API.
+ - Inode locking.
+ - Inode writeback.
+ - High-level VFS API.
+ - Unlocked read/write iter.
+ - Pre-locked read/write iter.
+ - Monolithic files API.
+ - Memory-mapped I/O API.
+ - High-level VM API.
+ - Deprecated PG_private2 API.
+ - I/O request API.
+ - Request structure.
+ - Stream structure.
+ - Subrequest structure.
+ - Filesystem methods.
+ - Terminating a subrequest.
+ - Local cache API.
+ - API function reference.
Overview
========
-The network filesystem helper library is a set of functions designed to aid a
-network filesystem in implementing VM/VFS operations. For the moment, that
-just includes turning various VM buffered read operations into requests to read
-from the server. The helper library, however, can also interpose other
-services, such as local caching or local data encryption.
+The network filesystem services library, netfslib, is a set of functions
+designed to aid a network filesystem in implementing VM/VFS API operations. It
+takes over the normal buffered read, readahead, write and writeback and also
+handles unbuffered and direct I/O.
-Note that the library module doesn't link against local caching directly, so
-access must be provided by the netfs.
+The library provides support for (re-)negotiation of I/O sizes and retrying
+failed I/O as well as local caching and will, in the future, provide content
+encryption.
+
+It insulates the filesystem from VM interface changes as much as possible and
+handles VM features such as large multipage folios. The filesystem basically
+just has to provide a way to perform read and write RPC calls.
+
+The way I/O is organised inside netfslib consists of a number of objects:
+
+ * A *request*. A request is used to track the progress of the I/O overall and
+ to hold on to resources. The collection of results is done at the request
+ level. The I/O within a request is divided into a number of parallel
+ streams of subrequests.
+
+ * A *stream*. A non-overlapping series of subrequests. The subrequests
+ within a stream do not have to be contiguous.
+
+ * A *subrequest*. This is the basic unit of I/O. It represents a single RPC
+ call or a single cache I/O operation. The library passes these to the
+ filesystem and the cache to perform.
+
+Requests and Streams
+--------------------
+
+When actually performing I/O (as opposed to just copying into the pagecache),
+netfslib will create one or more requests to track the progress of the I/O and
+to hold resources.
+
+A read operation will have a single stream and the subrequests within that
+stream may be of mixed origins, for instance mixing RPC subrequests and cache
+subrequests.
+
+On the other hand, a write operation may have multiple streams, where each
+stream targets a different destination. For instance, there may be one stream
+writing to the local cache and one to the server. Currently, only two streams
+are allowed, but this could be increased if parallel writes to multiple servers
+is desired.
+
+The subrequests within a write stream do not need to match alignment or size
+with the subrequests in another write stream and netfslib performs the tiling
+of subrequests in each stream over the source buffer independently. Further,
+each stream may contain holes that don't correspond to holes in the other
+stream.
+
+In addition, the subrequests do not need to correspond to the boundaries of the
+folios or vectors in the source/destination buffer. The library handles the
+collection of results and the wrangling of folio flags and references.
+
+Subrequests
+-----------
+
+Subrequests are at the heart of the interaction between netfslib and the
+filesystem using it. Each subrequest is expected to correspond to a single
+read or write RPC or cache operation. The library will stitch together the
+results from a set of subrequests to provide a higher level operation.
+
+Netfslib has two interactions with the filesystem or the cache when setting up
+a subrequest. First, there's an optional preparatory step that allows the
+filesystem to negotiate the limits on the subrequest, both in terms of maximum
+number of bytes and maximum number of vectors (e.g. for RDMA). This may
+involve negotiating with the server (e.g. cifs needing to acquire credits).
+
+And, secondly, there's the issuing step in which the subrequest is handed off
+to the filesystem to perform.
+
+Note that these two steps are done slightly differently between read and write:
+
+ * For reads, the VM/VFS tells us how much is being requested up front, so the
+ library can preset maximum values that the cache and then the filesystem can
+ then reduce. The cache also gets consulted first on whether it wants to do
+ a read before the filesystem is consulted.
+
+ * For writeback, it is unknown how much there will be to write until the
+ pagecache is walked, so no limit is set by the library.
+
+Once a subrequest is completed, the filesystem or cache informs the library of
+the completion and then collection is invoked. Depending on whether the
+request is synchronous or asynchronous, the collection of results will be done
+in either the application thread or in a work queue.
+
+Result Collection and Retry
+---------------------------
+
+As subrequests complete, the results are collected and collated by the library
+and folio unlocking is performed progressively (if appropriate). Once the
+request is complete, async completion will be invoked (again, if appropriate).
+It is possible for the filesystem to provide interim progress reports to the
+library to cause folio unlocking to happen earlier if possible.
+
+If any subrequests fail, netfslib can retry them. It will wait until all
+subrequests are completed, offer the filesystem the opportunity to fiddle with
+the resources/state held by the request and poke at the subrequests before
+re-preparing and re-issuing the subrequests.
+
+This allows the tiling of contiguous sets of failed subrequest within a stream
+to be changed, adding more subrequests or ditching excess as necessary (for
+instance, if the network sizes change or the server decides it wants smaller
+chunks).
+
+Further, if one or more contiguous cache-read subrequests fail, the library
+will pass them to the filesystem to perform instead, renegotiating and retiling
+them as necessary to fit with the filesystem's parameters rather than those of
+the cache.
+
+Local Caching
+-------------
+
+One of the services netfslib provides, via ``fscache``, is the option to cache
+on local disk a copy of the data obtained from/written to a network filesystem.
+The library will manage the storing, retrieval and some invalidation of data
+automatically on behalf of the filesystem if a cookie is attached to the
+``netfs_inode``.
+
+Note that local caching used to use the PG_private_2 (aliased as PG_fscache) to
+keep track of a page that was being written to the cache, but this is now
+deprecated as PG_private_2 will be removed.
+
+Instead, folios that are read from the server for which there was no data in
+the cache will be marked as dirty and will have ``folio->private`` set to a
+special value (``NETFS_FOLIO_COPY_TO_CACHE``) and left to writeback to write.
+If the folio is modified before that happened, the special value will be
+cleared and the write will become normally dirty.
+
+When writeback occurs, folios that are so marked will only be written to the
+cache and not to the server. Writeback handles mixed cache-only writes and
+server-and-cache writes by using two streams, sending one to the cache and one
+to the server. The server stream will have gaps in it corresponding to those
+folios.
+
+Content Encryption (fscrypt)
+----------------------------
+
+Though it does not do so yet, at some point netfslib will acquire the ability
+to do client-side content encryption on behalf of the network filesystem (Ceph,
+for example). fscrypt can be used for this if appropriate (it may not be -
+cifs, for example).
+
+The data will be stored encrypted in the local cache using the same manner of
+encryption as the data written to the server and the library will impose bounce
+buffering and RMW cycles as necessary.
Per-Inode Context
@@ -40,10 +194,13 @@ structure is defined::
struct netfs_inode {
struct inode inode;
const struct netfs_request_ops *ops;
- struct fscache_cookie *cache;
+ struct fscache_cookie * cache;
+ loff_t remote_i_size;
+ unsigned long flags;
+ ...
};
-A network filesystem that wants to use netfs lib must place one of these in its
+A network filesystem that wants to use netfslib must place one of these in its
inode wrapper struct instead of the VFS ``struct inode``. This can be done in
a way similar to the following::
@@ -56,7 +213,8 @@ This allows netfslib to find its state by using ``container_of()`` from the
inode pointer, thereby allowing the netfslib helper functions to be pointed to
directly by the VFS/VM operation tables.
-The structure contains the following fields:
+The structure contains the following fields that are of interest to the
+filesystem:
* ``inode``
@@ -71,6 +229,37 @@ The structure contains the following fields:
Local caching cookie, or NULL if no caching is enabled. This field does not
exist if fscache is disabled.
+ * ``remote_i_size``
+
+ The size of the file on the server. This differs from inode->i_size if
+ local modifications have been made but not yet written back.
+
+ * ``flags``
+
+ A set of flags, some of which the filesystem might be interested in:
+
+ * ``NETFS_ICTX_MODIFIED_ATTR``
+
+ Set if netfslib modifies mtime/ctime. The filesystem is free to ignore
+ this or clear it.
+
+ * ``NETFS_ICTX_UNBUFFERED``
+
+ Do unbuffered I/O upon the file. Like direct I/O but without the
+ alignment limitations. RMW will be performed if necessary. The pagecache
+ will not be used unless mmap() is also used.
+
+ * ``NETFS_ICTX_WRITETHROUGH``
+
+ Do writethrough caching upon the file. I/O will be set up and dispatched
+ as buffered writes are made to the page cache. mmap() does the normal
+ writeback thing.
+
+ * ``NETFS_ICTX_SINGLE_NO_UPLOAD``
+
+ Set if the file has a monolithic content that must be read entirely in a
+ single go and must not be written back to the server, though it can be
+ cached (e.g. AFS directories).
Inode Context Helper Functions
------------------------------
@@ -84,117 +273,250 @@ set the operations table pointer::
then a function to cast from the VFS inode structure to the netfs context::
- struct netfs_inode *netfs_node(struct inode *inode);
+ struct netfs_inode *netfs_inode(struct inode *inode);
and finally, a function to get the cache cookie pointer from the context
attached to an inode (or NULL if fscache is disabled)::
struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx);
+Inode Locking
+-------------
+
+A number of functions are provided to manage the locking of i_rwsem for I/O and
+to effectively extend it to provide more separate classes of exclusion::
+
+ int netfs_start_io_read(struct inode *inode);
+ void netfs_end_io_read(struct inode *inode);
+ int netfs_start_io_write(struct inode *inode);
+ void netfs_end_io_write(struct inode *inode);
+ int netfs_start_io_direct(struct inode *inode);
+ void netfs_end_io_direct(struct inode *inode);
+
+The exclusion breaks down into four separate classes:
+
+ 1) Buffered reads and writes.
+
+ Buffered reads can run concurrently each other and with buffered writes,
+ but buffered writes cannot run concurrently with each other.
+
+ 2) Direct reads and writes.
+
+ Direct (and unbuffered) reads and writes can run concurrently since they do
+ not share local buffering (i.e. the pagecache) and, in a network
+ filesystem, are expected to have exclusion managed on the server (though
+ this may not be the case for, say, Ceph).
+
+ 3) Other major inode modifying operations (e.g. truncate, fallocate).
+
+ These should just access i_rwsem directly.
+
+ 4) mmap().
+
+ mmap'd accesses might operate concurrently with any of the other classes.
+ They might form the buffer for an intra-file loopback DIO read/write. They
+ might be permitted on unbuffered files.
+
+Inode Writeback
+---------------
+
+Netfslib will pin resources on an inode for future writeback (such as pinning
+use of an fscache cookie) when an inode is dirtied. However, this pinning
+needs careful management. To manage the pinning, the following sequence
+occurs:
+
+ 1) An inode state flag ``I_PINNING_NETFS_WB`` is set by netfslib when the
+ pinning begins (when a folio is dirtied, for example) if the cache is
+ active to stop the cache structures from being discarded and the cache
+ space from being culled. This also prevents re-getting of cache resources
+ if the flag is already set.
+
+ 2) This flag then cleared inside the inode lock during inode writeback in the
+ VM - and the fact that it was set is transferred to ``->unpinned_netfs_wb``
+ in ``struct writeback_control``.
+
+ 3) If ``->unpinned_netfs_wb`` is now set, the write_inode procedure is forced.
+
+ 4) The filesystem's ``->write_inode()`` function is invoked to do the cleanup.
+
+ 5) The filesystem invokes netfs to do its cleanup.
+
+To do the cleanup, netfslib provides a function to do the resource unpinning::
+
+ int netfs_unpin_writeback(struct inode *inode, struct writeback_control *wbc);
+
+If the filesystem doesn't need to do anything else, this may be set as a its
+``.write_inode`` method.
+
+Further, if an inode is deleted, the filesystem's write_inode method may not
+get called, so::
+
+ void netfs_clear_inode_writeback(struct inode *inode, const void *aux);
-Buffered Read Helpers
-=====================
+must be called from ``->evict_inode()`` *before* ``clear_inode()`` is called.
-The library provides a set of read helpers that handle the ->read_folio(),
-->readahead() and much of the ->write_begin() VM operations and translate them
-into a common call framework.
-The following services are provided:
+High-Level VFS API
+==================
- * Handle folios that span multiple pages.
+Netfslib provides a number of sets of API calls for the filesystem to delegate
+VFS operations to. Netfslib, in turn, will call out to the filesystem and the
+cache to negotiate I/O sizes, issue RPCs and provide places for it to intervene
+at various times.
- * Insulate the netfs from VM interface changes.
+Unlocked Read/Write Iter
+------------------------
- * Allow the netfs to arbitrarily split reads up into pieces, even ones that
- don't match folio sizes or folio alignments and that may cross folios.
+The first API set is for the delegation of operations to netfslib when the
+filesystem is called through the standard VFS read/write_iter methods::
- * Allow the netfs to expand a readahead request in both directions to meet its
- needs.
+ ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+ ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from);
+ ssize_t netfs_buffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+ ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+ ssize_t netfs_unbuffered_write_iter(struct kiocb *iocb, struct iov_iter *from);
- * Allow the netfs to partially fulfil a read, which will then be resubmitted.
+They can be assigned directly to ``.read_iter`` and ``.write_iter``. They
+perform the inode locking themselves and the first two will switch between
+buffered I/O and DIO as appropriate.
- * Handle local caching, allowing cached data and server-read data to be
- interleaved for a single request.
+Pre-Locked Read/Write Iter
+--------------------------
- * Handle clearing of bufferage that isn't on the server.
+The second API set is for the delegation of operations to netfslib when the
+filesystem is called through the standard VFS methods, but needs to do some
+other stuff before or after calling netfslib whilst still inside locked section
+(e.g. Ceph negotiating caps). The unbuffered read function is::
- * Handle retrying of reads that failed, switching reads from the cache to the
- server as necessary.
+ ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_iter *iter);
- * In the future, this is a place that other services can be performed, such as
- local encryption of data to be stored remotely or in the cache.
+This must not be assigned directly to ``.read_iter`` and the filesystem is
+responsible for performing the inode locking before calling it. In the case of
+buffered read, the filesystem should use ``filemap_read()``.
-From the network filesystem, the helpers require a table of operations. This
-includes a mandatory method to issue a read operation along with a number of
-optional methods.
+There are three functions for writes::
+ ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *from,
+ struct netfs_group *netfs_group);
+ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
+ struct netfs_group *netfs_group);
+ ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *iter,
+ struct netfs_group *netfs_group);
-Read Helper Functions
+These must not be assigned directly to ``.write_iter`` and the filesystem is
+responsible for performing the inode locking before calling them.
+
+The first two functions are for buffered writes; the first just adds some
+standard write checks and jumps to the second, but if the filesystem wants to
+do the checks itself, it can use the second directly. The third function is
+for unbuffered or DIO writes.
+
+On all three write functions, there is a writeback group pointer (which should
+be NULL if the filesystem doesn't use this). Writeback groups are set on
+folios when they're modified. If a folio to-be-modified is already marked with
+a different group, it is flushed first. The writeback API allows writing back
+of a specific group.
+
+Memory-Mapped I/O API
---------------------
-Three read helpers are provided::
+An API for support of mmap()'d I/O is provided::
+
+ vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group);
+
+This allows the filesystem to delegate ``.page_mkwrite`` to netfslib. The
+filesystem should not take the inode lock before calling it, but, as with the
+locked write functions above, this does take a writeback group pointer. If the
+page to be made writable is in a different group, it will be flushed first.
+
+Monolithic Files API
+--------------------
+
+There is also a special API set for files for which the content must be read in
+a single RPC (and not written back) and is maintained as a monolithic blob
+(e.g. an AFS directory), though it can be stored and updated in the local cache::
+
+ ssize_t netfs_read_single(struct inode *inode, struct file *file, struct iov_iter *iter);
+ void netfs_single_mark_inode_dirty(struct inode *inode);
+ int netfs_writeback_single(struct address_space *mapping,
+ struct writeback_control *wbc,
+ struct iov_iter *iter);
+
+The first function reads from a file into the given buffer, reading from the
+cache in preference if the data is cached there; the second function allows the
+inode to be marked dirty, causing a later writeback; and the third function can
+be called from the writeback code to write the data to the cache, if there is
+one.
- void netfs_readahead(struct readahead_control *ractl);
- int netfs_read_folio(struct file *file,
- struct folio *folio);
- int netfs_write_begin(struct netfs_inode *ctx,
- struct file *file,
- struct address_space *mapping,
- loff_t pos,
- unsigned int len,
- struct folio **_folio,
- void **_fsdata);
+The inode should be marked ``NETFS_ICTX_SINGLE_NO_UPLOAD`` if this API is to be
+used. The writeback function requires the buffer to be of ITER_FOLIOQ type.
-Each corresponds to a VM address space operation. These operations use the
-state in the per-inode context.
+High-Level VM API
+==================
-For ->readahead() and ->read_folio(), the network filesystem just point directly
-at the corresponding read helper; whereas for ->write_begin(), it may be a
-little more complicated as the network filesystem might want to flush
-conflicting writes or track dirty data and needs to put the acquired folio if
-an error occurs after calling the helper.
+Netfslib also provides a number of sets of API calls for the filesystem to
+delegate VM operations to. Again, netfslib, in turn, will call out to the
+filesystem and the cache to negotiate I/O sizes, issue RPCs and provide places
+for it to intervene at various times::
-The helpers manage the read request, calling back into the network filesystem
-through the supplied table of operations. Waits will be performed as
-necessary before returning for helpers that are meant to be synchronous.
+ void netfs_readahead(struct readahead_control *);
+ int netfs_read_folio(struct file *, struct folio *);
+ int netfs_writepages(struct address_space *mapping,
+ struct writeback_control *wbc);
+ bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio);
+ void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length);
+ bool netfs_release_folio(struct folio *folio, gfp_t gfp);
-If an error occurs, the ->free_request() will be called to clean up the
-netfs_io_request struct allocated. If some parts of the request are in
-progress when an error occurs, the request will get partially completed if
-sufficient data is read.
+These are ``address_space_operations`` methods and can be set directly in the
+operations table.
-Additionally, there is::
+Deprecated PG_private_2 API
+---------------------------
- * void netfs_subreq_terminated(struct netfs_io_subrequest *subreq,
- ssize_t transferred_or_error,
- bool was_async);
+There is also a deprecated function for filesystems that still use the
+``->write_begin`` method::
-which should be called to complete a read subrequest. This is given the number
-of bytes transferred or a negative error code, plus a flag indicating whether
-the operation was asynchronous (ie. whether the follow-on processing can be
-done in the current context, given this may involve sleeping).
+ int netfs_write_begin(struct netfs_inode *inode, struct file *file,
+ struct address_space *mapping, loff_t pos, unsigned int len,
+ struct folio **_folio, void **_fsdata);
+It uses the deprecated PG_private_2 flag and so should not be used.
-Read Helper Structures
-----------------------
-The read helpers make use of a couple of structures to maintain the state of
-the read. The first is a structure that manages a read request as a whole::
+I/O Request API
+===============
+
+The I/O request API comprises a number of structures and a number of functions
+that the filesystem may need to use.
+
+Request Structure
+-----------------
+
+The request structure manages the request as a whole, holding some resources
+and state on behalf of the filesystem and tracking the collection of results::
struct netfs_io_request {
+ enum netfs_io_origin origin;
struct inode *inode;
struct address_space *mapping;
- struct netfs_cache_resources cache_resources;
+ struct netfs_group *group;
+ struct netfs_io_stream io_streams[];
void *netfs_priv;
- loff_t start;
- size_t len;
- loff_t i_size;
- const struct netfs_request_ops *netfs_ops;
+ void *netfs_priv2;
+ unsigned long long start;
+ unsigned long long len;
+ unsigned long long i_size;
unsigned int debug_id;
+ unsigned long flags;
...
};
-The above fields are the ones the netfs can use. They are:
+Many of the fields are for internal use, but the fields shown here are of
+interest to the filesystem:
+
+ * ``origin``
+
+ The origin of the request (readahead, read_folio, DIO read, writeback, ...).
* ``inode``
* ``mapping``
@@ -202,11 +524,19 @@ The above fields are the ones the netfs can use. They are:
The inode and the address space of the file being read from. The mapping
may or may not point to inode->i_data.
- * ``cache_resources``
+ * ``group``
+
+ The writeback group this request is dealing with or NULL. This holds a ref
+ on the group.
+
+ * ``io_streams``
- Resources for the local cache to use, if present.
+ The parallel streams of subrequests available to the request. Currently two
+ are available, but this may be made extensible in future. ``NR_IO_STREAMS``
+ indicates the size of the array.
* ``netfs_priv``
+ * ``netfs_priv2``
The network filesystem's private data. The value for this can be passed in
to the helper functions or set during the request.
@@ -221,37 +551,121 @@ The above fields are the ones the netfs can use. They are:
The size of the file at the start of the request.
- * ``netfs_ops``
-
- A pointer to the operation table. The value for this is passed into the
- helper functions.
-
* ``debug_id``
A number allocated to this operation that can be displayed in trace lines
for reference.
+ * ``flags``
+
+ Flags for managing and controlling the operation of the request. Some of
+ these may be of interest to the filesystem:
+
+ * ``NETFS_RREQ_RETRYING``
+
+ Netfslib sets this when generating retries.
+
+ * ``NETFS_RREQ_PAUSE``
+
+ The filesystem can set this to request to pause the library's subrequest
+ issuing loop - but care needs to be taken as netfslib may also set it.
+
+ * ``NETFS_RREQ_NONBLOCK``
+ * ``NETFS_RREQ_BLOCKED``
+
+ Netfslib sets the first to indicate that non-blocking mode was set by the
+ caller and the filesystem can set the second to indicate that it would
+ have had to block.
+
+ * ``NETFS_RREQ_USE_PGPRIV2``
+
+ The filesystem can set this if it wants to use PG_private_2 to track
+ whether a folio is being written to the cache. This is deprecated as
+ PG_private_2 is going to go away.
+
+If the filesystem wants more private data than is afforded by this structure,
+then it should wrap it and provide its own allocator.
+
+Stream Structure
+----------------
+
+A request is comprised of one or more parallel streams and each stream may be
+aimed at a different target.
+
+For read requests, only stream 0 is used. This can contain a mixture of
+subrequests aimed at different sources. For write requests, stream 0 is used
+for the server and stream 1 is used for the cache. For buffered writeback,
+stream 0 is not enabled unless a normal dirty folio is encountered, at which
+point ->begin_writeback() will be invoked and the filesystem can mark the
+stream available.
+
+The stream struct looks like::
+
+ struct netfs_io_stream {
+ unsigned char stream_nr;
+ bool avail;
+ size_t sreq_max_len;
+ unsigned int sreq_max_segs;
+ unsigned int submit_extendable_to;
+ ...
+ };
+
+A number of members are available for access/use by the filesystem:
+
+ * ``stream_nr``
+
+ The number of the stream within the request.
+
+ * ``avail``
+
+ True if the stream is available for use. The filesystem should set this on
+ stream zero if in ->begin_writeback().
+
+ * ``sreq_max_len``
+ * ``sreq_max_segs``
+
+ These are set by the filesystem or the cache in ->prepare_read() or
+ ->prepare_write() for each subrequest to indicate the maximum number of
+ bytes and, optionally, the maximum number of segments (if not 0) that that
+ subrequest can support.
+
+ * ``submit_extendable_to``
-The second structure is used to manage individual slices of the overall read
-request::
+ The size that a subrequest can be rounded up to beyond the EOF, given the
+ available buffer. This allows the cache to work out if it can do a DIO read
+ or write that straddles the EOF marker.
+
+Subrequest Structure
+--------------------
+
+Individual units of I/O are managed by the subrequest structure. These
+represent slices of the overall request and run independently::
struct netfs_io_subrequest {
struct netfs_io_request *rreq;
- loff_t start;
+ struct iov_iter io_iter;
+ unsigned long long start;
size_t len;
size_t transferred;
unsigned long flags;
+ short error;
unsigned short debug_index;
+ unsigned char stream_nr;
...
};
-Each subrequest is expected to access a single source, though the helpers will
+Each subrequest is expected to access a single source, though the library will
handle falling back from one source type to another. The members are:
* ``rreq``
A pointer to the read request.
+ * ``io_iter``
+
+ An I/O iterator representing a slice of the buffer to be read into or
+ written from.
+
* ``start``
* ``len``
@@ -260,241 +674,300 @@ handle falling back from one source type to another. The members are:
* ``transferred``
- The amount of data transferred so far of the length of this slice. The
- network filesystem or cache should start the operation this far into the
- slice. If a short read occurs, the helpers will call again, having updated
- this to reflect the amount read so far.
+ The amount of data transferred so far for this subrequest. This should be
+ added to with the length of the transfer made by this issuance of the
+ subrequest. If this is less than ``len`` then the subrequest may be
+ reissued to continue.
* ``flags``
- Flags pertaining to the read. There are two of interest to the filesystem
- or cache:
+ Flags for managing the subrequest. There are a number of interest to the
+ filesystem or cache:
+
+ * ``NETFS_SREQ_MADE_PROGRESS``
+
+ Set by the filesystem to indicates that at least one byte of data was read
+ or written.
+
+ * ``NETFS_SREQ_HIT_EOF``
+
+ The filesystem should set this if a read hit the EOF on the file (in which
+ case ``transferred`` should stop at the EOF). Netfslib may expand the
+ subrequest out to the size of the folio containing the EOF on the off
+ chance that a third party change happened or a DIO read may have asked for
+ more than is available. The library will clear any excess pagecache.
* ``NETFS_SREQ_CLEAR_TAIL``
- This can be set to indicate that the remainder of the slice, from
- transferred to len, should be cleared.
+ The filesystem can set this to indicate that the remainder of the slice,
+ from transferred to len, should be cleared. Do not set if HIT_EOF is set.
+
+ * ``NETFS_SREQ_NEED_RETRY``
+
+ The filesystem can set this to tell netfslib to retry the subrequest.
+
+ * ``NETFS_SREQ_BOUNDARY``
+
+ This can be set by the filesystem on a subrequest to indicate that it ends
+ at a boundary with the filesystem structure (e.g. at the end of a Ceph
+ object). It tells netfslib not to retile subrequests across it.
* ``NETFS_SREQ_SEEK_DATA_READ``
- This is a hint to the cache that it might want to try skipping ahead to
- the next data (ie. using SEEK_DATA).
+ This is a hint from netfslib to the cache that it might want to try
+ skipping ahead to the next data (ie. using SEEK_DATA).
+
+ * ``error``
+
+ This is for the filesystem to store result of the subrequest. It should be
+ set to 0 if successful and a negative error code otherwise.
* ``debug_index``
+ * ``stream_nr``
A number allocated to this slice that can be displayed in trace lines for
- reference.
+ reference and the number of the request stream that it belongs to.
+If necessary, the filesystem can get and put extra refs on the subrequest it is
+given::
-Read Helper Operations
-----------------------
+ void netfs_get_subrequest(struct netfs_io_subrequest *subreq,
+ enum netfs_sreq_ref_trace what);
+ void netfs_put_subrequest(struct netfs_io_subrequest *subreq,
+ enum netfs_sreq_ref_trace what);
-The network filesystem must provide the read helpers with a table of operations
-through which it can issue requests and negotiate::
+using netfs trace codes to indicate the reason. Care must be taken, however,
+as once control of the subrequest is returned to netfslib, the same subrequest
+can be reissued/retried.
+
+Filesystem Methods
+------------------
+
+The filesystem sets a table of operations in ``netfs_inode`` for netfslib to
+use::
struct netfs_request_ops {
- void (*init_request)(struct netfs_io_request *rreq, struct file *file);
+ mempool_t *request_pool;
+ mempool_t *subrequest_pool;
+ int (*init_request)(struct netfs_io_request *rreq, struct file *file);
void (*free_request)(struct netfs_io_request *rreq);
+ void (*free_subrequest)(struct netfs_io_subrequest *rreq);
void (*expand_readahead)(struct netfs_io_request *rreq);
- bool (*clamp_length)(struct netfs_io_subrequest *subreq);
+ int (*prepare_read)(struct netfs_io_subrequest *subreq);
void (*issue_read)(struct netfs_io_subrequest *subreq);
- bool (*is_still_valid)(struct netfs_io_request *rreq);
- int (*check_write_begin)(struct file *file, loff_t pos, unsigned len,
- struct folio **foliop, void **_fsdata);
void (*done)(struct netfs_io_request *rreq);
+ void (*update_i_size)(struct inode *inode, loff_t i_size);
+ void (*post_modify)(struct inode *inode);
+ void (*begin_writeback)(struct netfs_io_request *wreq);
+ void (*prepare_write)(struct netfs_io_subrequest *subreq);
+ void (*issue_write)(struct netfs_io_subrequest *subreq);
+ void (*retry_request)(struct netfs_io_request *wreq,
+ struct netfs_io_stream *stream);
+ void (*invalidate_cache)(struct netfs_io_request *wreq);
};
-The operations are as follows:
-
- * ``init_request()``
+The table starts with a pair of optional pointers to memory pools from which
+requests and subrequests can be allocated. If these are not given, netfslib
+has default pools that it will use instead. If the filesystem wraps the netfs
+structs in its own larger structs, then it will need to use its own pools.
+Netfslib will allocate directly from the pools.
- [Optional] This is called to initialise the request structure. It is given
- the file for reference.
+The methods defined in the table are:
+ * ``init_request()``
* ``free_request()``
+ * ``free_subrequest()``
- [Optional] This is called as the request is being deallocated so that the
- filesystem can clean up any state it has attached there.
+ [Optional] A filesystem may implement these to initialise or clean up any
+ resources that it attaches to the request or subrequest.
* ``expand_readahead()``
[Optional] This is called to allow the filesystem to expand the size of a
- readahead read request. The filesystem gets to expand the request in both
- directions, though it's not permitted to reduce it as the numbers may
- represent an allocation already made. If local caching is enabled, it gets
- to expand the request first.
+ readahead request. The filesystem gets to expand the request in both
+ directions, though it must retain the initial region as that may represent
+ an allocation already made. If local caching is enabled, it gets to expand
+ the request first.
Expansion is communicated by changing ->start and ->len in the request
structure. Note that if any change is made, ->len must be increased by at
least as much as ->start is reduced.
- * ``clamp_length()``
-
- [Optional] This is called to allow the filesystem to reduce the size of a
- subrequest. The filesystem can use this, for example, to chop up a request
- that has to be split across multiple servers or to put multiple reads in
- flight.
-
- This should return 0 on success and an error code on error.
-
- * ``issue_read()``
+ * ``prepare_read()``
- [Required] The helpers use this to dispatch a subrequest to the server for
- reading. In the subrequest, ->start, ->len and ->transferred indicate what
- data should be read from the server.
+ [Optional] This is called to allow the filesystem to limit the size of a
+ subrequest. It may also limit the number of individual regions in iterator,
+ such as required by RDMA. This information should be set on stream zero in::
- There is no return value; the netfs_subreq_terminated() function should be
- called to indicate whether or not the operation succeeded and how much data
- it transferred. The filesystem also should not deal with setting folios
- uptodate, unlocking them or dropping their refs - the helpers need to deal
- with this as they have to coordinate with copying to the local cache.
+ rreq->io_streams[0].sreq_max_len
+ rreq->io_streams[0].sreq_max_segs
- Note that the helpers have the folios locked, but not pinned. It is
- possible to use the ITER_XARRAY iov iterator to refer to the range of the
- inode that is being operated upon without the need to allocate large bvec
- tables.
+ The filesystem can use this, for example, to chop up a request that has to
+ be split across multiple servers or to put multiple reads in flight.
- * ``is_still_valid()``
+ Zero should be returned on success and an error code otherwise.
- [Optional] This is called to find out if the data just read from the local
- cache is still valid. It should return true if it is still valid and false
- if not. If it's not still valid, it will be reread from the server.
+ * ``issue_read()``
- * ``check_write_begin()``
+ [Required] Netfslib calls this to dispatch a subrequest to the server for
+ reading. In the subrequest, ->start, ->len and ->transferred indicate what
+ data should be read from the server and ->io_iter indicates the buffer to be
+ used.
- [Optional] This is called from the netfs_write_begin() helper once it has
- allocated/grabbed the folio to be modified to allow the filesystem to flush
- conflicting state before allowing it to be modified.
+ There is no return value; the ``netfs_read_subreq_terminated()`` function
+ should be called to indicate that the subrequest completed either way.
+ ->error, ->transferred and ->flags should be updated before completing. The
+ termination can be done asynchronously.
- It may unlock and discard the folio it was given and set the caller's folio
- pointer to NULL. It should return 0 if everything is now fine (``*foliop``
- left set) or the op should be retried (``*foliop`` cleared) and any other
- error code to abort the operation.
+ Note: the filesystem must not deal with setting folios uptodate, unlocking
+ them or dropping their refs - the library deals with this as it may have to
+ stitch together the results of multiple subrequests that variously overlap
+ the set of folios.
- * ``done``
+ * ``done()``
- [Optional] This is called after the folios in the request have all been
+ [Optional] This is called after the folios in a read request have all been
unlocked (and marked uptodate if applicable).
+ * ``update_i_size()``
+
+ [Optional] This is invoked by netfslib at various points during the write
+ paths to ask the filesystem to update its idea of the file size. If not
+ given, netfslib will set i_size and i_blocks and update the local cache
+ cookie.
+
+ * ``post_modify()``
+
+ [Optional] This is called after netfslib writes to the pagecache or when it
+ allows an mmap'd page to be marked as writable.
+
+ * ``begin_writeback()``
+
+ [Optional] Netfslib calls this when processing a writeback request if it
+ finds a dirty page that isn't simply marked NETFS_FOLIO_COPY_TO_CACHE,
+ indicating it must be written to the server. This allows the filesystem to
+ only set up writeback resources when it knows it's going to have to perform
+ a write.
+
+ * ``prepare_write()``
+ [Optional] This is called to allow the filesystem to limit the size of a
+ subrequest. It may also limit the number of individual regions in iterator,
+ such as required by RDMA. This information should be set on stream to which
+ the subrequest belongs::
-Read Helper Procedure
----------------------
-
-The read helpers work by the following general procedure:
-
- * Set up the request.
-
- * For readahead, allow the local cache and then the network filesystem to
- propose expansions to the read request. This is then proposed to the VM.
- If the VM cannot fully perform the expansion, a partially expanded read will
- be performed, though this may not get written to the cache in its entirety.
-
- * Loop around slicing chunks off of the request to form subrequests:
-
- * If a local cache is present, it gets to do the slicing, otherwise the
- helpers just try to generate maximal slices.
-
- * The network filesystem gets to clamp the size of each slice if it is to be
- the source. This allows rsize and chunking to be implemented.
+ rreq->io_streams[subreq->stream_nr].sreq_max_len
+ rreq->io_streams[subreq->stream_nr].sreq_max_segs
- * The helpers issue a read from the cache or a read from the server or just
- clears the slice as appropriate.
+ The filesystem can use this, for example, to chop up a request that has to
+ be split across multiple servers or to put multiple writes in flight.
- * The next slice begins at the end of the last one.
+ This is not permitted to return an error. Instead, in the event of failure,
+ ``netfs_prepare_write_failed()`` must be called.
- * As slices finish being read, they terminate.
+ * ``issue_write()``
- * When all the subrequests have terminated, the subrequests are assessed and
- any that are short or have failed are reissued:
+ [Required] This is used to dispatch a subrequest to the server for writing.
+ In the subrequest, ->start, ->len and ->transferred indicate what data
+ should be written to the server and ->io_iter indicates the buffer to be
+ used.
- * Failed cache requests are issued against the server instead.
+ There is no return value; the ``netfs_write_subreq_terminated()`` function
+ should be called to indicate that the subrequest completed either way.
+ ->error, ->transferred and ->flags should be updated before completing. The
+ termination can be done asynchronously.
- * Failed server requests just fail.
+ Note: the filesystem must not deal with removing the dirty or writeback
+ marks on folios involved in the operation and should not take refs or pins
+ on them, but should leave retention to netfslib.
- * Short reads against either source will be reissued against that source
- provided they have transferred some more data:
+ * ``retry_request()``
- * The cache may need to skip holes that it can't do DIO from.
+ [Optional] Netfslib calls this at the beginning of a retry cycle. This
+ allows the filesystem to examine the state of the request, the subrequests
+ in the indicated stream and of its own data and make adjustments or
+ renegotiate resources.
+
+ * ``invalidate_cache()``
- * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to the
- end of the slice instead of reissuing.
+ [Optional] This is called by netfslib to invalidate data stored in the local
+ cache in the event that writing to the local cache fails, providing updated
+ coherency data that netfs can't provide.
- * Once the data is read, the folios that have been fully read/cleared:
+Terminating a subrequest
+------------------------
- * Will be marked uptodate.
+When a subrequest completes, there are a number of functions that the cache or
+subrequest can call to inform netfslib of the status change. One function is
+provided to terminate a write subrequest at the preparation stage and acts
+synchronously:
- * If a cache is present, will be marked with PG_fscache.
+ * ``void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);``
- * Unlocked
+ Indicate that the ->prepare_write() call failed. The ``error`` field should
+ have been updated.
- * Any folios that need writing to the cache will then have DIO writes issued.
+Note that ->prepare_read() can return an error as a read can simply be aborted.
+Dealing with writeback failure is trickier.
- * Synchronous operations will wait for reading to be complete.
+The other functions are used for subrequests that got as far as being issued:
- * Writes to the cache will proceed asynchronously and the folios will have the
- PG_fscache mark removed when that completes.
+ * ``void netfs_read_subreq_terminated(struct netfs_io_subrequest *subreq);``
- * The request structures will be cleaned up when everything has completed.
+ Tell netfslib that a read subrequest has terminated. The ``error``,
+ ``flags`` and ``transferred`` fields should have been updated.
+ * ``void netfs_write_subrequest_terminated(void *_op, ssize_t transferred_or_error);``
-Read Helper Cache API
----------------------
+ Tell netfslib that a write subrequest has terminated. Either the amount of
+ data processed or the negative error code can be passed in. This is
+ can be used as a kiocb completion function.
-When implementing a local cache to be used by the read helpers, two things are
-required: some way for the network filesystem to initialise the caching for a
-read request and a table of operations for the helpers to call.
+ * ``void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq);``
-To begin a cache operation on an fscache object, the following function is
-called::
+ This is provided to optionally update netfslib on the incremental progress
+ of a read, allowing some folios to be unlocked early and does not actually
+ terminate the subrequest. The ``transferred`` field should have been
+ updated.
- int fscache_begin_read_operation(struct netfs_io_request *rreq,
- struct fscache_cookie *cookie);
+Local Cache API
+---------------
-passing in the request pointer and the cookie corresponding to the file. This
-fills in the cache resources mentioned below.
+Netfslib provides a separate API for a local cache to implement, though it
+provides some somewhat similar routines to the filesystem request API.
-The netfs_io_request object contains a place for the cache to hang its
+Firstly, the netfs_io_request object contains a place for the cache to hang its
state::
struct netfs_cache_resources {
const struct netfs_cache_ops *ops;
void *cache_priv;
void *cache_priv2;
+ unsigned int debug_id;
+ unsigned int inval_counter;
};
-This contains an operations table pointer and two private pointers. The
-operation table looks like the following::
+This contains an operations table pointer and two private pointers plus the
+debug ID of the fscache cookie for tracing purposes and an invalidation counter
+that is cranked by calls to ``fscache_invalidate()`` allowing cache subrequests
+to be invalidated after completion.
+
+The cache operation table looks like the following::
struct netfs_cache_ops {
void (*end_operation)(struct netfs_cache_resources *cres);
-
void (*expand_readahead)(struct netfs_cache_resources *cres,
loff_t *_start, size_t *_len, loff_t i_size);
-
enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq,
- loff_t i_size);
-
+ loff_t i_size);
int (*read)(struct netfs_cache_resources *cres,
loff_t start_pos,
struct iov_iter *iter,
bool seek_data,
netfs_io_terminated_t term_func,
void *term_func_priv);
-
- int (*prepare_write)(struct netfs_cache_resources *cres,
- loff_t *_start, size_t *_len, loff_t i_size,
- bool no_space_allocated_yet);
-
- int (*write)(struct netfs_cache_resources *cres,
- loff_t start_pos,
- struct iov_iter *iter,
- netfs_io_terminated_t term_func,
- void *term_func_priv);
-
- int (*query_occupancy)(struct netfs_cache_resources *cres,
- loff_t start, size_t len, size_t granularity,
- loff_t *_data_start, size_t *_data_len);
+ void (*prepare_write_subreq)(struct netfs_io_subrequest *subreq);
+ void (*issue_write)(struct netfs_io_subrequest *subreq);
};
With a termination handler function pointer::
@@ -511,10 +984,16 @@ The methods defined in the table are:
* ``expand_readahead()``
- [Optional] Called at the beginning of a netfs_readahead() operation to allow
- the cache to expand a request in either direction. This allows the cache to
+ [Optional] Called at the beginning of a readahead operation to allow the
+ cache to expand a request in either direction. This allows the cache to
size the request appropriately for the cache granularity.
+ * ``prepare_read()``
+
+ [Required] Called to configure the next slice of a request. ->start and
+ ->len in the subrequest indicate where and how big the next slice can be;
+ the cache gets to reduce the length to match its granularity requirements.
+
The function is passed pointers to the start and length in its parameters,
plus the size of the file for reference, and adjusts the start and length
appropriately. It should return one of:
@@ -528,12 +1007,6 @@ The methods defined in the table are:
downloaded from the server or read from the cache - or whether slicing
should be given up at the current point.
- * ``prepare_read()``
-
- [Required] Called to configure the next slice of a request. ->start and
- ->len in the subrequest indicate where and how big the next slice can be;
- the cache gets to reduce the length to match its granularity requirements.
-
* ``read()``
[Required] Called to read from the cache. The start file offset is given
@@ -547,44 +1020,33 @@ The methods defined in the table are:
indicating whether the termination is definitely happening in the caller's
context.
- * ``prepare_write()``
+ * ``prepare_write_subreq()``
- [Required] Called to prepare a write to the cache to take place. This
- involves checking to see whether the cache has sufficient space to honour
- the write. ``*_start`` and ``*_len`` indicate the region to be written; the
- region can be shrunk or it can be expanded to a page boundary either way as
- necessary to align for direct I/O. i_size holds the size of the object and
- is provided for reference. no_space_allocated_yet is set to true if the
- caller is certain that no data has been written to that region - for example
- if it tried to do a read from there already.
+ [Required] This is called to allow the cache to limit the size of a
+ subrequest. It may also limit the number of individual regions in iterator,
+ such as required by DIO/DMA. This information should be set on stream to
+ which the subrequest belongs::
- * ``write()``
+ rreq->io_streams[subreq->stream_nr].sreq_max_len
+ rreq->io_streams[subreq->stream_nr].sreq_max_segs
- [Required] Called to write to the cache. The start file offset is given
- along with an iterator to write from, which gives the length also.
-
- Also provided is a pointer to a termination handler function and private
- data to pass to that function. The termination function should be called
- with the number of bytes transferred or an error code, plus a flag
- indicating whether the termination is definitely happening in the caller's
- context.
+ The filesystem can use this, for example, to chop up a request that has to
+ be split across multiple servers or to put multiple writes in flight.
- * ``query_occupancy()``
+ This is not permitted to return an error. In the event of failure,
+ ``netfs_prepare_write_failed()`` must be called.
- [Required] Called to find out where the next piece of data is within a
- particular region of the cache. The start and length of the region to be
- queried are passed in, along with the granularity to which the answer needs
- to be aligned. The function passes back the start and length of the data,
- if any, available within that region. Note that there may be a hole at the
- front.
+ * ``issue_write()``
- It returns 0 if some data was found, -ENODATA if there was no usable data
- within the region or -ENOBUFS if there is no caching on this file.
+ [Required] This is used to dispatch a subrequest to the cache for writing.
+ In the subrequest, ->start, ->len and ->transferred indicate what data
+ should be written to the cache and ->io_iter indicates the buffer to be
+ used.
-Note that these methods are passed a pointer to the cache resource structure,
-not the read request structure as they could be used in other situations where
-there isn't a read request structure as well, such as writing dirty data to the
-cache.
+ There is no return value; the ``netfs_write_subreq_terminated()`` function
+ should be called to indicate that the subrequest completed either way.
+ ->error, ->transferred and ->flags should be updated before completing. The
+ termination can be done asynchronously.
API Function Reference
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 767b2927c762..3111ef5592f3 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1203,3 +1203,43 @@ should use d_drop();d_splice_alias() and return the result of the latter.
If a positive dentry cannot be returned for some reason, in-kernel
clients such as cachefiles, nfsd, smb/server may not perform ideally but
will fail-safe.
+
+---
+
+** mandatory**
+
+lookup_one(), lookup_one_unlocked(), lookup_one_positive_unlocked() now
+take a qstr instead of a name and len. These, not the "one_len"
+versions, should be used whenever accessing a filesystem from outside
+that filesysmtem, through a mount point - which will have a mnt_idmap.
+
+---
+
+** mandatory**
+
+Functions try_lookup_one_len(), lookup_one_len(),
+lookup_one_len_unlocked() and lookup_positive_unlocked() have been
+renamed to try_lookup_noperm(), lookup_noperm(),
+lookup_noperm_unlocked(), lookup_noperm_positive_unlocked(). They now
+take a qstr instead of separate name and length. QSTR() can be used
+when strlen() is needed for the length.
+
+For try_lookup_noperm() a reference to the qstr is passed in case the
+hash might subsequently be needed.
+
+These function no longer do any permission checking - they previously
+checked that the caller has 'X' permission on the parent. They must
+ONLY be used internally by a filesystem on itself when it knows that
+permissions are irrelevant or in a context where permission checks have
+already been performed such as after vfs_path_parent_lookup()
+
+---
+
+** mandatory**
+
+d_hash_and_lookup() is no longer exported or available outside the VFS.
+Use try_lookup_noperm() instead. This adds name validation and takes
+arguments in the opposite order but is otherwise identical.
+
+Using try_lookup_noperm() will require linux/namei.h to be included.
+
diff --git a/Documentation/filesystems/relay.rst b/Documentation/filesystems/relay.rst
index 04ad083cfe62..46447dbc75ad 100644
--- a/Documentation/filesystems/relay.rst
+++ b/Documentation/filesystems/relay.rst
@@ -32,7 +32,7 @@ functions in the relay interface code - please see that for details.
Semantics
=========
-Each relay channel has one buffer per CPU, each buffer has one or more
+Each relay channel has one buffer per CPU; each buffer has one or more
sub-buffers. Messages are written to the first sub-buffer until it is
too full to contain a new message, in which case it is written to
the next (if available). Messages are never split across sub-buffers.
@@ -40,7 +40,7 @@ At this point, userspace can be notified so it empties the first
sub-buffer, while the kernel continues writing to the next.
When notified that a sub-buffer is full, the kernel knows how many
-bytes of it are padding i.e. unused space occurring because a complete
+bytes of it are padding, i.e., unused space occurring because a complete
message couldn't fit into a sub-buffer. Userspace can use this
knowledge to copy only valid data.
@@ -71,7 +71,7 @@ klog and relay-apps example code
================================
The relay interface itself is ready to use, but to make things easier,
-a couple simple utility functions and a set of examples are provided.
+a couple of simple utility functions and a set of examples are provided.
The relay-apps example tarball, available on the relay sourceforge
site, contains a set of self-contained examples, each consisting of a
@@ -91,7 +91,7 @@ registered will data actually be logged (see the klog and kleak
examples for details).
It is of course possible to use the relay interface from scratch,
-i.e. without using any of the relay-apps example code or klog, but
+i.e., without using any of the relay-apps example code or klog, but
you'll have to implement communication between userspace and kernel,
allowing both to convey the state of buffers (full, empty, amount of
padding). The read() interface both removes padding and internally
@@ -119,7 +119,7 @@ mmap() results in channel buffer being mapped into the caller's
must map the entire file, which is NRBUF * SUBBUFSIZE.
read() read the contents of a channel buffer. The bytes read are
- 'consumed' by the reader, i.e. they won't be available
+ 'consumed' by the reader, i.e., they won't be available
again to subsequent reads. If the channel is being used
in no-overwrite mode (the default), it can be read at any
time even if there's an active kernel writer. If the
@@ -138,7 +138,7 @@ poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are
notified when sub-buffer boundaries are crossed.
close() decrements the channel buffer's refcount. When the refcount
- reaches 0, i.e. when no process or kernel client has the
+ reaches 0, i.e., when no process or kernel client has the
buffer open, the channel buffer is freed.
=========== ============================================================
@@ -149,7 +149,7 @@ host filesystem must be mounted. For example::
.. Note::
- the host filesystem doesn't need to be mounted for kernel
+ The host filesystem doesn't need to be mounted for kernel
clients to create or use channels - it only needs to be
mounted when user space applications need access to the buffer
data.
@@ -325,7 +325,7 @@ section, as it pertains mainly to mmap() implementations.
In 'overwrite' mode, also known as 'flight recorder' mode, writes
continuously cycle around the buffer and will never fail, but will
unconditionally overwrite old data regardless of whether it's actually
-been consumed. In no-overwrite mode, writes will fail, i.e. data will
+been consumed. In no-overwrite mode, writes will fail, i.e., data will
be lost, if the number of unconsumed sub-buffers equals the total
number of sub-buffers in the channel. It should be clear that if
there is no consumer or if the consumer can't consume sub-buffers fast
@@ -344,7 +344,7 @@ initialize the next sub-buffer if appropriate 2) finalize the previous
sub-buffer if appropriate and 3) return a boolean value indicating
whether or not to actually move on to the next sub-buffer.
-To implement 'no-overwrite' mode, the userspace client would provide
+To implement 'no-overwrite' mode, the userspace client provides
an implementation of the subbuf_start() callback something like the
following::
@@ -364,9 +364,9 @@ following::
return 1;
}
-If the current buffer is full, i.e. all sub-buffers remain unconsumed,
+If the current buffer is full, i.e., all sub-buffers remain unconsumed,
the callback returns 0 to indicate that the buffer switch should not
-occur yet, i.e. until the consumer has had a chance to read the
+occur yet, i.e., until the consumer has had a chance to read the
current set of ready sub-buffers. For the relay_buf_full() function
to make sense, the consumer is responsible for notifying the relay
interface when sub-buffers have been consumed via
@@ -400,7 +400,7 @@ consulted.
The default subbuf_start() implementation, used if the client doesn't
define any callbacks, or doesn't define the subbuf_start() callback,
-implements the simplest possible 'no-overwrite' mode, i.e. it does
+implements the simplest possible 'no-overwrite' mode, i.e., it does
nothing but return 0.
Header information can be reserved at the beginning of each sub-buffer
@@ -467,7 +467,7 @@ rather than open and close a new channel for each use. relay_reset()
can be used for this purpose - it resets a channel to its initial
state without reallocating channel buffer memory or destroying
existing mappings. It should however only be called when it's safe to
-do so, i.e. when the channel isn't currently being written to.
+do so, i.e., when the channel isn't currently being written to.
Finally, there are a couple of utility callbacks that can be used for
different purposes. buf_mapped() is called whenever a channel buffer
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
new file mode 100644
index 000000000000..c7949dd44f2f
--- /dev/null
+++ b/Documentation/filesystems/resctrl.rst
@@ -0,0 +1,1523 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+=====================================================
+User Interface for Resource Control feature (resctrl)
+=====================================================
+
+:Copyright: |copy| 2016 Intel Corporation
+:Authors: - Fenghua Yu <fenghua.yu@intel.com>
+ - Tony Luck <tony.luck@intel.com>
+ - Vikas Shivappa <vikas.shivappa@intel.com>
+
+
+Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT).
+AMD refers to this feature as AMD Platform Quality of Service(AMD QoS).
+
+This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo
+flag bits:
+
+=============================================== ================================
+RDT (Resource Director Technology) Allocation "rdt_a"
+CAT (Cache Allocation Technology) "cat_l3", "cat_l2"
+CDP (Code and Data Prioritization) "cdp_l3", "cdp_l2"
+CQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc"
+MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local"
+MBA (Memory Bandwidth Allocation) "mba"
+SMBA (Slow Memory Bandwidth Allocation) ""
+BMEC (Bandwidth Monitoring Event Configuration) ""
+=============================================== ================================
+
+Historically, new features were made visible by default in /proc/cpuinfo. This
+resulted in the feature flags becoming hard to parse by humans. Adding a new
+flag to /proc/cpuinfo should be avoided if user space can obtain information
+about the feature from resctrl's info directory.
+
+To use the feature mount the file system::
+
+ # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps][,debug]] /sys/fs/resctrl
+
+mount options are:
+
+"cdp":
+ Enable code/data prioritization in L3 cache allocations.
+"cdpl2":
+ Enable code/data prioritization in L2 cache allocations.
+"mba_MBps":
+ Enable the MBA Software Controller(mba_sc) to specify MBA
+ bandwidth in MiBps
+"debug":
+ Make debug files accessible. Available debug files are annotated with
+ "Available only with debug option".
+
+L2 and L3 CDP are controlled separately.
+
+RDT features are orthogonal. A particular system may support only
+monitoring, only control, or both monitoring and control. Cache
+pseudo-locking is a unique way of using cache control to "pin" or
+"lock" data in the cache. Details can be found in
+"Cache Pseudo-Locking".
+
+
+The mount succeeds if either of allocation or monitoring is present, but
+only those files and directories supported by the system will be created.
+For more details on the behavior of the interface during monitoring
+and allocation, see the "Resource alloc and monitor groups" section.
+
+Info directory
+==============
+
+The 'info' directory contains information about the enabled
+resources. Each resource has its own subdirectory. The subdirectory
+names reflect the resource names.
+
+Each subdirectory contains the following files with respect to
+allocation:
+
+Cache resource(L3/L2) subdirectory contains the following files
+related to allocation:
+
+"num_closids":
+ The number of CLOSIDs which are valid for this
+ resource. The kernel uses the smallest number of
+ CLOSIDs of all enabled resources as limit.
+"cbm_mask":
+ The bitmask which is valid for this resource.
+ This mask is equivalent to 100%.
+"min_cbm_bits":
+ The minimum number of consecutive bits which
+ must be set when writing a mask.
+
+"shareable_bits":
+ Bitmask of shareable resource with other executing
+ entities (e.g. I/O). User can use this when
+ setting up exclusive cache partitions. Note that
+ some platforms support devices that have their
+ own settings for cache use which can over-ride
+ these bits.
+"bit_usage":
+ Annotated capacity bitmasks showing how all
+ instances of the resource are used. The legend is:
+
+ "0":
+ Corresponding region is unused. When the system's
+ resources have been allocated and a "0" is found
+ in "bit_usage" it is a sign that resources are
+ wasted.
+
+ "H":
+ Corresponding region is used by hardware only
+ but available for software use. If a resource
+ has bits set in "shareable_bits" but not all
+ of these bits appear in the resource groups'
+ schematas then the bits appearing in
+ "shareable_bits" but no resource group will
+ be marked as "H".
+ "X":
+ Corresponding region is available for sharing and
+ used by hardware and software. These are the
+ bits that appear in "shareable_bits" as
+ well as a resource group's allocation.
+ "S":
+ Corresponding region is used by software
+ and available for sharing.
+ "E":
+ Corresponding region is used exclusively by
+ one resource group. No sharing allowed.
+ "P":
+ Corresponding region is pseudo-locked. No
+ sharing allowed.
+"sparse_masks":
+ Indicates if non-contiguous 1s value in CBM is supported.
+
+ "0":
+ Only contiguous 1s value in CBM is supported.
+ "1":
+ Non-contiguous 1s value in CBM is supported.
+
+Memory bandwidth(MB) subdirectory contains the following files
+with respect to allocation:
+
+"min_bandwidth":
+ The minimum memory bandwidth percentage which
+ user can request.
+
+"bandwidth_gran":
+ The granularity in which the memory bandwidth
+ percentage is allocated. The allocated
+ b/w percentage is rounded off to the next
+ control step available on the hardware. The
+ available bandwidth control steps are:
+ min_bandwidth + N * bandwidth_gran.
+
+"delay_linear":
+ Indicates if the delay scale is linear or
+ non-linear. This field is purely informational
+ only.
+
+"thread_throttle_mode":
+ Indicator on Intel systems of how tasks running on threads
+ of a physical core are throttled in cases where they
+ request different memory bandwidth percentages:
+
+ "max":
+ the smallest percentage is applied
+ to all threads
+ "per-thread":
+ bandwidth percentages are directly applied to
+ the threads running on the core
+
+If RDT monitoring is available there will be an "L3_MON" directory
+with the following files:
+
+"num_rmids":
+ The number of RMIDs available. This is the
+ upper bound for how many "CTRL_MON" + "MON"
+ groups can be created.
+
+"mon_features":
+ Lists the monitoring events if
+ monitoring is enabled for the resource.
+ Example::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mon_features
+ llc_occupancy
+ mbm_total_bytes
+ mbm_local_bytes
+
+ If the system supports Bandwidth Monitoring Event
+ Configuration (BMEC), then the bandwidth events will
+ be configurable. The output will be::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mon_features
+ llc_occupancy
+ mbm_total_bytes
+ mbm_total_bytes_config
+ mbm_local_bytes
+ mbm_local_bytes_config
+
+"mbm_total_bytes_config", "mbm_local_bytes_config":
+ Read/write files containing the configuration for the mbm_total_bytes
+ and mbm_local_bytes events, respectively, when the Bandwidth
+ Monitoring Event Configuration (BMEC) feature is supported.
+ The event configuration settings are domain specific and affect
+ all the CPUs in the domain. When either event configuration is
+ changed, the bandwidth counters for all RMIDs of both events
+ (mbm_total_bytes as well as mbm_local_bytes) are cleared for that
+ domain. The next read for every RMID will report "Unavailable"
+ and subsequent reads will report the valid value.
+
+ Following are the types of events supported:
+
+ ==== ========================================================
+ Bits Description
+ ==== ========================================================
+ 6 Dirty Victims from the QOS domain to all types of memory
+ 5 Reads to slow memory in the non-local NUMA domain
+ 4 Reads to slow memory in the local NUMA domain
+ 3 Non-temporal writes to non-local NUMA domain
+ 2 Non-temporal writes to local NUMA domain
+ 1 Reads to memory in the non-local NUMA domain
+ 0 Reads to memory in the local NUMA domain
+ ==== ========================================================
+
+ By default, the mbm_total_bytes configuration is set to 0x7f to count
+ all the event types and the mbm_local_bytes configuration is set to
+ 0x15 to count all the local memory events.
+
+ Examples:
+
+ * To view the current configuration::
+ ::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
+ 0=0x7f;1=0x7f;2=0x7f;3=0x7f
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
+ 0=0x15;1=0x15;3=0x15;4=0x15
+
+ * To change the mbm_total_bytes to count only reads on domain 0,
+ the bits 0, 1, 4 and 5 needs to be set, which is 110011b in binary
+ (in hexadecimal 0x33):
+ ::
+
+ # echo "0=0x33" > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
+ 0=0x33;1=0x7f;2=0x7f;3=0x7f
+
+ * To change the mbm_local_bytes to count all the slow memory reads on
+ domain 0 and 1, the bits 4 and 5 needs to be set, which is 110000b
+ in binary (in hexadecimal 0x30):
+ ::
+
+ # echo "0=0x30;1=0x30" > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
+ 0=0x30;1=0x30;3=0x15;4=0x15
+
+"max_threshold_occupancy":
+ Read/write file provides the largest value (in
+ bytes) at which a previously used LLC_occupancy
+ counter can be considered for re-use.
+
+Finally, in the top level of the "info" directory there is a file
+named "last_cmd_status". This is reset with every "command" issued
+via the file system (making new directories or writing to any of the
+control files). If the command was successful, it will read as "ok".
+If the command failed, it will provide more information that can be
+conveyed in the error returns from file operations. E.g.
+::
+
+ # echo L3:0=f7 > schemata
+ bash: echo: write error: Invalid argument
+ # cat info/last_cmd_status
+ mask f7 has non-consecutive 1-bits
+
+Resource alloc and monitor groups
+=================================
+
+Resource groups are represented as directories in the resctrl file
+system. The default group is the root directory which, immediately
+after mounting, owns all the tasks and cpus in the system and can make
+full use of all resources.
+
+On a system with RDT control features additional directories can be
+created in the root directory that specify different amounts of each
+resource (see "schemata" below). The root and these additional top level
+directories are referred to as "CTRL_MON" groups below.
+
+On a system with RDT monitoring the root directory and other top level
+directories contain a directory named "mon_groups" in which additional
+directories can be created to monitor subsets of tasks in the CTRL_MON
+group that is their ancestor. These are called "MON" groups in the rest
+of this document.
+
+Removing a directory will move all tasks and cpus owned by the group it
+represents to the parent. Removing one of the created CTRL_MON groups
+will automatically remove all MON groups below it.
+
+Moving MON group directories to a new parent CTRL_MON group is supported
+for the purpose of changing the resource allocations of a MON group
+without impacting its monitoring data or assigned tasks. This operation
+is not allowed for MON groups which monitor CPUs. No other move
+operation is currently allowed other than simply renaming a CTRL_MON or
+MON group.
+
+All groups contain the following files:
+
+"tasks":
+ Reading this file shows the list of all tasks that belong to
+ this group. Writing a task id to the file will add a task to the
+ group. Multiple tasks can be added by separating the task ids
+ with commas. Tasks will be assigned sequentially. Multiple
+ failures are not supported. A single failure encountered while
+ attempting to assign a task will cause the operation to abort and
+ already added tasks before the failure will remain in the group.
+ Failures will be logged to /sys/fs/resctrl/info/last_cmd_status.
+
+ If the group is a CTRL_MON group the task is removed from
+ whichever previous CTRL_MON group owned the task and also from
+ any MON group that owned the task. If the group is a MON group,
+ then the task must already belong to the CTRL_MON parent of this
+ group. The task is removed from any previous MON group.
+
+
+"cpus":
+ Reading this file shows a bitmask of the logical CPUs owned by
+ this group. Writing a mask to this file will add and remove
+ CPUs to/from this group. As with the tasks file a hierarchy is
+ maintained where MON groups may only include CPUs owned by the
+ parent CTRL_MON group.
+ When the resource group is in pseudo-locked mode this file will
+ only be readable, reflecting the CPUs associated with the
+ pseudo-locked region.
+
+
+"cpus_list":
+ Just like "cpus", only using ranges of CPUs instead of bitmasks.
+
+
+When control is enabled all CTRL_MON groups will also contain:
+
+"schemata":
+ A list of all the resources available to this group.
+ Each resource has its own line and format - see below for details.
+
+"size":
+ Mirrors the display of the "schemata" file to display the size in
+ bytes of each allocation instead of the bits representing the
+ allocation.
+
+"mode":
+ The "mode" of the resource group dictates the sharing of its
+ allocations. A "shareable" resource group allows sharing of its
+ allocations while an "exclusive" resource group does not. A
+ cache pseudo-locked region is created by first writing
+ "pseudo-locksetup" to the "mode" file before writing the cache
+ pseudo-locked region's schemata to the resource group's "schemata"
+ file. On successful pseudo-locked region creation the mode will
+ automatically change to "pseudo-locked".
+
+"ctrl_hw_id":
+ Available only with debug option. The identifier used by hardware
+ for the control group. On x86 this is the CLOSID.
+
+When monitoring is enabled all MON groups will also contain:
+
+"mon_data":
+ This contains a set of files organized by L3 domain and by
+ RDT event. E.g. on a system with two L3 domains there will
+ be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
+ directories have one file per event (e.g. "llc_occupancy",
+ "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
+ files provide a read out of the current value of the event for
+ all tasks in the group. In CTRL_MON groups these files provide
+ the sum for all tasks in the CTRL_MON group and all tasks in
+ MON groups. Please see example section for more details on usage.
+ On systems with Sub-NUMA Cluster (SNC) enabled there are extra
+ directories for each node (located within the "mon_L3_XX" directory
+ for the L3 cache they occupy). These are named "mon_sub_L3_YY"
+ where "YY" is the node number.
+
+"mon_hw_id":
+ Available only with debug option. The identifier used by hardware
+ for the monitor group. On x86 this is the RMID.
+
+When the "mba_MBps" mount option is used all CTRL_MON groups will also contain:
+
+"mba_MBps_event":
+ Reading this file shows which memory bandwidth event is used
+ as input to the software feedback loop that keeps memory bandwidth
+ below the value specified in the schemata file. Writing the
+ name of one of the supported memory bandwidth events found in
+ /sys/fs/resctrl/info/L3_MON/mon_features changes the input
+ event.
+
+Resource allocation rules
+-------------------------
+
+When a task is running the following rules define which resources are
+available to it:
+
+1) If the task is a member of a non-default group, then the schemata
+ for that group is used.
+
+2) Else if the task belongs to the default group, but is running on a
+ CPU that is assigned to some specific group, then the schemata for the
+ CPU's group is used.
+
+3) Otherwise the schemata for the default group is used.
+
+Resource monitoring rules
+-------------------------
+1) If a task is a member of a MON group, or non-default CTRL_MON group
+ then RDT events for the task will be reported in that group.
+
+2) If a task is a member of the default CTRL_MON group, but is running
+ on a CPU that is assigned to some specific group, then the RDT events
+ for the task will be reported in that group.
+
+3) Otherwise RDT events for the task will be reported in the root level
+ "mon_data" group.
+
+
+Notes on cache occupancy monitoring and control
+===============================================
+When moving a task from one group to another you should remember that
+this only affects *new* cache allocations by the task. E.g. you may have
+a task in a monitor group showing 3 MB of cache occupancy. If you move
+to a new group and immediately check the occupancy of the old and new
+groups you will likely see that the old group is still showing 3 MB and
+the new group zero. When the task accesses locations still in cache from
+before the move, the h/w does not update any counters. On a busy system
+you will likely see the occupancy in the old group go down as cache lines
+are evicted and re-used while the occupancy in the new group rises as
+the task accesses memory and loads into the cache are counted based on
+membership in the new group.
+
+The same applies to cache allocation control. Moving a task to a group
+with a smaller cache partition will not evict any cache lines. The
+process may continue to use them from the old partition.
+
+Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
+to identify a control group and a monitoring group respectively. Each of
+the resource groups are mapped to these IDs based on the kind of group. The
+number of CLOSid and RMID are limited by the hardware and hence the creation of
+a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
+and creation of "MON" group may fail if we run out of RMIDs.
+
+max_threshold_occupancy - generic concepts
+------------------------------------------
+
+Note that an RMID once freed may not be immediately available for use as
+the RMID is still tagged the cache lines of the previous user of RMID.
+Hence such RMIDs are placed on limbo list and checked back if the cache
+occupancy has gone down. If there is a time when system has a lot of
+limbo RMIDs but which are not ready to be used, user may see an -EBUSY
+during mkdir.
+
+max_threshold_occupancy is a user configurable value to determine the
+occupancy at which an RMID can be freed.
+
+The mon_llc_occupancy_limbo tracepoint gives the precise occupancy in bytes
+for a subset of RMID that are not immediately available for allocation.
+This can't be relied on to produce output every second, it may be necessary
+to attempt to create an empty monitor group to force an update. Output may
+only be produced if creation of a control or monitor group fails.
+
+Schemata files - general concepts
+---------------------------------
+Each line in the file describes one resource. The line starts with
+the name of the resource, followed by specific values to be applied
+in each of the instances of that resource on the system.
+
+Cache IDs
+---------
+On current generation systems there is one L3 cache per socket and L2
+caches are generally just shared by the hyperthreads on a core, but this
+isn't an architectural requirement. We could have multiple separate L3
+caches on a socket, multiple cores could share an L2 cache. So instead
+of using "socket" or "core" to define the set of logical cpus sharing
+a resource we use a "Cache ID". At a given cache level this will be a
+unique number across the whole system (but it isn't guaranteed to be a
+contiguous sequence, there may be gaps). To find the ID for each logical
+CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
+
+Cache Bit Masks (CBM)
+---------------------
+For cache resources we describe the portion of the cache that is available
+for allocation using a bitmask. The maximum value of the mask is defined
+by each cpu model (and may be different for different cache levels). It
+is found using CPUID, but is also provided in the "info" directory of
+the resctrl file system in "info/{resource}/cbm_mask". Some Intel hardware
+requires that these masks have all the '1' bits in a contiguous block. So
+0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
+and 0xA are not. Check /sys/fs/resctrl/info/{resource}/sparse_masks
+if non-contiguous 1s value is supported. On a system with a 20-bit mask
+each bit represents 5% of the capacity of the cache. You could partition
+the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
+
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.
+
+The top-level monitoring files in each "mon_L3_XX" directory provide
+the sum of data across all SNC nodes sharing an L3 cache instance.
+Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
+the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
+"mon_sub_L3_YY" directories to get node local data.
+
+Memory bandwidth allocation is still performed at the L3 cache
+level. I.e. throttling controls are applied to all SNC nodes.
+
+L3 cache allocation bitmaps also apply to all SNC nodes. But note that
+the amount of L3 cache represented by each bit is divided by the number
+of SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit
+allocation masks each bit normally represents 10MB. With SNC mode enabled
+with two SNC nodes per L3 cache, each bit only represents 5MB.
+
+Memory bandwidth Allocation and monitoring
+==========================================
+
+For Memory bandwidth resource, by default the user controls the resource
+by indicating the percentage of total memory bandwidth.
+
+The minimum bandwidth percentage value for each cpu model is predefined
+and can be looked up through "info/MB/min_bandwidth". The bandwidth
+granularity that is allocated is also dependent on the cpu model and can
+be looked up at "info/MB/bandwidth_gran". The available bandwidth
+control steps are: min_bw + N * bw_gran. Intermediate values are rounded
+to the next control step available on the hardware.
+
+The bandwidth throttling is a core specific mechanism on some of Intel
+SKUs. Using a high bandwidth and a low bandwidth setting on two threads
+sharing a core may result in both threads being throttled to use the
+low bandwidth (see "thread_throttle_mode").
+
+The fact that Memory bandwidth allocation(MBA) may be a core
+specific mechanism where as memory bandwidth monitoring(MBM) is done at
+the package level may lead to confusion when users try to apply control
+via the MBA and then monitor the bandwidth to see if the controls are
+effective. Below are such scenarios:
+
+1. User may *not* see increase in actual bandwidth when percentage
+ values are increased:
+
+This can occur when aggregate L2 external bandwidth is more than L3
+external bandwidth. Consider an SKL SKU with 24 cores on a package and
+where L2 external is 10GBps (hence aggregate L2 external bandwidth is
+240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
+threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
+bandwidth of 100GBps although the percentage value specified is only 50%
+<< 100%. Hence increasing the bandwidth percentage will not yield any
+more bandwidth. This is because although the L2 external bandwidth still
+has capacity, the L3 external bandwidth is fully used. Also note that
+this would be dependent on number of cores the benchmark is run on.
+
+2. Same bandwidth percentage may mean different actual bandwidth
+ depending on # of threads:
+
+For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
+thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
+they have same percentage bandwidth of 10%. This is simply because as
+threads start using more cores in an rdtgroup, the actual bandwidth may
+increase or vary although user specified bandwidth percentage is same.
+
+In order to mitigate this and make the interface more user friendly,
+resctrl added support for specifying the bandwidth in MiBps as well. The
+kernel underneath would use a software feedback mechanism or a "Software
+Controller(mba_sc)" which reads the actual bandwidth using MBM counters
+and adjust the memory bandwidth percentages to ensure::
+
+ "actual bandwidth < user specified bandwidth".
+
+By default, the schemata would take the bandwidth percentage values
+where as user can switch to the "MBA software controller" mode using
+a mount option 'mba_MBps'. The schemata format is specified in the below
+sections.
+
+L3 schemata file details (code and data prioritization disabled)
+----------------------------------------------------------------
+With CDP disabled the L3 schemata format is::
+
+ L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+L3 schemata file details (CDP enabled via mount option to resctrl)
+------------------------------------------------------------------
+When CDP is enabled L3 control is split into two separate resources
+so you can specify independent masks for code and data like this::
+
+ L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+ L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+L2 schemata file details
+------------------------
+CDP is supported at L2 using the 'cdpl2' mount option. The schemata
+format is either::
+
+ L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+or
+
+ L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+ L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+
+Memory bandwidth Allocation (default mode)
+------------------------------------------
+
+Memory b/w domain is L3 cache.
+::
+
+ MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
+
+Memory bandwidth Allocation specified in MiBps
+----------------------------------------------
+
+Memory bandwidth domain is L3 cache.
+::
+
+ MB:<cache_id0>=bw_MiBps0;<cache_id1>=bw_MiBps1;...
+
+Slow Memory Bandwidth Allocation (SMBA)
+---------------------------------------
+AMD hardware supports Slow Memory Bandwidth Allocation (SMBA).
+CXL.memory is the only supported "slow" memory device. With the
+support of SMBA, the hardware enables bandwidth allocation on
+the slow memory devices. If there are multiple such devices in
+the system, the throttling logic groups all the slow sources
+together and applies the limit on them as a whole.
+
+The presence of SMBA (with CXL.memory) is independent of slow memory
+devices presence. If there are no such devices on the system, then
+configuring SMBA will have no impact on the performance of the system.
+
+The bandwidth domain for slow memory is L3 cache. Its schemata file
+is formatted as:
+::
+
+ SMBA:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
+
+Reading/writing the schemata file
+---------------------------------
+Reading the schemata file will show the state of all resources
+on all domains. When writing you only need to specify those values
+which you wish to change. E.g.
+::
+
+ # cat schemata
+ L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
+ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
+ # echo "L3DATA:2=3c0;" > schemata
+ # cat schemata
+ L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
+ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
+
+Reading/writing the schemata file (on AMD systems)
+--------------------------------------------------
+Reading the schemata file will show the current bandwidth limit on all
+domains. The allocated resources are in multiples of one eighth GB/s.
+When writing to the file, you need to specify what cache id you wish to
+configure the bandwidth limit.
+
+For example, to allocate 2GB/s limit on the first cache id:
+
+::
+
+ # cat schemata
+ MB:0=2048;1=2048;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+ # echo "MB:1=16" > schemata
+ # cat schemata
+ MB:0=2048;1= 16;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+Reading/writing the schemata file (on AMD systems) with SMBA feature
+--------------------------------------------------------------------
+Reading and writing the schemata file is the same as without SMBA in
+above section.
+
+For example, to allocate 8GB/s limit on the first cache id:
+
+::
+
+ # cat schemata
+ SMBA:0=2048;1=2048;2=2048;3=2048
+ MB:0=2048;1=2048;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+ # echo "SMBA:1=64" > schemata
+ # cat schemata
+ SMBA:0=2048;1= 64;2=2048;3=2048
+ MB:0=2048;1=2048;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+Cache Pseudo-Locking
+====================
+CAT enables a user to specify the amount of cache space that an
+application can fill. Cache pseudo-locking builds on the fact that a
+CPU can still read and write data pre-allocated outside its current
+allocated area on a cache hit. With cache pseudo-locking, data can be
+preloaded into a reserved portion of cache that no application can
+fill, and from that point on will only serve cache hits. The cache
+pseudo-locked memory is made accessible to user space where an
+application can map it into its virtual address space and thus have
+a region of memory with reduced average read latency.
+
+The creation of a cache pseudo-locked region is triggered by a request
+from the user to do so that is accompanied by a schemata of the region
+to be pseudo-locked. The cache pseudo-locked region is created as follows:
+
+- Create a CAT allocation CLOSNEW with a CBM matching the schemata
+ from the user of the cache region that will contain the pseudo-locked
+ memory. This region must not overlap with any current CAT allocation/CLOS
+ on the system and no future overlap with this cache region is allowed
+ while the pseudo-locked region exists.
+- Create a contiguous region of memory of the same size as the cache
+ region.
+- Flush the cache, disable hardware prefetchers, disable preemption.
+- Make CLOSNEW the active CLOS and touch the allocated memory to load
+ it into the cache.
+- Set the previous CLOS as active.
+- At this point the closid CLOSNEW can be released - the cache
+ pseudo-locked region is protected as long as its CBM does not appear in
+ any CAT allocation. Even though the cache pseudo-locked region will from
+ this point on not appear in any CBM of any CLOS an application running with
+ any CLOS will be able to access the memory in the pseudo-locked region since
+ the region continues to serve cache hits.
+- The contiguous region of memory loaded into the cache is exposed to
+ user-space as a character device.
+
+Cache pseudo-locking increases the probability that data will remain
+in the cache via carefully configuring the CAT feature and controlling
+application behavior. There is no guarantee that data is placed in
+cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
+“locked” data from cache. Power management C-states may shrink or
+power off cache. Deeper C-states will automatically be restricted on
+pseudo-locked region creation.
+
+It is required that an application using a pseudo-locked region runs
+with affinity to the cores (or a subset of the cores) associated
+with the cache on which the pseudo-locked region resides. A sanity check
+within the code will not allow an application to map pseudo-locked memory
+unless it runs with affinity to cores associated with the cache on which the
+pseudo-locked region resides. The sanity check is only done during the
+initial mmap() handling, there is no enforcement afterwards and the
+application self needs to ensure it remains affine to the correct cores.
+
+Pseudo-locking is accomplished in two stages:
+
+1) During the first stage the system administrator allocates a portion
+ of cache that should be dedicated to pseudo-locking. At this time an
+ equivalent portion of memory is allocated, loaded into allocated
+ cache portion, and exposed as a character device.
+2) During the second stage a user-space application maps (mmap()) the
+ pseudo-locked memory into its address space.
+
+Cache Pseudo-Locking Interface
+------------------------------
+A pseudo-locked region is created using the resctrl interface as follows:
+
+1) Create a new resource group by creating a new directory in /sys/fs/resctrl.
+2) Change the new resource group's mode to "pseudo-locksetup" by writing
+ "pseudo-locksetup" to the "mode" file.
+3) Write the schemata of the pseudo-locked region to the "schemata" file. All
+ bits within the schemata should be "unused" according to the "bit_usage"
+ file.
+
+On successful pseudo-locked region creation the "mode" file will contain
+"pseudo-locked" and a new character device with the same name as the resource
+group will exist in /dev/pseudo_lock. This character device can be mmap()'ed
+by user space in order to obtain access to the pseudo-locked memory region.
+
+An example of cache pseudo-locked region creation and usage can be found below.
+
+Cache Pseudo-Locking Debugging Interface
+----------------------------------------
+The pseudo-locking debugging interface is enabled by default (if
+CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl.
+
+There is no explicit way for the kernel to test if a provided memory
+location is present in the cache. The pseudo-locking debugging interface uses
+the tracing infrastructure to provide two ways to measure cache residency of
+the pseudo-locked region:
+
+1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data
+ from these measurements are best visualized using a hist trigger (see
+ example below). In this test the pseudo-locked region is traversed at
+ a stride of 32 bytes while hardware prefetchers and preemption
+ are disabled. This also provides a substitute visualization of cache
+ hits and misses.
+2) Cache hit and miss measurements using model specific precision counters if
+ available. Depending on the levels of cache on the system the pseudo_lock_l2
+ and pseudo_lock_l3 tracepoints are available.
+
+When a pseudo-locked region is created a new debugfs directory is created for
+it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
+write-only file, pseudo_lock_measure, is present in this directory. The
+measurement of the pseudo-locked region depends on the number written to this
+debugfs file:
+
+1:
+ writing "1" to the pseudo_lock_measure file will trigger the latency
+ measurement captured in the pseudo_lock_mem_latency tracepoint. See
+ example below.
+2:
+ writing "2" to the pseudo_lock_measure file will trigger the L2 cache
+ residency (cache hits and misses) measurement captured in the
+ pseudo_lock_l2 tracepoint. See example below.
+3:
+ writing "3" to the pseudo_lock_measure file will trigger the L3 cache
+ residency (cache hits and misses) measurement captured in the
+ pseudo_lock_l3 tracepoint.
+
+All measurements are recorded with the tracing infrastructure. This requires
+the relevant tracepoints to be enabled before the measurement is triggered.
+
+Example of latency debugging interface
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In this example a pseudo-locked region named "newlock" was created. Here is
+how we can measure the latency in cycles of reading from this region and
+visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS
+is set::
+
+ # :> /sys/kernel/tracing/trace
+ # echo 'hist:keys=latency' > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/trigger
+ # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
+ # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
+ # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
+ # cat /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/hist
+
+ # event histogram
+ #
+ # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
+ #
+
+ { latency: 456 } hitcount: 1
+ { latency: 50 } hitcount: 83
+ { latency: 36 } hitcount: 96
+ { latency: 44 } hitcount: 174
+ { latency: 48 } hitcount: 195
+ { latency: 46 } hitcount: 262
+ { latency: 42 } hitcount: 693
+ { latency: 40 } hitcount: 3204
+ { latency: 38 } hitcount: 3484
+
+ Totals:
+ Hits: 8192
+ Entries: 9
+ Dropped: 0
+
+Example of cache hits/misses debugging
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In this example a pseudo-locked region named "newlock" was created on the L2
+cache of a platform. Here is how we can obtain details of the cache hits
+and misses using the platform's precision counters.
+::
+
+ # :> /sys/kernel/tracing/trace
+ # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
+ # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
+ # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
+ # cat /sys/kernel/tracing/trace
+
+ # tracer: nop
+ #
+ # _-----=> irqs-off
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # ||| / delay
+ # TASK-PID CPU# |||| TIMESTAMP FUNCTION
+ # | | | |||| | |
+ pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0
+
+
+Examples for RDT allocation usage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1) Example 1
+
+On a two socket machine (one L3 cache per socket) with just four bits
+for cache bit masks, minimum b/w of 10% with a memory bandwidth
+granularity of 10%.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir p0 p1
+ # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
+ # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
+
+The default resource group is unmodified, so we have access to all parts
+of all caches (its schemata file reads "L3:0=f;1=f").
+
+Tasks that are under the control of group "p0" may only allocate from the
+"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
+Tasks in group "p1" use the "lower" 50% of cache on both sockets.
+
+Similarly, tasks that are under the control of group "p0" may use a
+maximum memory b/w of 50% on socket0 and 50% on socket 1.
+Tasks in group "p1" may also use 50% memory b/w on both sockets.
+Note that unlike cache masks, memory b/w cannot specify whether these
+allocations can overlap or not. The allocations specifies the maximum
+b/w that the group may be able to use and the system admin can configure
+the b/w accordingly.
+
+If resctrl is using the software controller (mba_sc) then user can enter the
+max b/w in MB rather than the percentage values.
+::
+
+ # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
+ # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
+
+In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
+of 1024MB where as on socket 1 they would use 500MB.
+
+2) Example 2
+
+Again two sockets, but this time with a more realistic 20-bit mask.
+
+Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
+processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
+neighbors, each of the two real-time tasks exclusively occupies one quarter
+of L3 cache on socket 0.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+
+First we reset the schemata for the default group so that the "upper"
+50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
+ordinary tasks::
+
+ # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
+
+Next we make a resource group for our first real time task and give
+it access to the "top" 25% of the cache on socket 0.
+::
+
+ # mkdir p0
+ # echo "L3:0=f8000;1=fffff" > p0/schemata
+
+Finally we move our first real time task into this resource group. We
+also use taskset(1) to ensure the task always runs on a dedicated CPU
+on socket 0. Most uses of resource groups will also constrain which
+processors tasks run on.
+::
+
+ # echo 1234 > p0/tasks
+ # taskset -cp 1 1234
+
+Ditto for the second real time task (with the remaining 25% of cache)::
+
+ # mkdir p1
+ # echo "L3:0=7c00;1=fffff" > p1/schemata
+ # echo 5678 > p1/tasks
+ # taskset -cp 2 5678
+
+For the same 2 socket system with memory b/w resource and CAT L3 the
+schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
+10):
+
+For our first real time task this would request 20% memory b/w on socket 0.
+::
+
+ # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
+
+For our second real time task this would request an other 20% memory b/w
+on socket 0.
+::
+
+ # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
+
+3) Example 3
+
+A single socket system which has real-time tasks running on core 4-7 and
+non real-time workload assigned to core 0-3. The real-time tasks share text
+and data, so a per task association is not required and due to interaction
+with the kernel it's desired that the kernel on these cores shares L3 with
+the tasks.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+
+First we reset the schemata for the default group so that the "upper"
+50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
+cannot be used by ordinary tasks::
+
+ # echo "L3:0=3ff\nMB:0=50" > schemata
+
+Next we make a resource group for our real time cores and give it access
+to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
+socket 0.
+::
+
+ # mkdir p0
+ # echo "L3:0=ffc00\nMB:0=50" > p0/schemata
+
+Finally we move core 4-7 over to the new group and make sure that the
+kernel and the tasks running there get 50% of the cache. They should
+also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
+siblings and only the real time threads are scheduled on the cores 4-7.
+::
+
+ # echo F0 > p0/cpus
+
+4) Example 4
+
+The resource groups in previous examples were all in the default "shareable"
+mode allowing sharing of their cache allocations. If one resource group
+configures a cache allocation then nothing prevents another resource group
+to overlap with that allocation.
+
+In this example a new exclusive resource group will be created on a L2 CAT
+system with two L2 cache instances that can be configured with an 8-bit
+capacity bitmask. The new exclusive resource group will be configured to use
+25% of each cache instance.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl/
+ # cd /sys/fs/resctrl
+
+First, we observe that the default group is configured to allocate to all L2
+cache::
+
+ # cat schemata
+ L2:0=ff;1=ff
+
+We could attempt to create the new resource group at this point, but it will
+fail because of the overlap with the schemata of the default group::
+
+ # mkdir p0
+ # echo 'L2:0=0x3;1=0x3' > p0/schemata
+ # cat p0/mode
+ shareable
+ # echo exclusive > p0/mode
+ -sh: echo: write error: Invalid argument
+ # cat info/last_cmd_status
+ schemata overlaps
+
+To ensure that there is no overlap with another resource group the default
+resource group's schemata has to change, making it possible for the new
+resource group to become exclusive.
+::
+
+ # echo 'L2:0=0xfc;1=0xfc' > schemata
+ # echo exclusive > p0/mode
+ # grep . p0/*
+ p0/cpus:0
+ p0/mode:exclusive
+ p0/schemata:L2:0=03;1=03
+ p0/size:L2:0=262144;1=262144
+
+A new resource group will on creation not overlap with an exclusive resource
+group::
+
+ # mkdir p1
+ # grep . p1/*
+ p1/cpus:0
+ p1/mode:shareable
+ p1/schemata:L2:0=fc;1=fc
+ p1/size:L2:0=786432;1=786432
+
+The bit_usage will reflect how the cache is used::
+
+ # cat info/L2/bit_usage
+ 0=SSSSSSEE;1=SSSSSSEE
+
+A resource group cannot be forced to overlap with an exclusive resource group::
+
+ # echo 'L2:0=0x1;1=0x1' > p1/schemata
+ -sh: echo: write error: Invalid argument
+ # cat info/last_cmd_status
+ overlaps with exclusive group
+
+Example of Cache Pseudo-Locking
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
+region is exposed at /dev/pseudo_lock/newlock that can be provided to
+application for argument to mmap().
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl/
+ # cd /sys/fs/resctrl
+
+Ensure that there are bits available that can be pseudo-locked, since only
+unused bits can be pseudo-locked the bits to be pseudo-locked needs to be
+removed from the default resource group's schemata::
+
+ # cat info/L2/bit_usage
+ 0=SSSSSSSS;1=SSSSSSSS
+ # echo 'L2:1=0xfc' > schemata
+ # cat info/L2/bit_usage
+ 0=SSSSSSSS;1=SSSSSS00
+
+Create a new resource group that will be associated with the pseudo-locked
+region, indicate that it will be used for a pseudo-locked region, and
+configure the requested pseudo-locked region capacity bitmask::
+
+ # mkdir newlock
+ # echo pseudo-locksetup > newlock/mode
+ # echo 'L2:1=0x3' > newlock/schemata
+
+On success the resource group's mode will change to pseudo-locked, the
+bit_usage will reflect the pseudo-locked region, and the character device
+exposing the pseudo-locked region will exist::
+
+ # cat newlock/mode
+ pseudo-locked
+ # cat info/L2/bit_usage
+ 0=SSSSSSSS;1=SSSSSSPP
+ # ls -l /dev/pseudo_lock/newlock
+ crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock
+
+::
+
+ /*
+ * Example code to access one page of pseudo-locked cache region
+ * from user space.
+ */
+ #define _GNU_SOURCE
+ #include <fcntl.h>
+ #include <sched.h>
+ #include <stdio.h>
+ #include <stdlib.h>
+ #include <unistd.h>
+ #include <sys/mman.h>
+
+ /*
+ * It is required that the application runs with affinity to only
+ * cores associated with the pseudo-locked region. Here the cpu
+ * is hardcoded for convenience of example.
+ */
+ static int cpuid = 2;
+
+ int main(int argc, char *argv[])
+ {
+ cpu_set_t cpuset;
+ long page_size;
+ void *mapping;
+ int dev_fd;
+ int ret;
+
+ page_size = sysconf(_SC_PAGESIZE);
+
+ CPU_ZERO(&cpuset);
+ CPU_SET(cpuid, &cpuset);
+ ret = sched_setaffinity(0, sizeof(cpuset), &cpuset);
+ if (ret < 0) {
+ perror("sched_setaffinity");
+ exit(EXIT_FAILURE);
+ }
+
+ dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR);
+ if (dev_fd < 0) {
+ perror("open");
+ exit(EXIT_FAILURE);
+ }
+
+ mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
+ dev_fd, 0);
+ if (mapping == MAP_FAILED) {
+ perror("mmap");
+ close(dev_fd);
+ exit(EXIT_FAILURE);
+ }
+
+ /* Application interacts with pseudo-locked memory @mapping */
+
+ ret = munmap(mapping, page_size);
+ if (ret < 0) {
+ perror("munmap");
+ close(dev_fd);
+ exit(EXIT_FAILURE);
+ }
+
+ close(dev_fd);
+ exit(EXIT_SUCCESS);
+ }
+
+Locking between applications
+----------------------------
+
+Certain operations on the resctrl filesystem, composed of read/writes
+to/from multiple files, must be atomic.
+
+As an example, the allocation of an exclusive reservation of L3 cache
+involves:
+
+ 1. Read the cbmmasks from each directory or the per-resource "bit_usage"
+ 2. Find a contiguous set of bits in the global CBM bitmask that is clear
+ in any of the directory cbmmasks
+ 3. Create a new directory
+ 4. Set the bits found in step 2 to the new directory "schemata" file
+
+If two applications attempt to allocate space concurrently then they can
+end up allocating the same bits so the reservations are shared instead of
+exclusive.
+
+To coordinate atomic operations on the resctrlfs and to avoid the problem
+above, the following locking procedure is recommended:
+
+Locking is based on flock, which is available in libc and also as a shell
+script command
+
+Write lock:
+
+ A) Take flock(LOCK_EX) on /sys/fs/resctrl
+ B) Read/write the directory structure.
+ C) funlock
+
+Read lock:
+
+ A) Take flock(LOCK_SH) on /sys/fs/resctrl
+ B) If success read the directory structure.
+ C) funlock
+
+Example with bash::
+
+ # Atomically read directory structure
+ $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
+
+ # Read directory contents and create new subdirectory
+
+ $ cat create-dir.sh
+ find /sys/fs/resctrl/ > output.txt
+ mask = function-of(output.txt)
+ mkdir /sys/fs/resctrl/newres/
+ echo mask > /sys/fs/resctrl/newres/schemata
+
+ $ flock /sys/fs/resctrl/ ./create-dir.sh
+
+Example with C::
+
+ /*
+ * Example code do take advisory locks
+ * before accessing resctrl filesystem
+ */
+ #include <sys/file.h>
+ #include <stdlib.h>
+
+ void resctrl_take_shared_lock(int fd)
+ {
+ int ret;
+
+ /* take shared lock on resctrl filesystem */
+ ret = flock(fd, LOCK_SH);
+ if (ret) {
+ perror("flock");
+ exit(-1);
+ }
+ }
+
+ void resctrl_take_exclusive_lock(int fd)
+ {
+ int ret;
+
+ /* release lock on resctrl filesystem */
+ ret = flock(fd, LOCK_EX);
+ if (ret) {
+ perror("flock");
+ exit(-1);
+ }
+ }
+
+ void resctrl_release_lock(int fd)
+ {
+ int ret;
+
+ /* take shared lock on resctrl filesystem */
+ ret = flock(fd, LOCK_UN);
+ if (ret) {
+ perror("flock");
+ exit(-1);
+ }
+ }
+
+ void main(void)
+ {
+ int fd, ret;
+
+ fd = open("/sys/fs/resctrl", O_DIRECTORY);
+ if (fd == -1) {
+ perror("open");
+ exit(-1);
+ }
+ resctrl_take_shared_lock(fd);
+ /* code to read directory contents */
+ resctrl_release_lock(fd);
+
+ resctrl_take_exclusive_lock(fd);
+ /* code to read and write directory contents */
+ resctrl_release_lock(fd);
+ }
+
+Examples for RDT Monitoring along with allocation usage
+=======================================================
+Reading monitored data
+----------------------
+Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
+show the current snapshot of LLC occupancy of the corresponding MON
+group or CTRL_MON group.
+
+
+Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
+------------------------------------------------------------------------
+On a two socket machine (one L3 cache per socket) with just four bits
+for cache bit masks::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir p0 p1
+ # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
+ # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
+ # echo 5678 > p1/tasks
+ # echo 5679 > p1/tasks
+
+The default resource group is unmodified, so we have access to all parts
+of all caches (its schemata file reads "L3:0=f;1=f").
+
+Tasks that are under the control of group "p0" may only allocate from the
+"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
+Tasks in group "p1" use the "lower" 50% of cache on both sockets.
+
+Create monitor groups and assign a subset of tasks to each monitor group.
+::
+
+ # cd /sys/fs/resctrl/p1/mon_groups
+ # mkdir m11 m12
+ # echo 5678 > m11/tasks
+ # echo 5679 > m12/tasks
+
+fetch data (data shown in bytes)
+::
+
+ # cat m11/mon_data/mon_L3_00/llc_occupancy
+ 16234000
+ # cat m11/mon_data/mon_L3_01/llc_occupancy
+ 14789000
+ # cat m12/mon_data/mon_L3_00/llc_occupancy
+ 16789000
+
+The parent ctrl_mon group shows the aggregated data.
+::
+
+ # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
+ 31234000
+
+Example 2 (Monitor a task from its creation)
+--------------------------------------------
+On a two socket machine (one L3 cache per socket)::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir p0 p1
+
+An RMID is allocated to the group once its created and hence the <cmd>
+below is monitored from its creation.
+::
+
+ # echo $$ > /sys/fs/resctrl/p1/tasks
+ # <cmd>
+
+Fetch the data::
+
+ # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
+ 31789000
+
+Example 3 (Monitor without CAT support or before creating CAT groups)
+---------------------------------------------------------------------
+
+Assume a system like HSW has only CQM and no CAT support. In this case
+the resctrl will still mount but cannot create CTRL_MON directories.
+But user can create different MON groups within the root group thereby
+able to monitor all tasks including kernel threads.
+
+This can also be used to profile jobs cache size footprint before being
+able to allocate them to different allocation groups.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir mon_groups/m01
+ # mkdir mon_groups/m02
+
+ # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
+ # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
+
+Monitor the groups separately and also get per domain data. From the
+below its apparent that the tasks are mostly doing work on
+domain(socket) 0.
+::
+
+ # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
+ 31234000
+ # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
+ 34555
+ # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
+ 31234000
+ # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
+ 32789
+
+
+Example 4 (Monitor real time tasks)
+-----------------------------------
+
+A single socket system which has real time tasks running on cores 4-7
+and non real time tasks on other cpus. We want to monitor the cache
+occupancy of the real time threads on these cores.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir p1
+
+Move the cpus 4-7 over to p1::
+
+ # echo f0 > p1/cpus
+
+View the llc occupancy snapshot::
+
+ # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
+ 11234000
+
+Intel RDT Errata
+================
+
+Intel MBM Counters May Report System Memory Bandwidth Incorrectly
+-----------------------------------------------------------------
+
+Errata SKX99 for Skylake server and BDF102 for Broadwell server.
+
+Problem: Intel Memory Bandwidth Monitoring (MBM) counters track metrics
+according to the assigned Resource Monitor ID (RMID) for that logical
+core. The IA32_QM_CTR register (MSR 0xC8E), used to report these
+metrics, may report incorrect system bandwidth for certain RMID values.
+
+Implication: Due to the errata, system memory bandwidth may not match
+what is reported.
+
+Workaround: MBM total and local readings are corrected according to the
+following correction factor table:
+
++---------------+---------------+---------------+-----------------+
+|core count |rmid count |rmid threshold |correction factor|
++---------------+---------------+---------------+-----------------+
+|1 |8 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|2 |16 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|3 |24 |15 |0.969650 |
++---------------+---------------+---------------+-----------------+
+|4 |32 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|6 |48 |31 |0.969650 |
++---------------+---------------+---------------+-----------------+
+|7 |56 |47 |1.142857 |
++---------------+---------------+---------------+-----------------+
+|8 |64 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|9 |72 |63 |1.185115 |
++---------------+---------------+---------------+-----------------+
+|10 |80 |63 |1.066553 |
++---------------+---------------+---------------+-----------------+
+|11 |88 |79 |1.454545 |
++---------------+---------------+---------------+-----------------+
+|12 |96 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|13 |104 |95 |1.230769 |
++---------------+---------------+---------------+-----------------+
+|14 |112 |95 |1.142857 |
++---------------+---------------+---------------+-----------------+
+|15 |120 |95 |1.066667 |
++---------------+---------------+---------------+-----------------+
+|16 |128 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|17 |136 |127 |1.254863 |
++---------------+---------------+---------------+-----------------+
+|18 |144 |127 |1.185255 |
++---------------+---------------+---------------+-----------------+
+|19 |152 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|20 |160 |127 |1.066667 |
++---------------+---------------+---------------+-----------------+
+|21 |168 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|22 |176 |159 |1.454334 |
++---------------+---------------+---------------+-----------------+
+|23 |184 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|24 |192 |127 |0.969744 |
++---------------+---------------+---------------+-----------------+
+|25 |200 |191 |1.280246 |
++---------------+---------------+---------------+-----------------+
+|26 |208 |191 |1.230921 |
++---------------+---------------+---------------+-----------------+
+|27 |216 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|28 |224 |191 |1.143118 |
++---------------+---------------+---------------+-----------------+
+
+If rmid > rmid threshold, MBM total and local values should be multiplied
+by the correction factor.
+
+See:
+
+1. Erratum SKX99 in Intel Xeon Processor Scalable Family Specification Update:
+http://web.archive.org/web/20200716124958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
+
+2. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update:
+http://web.archive.org/web/20191125200531/https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf
+
+3. The errata in Intel Resource Director Technology (Intel RDT) on 2nd Generation Intel Xeon Scalable Processors Reference Manual:
+https://software.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-manual.html
+
+for further information.
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index ae79c30b6c0c..bf051c7da6b8 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -716,9 +716,8 @@ page lookup by address, and keeping track of pages tagged as Dirty or
Writeback.
The first can be used independently to the others. The VM can try to
-either write dirty pages in order to clean them, or release clean pages
-in order to reuse them. To do this it can call the ->writepage method
-on dirty pages, and ->release_folio on clean folios with the private
+release clean pages in order to reuse them. To do this it can call
+->release_folio on clean folios with the private
flag set. Clean pages without PagePrivate and with no external references
will be released without notice being given to the address_space.
@@ -731,8 +730,8 @@ maintains information about the PG_Dirty and PG_Writeback status of each
page, so that pages with either of these flags can be found quickly.
The Dirty tag is primarily used by mpage_writepages - the default
-->writepages method. It uses the tag to find dirty pages to call
-->writepage on. If mpage_writepages is not used (i.e. the address
+->writepages method. It uses the tag to find dirty pages to
+write back. If mpage_writepages is not used (i.e. the address
provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is almost
unused. write_inode_now and sync_inode do use it (through
__sync_single_inode) to check if ->writepages has been successful in
@@ -756,23 +755,23 @@ pages, however the address_space has finer control of write sizes.
The read process essentially only requires 'read_folio'. The write
process is more complicated and uses write_begin/write_end or
-dirty_folio to write data into the address_space, and writepage and
+dirty_folio to write data into the address_space, and
writepages to writeback data to storage.
Adding and removing pages to/from an address_space is protected by the
inode's i_mutex.
When data is written to a page, the PG_Dirty flag should be set. It
-typically remains set until writepage asks for it to be written. This
+typically remains set until writepages asks for it to be written. This
should clear PG_Dirty and set PG_Writeback. It can be actually written
at any point after PG_Dirty is clear. Once it is known to be safe,
PG_Writeback is cleared.
Writeback makes use of a writeback_control structure to direct the
-operations. This gives the writepage and writepages operations some
+operations. This gives the writepages operation some
information about the nature of and reason for the writeback request,
and the constraints under which it is being done. It is also used to
-return information back to the caller about the result of a writepage or
+return information back to the caller about the result of a
writepages request.
@@ -819,7 +818,6 @@ cache in your filesystem. The following members are defined:
.. code-block:: c
struct address_space_operations {
- int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*read_folio)(struct file *, struct folio *);
int (*writepages)(struct address_space *, struct writeback_control *);
bool (*dirty_folio)(struct address_space *, struct folio *);
@@ -848,25 +846,6 @@ cache in your filesystem. The following members are defined:
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
};
-``writepage``
- called by the VM to write a dirty page to backing store. This
- may happen for data integrity reasons (i.e. 'sync'), or to free
- up memory (flush). The difference can be seen in
- wbc->sync_mode. The PG_Dirty flag has been cleared and
- PageLocked is true. writepage should start writeout, should set
- PG_Writeback, and should make sure the page is unlocked, either
- synchronously or asynchronously when the write operation
- completes.
-
- If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
- try too hard if there are problems, and may choose to write out
- other pages from the mapping if that is easier (e.g. due to
- internal dependencies). If it chooses not to start writeout, it
- should return AOP_WRITEPAGE_ACTIVATE so that the VM will not
- keep calling ->writepage on that page.
-
- See the file "Locking" for more details.
-
``read_folio``
Called by the page cache to read a folio from the backing store.
The 'file' argument supplies authentication information to network
@@ -909,7 +888,7 @@ cache in your filesystem. The following members are defined:
given and that many pages should be written if possible. If no
->writepages is given, then mpage_writepages is used instead.
This will choose pages from the address space that are tagged as
- DIRTY and will pass them to ->writepage.
+ DIRTY and will write them back.
``dirty_folio``
called by the VM to mark a folio as dirty. This is particularly