summaryrefslogtreecommitdiff
path: root/Documentation/filesystems/iomap
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems/iomap')
-rw-r--r--Documentation/filesystems/iomap/design.rst28
-rw-r--r--Documentation/filesystems/iomap/operations.rst89
2 files changed, 75 insertions, 42 deletions
diff --git a/Documentation/filesystems/iomap/design.rst b/Documentation/filesystems/iomap/design.rst
index b0d0188a095e..0f7672676c0b 100644
--- a/Documentation/filesystems/iomap/design.rst
+++ b/Documentation/filesystems/iomap/design.rst
@@ -167,7 +167,6 @@ structure below:
struct dax_device *dax_dev;
void *inline_data;
void *private;
- const struct iomap_folio_ops *folio_ops;
u64 validity_cookie;
};
@@ -243,8 +242,24 @@ The fields are as follows:
regular file data.
This is only useful for FIEMAP.
- * **IOMAP_F_PRIVATE**: Starting with this value, the upper bits can
- be set by the filesystem for its own purposes.
+ * **IOMAP_F_BOUNDARY**: This indicates I/O and its completion must not be
+ merged with any other I/O or completion. Filesystems must use this when
+ submitting I/O to devices that cannot handle I/O crossing certain LBAs
+ (e.g. ZNS devices). This flag applies only to buffered I/O writeback; all
+ other functions ignore it.
+
+ * **IOMAP_F_PRIVATE**: This flag is reserved for filesystem private use.
+
+ * **IOMAP_F_ANON_WRITE**: Indicates that (write) I/O does not have a target
+ block assigned to it yet and the file system will do that in the bio
+ submission handler, splitting the I/O as needed.
+
+ * **IOMAP_F_ATOMIC_BIO**: This indicates write I/O must be submitted with the
+ ``REQ_ATOMIC`` flag set in the bio. Filesystems need to set this flag to
+ inform iomap that the write I/O operation requires torn-write protection
+ based on HW-offload mechanism. They must also ensure that mapping updates
+ upon the completion of the I/O must be performed in a single metadata
+ update.
These flags can be set by iomap itself during file operations.
The filesystem should supply an ``->iomap_end`` function if it needs
@@ -276,8 +291,6 @@ The fields are as follows:
<https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_.
This value will be passed unchanged to ``->iomap_end``.
- * ``folio_ops`` will be covered in the section on pagecache operations.
-
* ``validity_cookie`` is a magic freshness value set by the filesystem
that should be used to detect stale mappings.
For pagecache operations this is critical for correct operation
@@ -352,6 +365,11 @@ operations:
``IOMAP_NOWAIT`` is often set on behalf of ``IOCB_NOWAIT`` or
``RWF_NOWAIT``.
+ * ``IOMAP_DONTCACHE`` is set when the caller wishes to perform a
+ buffered file I/O and would like the kernel to drop the pagecache
+ after the I/O completes, if it isn't already being used by another
+ thread.
+
If it is necessary to read existing file contents from a `different
<https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/>`_
device or address range on a device, the filesystem should return that
diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
index 2c7f5df9d8b0..067ed8e14ef3 100644
--- a/Documentation/filesystems/iomap/operations.rst
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -57,21 +57,19 @@ The following address space operations can be wrapped easily:
* ``bmap``
* ``swap_activate``
-``struct iomap_folio_ops``
+``struct iomap_write_ops``
--------------------------
-The ``->iomap_begin`` function for pagecache operations may set the
-``struct iomap::folio_ops`` field to an ops structure to override
-default behaviors of iomap:
-
.. code-block:: c
- struct iomap_folio_ops {
+ struct iomap_write_ops {
struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
unsigned len);
void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
struct folio *folio);
bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
+ int (*read_folio_range)(const struct iomap_iter *iter,
+ struct folio *folio, loff_t pos, size_t len);
};
iomap calls these functions:
@@ -127,10 +125,16 @@ iomap calls these functions:
``->iomap_valid``, then the iomap should considered stale and the
validation failed.
+ - ``read_folio_range``: Called to synchronously read in the range that will
+ be written to. If this function is not provided, iomap will default to
+ submitting a bio read request.
+
These ``struct kiocb`` flags are significant for buffered I/O with iomap:
* ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
+ * ``IOCB_DONTCACHE``: Turns on ``IOMAP_DONTCACHE``.
+
Internal per-Folio State
------------------------
@@ -269,7 +273,7 @@ writeback.
It does not lock ``i_rwsem`` or ``invalidate_lock``.
The dirty bit will be cleared for all folios run through the
-``->map_blocks`` machinery described below even if the writeback fails.
+``->writeback_range`` machinery described below even if the writeback fails.
This is to prevent dirty folio clots when storage devices fail; an
``-EIO`` is recorded for userspace to collect via ``fsync``.
@@ -281,15 +285,14 @@ The ``ops`` structure must be specified and is as follows:
.. code-block:: c
struct iomap_writeback_ops {
- int (*map_blocks)(struct iomap_writepage_ctx *wpc, struct inode *inode,
- loff_t offset, unsigned len);
- int (*prepare_ioend)(struct iomap_ioend *ioend, int status);
- void (*discard_folio)(struct folio *folio, loff_t pos);
+ int (*writeback_range)(struct iomap_writepage_ctx *wpc,
+ struct folio *folio, u64 pos, unsigned int len, u64 end_pos);
+ int (*writeback_submit)(struct iomap_writepage_ctx *wpc, int error);
};
The fields are as follows:
- - ``map_blocks``: Sets ``wpc->iomap`` to the space mapping of the file
+ - ``writeback_range``: Sets ``wpc->iomap`` to the space mapping of the file
range (in bytes) given by ``offset`` and ``len``.
iomap calls this function for each dirty fs block in each dirty folio,
though it will `reuse mappings
@@ -304,28 +307,26 @@ The fields are as follows:
This revalidation must be open-coded by the filesystem; it is
unclear if ``iomap::validity_cookie`` can be reused for this
purpose.
- This function must be supplied by the filesystem.
-
- - ``prepare_ioend``: Enables filesystems to transform the writeback
- ioend or perform any other preparatory work before the writeback I/O
- is submitted.
- This might include pre-write space accounting updates, or installing
- a custom ``->bi_end_io`` function for internal purposes, such as
- deferring the ioend completion to a workqueue to run metadata update
- transactions from process context.
- This function is optional.
- - ``discard_folio``: iomap calls this function after ``->map_blocks``
- fails to schedule I/O for any part of a dirty folio.
- The function should throw away any reservations that may have been
- made for the write.
+ If this methods fails to schedule I/O for any part of a dirty folio, it
+ should throw away any reservations that may have been made for the write.
The folio will be marked clean and an ``-EIO`` recorded in the
pagecache.
Filesystems can use this callback to `remove
<https://lore.kernel.org/all/20201029163313.1766967-1-bfoster@redhat.com/>`_
delalloc reservations to avoid having delalloc reservations for
clean pagecache.
- This function is optional.
+ This function must be supplied by the filesystem.
+
+ - ``writeback_submit``: Submit the previous built writeback context.
+ Block based file systems should use the iomap_ioend_writeback_submit
+ helper, other file system can implement their own.
+ File systems can optionall to hook into writeback bio submission.
+ This might include pre-write space accounting updates, or installing
+ a custom ``->bi_end_io`` function for internal purposes, such as
+ deferring the ioend completion to a workqueue to run metadata update
+ transactions from process context before submitting the bio.
+ This function must be supplied by the filesystem.
Pagecache Writeback Completion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -339,10 +340,9 @@ If the write failed, it will also set the error bits on the folios and
the address space.
This can happen in interrupt or process context, depending on the
storage device.
-
Filesystems that need to update internal bookkeeping (e.g. unwritten
-extent conversions) should provide a ``->prepare_ioend`` function to
-set ``struct iomap_end::bio::bi_end_io`` to its own function.
+extent conversions) should set their own bi_end_io on the bios
+submitted by ``->submit_writeback``
This function should call ``iomap_finish_ioends`` after finishing its
own work (e.g. unwritten extent conversion).
@@ -515,18 +515,33 @@ IOMAP_WRITE`` with any combination of the following enhancements:
* ``IOMAP_ATOMIC``: This write is being issued with torn-write
protection.
- Only a single bio can be created for the write, and the write must
- not be split into multiple I/O requests, i.e. flag REQ_ATOMIC must be
- set.
+ Torn-write protection may be provided based on HW-offload or by a
+ software mechanism provided by the filesystem.
+
+ For HW-offload based support, only a single bio can be created for the
+ write, and the write must not be split into multiple I/O requests, i.e.
+ flag REQ_ATOMIC must be set.
The file range to write must be aligned to satisfy the requirements
of both the filesystem and the underlying block device's atomic
commit capabilities.
If filesystem metadata updates are required (e.g. unwritten extent
- conversion or copy on write), all updates for the entire file range
+ conversion or copy-on-write), all updates for the entire file range
must be committed atomically as well.
- Only one space mapping is allowed per untorn write.
- Untorn writes must be aligned to, and must not be longer than, a
- single file block.
+ Untorn-writes may be longer than a single file block. In all cases,
+ the mapping start disk block must have at least the same alignment as
+ the write offset.
+ The filesystems must set IOMAP_F_ATOMIC_BIO to inform iomap core of an
+ untorn-write based on HW-offload.
+
+ For untorn-writes based on a software mechanism provided by the
+ filesystem, all the disk block alignment and single bio restrictions
+ which apply for HW-offload based untorn-writes do not apply.
+ The mechanism would typically be used as a fallback for when
+ HW-offload based untorn-writes may not be issued, e.g. the range of the
+ write covers multiple extents, meaning that it is not possible to issue
+ a single bio.
+ All filesystem metadata updates for the entire file range must be
+ committed atomically as well.
Callers commonly hold ``i_rwsem`` in shared or exclusive mode before
calling this function.