summaryrefslogtreecommitdiff
path: root/include
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2022-05-25 04:42:04 +0300
committerLinus Torvalds <torvalds@linux-foundation.org>2022-05-25 04:42:04 +0300
commit65965d9530b0c320759cd18a9a5975fb2e098462 (patch)
tree150e4b53ada8c9c5a9d4db50feb113418103e9b5 /include
parent850f6033cd2bf3b1fcbf9a20d078edab7e7c67b4 (diff)
parentba73eadd23d1c2dc5c8dc0c0ae2eeca2b9b709a7 (diff)
downloadlinux-65965d9530b0c320759cd18a9a5975fb2e098462.tar.xz
Merge tag 'erofs-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs (and fscache) updates from Gao Xiang: "After working on it on the mailing list for more than half a year, we finally form 'erofs over fscache' feature into shape. Hopefully it could bring more possibility to the communities. The story mainly started from a new project what we called "RAFS v6" [1] for Nydus image service almost a year ago, which enhances EROFS to be a new form of one bootstrap (which includes metadata representing the whole fs tree) + several data-deduplicated content addressable blobs (actually treated as multiple devices). Each blob can represent one container image layer but not quite exactly since all new data can be fully existed in the previous blobs so no need to introduce another new blob. It is actually not a new idea (at least on my side it's much like a simpilied casync [2] for now) and has many benefits over per-file blobs or some other exist ways since typically each RAFS v6 image only has dozens of device blobs instead of thousands of per-file blobs. It's easy to be signed with user keys as a golden image, transfered untouchedly with minimal overhead over the network, kept in some type of storage conveniently, and run with (optional) runtime verification but without involving too many irrelevant features crossing the system beyond EROFS itself. At least it's our final goal and we're keeping working on it. There was also a good summary of this approach from the casync author [3]. Regardless further optimizations, this work is almost done in the previous Linux release cycles. In this round, we'd like to introduce on-demand load for EROFS with the fscache/cachefiles infrastructure, considering the following advantages: - Introduce new file-based backend to EROFS. Although each image only contains dozens of blobs but in densely-deployed runC host for example, there could still be massive blobs on a machine, which is messy if each blob is treated as a device. In contrast, fscache and cachefiles are really great interfaces for us to make them work. - Introduce on-demand load to fscache and EROFS. Previously, fscache is mainly used to caching network-likewise filesystems, now it can support on-demand downloading for local fses too with the exact localfs on-disk format. It has many advantages which we're been described in the latest patchset cover letter [4]. In addition to that, most importantly, the cached data is still stored in the original local fs on-disk format so that it's still the one signed with private keys but only could be partially available. Users can fully trust it during running. Later, users can also back up cachefiles easily to another machine. - More reliable on-demand approach in principle. After data is all available locally, user daemon can be no longer online in some use cases, which helps daemon crash recovery (filesystems can still in service) and hot-upgrade (user daemon can be upgraded more frequently due to new features or protocols introduced.) - Other format can also be converted to EROFS filesystem format over the internet on the fly with the new on-demand load feature and mounted. That is entirely possible with on-demand load feature as long as such archive format metadata can be fetched in advance like stargz. In addition, although currently our target user is Nydus image service [5], but laterly, it can be used for other use cases like on-demand system booting, etc. As for the fscache on-demand load feature itself, strictly it can be used for other local fses too. Laterly we could promote most code to the iomap infrastructure and also enhance it in the read-write way if other local fses are interested. Thanks David Howells for taking so much time and patience on this these months, many thanks with great respect here again! Thanks Jeffle for working on this feature and Xin Yin from Bytedance for asynchronous I/O implementation as well as Zichen Tian, Jia Zhu, and Yan Song for testing, much appeciated. We're also exploring more possibly over fscache cache management over FSDAX for secure containers and working on more improvements and useful features for fscache, cachefiles, and on-demand load. In addition to "erofs over fscache", NFS export and idmapped mount are also completed in this cycle for container use cases as well. Summary: - Add erofs on-demand load support over fscache - Support NFS export for erofs - Support idmapped mounts for erofs - Don't prompt for risk any more when using big pcluster - Fix buffer copy overflow of ztailpacking feature - Several minor cleanups" [1] https://lore.kernel.org/r/20210730194625.93856-1-hsiangkao@linux.alibaba.com [2] https://github.com/systemd/casync [3] http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html [4] https://lore.kernel.org/r/20220509074028.74954-1-jefflexu@linux.alibaba.com [5] https://github.com/dragonflyoss/image-service * tag 'erofs-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: (29 commits) erofs: scan devices from device table erofs: change to use asynchronous io for fscache readpage/readahead erofs: add 'fsid' mount option erofs: implement fscache-based data readahead erofs: implement fscache-based data read for inline layout erofs: implement fscache-based data read for non-inline layout erofs: implement fscache-based metadata read erofs: register fscache context for extra data blobs erofs: register fscache context for primary data blob erofs: add erofs_fscache_read_folios() helper erofs: add anonymous inode caching metadata for data blobs erofs: add fscache context helper functions erofs: register fscache volume erofs: add fscache mode check helper erofs: make erofs_map_blocks() generally available cachefiles: document on-demand read mode cachefiles: add tracepoints for on-demand read mode cachefiles: enable on-demand read mode cachefiles: implement on-demand read cachefiles: notify the user daemon when withdrawing cookie ...
Diffstat (limited to 'include')
-rw-r--r--include/linux/fscache.h1
-rw-r--r--include/linux/netfs.h1
-rw-r--r--include/trace/events/cachefiles.h176
-rw-r--r--include/uapi/linux/cachefiles.h68
4 files changed, 246 insertions, 0 deletions
diff --git a/include/linux/fscache.h b/include/linux/fscache.h
index e25539072463..72585c9729a2 100644
--- a/include/linux/fscache.h
+++ b/include/linux/fscache.h
@@ -39,6 +39,7 @@ struct fscache_cookie;
#define FSCACHE_ADV_SINGLE_CHUNK 0x01 /* The object is a single chunk of data */
#define FSCACHE_ADV_WRITE_CACHE 0x00 /* Do cache if written to locally */
#define FSCACHE_ADV_WRITE_NOCACHE 0x02 /* Don't cache if written to locally */
+#define FSCACHE_ADV_WANT_CACHE_SIZE 0x04 /* Retrieve cache size at runtime */
#define FSCACHE_INVAL_DIO_WRITE 0x01 /* Invalidate due to DIO write */
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 0c33b715cbfd..80b7728277b1 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -159,6 +159,7 @@ struct netfs_io_subrequest {
#define NETFS_SREQ_SHORT_IO 2 /* Set if the I/O was short */
#define NETFS_SREQ_SEEK_DATA_READ 3 /* Set if ->read() should SEEK_DATA first */
#define NETFS_SREQ_NO_PROGRESS 4 /* Set if we didn't manage to read any data */
+#define NETFS_SREQ_ONDEMAND 5 /* Set if it's from on-demand read mode */
};
enum netfs_io_origin {
diff --git a/include/trace/events/cachefiles.h b/include/trace/events/cachefiles.h
index 311c14a20e70..d8d4d73fe7b6 100644
--- a/include/trace/events/cachefiles.h
+++ b/include/trace/events/cachefiles.h
@@ -31,6 +31,8 @@ enum cachefiles_obj_ref_trace {
cachefiles_obj_see_lookup_failed,
cachefiles_obj_see_withdraw_cookie,
cachefiles_obj_see_withdrawal,
+ cachefiles_obj_get_ondemand_fd,
+ cachefiles_obj_put_ondemand_fd,
};
enum fscache_why_object_killed {
@@ -671,6 +673,180 @@ TRACE_EVENT(cachefiles_io_error,
__entry->error)
);
+TRACE_EVENT(cachefiles_ondemand_open,
+ TP_PROTO(struct cachefiles_object *obj, struct cachefiles_msg *msg,
+ struct cachefiles_open *load),
+
+ TP_ARGS(obj, msg, load),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, msg_id )
+ __field(unsigned int, object_id )
+ __field(unsigned int, fd )
+ __field(unsigned int, flags )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->msg_id = msg->msg_id;
+ __entry->object_id = msg->object_id;
+ __entry->fd = load->fd;
+ __entry->flags = load->flags;
+ ),
+
+ TP_printk("o=%08x mid=%x oid=%x fd=%d f=%x",
+ __entry->obj,
+ __entry->msg_id,
+ __entry->object_id,
+ __entry->fd,
+ __entry->flags)
+ );
+
+TRACE_EVENT(cachefiles_ondemand_copen,
+ TP_PROTO(struct cachefiles_object *obj, unsigned int msg_id,
+ long len),
+
+ TP_ARGS(obj, msg_id, len),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, msg_id )
+ __field(long, len )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->msg_id = msg_id;
+ __entry->len = len;
+ ),
+
+ TP_printk("o=%08x mid=%x l=%lx",
+ __entry->obj,
+ __entry->msg_id,
+ __entry->len)
+ );
+
+TRACE_EVENT(cachefiles_ondemand_close,
+ TP_PROTO(struct cachefiles_object *obj, struct cachefiles_msg *msg),
+
+ TP_ARGS(obj, msg),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, msg_id )
+ __field(unsigned int, object_id )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->msg_id = msg->msg_id;
+ __entry->object_id = msg->object_id;
+ ),
+
+ TP_printk("o=%08x mid=%x oid=%x",
+ __entry->obj,
+ __entry->msg_id,
+ __entry->object_id)
+ );
+
+TRACE_EVENT(cachefiles_ondemand_read,
+ TP_PROTO(struct cachefiles_object *obj, struct cachefiles_msg *msg,
+ struct cachefiles_read *load),
+
+ TP_ARGS(obj, msg, load),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, msg_id )
+ __field(unsigned int, object_id )
+ __field(loff_t, start )
+ __field(size_t, len )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->msg_id = msg->msg_id;
+ __entry->object_id = msg->object_id;
+ __entry->start = load->off;
+ __entry->len = load->len;
+ ),
+
+ TP_printk("o=%08x mid=%x oid=%x s=%llx l=%zx",
+ __entry->obj,
+ __entry->msg_id,
+ __entry->object_id,
+ __entry->start,
+ __entry->len)
+ );
+
+TRACE_EVENT(cachefiles_ondemand_cread,
+ TP_PROTO(struct cachefiles_object *obj, unsigned int msg_id),
+
+ TP_ARGS(obj, msg_id),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, msg_id )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->msg_id = msg_id;
+ ),
+
+ TP_printk("o=%08x mid=%x",
+ __entry->obj,
+ __entry->msg_id)
+ );
+
+TRACE_EVENT(cachefiles_ondemand_fd_write,
+ TP_PROTO(struct cachefiles_object *obj, struct inode *backer,
+ loff_t start, size_t len),
+
+ TP_ARGS(obj, backer, start, len),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, backer )
+ __field(loff_t, start )
+ __field(size_t, len )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->backer = backer->i_ino;
+ __entry->start = start;
+ __entry->len = len;
+ ),
+
+ TP_printk("o=%08x iB=%x s=%llx l=%zx",
+ __entry->obj,
+ __entry->backer,
+ __entry->start,
+ __entry->len)
+ );
+
+TRACE_EVENT(cachefiles_ondemand_fd_release,
+ TP_PROTO(struct cachefiles_object *obj, int object_id),
+
+ TP_ARGS(obj, object_id),
+
+ TP_STRUCT__entry(
+ __field(unsigned int, obj )
+ __field(unsigned int, object_id )
+ ),
+
+ TP_fast_assign(
+ __entry->obj = obj ? obj->debug_id : 0;
+ __entry->object_id = object_id;
+ ),
+
+ TP_printk("o=%08x oid=%x",
+ __entry->obj,
+ __entry->object_id)
+ );
+
#endif /* _TRACE_CACHEFILES_H */
/* This part must be outside protection */
diff --git a/include/uapi/linux/cachefiles.h b/include/uapi/linux/cachefiles.h
new file mode 100644
index 000000000000..78caa73e5343
--- /dev/null
+++ b/include/uapi/linux/cachefiles.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_CACHEFILES_H
+#define _LINUX_CACHEFILES_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/*
+ * Fscache ensures that the maximum length of cookie key is 255. The volume key
+ * is controlled by netfs, and generally no bigger than 255.
+ */
+#define CACHEFILES_MSG_MAX_SIZE 1024
+
+enum cachefiles_opcode {
+ CACHEFILES_OP_OPEN,
+ CACHEFILES_OP_CLOSE,
+ CACHEFILES_OP_READ,
+};
+
+/*
+ * Message Header
+ *
+ * @msg_id a unique ID identifying this message
+ * @opcode message type, CACHEFILE_OP_*
+ * @len message length, including message header and following data
+ * @object_id a unique ID identifying a cache file
+ * @data message type specific payload
+ */
+struct cachefiles_msg {
+ __u32 msg_id;
+ __u32 opcode;
+ __u32 len;
+ __u32 object_id;
+ __u8 data[];
+};
+
+/*
+ * @data contains the volume_key followed directly by the cookie_key. volume_key
+ * is a NUL-terminated string; @volume_key_size indicates the size of the volume
+ * key in bytes. cookie_key is binary data, which is netfs specific;
+ * @cookie_key_size indicates the size of the cookie key in bytes.
+ *
+ * @fd identifies an anon_fd referring to the cache file.
+ */
+struct cachefiles_open {
+ __u32 volume_key_size;
+ __u32 cookie_key_size;
+ __u32 fd;
+ __u32 flags;
+ __u8 data[];
+};
+
+/*
+ * @off indicates the starting offset of the requested file range
+ * @len indicates the length of the requested file range
+ */
+struct cachefiles_read {
+ __u64 off;
+ __u64 len;
+};
+
+/*
+ * Reply for READ request
+ * @arg for this ioctl is the @id field of READ request.
+ */
+#define CACHEFILES_IOC_READ_COMPLETE _IOW(0x98, 1, int)
+
+#endif