diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/cifs/index.rst | 10 | ||||
-rw-r--r-- | Documentation/filesystems/cifs/ksmbd.rst | 165 | ||||
-rw-r--r-- | Documentation/filesystems/fscrypt.rst | 15 | ||||
-rw-r--r-- | Documentation/filesystems/idmappings.rst | 1026 | ||||
-rw-r--r-- | Documentation/filesystems/index.rst | 3 | ||||
-rw-r--r-- | Documentation/filesystems/locking.rst | 79 | ||||
-rw-r--r-- | Documentation/filesystems/mandatory-locking.rst | 188 | ||||
-rw-r--r-- | Documentation/filesystems/ramfs-rootfs-initramfs.rst | 2 |
8 files changed, 1262 insertions, 226 deletions
diff --git a/Documentation/filesystems/cifs/index.rst b/Documentation/filesystems/cifs/index.rst new file mode 100644 index 000000000000..1c8597a679ab --- /dev/null +++ b/Documentation/filesystems/cifs/index.rst @@ -0,0 +1,10 @@ +=============================== +CIFS +=============================== + + +.. toctree:: + :maxdepth: 1 + + ksmbd + cifsroot diff --git a/Documentation/filesystems/cifs/ksmbd.rst b/Documentation/filesystems/cifs/ksmbd.rst new file mode 100644 index 000000000000..a1326157d53f --- /dev/null +++ b/Documentation/filesystems/cifs/ksmbd.rst @@ -0,0 +1,165 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +KSMBD - SMB3 Kernel Server +========================== + +KSMBD is a linux kernel server which implements SMB3 protocol in kernel space +for sharing files over network. + +KSMBD architecture +================== + +The subset of performance related operations belong in kernelspace and +the other subset which belong to operations which are not really related with +performance in userspace. So, DCE/RPC management that has historically resulted +into number of buffer overflow issues and dangerous security bugs and user +account management are implemented in user space as ksmbd.mountd. +File operations that are related with performance (open/read/write/close etc.) +in kernel space (ksmbd). This also allows for easier integration with VFS +interface for all file operations. + +ksmbd (kernel daemon) +--------------------- + +When the server daemon is started, It starts up a forker thread +(ksmbd/interface name) at initialization time and open a dedicated port 445 +for listening to SMB requests. Whenever new clients make request, Forker +thread will accept the client connection and fork a new thread for dedicated +communication channel between the client and the server. It allows for parallel +processing of SMB requests(commands) from clients as well as allowing for new +clients to make new connections. Each instance is named ksmbd/1~n(port number) +to indicate connected clients. Depending on the SMB request types, each new +thread can decide to pass through the commands to the user space (ksmbd.mountd), +currently DCE/RPC commands are identified to be handled through the user space. +To further utilize the linux kernel, it has been chosen to process the commands +as workitems and to be executed in the handlers of the ksmbd-io kworker threads. +It allows for multiplexing of the handlers as the kernel take care of initiating +extra worker threads if the load is increased and vice versa, if the load is +decreased it destroys the extra worker threads. So, after connection is +established with client. Dedicated ksmbd/1..n(port number) takes complete +ownership of receiving/parsing of SMB commands. Each received command is worked +in parallel i.e., There can be multiple clients commands which are worked in +parallel. After receiving each command a separated kernel workitem is prepared +for each command which is further queued to be handled by ksmbd-io kworkers. +So, each SMB workitem is queued to the kworkers. This allows the benefit of load +sharing to be managed optimally by the default kernel and optimizing client +performance by handling client commands in parallel. + +ksmbd.mountd (user space daemon) +-------------------------------- + +ksmbd.mountd is userspace process to, transfer user account and password that +are registered using ksmbd.adduser(part of utils for user space). Further it +allows sharing information parameters that parsed from smb.conf to ksmbd in +kernel. For the execution part it has a daemon which is continuously running +and connected to the kernel interface using netlink socket, it waits for the +requests(dcerpc and share/user info). It handles RPC calls (at a minimum few +dozen) that are most important for file server from NetShareEnum and +NetServerGetInfo. Complete DCE/RPC response is prepared from the user space +and passed over to the associated kernel thread for the client. + + +KSMBD Feature Status +==================== + +============================== ================================================= +Feature name Status +============================== ================================================= +Dialects Supported. SMB2.1 SMB3.0, SMB3.1.1 dialects + (intentionally excludes security vulnerable SMB1 + dialect). +Auto Negotiation Supported. +Compound Request Supported. +Oplock Cache Mechanism Supported. +SMB2 leases(v1 lease) Supported. +Directory leases(v2 lease) Planned for future. +Multi-credits Supported. +NTLM/NTLMv2 Supported. +HMAC-SHA256 Signing Supported. +Secure negotiate Supported. +Signing Update Supported. +Pre-authentication integrity Supported. +SMB3 encryption(CCM, GCM) Supported. (CCM and GCM128 supported, GCM256 in + progress) +SMB direct(RDMA) Partially Supported. SMB3 Multi-channel is + required to connect to Windows client. +SMB3 Multi-channel Partially Supported. Planned to implement + replay/retry mechanisms for future. +SMB3.1.1 POSIX extension Supported. +ACLs Partially Supported. only DACLs available, SACLs + (auditing) is planned for the future. For + ownership (SIDs) ksmbd generates random subauth + values(then store it to disk) and use uid/gid + get from inode as RID for local domain SID. + The current acl implementation is limited to + standalone server, not a domain member. + Integration with Samba tools is being worked on + to allow future support for running as a domain + member. +Kerberos Supported. +Durable handle v1,v2 Planned for future. +Persistent handle Planned for future. +SMB2 notify Planned for future. +Sparse file support Supported. +DCE/RPC support Partially Supported. a few calls(NetShareEnumAll, + NetServerGetInfo, SAMR, LSARPC) that are needed + for file server handled via netlink interface + from ksmbd.mountd. Additional integration with + Samba tools and libraries via upcall is being + investigated to allow support for additional + DCE/RPC management calls (and future support + for Witness protocol e.g.) +ksmbd/nfsd interoperability Planned for future. The features that ksmbd + support are Leases, Notify, ACLs and Share modes. +============================== ================================================= + + +How to run +========== + +1. Download ksmbd-tools and compile them. + - https://github.com/cifsd-team/ksmbd-tools + +2. Create user/password for SMB share. + + # mkdir /etc/ksmbd/ + # ksmbd.adduser -a <Enter USERNAME for SMB share access> + +3. Create /etc/ksmbd/smb.conf file, add SMB share in smb.conf file + - Refer smb.conf.example and + https://github.com/cifsd-team/ksmbd-tools/blob/master/Documentation/configuration.txt + +4. Insert ksmbd.ko module + + # insmod ksmbd.ko + +5. Start ksmbd user space daemon + # ksmbd.mountd + +6. Access share from Windows or Linux using CIFS + +Shutdown KSMBD +============== + +1. kill user and kernel space daemon + # sudo ksmbd.control -s + +How to turn debug print on +========================== + +Each layer +/sys/class/ksmbd-control/debug + +1. Enable all component prints + # sudo ksmbd.control -d "all" + +2. Enable one of components(smb, auth, vfs, oplock, ipc, conn, rdma) + # sudo ksmbd.control -d "smb" + +3. Show what prints are enable. + # cat/sys/class/ksmbd-control/debug + [smb] auth vfs oplock ipc conn [rdma] + +4. Disable prints: + If you try the selected component once more, It is disabled without brackets. diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst index 44b67ebd6e40..0eb799d9d05a 100644 --- a/Documentation/filesystems/fscrypt.rst +++ b/Documentation/filesystems/fscrypt.rst @@ -1063,11 +1063,6 @@ astute users may notice some differences in behavior: - DAX (Direct Access) is not supported on encrypted files. -- The st_size of an encrypted symlink will not necessarily give the - length of the symlink target as required by POSIX. It will actually - give the length of the ciphertext, which will be slightly longer - than the plaintext due to NUL-padding and an extra 2-byte overhead. - - The maximum length of an encrypted symlink is 2 bytes shorter than the maximum length of an unencrypted symlink. For example, on an EXT4 filesystem with a 4K block size, unencrypted symlinks can be up @@ -1235,12 +1230,12 @@ the user-supplied name to get the ciphertext. Lookups without the key are more complicated. The raw ciphertext may contain the ``\0`` and ``/`` characters, which are illegal in -filenames. Therefore, readdir() must base64-encode the ciphertext for -presentation. For most filenames, this works fine; on ->lookup(), the -filesystem just base64-decodes the user-supplied name to get back to -the raw ciphertext. +filenames. Therefore, readdir() must base64url-encode the ciphertext +for presentation. For most filenames, this works fine; on ->lookup(), +the filesystem just base64url-decodes the user-supplied name to get +back to the raw ciphertext. -However, for very long filenames, base64 encoding would cause the +However, for very long filenames, base64url encoding would cause the filename length to exceed NAME_MAX. To prevent this, readdir() actually presents long filenames in an abbreviated form which encodes a strong "hash" of the ciphertext filename, along with the optional diff --git a/Documentation/filesystems/idmappings.rst b/Documentation/filesystems/idmappings.rst new file mode 100644 index 000000000000..1229a75ec75d --- /dev/null +++ b/Documentation/filesystems/idmappings.rst @@ -0,0 +1,1026 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Idmappings +========== + +Most filesystem developers will have encountered idmappings. They are used when +reading from or writing ownership to disk, reporting ownership to userspace, or +for permission checking. This document is aimed at filesystem developers that +want to know how idmappings work. + +Formal notes +------------ + +An idmapping is essentially a translation of a range of ids into another or the +same range of ids. The notational convention for idmappings that is widely used +in userspace is:: + + u:k:r + +``u`` indicates the first element in the upper idmapset ``U`` and ``k`` +indicates the first element in the lower idmapset ``K``. The ``r`` parameter +indicates the range of the idmapping, i.e. how many ids are mapped. From now +on, we will always prefix ids with ``u`` or ``k`` to make it clear whether +we're talking about an id in the upper or lower idmapset. + +To see what this looks like in practice, let's take the following idmapping:: + + u22:k10000:r3 + +and write down the mappings it will generate:: + + u22 -> k10000 + u23 -> k10001 + u24 -> k10002 + +From a mathematical viewpoint ``U`` and ``K`` are well-ordered sets and an +idmapping is an order isomorphism from ``U`` into ``K``. So ``U`` and ``K`` are +order isomorphic. In fact, ``U`` and ``K`` are always well-ordered subsets of +the set of all possible ids useable on a given system. + +Looking at this mathematically briefly will help us highlight some properties +that make it easier to understand how we can translate between idmappings. For +example, we know that the inverse idmapping is an order isomorphism as well:: + + k10000 -> u22 + k10001 -> u23 + k10002 -> u24 + +Given that we are dealing with order isomorphisms plus the fact that we're +dealing with subsets we can embedd idmappings into each other, i.e. we can +sensibly translate between different idmappings. For example, assume we've been +given the three idmappings:: + + 1. u0:k10000:r10000 + 2. u0:k20000:r10000 + 3. u0:k30000:r10000 + +and id ``k11000`` which has been generated by the first idmapping by mapping +``u1000`` from the upper idmapset down to ``k11000`` in the lower idmapset. + +Because we're dealing with order isomorphic subsets it is meaningful to ask +what id ``k11000`` corresponds to in the second or third idmapping. The +straightfoward algorithm to use is to apply the inverse of the first idmapping, +mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using +either the second idmapping mapping or third idmapping mapping. The second +idmapping would map ``u1000`` down to ``21000``. The third idmapping would map +``u1000`` down to ``u31000``. + +If we were given the same task for the following three idmappings:: + + 1. u0:k10000:r10000 + 2. u0:k20000:r200 + 3. u0:k30000:r300 + +we would fail to translate as the sets aren't order isomorphic over the full +range of the first idmapping anymore (However they are order isomorphic over +the full range of the second idmapping.). Neither the second or third idmapping +contain ``u1000`` in the upper idmapset ``U``. This is equivalent to not having +an id mapped. We can simply say that ``u1000`` is unmapped in the second and +third idmapping. The kernel will report unmapped ids as the overflowuid +``(uid_t)-1`` or overflowgid ``(gid_t)-1`` to userspace. + +The algorithm to calculate what a given id maps to is pretty simple. First, we +need to verify that the range can contain our target id. We will skip this step +for simplicity. After that if we want to know what ``id`` maps to we can do +simple calculations: + +- If we want to map from left to right:: + + u:k:r + id - u + k = n + +- If we want to map from right to left:: + + u:k:r + id - k + u = n + +Instead of "left to right" we can also say "down" and instead of "right to +left" we can also say "up". Obviously mapping down and up invert each other. + +To see whether the simple formulas above work, consider the following two +idmappings:: + + 1. u0:k20000:r10000 + 2. u500:k30000:r10000 + +Assume we are given ``k21000`` in the lower idmapset of the first idmapping. We +want to know what id this was mapped from in the upper idmapset of the first +idmapping. So we're mapping up in the first idmapping:: + + id - k + u = n + k21000 - k20000 + u0 = u1000 + +Now assume we are given the id ``u1100`` in the upper idmapset of the second +idmapping and we want to know what this id maps down to in the lower idmapset +of the second idmapping. This means we're mapping down in the second +idmapping:: + + id - u + k = n + u1100 - u500 + k30000 = k30600 + +General notes +------------- + +In the context of the kernel an idmapping can be interpreted as mapping a range +of userspace ids into a range of kernel ids:: + + userspace-id:kernel-id:range + +A userspace id is always an element in the upper idmapset of an idmapping of +type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the lower +idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on +"userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t`` +types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``. + +The kernel is mostly concerned with kernel ids. They are used when performing +permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field. +A userspace id on the other hand is an id that is reported to userspace by the +kernel, or is passed by userspace to the kernel, or a raw device id that is +written or read from disk. + +Note that we are only concerned with idmappings as the kernel stores them not +how userspace would specify them. + +For the rest of this document we will prefix all userspace ids with ``u`` and +all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So +an idmapping will be written as ``u0:k10000:r10000``. + +For example, the id ``u1000`` is an id in the upper idmapset or "userspace +idmapset" starting with ``u1000``. And it is mapped to ``k11000`` which is a +kernel id in the lower idmapset or "kernel idmapset" starting with ``k10000``. + +A kernel id is always created by an idmapping. Such idmappings are associated +with user namespaces. Since we mainly care about how idmappings work we're not +going to be concerned with how idmappings are created nor how they are used +outside of the filesystem context. This is best left to an explanation of user +namespaces. + +The initial user namespace is special. It always has an idmapping of the +following form:: + + u0:k0:r4294967295 + +which is an identity idmapping over the full range of ids available on this +system. + +Other user namespaces usually have non-identity idmappings such as:: + + u0:k10000:r10000 + +When a process creates or wants to change ownership of a file, or when the +ownership of a file is read from disk by a filesystem, the userspace id is +immediately translated into a kernel id according to the idmapping associated +with the relevant user namespace. + +For instance, consider a file that is stored on disk by a filesystem as being +owned by ``u1000``: + +- If a filesystem were to be mounted in the initial user namespaces (as most + filesystems are) then the initial idmapping will be used. As we saw this is + simply the identity idmapping. This would mean id ``u1000`` read from disk + would be mapped to id ``k1000``. So an inode's ``i_uid`` and ``i_gid`` field + would contain ``k1000``. + +- If a filesystem were to be mounted with an idmapping of ``u0:k10000:r10000`` + then ``u1000`` read from disk would be mapped to ``k11000``. So an inode's + ``i_uid`` and ``i_gid`` would contain ``k11000``. + +Translation algorithms +---------------------- + +We've already seen briefly that it is possible to translate between different +idmappings. We'll now take a closer look how that works. + +Crossmapping +~~~~~~~~~~~~ + +This translation algorithm is used by the kernel in quite a few places. For +example, it is used when reporting back the ownership of a file to userspace +via the ``stat()`` system call family. + +If we've been given ``k11000`` from one idmapping we can map that id up in +another idmapping. In order for this to work both idmappings need to contain +the same kernel id in their kernel idmapsets. For example, consider the +following idmappings:: + + 1. u0:k10000:r10000 + 2. u20000:k10000:r10000 + +and we are mapping ``u1000`` down to ``k11000`` in the first idmapping . We can +then translate ``k11000`` into a userspace id in the second idmapping using the +kernel idmapset of the second idmapping:: + + /* Map the kernel id up into a userspace id in the second idmapping. */ + from_kuid(u20000:k10000:r10000, k11000) = u21000 + +Note, how we can get back to the kernel id in the first idmapping by inverting +the algorithm:: + + /* Map the userspace id down into a kernel id in the second idmapping. */ + make_kuid(u20000:k10000:r10000, u21000) = k11000 + + /* Map the kernel id up into a userspace id in the first idmapping. */ + from_kuid(u0:k10000:r10000, k11000) = u1000 + +This algorithm allows us to answer the question what userspace id a given +kernel id corresponds to in a given idmapping. In order to be able to answer +this question both idmappings need to contain the same kernel id in their +respective kernel idmapsets. + +For example, when the kernel reads a raw userspace id from disk it maps it down +into a kernel id according to the idmapping associated with the filesystem. +Let's assume the filesystem was mounted with an idmapping of +``u0:k20000:r10000`` and it reads a file owned by ``u1000`` from disk. This +means ``u1000`` will be mapped to ``k21000`` which is what will be stored in +the inode's ``i_uid`` and ``i_gid`` field. + +When someone in userspace calls ``stat()`` or a related function to get +ownership information about the file the kernel can't simply map the id back up +according to the filesystem's idmapping as this would give the wrong owner if +the caller is using an idmapping. + +So the kernel will map the id back up in the idmapping of the caller. Let's +assume the caller has the slighly unconventional idmapping +``u3000:k20000:r10000`` then ``k21000`` would map back up to ``u4000``. +Consequently the user would see that this file is owned by ``u4000``. + +Remapping +~~~~~~~~~ + +It is possible to translate a kernel id from one idmapping to another one via +the userspace idmapset of the two idmappings. This is equivalent to remapping +a kernel id. + +Let's look at an example. We are given the following two idmappings:: + + 1. u0:k10000:r10000 + 2. u0:k20000:r10000 + +and we are given ``k11000`` in the first idmapping. In order to translate this +kernel id in the first idmapping into a kernel id in the second idmapping we +need to perform two steps: + +1. Map the kernel id up into a userspace id in the first idmapping:: + + /* Map the kernel id up into a userspace id in the first idmapping. */ + from_kuid(u0:k10000:r10000, k11000) = u1000 + +2. Map the userspace id down into a kernel id in the second idmapping:: + + /* Map the userspace id down into a kernel id in the second idmapping. */ + make_kuid(u0:k20000:r10000, u1000) = k21000 + +As you can see we used the userspace idmapset in both idmappings to translate +the kernel id in one idmapping to a kernel id in another idmapping. + +This allows us to answer the question what kernel id we would need to use to +get the same userspace id in another idmapping. In order to be able to answer +this question both idmappings need to contain the same userspace id in their +respective userspace idmapsets. + +Note, how we can easily get back to the kernel id in the first idmapping by +inverting the algorithm: + +1. Map the kernel id up into a userspace id in the second idmapping:: + + /* Map the kernel id up into a userspace id in the second idmapping. */ + from_kuid(u0:k20000:r10000, k21000) = u1000 + +2. Map the userspace id down into a kernel id in the first idmapping:: + + /* Map the userspace id down into a kernel id in the first idmapping. */ + make_kuid(u0:k10000:r10000, u1000) = k11000 + +Another way to look at this translation is to treat it as inverting one +idmapping and applying another idmapping if both idmappings have the relevant +userspace id mapped. This will come in handy when working with idmapped mounts. + +Invalid translations +~~~~~~~~~~~~~~~~~~~~ + +It is never valid to use an id in the kernel idmapset of one idmapping as the +id in the userspace idmapset of another or the same idmapping. While the kernel +idmapset always indicates an idmapset in the kernel id space the userspace +idmapset indicates a userspace id. So the following translations are forbidden:: + + /* Map the userspace id down into a kernel id in the first idmapping. */ + make_kuid(u0:k10000:r10000, u1000) = k11000 + + /* INVALID: Map the kernel id down into a kernel id in the second idmapping. */ + make_kuid(u10000:k20000:r10000, k110000) = k21000 + ~~~~~~~ + +and equally wrong:: + + /* Map the kernel id up into a userspace id in the first idmapping. */ + from_kuid(u0:k10000:r10000, k11000) = u1000 + + /* INVALID: Map the userspace id up into a userspace id in the second idmapping. */ + from_kuid(u20000:k0:r10000, u1000) = k21000 + ~~~~~ + +Idmappings when creating filesystem objects +------------------------------------------- + +The concepts of mapping an id down or mapping an id up are expressed in the two +kernel functions filesystem developers are rather familiar with and which we've +already used in this document:: + + /* Map the userspace id down into a kernel id. */ + make_kuid(idmapping, uid) + + /* Map the kernel id up into a userspace id. */ + from_kuid(idmapping, kuid) + +We will take an abbreviated look into how idmappings figure into creating +filesystem objects. For simplicity we will only look at what happens when the +VFS has already completed path lookup right before it calls into the filesystem +itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is +called. We will also assume that the directory we're creating filesystem +objects in is readable and writable for everyone. + +When creating a filesystem object the caller will look at the caller's +filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids +but they are exclusively used when determining file ownership which is why they +are called "filesystem ids". They are usually identical to the uid and gid of +the caller but can differ. We will just assume they are always identical to not +get lost in too many details. + +When the caller enters the kernel two things happen: + +1. Map the caller's userspace ids down into kernel ids in the caller's + idmapping. + (To be precise, the kernel will simply look at the kernel ids stashed in the + credentials of the current task but for our education we'll pretend this + translation happens just in time.) +2. Verify that the caller's kernel ids can be mapped up to userspace ids in the + filesystem's idmapping. + +The second step is important as regular filesystem will ultimately need to map +the kernel id back up into a userspace id when writing to disk. +So with the second step the kernel guarantees that a valid userspace id can be +written to disk. If it can't the kernel will refuse the creation request to not +even remotely risk filesystem corruption. + +The astute reader will have realized that this is simply a varation of the +crossmapping algorithm we mentioned above in a previous section. First, the +kernel maps the caller's userspace id down into a kernel id according to the +caller's idmapping and then maps that kernel id up according to the +filesystem's idmapping. + +Example 1 +~~~~~~~~~ + +:: + + caller id: u1000 + caller idmapping: u0:k0:r4294967295 + filesystem idmapping: u0:k0:r4294967295 + +Both the caller and the filesystem use the identity idmapping: + +1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: + + make_kuid(u0:k0:r4294967295, u1000) = k1000 + +2. Verify that the caller's kernel ids can be mapped to userspace ids in the + filesystem's idmapping. + + For this second step the kernel will call the function + ``fsuidgid_has_mapping()`` which ultimately boils down to calling + ``from_kuid()``:: + + from_kuid(u0:k0:r4294967295, k1000) = u1000 + +In this example both idmappings are the same so there's nothing exciting going +on. Ultimately the userspace id that lands on disk will be ``u1000``. + +Example 2 +~~~~~~~~~ + +:: + + caller id: u1000 + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k20000:r10000 + +1. Map the caller's userspace ids down into kernel ids in the caller's + idmapping:: + + make_kuid(u0:k10000:r10000, u1000) = k11000 + +2. Verify that the caller's kernel ids can be mapped up to userspace ids in the + filesystem's idmapping:: + + from_kuid(u0:k20000:r10000, k11000) = u-1 + +It's immediately clear that while the caller's userspace id could be +successfully mapped down into kernel ids in the caller's idmapping the kernel +ids could not be mapped up according to the filesystem's idmapping. So the +kernel will deny this creation request. + +Note that while this example is less common, because most filesystem can't be +mounted with non-initial idmappings this is a general problem as we can see in +the next examples. + +Example 3 +~~~~~~~~~ + +:: + + caller id: u1000 + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k0:r4294967295 + +1. Map the caller's userspace ids down into kernel ids in the caller's + idmapping:: + + make_kuid(u0:k10000:r10000, u1000) = k11000 + +2. Verify that the caller's kernel ids can be mapped up to userspace ids in the + filesystem's idmapping:: + + from_kuid(u0:k0:r4294967295, k11000) = u11000 + +We can see that the translation always succeeds. The userspace id that the +filesystem will ultimately put to disk will always be identical to the value of +the kernel id that was created in the caller's idmapping. This has mainly two +consequences. + +First, that we can't allow a caller to ultimately write to disk with another +userspace id. We could only do this if we were to mount the whole fileystem +with the caller's or another idmapping. But that solution is limited to a few +filesystems and not very flexible. But this is a use-case that is pretty +important in containerized workloads. + +Second, the caller will usually not be able to create any files or access +directories that have stricter permissions because none of the filesystem's +kernel ids map up into valid userspace ids in the caller's idmapping + +1. Map raw userspace ids down to kernel ids in the filesystem's idmapping:: + + make_kuid(u0:k0:r4294967295, u1000) = k1000 + +2. Map kernel ids up to userspace ids in the caller's idmapping:: + + from_kuid(u0:k10000:r10000, k1000) = u-1 + +Example 4 +~~~~~~~~~ + +:: + + file id: u1000 + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k0:r4294967295 + +In order to report ownership to userspace the kernel uses the crossmapping +algorithm introduced in a previous section: + +1. Map the userspace id on disk down into a kernel id in the filesystem's + idmapping:: + + make_kuid(u0:k0:r4294967295, u1000) = k1000 + +2. Map the kernel id up into a userspace id in the caller's idmapping:: + + from_kuid(u0:k10000:r10000, k1000) = u-1 + +The crossmapping algorithm fails in this case because the kernel id in the +filesystem idmapping cannot be mapped up to a userspace id in the caller's +idmapping. Thus, the kernel will report the ownership of this file as the +overflowid. + +Example 5 +~~~~~~~~~ + +:: + + file id: u1000 + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k20000:r10000 + +In order to report ownership to userspace the kernel uses the crossmapping +algorithm introduced in a previous section: + +1. Map the userspace id on disk down into a kernel id in the filesystem's + idmapping:: + + make_kuid(u0:k20000:r10000, u1000) = k21000 + +2. Map the kernel id up into a userspace id in the caller's idmapping:: + + from_kuid(u0:k10000:r10000, k21000) = u-1 + +Again, the crossmapping algorithm fails in this case because the kernel id in +the filesystem idmapping cannot be mapped to a userspace id in the caller's +idmapping. Thus, the kernel will report the ownership of this file as the +overflowid. + +Note how in the last two examples things would be simple if the caller would be +using the initial idmapping. For a filesystem mounted with the initial +idmapping it would be trivial. So we only consider a filesystem with an +idmapping of ``u0:k20000:r10000``: + +1. Map the userspace id on disk down into a kernel id in the filesystem's + idmapping:: + + make_kuid(u0:k20000:r10000, u1000) = k21000 + +2. Map the kernel id up into a userspace id in the caller's idmapping:: + + from_kuid(u0:k0:r4294967295, k21000) = u21000 + +Idmappings on idmapped mounts +----------------------------- + +The examples we've seen in the previous section where the caller's idmapping +and the filesystem's idmapping are incompatible causes various issues for +workloads. For a more complex but common example, consider two containers +started on the host. To completely prevent the two containers from affecting +each other, an administrator may often use different non-overlapping idmappings +for the two containers:: + + container1 idmapping: u0:k10000:r10000 + container2 idmapping: u0:k20000:r10000 + filesystem idmapping: u0:k30000:r10000 + +An administrator wanting to provide easy read-write access to the following set +of files:: + + dir id: u0 + dir/file1 id: u1000 + dir/file2 id: u2000 + +to both containers currently can't. + +Of course the administrator has the option to recursively change ownership via +``chown()``. For example, they could change ownership so that ``dir`` and all +files below it can be crossmapped from the filesystem's into the container's +idmapping. Let's assume they change ownership so it is compatible with the +first container's idmapping:: + + dir id: u10000 + dir/file1 id: u11000 + dir/file2 id: u12000 + +This would still leave ``dir`` rather useless to the second container. In fact, +``dir`` and all files below it would continue to appear owned by the overflowid +for the second container. + +Or consider another increasingly popular example. Some service managers such as +systemd implement a concept called "portable home directories". A user may want +to use their home directories on different machines where they are assigned +different login userspace ids. Most users will have ``u1000`` as the login id +on their machine at home and all files in their home directory will usually be +owned by ``u1000``. At uni or at work they may have another login id such as +``u1125``. This makes it rather difficult to interact with their home directory +on their work machine. + +In both cases changing ownership recursively has grave implications. The most +obvious one is that ownership is changed globally and permanently. In the home +directory case this change in ownership would even need to happen everytime the +user switches from their home to their work machine. For really large sets of +files this becomes increasingly costly. + +If the user is lucky, they are dealing with a filesystem that is mountable +inside user namespaces. But this would also change ownership globally and the +change in ownership is tied to the lifetime of the filesystem mount, i.e. the +superblock. The only way to change ownership is to completely unmount the +filesystem and mount it again in another user namespace. This is usually +impossible because it would mean that all users currently accessing the +filesystem can't anymore. And it means that ``dir`` still can't be shared +between two containers with different idmappings. +But usually the user doesn't even have this option since most filesystems +aren't mountable inside containers. And not having them mountable might be +desirable as it doesn't require the filesystem to deal with malicious +filesystem images. + +But the usecases mentioned above and more can be handled by idmapped mounts. +They allow to expose the same set of dentries with different ownership at +different mounts. This is achieved by marking the mounts with a user namespace +through the ``mount_setattr()`` system call. The idmapping associated with it +is then used to translate from the caller's idmapping to the filesystem's +idmapping and vica versa using the remapping algorithm we introduced above. + +Idmapped mounts make it possible to change ownership in a temporary and +localized way. The ownership changes are restricted to a specific mount and the +ownership changes are tied to the lifetime of the mount. All other users and +locations where the filesystem is exposed are unaffected. + +Filesystems that support idmapped mounts don't have any real reason to support +being mountable inside user namespaces. A filesystem could be exposed +completely under an idmapped mount to get the same effect. This has the +advantage that filesystems can leave the creation of the superblock to +privileged users in the initial user namespace. + +However, it is perfectly possible to combine idmapped mounts with filesystems +mountable inside user namespaces. We will touch on this further below. + +Remapping helpers +~~~~~~~~~~~~~~~~~ + +Idmapping functions were added that translate between idmappings. They make use +of the remapping algorithm we've introduced earlier. We're going to look at +two: + +- ``i_uid_into_mnt()`` and ``i_gid_into_mnt()`` + + The ``i_*id_into_mnt()`` functions translate filesystem's kernel ids into + kernel ids in the mount's idmapping:: + + /* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */ + from_kuid(filesystem, kid) = uid + + /* Map the filesystem's userspace id down ito a kernel id in the mount's idmapping. */ + make_kuid(mount, uid) = kuid + +- ``mapped_fsuid()`` and ``mapped_fsgid()`` + + The ``mapped_fs*id()`` functions translate the caller's kernel ids into + kernel ids in the filesystem's idmapping. This translation is achieved by + remapping the caller's kernel ids using the mount's idmapping:: + + /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */ + from_kuid(mount, kid) = uid + + /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */ + make_kuid(filesystem, uid) = kuid + +Note that these two functions invert each other. Consider the following +idmappings:: + + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k20000:r10000 + mount idmapping: u0:k10000:r10000 + +Assume a file owned by ``u1000`` is read from disk. The filesystem maps this id +to ``k21000`` according to it's idmapping. This is what is stored in the +inode's ``i_uid`` and ``i_gid`` fields. + +When the caller queries the ownership of this file via ``stat()`` the kernel +would usually simply use the crossmapping algorithm and map the filesystem's +kernel id up to a userspace id in the caller's idmapping. + +But when the caller is accessing the file on an idmapped mount the kernel will +first call ``i_uid_into_mnt()`` thereby translating the filesystem's kernel id +into a kernel id in the mount's idmapping:: + + i_uid_into_mnt(k21000): + /* Map the filesystem's kernel id up into a userspace id. */ + from_kuid(u0:k20000:r10000, k21000) = u1000 + + /* Map the filesystem's userspace id down ito a kernel id in the mount's idmapping. */ + make_kuid(u0:k10000:r10000, u1000) = k11000 + +Finally, when the kernel reports the owner to the caller it will turn the +kernel id in the mount's idmapping into a userspace id in the caller's +idmapping:: + + from_kuid(u0:k10000:r10000, k11000) = u1000 + +We can test whether this algorithm really works by verifying what happens when +we create a new file. Let's say the user is creating a file with ``u1000``. + +The kernel maps this to ``k11000`` in the caller's idmapping. Usually the +kernel would now apply the crossmapping, verifying that ``k11000`` can be +mapped to a userspace id in the filesystem's idmapping. Since ``k11000`` can't +be mapped up in the filesystem's idmapping directly this creation request +fails. + +But when the caller is accessing the file on an idmapped mount the kernel will +first call ``mapped_fs*id()`` thereby translating the caller's kernel id into +a kernel id according to the mount's idmapping:: + + mapped_fsuid(k11000): + /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */ + from_kuid(u0:k10000:r10000, k11000) = u1000 + + /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */ + make_kuid(u0:k20000:r10000, u1000) = k21000 + +When finally writing to disk the kernel will then map ``k21000`` up into a +userspace id in the filesystem's idmapping:: + + from_kuid(u0:k20000:r10000, k21000) = u1000 + +As we can see, we end up with an invertible and therefore information +preserving algorithm. A file created from ``u1000`` on an idmapped mount will +also be reported as being owned by ``u1000`` and vica versa. + +Let's now briefly reconsider the failing examples from earlier in the context +of idmapped mounts. + +Example 2 reconsidered +~~~~~~~~~~~~~~~~~~~~~~ + +:: + + caller id: u1000 + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k20000:r10000 + mount idmapping: u0:k10000:r10000 + +When the caller is using a non-initial idmapping the common case is to attach +the same idmapping to the mount. We now perform three steps: + +1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: + + make_kuid(u0:k10000:r10000, u1000) = k11000 + +2. Translate the caller's kernel id into a kernel id in the filesystem's + idmapping:: + + mapped_fsuid(k11000): + /* Map the kernel id up into a userspace id in the mount's idmapping. */ + from_kuid(u0:k10000:r10000, k11000) = u1000 + + /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ + make_kuid(u0:k20000:r10000, u1000) = k21000 + +2. Verify that the caller's kernel ids can be mapped to userspace ids in the + filesystem's idmapping:: + + from_kuid(u0:k20000:r10000, k21000) = u1000 + +So the ownership that lands on disk will be ``u1000``. + +Example 3 reconsidered +~~~~~~~~~~~~~~~~~~~~~~ + +:: + + caller id: u1000 + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k0:r4294967295 + mount idmapping: u0:k10000:r10000 + +The same translation algorithm works with the third example. + +1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: + + make_kuid(u0:k10000:r10000, u1000) = k11000 + +2. Translate the caller's kernel id into a kernel id in the filesystem's + idmapping:: + + mapped_fsuid(k11000): + /* Map the kernel id up into a userspace id in the mount's idmapping. */ + from_kuid(u0:k10000:r10000, k11000) = u1000 + + /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ + make_kuid(u0:k0:r4294967295, u1000) = k1000 + +2. Verify that the caller's kernel ids can be mapped to userspace ids in the + filesystem's idmapping:: + + from_kuid(u0:k0:r4294967295, k21000) = u1000 + +So the ownership that lands on disk will be ``u1000``. + +Example 4 reconsidered +~~~~~~~~~~~~~~~~~~~~~~ + +:: + + file id: u1000 + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k0:r4294967295 + mount idmapping: u0:k10000:r10000 + +In order to report ownership to userspace the kernel now does three steps using +the translation algorithm we introduced earlier: + +1. Map the userspace id on disk down into a kernel id in the filesystem's + idmapping:: + + make_kuid(u0:k0:r4294967295, u1000) = k1000 + +2. Translate the kernel id into a kernel id in the mount's idmapping:: + + i_uid_into_mnt(k1000): + /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ + from_kuid(u0:k0:r4294967295, k1000) = u1000 + + /* Map the userspace id down into a kernel id in the mounts's idmapping. */ + make_kuid(u0:k10000:r10000, u1000) = k11000 + +3. Map the kernel id up into a userspace id in the caller's idmapping:: + + from_kuid(u0:k10000:r10000, k11000) = u1000 + +Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's +idmapping. With the idmapped mount in place it now can be crossmapped into the +filesystem's idmapping via the mount's idmapping. The file will now be created +with ``u1000`` according to the mount's idmapping. + +Example 5 reconsidered +~~~~~~~~~~~~~~~~~~~~~~ + +:: + + file id: u1000 + caller idmapping: u0:k10000:r10000 + filesystem idmapping: u0:k20000:r10000 + mount idmapping: u0:k10000:r10000 + +Again, in order to report ownership to userspace the kernel now does three +steps using the translation algorithm we introduced earlier: + +1. Map the userspace id on disk down into a kernel id in the filesystem's + idmapping:: + + make_kuid(u0:k20000:r10000, u1000) = k21000 + +2. Translate the kernel id into a kernel id in the mount's idmapping:: + + i_uid_into_mnt(k21000): + /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ + from_kuid(u0:k20000:r10000, k21000) = u1000 + + /* Map the userspace id down into a kernel id in the mounts's idmapping. */ + make_kuid(u0:k10000:r10000, u1000) = k11000 + +3. Map the kernel id up into a userspace id in the caller's idmapping:: + + from_kuid(u0:k10000:r10000, k11000) = u1000 + +Earlier, the file's kernel id couldn't be crossmapped in the filesystems's +idmapping. With the idmapped mount in place it now can be crossmapped into the +filesystem's idmapping via the mount's idmapping. The file is now owned by +``u1000`` according to the mount's idmapping. + +Changing ownership on a home directory +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We've seen above how idmapped mounts can be used to translate between +idmappings when either the caller, the filesystem or both uses a non-initial +idmapping. A wide range of usecases exist when the caller is using +a non-initial idmapping. This mostly happens in the context of containerized +workloads. The consequence is as we have seen that for both, filesystem's +mounted with the initial idmapping and filesystems mounted with non-initial +idmappings, access to the filesystem isn't working because the kernel ids can't +be crossmapped between the caller's and the filesystem's idmapping. + +As we've seen above idmapped mounts provide a solution to this by remapping the +caller's or filesystem's idmapping according to the mount's idmapping. + +Aside from containerized workloads, idmapped mounts have the advantage that +they also work when both the caller and the filesystem use the initial +idmapping which means users on the host can change the ownership of directories +and files on a per-mount basis. + +Consider our previous example where a user has their home directory on portable +storage. At home they have id ``u1000`` and all files in their home directory +are owned by ``u1000`` whereas at uni or work they have login id ``u1125``. + +Taking their home directory with them becomes problematic. They can't easily +access their files, they might not be able to write to disk without applying +lax permissions or ACLs and even if they can, they will end up with an annoying +mix of files and directories owned by ``u1000`` and ``u1125``. + +Idmapped mounts allow to solve this problem. A user can create an idmapped +mount for their home directory on their work computer or their computer at home +depending on what ownership they would prefer to end up on the portable storage +itself. + +Let's assume they want all files on disk to belong to ``u1000``. When the user +plugs in their portable storage at their work station they can setup a job that +creates an idmapped mount with the minimal idmapping ``u1000:k1125:r1``. So now +when they create a file the kernel performs the following steps we already know +from above::: + + caller id: u1125 + caller idmapping: u0:k0:r4294967295 + filesystem idmapping: u0:k0:r4294967295 + mount idmapping: u1000:k1125:r1 + +1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: + + make_kuid(u0:k0:r4294967295, u1125) = k1125 + +2. Translate the caller's kernel id into a kernel id in the filesystem's + idmapping:: + + mapped_fsuid(k1125): + /* Map the kernel id up into a userspace id in the mount's idmapping. */ + from_kuid(u1000:k1125:r1, k1125) = u1000 + + /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ + make_kuid(u0:k0:r4294967295, u1000) = k1000 + +2. Verify that the caller's kernel ids can be mapped to userspace ids in the + filesystem's idmapping:: + + from_kuid(u0:k0:r4294967295, k1000) = u1000 + +So ultimately the file will be created with ``u1000`` on disk. + +Now let's briefly look at what ownership the caller with id ``u1125`` will see +on their work computer: + +:: + + file id: u1000 + caller idmapping: u0:k0:r4294967295 + filesystem idmapping: u0:k0:r4294967295 + mount idmapping: u1000:k1125:r1 + +1. Map the userspace id on disk down into a kernel id in the filesystem's + idmapping:: + + make_kuid(u0:k0:r4294967295, u1000) = k1000 + +2. Translate the kernel id into a kernel id in the mount's idmapping:: + + i_uid_into_mnt(k1000): + /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ + from_kuid(u0:k0:r4294967295, k1000) = u1000 + + /* Map the userspace id down into a kernel id in the mounts's idmapping. */ + make_kuid(u1000:k1125:r1, u1000) = k1125 + +3. Map the kernel id up into a userspace id in the caller's idmapping:: + + from_kuid(u0:k0:r4294967295, k1125) = u1125 + +So ultimately the caller will be reported that the file belongs to ``u1125`` +which is the caller's userspace id on their workstation in our example. + +The raw userspace id that is put on disk is ``u1000`` so when the user takes +their home directory back to their home computer where they are assigned +``u1000`` using the initial idmapping and mount the filesystem with the initial +idmapping they will see all those files owned by ``u1000``. + +Shortcircuting +-------------- + +Currently, the implementation of idmapped mounts enforces that the filesystem +is mounted with the initial idmapping. The reason is simply that none of the +filesystems that we targeted were mountable with a non-initial idmapping. But +that might change soon enough. As we've seen above, thanks to the properties of +idmappings the translation works for both filesystems mounted with the initial +idmapping and filesystem with non-initial idmappings. + +Based on this current restriction to filesystem mounted with the initial +idmapping two noticeable shortcuts have been taken: + +1. We always stash a reference to the initial user namespace in ``struct + vfsmount``. Idmapped mounts are thus mounts that have a non-initial user + namespace attached to them. + + In order to support idmapped mounts this needs to be changed. Instead of + stashing the initial user namespace the user namespace the filesystem was + mounted with must be stashed. An idmapped mount is then any mount that has + a different user namespace attached then the filesystem was mounted with. + This has no user-visible consequences. + +2. The translation algorithms in ``mapped_fs*id()`` and ``i_*id_into_mnt()`` + are simplified. + + Let's consider ``mapped_fs*id()`` first. This function translates the + caller's kernel id into a kernel id in the filesystem's idmapping via + a mount's idmapping. The full algorithm is:: + + mapped_fsuid(kid): + /* Map the kernel id up into a userspace id in the mount's idmapping. */ + from_kuid(mount-idmapping, kid) = uid + + /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ + make_kuid(filesystem-idmapping, uid) = kuid + + We know that the filesystem is always mounted with the initial idmapping as + we enforce this in ``mount_setattr()``. So this can be shortened to:: + + mapped_fsuid(kid): + /* Map the kernel id up into a userspace id in the mount's idmapping. */ + from_kuid(mount-idmapping, kid) = uid + + /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ + KUIDT_INIT(uid) = kuid + + Similarly, for ``i_*id_into_mnt()`` which translated the filesystem's kernel + id into a mount's kernel id:: + + i_uid_into_mnt(kid): + /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ + from_kuid(filesystem-idmapping, kid) = uid + + /* Map the userspace id down into a kernel id in the mounts's idmapping. */ + make_kuid(mount-idmapping, uid) = kuid + + Again, we know that the filesystem is always mounted with the initial + idmapping as we enforce this in ``mount_setattr()``. So this can be + shortened to:: + + i_uid_into_mnt(kid): + /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ + __kuid_val(kid) = uid + + /* Map the userspace id down into a kernel id in the mounts's idmapping. */ + make_kuid(mount-idmapping, uid) = kuid + +Handling filesystems mounted with non-initial idmappings requires that the +translation functions be converted to their full form. They can still be +shortcircuited on non-idmapped mounts. This has no user-visible consequences. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 246af51b277a..1a2dd4d35717 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -34,6 +34,7 @@ algorithms work. quota seq_file sharedsubtree + idmappings automount-support @@ -72,7 +73,7 @@ Documentation for filesystem implementations. befs bfs btrfs - cifs/cifsroot + cifs/index ceph coda configfs diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index 2183fd8cc350..2a75dd5da7b5 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -271,19 +271,19 @@ prototypes:: locking rules: All except set_page_dirty and freepage may block -====================== ======================== ========= -ops PageLocked(page) i_rwsem -====================== ======================== ========= +====================== ======================== ========= =============== +ops PageLocked(page) i_rwsem invalidate_lock +====================== ======================== ========= =============== writepage: yes, unlocks (see below) -readpage: yes, unlocks +readpage: yes, unlocks shared writepages: set_page_dirty no -readahead: yes, unlocks -readpages: no +readahead: yes, unlocks shared +readpages: no shared write_begin: locks the page exclusive write_end: yes, unlocks exclusive bmap: -invalidatepage: yes +invalidatepage: yes exclusive releasepage: yes freepage: yes direct_IO: @@ -295,7 +295,7 @@ is_partially_uptodate: yes error_remove_page: yes swap_activate: no swap_deactivate: no -====================== ======================== ========= +====================== ======================== ========= =============== ->write_begin(), ->write_end() and ->readpage() may be called from the request handler (/dev/loop). @@ -378,7 +378,10 @@ keep it that way and don't breed new callers. ->invalidatepage() is called when the filesystem must attempt to drop some or all of the buffers from the page when it is being truncated. It returns zero on success. If ->invalidatepage is zero, the kernel uses -block_invalidatepage() instead. +block_invalidatepage() instead. The filesystem must exclusively acquire +invalidate_lock before invalidating page cache in truncate / hole punch path +(and thus calling into ->invalidatepage) to block races between page cache +invalidation and page cache filling functions (fault, read, ...). ->releasepage() is called when the kernel is about to try to drop the buffers from the page in preparation for freeing it. It returns zero to @@ -506,6 +509,7 @@ prototypes:: ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll) (struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); @@ -518,12 +522,6 @@ prototypes:: int (*fsync) (struct file *, loff_t start, loff_t end, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); - ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, - loff_t *); - ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, - loff_t *); - ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, - void __user *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, @@ -536,6 +534,14 @@ prototypes:: size_t, unsigned int); int (*setlease)(struct file *, long, struct file_lock **, void **); long (*fallocate)(struct file *, int, loff_t, loff_t); + void (*show_fdinfo)(struct seq_file *m, struct file *f); + unsigned (*mmap_capabilities)(struct file *); + ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, + loff_t, size_t, unsigned int); + loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + loff_t len, unsigned int remap_flags); + int (*fadvise)(struct file *, loff_t, loff_t, int); locking rules: All may block. @@ -570,6 +576,25 @@ in sys_read() and friends. the lease within the individual filesystem to record the result of the operation +->fallocate implementation must be really careful to maintain page cache +consistency when punching holes or performing other operations that invalidate +page cache contents. Usually the filesystem needs to call +truncate_inode_pages_range() to invalidate relevant range of the page cache. +However the filesystem usually also needs to update its internal (and on disk) +view of file offset -> disk block mapping. Until this update is finished, the +filesystem needs to block page faults and reads from reloading now-stale page +cache contents from the disk. Since VFS acquires mapping->invalidate_lock in +shared mode when loading pages from disk (filemap_fault(), filemap_read(), +readahead paths), the fallocate implementation must take the invalidate_lock to +prevent reloading. + +->copy_file_range and ->remap_file_range implementations need to serialize +against modifications of file data while the operation is running. For +blocking changes through write(2) and similar operations inode->i_rwsem can be +used. To block changes to file contents via a memory mapping during the +operation, the filesystem must take mapping->invalidate_lock to coordinate +with ->page_mkwrite. + dquot_operations ================ @@ -627,11 +652,11 @@ pfn_mkwrite: yes access: yes ============= ========= =========================== -->fault() is called when a previously not present pte is about -to be faulted in. The filesystem must find and return the page associated -with the passed in "pgoff" in the vm_fault structure. If it is possible that -the page may be truncated and/or invalidated, then the filesystem must lock -the page, then ensure it is not already truncated (the page lock will block +->fault() is called when a previously not present pte is about to be faulted +in. The filesystem must find and return the page associated with the passed in +"pgoff" in the vm_fault structure. If it is possible that the page may be +truncated and/or invalidated, then the filesystem must lock invalidate_lock, +then ensure the page is not already truncated (invalidate_lock will block subsequent truncate), and then return with VM_FAULT_LOCKED, and the page locked. The VM will unlock the page. @@ -644,12 +669,14 @@ page table entry. Pointer to entry associated with the page is passed in "pte" field in vm_fault structure. Pointers to entries for other offsets should be calculated relative to "pte". -->page_mkwrite() is called when a previously read-only pte is -about to become writeable. The filesystem again must ensure that there are -no truncate/invalidate races, and then return with the page locked. If -the page has been truncated, the filesystem should not look up a new page -like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which -will cause the VM to retry the fault. +->page_mkwrite() is called when a previously read-only pte is about to become +writeable. The filesystem again must ensure that there are no +truncate/invalidate races or races with operations such as ->remap_file_range +or ->copy_file_range, and then return with the page locked. Usually +mapping->invalidate_lock is suitable for proper serialization. If the page has +been truncated, the filesystem should not look up a new page like the ->fault() +handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to +retry the fault. ->pfn_mkwrite() is the same as page_mkwrite but when the pte is VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is diff --git a/Documentation/filesystems/mandatory-locking.rst b/Documentation/filesystems/mandatory-locking.rst deleted file mode 100644 index 9ce73544a8f0..000000000000 --- a/Documentation/filesystems/mandatory-locking.rst +++ /dev/null @@ -1,188 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -===================================================== -Mandatory File Locking For The Linux Operating System -===================================================== - - Andy Walker <andy@lysaker.kvaerner.no> - - 15 April 1996 - - (Updated September 2007) - -0. Why you should avoid mandatory locking ------------------------------------------ - -The Linux implementation is prey to a number of difficult-to-fix race -conditions which in practice make it not dependable: - - - The write system call checks for a mandatory lock only once - at its start. It is therefore possible for a lock request to - be granted after this check but before the data is modified. - A process may then see file data change even while a mandatory - lock was held. - - Similarly, an exclusive lock may be granted on a file after - the kernel has decided to proceed with a read, but before the - read has actually completed, and the reading process may see - the file data in a state which should not have been visible - to it. - - Similar races make the claimed mutual exclusion between lock - and mmap similarly unreliable. - -1. What is mandatory locking? ------------------------------- - -Mandatory locking is kernel enforced file locking, as opposed to the more usual -cooperative file locking used to guarantee sequential access to files among -processes. File locks are applied using the flock() and fcntl() system calls -(and the lockf() library routine which is a wrapper around fcntl().) It is -normally a process' responsibility to check for locks on a file it wishes to -update, before applying its own lock, updating the file and unlocking it again. -The most commonly used example of this (and in the case of sendmail, the most -troublesome) is access to a user's mailbox. The mail user agent and the mail -transfer agent must guard against updating the mailbox at the same time, and -prevent reading the mailbox while it is being updated. - -In a perfect world all processes would use and honour a cooperative, or -"advisory" locking scheme. However, the world isn't perfect, and there's -a lot of poorly written code out there. - -In trying to address this problem, the designers of System V UNIX came up -with a "mandatory" locking scheme, whereby the operating system kernel would -block attempts by a process to write to a file that another process holds a -"read" -or- "shared" lock on, and block attempts to both read and write to a -file that a process holds a "write " -or- "exclusive" lock on. - -The System V mandatory locking scheme was intended to have as little impact as -possible on existing user code. The scheme is based on marking individual files -as candidates for mandatory locking, and using the existing fcntl()/lockf() -interface for applying locks just as if they were normal, advisory locks. - -.. Note:: - - 1. In saying "file" in the paragraphs above I am actually not telling - the whole truth. System V locking is based on fcntl(). The granularity of - fcntl() is such that it allows the locking of byte ranges in files, in - addition to entire files, so the mandatory locking rules also have byte - level granularity. - - 2. POSIX.1 does not specify any scheme for mandatory locking, despite - borrowing the fcntl() locking scheme from System V. The mandatory locking - scheme is defined by the System V Interface Definition (SVID) Version 3. - -2. Marking a file for mandatory locking ---------------------------------------- - -A file is marked as a candidate for mandatory locking by setting the group-id -bit in its file mode but removing the group-execute bit. This is an otherwise -meaningless combination, and was chosen by the System V implementors so as not -to break existing user programs. - -Note that the group-id bit is usually automatically cleared by the kernel when -a setgid file is written to. This is a security measure. The kernel has been -modified to recognize the special case of a mandatory lock candidate and to -refrain from clearing this bit. Similarly the kernel has been modified not -to run mandatory lock candidates with setgid privileges. - -3. Available implementations ----------------------------- - -I have considered the implementations of mandatory locking available with -SunOS 4.1.x, Solaris 2.x and HP-UX 9.x. - -Generally I have tried to make the most sense out of the behaviour exhibited -by these three reference systems. There are many anomalies. - -All the reference systems reject all calls to open() for a file on which -another process has outstanding mandatory locks. This is in direct -contravention of SVID 3, which states that only calls to open() with the -O_TRUNC flag set should be rejected. The Linux implementation follows the SVID -definition, which is the "Right Thing", since only calls with O_TRUNC can -modify the contents of the file. - -HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not -just mandatory locks. That would appear to contravene POSIX.1. - -mmap() is another interesting case. All the operating systems mentioned -prevent mandatory locks from being applied to an mmap()'ed file, but HP-UX -also disallows advisory locks for such a file. SVID actually specifies the -paranoid HP-UX behaviour. - -In my opinion only MAP_SHARED mappings should be immune from locking, and then -only from mandatory locks - that is what is currently implemented. - -SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for -mandatory locks, so reads and writes to locked files always block when they -should return EAGAIN. - -I'm afraid that this is such an esoteric area that the semantics described -below are just as valid as any others, so long as the main points seem to -agree. - -4. Semantics ------------- - -1. Mandatory locks can only be applied via the fcntl()/lockf() locking - interface - in other words the System V/POSIX interface. BSD style - locks using flock() never result in a mandatory lock. - -2. If a process has locked a region of a file with a mandatory read lock, then - other processes are permitted to read from that region. If any of these - processes attempts to write to the region it will block until the lock is - released, unless the process has opened the file with the O_NONBLOCK - flag in which case the system call will return immediately with the error - status EAGAIN. - -3. If a process has locked a region of a file with a mandatory write lock, all - attempts to read or write to that region block until the lock is released, - unless a process has opened the file with the O_NONBLOCK flag in which case - the system call will return immediately with the error status EAGAIN. - -4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has - any mandatory locks owned by other processes will be rejected with the - error status EAGAIN. - -5. Attempts to apply a mandatory lock to a file that is memory mapped and - shared (via mmap() with MAP_SHARED) will be rejected with the error status - EAGAIN. - -6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED) - that has any mandatory locks in effect will be rejected with the error status - EAGAIN. - -5. Which system calls are affected? ------------------------------------ - -Those which modify a file's contents, not just the inode. That gives read(), -write(), readv(), writev(), open(), creat(), mmap(), truncate() and -ftruncate(). truncate() and ftruncate() are considered to be "write" actions -for the purposes of mandatory locking. - -The affected region is usually defined as stretching from the current position -for the total number of bytes read or written. For the truncate calls it is -defined as the bytes of a file removed or added (we must also consider bytes -added, as a lock can specify just "the whole file", rather than a specific -range of bytes.) - -Note 3: I may have overlooked some system calls that need mandatory lock -checking in my eagerness to get this code out the door. Please let me know, or -better still fix the system calls yourself and submit a patch to me or Linus. - -6. Warning! ------------ - -Not even root can override a mandatory lock, so runaway processes can wreak -havoc if they lock crucial files. The way around it is to change the file -permissions (remove the setgid bit) before trying to read or write to it. -Of course, that might be a bit tricky if the system is hung :-( - -7. The "mand" mount option --------------------------- -Mandatory locking is disabled on all filesystems by default, and must be -administratively enabled by mounting with "-o mand". That mount option -is only allowed if the mounting task has the CAP_SYS_ADMIN capability. - -Since kernel v4.5, it is possible to disable mandatory locking -altogether by setting CONFIG_MANDATORY_FILE_LOCKING to "n". A kernel -with this disabled will reject attempts to mount filesystems with the -"mand" mount option with the error status EPERM. diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.rst b/Documentation/filesystems/ramfs-rootfs-initramfs.rst index 4598b0d90b60..164960631925 100644 --- a/Documentation/filesystems/ramfs-rootfs-initramfs.rst +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.rst @@ -170,7 +170,7 @@ Documentation/driver-api/early-userspace/early_userspace_support.rst for more de The kernel does not depend on external cpio tools. If you specify a directory instead of a configuration file, the kernel's build infrastructure creates a configuration file from that directory (usr/Makefile calls -usr/gen_initramfs_list.sh), and proceeds to package up that directory +usr/gen_initramfs.sh), and proceeds to package up that directory using the config file (by feeding it to usr/gen_init_cpio, which is created from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is entirely self-contained, and the kernel's boot-time extractor is also |