summaryrefslogtreecommitdiff
path: root/fs/overlayfs
AgeCommit message (Collapse)AuthorFilesLines
2021-01-28ovl: perform vfs_getxattr() with mounter credsMiklos Szeredi1-0/+2
The vfs_getxattr() in ovl_xattr_set() is used to check whether an xattr exist on a lower layer file that is to be removed. If the xattr does not exist, then no need to copy up the file. This call of vfs_getxattr() wasn't wrapped in credential override, and this is probably okay. But for consitency wrap this instance as well. Reported-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-01-28ovl: add warning on user_ns mismatchMiklos Szeredi1-0/+4
Currently there's no way to create an overlay filesystem outside of the current user namespace. Make sure that if this assumption changes it doesn't go unnoticed. Reported-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-01-24overlayfs: do not mount on top of idmapped mountsChristian Brauner1-0/+4
Prevent overlayfs from being mounted on top of idmapped mounts. Stacking filesystems need to be prevented from being mounted on top of idmapped mounts until they have have been converted to handle this. Link: https://lore.kernel.org/r/20210121131959.646623-29-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24fs: make helpers idmap mount awareChristian Brauner4-19/+24
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24namei: prepare for idmapped mountsChristian Brauner2-11/+13
The various vfs_*() helpers are called by filesystems or by the vfs itself to perform core operations such as create, link, mkdir, mknod, rename, rmdir, tmpfile and unlink. Enable them to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace and pass it down. Afterwards the checks and operations are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-15-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24namei: introduce struct renamedataChristian Brauner1-1/+8
In order to handle idmapped mounts we will extend the vfs rename helper to take two new arguments in follow up patches. Since this operations already takes a bunch of arguments add a simple struct renamedata and make the current helper use it before we extend it. Link: https://lore.kernel.org/r/20210121131959.646623-14-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24xattr: handle idmapped mountsTycho Andersen5-17/+20
When interacting with extended attributes the vfs verifies that the caller is privileged over the inode with which the extended attribute is associated. For posix access and posix default extended attributes a uid or gid can be stored on-disk. Let the functions handle posix extended attributes on idmapped mounts. If the inode is accessed through an idmapped mount we need to map it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. This has no effect for e.g. security xattrs since they don't store uids or gids and don't perform permission checks on them like posix acls do. Link: https://lore.kernel.org/r/20210121131959.646623-10-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Tycho Andersen <tycho@tycho.pizza> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24acl: handle idmapped mountsChristian Brauner1-0/+3
The posix acl permission checking helpers determine whether a caller is privileged over an inode according to the acls associated with the inode. Add helpers that make it possible to handle acls on idmapped mounts. The vfs and the filesystems targeted by this first iteration make use of posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to translate basic posix access and default permissions such as the ACL_USER and ACL_GROUP type according to the initial user namespace (or the superblock's user namespace) to and from the caller's current user namespace. Adapt these two helpers to handle idmapped mounts whereby we either map from or into the mount's user namespace depending on in which direction we're translating. Similarly, cap_convert_nscap() is used by the vfs to translate user namespace and non-user namespace aware filesystem capabilities from the superblock's user namespace to the caller's user namespace. Enable it to handle idmapped mounts by accounting for the mount's user namespace. In addition the fileystems targeted in the first iteration of this patch series make use of the posix_acl_chmod() and, posix_acl_update_mode() helpers. Both helpers perform permission checks on the target inode. Let them handle idmapped mounts. These two helpers are called when posix acls are set by the respective filesystems to handle this case we extend the ->set() method to take an additional user namespace argument to pass the mount's user namespace down. Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24attr: handle idmapped mountsChristian Brauner4-8/+8
When file attributes are changed most filesystems rely on the setattr_prepare(), setattr_copy(), and notify_change() helpers for initialization and permission checking. Let them handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Helpers that perform checks on the ia_uid and ia_gid fields in struct iattr assume that ia_uid and ia_gid are intended values and have already been mapped correctly at the userspace-kernelspace boundary as we already do today. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24inode: make init and permission helpers idmapped mount awareChristian Brauner4-5/+5
The inode_owner_or_capable() helper determines whether the caller is the owner of the inode or is capable with respect to that inode. Allow it to handle idmapped mounts. If the inode is accessed through an idmapped mount it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Similarly, allow the inode_init_owner() helper to handle idmapped mounts. It initializes a new inode on idmapped mounts by mapping the fsuid and fsgid of the caller from the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24namei: make permission helpers idmapped mount awareChristian Brauner3-4/+4
The two helpers inode_permission() and generic_permission() are used by the vfs to perform basic permission checking by verifying that the caller is privileged over an inode. In order to handle idmapped mounts we extend the two helpers with an additional user namespace argument. On idmapped mounts the two helpers will make sure to map the inode according to the mount's user namespace and then peform identical permission checks to inode_permission() and generic_permission(). If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24capability: handle idmapped mountsChristian Brauner1-1/+1
In order to determine whether a caller holds privilege over a given inode the capability framework exposes the two helpers privileged_wrt_inode_uidgid() and capable_wrt_inode_uidgid(). The former verifies that the inode has a mapping in the caller's user namespace and the latter additionally verifies that the caller has the requested capability in their current user namespace. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the checks are identical to non-idmapped inodes. If the initial user namespace is passed all operations are a nop so non-idmapped mounts will not see a change in behavior. Link: https://lore.kernel.org/r/20210121131959.646623-5-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-12-14ovl: unprivieged mountsMiklos Szeredi1-0/+1
Enable unprivileged user namespace mounts of overlayfs. Overlayfs's permission model (*) ensures that the mounter itself cannot gain additional privileges by the act of creating an overlayfs mount. This feature request is coming from the "rootless" container crowd. (*) Documentation/filesystems/overlayfs.txt#Permission model Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-12-14ovl: do not get metacopy for userxattrMiklos Szeredi1-0/+7
When looking up an inode on the lower layer for which the mounter lacks read permisison the metacopy check will fail. This causes the lookup to fail as well, even though the directory is readable. So ignore EACCES for the "userxattr" case and assume no metacopy for the unreadable file. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-12-14ovl: do not fail because of O_NOATIMEMiklos Szeredi1-8/+3
In case the file cannot be opened with O_NOATIME because of lack of capabilities, then clear O_NOATIME instead of failing. Remove WARN_ON(), since it would now trigger if O_NOATIME was cleared. Noticed by Amir Goldstein. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-12-14ovl: do not fail when setting origin xattrMiklos Szeredi1-1/+2
Comment above call already says this, but only EOPNOTSUPP is ignored, other failures are not. For example setting "user.*" will fail with EPERM on symlink/special. Ignore this error as well. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-12-14ovl: user xattrMiklos Szeredi5-16/+71
Optionally allow using "user.overlay." namespace instead of "trusted.overlay." This is necessary for overlayfs to be able to be mounted in an unprivileged namepsace. Make the option explicit, since it makes the filesystem format be incompatible. Disable redirect_dir and metacopy options, because these would allow privilege escalation through direct manipulation of the "user.overlay.redirect" or "user.overlay.metacopy" xattrs. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
2020-12-14ovl: simplify file spliceMiklos Szeredi1-44/+2
generic_file_splice_read() and iter_file_splice_write() will call back into f_op->iter_read() and f_op->iter_write() respectively. These already do the real file lookup and cred override. So the code in ovl_splice_read() and ovl_splice_write() is redundant. In addition the ovl_file_accessed() call in ovl_splice_write() is incorrect, though probably harmless. Fix by calling generic_file_splice_read() and iter_file_splice_write() directly. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-12-14ovl: make ioctl() safeMiklos Szeredi1-71/+16
ovl_ioctl_set_flags() does a capability check using flags, but then the real ioctl double-fetches flags and uses potentially different value. The "Check the capability before cred override" comment misleading: user can skip this check by presenting benign flags first and then overwriting them to non-benign flags. Just remove the cred override for now, hoping this doesn't cause a regression. The proper solution is to create a new setxflags i_op (patches are in the works). Xfstests don't show a regression. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Fixes: dab5ca8fd9dd ("ovl: add lsattr/chattr support") Cc: <stable@vger.kernel.org> # v4.19
2020-12-14ovl: check privs before decoding file handleMiklos Szeredi2-0/+6
CAP_DAC_READ_SEARCH is required by open_by_handle_at(2) so check it in ovl_decode_real_fh() as well to prevent privilege escalation for unprivileged overlay mounts. [Amir] If the mounter is not capable in init ns, ovl_check_origin() and ovl_verify_index() will not function as expected and this will break index and nfs export features. So check capability in ovl_can_decode_fh(), to auto disable those features. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-12ovl: fix incorrect extent info in metacopy caseChengguang Xu1-1/+1
In metacopy case, we should use ovl_inode_realdata() instead of ovl_inode_real() to get real inode which has data, so that we can get correct information of extentes in ->fiemap operation. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-12ovl: expand warning in ovl_d_real()Miklos Szeredi1-5/+8
There was a syzbot report with this warning but insufficient information... Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-12ovl: warn about orphan metacopyKevin Locke1-0/+2
When the lower file of a metacopy is inaccessible, -EIO is returned. For users not familiar with overlayfs internals, such as myself, the meaning of this error may not be apparent or easy to determine, since the (metacopy) file is present and open/stat succeed when accessed outside of the overlay. Add a rate-limited warning for orphan metacopy to give users a hint when investigating such errors. Link: https://lore.kernel.org/linux-unionfs/CAOQ4uxi23Zsmfb4rCed1n=On0NNA5KZD74jjjeyz+et32sk-gg@mail.gmail.com/ Signed-off-by: Kevin Locke <kevin@kevinlocke.name> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-12ovl: introduce new "uuid=off" option for inodes index featurePavel Tikhomirov4-2/+26
This replaces uuid with null in overlayfs file handles and thus relaxes uuid checks for overlay index feature. It is only possible in case there is only one filesystem for all the work/upper/lower directories and bare file handles from this backing filesystem are unique. In other case when we have multiple filesystems lets just fallback to "uuid=on" which is and equivalent of how it worked before with all uuid checks. This is needed when overlayfs is/was mounted in a container with index enabled (e.g.: to be able to resolve inotify watch file handles on it to paths in CRIU), and this container is copied and started alongside with the original one. This way the "copy" container can't have the same uuid on the superblock and mounting the overlayfs from it later would fail. That is an example of the problem on top of loop+ext4: dd if=/dev/zero of=loopbackfile.img bs=100M count=10 losetup -fP loopbackfile.img losetup -a #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img) mkfs.ext4 loopbackfile.img mkdir loop-mp mount -o loop /dev/loop0 loop-mp mkdir loop-mp/{lower,upper,work,merged} mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged umount loop-mp/merged umount loop-mp e2fsck -f /dev/loop0 tune2fs -U random /dev/loop0 mount -o loop /dev/loop0 loop-mp mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\ upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged #mount: /loop-test/loop-mp/merged: #mount(2) system call failed: Stale file handle. If you just change the uuid of the backing filesystem, overlay is not mounting any more. In Virtuozzo we copy container disks (ploops) when create the copy of container and we require fs uuid to be unique for a new container. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-12ovl: propagate ovl_fs to ovl_decode_real_fh and ovl_encode_real_fhPavel Tikhomirov5-30/+38
This will be used in next patch to be able to change uuid checks and add uuid nullification based on ofs->config.index for a new "uuid=off" mode. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-06ovl: use generic vfs_ioc_setflags_prepare() helperAmir Goldstein1-32/+30
Canonalize to ioctl FS_* flags instead of inode S_* flags. Note that we do not call the helper vfs_ioc_fssetxattr_check() for FS_IOC_FSSETXATTR ioctl. The reason is that underlying filesystem will perform all the checks. We only need to perform the capability check before overriding credentials. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-06ovl: support [S|G]ETFLAGS and FS[S|G]ETXATTR ioctls for directoriesAmir Goldstein3-12/+51
[S|G]ETFLAGS and FS[S|G]ETXATTR ioctls are applicable to both files and directories, so add ioctl operations to dir as well. We teach ovl_real_fdget() to get the realfile of directories which use a different type of file->private_data. Ifdef away compat ioctl implementation to conform to standard practice. With this change, xfstest generic/079 which tests these ioctls on files and directories passes. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-02ovl: rearrange ovl_can_list()Miklos Szeredi1-3/+6
ovl_can_list() should return false for overlay private xattrs. Since currently these use the "trusted.overlay." prefix, they will always match the "trusted." prefix as well, hence the test for being non-trusted will not trigger. Prepare for using the "user.overlay." namespace by moving the test for private xattr before the test for non-trusted. This patch doesn't change behavior. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-02ovl: enumerate private xattrsMiklos Szeredi3-26/+59
Instead of passing the xattr name down to the ovl_do_*xattr() accessor functions, pass an enumerated value. The enum can use the same names as the the previous #define for each xattr name. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-02ovl: pass ovl_fs down to functions accessing private xattrsMiklos Szeredi9-86/+107
This paves the way for optionally using the "user.overlay." xattr namespace. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-02ovl: drop flags argument from ovl_do_setxattr()Miklos Szeredi6-9/+9
All callers pass zero flags to ovl_do_setxattr(). So drop this argument. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-02ovl: adhere to the vfs_ vs. ovl_do_ conventions for xattrsMiklos Szeredi2-4/+4
Call ovl_do_*xattr() when accessing an overlay private xattr, vfs_*xattr() otherwise. This has an effect on debug output, which is made more consistent by this patch. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-02ovl: use ovl_do_getxattr() for private xattrMiklos Szeredi4-8/+15
Use the convention of calling ovl_do_foo() for operations which are overlay specific. This patch is a no-op, and will have significance for supporting "user.overlay." xattr namespace. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-02ovl: fold ovl_getxattr() into ovl_get_redirect_xattr()Miklos Szeredi1-36/+17
This is a partial revert (with some cleanups) of commit 993a0b2aec52 ("ovl: Do not lose security.capability xattr over metadata file copy-up"), which introduced ovl_getxattr() in the first place. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-02ovl: clean up ovl_getxattr() in copy_up.cMiklos Szeredi1-21/+11
Lose the padding and the failure message (in line with other parts of the copy up process). Return zero for both nonexistent or empty xattr. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-02duplicate ovl_getxattr()Miklos Szeredi3-4/+35
ovl_getattr() returns the value of an xattr in a kmalloced buffer. There are two callers: ovl_copy_up_meta_inode_data() (copy_up.c) ovl_get_redirect_xattr() (util.c) This patch just copies ovl_getxattr() to copy_up.c, the following patches will deal with the differences in idividual callers. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-02ovl: provide a mount option "volatile"Vivek Goyal5-8/+97
Container folks are complaining that dnf/yum issues too many sync while installing packages and this slows down the image build. Build requirement is such that they don't care if a node goes down while build was still going on. In that case, they will simply throw away unfinished layer and start new build. So they don't care about syncing intermediate state to the disk and hence don't want to pay the price associated with sync. So they are asking for mount options where they can disable sync on overlay mount point. They primarily seem to have two use cases. - For building images, they will mount overlay with nosync and then sync upper layer after unmounting overlay and reuse upper as lower for next layer. - For running containers, they don't seem to care about syncing upper layer because if node goes down, they will simply throw away upper layer and create a fresh one. So this patch provides a mount option "volatile" which disables all forms of sync. Now it is caller's responsibility to throw away upper if system crashes or shuts down and start fresh. With "volatile", I am seeing roughly 20% speed up in my VM where I am just installing emacs in an image. Installation time drops from 31 seconds to 25 seconds when nosync option is used. This is for the case of building on top of an image where all packages are already cached. That way I take out the network operations latency out of the measurement. Giuseppe is also looking to cut down on number of iops done on the disk. He is complaining that often in cloud their VMs are throttled if they cross the limit. This option can help them where they reduce number of iops (by cutting down on frequent sync and writebacks). Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-02ovl: check for incompatible features in work dirAmir Goldstein2-11/+46
An incompatible feature is marked by a non-empty directory nested 2 levels deep under "work" dir, e.g.: workdir/work/incompat/volatile. This commit checks for marked incompat features, warns about them and fails to mount the overlay, for example: overlayfs: overlay with incompat feature 'volatile' cannot be mounted Very old kernels (i.e. v3.18) will fail to remove a non-empty "work" dir and fail the mount. Newer kernels will fail to remove a "work" dir with entries nested 3 levels and fall back to read-only mount. User mounting with old kernel will see a warning like these in dmesg: overlayfs: cleanup of 'incompat/...' failed (-39) overlayfs: cleanup of 'work/incompat' failed (-39) overlayfs: cleanup of 'ovl-work/work' failed (-39) overlayfs: failed to create directory /vdf/ovl-work/work (errno: 17); mounting read-only These warnings should give the hint to the user that: 1. mount failure is caused by backward incompatible features 2. mount failure can be resolved by manually removing the "work" directory There is nothing preventing users on old kernels from manually removing workdir entirely or mounting overlay with a new workdir, so this is in no way a full proof backward compatibility enforcement, but only a best effort. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-08-04Merge tag 'uninit-macro-v5.9-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull uninitialized_var() macro removal from Kees Cook: "This is long overdue, and has hidden too many bugs over the years. The series has several "by hand" fixes, and then a trivial treewide replacement. - Clean up non-trivial uses of uninitialized_var() - Update documentation and checkpatch for uninitialized_var() removal - Treewide removal of uninitialized_var()" * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: compiler: Remove uninitialized_var() macro treewide: Remove uninitialized_var() usage checkpatch: Remove awareness of uninitialized_var() macro mm/debug_vm_pgtable: Remove uninitialized_var() usage f2fs: Eliminate usage of uninitialized_var() macro media: sur40: Remove uninitialized_var() usage KVM: PPC: Book3S PR: Remove uninitialized_var() usage clk: spear: Remove uninitialized_var() usage clk: st: Remove uninitialized_var() usage spi: davinci: Remove uninitialized_var() usage ide: Remove uninitialized_var() usage rtlwifi: rtl8192cu: Remove uninitialized_var() usage b43: Remove uninitialized_var() usage drbd: Remove uninitialized_var() usage x86/mm/numa: Remove uninitialized_var() usage docs: deprecated.rst: Add uninitialized_var()
2020-07-16treewide: Remove uninitialized_var() usageKees Cook1-1/+1
Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-16ovl: fix lookup of indexed hardlinks with metacopyAmir Goldstein1-0/+4
We recently moved setting inode flag OVL_UPPERDATA to ovl_lookup(). When looking up an overlay dentry, upperdentry may be found by index and not by name. In that case, we fail to read the metacopy xattr and falsly set the OVL_UPPERDATA on the overlay inode. This caused a regression in xfstest overlay/033 when run with OVERLAY_MOUNT_OPTIONS="-o metacopy=on". Fixes: 28166ab3c875 ("ovl: initialize OVL_UPPERDATA in ovl_lookup()") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16ovl: fix unneeded call to ovl_change_flags()Amir Goldstein1-4/+6
The check if user has changed the overlay file was wrong, causing unneeded call to ovl_change_flags() including taking f_lock on every file access. Fixes: d989903058a8 ("ovl: do not generate duplicate fsnotify events for "fake" path") Cc: <stable@vger.kernel.org> # v4.19+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16ovl: fix mount option checks for nfs_export with no upperdirAmir Goldstein1-13/+18
Without upperdir mount option, there is no index dir and the dependency checks nfs_export => index for mount options parsing are incorrect. Allow the combination nfs_export=on,index=off with no upperdir and move the check for dependency redirect_dir=nofollow for non-upper mount case to mount options parsing. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16ovl: force read-only sb on failure to create index dirAmir Goldstein1-5/+6
With index feature enabled, on failure to create index dir, overlay is being mounted read-only. However, we do not forbid user to remount overlay read-write. Fix that by setting ofs->workdir to NULL, which prevents remount read-write. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16ovl: fix regression with re-formatted lower squashfsAmir Goldstein1-0/+12
Commit 9df085f3c9a2 ("ovl: relax requirement for non null uuid of lower fs") relaxed the requirement for non null uuid with single lower layer to allow enabling index and nfs_export features with single lower squashfs. Fabian reported a regression in a setup when overlay re-uses an existing upper layer and re-formats the lower squashfs image. Because squashfs has no uuid, the origin xattr in upper layer are decoded from the new lower layer where they may resolve to a wrong origin file and user may get an ESTALE or EIO error on lookup. To avoid the reported regression while still allowing the new features with single lower squashfs, do not allow decoding origin with lower null uuid unless user opted-in to one of the new features that require following the lower inode of non-dir upper (index, xino, metacopy). Reported-by: Fabian <godi.beat@gmx.net> Link: https://lore.kernel.org/linux-unionfs/32532923.JtPX5UtSzP@fgdesktop/ Fixes: 9df085f3c9a2 ("ovl: relax requirement for non null uuid of lower fs") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16ovl: fix oops in ovl_indexdir_cleanup() with nfs_export=onAmir Goldstein1-9/+7
Mounting with nfs_export=on, xfstests overlay/031 triggers a kernel panic since v5.8-rc1 overlayfs updates. overlayfs: orphan index entry (index/00fb1..., ftype=4000, nlink=2) BUG: kernel NULL pointer dereference, address: 0000000000000030 RIP: 0010:ovl_cleanup_and_whiteout+0x28/0x220 [overlay] Bisect point at commit c21c839b8448 ("ovl: whiteout inode sharing") Minimal reproducer: -------------------------------------------------- rm -rf l u w m mkdir -p l u w m mkdir -p l/testdir touch l/testdir/testfile mount -t overlay -o lowerdir=l,upperdir=u,workdir=w,nfs_export=on overlay m echo 1 > m/testdir/testfile umount m rm -rf u/testdir mount -t overlay -o lowerdir=l,upperdir=u,workdir=w,nfs_export=on overlay m umount m -------------------------------------------------- When mount with nfs_export=on, and fail to verify an orphan index, we're cleaning this index from indexdir by calling ovl_cleanup_and_whiteout(). This dereferences ofs->workdir, that was earlier set to NULL. The design was that ovl->workdir will point at ovl->indexdir, but we are assigning ofs->indexdir to ofs->workdir only after ovl_indexdir_cleanup(). There is no reason not to do it sooner, because once we get success from ofs->indexdir = ovl_workdir_create(... there is no turning back. Reported-and-tested-by: Murphy Zhou <jencce.kernel@gmail.com> Fixes: c21c839b8448 ("ovl: whiteout inode sharing") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16ovl: relax WARN_ON() when decoding lower directory file handleAmir Goldstein1-1/+1
Decoding a lower directory file handle to overlay path with cold inode/dentry cache may go as follows: 1. Decode real lower file handle to lower dir path 2. Check if lower dir is indexed (was copied up) 3. If indexed, get the upper dir path from index 4. Lookup upper dir path in overlay 5. If overlay path found, verify that overlay lower is the lower dir from step 1 On failure to verify step 5 above, user will get an ESTALE error and a WARN_ON will be printed. A mismatch in step 5 could be a result of lower directory that was renamed while overlay was offline, after that lower directory has been copied up and indexed. This is a scripted reproducer based on xfstest overlay/052: # Create lower subdir create_dirs create_test_files $lower/lowertestdir/subdir mount_dirs # Copy up lower dir and encode lower subdir file handle touch $SCRATCH_MNT/lowertestdir test_file_handles $SCRATCH_MNT/lowertestdir/subdir -p -o $tmp.fhandle # Rename lower dir offline unmount_dirs mv $lower/lowertestdir $lower/lowertestdir.new/ mount_dirs # Attempt to decode lower subdir file handle test_file_handles $SCRATCH_MNT -p -i $tmp.fhandle Since this WARN_ON() can be triggered by user we need to relax it. Fixes: 4b91c30a5a19 ("ovl: lookup connected ancestor of dir in inode cache") Cc: <stable@vger.kernel.org> # v4.16+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16ovl: remove not used argument in ovl_check_originyoungjun1-9/+2
ovl_check_origin outparam 'ctrp' argument not used by caller. So remove this argument. Signed-off-by: youngjun <her0gyugyu@gmail.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16ovl: change ovl_copy_up_flags staticyoungjun2-2/+1
"ovl_copy_up_flags" is used in copy_up.c. so, change it static. Signed-off-by: youngjun <her0gyugyu@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-16ovl: inode reference leak in ovl_is_inuse true case.youngjun1-1/+10
When "ovl_is_inuse" true case, trap inode reference not put. plus adding the comment explaining sequence of ovl_is_inuse after ovl_setup_trap. Fixes: 0be0bfd2de9d ("ovl: fix regression caused by overlapping layers detection") Cc: <stable@vger.kernel.org> # v4.19+ Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: youngjun <her0gyugyu@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>