summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2017-02-21proc/sysctl: Don't grab i_lock under sysctl_lock.Eric W. Biederman1-13/+18
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > This patch has locking problem. I've got lockdep splat under LTP. > > [ 6633.115456] ====================================================== > [ 6633.115502] [ INFO: possible circular locking dependency detected ] > [ 6633.115553] 4.9.10-debug+ #9 Tainted: G L > [ 6633.115584] ------------------------------------------------------- > [ 6633.115627] ksm02/284980 is trying to acquire lock: > [ 6633.115659] (&sb->s_type->i_lock_key#4){+.+...}, at: [<ffffffff816bc1ce>] igrab+0x1e/0x80 > [ 6633.115834] but task is already holding lock: > [ 6633.115882] (sysctl_lock){+.+...}, at: [<ffffffff817e379b>] unregister_sysctl_table+0x6b/0x110 > [ 6633.116026] which lock already depends on the new lock. > [ 6633.116026] > [ 6633.116080] > [ 6633.116080] the existing dependency chain (in reverse order) is: > [ 6633.116117] > -> #2 (sysctl_lock){+.+...}: > -> #1 (&(&dentry->d_lockref.lock)->rlock){+.+...}: > -> #0 (&sb->s_type->i_lock_key#4){+.+...}: > > d_lock nests inside i_lock > sysctl_lock nests inside d_lock in d_compare > > This patch adds i_lock nesting inside sysctl_lock. Al Viro <viro@ZenIV.linux.org.uk> replied: > Once ->unregistering is set, you can drop sysctl_lock just fine. So I'd > try something like this - use rcu_read_lock() in proc_sys_prune_dcache(), > drop sysctl_lock() before it and regain after. Make sure that no inodes > are added to the list ones ->unregistering has been set and use RCU list > primitives for modifying the inode list, with sysctl_lock still used to > serialize its modifications. > > Freeing struct inode is RCU-delayed (see proc_destroy_inode()), so doing > igrab() is safe there. Since we don't drop inode reference until after we'd > passed beyond it in the list, list_for_each_entry_rcu() should be fine. I agree with Al Viro's analsysis of the situtation. Fixes: d6cffbbe9a7e ("proc/sysctl: prune stale dentries during unregistering") Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Tested-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-02-14vfs: Use upper filesystem inode in bprm_fill_uid()Vivek Goyal1-1/+1
Right now bprm_fill_uid() uses inode fetched from file_inode(bprm->file). This in turn returns inode of lower filesystem (in a stacked filesystem setup). I was playing with modified patches of shiftfs posted by james bottomley and realized that through shiftfs setuid bit does not take effect. And reason being that we fetch uid/gid from inode of lower fs (and not from shiftfs inode). And that results in following checks failing. /* We ignore suid/sgid if there are no mappings for them in the ns */ if (!kuid_has_mapping(bprm->cred->user_ns, uid) || !kgid_has_mapping(bprm->cred->user_ns, gid)) return; uid/gid fetched from lower fs inode might not be mapped inside the user namespace of container. So we need to look at uid/gid fetched from upper filesystem (shiftfs in this particular case) and these should be mapped and setuid bit can take affect. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-02-13proc/sysctl: prune stale dentries during unregisteringKonstantin Khlebnikov3-19/+50
Currently unregistering sysctl table does not prune its dentries. Stale dentries could slowdown sysctl operations significantly. For example, command: # for i in {1..100000} ; do unshare -n -- sysctl -a &> /dev/null ; done creates a millions of stale denties around sysctls of loopback interface: # sysctl fs.dentry-state fs.dentry-state = 25812579 24724135 45 0 0 0 All of them have matching names thus lookup have to scan though whole hash chain and call d_compare (proc_sys_compare) which checks them under system-wide spinlock (sysctl_lock). # time sysctl -a > /dev/null real 1m12.806s user 0m0.016s sys 1m12.400s Currently only memory reclaimer could remove this garbage. But without significant memory pressure this never happens. This patch collects sysctl inodes into list on sysctl table header and prunes all their dentries once that table unregisters. Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > On 10.02.2017 10:47, Al Viro wrote: >> how about >> the matching stats *after* that patch? > > dcache size doesn't grow endlessly, so stats are fine > > # sysctl fs.dentry-state > fs.dentry-state = 92712 58376 45 0 0 0 > > # time sysctl -a &>/dev/null > > real 0m0.013s > user 0m0.004s > sys 0m0.008s Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-02-03mnt: Tuck mounts under others instead of creating shadow/side mounts.Eric W. Biederman4-63/+111
Ever since mount propagation was introduced in cases where a mount in propagated to parent mount mountpoint pair that is already in use the code has placed the new mount behind the old mount in the mount hash table. This implementation detail is problematic as it allows creating arbitrary length mount hash chains. Furthermore it invalidates the constraint maintained elsewhere in the mount code that a parent mount and a mountpoint pair will have exactly one mount upon them. Making it hard to deal with and to talk about this special case in the mount code. Modify mount propagation to notice when there is already a mount at the parent mount and mountpoint where a new mount is propagating to and place that preexisting mount on top of the new mount. Modify unmount propagation to notice when a mount that is being unmounted has another mount on top of it (and no other children), and to replace the unmounted mount with the mount on top of it. Move the MNT_UMUONT test from __lookup_mnt_last into __propagate_umount as that is the only call of __lookup_mnt_last where MNT_UMOUNT may be set on any mount visible in the mount hash table. These modifications allow: - __lookup_mnt_last to be removed. - attach_shadows to be renamed __attach_mnt and its shadow handling to be removed. - commit_tree to be simplified - copy_tree to be simplified The result is an easier to understand tree of mounts that does not allow creation of arbitrary length hash chains in the mount hash table. The result is also a very slight userspace visible difference in semantics. The following two cases now behave identically, where before order mattered: case 1: (explicit user action) B is a slave of A mount something on A/a , it will propagate to B/a and than mount something on B/a case 2: (tucked mount) B is a slave of A mount something on B/a and than mount something on A/a Histroically umount A/a would fail in case 1 and succeed in case 2. Now umount A/a succeeds in both configurations. This very small change in semantics appears if anything to be a bug fix to me and my survey of userspace leads me to believe that no programs will notice or care of this subtle semantic change. v2: Updated to mnt_change_mountpoint to not call dput or mntput and instead to decrement the counts directly. It is guaranteed that there will be other references when mnt_change_mountpoint is called so this is safe. v3: Moved put_mountpoint under mount_lock in attach_recursive_mnt As the locking in fs/namespace.c changed between v2 and v3. v4: Reworked the logic in propagate_mount_busy and __propagate_umount that detects when a mount completely covers another mount. v5: Removed unnecessary tests whose result is alwasy true in find_topper and attach_recursive_mnt. v6: Document the user space visible semantic difference. Cc: stable@vger.kernel.org Fixes: b90fa9ae8f51 ("[PATCH] shared mount handling: bind and rbind") Tested-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-02-03Merge branch 'nsfs-discovery'Eric W. Biederman1-0/+13
Michael Kerrisk <<mtk.manpages@gmail.com> writes: I would like to write code that discovers the namespace setup on a live system. The NS_GET_PARENT and NS_GET_USERNS ioctl() operations added in Linux 4.9 provide much of what I want, but there are still a couple of small pieces missing. Those pieces are added with this patch series. Here's an example program that makes use of the new ioctl() operations. 8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x--- /* ns_capable.c (C) 2016 Michael Kerrisk, <mtk.manpages@gmail.com> Licensed under the GNU General Public License v2 or later. Test whether a process (identified by PID) might (subject to LSM checks) have capabilities in a namespace (identified by a /proc/PID/ns/xxx file). */ } while (0) exit(EXIT_FAILURE); } while (0) /* Display capabilities sets of process with specified PID */ static void show_cap(pid_t pid) { cap_t caps; char *cap_string; caps = cap_get_pid(pid); if (caps == NULL) errExit("cap_get_proc"); cap_string = cap_to_text(caps, NULL); if (cap_string == NULL) errExit("cap_to_text"); printf("Capabilities: %s\n", cap_string); } /* Obtain the effective UID pf the process 'pid' by scanning its /proc/PID/file */ static uid_t get_euid_of_process(pid_t pid) { char path[PATH_MAX]; char line[1024]; int uid; snprintf(path, sizeof(path), "/proc/%ld/status", (long) pid); FILE *fp; fp = fopen(path, "r"); if (fp == NULL) errExit("fopen-/proc/PID/status"); for (;;) { if (fgets(line, sizeof(line), fp) == NULL) { /* Should never happen... */ fprintf(stderr, "Failure scanning %s\n", path); exit(EXIT_FAILURE); } if (strstr(line, "Uid:") == line) { sscanf(line, "Uid: %*d %d %*d %*d", &uid); return uid; } } } int main(int argc, char *argv[]) { int ns_fd, userns_fd, pid_userns_fd; int nstype; int next_fd; struct stat pid_stat; struct stat target_stat; char *pid_str; pid_t pid; char path[PATH_MAX]; if (argc < 2) { fprintf(stderr, "Usage: %s PID [ns-file]\n", argv[0]); fprintf(stderr, "\t'ns-file' is a /proc/PID/ns/xxxx file; " "if omitted, use the namespace\n" "\treferred to by standard input " "(file descriptor 0)\n"); exit(EXIT_FAILURE); } pid_str = argv[1]; pid = atoi(pid_str); if (argc <= 2) { ns_fd = STDIN_FILENO; } else { ns_fd = open(argv[2], O_RDONLY); if (ns_fd == -1) errExit("open-ns-file"); } /* Get the relevant user namespace FD, which is 'ns_fd' if 'ns_fd' refers to a user namespace, otherwise the user namespace that owns 'ns_fd' */ nstype = ioctl(ns_fd, NS_GET_NSTYPE); if (nstype == -1) errExit("ioctl-NS_GET_NSTYPE"); if (nstype == CLONE_NEWUSER) { userns_fd = ns_fd; } else { userns_fd = ioctl(ns_fd, NS_GET_USERNS); if (userns_fd == -1) errExit("ioctl-NS_GET_USERNS"); } /* Obtain 'stat' info for the user namespace of the specified PID */ snprintf(path, sizeof(path), "/proc/%s/ns/user", pid_str); pid_userns_fd = open(path, O_RDONLY); if (pid_userns_fd == -1) errExit("open-PID"); if (fstat(pid_userns_fd, &pid_stat) == -1) errExit("fstat-PID"); /* Get 'stat' info for the target user namesapce */ if (fstat(userns_fd, &target_stat) == -1) errExit("fstat-PID"); /* If the PID is in the target user namespace, then it has whatever capabilities are in its sets. */ if (pid_stat.st_dev == target_stat.st_dev && pid_stat.st_ino == target_stat.st_ino) { printf("PID is in target namespace\n"); printf("Subject to LSM checks, it has the following capabilities\n"); show_cap(pid); exit(EXIT_SUCCESS); } /* Otherwise, we need to walk through the ancestors of the target user namespace to see if PID is in an ancestor namespace */ for (;;) { int f; next_fd = ioctl(userns_fd, NS_GET_PARENT); if (next_fd == -1) { /* The error here should be EPERM... */ if (errno != EPERM) errExit("ioctl-NS_GET_PARENT"); printf("PID is not in an ancestor namespace\n"); printf("It has no capabilities in the target namespace\n"); exit(EXIT_SUCCESS); } if (fstat(next_fd, &target_stat) == -1) errExit("fstat-PID"); /* If the 'stat' info for this user namespace matches the 'stat' * info for 'next_fd', then the PID is in an ancestor namespace */ if (pid_stat.st_dev == target_stat.st_dev && pid_stat.st_ino == target_stat.st_ino) break; /* Next time round, get the next parent */ f = userns_fd; userns_fd = next_fd; close(f); } /* At this point, we found that PID is in an ancestor of the target user namespace, and 'userns_fd' refers to the immediate descendant user namespace of PID in the chain of user namespaces from PID to the target user namespace. If the effective UID of PID matches the owner UID of descendant user namespace, then PID has all capabilities in the descendant namespace(s); otherwise, it just has the capabilities that are in its sets. */ uid_t owner_uid, uid; if (ioctl(userns_fd, NS_GET_OWNER_UID, &owner_uid) == -1) { perror("ioctl-NS_GET_OWNER_UID"); exit(EXIT_FAILURE); } uid = get_euid_of_process(pid); printf("PID is in an ancestor namespace\n"); if (owner_uid == uid) { printf("And its effective UID matches the owner " "of the namespace\n"); printf("Subject to LSM checks, PID has all capabilities in " "that namespace!\n"); } else { printf("But its effective UID does not match the owner " "of the namespace\n"); printf("Subject to LSM checks, it has the following capabilities\n"); show_cap(pid); } exit(EXIT_SUCCESS); } 8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x--- Michael Kerrisk (2): nsfs: Add an ioctl() to return the namespace type nsfs: Add an ioctl() to return owner UID of a userns fs/nsfs.c | 13 +++++++++++++ include/uapi/linux/nsfs.h | 9 +++++++-- 2 files changed, 20 insertions(+), 2 deletions(-)
2017-02-03nsfs: Add an ioctl() to return owner UID of a usernsMichael Kerrisk (man-pages)1-0/+11
I'd like to write code that discovers the user namespace hierarchy on a running system, and also shows who owns the various user namespaces. Currently, there is no way of getting the owner UID of a user namespace. Therefore, this patch adds a new NS_GET_CREATOR_UID ioctl() that fetches the UID (as seen in the user namespace of the caller) of the creator of the user namespace referred to by the specified file descriptor. If the supplied file descriptor does not refer to a user namespace, the operation fails with the error EINVAL. If the owner UID does not have a mapping in the caller's user namespace return the overflow UID as that appears easier to deal with in practice in user-space applications. -- EWB Changed the handling of unmapped UIDs from -EOVERFLOW back to the overflow uid. Per conversation with Michael Kerrisk after examining his test code. Acked-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: Michael Kerrisk <mtk-manpages@gmail.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-02-01fs: Better permission checking for submountsEric W. Biederman9-19/+39
To support unprivileged users mounting filesystems two permission checks have to be performed: a test to see if the user allowed to create a mount in the mount namespace, and a test to see if the user is allowed to access the specified filesystem. The automount case is special in that mounting the original filesystem grants permission to mount the sub-filesystems, to any user who happens to stumble across the their mountpoint and satisfies the ordinary filesystem permission checks. Attempting to handle the automount case by using override_creds almost works. It preserves the idea that permission to mount the original filesystem is permission to mount the sub-filesystem. Unfortunately using override_creds messes up the filesystems ordinary permission checks. Solve this by being explicit that a mount is a submount by introducing vfs_submount, and using it where appropriate. vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let sget and friends know that a mount is a submount so they can take appropriate action. sget and sget_userns are modified to not perform any permission checks on submounts. follow_automount is modified to stop using override_creds as that has proven problemantic. do_mount is modified to always remove the new MS_SUBMOUNT flag so that we know userspace will never by able to specify it. autofs4 is modified to stop using current_real_cred that was put in there to handle the previous version of submount permission checking. cifs is modified to pass the mountpoint all of the way down to vfs_submount. debugfs is modified to pass the mountpoint all of the way down to trace_automount by adding a new parameter. To make this change easier a new typedef debugfs_automount_t is introduced to capture the type of the debugfs automount function. Cc: stable@vger.kernel.org Fixes: 069d5ac9ae0d ("autofs: Fix automounts by using current_real_cred()->uid") Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds") Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-02-01vfs: open() with O_CREAT should not create inodes with unknown idsSeth Forshee1-0/+6
may_create() rejects creation of inodes with ids which lack a mapping into s_user_ns. However for O_CREAT may_o_create() is is used instead. Add a similar check there. Fixes: 036d523641c6 ("vfs: Don't create inodes with a uid or gid unknown to the vfs") Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-01-25nsfs: Add an ioctl() to return the namespace typeMichael Kerrisk (man-pages)1-0/+2
Linux 4.9 added two ioctl() operations that can be used to discover: * the parental relationships for hierarchical namespaces (user and PID) [NS_GET_PARENT] * the user namespaces that owns a specified non-user-namespace [NS_GET_USERNS] For no good reason that I can glean, NS_GET_USERNS was made synonymous with NS_GET_PARENT for user namespaces. It might have been better if NS_GET_USERNS had returned an error if the supplied file descriptor referred to a user namespace, since it suggests that the caller may be confused. More particularly, if it had generated an error, then I wouldn't need the new ioctl() operation proposed here. (On the other hand, what I propose here may be more generally useful.) I would like to write code that discovers namespace relationships for the purpose of understanding the namespace setup on a running system. In particular, given a file descriptor (or pathname) for a namespace, N, I'd like to obtain the corresponding user namespace. Namespace N might be a user namespace (in which case my code would just use N) or a non-user namespace (in which case my code will use NS_GET_USERNS to get the user namespace associated with N). The problem is that there is no way to tell the difference by looking at the file descriptor (and if I try to use NS_GET_USERNS on an N that is a user namespace, I get the parent user namespace of N, which is not what I want). This patch therefore adds a new ioctl(), NS_GET_NSTYPE, which, given a file descriptor that refers to a user namespace, returns the namespace type (one of the CLONE_NEW* constants). Signed-off-by: Michael Kerrisk <mtk-manpages@gmail.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-01-24proc: Better ownership of files for non-dumpable tasks in user namespacesEric W. Biederman3-69/+61
Instead of making the files owned by the GLOBAL_ROOT_USER. Make non-dumpable files whose mm has always lived in a user namespace owned by the user namespace root. This allows the container root to have things work as expected in a container. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-01-24exec: Remove LSM_UNSAFE_PTRACE_CAPEric W. Biederman1-6/+2
With previous changes every location that tests for LSM_UNSAFE_PTRACE_CAP also tests for LSM_UNSAFE_PTRACE making the LSM_UNSAFE_PTRACE_CAP redundant, so remove it. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-01-24inotify: Convert to using per-namespace limitsNikolay Borisov3-21/+36
This patchset converts inotify to using the newly introduced per-userns sysctl infrastructure. Currently the inotify instances/watches are being accounted in the user_struct structure. This means that in setups where multiple users in unprivileged containers map to the same underlying real user (i.e. pointing to the same user_struct) the inotify limits are going to be shared as well, allowing one user(or application) to exhaust all others limits. Fix this by switching the inotify sysctls to using the per-namespace/per-user limits. This will allow the server admin to set sensible global limits, which can further be tuned inside every individual user namespace. Additionally, in order to preserve the sysctl ABI make the existing inotify instances/watches sysctls modify the values of the initial user namespace. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-01-10sysctl: Drop reference added by grab_header in proc_sys_readdirZhou Chengming1-1/+2
Fixes CVE-2016-9191, proc_sys_readdir doesn't drop reference added by grab_header when return from !dir_emit_dots path. It can cause any path called unregister_sysctl_table will wait forever. The calltrace of CVE-2016-9191: [ 5535.960522] Call Trace: [ 5535.963265] [<ffffffff817cdaaf>] schedule+0x3f/0xa0 [ 5535.968817] [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0 [ 5535.975346] [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130 [ 5535.982256] [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130 [ 5535.988972] [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80 [ 5535.994804] [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0 [ 5536.001227] [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0 [ 5536.007648] [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0 [ 5536.014654] [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0 [ 5536.021657] [<ffffffff810f57f5>] unregister_sched_domain_sysctl+0x15/0x40 [ 5536.029344] [<ffffffff810d7704>] partition_sched_domains+0x44/0x450 [ 5536.036447] [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0 [ 5536.043844] [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0 [ 5536.051336] [<ffffffff8116789d>] update_flag+0x11d/0x210 [ 5536.057373] [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450 [ 5536.064186] [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60 [ 5536.070899] [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10 [ 5536.077420] [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450 [ 5536.084234] [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220 [ 5536.091049] [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60 [ 5536.097571] [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220 [ 5536.104207] [<ffffffff810bc83f>] process_one_work+0x1df/0x710 [ 5536.110736] [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710 [ 5536.117461] [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0 [ 5536.123697] [<ffffffff810bcd70>] ? process_one_work+0x710/0x710 [ 5536.130426] [<ffffffff810c3f7e>] kthread+0xfe/0x120 [ 5536.135991] [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40 [ 5536.142041] [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230 One cgroup maintainer mentioned that "cgroup is trying to offline a cpuset css, which takes place under cgroup_mutex. The offlining ends up trying to drain active usages of a sysctl table which apprently is not happening." The real reason is that proc_sys_readdir doesn't drop reference added by grab_header when return from !dir_emit_dots path. So this cpuset offline path will wait here forever. See here for details: http://www.openwall.com/lists/oss-security/2016/11/04/13 Fixes: f0c3b5093add ("[readdir] convert procfs") Cc: stable@vger.kernel.org Reported-by: CAI Qian <caiqian@redhat.com> Tested-by: Yang Shukui <yangshukui@huawei.com> Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-01-10libfs: Modify mount_pseudo_xattr to be clear it is not a userspace mountEric W. Biederman1-1/+2
Add MS_KERNMOUNT to the flags that are passed. Use sget_userns and force &init_user_ns instead of calling sget so that even if called from a weird context the internal filesystem will be considered to be in the intial user namespace. Luis Ressel reported that the the failure to pass MS_KERNMOUNT into mount_pseudo broke his in development graphics driver that uses the generic drm infrastructure. I am not certain the deriver was bug free in it's usage of that infrastructure but since mount_pseudo_xattr can never be triggered by userspace it is clearer and less error prone, and less problematic for the code to be explicit. Reported-by: Luis Ressel <aranea@aixah.de> Tested-by: Luis Ressel <aranea@aixah.de> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-01-10mnt: Protect the mountpoint hashtable with mount_lockEric W. Biederman2-21/+50
Protecting the mountpoint hashtable with namespace_sem was sufficient until a call to umount_mnt was added to mntput_no_expire. At which point it became possible for multiple calls of put_mountpoint on the same hash chain to happen on the same time. Kristen Johansen <kjlx@templeofstupid.com> reported: > This can cause a panic when simultaneous callers of put_mountpoint > attempt to free the same mountpoint. This occurs because some callers > hold the mount_hash_lock, while others hold the namespace lock. Some > even hold both. > > In this submitter's case, the panic manifested itself as a GP fault in > put_mountpoint() when it called hlist_del() and attempted to dereference > a m_hash.pprev that had been poisioned by another thread. Al Viro observed that the simple fix is to switch from using the namespace_sem to the mount_lock to protect the mountpoint hash table. I have taken Al's suggested patch moved put_mountpoint in pivot_root (instead of taking mount_lock an additional time), and have replaced new_mountpoint with get_mountpoint a function that does the hash table lookup and addition under the mount_lock. The introduction of get_mounptoint ensures that only the mount_lock is needed to manipulate the mountpoint hashtable. d_set_mounted is modified to only set DCACHE_MOUNTED if it is not already set. This allows get_mountpoint to use the setting of DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry happens exactly once. Cc: stable@vger.kernel.org Fixes: ce07d891a089 ("mnt: Honor MNT_LOCKED when detaching mounts") Reported-by: Krister Johansen <kjlx@templeofstupid.com> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-12-27ext4: Simplify DAX fault pathJan Kara1-38/+10
Now that dax_iomap_fault() calls ->iomap_begin() without entry lock, we can use transaction starting in ext4_iomap_begin() and thus simplify ext4_dax_fault(). It also provides us proper retries in case of ENOSPC. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-27dax: Call ->iomap_begin without entry lock during dax faultJan Kara1-55/+66
Currently ->iomap_begin() handler is called with entry lock held. If the filesystem held any locks between ->iomap_begin() and ->iomap_end() (such as ext4 which will want to hold transaction open), this would cause lock inversion with the iomap_apply() from standard IO path which first calls ->iomap_begin() and only then calls ->actor() callback which grabs entry locks for DAX (if it faults when copying from/to user provided buffers). Fix the problem by nesting grabbing of entry lock inside ->iomap_begin() - ->iomap_end() pair. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-27dax: Finish fault completely when loading holesJan Kara1-9/+18
The only case when we do not finish the page fault completely is when we are loading hole pages into a radix tree. Avoid this special case and finish the fault in that case as well inside the DAX fault handler. It will allow us for easier iomap handling. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-27dax: Avoid page invalidation races and unnecessary radix tree traversalsJan Kara1-17/+11
Currently dax_iomap_rw() takes care of invalidating page tables and evicting hole pages from the radix tree when write(2) to the file happens. This invalidation is only necessary when there is some block allocation resulting from write(2). Furthermore in current place the invalidation is racy wrt page fault instantiating a hole page just after we have invalidated it. So perform the page invalidation inside dax_iomap_actor() where we can do it only when really necessary and after blocks have been allocated so nobody will be instantiating new hole pages anymore. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-27mm: Invalidate DAX radix tree entries only if appropriateJan Kara1-10/+61
Currently invalidate_inode_pages2_range() and invalidate_mapping_pages() just delete all exceptional radix tree entries they find. For DAX this is not desirable as we track cache dirtiness in these entries and when they are evicted, we may not flush caches although it is necessary. This can for example manifest when we write to the same block both via mmap and via write(2) (to different offsets) and fsync(2) then does not properly flush CPU caches when modification via write(2) was the last one. Create appropriate DAX functions to handle invalidation of DAX entries for invalidate_inode_pages2_range() and invalidate_mapping_pages() and wire them up into the corresponding mm functions. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-27ext2: Return BH_New buffers for zeroed blocksJan Kara1-2/+1
So far we did not return BH_New buffers from ext2_get_blocks() when we allocated and zeroed-out a block for DAX inode to avoid racy zeroing in DAX code. This zeroing is gone these days so we can remove the workaround. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-25ktime: Get rid of ktime_equal()Thomas Gleixner2-4/+3
No point in going through loops and hoops instead of just comparing the values. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-25ktime: Cleanup ktime_set() usageThomas Gleixner3-3/+3
ktime_set(S,N) was required for the timespec storage type and is still useful for situations where a Seconds and Nanoseconds part of a time value needs to be converted. For anything where the Seconds argument is 0, this is pointless and can be replaced with a simple assignment. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-25ktime: Get rid of the unionThomas Gleixner4-18/+17
ktime is a union because the initial implementation stored the time in scalar nanoseconds on 64 bit machine and in a endianess optimized timespec variant for 32bit machines. The Y2038 cleanup removed the timespec variant and switched everything to scalar nanoseconds. The union remained, but become completely pointless. Get rid of the union and just keep ktime_t as simple typedef of type s64. The conversion was done with coccinelle and some manual mopping up. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-24Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds96-96/+96
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds14-90/+177
Pull cifs fixes from Steve French: "This ncludes various cifs/smb3 bug fixes, mostly for stable as well. In the next week I expect that Germano will have some reconnection fixes, and also I expect to have the remaining pieces of the snapshot enablement and SMB3 ACLs, but wanted to get this set of bug fixes in" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: cifs_get_root shouldn't use path with tree name Fix default behaviour for empty domains and add domainauto option cifs: use %16phN for formatting md5 sum cifs: Fix smbencrypt() to stop pointing a scatterlist at the stack CIFS: Fix a possible double locking of mutex during reconnect CIFS: Fix a possible memory corruption during reconnect CIFS: Fix a possible memory corruption in push locks CIFS: Fix missing nls unload in smb2_reconnect() CIFS: Decrease verbosity of ioctl call SMB3: parsing for new snapshot timestamp mount parm
2016-12-23Merge branch 'for-linus' of ↵Linus Torvalds10-133/+162
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull final vfs updates from Al Viro: "Assorted cleanups and fixes all over the place" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: sg_write()/bsg_write() is not fit to be called under KERNEL_DS ufs: fix function declaration for ufs_truncate_blocks fs: exec: apply CLOEXEC before changing dumpable task flags seq_file: reset iterator to first record for zero offset vfs: fix isize/pos/len checks for reflink & dedupe [iov_iter] fix iterate_all_kinds() on empty iterators move aio compat to fs/aio.c reorganize do_make_slave() clone_private_mount() doesn't need to touch namespace_sem remove a bogus claim about namespace_sem being held by callers of mnt_alloc_id()
2016-12-23Merge tag 'befs-v4.10-rc1' of git://github.com/luisbg/linux-befsLinus Torvalds13-116/+145
Pull befs updates from Luis de Bethencourt: "A series of small fixes and adding NFS export support" * tag 'befs-v4.10-rc1' of git://github.com/luisbg/linux-befs: befs: add NFS export support befs: remove trailing whitespaces befs: remove signatures from comments befs: fix style issues in header files befs: fix style issues in linuxvfs.c befs: fix typos in linuxvfs.c befs: fix style issues in io.c befs: fix style issues in inode.c befs: fix style issues in debug.c
2016-12-23Merge branch 'work.namespace' into for-linusAl Viro2-44/+38
2016-12-23ufs: fix function declaration for ufs_truncate_blocksJeff Layton1-1/+1
sparse says: fs/ufs/inode.c:1195:6: warning: symbol 'ufs_truncate_blocks' was not declared. Should it be static? Note that the forward declaration in the file is already marked static. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-23fs: exec: apply CLOEXEC before changing dumpable task flagsAleksa Sarai1-2/+8
If you have a process that has set itself to be non-dumpable, and it then undergoes exec(2), any CLOEXEC file descriptors it has open are "exposed" during a race window between the dumpable flags of the process being reset for exec(2) and CLOEXEC being applied to the file descriptors. This can be exploited by a process by attempting to access /proc/<pid>/fd/... during this window, without requiring CAP_SYS_PTRACE. The race in question is after set_dumpable has been (for get_link, though the trace is basically the same for readlink): [vfs] -> proc_pid_link_inode_operations.get_link -> proc_pid_get_link -> proc_fd_access_allowed -> ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS); Which will return 0, during the race window and CLOEXEC file descriptors will still be open during this window because do_close_on_exec has not been called yet. As a result, the ordering of these calls should be reversed to avoid this race window. This is of particular concern to container runtimes, where joining a PID namespace with file descriptors referring to the host filesystem can result in security issues (since PRCTL_SET_DUMPABLE doesn't protect against access of CLOEXEC file descriptors -- file descriptors which may reference filesystem objects the container shouldn't have access to). Cc: dev@opencontainers.org Cc: <stable@vger.kernel.org> # v3.2+ Reported-by: Michael Crosby <crosbymichael@gmail.com> Signed-off-by: Aleksa Sarai <asarai@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-23seq_file: reset iterator to first record for zero offsetTomasz Majchrzak1-0/+7
If kernfs file is empty on a first read, successive read operations using the same file descriptor will return no data, even when data is available. Default kernfs 'seq_next' implementation advances iterator position even when next object is not there. Kernfs 'seq_start' for following requests will not return iterator as position is already on the second object. This defect doesn't allow to monitor badblocks sysfs files from MD raid. They are initially empty but if data appears at some stage, userspace is not able to read it. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-23vfs: fix isize/pos/len checks for reflink & dedupeDarrick J. Wong3-9/+13
Strengthen the checking of pos/len vs. i_size, clarify the return values for the clone prep function, and remove pointless code. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-23move aio compat to fs/aio.cAl Viro2-77/+95
... and fix the minor buglet in compat io_submit() - native one kills ioctx as cleanup when put_user() fails. Get rid of bogus compat_... in !CONFIG_AIO case, while we are at it - they should simply fail with ENOSYS, same as for native counterparts. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-22befs: add NFS export supportLuis de Bethencourt1-0/+38
Implement mandatory export_operations, so it is possible to export befs via nfs. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-12-22befs: remove trailing whitespacesLuis de Bethencourt7-48/+47
Removing all trailing whitespaces in befs. I was skeptic about tainting the history with this, but whitespace changes can be ignored by using 'git blame -w' and 'git log -w'. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-12-22befs: remove signatures from commentsLuis de Bethencourt3-7/+1
No idea why some comments have signatures. These predate git. Removing them since they add noise and no information. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-12-22befs: fix style issues in header filesLuis de Bethencourt6-17/+12
Fixing checkpatch.pl issues in befs header files: WARNING: Missing a blank line after declarations + befs_inode_addr iaddr; + iaddr.allocation_group = blockno >> BEFS_SB(sb)->ag_shift; WARNING: space prohibited between function name and open parenthesis '(' + return BEFS_SB(sb)->block_size / sizeof (befs_disk_inode_addr); ERROR: "foo * bar" should be "foo *bar" + const char *key, befs_off_t * value); ERROR: Macros with complex values should be enclosed in parentheses +#define PACKED __attribute__ ((__packed__)) Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-12-22befs: fix style issues in linuxvfs.cLuis de Bethencourt1-23/+26
Fix the following type of checkpatch.pl issues: WARNING: line over 80 characters +static struct dentry *befs_lookup(struct inode *, struct dentry *, unsigned int); ERROR: code indent should use tabs where possible + if (!bi)$ WARNING: please, no spaces at the start of a line + if (!bi)$ WARNING: labels should not be indented + unacquire_bh: WARNING: space prohibited between function name and open parenthesis '(' + sizeof (struct befs_inode_info), WARNING: braces {} are not necessary for single statement blocks + if (!*out) { + return -ENOMEM; + } WARNING: Block comments use a trailing */ on a separate line + * in special cases */ WARNING: Missing a blank line after declarations + int token; + if (!*p) ERROR: do not use assignment in if condition + if (!(bh = sb_bread(sb, sb_block))) { ERROR: space prohibited after that open parenthesis '(' + if( befs_sb->num_blocks > ~((sector_t)0) ) { ERROR: space prohibited before that close parenthesis ')' + if( befs_sb->num_blocks > ~((sector_t)0) ) { ERROR: space required before the open parenthesis '(' + if( befs_sb->num_blocks > ~((sector_t)0) ) { Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-12-22befs: fix typos in linuxvfs.cLuis de Bethencourt1-8/+6
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-12-22befs: fix style issues in io.cLuis de Bethencourt1-2/+2
Fixing the two following checkpatch.pl issues: ERROR: trailing whitespace + * Based on portions of file.c and inode.c $ WARNING: labels should not be indented + error: Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-12-22befs: fix style issues in inode.cLuis de Bethencourt1-6/+6
Fixing the following checkpatch.pl errors and warning: ERROR: trailing whitespace + * $ WARNING: Block comments use * on subsequent lines +/* + Validates the correctness of the befs inode ERROR: "foo * bar" should be "foo *bar" +befs_check_inode(struct super_block *sb, befs_inode * raw_inode, Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-12-22befs: fix style issues in debug.cLuis de Bethencourt1-6/+8
Fix all checkpatch.pl errors and warnings in debug.c: ERROR: trailing whitespace + * $ WARNING: Missing a blank line after declarations + va_list args; + va_start(args, fmt); ERROR: "foo * bar" should be "foo *bar" +befs_dump_inode(const struct super_block *sb, befs_inode * inode) ERROR: "foo * bar" should be "foo *bar" +befs_dump_super_block(const struct super_block *sb, befs_super_block * sup) ERROR: "foo * bar" should be "foo *bar" +befs_dump_small_data(const struct super_block *sb, befs_small_data * sd) WARNING: line over 80 characters +befs_dump_index_entry(const struct super_block *sb, befs_disk_btree_super * super) ERROR: "foo * bar" should be "foo *bar" +befs_dump_index_entry(const struct super_block *sb, befs_disk_btree_super * super) ERROR: "foo * bar" should be "foo *bar" +befs_dump_index_node(const struct super_block *sb, befs_btree_nodehead * node) Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
2016-12-21splice: reinstate SIGPIPE/EPIPE handlingLinus Torvalds1-2/+7
Commit 8924feff66f3 ("splice: lift pipe_lock out of splice_to_pipe()") caused a regression when there were no more readers left on a pipe that was being spliced into: rather than the expected SIGPIPE and -EPIPE return value, the writer would end up waiting forever for space to free up (which obviously was not going to happen with no readers around). Fixes: 8924feff66f3 ("splice: lift pipe_lock out of splice_to_pipe()") Reported-and-tested-by: Andreas Schwab <schwab@linux-m68k.org> Debugged-by: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@kernel.org # v4.9 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-21Merge tag 'nfs-for-4.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds11-109/+166
Pull more NFS client updates from Trond Myklebust: "Highlights include: - further attribute cache improvements to make revalidation more fine grained - NFSv4 locking improvements Bugfixes: - nfs4_fl_prepare_ds must be careful about reporting success in files layout - pNFS/flexfiles: Instead of marking a device inactive, remove it from the cache" * tag 'nfs-for-4.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4: Retry the DELEGRETURN if the embedded GETATTR is rejected with EACCES NFS: Retry the CLOSE if the embedded GETATTR is rejected with EACCES NFSv4: Place the GETATTR operation before the CLOSE NFSv4: Also ask for attributes when downgrading to a READ-only state NFS: Don't abuse NFS_INO_REVAL_FORCED in nfs_post_op_update_inode_locked() pNFS: Return RW layouts on OPEN_DOWNGRADE NFSv4: Add encode/decode of the layoutreturn op in OPEN_DOWNGRADE NFS: Don't disconnect open-owner on NFS4ERR_BAD_SEQID NFSv4: ensure __nfs4_find_lock_state returns consistent result. NFSv4.1: nfs4_fl_prepare_ds must be careful about reporting success. pNFS/flexfiles: delete deviceid, don't mark inactive NFS: Clean up nfs_attribute_timeout() NFS: Remove unused function nfs_revalidate_inode_rcu() NFS: Fix and clean up the access cache validity checking NFS: Only look at the change attribute cache state in nfs_weak_revalidate() NFS: Clean up cache validity checking NFS: Don't revalidate the file on close if we hold a delegation NFSv4: Don't discard the attributes returned by asynchronous DELEGRETURN NFSv4: Update the attribute cache info in update_changeattr
2016-12-20NFSv4: Retry the DELEGRETURN if the embedded GETATTR is rejected with EACCESTrond Myklebust2-4/+15
If our DELEGRETURN RPC call is rejected with an EACCES call, then we should remove the GETATTR call from the compound RPC and retry. This could potentially happen when there is a conflict between an ACL denying attribute reads and our use of SP4_MACH_CRED. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-20NFS: Retry the CLOSE if the embedded GETATTR is rejected with EACCESTrond Myklebust1-0/+10
If our CLOSE RPC call is rejected with an EACCES call, then we should remove the GETATTR call from the compound RPC and retry. This could potentially happen when there is a conflict between an ACL denying attribute reads and our use of SP4_MACH_CRED. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-20NFSv4: Place the GETATTR operation before the CLOSETrond Myklebust2-12/+12
In order to benefit from the DENY share lock protection, we should put the GETATTR operation before the CLOSE. Otherwise, we might race with a Windows machine that thinks it is now safe to modify the file. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-20NFSv4: Also ask for attributes when downgrading to a READ-only stateTrond Myklebust1-1/+2
If we're downgrading from a READ+WRITE mode to a READ-only mode, then ask for cache consistency attributes so that we avoid the revalidation in nfs_close_context() Fixes: 3947b74d0f9d ("NFSv4: Don't request a GETATTR on open_downgrade.") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-20NFS: Don't abuse NFS_INO_REVAL_FORCED in nfs_post_op_update_inode_locked()Trond Myklebust1-7/+0
The NFS_INO_REVAL_FORCED flag now really only has meaning for the case when we've just been handed a delegation for a file that was already cached, and we're unsure about that cache. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>