BMC/Intel-BMC/linux.git/kernel/cgroup.c, branch dev-4.7

cgroup: fix invalid controller enable rejections with cgroup namespace

2016-10-07T13:21:16+00:00

commit 9157056da8f8c4a6305f15619e269f164b63a6de upstream. On the v2 hierarchy, "cgroup.subtree_control" rejects controller enables if the cgroup has processes in it. The enforcement of this logic assumes that the cgroup wouldn't have any css_sets associated with it if there are no tasks in the cgroup, which is no longer true since a79a908fd2b0 ("cgroup: introduce cgroup namespaces"). When a cgroup namespace is created, it pins the css_set of the creating task to use it as the root css_set of the namespace. This extra reference stays as long as the namespace is around and makes "cgroup.subtree_control" think that the namespace root cgroup is not empty even when it is and thus reject controller enables. Fix it by making cgroup_subtree_control() walk and test emptiness of each css_set instead of testing whether the list_head is empty. While at it, update the comment of cgroup_task_count() to indicate that the returned value may be higher than the number of tasks, which has always been true due to temporary references and doesn't break anything. Signed-off-by: Tejun Heo Reported-by: Evgeny Vereshchagin Cc: Serge E. Hallyn Cc: Aditya Kali Cc: Eric W. Biederman Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Link: https://github.com/systemd/systemd/pull/3589#issuecomment-249089541 Signed-off-by: Greg Kroah-Hartman

cgroup: duplicate cgroup reference when cloning sockets

2016-09-30T08:12:44+00:00

commit d979a39d7242e0601bf9b60e89628fb8ac577179 upstream. When a socket is cloned, the associated sock_cgroup_data is duplicated but not its reference on the cgroup. As a result, the cgroup reference count will underflow when both sockets are destroyed later on. Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Link: http://lkml.kernel.org/r/20160914194846.11153-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Acked-by: Tejun Heo Cc: Michal Hocko Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

cgroupns: Only allow creation of hierarchies in the initial cgroup namespace

2016-08-20T16:10:58+00:00

commit 726a4994b05ff5b6f83d64b5b43c3251217366ce upstream. Unprivileged users can't use hierarchies if they create them as they do not have privilieges to the root directory. Which means the only thing a hiearchy created by an unprivileged user is good for is expanding the number of cgroup links in every css_set, which is a DOS attack. We could allow hierarchies to be created in namespaces in the initial user namespace. Unfortunately there is only a single namespace for the names of heirarchies, so that is likely to create more confusion than not. So do the simple thing and restrict hiearchy creation to the initial cgroup namespace. Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Signed-off-by: "Eric W. Biederman" Signed-off-by: Tejun Heo Signed-off-by: Greg Kroah-Hartman

cgroupns: Close race between cgroup_post_fork and copy_cgroup_ns

2016-08-20T16:10:58+00:00

commit eedd0f4cbf5f3b81e82649832091e1d9d53f0709 upstream. In most code paths involving cgroup migration cgroup_threadgroup_rwsem is taken. There are two exceptions: - remove_tasks_in_empty_cpuset calls cgroup_transfer_tasks - vhost_attach_cgroups_work calls cgroup_attach_task_all With cgroup_threadgroup_rwsem held it is guaranteed that cgroup_post_fork and copy_cgroup_ns will reference the same css_set from the process calling fork. Without such an interlock there process after fork could reference one css_set from it's new cgroup namespace and another css_set from task->cgroups, which semantically is nonsensical. Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Signed-off-by: "Eric W. Biederman" Signed-off-by: Tejun Heo Signed-off-by: Greg Kroah-Hartman

cgroupns: Fix the locking in copy_cgroup_ns

2016-08-20T16:10:58+00:00

commit 7bd8830875bfa380c68f390efbad893293749324 upstream. If "clone(CLONE_NEWCGROUP...)" is called it results in a nice lockdep valid splat. In __cgroup_proc_write the lock ordering is: cgroup_mutex -- through cgroup_kn_lock_live cgroup_threadgroup_rwsem In copy_process the guts of clone the lock ordering is: cgroup_threadgroup_rwsem -- through threadgroup_change_begin cgroup_mutex -- through copy_namespaces -- copy_cgroup_ns lockdep reports some a different call chains for the first ordering of cgroup_mutex and cgroup_threadgroup_rwsem but it is harder to trace. This is most definitely deadlock potential under the right circumstances. Fix this by by skipping the cgroup_mutex and making the locking in copy_cgroup_ns mirror the locking in cgroup_post_fork which also runs during fork under the cgroup_threadgroup_rwsem. Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Signed-off-by: "Eric W. Biederman" Signed-off-by: Tejun Heo Signed-off-by: Greg Kroah-Hartman

cgroup: Disable IRQs while holding css_set_lock

2016-06-23T21:23:12+00:00

While testing the deadline scheduler + cgroup setup I hit this warning. [ 132.612935] ------------[ cut here ]------------ [ 132.612951] WARNING: CPU: 5 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80 [ 132.612952] Modules linked in: (a ton of modules...) [ 132.612981] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc2 #2 [ 132.612981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014 [ 132.612982] 0000000000000086 45c8bb5effdd088b ffff88013fd43da0 ffffffff813d229e [ 132.612984] 0000000000000000 0000000000000000 ffff88013fd43de0 ffffffff810a652b [ 132.612985] 00000096811387b5 0000000000000200 ffff8800bab29d80 ffff880034c54c00 [ 132.612986] Call Trace: [ 132.612987] [] dump_stack+0x63/0x85 [ 132.612994] [] __warn+0xcb/0xf0 [ 132.612997] [] ? push_dl_task.part.32+0x170/0x170 [ 132.612999] [] warn_slowpath_null+0x1d/0x20 [ 132.613000] [] __local_bh_enable_ip+0x6b/0x80 [ 132.613008] [] _raw_write_unlock_bh+0x1a/0x20 [ 132.613010] [] _raw_spin_unlock_bh+0xe/0x10 [ 132.613015] [] put_css_set+0x5c/0x60 [ 132.613016] [] cgroup_free+0x7f/0xa0 [ 132.613017] [] __put_task_struct+0x42/0x140 [ 132.613018] [] dl_task_timer+0xca/0x250 [ 132.613027] [] ? push_dl_task.part.32+0x170/0x170 [ 132.613030] [] __hrtimer_run_queues+0xee/0x270 [ 132.613031] [] hrtimer_interrupt+0xa8/0x190 [ 132.613034] [] local_apic_timer_interrupt+0x38/0x60 [ 132.613035] [] smp_apic_timer_interrupt+0x3d/0x50 [ 132.613037] [] apic_timer_interrupt+0x8c/0xa0 [ 132.613038] [] ? native_safe_halt+0x6/0x10 [ 132.613043] [] default_idle+0x1e/0xd0 [ 132.613044] [] arch_cpu_idle+0xf/0x20 [ 132.613046] [] default_idle_call+0x2a/0x40 [ 132.613047] [] cpu_startup_entry+0x2e7/0x340 [ 132.613048] [] start_secondary+0x155/0x190 [ 132.613049] ---[ end trace f91934d162ce9977 ]--- The warn is the spin_(lock|unlock)_bh(&css_set_lock) in the interrupt context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid this problem - and other problems of sharing a spinlock with an interrupt. Cc: Tejun Heo Cc: Li Zefan Cc: Johannes Weiner Cc: Juri Lelli Cc: Steven Rostedt Cc: cgroups@vger.kernel.org Cc: stable@vger.kernel.org # 4.5+ Cc: linux-kernel@vger.kernel.org Reviewed-by: Rik van Riel Reviewed-by: "Luis Claudio R. Goncalves" Signed-off-by: Daniel Bristot de Oliveira Acked-by: Zefan Li Signed-off-by: Tejun Heo

cgroup: set css->id to -1 during init

2016-06-16T21:59:35+00:00

If percpu_ref initialization fails during css_create(), the free path can end up trying to free css->id of zero. As ID 0 is unused, it doesn't cause a critical breakage but it does trigger a warning message. Fix it by setting css->id to -1 from init_and_link_css(). Signed-off-by: Tejun Heo Cc: Wenwei Tao Fixes: 01e586598b22 ("cgroup: release css->id after css_free") Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Tejun Heo

cgroup: remove redundant cleanup in css_create

2016-05-26T19:09:23+00:00

When create css failed, before call css_free_rcu_fn, we remove the css id and exit the percpu_ref, but we will do these again in css_free_work_fn, so they are redundant. Especially the css id, that would cause problem if we remove it twice, since it may be assigned to another css after the first remove. tj: This was broken by two commits updating the free path without synchronizing the creation failure path. This can be easily triggered by trying to create more than 64k memory cgroups. Signed-off-by: Wenwei Tao Signed-off-by: Tejun Heo Cc: Vladimir Davydov Fixes: 9a1049da9bd2 ("percpu-refcount: require percpu_ref to be exited explicitly") Fixes: 01e586598b22 ("cgroup: release css->id after css_free") Cc: stable@vger.kernel.org # v3.17+

cgroup: fix compile warning

2016-05-12T15:05:27+00:00

commit 4f41fc59620f ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces") added the following compile warning: kernel/cgroup.c: In function ‘cgroup_show_path’: kernel/cgroup.c:1634:15: warning: unused variable ‘ret’ [-Wunused-variable] int len = 0, ret = 0; ^ fix it. Fixes: 4f41fc59620f ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces") Signed-off-by: Felipe Balbi Signed-off-by: Tejun Heo

cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces

2016-05-09T16:15:03+00:00

Patch summary: When showing a cgroupfs entry in mountinfo, show the path of the mount root dentry relative to the reader's cgroup namespace root. Short explanation (courtesy of mkerrisk): If we create a new cgroup namespace, then we want both /proc/self/cgroup and /proc/self/mountinfo to show cgroup paths that are correctly virtualized with respect to the cgroup mount point. Previous to this patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo does not. Long version: When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup namespace, and then mounts a new instance of the freezer cgroup, the new mount will be rooted at /a/b. The root dentry field of the mountinfo entry will show '/a/b'. cat > /tmp/do1 << EOF mount -t cgroup -o freezer freezer /mnt grep freezer /proc/self/mountinfo EOF unshare -Gm bash /tmp/do1 > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer The task's freezer cgroup entry in /proc/self/cgroup will simply show '/': grep freezer /proc/self/cgroup 9:freezer:/ If instead the same task simply bind mounts the /a/b cgroup directory, the resulting mountinfo entry will again show /a/b for the dentry root. However in this case the task will find its own cgroup at /mnt/a/b, not at /mnt: mount --bind /sys/fs/cgroup/freezer/a/b /mnt 130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer In other words, there is no way for the task to know, based on what is in mountinfo, which cgroup directory is its own. Example (by mkerrisk): First, a little script to save some typing and verbiage: echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)" cat /proc/self/mountinfo | grep freezer | awk '{print "\tmountinfo:\t\t" $4 "\t" $5}' Create cgroup, place this shell into the cgroup, and look at the state of the /proc files: 2653 2653 # Our shell 14254 # cat(1) /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer Create a shell in new cgroup and mount namespaces. The act of creating a new cgroup namespace causes the process's current cgroups directories to become its cgroup root directories. (Here, I'm using my own version of the "unshare" utility, which takes the same options as the util-linux version): Look at the state of the /proc files: /proc/self/cgroup: 10:freezer:/ mountinfo: / /sys/fs/cgroup/freezer The third entry in /proc/self/cgroup (the pathname of the cgroup inside the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which is rooted at /a/b in the outer namespace. However, the info in /proc/self/mountinfo is not for this cgroup namespace, since we are seeing a duplicate of the mount from the old mount namespace, and the info there does not correspond to the new cgroup namespace. However, trying to create a new mount still doesn't show us the right information in mountinfo: # propagating to other mountns /proc/self/cgroup: 7:freezer:/ mountinfo: /a/b /mnt/freezer The act of creating a new cgroup namespace caused the process's current freezer directory, "/a/b", to become its cgroup freezer root directory. In other words, the pathname directory of the directory within the newly mounted cgroup filesystem should be "/", but mountinfo wrongly shows us "/a/b". The consequence of this is that the process in the cgroup namespace cannot correctly construct the pathname of its cgroup root directory from the information in /proc/PID/mountinfo. With this patch, the dentry root field in mountinfo is shown relative to the reader's cgroup namespace. So the same steps as above: /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: /../.. /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: / /mnt/freezer cgroup.clone_children freezer.parent_freezing freezer.state tasks cgroup.procs freezer.self_freezing notify_on_release 3164 2653 # First shell that placed in this cgroup 3164 # Shell started by 'unshare' 14197 # cat(1) Signed-off-by: Serge Hallyn Tested-by: Michael Kerrisk Acked-by: Michael Kerrisk Signed-off-by: Tejun Heo