summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2013-01-07cpuset: make CPU / memory hotplug propagation asynchronousTejun Heo1-6/+48
cpuset_hotplug_workfn() has been invoking cpuset_propagate_hotplug() directly to propagate hotplug updates to !root cpusets; however, this has the following problems. * cpuset locking is scheduled to be decoupled from cgroup_mutex, cgroup_mutex will be unexported, and cgroup_attach_task() will do cgroup locking internally, so propagation can't synchronously move tasks to a parent cgroup while walking the hierarchy. * We can't use cgroup generic tree iterator because propagation to each cpuset may sleep. With propagation done asynchronously, we can lose the rather ugly cpuset specific iteration. Convert cpuset_propagate_hotplug() to cpuset_propagate_hotplug_workfn() and execute it from newly added cpuset->hotplug_work. The work items are run on an ordered workqueue, so the propagation order is preserved. cpuset_hotplug_workfn() schedules all propagations while holding cgroup_mutex and waits for completion without cgroup_mutex. Each in-flight propagation holds a reference to the cpuset->css. This patch doesn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07cpuset: drop async_rebuild_sched_domains()Tejun Heo1-60/+16
In general, we want to make cgroup_mutex one of the outermost locks and be able to use get_online_cpus() and friends from cgroup methods. With cpuset hotplug made async, get_online_cpus() can now be nested inside cgroup_mutex. Currently, cpuset avoids nesting get_online_cpus() inside cgroup_mutex by bouncing sched_domain rebuilding to a work item. As such nesting is allowed now, remove the workqueue bouncing code and always rebuild sched_domains synchronously. This also nests sched_domains_mutex inside cgroup_mutex, which is intended and should be okay. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07cpuset: don't nest cgroup_mutex inside get_online_cpus()Tejun Heo1-4/+35
CPU / memory hotplug path currently grabs cgroup_mutex from hotplug event notifications. We want to separate cpuset locking from cgroup core and make cgroup_mutex outer to hotplug synchronization so that, among other things, mechanisms which depend on get_online_cpus() can be used from cgroup callbacks. In general, we want to keep cgroup_mutex the outermost lock to minimize locking interactions among different controllers. Convert cpuset_handle_hotplug() to cpuset_hotplug_workfn() and schedule it from the hotplug notifications. As the function can already handle multiple mixed events without any input, converting it to a work function is mostly trivial; however, one complication is that cpuset_update_active_cpus() needs to update sched domains synchronously to reflect an offlined cpu to avoid confusing the scheduler. This is worked around by falling back to the the default single sched domain synchronously before scheduling the actual hotplug work. This makes sched domain rebuilt twice per CPU hotplug event but the operation isn't that heavy and a lot of the second operation would be noop for systems w/ single sched domain, which is the common case. This decouples cpuset hotplug handling from the notification callbacks and there can be an arbitrary delay between the actual event and updates to cpusets. Scheduler and mm can handle it fine but moving tasks out of an empty cpuset may race against writes to the cpuset restoring execution resources which can lead to confusing behavior. Flush hotplug work item from cpuset_write_resmask() to avoid such confusions. v2: Synchronous sched domain rebuilding using the fallback sched domain added. This fixes various issues caused by confused scheduler putting tasks on a dead CPU, including the one reported by Li Zefan. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07cpuset: reorganize CPU / memory hotplug handlingTejun Heo1-117/+104
Reorganize hotplug path to prepare for async hotplug handling. * Both CPU and memory hotplug handlings are collected into a single function - cpuset_handle_hotplug(). It doesn't take any argument but compares the current setttings of top_cpuset against what's actually available to determine what happened. This function directly updates top_cpuset. If there are CPUs or memory nodes which are taken down, cpuset_propagate_hotplug() in invoked on all !root cpusets. * cpuset_propagate_hotplug() is responsible for updating the specified cpuset so that it doesn't include any resource which isn't available to top_cpuset. If no CPU or memory is left after update, all tasks are moved to the nearest ancestor with both resources. * update_tasks_cpumask() and update_tasks_nodemask() are now always called after cpus or mems masks are updated even if the cpuset doesn't have any task. This is for brevity and not expected to have any measureable effect. * cpu_active_mask and N_HIGH_MEMORY are read exactly once per cpuset_handle_hotplug() invocation, all cpusets share the same view of what resources are available, and cpuset_handle_hotplug() can handle multiple resources going up and down. These properties will allow async operation. The reorganization, while drastic, is equivalent and shouldn't cause any behavior difference. This will enable making hotplug handling async and remove get_online_cpus() -> cgroup_mutex nesting. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07cpuset: cleanup cpuset[_can]_attach()Tejun Heo1-17/+18
cpuset_can_attach() prepare global variables cpus_attach and cpuset_attach_nodemask_{to|from} which are used by cpuset_attach(). There is no reason to prepare in cpuset_can_attach(). The same information can be accessed from cpuset_attach(). Move the prepartion logic from cpuset_can_attach() to cpuset_attach() and make the global variables static ones inside cpuset_attach(). With this change, there's no reason to keep cpuset_attach_nodemask_{from|to} global. Move them inside cpuset_attach(). Unfortunately, we need to keep cpus_attach global as it can't be allocated from cpuset_attach(). v2: cpus_attach not converted to cpumask_t as per Li Zefan and Rusty Russell. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Rusty Russell <rusty@rustcorp.com.au>
2013-01-07cpuset: introduce cpuset_for_each_child()Tejun Heo1-31/+54
Instead of iterating cgroup->children directly, introduce and use cpuset_for_each_child() which wraps cgroup_for_each_child() and performs online check. As it uses the generic iterator, it requires RCU read locking too. As cpuset is currently protected by cgroup_mutex, non-online cpusets aren't visible to all the iterations and this patch currently doesn't make any functional difference. This will be used to de-couple cpuset locking from cgroup core. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07cpuset: introduce CS_ONLINETejun Heo1-1/+10
Add CS_ONLINE which is set from css_online() and cleared from css_offline(). This will enable using generic cgroup iterator while allowing decoupling cpuset from cgroup internal locking. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07cpuset: introduce ->css_on/offline()Tejun Heo1-22/+44
Add cpuset_css_on/offline() and rearrange css init/exit such that, * Allocation and clearing to the default values happen in css_alloc(). Allocation now uses kzalloc(). * Config inheritance and registration happen in css_online(). * css_offline() undoes what css_online() did. * css_free() frees. This doesn't introduce any visible behavior changes. This will help cleaning up locking. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07cpuset: remove fast exit path from remove_tasks_in_empty_cpuset()Tejun Heo1-8/+0
The function isn't that hot, the overhead of missing the fast exit is low, the test itself depends heavily on cgroup internals, and it's gonna be a hindrance when trying to decouple cpuset locking from cgroup core. Remove the fast exit path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07cpuset: remove unused cpuset_unlock()Tejun Heo1-11/+0
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07cgroup: implement cgroup_rightmost_descendant()Tejun Heo1-0/+26
Implement cgroup_rightmost_descendant() which returns the right most descendant of the specified cgroup. This can be used to skip the cgroup's subtree while iterating with cgroup_for_each_descendant_pre(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com>
2013-01-07Merge branch 'akpm' (fixes from Andrew)Linus Torvalds1-2/+3
Merge emailed fixes from Andrew Morton: "Bunch of fixes: - delayed IPC updates. I held back on this because of some possible outstanding bug reports, but they appear to have been addressed in later versions - A bunch of MAINTAINERS updates - Yet Another RTC driver. I'd held this back while a couple of little issues were being worked out. I'm expecting an intrusive-but-simple patchset from Joe Perches which splits up printk.c into kernel/printk/*. That will be a pig to maintain for two months so if it passes testing I'd like to get it upstream after a week or so." * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits) printk: fix incorrect length from print_time() when seconds > 99999 drivers/rtc/rtc-vt8500.c: fix handling of data passed in struct rtc_time drivers/rtc/rtc-vt8500.c: correct handling of CR_24H bitfield rtc: add RTC driver for TPS6586x MAINTAINERS: fix drivers/staging/sm7xx/ MAINTAINERS: remove include/linux/of_pwm.h MAINTAINERS: remove arch/*/lib/perf_event*.c MAINTAINERS: remove drivers/mmc/host/imxmmc.* MAINTAINERS: fix Documentation/mei/ MAINTAINERS: remove arch/x86/platform/mrst/pmu.* MAINTAINERS: remove firmware/isci/ MAINTAINERS: fix drivers/ieee802154/ MAINTAINERS: fix .../plat-mxc/include/mach/imxfb.h MAINTAINERS: remove drivers/video/epson1355fb.c MAINTAINERS: fix drivers/media/usb/dvb-usb/cxusb* MAINTAINERS: adjust for UAPI MAINTAINERS: fix drivers/media/platform/atmel-isi.c MAINTAINERS: fix arch/arm/mach-at91/include/mach/at_hdmac.h MAINTAINERS: fix drivers/rtc/rtc-vt8500.c MAINTAINERS: remove arch/arm/plat-s5p/ ...
2013-01-06signals: set_current_blocked() can use __set_current_blocked()Oleg Nesterov1-6/+2
Cleanup. And I think we need more cleanups, in particular __set_current_blocked() and sigprocmask() should die. Nobody should ever block SIGKILL or SIGSTOP. - Change set_current_blocked() to use __set_current_blocked() - Change sys_sigprocmask() to use set_current_blocked(), this way it should not worry about SIGKILL/SIGSTOP. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-06signals: sys_ssetmask() uses uninitialized newmaskOleg Nesterov1-0/+1
Commit 77097ae503b1 ("most of set_current_blocked() callers want SIGKILL/SIGSTOP removed from set") removed the initialization of newmask by accident, causing ltp to complain like this: ssetmask01 1 TFAIL : sgetmask() failed: TEST_ERRNO=???(0): Success Restore the proper initialization. Reported-and-tested-by: CAI Qian <caiqian@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: stable@kernel.org # v3.5+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-05printk: fix incorrect length from print_time() when seconds > 99999Roland Dreier1-2/+3
print_prefix() passes a NULL buf to print_time() to get the length of the time prefix; when printk times are enabled, the current code just returns the constant 15, which matches the format "[%5lu.%06lu] " used to print the time value. However, this is obviously incorrect when the whole seconds part of the time gets beyond 5 digits (100000 seconds is a bit more than a day of uptime). The simple fix is to use snprintf(NULL, 0, ...) to calculate the actual length of the time prefix. This could be micro-optimized but it seems better to have simpler, more readable code here. The bug leads to the syslog system call miscomputing which messages fit into the userspace buffer. If there are enough messages to fill log_buf_len and some have a timestamp >= 100000, dmesg may fail with: # dmesg klogctl: Bad address When this happens, strace shows that the failure is indeed EFAULT due to the kernel mistakenly accessing past the end of dmesg's buffer, since dmesg asks the kernel how big a buffer it needs, allocates a bit more, and then gets an error when it asks the kernel to fill it: syslog(0xa, 0, 0) = 1048576 mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa4d25d2000 syslog(0x3, 0x7fa4d25d2010, 0x100008) = -1 EFAULT (Bad address) As far as I can see, the bug has been there as long as print_time(), which comes from commit 084681d14e42 ("printk: flush continuation lines immediately to console") in 3.5-rc5. Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Joe Perches <joe@perches.com> Cc: Sylvain Munaut <s.munaut@whatever-company.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-03module: prevent warning when finit_module a 0 sized fileSasha Levin1-0/+7
If we try to finit_module on a file sized 0 bytes vmalloc will scream and spit out a warning. Since modules have to be bigger than 0 bytes anyways we can just check that beforehand and avoid the warning. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-12-27userns: Allow unprivileged rebootLi Zefan1-2/+3
In a container with its own pid namespace and user namespace, rebooting the system won't reboot the host, but terminate all the processes in it and thus have the container shutdown, so it's safe. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-12-26x32: fix sigtimedwaitAl Viro1-1/+1
It needs 64bit timespec. As it is, we end up truncating the timeout to whole seconds; usually it doesn't matter, but for having all sub-second timeouts truncated to one jiffy is visibly wrong. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-26x32: fix waitid()Al Viro1-1/+5
It needs 64bit rusage and 32bit siginfo. glibc never calls it with non-NULL rusage pointer, or we would've seen breakage already... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-26switch compat_sys_wait4() and compat_sys_waitid() to COMPAT_SYSCALL_DEFINEAl Viro1-6/+9
Strictly speaking, ppc64 needs it for C ABI compliance. Realistically I would be very surprised if e.g. passing 0xffffffff as 'options' argument to waitid() from 32bit task would cause problems, but yes, it puts us into undefined behaviour territory. ppc64 expects int argument to be passed in 64bit register with bits 31..63 containing the same value. SYSCALL_DEFINE on ppc provides a wrapper that normalizes the value passed from userland; so does COMPAT_SYSCALL_DEFINE. Plain declaration of compat_sys_something() with an int argument obviously doesn't. Again, for wait4 and waitid I would be extremely surprised if gcc started to produce code depending on that value having been properly sign-extended - the argument(s) in question end up passed blindly to sys_wait4 and sys_waitid resp. and normalization for native syscalls takes care of their use there. Still, better to use COMPAT_SYSCALL_DEFINE here than worry about nasal daemons... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-26switch compat_sys_sigaltstack() to COMPAT_SYSCALL_DEFINEAl Viro1-2/+3
Makes sigaltstack conversion easier to split into per-architecture parts. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-26pidns: Stop pid allocation when init diesEric W. Biederman2-3/+16
Oleg pointed out that in a pid namespace the sequence. - pid 1 becomes a zombie - setns(thepidns), fork,... - reaping pid 1. - The injected processes exiting. Can lead to processes attempting access their child reaper and instead following a stale pointer. That waitpid for init can return before all of the processes in the pid namespace have exited is also unfortunate. Avoid these problems by disabling the allocation of new pids in a pid namespace when init dies, instead of when the last process in a pid namespace is reaped. Pointed-out-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-12-25pidns: Outlaw thread creation after unshare(CLONE_NEWPID)Eric W. Biederman1-0/+8
The sequence: unshare(CLONE_NEWPID) clone(CLONE_THREAD|CLONE_SIGHAND|CLONE_VM) Creates a new process in the new pid namespace without setting pid_ns->child_reaper. After forking this results in a NULL pointer dereference. Avoid this and other nonsense scenarios that can show up after creating a new pid namespace with unshare by adding a new check in copy_prodcess. Pointed-out-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-12-24time: convert arch_gettimeoffset to a pointerStephen Warren1-6/+20
Currently, whenever CONFIG_ARCH_USES_GETTIMEOFFSET is enabled, each arch core provides a single implementation of arch_gettimeoffset(). In many cases, different sub-architectures, different machines, or different timer providers exist, and so the arch ends up implementing arch_gettimeoffset() as a call-through-pointer anyway. Examples are ARM, Cris, M68K, and it's arguable that the remaining architectures, M32R and Blackfin, should be doing this anyway. Modify arch_gettimeoffset so that it itself is a function pointer, which the arch initializes. This will allow later changes to move the initialization of this function into individual machine support or timer drivers. This is particularly useful for code in drivers/clocksource which should rely on an arch-independant mechanism to register their implementation of arch_gettimeoffset(). This patch also converts the Cris architecture to set arch_gettimeoffset directly to the final implementation in time_init(), because Cris already had separate time_init() functions per sub-architecture. M68K and ARM are converted to set arch_gettimeoffset to the final implementation in later patches, because they already have function pointers in place for this purpose. Cc: Russell King <linux@arm.linux.org.uk> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Mikael Starvik <starvik@axis.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Acked-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Stephen Warren <swarren@nvidia.com>
2012-12-21Merge branch 'for-next' of git://git.infradead.org/users/eparis/notifyLinus Torvalds2-7/+7
Pull filesystem notification updates from Eric Paris: "This pull mostly is about locking changes in the fsnotify system. By switching the group lock from a spin_lock() to a mutex() we can now hold the lock across things like iput(). This fixes a problem involving unmounting a fs and having inodes be busy, first pointed out by FAT, but reproducible with tmpfs. This also restores signal driven I/O for inotify, which has been broken since about 2.6.32." Ugh. I *hate* the timing of this. It was rebased after the merge window opened, and then left to sit with the pull request coming the day before the merge window closes. That's just crap. But apparently the patches themselves have been around for over a year, just gathering dust, so now it's suddenly critical. Fixed up semantic conflict in fs/notify/fdinfo.c as per Stephen Rothwell's fixes from -next. * 'for-next' of git://git.infradead.org/users/eparis/notify: inotify: automatically restart syscalls inotify: dont skip removal of watch descriptor if creation of ignored event failed fanotify: dont merge permission events fsnotify: make fasync generic for both inotify and fanotify fsnotify: change locking order fsnotify: dont put marks on temporary list when clearing marks by group fsnotify: introduce locked versions of fsnotify_add_mark() and fsnotify_remove_mark() fsnotify: pass group to fsnotify_destroy_mark() fsnotify: use a mutex instead of a spinlock to protect a groups mark list fanotify: add an extra flag to mark_remove_from_mask that indicates wheather a mark should be destroyed fsnotify: take groups mark_lock before mark lock fsnotify: use reference counting for groups fsnotify: introduce fsnotify_get_group() inotify, fanotify: replace fsnotify_put_group() with fsnotify_destroy_group()
2012-12-21Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds2-9/+7
Merge the rest of Andrew's patches for -rc1: "A bunch of fixes and misc missed-out-on things. That'll do for -rc1. I still have a batch of IPC patches which still have a possible bug report which I'm chasing down." * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (25 commits) keys: use keyring_alloc() to create module signing keyring keys: fix unreachable code sendfile: allows bypassing of notifier events SGI-XP: handle non-fatal traps fat: fix incorrect function comment Documentation: ABI: remove testing/sysfs-devices-node proc: fix inconsistent lock state linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors memcg: don't register hotcpu notifier from ->css_alloc() checkpatch: warn on uapi #includes that #include <uapi/... revert "rtc: recycle id when unloading a rtc driver" mm: clean up transparent hugepage sysfs error messages hfsplus: add error message for the case of failure of sync fs in delayed_sync_fs() method hfsplus: rework processing of hfs_btree_write() returned error hfsplus: rework processing errors in hfsplus_free_extents() hfsplus: avoid crash on failed block map free kcmp: include linux/ptrace.h drivers/rtc/rtc-imxdi.c: must include <linux/spinlock.h> mm: cma: WARN if freed memory is still in use exec: do not leave bprm->interp on stack ...
2012-12-21Merge branch 'for-linus' of ↵Linus Torvalds3-5/+77
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal Pull signal handling cleanups from Al Viro: "sigaltstack infrastructure + conversion for x86, alpha and um, COMPAT_SYSCALL_DEFINE infrastructure. Note that there are several conflicts between "unify SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline; resolution is trivial - just remove definitions of SS_ONSTACK and SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and include/uapi/linux/signal.h contains the unified variant." Fixed up conflicts as per Al. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: alpha: switch to generic sigaltstack new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those generic compat_sys_sigaltstack() introduce generic sys_sigaltstack(), switch x86 and um to it new helper: compat_user_stack_pointer() new helper: restore_altstack() unify SS_ONSTACK/SS_DISABLE definitions new helper: current_user_stack_pointer() missing user_stack_pointer() instances Bury the conditionals from kernel_thread/kernel_execve series COMPAT_SYSCALL_DEFINE: infrastructure
2012-12-21keys: use keyring_alloc() to create module signing keyringDavid Howells1-9/+6
Use keyring_alloc() to create special keyrings now that it has a permissions parameter rather than using key_alloc() + key_instantiate_and_link(). Signed-off-by: David Howells <dhowells@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-21kcmp: include linux/ptrace.hCyrill Gorcunov1-0/+1
This makes it compile on s390. After all the ptrace_may_access (which we use this file) is declared exactly in linux/ptrace.h. This is preparatory work to wire this syscall up on all archs. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Alexander Kartashov <alekskartashov@parallels.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20sched: numa: ksm: fix oops in task_numa_placment()Hugh Dickins1-1/+4
task_numa_placement() oopsed on NULL p->mm when task_numa_fault() got called in the handling of break_ksm() for ksmd. That might be a peculiar case, which perhaps KSM could takes steps to avoid? but it's more robust if task_numa_placement() allows for such a possibility. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20Merge tag 'random_for_linus' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random Pull random updates from Ted Ts'o: "A few /dev/random improvements for the v3.8 merge window." * tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: random: Mix cputime from each thread that exits to the pool random: prime last_data value per fips requirements random: fix debug format strings random: make it possible to enable debugging without rebuild
2012-12-20new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to thoseAl Viro1-0/+16
note that they are relying on access_ok() already checked by caller. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20generic compat_sys_sigaltstack()Al Viro1-0/+45
Again, conditional on CONFIG_GENERIC_SIGALTSTACK Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20introduce generic sys_sigaltstack(), switch x86 and um to itAl Viro1-0/+6
Conditional on CONFIG_GENERIC_SIGALTSTACK; architectures that do not select it are completely unaffected Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20new helper: restore_altstack()Al Viro1-0/+7
to be used by rt_sigreturn instances Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20Bury the conditionals from kernel_thread/kernel_execve seriesAl Viro2-5/+3
All architectures have CONFIG_GENERIC_KERNEL_THREAD CONFIG_GENERIC_KERNEL_EXECVE __ARCH_WANT_SYS_EXECVE None of them have __ARCH_WANT_KERNEL_EXECVE and there are only two callers of kernel_execve() (which is a trivial wrapper for do_execve() now) left. Kill the conditionals and make both callers use do_execve(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20watchdog: Fix disable/enable regressionBjørn Mork1-7/+4
Commit 8d4516904b39 ("watchdog: Fix CPU hotplug regression") causes an oops or hard lockup when doing echo 0 > /proc/sys/kernel/nmi_watchdog echo 1 > /proc/sys/kernel/nmi_watchdog and the kernel is booted with nmi_watchdog=1 (default) Running laptop-mode-tools and disconnecting/connecting AC power will cause this to trigger, making it a common failure scenario on laptops. Instead of bailing out of watchdog_disable() when !watchdog_enabled we can initialize the hrtimer regardless of watchdog_enabled status. This makes it safe to call watchdog_disable() in the nmi_watchdog=0 case, without the negative effect on the enabled => disabled => enabled case. All these tests pass with this patch: - nmi_watchdog=1 echo 0 > /proc/sys/kernel/nmi_watchdog echo 1 > /proc/sys/kernel/nmi_watchdog - nmi_watchdog=0 echo 0 > /sys/devices/system/cpu/cpu1/online - nmi_watchdog=0 echo mem > /sys/power/state Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=51661 Cc: <stable@vger.kernel.org> # v3.7 Cc: Norbert Warmuth <nwarmuth@t-online.de> Cc: Joseph Salisbury <joseph.salisbury@canonical.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19workqueue: fix find_worker_executing_work() brekage from hashtable conversionTejun Heo1-1/+1
42f8570f43 ("workqueue: use new hashtable implementation") incorrectly made busy workers hashed by the pointer value of worker instead of work. This broke find_worker_executing_work() which in turn broke a lot of fundamental operations of workqueue - non-reentrancy and flushing among others. The flush malfunction triggered warning in disk event code in Fengguang's automated test. write_dev_root_ (3265) used greatest stack depth: 2704 bytes left ------------[ cut here ]------------ WARNING: at /c/kernel-tests/src/stable/block/genhd.c:1574 disk_clear_events+0x\ cf/0x108() Hardware name: Bochs Modules linked in: Pid: 3328, comm: ata_id Not tainted 3.7.0-01930-gbff6343 #1167 Call Trace: [<ffffffff810997c4>] warn_slowpath_common+0x83/0x9c [<ffffffff810997f7>] warn_slowpath_null+0x1a/0x1c [<ffffffff816aea77>] disk_clear_events+0xcf/0x108 [<ffffffff811bd8be>] check_disk_change+0x27/0x59 [<ffffffff822e48e2>] cdrom_open+0x49/0x68b [<ffffffff81ab0291>] idecd_open+0x88/0xb7 [<ffffffff811be58f>] __blkdev_get+0x102/0x3ec [<ffffffff811bea08>] blkdev_get+0x18f/0x30f [<ffffffff811bebfd>] blkdev_open+0x75/0x80 [<ffffffff8118f510>] do_dentry_open+0x1ea/0x295 [<ffffffff8118f5f0>] finish_open+0x35/0x41 [<ffffffff8119c720>] do_last+0x878/0xa25 [<ffffffff8119c993>] path_openat+0xc6/0x333 [<ffffffff8119cf37>] do_filp_open+0x38/0x86 [<ffffffff81190170>] do_sys_open+0x6c/0xf9 [<ffffffff8119021e>] sys_open+0x21/0x23 [<ffffffff82c1c3d9>] system_call_fastpath+0x16/0x1b Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Sasha Levin <sasha.levin@oracle.com>
2012-12-19Merge tag 'modules-next-for-linus' of ↵Linus Torvalds5-183/+294
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module update from Rusty Russell: "Nothing all that exciting; a new module-from-fd syscall for those who want to verify the source of the module (ChromeOS) and/or use standard IMA on it or other security hooks." * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: MODSIGN: Fix kbuild output when using default extra_certificates MODSIGN: Avoid using .incbin in C source modules: don't hand 0 to vmalloc. module: Remove a extra null character at the top of module->strtab. ASN.1: Use the ASN1_LONG_TAG and ASN1_INDEFINITE_LENGTH constants ASN.1: Define indefinite length marker constant moduleparam: use __UNIQUE_ID() __UNIQUE_ID() MODSIGN: Add modules_sign make target powerpc: add finit_module syscall. ima: support new kernel module syscall add finit_module syscall to asm-generic ARM: add finit_module syscall to ARM security: introduce kernel_module_from_file hook module: add flags arg to sys_finit_module() module: add syscall to load module from fd
2012-12-19Merge branch 'akpm' (more patches from Andrew)Linus Torvalds3-10/+16
Merge patches from Andrew Morton: "Most of the rest of MM, plus a few dribs and drabs. I still have quite a few irritating patches left around: ones with dubious testing results, lack of review, ones which should have gone via maintainer trees but the maintainers are slack, etc. I need to be more activist in getting these things wrapped up outside the merge window, but they're such a PITA." * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (48 commits) mm/vmscan.c: avoid possible deadlock caused by too_many_isolated() vmscan: comment too_many_isolated() mm/kmemleak.c: remove obsolete simple_strtoul mm/memory_hotplug.c: improve comments mm/hugetlb: create hugetlb cgroup file in hugetlb_init mm/mprotect.c: coding-style cleanups Documentation: ABI: /sys/devices/system/node/ slub: drop mutex before deleting sysfs entry memcg: add comments clarifying aspects of cache attribute propagation kmem: add slab-specific documentation about the kmem controller slub: slub-specific propagation changes slab: propagate tunable values memcg: aggregate memcg cache values in slabinfo memcg/sl[au]b: shrink dead caches memcg/sl[au]b: track all the memcg children of a kmem_cache memcg: destroy memcg caches sl[au]b: allocate objects from memcg cache sl[au]b: always get the cache from its page in kmem_cache_free() memcg: skip memcg kmem allocations in specified code regions memcg: infrastructure to match an allocation to the right cache ...
2012-12-19fork: protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombsGlauber Costa1-2/+2
Because those architectures will draw their stacks directly from the page allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG flag, and issue the corresponding free_pages. This code path is taken when the architecture doesn't define CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the remaining architectures fall in this category. This will guarantee that every stack page is accounted to the memcg the process currently lives on, and will have the allocations to fail if they go over limit. For the time being, I am defining a new variant of THREADINFO_GFP, not to mess with the other path. Once the slab is also tracked by memcg, we can get rid of that flag. Tested to successfully protect against :(){ :|:& };: Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Frederic Weisbecker <fweisbec@redhat.com> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19res_counter: return amount of charges after res_counter_uncharge()Glauber Costa1-7/+13
It is useful to know how many charges are still left after a call to res_counter_uncharge. While it is possible to issue a res_counter_read after uncharge, this can be racy. If we need, for instance, to take some action when the counters drop down to 0, only one of the callers should see it. This is the same semantics as the atomic variables in the kernel. Since the current return value is void, we don't need to worry about anything breaking due to this change: nobody relied on that, and only users appearing from now on will be checking this value. Signed-off-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19irq: tsk->comm is an arrayAlan Cox1-1/+1
The array check is useless so remove it. [akpm@linux-foundation.org: remove comment, per David] Signed-off-by: Alan Cox <alan@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19Merge branch 'tip/perf/core-2' of ↵Linus Torvalds2-33/+31
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull minor tracing updates and fixes from Steven Rostedt: "It seems that one of my old pull requests have slipped through. The changes are contained to just the files that I maintain, and are changes from others that I told I would get into this merge window. They have already been in linux-next for several weeks, and should be well tested." * 'tip/perf/core-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Remove unnecessary WARN_ONCE's from tracing_buffers_splice_read tracing: Remove unneeded checks from the stack tracer tracing: Add a resize function to make one buffer equivalent to another buffer
2012-12-18workqueue: consider work function when searching for busy work itemsTejun Heo1-8/+31
To avoid executing the same work item concurrenlty, workqueue hashes currently busy workers according to their current work items and looks up the the table when it wants to execute a new work item. If there already is a worker which is executing the new work item, the new item is queued to the found worker so that it gets executed only after the current execution finishes. Unfortunately, a work item may be freed while being executed and thus recycled for different purposes. If it gets recycled for a different work item and queued while the previous execution is still in progress, workqueue may make the new work item wait for the old one although the two aren't really related in any way. In extreme cases, this false dependency may lead to deadlock although it's extremely unlikely given that there aren't too many self-freeing work item users and they usually don't wait for other work items. To alleviate the problem, record the current work function in each busy worker and match it together with the work item address in find_worker_executing_work(). While this isn't complete, it ensures that unrelated work items don't interact with each other and in the very unlikely case where a twisted wq user triggers it, it's always onto itself making the culprit easy to spot. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Andrey Isakov <andy51@gmx.ru> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=51701 Cc: stable@vger.kernel.org
2012-12-18Merge branch 'for-linus' of ↵Linus Torvalds4-4/+31
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull (again) user namespace infrastructure changes from Eric Biederman: "Those bugs, those darn embarrasing bugs just want don't want to get fixed. Linus I just updated my mirror of your kernel.org tree and it appears you successfully pulled everything except the last 4 commits that fix those embarrasing bugs. When you get a chance can you please repull my branch" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: userns: Fix typo in description of the limitation of userns_install userns: Add a more complete capability subset test to commit_creds userns: Require CAP_SYS_ADMIN for most uses of setns. Fix cap_capable to only allow owners in the parent user namespace to have caps.
2012-12-18workqueue: use new hashtable implementationSasha Levin1-71/+15
Switch workqueues to use the new hashtable implementation. This reduces the amount of generic unrelated code in the workqueues. This patch depends on d9b482c ("hashtable: introduce a small and naive hashtable") which was merged in v3.6. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-12-18Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds9-49/+55
Merge misc patches from Andrew Morton: "Incoming: - lots of misc stuff - backlight tree updates - lib/ updates - Oleg's percpu-rwsem changes - checkpatch - rtc - aoe - more checkpoint/restart support I still have a pile of MM stuff pending - Pekka should be merging later today after which that is good to go. A number of other things are twiddling thumbs awaiting maintainer merges." * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (180 commits) scatterlist: don't BUG when we can trivially return a proper error. docs: update documentation about /proc/<pid>/fdinfo/<fd> fanotify output fs, fanotify: add @mflags field to fanotify output docs: add documentation about /proc/<pid>/fdinfo/<fd> output fs, notify: add procfs fdinfo helper fs, exportfs: add exportfs_encode_inode_fh() helper fs, exportfs: escape nil dereference if no s_export_op present fs, epoll: add procfs fdinfo helper fs, eventfd: add procfs fdinfo helper procfs: add ability to plug in auxiliary fdinfo providers tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test breakpoint selftests: print failure status instead of cause make error kcmp selftests: print fail status instead of cause make error kcmp selftests: make run_tests fix mem-hotplug selftests: print failure status instead of cause make error cpu-hotplug selftests: print failure status instead of cause make error mqueue selftests: print failure status instead of cause make error vm selftests: print failure status instead of cause make error ubifs: use prandom_bytes mtd: nandsim: use prandom_bytes ...
2012-12-18pidns: remove unused is_container_init()Gao feng1-15/+0
Since commit 1cdcbec1a337 ("CRED: Neuter sys_capset()") is_container_init() has no callers. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Cc: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: James Morris <jmorris@namei.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-18ptrace: introduce PTRACE_O_EXITKILLOleg Nesterov1-0/+3
Ptrace jailers want to be sure that the tracee can never escape from the control. However if the tracer dies unexpectedly the tracee continues to run in potentially unsafe mode. Add the new ptrace option PTRACE_O_EXITKILL. If the tracer exits it sends SIGKILL to every tracee which has this bit set. Note that the new option is not equal to the last-option << 1. Because currently all options have an event, and the new one starts the eventless group. It uses the random 20 bit, so we have the room for 12 more events, but we can also add the new eventless options below this one. Suggested by Amnon Shiloh. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Amnon Shiloh <u3557@miso.sublimeip.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: Chris Evans <scarybeasts@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>