summaryrefslogtreecommitdiff
path: root/ipc
AgeCommit message (Collapse)AuthorFilesLines
2013-05-27ipc/sem.c: Fix missing wakeups in do_smart_update_queue()Manfred Spraul1-5/+22
do_smart_update_queue() is called when an operation (semop, semctl(SETVAL), semctl(SETALL), ...) modified the array. It must check which of the sleeping tasks can proceed. do_smart_update_queue() missed a few wakeups: - if a sleeping complex op was completed, then all per-semaphore queues must be scanned - not only those that were modified by *sops - if a sleeping simple op proceeded, then the global queue must be scanned again And: - the test for "|sops == NULL) before scanning the global queue is not required: If the global queue is empty, then it doesn't need to be scanned - regardless of the reason for calling do_smart_update_queue() The patch is not optimized, i.e. even completing a wait-for-zero operation causes a rescan. This is done to keep the patch as simple as possible. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-10shm: fix null pointer deref when userspace specifies invalid hugepage sizeLi Zefan1-1/+7
Dave reported an oops triggered by trinity: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: newseg+0x10d/0x390 PGD cf8c1067 PUD cf8c2067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 2 PID: 7636 Comm: trinity-child2 Not tainted 3.9.0+#67 ... Call Trace: ipcget+0x182/0x380 SyS_shmget+0x5a/0x60 tracesys+0xdd/0xe2 This bug was introduced by commit af73e4d9506d ("hugetlbfs: fix mmap failure in unaligned size request"). Reported-by: Dave Jones <davej@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Li Zefan <lizfan@huawei.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-10ipc,sem: fix semctl(..., GETNCNT)Rik van Riel1-0/+7
The semctl GETNCNT returns the number of semops waiting for the specified semaphore to become nonzero. After commit 9f1bc2c9022c ("ipc,sem: have only one list in struct sem_queue"), the semops waiting on just one semaphore are waiting on that semaphore's list. In order to return the correct count, we have to walk that list too, in addition to the sem_array's list for complex operations. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-10ipc,sem: fix semctl(..., GETZCNT)Rik van Riel1-0/+7
The semctl GETZCNT returns the number of semops waiting for the specified semaphore to become zero. After commit 9f1bc2c9022c ("ipc,sem: have only one list in struct sem_queue"), the semops waiting on just one semaphore are waiting on that semaphore's list. In order to return the correct count, we have to walk that list too, in addition to the sem_array's list for complex operations. This bug broke dbench; it works again with this patch applied. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Kent Overstreet <koverstreet@google.com> Tested-by: Kent Overstreet <koverstreet@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-08hugetlbfs: fix mmap failure in unaligned size requestNaoya Horiguchi1-1/+5
The current kernel returns -EINVAL unless a given mmap length is "almost" hugepage aligned. This is because in sys_mmap_pgoff() the given length is passed to vm_mmap_pgoff() as it is without being aligned with hugepage boundary. This is a regression introduced in commit 40716e29243d ("hugetlbfs: fix alignment of huge page requests"), where alignment code is pushed into hugetlb_file_setup() and the variable len in caller side is not changed. To fix this, this patch partially reverts that commit, and adds alignment code in caller side. And it also introduces hstate_sizelog() in order to get proper hstate to specified hugepage size. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=56881 [akpm@linux-foundation.org: fix warning when CONFIG_HUGETLB_PAGE=n] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: <iceman_dvd@yahoo.com> Cc: Steven Truelove <steven.truelove@utoronto.ca> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-05ipc: simplify rcu_read_lock() in semctl_nolock()Linus Torvalds1-2/+1
This trivially combines two rcu_read_lock() calls in both sides of a if-statement into one single one in front of the if-statement. Split out as an independent cleanup from the previous commit. Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-05ipc: simplify semtimedop/semctl_main() common error path handlingLinus Torvalds1-27/+14
With various straight RCU lock/unlock movements, one common exit path pattern had become rcu_read_unlock(); goto out_wakeup; and in fact there were no cases where we wanted to exit to out_wakeup _without_ releasing the RCU read lock. So replace that pattern with "goto out_rcu_wakeup", and remove the old out_wakeup. Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-05ipc: move sem_obtain_lock() rcu locking into the only callerLinus Torvalds1-9/+7
sem_obtain_lock() was another of those functions that returned with the RCU lock held for reading in the success case. Move the RCU locking to the caller (semtimedop()), making it more obvious. We already did RCU locking elsewhere in that function. Side note: why does semtimedop() re-do the semphore lookup after the sleep, rather than just getting a reference to the semaphore it already looked up originally? Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-05ipc: fix double sem unlock in semctl error pathLinus Torvalds1-1/+1
Fix another ipc locking buglet introduced by the scalability patches: when semctl_down() was changed to delay the semaphore locking, one error path for security_sem_semctl() went through the semaphore unlock logic even though the semaphore had never been locked. Introduced by commit 16df3674efe3 ("ipc,sem: do not hold ipc lock more than necessary") Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-05ipc: move the rcu_read_lock() from sem_lock_and_putref() into callersLinus Torvalds1-2/+3
This is another ipc semaphore locking cleanup, trying to make the locking more straightforward. We move the rcu read locking into the callers of sem_lock_and_putref(), which in general means that we now mostly do the rcu_read_lock() and rcu_read_unlock() in the same function. Mostly. We still have the ipc_addid/newary/freeary mess, and things like ipcctl_pre_down_nolock(). Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-04ipc: sem_putref() does not need the semaphore lock any moreLinus Torvalds1-3/+1
ipc_rcu_putref() uses atomics for the refcount, and the games to lock and unlock the semaphore just to try to keep the reference counting working are no longer useful. Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-04ipc: move rcu_read_unlock() out of sem_unlock() and into callersLinus Torvalds1-2/+17
The IPC locking is a mess, and sem_unlock() unlocks not only the semaphore spinlock, it also drops the rcu read lock. Unlike sem_lock(), which just gets the spin-lock, and expects the caller to get the rcu read lock. This all makes things very hard to follow, and it's very confusing when you take the rcu read lock in one function, and then release it in another. And it has caused actual bugs: the sem_obtain_lock() function ended up dropping the RCU read lock twice in one error path, because it first did the sem_unlock(), and then did a rcu_read_unlock() to match the rcu_read_lock() it had done. This is just a totally mindless "remove rcu_read_unlock() from sem_unlock() and add it immediately after each caller" (except for the aforementioned bug where we did too many rcu_read_unlock(), and in find_alloc_undo() where we just got the rcu_read_lock() to correct for the fact that sem_unlock would immediately drop it again). We can (and should) clean things up further, but this fixes the bug with the minimal amount of subtlety. Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-03ipc: fix GETALL/IPC_RM race for sysv semaphoresAl Viro1-21/+8
We can step on WARN_ON_ONCE() in sem_getref() if a semaphore is removed just as we are about to call sem_getref() from semctl_main(); results are not pretty. We should fail with -EIDRM, same as if IPC_RM happened while we'd been doing allocation there. This also expands sem_getref() at its only callsite (and fixed there), while sem_getref_and_unlock() is simply killed off - it has no callers at all. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-02ipc_schedule_free() can do vfree() directly nowAl Viro1-87/+16
Commit 32fcfd40715e ("make vfree() safe to call from interrupt contexts") made it safe to do vfree directly from the RCU callback, which allows us to simplify ipc/util.c a lot by getting rid of the differences between vmalloc/kmalloc memory. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-02Merge branch 'for-linus' of ↵Linus Torvalds3-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS updates from Al Viro, Misc cleanups all over the place, mainly wrt /proc interfaces (switch create_proc_entry to proc_create(), get rid of the deprecated create_proc_read_entry() in favor of using proc_create_data() and seq_file etc). 7kloc removed. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits) don't bother with deferred freeing of fdtables proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h proc: Make the PROC_I() and PDE() macros internal to procfs proc: Supply a function to remove a proc entry by PDE take cgroup_open() and cpuset_open() to fs/proc/base.c ppc: Clean up scanlog ppc: Clean up rtas_flash driver somewhat hostap: proc: Use remove_proc_subtree() drm: proc: Use remove_proc_subtree() drm: proc: Use minor->index to label things, not PDE->name drm: Constify drm_proc_list[] zoran: Don't print proc_dir_entry data in debug reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show() proc: Supply an accessor for getting the data from a PDE's parent airo: Use remove_proc_subtree() rtl8192u: Don't need to save device proc dir PDE rtl8187se: Use a dir under /proc/net/r8180/ proc: Add proc_mkdir_data() proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h} proc: Move PDE_NET() to fs/proc/proc_net.c ...
2013-05-02proc: Split the namespace stuff out into linux/proc_ns.hDavid Howells2-2/+2
Split the proc namespace stuff out into linux/proc_ns.h. Signed-off-by: David Howells <dhowells@redhat.com> cc: netdev@vger.kernel.org cc: Serge E. Hallyn <serge.hallyn@ubuntu.com> cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01ipc: sysv shared memory limited to 8TiBRobin Holt1-1/+1
Trying to run an application which was trying to put data into half of memory using shmget(), we found that having a shmall value below 8EiB-8TiB would prevent us from using anything more than 8TiB. By setting kernel.shmall greater than 8EiB-8TiB would make the job work. In the newseg() function, ns->shm_tot which, at 8TiB is INT_MAX. ipc/shm.c: 458 static int newseg(struct ipc_namespace *ns, struct ipc_params *params) 459 { ... 465 int numpages = (size + PAGE_SIZE -1) >> PAGE_SHIFT; ... 474 if (ns->shm_tot + numpages > ns->shm_ctlall) 475 return -ENOSPC; [akpm@linux-foundation.org: make ipc/shm.c:newseg()'s numpages size_t, not int] Signed-off-by: Robin Holt <holt@sgi.com> Reported-by: Alex Thorlton <athorlton@sgi.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc/msg.c: use list_for_each_entry_[safe] for list traversingNikola Pajkovsky1-27/+8
The ipc/msg.c code does its list operations by hand and it open-codes the accesses, instead of using for_each_entry_[safe]. Signed-off-by: Nikola Pajkovsky <npajkovs@redhat.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc,sem: fine grained locking for semtimedopRik van Riel4-125/+203
Introduce finer grained locking for semtimedop, to handle the common case of a program wanting to manipulate one semaphore from an array with multiple semaphores. If the call is a semop manipulating just one semaphore in an array with multiple semaphores, only take the lock for that semaphore itself. If the call needs to manipulate multiple semaphores, or another caller is in a transaction that manipulates multiple semaphores, the sem_array lock is taken, as well as all the locks for the individual semaphores. On a 24 CPU system, performance numbers with the semop-multi test with N threads and N semaphores, look like this: vanilla Davidlohr's Davidlohr's + Davidlohr's + threads patches rwlock patches v3 patches 10 610652 726325 1783589 2142206 20 341570 365699 1520453 1977878 30 288102 307037 1498167 2037995 40 290714 305955 1612665 2256484 50 288620 312890 1733453 2650292 60 289987 306043 1649360 2388008 70 291298 306347 1723167 2717486 80 290948 305662 1729545 2763582 90 290996 306680 1736021 2757524 100 292243 306700 1773700 3059159 [davidlohr.bueso@hp.com: do not call sem_lock when bogus sma] [davidlohr.bueso@hp.com: make refcounter atomic] Signed-off-by: Rik van Riel <riel@redhat.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Chegu Vinod <chegu_vinod@hp.com> Cc: Jason Low <jason.low2@hp.com> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Tested-by: Emmanuel Benisty <benisty.e@gmail.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc,sem: have only one list in struct sem_queueRik van Riel1-31/+34
Having only one list in struct sem_queue, and only queueing simple semaphore operations on the list for the semaphore involved, allows us to introduce finer grained locking for semtimedop. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Chegu Vinod <chegu_vinod@hp.com> Cc: Emmanuel Benisty <benisty.e@gmail.com> Cc: Jason Low <jason.low2@hp.com> Cc: Michel Lespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc,sem: open code and rename sem_lockRik van Riel1-6/+23
Rename sem_lock() to sem_obtain_lock(), so we can introduce a sem_lock() later that only locks the sem_array and does nothing else. Open code the locking from ipc_lock() in sem_obtain_lock() so we can introduce finer grained locking for the sem_array in the next patch. [akpm@linux-foundation.org: propagate the ipc_obtain_object() errno out of sem_obtain_lock()] Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Chegu Vinod <chegu_vinod@hp.com> Cc: Emmanuel Benisty <benisty.e@gmail.com> Cc: Jason Low <jason.low2@hp.com> Cc: Michel Lespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc,sem: do not hold ipc lock more than necessaryDavidlohr Bueso2-48/+118
Instead of holding the ipc lock for permissions and security checks, among others, only acquire it when necessary. Some numbers.... 1) With Rik's semop-multi.c microbenchmark we can see the following results: Baseline (3.9-rc1): cpus 4, threads: 256, semaphores: 128, test duration: 30 secs total operations: 151452270, ops/sec 5048409 + 59.40% a.out [kernel.kallsyms] [k] _raw_spin_lock + 6.14% a.out [kernel.kallsyms] [k] sys_semtimedop + 3.84% a.out [kernel.kallsyms] [k] avc_has_perm_flags + 3.64% a.out [kernel.kallsyms] [k] __audit_syscall_exit + 2.06% a.out [kernel.kallsyms] [k] copy_user_enhanced_fast_string + 1.86% a.out [kernel.kallsyms] [k] ipc_lock With this patchset: cpus 4, threads: 256, semaphores: 128, test duration: 30 secs total operations: 273156400, ops/sec 9105213 + 18.54% a.out [kernel.kallsyms] [k] _raw_spin_lock + 11.72% a.out [kernel.kallsyms] [k] sys_semtimedop + 7.70% a.out [kernel.kallsyms] [k] ipc_has_perm.isra.21 + 6.58% a.out [kernel.kallsyms] [k] avc_has_perm_flags + 6.54% a.out [kernel.kallsyms] [k] __audit_syscall_exit + 4.71% a.out [kernel.kallsyms] [k] ipc_obtain_object_check 2) While on an Oracle swingbench DSS (data mining) workload the improvements are not as exciting as with Rik's benchmark, we can see some positive numbers. For an 8 socket machine the following are the percentages of %sys time incurred in the ipc lock: Baseline (3.9-rc1): 100 swingbench users: 8,74% 400 swingbench users: 21,86% 800 swingbench users: 84,35% With this patchset: 100 swingbench users: 8,11% 400 swingbench users: 19,93% 800 swingbench users: 77,69% [riel@redhat.com: fix two locking bugs] [sasha.levin@oracle.com: prevent releasing RCU read lock twice in semctl_main] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Reviewed-by: Chegu Vinod <chegu_vinod@hp.com> Acked-by: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jason Low <jason.low2@hp.com> Cc: Emmanuel Benisty <benisty.e@gmail.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc: introduce lockless pre_down ipcctlDavidlohr Bueso2-5/+29
Various forms of ipc use ipcctl_pre_down() to retrieve an ipc object and check permissions, mostly for IPC_RMID and IPC_SET commands. Introduce ipcctl_pre_down_nolock(), a lockless version of this function. The locking version is retained, yet modified to call the nolock version without affecting its semantics, thus transparent to all ipc callers. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Cc: Emmanuel Benisty <benisty.e@gmail.com> Cc: Jason Low <jason.low2@hp.com> Cc: Michel Lespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc: introduce obtaining a lockless ipc objectDavidlohr Bueso2-14/+59
Through ipc_lock() and therefore ipc_lock_check() we currently return the locked ipc object. This is not necessary for all situations and can, therefore, cause unnecessary ipc lock contention. Introduce analogous ipc_obtain_object() and ipc_obtain_object_check() functions that only lookup and return the ipc object. Both these functions must be called within the RCU read critical section. [akpm@linux-foundation.org: propagate the ipc_obtain_object() errno from ipc_lock()] Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Reviewed-by: Chegu Vinod <chegu_vinod@hp.com> Acked-by: Michel Lespinasse <walken@google.com> Cc: Emmanuel Benisty <benisty.e@gmail.com> Cc: Jason Low <jason.low2@hp.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc: remove bogus lock comment for ipc_checkidDavidlohr Bueso1-6/+1
This series makes the sysv semaphore code more scalable, by reducing the time the semaphore lock is held, and making the locking more scalable for semaphore arrays with multiple semaphores. The first four patches were written by Davidlohr Buesso, and reduce the hold time of the semaphore lock. The last three patches change the sysv semaphore code locking to be more fine grained, providing a performance boost when multiple semaphores in a semaphore array are being manipulated simultaneously. On a 24 CPU system, performance numbers with the semop-multi test with N threads and N semaphores, look like this: vanilla Davidlohr's Davidlohr's + Davidlohr's + threads patches rwlock patches v3 patches 10 610652 726325 1783589 2142206 20 341570 365699 1520453 1977878 30 288102 307037 1498167 2037995 40 290714 305955 1612665 2256484 50 288620 312890 1733453 2650292 60 289987 306043 1649360 2388008 70 291298 306347 1723167 2717486 80 290948 305662 1729545 2763582 90 290996 306680 1736021 2757524 100 292243 306700 1773700 3059159 This patch: There is no reason to be holding the ipc lock while reading ipcp->seq, hence remove misleading comment. Also simplify the return value for the function. Signed-off-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Chegu Vinod <chegu_vinod@hp.com> Cc: Emmanuel Benisty <benisty.e@gmail.com> Cc: Jason Low <jason.low2@hp.com> Cc: Michel Lespinasse <walken@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc/msgutil.c: use linux/uaccess.hHoSung Jung1-2/+2
Signed-off-by: HoSung Jung <rain6557@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc: refactor msg list search into separate functionPeter Hurley1-22/+26
[fengguang.wu@intel.com: find_msg can be static] Signed-off-by: Peter Hurley <peter@hurleysoftware.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc: simplify msg list searchPeter Hurley1-6/+2
Signed-off-by: Peter Hurley <peter@hurleysoftware.com> Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc: implement MSG_COPY as a new receive modePeter Hurley1-13/+9
Teach the helper routines about MSG_COPY so that msgtyp is preserved as the message number to copy. The security functions affected by this change were audited and no additional changes are necessary. Signed-off-by: Peter Hurley <peter@hurleysoftware.com> Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc: remove msg handling from queue scanPeter Hurley1-10/+4
In preparation for refactoring the queue scan into a separate function, relocate msg copying. Signed-off-by: Peter Hurley <peter@hurleysoftware.com> Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc: set EFAULT as default error in load_msg()Peter Hurley1-7/+3
Signed-off-by: Peter Hurley <peter@hurleysoftware.com> Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc: tighten msg copy loopsPeter Hurley1-21/+11
Signed-off-by: Peter Hurley <peter@hurleysoftware.com> Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc: separate msg allocation from userspace copyPeter Hurley1-14/+38
Separating msg allocation enables single-block vmalloc allocation instead. Signed-off-by: Peter Hurley <peter@hurleysoftware.com> Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01ipc: clamp with min()Peter Hurley1-22/+8
Signed-off-by: Peter Hurley <peter@hurleysoftware.com> Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-01Merge branch 'for-linus' of ↵Linus Torvalds3-126/+177
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal Pull compat cleanup from Al Viro: "Mostly about syscall wrappers this time; there will be another pile with patches in the same general area from various people, but I'd rather push those after both that and vfs.git pile are in." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: syscalls.h: slightly reduce the jungles of macros get rid of union semop in sys_semctl(2) arguments make do_mremap() static sparc: no need to sign-extend in sync_file_range() wrapper ppc compat wrappers for add_key(2) and request_key(2) are pointless x86: trim sys_ia32.h x86: sys32_kill and sys32_mprotect are pointless get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC merge compat sys_ipc instances consolidate compat lookup_dcookie() convert vmsplice to COMPAT_SYSCALL_DEFINE switch getrusage() to COMPAT_SYSCALL_DEFINE switch epoll_pwait to COMPAT_SYSCALL_DEFINE convert sendfile{,64} to COMPAT_SYSCALL_DEFINE switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect make HAVE_SYSCALL_WRAPPERS unconditional consolidate cond_syscall and SYSCALL_ALIAS declarations teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long get rid of duplicate logics in __SC_....[1-6] definitions
2013-04-30ipc/util.c: use register_hotmemory_notifier()Andrew Morton1-7/+8
Squishes a statement-with-no-effect warning, removes some ifdefs and shrinks .text by one byte! Note that this code fails to check for blocking_notifier_chain_register() failures. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-09procfs: new helper - PDE_DATA(inode)Al Viro1-1/+1
The only part of proc_dir_entry the code outside of fs/proc really cares about is PDE(inode)->data. Provide a helper for that; static inline for now, eventually will be moved to fs/proc, along with the knowledge of struct proc_dir_entry layout. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-02ipc: set msg back to -EAGAIN if copy wasn't performedStanislav Kinsbursky1-0/+1
Make sure that msg pointer is set back to error value in case of MSG_COPY flag is set and desired message to copy wasn't found. This garantees that msg is either a error pointer or a copy address. Otherwise the last message in queue will be freed without unlinking from the queue (which leads to memory corruption) and the dummy allocated copy won't be released. Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-29Merge branch 'for-linus' of ↵Linus Torvalds1-2/+10
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull userns fixes from Eric W Biederman: "The bulk of the changes are fixing the worst consequences of the user namespace design oversight in not considering what happens when one namespace starts off as a clone of another namespace, as happens with the mount namespace. The rest of the changes are just plain bug fixes. Many thanks to Andy Lutomirski for pointing out many of these issues." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: userns: Restrict when proc and sysfs can be mounted ipc: Restrict mounting the mqueue filesystem vfs: Carefully propogate mounts across user namespaces vfs: Add a mount flag to lock read only bind mounts userns: Don't allow creation if the user is chrooted yama: Better permission check for ptraceme pid: Handle the exit of a multi-threaded init. scm: Require CAP_SYS_ADMIN over the current pidns to spoof pids.
2013-03-27ipc: Restrict mounting the mqueue filesystemEric W. Biederman1-2/+10
Only allow mounting the mqueue filesystem if the caller has CAP_SYS_ADMIN rights over the ipc namespace. The principle here is if you create or have capabilities over it you can mount it, otherwise you get to live with what other people have mounted. This information is not particularly sensitive and mqueue essentially only reports which posix messages queues exist. Still when creating a restricted environment for an application to live any extra information may be of use to someone with sufficient creativity. The historical if imperfect way this information has been restricted has been not to allow mounts and restricting this to ipc namespace creators maintains the spirit of the historical restriction. Cc: stable@vger.kernel.org Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-23mqueue: sys_mq_open: do not call mnt_drop_write() if read-onlyVladimir Davydov1-1/+2
mnt_drop_write() must be called only if mnt_want_write() succeeded, otherwise the mnt_writers counter will diverge. mnt_writers counters are used to check if remounting FS as read-only is OK, so after an extra mnt_drop_write() call, it would be impossible to remount mqueue FS as read-only. Besides, on umount a warning would be printed like this one: ===================================== [ BUG: bad unlock balance detected! ] 3.9.0-rc3 #5 Not tainted ------------------------------------- a.out/12486 is trying to release lock (sb_writers) at: mnt_drop_write+0x1f/0x30 but there are no more locks to release! Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Doug Ledford <dledford@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-09ipc: don't allocate a copy larger than maxPeter Hurley1-2/+4
When MSG_COPY is set, a duplicate message must be allocated for the copy before locking the queue. However, the copy could not be larger than was sent which is limited to msg_ctlmax. Signed-off-by: Peter Hurley <peter@hurleysoftware.com> Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-09ipc: fix potential oops when src msg > 4k w/ MSG_COPYPeter Hurley1-3/+0
If the src msg is > 4k, then dest->next points to the next allocated segment; resetting it just prior to dereferencing is bad. Signed-off-by: Peter Hurley <peter@hurleysoftware.com> Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-06get rid of union semop in sys_semctl(2) argumentsAl Viro3-53/+88
just have the bugger take unsigned long and deal with SETVAL case (when we use an int member in the union) explicitly. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-03-04get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPCAl Viro1-92/+66
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-03-04merge compat sys_ipc instancesAl Viro1-0/+44
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-03-04make HAVE_SYSCALL_WRAPPERS unconditionalAl Viro1-2/+0
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-28ipc: convert to idr_alloc()Tejun Heo1-21/+9
Convert to the much saner new idr interface. The new interface doesn't directly translate to the way idr_pre_get() was used around ipc_addid() as preloading disables preemption. From my cursory reading, it seems like we should be able to do all allocation from ipc_addid(), so I moved it there. Can you please check whether this would be okay? If this is wrong and ipc_addid() should be allowed to be called from non-sleepable context, I'd suggest allocating id itself in the outer functions and later install the pointer using idr_replace(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27Merge branch 'for-linus' of ↵Linus Torvalds2-13/+14
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle *and* ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...
2013-02-26Merge branch 'for-linus' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace and namespace infrastructure changes from Eric W Biederman: "This set of changes starts with a few small enhnacements to the user namespace. reboot support, allowing more arbitrary mappings, and support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the user namespace root. I do my best to document that if you care about limiting your unprivileged users that when you have the user namespace support enabled you will need to enable memory control groups. There is a minor bug fix to prevent overflowing the stack if someone creates way too many user namespaces. The bulk of the changes are a continuation of the kuid/kgid push down work through the filesystems. These changes make using uids and gids typesafe which ensures that these filesystems are safe to use when multiple user namespaces are in use. The filesystems converted for 3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The changes for these filesystems were a little more involved so I split the changes into smaller hopefully obviously correct changes. XFS is the only filesystem that remains. I was hoping I could get that in this release so that user namespace support would be enabled with an allyesconfig or an allmodconfig but it looks like the xfs changes need another couple of days before it they are ready." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits) cifs: Enable building with user namespaces enabled. cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t cifs: Convert struct cifs_sb_info to use kuids and kgids cifs: Modify struct smb_vol to use kuids and kgids cifs: Convert struct cifsFileInfo to use a kuid cifs: Convert struct cifs_fattr to use kuid and kgids cifs: Convert struct tcon_link to use a kuid. cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t cifs: Convert from a kuid before printing current_fsuid cifs: Use kuids and kgids SID to uid/gid mapping cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc cifs: Use BUILD_BUG_ON to validate uids and gids are the same size cifs: Override unmappable incoming uids and gids nfsd: Enable building with user namespaces enabled. nfsd: Properly compare and initialize kuids and kgids nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids nfsd: Modify nfsd4_cb_sec to use kuids and kgids nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion nfsd: Convert nfsxdr to use kuids and kgids nfsd: Convert nfs3xdr to use kuids and kgids ...