kernel/linux.git/include, branch v3.0.1

NFS: Fix spurious readdir cookie loop messages

2011-08-05T04:58:40+00:00

commit 0c0308066ca53fdf1423895f3a42838b67b3a5a8 upstream. If the directory contents change, then we have to accept that the file->f_pos value may shrink if we do a 'search-by-cookie'. In that case, we should turn off the loop detection and let the NFS client try to recover. The patch also fixes a second loop detection bug by ensuring that after turning on the ctx->duped flag, we read at least one new cookie into ctx->dir_cookie before attempting to match with ctx->dup_cookie. Reported-by: Petr Vandrovec Signed-off-by: Trond Myklebust Signed-off-by: Greg Kroah-Hartman

mm/futex: fix futex writes on archs with SW tracking of dirty & young

2011-08-05T04:58:38+00:00

commit 2efaca927f5cd7ecd0f1554b8f9b6a9a2c329c03 upstream. I haven't reproduced it myself but the fail scenario is that on such machines (notably ARM and some embedded powerpc), if you manage to hit that futex path on a writable page whose dirty bit has gone from the PTE, you'll livelock inside the kernel from what I can tell. It will go in a loop of trying the atomic access, failing, trying gup to "fix it up", getting succcess from gup, go back to the atomic access, failing again because dirty wasn't fixed etc... So I think you essentially hang in the kernel. The scenario is probably rare'ish because affected architecture are embedded and tend to not swap much (if at all) so we probably rarely hit the case where dirty is missing or young is missing, but I think Shan has a piece of SW that can reliably reproduce it using a shared writable mapping & fork or something like that. On archs who use SW tracking of dirty & young, a page without dirty is effectively mapped read-only and a page without young unaccessible in the PTE. Additionally, some architectures might lazily flush the TLB when relaxing write protection (by doing only a local flush), and expect a fault to invalidate the stale entry if it's still present on another processor. The futex code assumes that if the "in_atomic()" access -EFAULT's, it can "fix it up" by causing get_user_pages() which would then be equivalent to taking the fault. However that isn't the case. get_user_pages() will not call handle_mm_fault() in the case where the PTE seems to have the right permissions, regardless of the dirty and young state. It will eventually update those bits ... in the struct page, but not in the PTE. Additionally, it will not handle the lazy TLB flushing that can be required by some architectures in the fault case. Basically, gup is the wrong interface for the job. The patch provides a more appropriate one which boils down to just calling handle_mm_fault() since what we are trying to do is simulate a real page fault. The futex code currently attempts to write to user memory within a pagefault disabled section, and if that fails, tries to fix it up using get_user_pages(). This doesn't work on archs where the dirty and young bits are maintained by software, since they will gate access permission in the TLB, and will not be updated by gup(). In addition, there's an expectation on some archs that a spurious write fault triggers a local TLB flush, and that is missing from the picture as well. I decided that adding those "features" to gup() would be too much for this already too complex function, and instead added a new simpler fixup_user_fault() which is essentially a wrapper around handle_mm_fault() which the futex code can call. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout] Signed-off-by: Benjamin Herrenschmidt Reported-by: Shan Hai Tested-by: Shan Hai Cc: David Laight Acked-by: Peter Zijlstra Cc: Darren Hart Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

pnfs: let layoutcommit handle a list of lseg

2011-08-05T04:58:37+00:00

commit a9bae5666d0510ad69bdb437371c9a3e6b770705 upstream. There can be multiple lseg per file, so layoutcommit should be able to handle it. [Needed in v3.0] Signed-off-by: Peng Tao Signed-off-by: Boaz Harrosh Signed-off-by: Jim Rees Signed-off-by: Trond Myklebust Signed-off-by: Greg Kroah-Hartman

firewire: cdev: prevent race between first get_info ioctl and bus reset event queuing

2011-08-05T04:58:34+00:00

commit 93b37905f70083d6143f5f4dba0a45cc64379a62 upstream. Between open(2) of a /dev/fw* and the first FW_CDEV_IOC_GET_INFO ioctl(2) on it, the kernel already queues FW_CDEV_EVENT_BUS_RESET events to be read(2) by the client. The get_info ioctl is practically always issued right away after open, hence this condition only occurs if the client opens during a bus reset, especially during a rapid series of bus resets. The problem with this condition is twofold: - These bus reset events carry the (as yet undocumented) @closure value of 0. But it is not the kernel's place to choose closures; they are privat to the client. E.g., this 0 value forced from the kernel makes it unsafe for clients to dereference it as a pointer to a closure object without NULL pointer check. - It is impossible for clients to determine the relative order of bus reset events from get_info ioctl(2) versus those from read(2), except in one way: By comparison of closure values. Again, such a procedure imposes complexity on clients and reduces freedom in use of the bus reset closure. So, change the ABI to suppress queuing of bus reset events before the first FW_CDEV_IOC_GET_INFO ioctl was issued by the client. Note, this ABI change cannot be version-controlled. The kernel cannot distinguish old from new clients before the first FW_CDEV_IOC_GET_INFO ioctl. We will try to back-merge this change into currently maintained stable/ longterm series, and we only document the new behaviour. The old behavior is now considered a kernel bug, which it basically is. Signed-off-by: Stefan Richter Cc:

gro: Only reset frag0 when skb can be pulled

2011-08-05T04:58:31+00:00

commit 17dd759c67f21e34f2156abcf415e1f60605a188 upstream. Currently skb_gro_header_slow unconditionally resets frag0 and frag0_len. However, when we can't pull on the skb this leaves the GRO fields in an inconsistent state. This patch fixes this by only resetting those fields after the pskb_may_pull test. Signed-off-by: Herbert Xu Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman

Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

2011-07-20T22:56:25+00:00

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: signal: align __lock_task_sighand() irq disabling and RCU softirq,rcu: Inform RCU of irq_exit() activity sched: Add irq_{enter,exit}() to scheduler_ipi() rcu: protect __rcu_read_unlock() against scheduler-using irq handlers rcu: Streamline code produced by __rcu_read_unlock() rcu: Fix RCU_BOOST race handling current->rcu_read_unlock_special rcu: decrease rcu_report_exp_rnp coupling with scheduler

Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/urgent

2011-07-20T18:59:26+00:00

sched: Allow for overlapping sched_domain spans

2011-07-20T16:32:41+00:00

Allow for sched_domain spans that overlap by giving such domains their own sched_group list instead of sharing the sched_groups amongst each-other. This is needed for machines with more than 16 nodes, because sched_domain_node_span() will generate a node mask from the 16 nearest nodes without regard if these masks have any overlap. Currently sched_domains have a sched_group that maps to their child sched_domain span, and since there is no overlap we share the sched_group between the sched_domains of the various CPUs. If however there is overlap, we would need to link the sched_group list in different ways for each cpu, and hence sharing isn't possible. In order to solve this, allocate private sched_groups for each CPU's sched_domain but have the sched_groups share a sched_group_power structure such that we can uniquely track the power. Reported-and-tested-by: Anton Blanchard Signed-off-by: Peter Zijlstra Cc: Linus Torvalds Cc: Andrew Morton Link: http://lkml.kernel.org/n/tip-08bxqw9wis3qti9u5inifh3y@git.kernel.org Signed-off-by: Ingo Molnar

sched: Break out cpu_power from the sched_group structure

2011-07-20T16:32:40+00:00

In order to prepare for non-unique sched_groups per domain, we need to carry the cpu_power elsewhere, so put a level of indirection in. Reported-and-tested-by: Anton Blanchard Signed-off-by: Peter Zijlstra Cc: Linus Torvalds Cc: Andrew Morton Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org Signed-off-by: Ingo Molnar

rcu: Fix RCU_BOOST race handling current->rcu_read_unlock_special

2011-07-20T04:38:52+00:00

The RCU_BOOST commits for TREE_PREEMPT_RCU introduced an other-task write to a new RCU_READ_UNLOCK_BOOSTED bit in the task_struct structure's ->rcu_read_unlock_special field, but, as noted by Steven Rostedt, without correctly synchronizing all accesses to ->rcu_read_unlock_special. This could result in bits in ->rcu_read_unlock_special being spuriously set and cleared due to conflicting accesses, which in turn could result in deadlocks between the rcu_node structure's ->lock and the scheduler's rq and pi locks. These deadlocks would result from RCU incorrectly believing that the just-ended RCU read-side critical section had been preempted and/or boosted. If that RCU read-side critical section was executed with either rq or pi locks held, RCU's ensuing (incorrect) calls to the scheduler would cause the scheduler to attempt to once again acquire the rq and pi locks, resulting in deadlock. More complex deadlock cycles are also possible, involving multiple rq and pi locks as well as locks from multiple rcu_node structures. This commit fixes synchronization by creating ->rcu_boosted field in task_struct that is accessed and modified only when holding the ->lock in the rcu_node structure on which the task is queued (on that rcu_node structure's ->blkd_tasks list). This results in tasks accessing only their own current->rcu_read_unlock_special fields, making unsynchronized access once again legal, and keeping the rcu_read_unlock() fastpath free of atomic instructions and memory barriers. The reason that the rcu_read_unlock() fastpath does not need to access the new current->rcu_boosted field is that this new field cannot be non-zero unless the RCU_READ_UNLOCK_BLOCKED bit is set in the current->rcu_read_unlock_special field. Therefore, rcu_read_unlock() need only test current->rcu_read_unlock_special: if that is zero, then current->rcu_boosted must also be zero. This bug does not affect TINY_PREEMPT_RCU because this implementation of RCU accesses current->rcu_read_unlock_special with irqs disabled, thus preventing races on the !SMP systems that TINY_PREEMPT_RCU runs on. Maybe-reported-by: Dave Jones Maybe-reported-by: Sergey Senozhatsky Reported-by: Steven Rostedt Signed-off-by: Paul E. McKenney Signed-off-by: Paul E. McKenney Reviewed-by: Steven Rostedt