summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2017-08-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller11-37/+119
2017-08-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2-5/+7
Pull networking fixes from David Miller: 1) Fix IGMP handling wrt VRF, from David Ahern. 2) Fix timer access to freed object in dccp, from Eric Dumazet. 3) Use kmalloc_array() in ptr_ring to avoid overflow cases which are triggerable by userspace. Also from Eric Dumazet. 4) Fix infinite loop in unmapping cleanup of nfp driver, from Colin Ian King. 5) Correct datagram peek handling of empty SKBs, from Matthew Dawson. 6) Fix use after free in TIPC, from Eric Dumazet. 7) When replacing a route in ipv6 we need to reset the round robin pointer, from Wei Wang. 8) Fix bug in pci_find_pcie_root_port() which was unearthed by the relaxed ordering changes, from Thierry Redding. I made sure to get an explicit ACK from Bjorn this time around :-) * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (27 commits) ipv6: repair fib6 tree in failure case net_sched: fix order of queue length updates in qdisc_replace() tools lib bpf: improve warning switchdev: documentation: minor typo fixes bpf, doc: also add s390x as arch to sysctl description net: sched: fix NULL pointer dereference when action calls some targets rxrpc: Fix oops when discarding a preallocated service call irda: do not leak initialized list.dev to userspace net/mlx4_core: Enable 4K UAR if SRIOV module parameter is not enabled PCI: Allow PCI express root ports to find themselves tcp: when rearming RTO, if RTO time is in past then fire RTO ASAP net: check and errout if res->fi is NULL when RTM_F_FIB_MATCH is set ipv6: reset fn->rr_ptr when replacing route sctp: fully initialize the IPv6 address in sctp_v6_to_addr() tipc: fix use-after-free tun: handle register_netdevice() failures properly datagram: When peeking datagrams with offset < 0 don't skip empty skbs bpf, doc: improve sysctl knob description netxen: fix incorrect loop counter decrement nfp: fix infinite loop on umapping cleanup ...
2017-08-21pids: make task_tgid_nr_ns() safeOleg Nesterov2-25/+30
This was reported many times, and this was even mentioned in commit 52ee2dfdd4f5 ("pids: refactor vnr/nr_ns helpers to make them safe") but somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is not safe because task->group_leader points to nowhere after the exiting task passes exit_notify(), rcu_read_lock() can not help. We really need to change __unhash_process() to nullify group_leader, parent, and real_parent, but this needs some cleanups. Until then we can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and fix the problem. Reported-by: Troy Kensinger <tkensinger@google.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-20Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "Two fixes for the perf subsystem: - Fix an inconsistency of RDPMC mm struct tagging across exec() which causes RDPMC to fault. - Correct the timestamp mechanics across IOC_DISABLE/ENABLE which causes incorrect timestamps and total time calculations" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix time on IOC_ENABLE perf/x86: Fix RDPMC vs. mm_struct tracking
2017-08-20Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds1-0/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull watchdog fix from Thomas Gleixner: "A fix for the hardlockup watchdog to prevent false positives with extreme Turbo-Modes which make the perf/NMI watchdog fire faster than the hrtimer which is used to verify. Slightly larger than the minimal fix, which just would increase the hrtimer frequency, but comes with extra overhead of more watchdog timer interrupts and thread wakeups for all users. With this change we restrict the overhead to the extreme Turbo-Mode systems" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kernel/watchdog: Prevent false positives with turbo modes
2017-08-20net/mlx5e: Add outbound PCI buffer overflow counterEran Ben Elisha1-2/+4
Add outbound_pci_buffer_overflow to ethtool output for monitoring the number of packets that were dropped due to lack of PCIe buffers on receive path from NIC port toward the host(s). This counter is valid only in case that tx_overflow_buffer_pkt is supported in MCAM enhanced features. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-08-20net/mlx5: Add RX buffer fullness counters infrastructureGal Pressman1-2/+13
Add capability bit in PCAM register and counters to PPCNT register. Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-08-20net/mlx5: Add PCIe outbound stalls counters infrastructureGal Pressman1-3/+14
Add capability bit in MCAM register and counters to MPCNT register. Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-08-20bpf: linux/bpf.h needs linux/numa.hDavid S. Miller1-0/+1
Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-20bpf: Allow selecting numa node during map creationMartin KaFai Lau1-1/+9
The current map creation API does not allow to provide the numa-node preference. The memory usually comes from where the map-creation-process is running. The performance is not ideal if the bpf_prog is known to always run in a numa node different from the map-creation-process. One of the use case is sharding on CPU to different LRU maps (i.e. an array of LRU maps). Here is the test result of map_perf_test on the INNER_LRU_HASH_PREALLOC test if we force the lru map used by CPU0 to be allocated from a remote numa node: [ The machine has 20 cores. CPU0-9 at node 0. CPU10-19 at node 1 ] ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000 5:inner_lru_hash_map_perf pre-alloc 1628380 events per sec 4:inner_lru_hash_map_perf pre-alloc 1626396 events per sec 3:inner_lru_hash_map_perf pre-alloc 1626144 events per sec 6:inner_lru_hash_map_perf pre-alloc 1621657 events per sec 2:inner_lru_hash_map_perf pre-alloc 1621534 events per sec 1:inner_lru_hash_map_perf pre-alloc 1620292 events per sec 7:inner_lru_hash_map_perf pre-alloc 1613305 events per sec 0:inner_lru_hash_map_perf pre-alloc 1239150 events per sec #<<< After specifying numa node: ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000 5:inner_lru_hash_map_perf pre-alloc 1629627 events per sec 3:inner_lru_hash_map_perf pre-alloc 1628057 events per sec 1:inner_lru_hash_map_perf pre-alloc 1623054 events per sec 6:inner_lru_hash_map_perf pre-alloc 1616033 events per sec 2:inner_lru_hash_map_perf pre-alloc 1614630 events per sec 4:inner_lru_hash_map_perf pre-alloc 1612651 events per sec 7:inner_lru_hash_map_perf pre-alloc 1609337 events per sec 0:inner_lru_hash_map_perf pre-alloc 1619340 events per sec #<<< This patch adds one field, numa_node, to the bpf_attr. Since numa node 0 is a valid node, a new flag BPF_F_NUMA_NODE is also added. The numa_node field is honored if and only if the BPF_F_NUMA_NODE flag is set. Numa node selection is not supported for percpu map. This patch does not change all the kmalloc. F.e. 'htab = kzalloc()' is not changed since the object is small enough to stay in the cache. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-19net: drop unused attribute argument from sysfs queue funcsstephen hemminger1-3/+2
The show and store functions don't need/use the attribute. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-19net: constify net_ns_type_operationsstephen hemminger1-1/+1
This can be const. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-19net: constify netdev_class_filestephen hemminger1-4/+4
These functions are wrapper arount class_create_file which can take a const attribute. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-19net: ethtool: Add macro to clear a link mode settingLendacky, Thomas1-0/+11
There are currently macros to set and test an ETHTOOL_LINK_MODE_ setting, but not to clear one. Add a macro to clear an ETHTOOL_LINK_MODE_ setting. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-19mm, oom: fix potential data corruption when oom_reaper races with writerMichal Hocko1-0/+22
Wenwei Tao has noticed that our current assumption that the oom victim is dying and never doing any visible changes after it dies, and so the oom_reaper can tear it down, is not entirely true. __task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT is set but do_group_exit sends SIGKILL to all threads _after_ the flag is set. So there is a race window when some threads won't have fatal_signal_pending while the oom_reaper could start unmapping the address space. Moreover some paths might not check for fatal signals before each PF/g-u-p/copy_from_user. We already have a protection for oom_reaper vs. PF races by checking MMF_UNSTABLE. This has been, however, checked only for kernel threads (use_mm users) which can outlive the oom victim. A simple fix would be to extend the current check in handle_mm_fault for all tasks but that wouldn't be sufficient because the current check assumes that a kernel thread would bail out after EFAULT from get_user*/copy_from_user and never re-read the same address which would succeed because the PF path has established page tables already. This seems to be the case for the only existing use_mm user currently (virtio driver) but it is rather fragile in general. This is even more fragile in general for more complex paths such as generic_perform_write which can re-read the same address more times (e.g. iov_iter_copy_from_user_atomic to fail and then iov_iter_fault_in_readable on retry). Therefore we have to implement MMF_UNSTABLE protection in a robust way and never make a potentially corrupted content visible. That requires to hook deeper into the PF path and check for the flag _every time_ before a pte for anonymous memory is established (that means all !VM_SHARED mappings). The corruption can be triggered artificially (http://lkml.kernel.org/r/201708040646.v746kkhC024636@www262.sakura.ne.jp) but there doesn't seem to be any real life bug report. The race window should be quite tight to trigger most of the time. Link: http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org Fixes: aac453635549 ("mm, oom: introduce oom reaper") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Wenwei Tao <wenwei.tww@alibaba-inc.com> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-19mm: discard memblock data laterPavel Tatashin1-2/+4
There is existing use after free bug when deferred struct pages are enabled: The memblock_add() allocates memory for the memory array if more than 128 entries are needed. See comment in e820__memblock_setup(): * The bootstrap memblock region count maximum is 128 entries * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries * than that - so allow memblock resizing. This memblock memory is freed here: free_low_memory_core_early() We access the freed memblock.memory later in boot when deferred pages are initialized in this path: deferred_init_memmap() for_each_mem_pfn_range() __next_mem_pfn_range() type = &memblock.memory; One possible explanation for why this use-after-free hasn't been hit before is that the limit of INIT_MEMBLOCK_REGIONS has never been exceeded at least on systems where deferred struct pages were enabled. Tested by reducing INIT_MEMBLOCK_REGIONS down to 4 from the current 128, and verifying in qemu that this code is getting excuted and that the freed pages are sane. Link: http://lkml.kernel.org/r/1502485554-318703-2-git-send-email-pasha.tatashin@oracle.com Fixes: 7e18adb4f80b ("mm: meminit: initialise remaining struct pages in parallel with kswapd") Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-19wait: add wait_event_killable_timeout()Luis R. Rodriguez1-0/+37
These are the few pending fixes I have queued up for v4.13-final. One is a a generic regression fix for recursive loops on kmod and the other one is a trivial print out correction. During the v4.13 development we assumed that recursive kmod loops were no longer possible. Clearly that is not true. The regression fix makes use of a new killable wait. We use a killable wait to be paranoid in how signals might be sent to modprobe and only accept a proper SIGKILL. The signal will only be available to userspace to issue *iff* a thread has already entered a wait state, and that happens only if we've already throttled after 50 kmod threads have been hit. Note that although it may seem excessive to trigger a failure afer 5 seconds if all kmod thread remain busy, prior to the series of changes that went into v4.13 we would actually *always* fatally fail any request which came in if the limit was already reached. The new waiting implemented in v4.13 actually gives us *more* breathing room -- the wait for 5 seconds is a wait for *any* kmod thread to finish. We give up and fail *iff* no kmod thread has finished and they're *all* running straight for 5 consecutive seconds. If 50 kmod threads are running consecutively for 5 seconds something else must be really bad. Recursive loops with kmod are bad but they're also hard to implement properly as a selftest without currently fooling current userspace tools like kmod [1]. For instance kmod will complain when you run depmod if it finds a recursive loop with symbol dependency between modules as such this type of recursive loop cannot go upstream as the modules_install target will fail after running depmod. These tests already exist on userspace kmod upstream though (refer to the testsuite/module-playground/mod-loop-*.c files). The same is not true if request_module() is used though, or worst if aliases are used. Likewise the issue with 64-bit kernels booting 32-bit userspace without a binfmt handler built-in is also currently not detected and proactively avoided by userspace kmod tools, or kconfig for all architectures. Although we could complain in the kernel when some of these individual recursive issues creep up, proactively avoiding these situations in userspace at build time is what we should keep striving for. Lastly, since recursive loops could happen with kmod it may mean recursive loops may also be possible with other kernel usermode helpers, this should be investigated and long term if we can come up with a more sensible generic solution even better! [0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20170809-kmod-for-v4.13-final [1] https://git.kernel.org/pub/scm/utils/kernel/kmod/kmod.git This patch (of 3): This wait is similar to wait_event_interruptible_timeout() but only accepts SIGKILL interrupt signal. Other signals are ignored. Link: http://lkml.kernel.org/r/20170809234635.13443-2-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Jessica Yu <jeyu@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Michal Marek <mmarek@suse.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Matt Redfearn <matt.redfearn@imgtec.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Daniel Mentz <danielmentz@google.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Matt Redfearn <matt.redfearn@imgetc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-19mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback()Johannes Weiner1-2/+8
Jaegeuk and Brad report a NULL pointer crash when writeback ending tries to update the memcg stats: BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0 IP: test_clear_page_writeback+0x12e/0x2c0 [...] RIP: 0010:test_clear_page_writeback+0x12e/0x2c0 Call Trace: <IRQ> end_page_writeback+0x47/0x70 f2fs_write_end_io+0x76/0x180 [f2fs] bio_endio+0x9f/0x120 blk_update_request+0xa8/0x2f0 scsi_end_request+0x39/0x1d0 scsi_io_completion+0x211/0x690 scsi_finish_command+0xd9/0x120 scsi_softirq_done+0x127/0x150 __blk_mq_complete_request_remote+0x13/0x20 flush_smp_call_function_queue+0x56/0x110 generic_smp_call_function_single_interrupt+0x13/0x30 smp_call_function_single_interrupt+0x27/0x40 call_function_single_interrupt+0x89/0x90 RIP: 0010:native_safe_halt+0x6/0x10 (gdb) l *(test_clear_page_writeback+0x12e) 0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619). 614 mod_node_page_state(page_pgdat(page), idx, val); 615 if (mem_cgroup_disabled() || !page->mem_cgroup) 616 return; 617 mod_memcg_state(page->mem_cgroup, idx, val); 618 pn = page->mem_cgroup->nodeinfo[page_to_nid(page)]; 619 this_cpu_add(pn->lruvec_stat->count[idx], val); 620 } 621 622 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, 623 gfp_t gfp_mask, The issue is that writeback doesn't hold a page reference and the page might get freed after PG_writeback is cleared (and the mapping is unlocked) in test_clear_page_writeback(). The stat functions looking up the page's node or zone are safe, as those attributes are static across allocation and free cycles. But page->mem_cgroup is not, and it will get cleared if we race with truncation or migration. It appears this race window has been around for a while, but less likely to trigger when the memcg stats were updated first thing after PG_writeback is cleared. Recent changes reshuffled this code to update the global node stats before the memcg ones, though, stretching the race window out to an extent where people can reproduce the problem. Update test_clear_page_writeback() to look up and pin page->mem_cgroup before clearing PG_writeback, then not use that pointer afterward. It is a partial revert of 62cccb8c8e7a ("mm: simplify lock_page_memcg()") but leaves the pageref-holding callsites that aren't affected alone. Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org Fixes: 62cccb8c8e7a ("mm: simplify lock_page_memcg()") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Jaegeuk Kim <jaegeuk@kernel.org> Tested-by: Jaegeuk Kim <jaegeuk@kernel.org> Reported-by: Bradley Bolen <bradleybolen@gmail.com> Tested-by: Brad Bolen <bradleybolen@gmail.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> [4.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-18kernel/watchdog: Prevent false positives with turbo modesThomas Gleixner1-0/+8
The hardlockup detector on x86 uses a performance counter based on unhalted CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the performance counter period, so the hrtimer should fire 2-3 times before the performance counter NMI fires. The NMI code checks whether the hrtimer fired since the last invocation. If not, it assumess a hard lockup. The calculation of those periods is based on the nominal CPU frequency. Turbo modes increase the CPU clock frequency and therefore shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x nominal frequency) the perf/NMI period is shorter than the hrtimer period which leads to false positives. A simple fix would be to shorten the hrtimer period, but that comes with the side effect of more frequent hrtimer and softlockup thread wakeups, which is not desired. Implement a low pass filter, which checks the perf/NMI period against kernel time. If the perf/NMI fires before 4/5 of the watchdog period has elapsed then the event is ignored and postponed to the next perf/NMI. That solves the problem and avoids the overhead of shorter hrtimer periods and more frequent softlockup thread wakeups. Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and softlockup detector") Reported-and-tested-by: Kan Liang <Kan.liang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: dzickus@redhat.com Cc: prarit@redhat.com Cc: ak@linux.intel.com Cc: babu.moger@oracle.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: acme@redhat.com Cc: stable@vger.kernel.org Cc: atomlin@redhat.com Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
2017-08-17pty: fix the cached path of the pty slave file descriptor in the masterLinus Torvalds1-1/+1
Christian Brauner reported that if you use the TIOCGPTPEER ioctl() to get a slave pty file descriptor, the resulting file descriptor doesn't look right in /proc/<pid>/fd/<fd>. In particular, he wanted to use readlink() on /proc/self/fd/<fd> to get the pathname of the slave pty (basically implementing "ptsname{_r}()"). The reason for that was that we had generated the wrong 'struct path' when we create the pty in ptmx_open(). In particular, the dentry was correct, but the vfsmount pointed to the mount of the ptmx node. That _can_ be correct - in case you use "/dev/pts/ptmx" to open the master - but usually is not. The normal case is to use /dev/ptmx, which then looks up the pts/ directory, and then the vfsmount of the ptmx node is obviously the /dev directory, not the /dev/pts/ directory. We actually did have the right vfsmount available, but in the wrong place (it gets looked up in 'devpts_acquire()' when we get a reference to the pts filesystem), and so ptmx_open() used the wrong mnt pointer. The end result of this confusion was that the pty worked fine, but when if you did TIOCGPTPEER to get the slave side of the pty, end end result would also work, but have that dodgy 'struct path'. And then when doing "d_path()" on to get the pathname, the vfsmount would not match the root of the pts directory, and d_path() would return an empty pathname thinking that the entry had escaped a bind mount into another mount. This fixes the problem by making devpts_acquire() return the vfsmount for the pts filesystem, allowing ptmx_open() to trivially just use the right mount for the pts dentry, and create the proper 'struct path'. Reported-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-17ptr_ring: use kmalloc_array()Eric Dumazet2-5/+7
As found by syzkaller, malicious users can set whatever tx_queue_len on a tun device and eventually crash the kernel. Lets remove the ALIGN(XXX, SMP_CACHE_BYTES) thing since a small ring buffer is not fast anyway. Fixes: 2e0ab8ca83c1 ("ptr_ring: array based FIFO for pointers") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-17vmbus: remove unused vmbus_sendpacket_ctlstephen hemminger1-7/+0
The only usage of vmbus_sendpacket_ctl was by vmbus_sendpacket. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-17vmbus: remove unused vmubs_sendpacket_pagebuffer_ctlstephen hemminger1-8/+0
The function vmbus_sendpacket_pagebuffer_ctl was never used directly. Just have vmbus_send_pagebuffer Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-17vmbus: remove unused vmbus_sendpacket_multipagebufferstephen hemminger1-6/+0
This function is not used anywhere in current code. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-17bpf: sock_map fixes for !CONFIG_BPF_SYSCALL and !STREAM_PARSERJohn Fastabend2-1/+11
Resolve issues with !CONFIG_BPF_SYSCALL and !STREAM_PARSER net/core/filter.c: In function ‘do_sk_redirect_map’: net/core/filter.c:1881:3: error: implicit declaration of function ‘__sock_map_lookup_elem’ [-Werror=implicit-function-declaration] sk = __sock_map_lookup_elem(ri->map, ri->ifindex); ^ net/core/filter.c:1881:6: warning: assignment makes pointer from integer without a cast [enabled by default] sk = __sock_map_lookup_elem(ri->map, ri->ifindex); Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16bpf: sockmap with sk redirect supportJohn Fastabend3-2/+8
Recently we added a new map type called dev map used to forward XDP packets between ports (6093ec2dc313). This patches introduces a similar notion for sockets. A sockmap allows users to add participating sockets to a map. When sockets are added to the map enough context is stored with the map entry to use the entry with a new helper bpf_sk_redirect_map(map, key, flags) This helper (analogous to bpf_redirect_map in XDP) is given the map and an entry in the map. When called from a sockmap program, discussed below, the skb will be sent on the socket using skb_send_sock(). With the above we need a bpf program to call the helper from that will then implement the send logic. The initial site implemented in this series is the recv_sock hook. For this to work we implemented a map attach command to add attributes to a map. In sockmap we add two programs a parse program and a verdict program. The parse program uses strparser to build messages and pass them to the verdict program. The parse programs use the normal strparser semantics. The verdict program is of type SK_SKB. The verdict program returns a verdict SK_DROP, or SK_REDIRECT for now. Additional actions may be added later. When SK_REDIRECT is returned, expected when bpf program uses bpf_sk_redirect_map(), the sockmap logic will consult per cpu variables set by the helper routine and pull the sock entry out of the sock map. This pattern follows the existing redirect logic in cls and xdp programs. This gives the flow, recv_sock -> str_parser (parse_prog) -> verdict_prog -> skb_send_sock \ -> kfree_skb As an example use case a message based load balancer may use specific logic in the verdict program to select the sock to send on. Sample programs are provided in future patches that hopefully illustrate the user interfaces. Also selftests are in follow-on patches. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16bpf: export bpf_prog_inc_not_zeroJohn Fastabend1-0/+7
bpf_prog_inc_not_zero will be used by upcoming sockmap patches this patch simply exports it so we can pull it in. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16bpf: introduce new program type for skbs on socketsJohn Fastabend1-0/+1
A class of programs, run from strparser and soon from a new map type called sock map, are used with skb as the context but on established sockets. By creating a specific program type for these we can use bpf helpers that expect full sockets and get the verifier to ensure these helpers are not used out of context. The new type is BPF_PROG_TYPE_SK_SKB. This patch introduces the infrastructure and type. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller8-28/+62
2017-08-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2-1/+4
Pull networking fixes from David Miller: 1) Fix TCP checksum offload handling in iwlwifi driver, from Emmanuel Grumbach. 2) In ksz DSA tagging code, free SKB if skb_put_padto() fails. From Vivien Didelot. 3) Fix two regressions with bonding on wireless, from Andreas Born. 4) Fix build when busypoll is disabled, from Daniel Borkmann. 5) Fix copy_linear_skb() wrt. SO_PEEK_OFF, from Eric Dumazet. 6) Set SKB cached route properly in inet_rtm_getroute(), from Florian Westphal. 7) Fix PCI-E relaxed ordering handling in cxgb4 driver, from Ding Tianhong. 8) Fix module refcnt leak in ULP code, from Sabrina Dubroca. 9) Fix use of GFP_KERNEL in atomic contexts in AF_KEY code, from Eric Dumazet. 10) Need to purge socket write queue in dccp_destroy_sock(), also from Eric Dumazet. 11) Make bpf_trace_printk() work properly on 32-bit architectures, from Daniel Borkmann. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits) bpf: fix bpf_trace_printk on 32 bit archs PCI: fix oops when try to find Root Port for a PCI device sfc: don't try and read ef10 data on non-ef10 NIC net_sched: remove warning from qdisc_hash_add net_sched/sfq: update hierarchical backlog when drop packet net_sched: reset pointers to tcf blocks in classful qdiscs' destructors ipv4: fix NULL dereference in free_fib_info_rcu() net: Fix a typo in comment about sock flags. ipv6: fix NULL dereference in ip6_route_dev_notify() tcp: fix possible deadlock in TCP stack vs BPF filter dccp: purge write queue in dccp_destroy_sock() udp: fix linear skb reception with PEEK_OFF ipv6: release rt6->rt6i_idev properly during ifdown af_key: do not use GFP_KERNEL in atomic contexts tcp: ulp: avoid module refcnt leak in tcp_set_ulp net/cxgb4vf: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag net/cxgb4: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag PCI: Disable Relaxed Ordering Attributes for AMD A1100 PCI: Disable Relaxed Ordering for some Intel processors PCI: Disable PCIe Relaxed Ordering if unsupported ...
2017-08-16net: Fix a typo in comment about sock flags.Tonghao Zhang1-1/+1
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16bpf/verifier: track liveness for pruningEdward Cree1-1/+10
State of a register doesn't matter if it wasn't read in reaching an exit; a write screens off all reads downstream of it from all explored_states upstream of it. This allows us to prune many more branches; here are some processed insn counts for some Cilium programs: Program before after bpf_lb_opt_-DLB_L3.o 6515 3361 bpf_lb_opt_-DLB_L4.o 8976 5176 bpf_lb_opt_-DUNKNOWN.o 2960 1137 bpf_lxc_opt_-DDROP_ALL.o 95412 48537 bpf_lxc_opt_-DUNKNOWN.o 141706 78718 bpf_netdev.o 24251 17995 bpf_overlay.o 10999 9385 The runtime is also improved; here are 'time' results in ms: Program before after bpf_lb_opt_-DLB_L3.o 24 6 bpf_lb_opt_-DLB_L4.o 26 11 bpf_lb_opt_-DUNKNOWN.o 11 2 bpf_lxc_opt_-DDROP_ALL.o 1288 139 bpf_lxc_opt_-DUNKNOWN.o 1768 234 bpf_netdev.o 62 31 bpf_overlay.o 15 13 Signed-off-by: Edward Cree <ecree@solarflare.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-15PCI: Disable PCIe Relaxed Ordering if unsupporteddingtianhong1-0/+3
When bit4 is set in the PCIe Device Control register, it indicates whether the device is permitted to use relaxed ordering. On some platforms using relaxed ordering can have performance issues or due to erratum can cause data-corruption. In such cases devices must avoid using relaxed ordering. The patch adds a new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING to indicate that Relaxed Ordering (RO) attribute should not be used for Transaction Layer Packets (TLP) targeted towards these affected root complexes. This patch checks if there is any node in the hierarchy that indicates that using relaxed ordering is not safe. In such cases the patch turns off the relaxed ordering by clearing the capability for this device. Signed-off-by: Casey Leedom <leedom@chelsio.com> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Acked-by: Ashok Raj <ashok.raj@intel.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Casey Leedom <leedom@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-14net: export some generic xdp helpersJason Wang1-0/+2
This patch tries to export some generic xdp helpers to drivers. This can let driver to do XDP for a specific skb. This is useful for the case when the packet is hard to be processed at page level directly (e.g jumbo/GSO frame). With this patch, there's no need for driver to forbid the XDP set when configuration is not suitable. Instead, it can defer the XDP for packets that is hard to be processed directly after skb is created. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-13Merge tag 'tty-4.13-rc5' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial fixes from Greg KH: "Here are two tty serial driver fixes for 4.13-rc5. One is a revert of a -rc1 patch that turned out to not be a good idea, and the other is a fix for the pl011 serial driver. Both have been in linux-next with no reported issues" * tag 'tty-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: Revert "serial: Delete dead code for CIR serial ports" tty: pl011: fix initialization order of QDF2400 E44
2017-08-13Merge tag 'staging-4.13-rc5' of ↵Linus Torvalds2-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging/iio fixes from Greg KH: "Here are some Staging and IIO driver fixes for 4.13-rc5. Nothing major, just a number of small fixes for reported issues. All of these have been in linux-next for a while now with no reported issues. Full details are in the shortlog" * tag 'staging-4.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: staging: comedi: comedi_fops: do not call blocking ops when !TASK_RUNNING iio: aspeed-adc: wait for initial sequence. iio: accel: bmc150: Always restore device to normal mode after suspend-resume staging:iio:resolver:ad2s1210 fix negative IIO_ANGL_VEL read iio: adc: axp288: Fix the GPADC pin reading often wrongly returning 0 iio: adc: vf610_adc: Fix VALT selection value for REFSEL bits iio: accel: st_accel: add SPI-3wire support iio: adc: Revert "axp288: Drop bogus AXP288_ADC_TS_PIN_CTRL register modifications" iio: adc: sun4i-gpadc-iio: fix unbalanced irq enable/disable iio: pressure: st_pressure_core: disable multiread by default for LPS22HB iio: light: tsl2563: use correct event code
2017-08-11Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds1-0/+7
Pull block fixes from Jens Axboe: "A set of fixes that should go into this series. This contains: - Fix from Bart for blk-mq requeue queue running, preventing a continued loop of run/restart. - Fix for a bio/blk-integrity issue, in two parts. One from Christoph, fixing where verification happens, and one from Milan, for a NULL profile. - NVMe pull request, most of the changes being for nvme-fc, but also a few trivial core/pci fixes" * 'for-linus' of git://git.kernel.dk/linux-block: nvme: fix directive command numd calculation nvme: fix nvme reset command timeout handling nvme-pci: fix CMB sysfs file removal in reset path lpfc: support nvmet_fc defer_rcv callback nvmet_fc: add defer_req callback for deferment of cmd buffer return nvme: strip trailing 0-bytes in wwid_show block: Make blk_mq_delay_kick_requeue_list() rerun the queue at a quiet time bio-integrity: only verify integrity on the lowest stacked driver bio-integrity: Fix regression if profile verify_fn is NULL
2017-08-11Merge branch 'nvme-4.13' of git://git.infradead.org/nvme into for-linusJens Axboe1-0/+7
Pull NVMe fixes from Christoph: "A few more small fixes - the fc/lpfc update is the biggest by far."
2017-08-11Merge tag 'drm-fixes-for-v4.13-rc5' of ↵Linus Torvalds1-1/+2
git://people.freedesktop.org/~airlied/linux Pull drm fixes from Dave Airlie: "Nothing too earth shattering here, it just seems like lots of little things all over the place. msm has probably the larger amount of changes, but they all seem fine, otherwise, some rockchip, i915, etnaviv and exynos fixes, along with one nouveau regression fix for some older GPUs" * tag 'drm-fixes-for-v4.13-rc5' of git://people.freedesktop.org/~airlied/linux: (35 commits) drm/nouveau/disp/nv04: avoid creation of output paths drm: make DRM_STM default n drm/exynos: forbid creating framebuffers from too small GEM buffers drm/etnaviv: Fix off-by-one error in reloc checking drm/i915: fix backlight invert for non-zero minimum brightness drm/i915/shrinker: Wrap need_resched() inside preempt-disable drm/i915/perf: fix flex eu registers programming drm/i915: Fix out-of-bounds array access in bdw_load_gamma_lut drm/i915/gvt: Change the max length of mmio_reg_rw from 4 to 8 drm/i915/gvt: Initialize MMIO Block with HW state drm/rockchip: vop: report error when check resource error drm/rockchip: vop: round_up pitches to word align drm/rockchip: vop: fix NV12 video display error drm/rockchip: vop: fix iommu page fault when resume drm/i915/gvt: clean workload queue if error happened drm/i915/gvt: change resetting to resetting_eng drm/msm: gpu: don't abuse dma_alloc for non-DMA allocations drm/msm: gpu: call qcom_mdt interfaces only for ARCH_QCOM drm/msm/adreno: Prevent unclocked access when retrieving timestamps drm/msm: Remove __user from __u64 data types ...
2017-08-11Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-26/+38
Merge misc fixes from Andrew Morton: "21 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (21 commits) userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage zram: rework copy of compressor name in comp_algorithm_store() rmap: do not call mmu_notifier_invalidate_page() under ptl mm: fix list corruptions on shmem shrinklist mm/balloon_compaction.c: don't zero ballooned pages MAINTAINERS: copy virtio on balloon_compaction.c mm: fix KSM data corruption mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem mm: make tlb_flush_pending global mm: refactor TLB gathering API Revert "mm: numa: defer TLB flush for THP migration as long as possible" mm: migrate: fix barriers around tlb_flush_pending mm: migrate: prevent racy access to tlb_flush_pending fault-inject: fix wrong should_fail() decision in task context test_kmod: fix small memory leak on filesystem tests test_kmod: fix the lock in register_test_dev_kmod() test_kmod: fix bug which allows negative values on two config options test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY" userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case mm: ratelimit PFNs busy info message ...
2017-08-11mm: fix MADV_[FREE|DONTNEED] TLB flush miss problemMinchan Kim1-0/+8
Nadav reported parallel MADV_DONTNEED on same range has a stale TLB problem and Mel fixed it[1] and found same problem on MADV_FREE[2]. Quote from Mel Gorman: "The race in question is CPU 0 running madv_free and updating some PTEs while CPU 1 is also running madv_free and looking at the same PTEs. CPU 1 may have writable TLB entries for a page but fail the pte_dirty check (because CPU 0 has updated it already) and potentially fail to flush. Hence, when madv_free on CPU 1 returns, there are still potentially writable TLB entries and the underlying PTE is still present so that a subsequent write does not necessarily propagate the dirty bit to the underlying PTE any more. Reclaim at some unknown time at the future may then see that the PTE is still clean and discard the page even though a write has happened in the meantime. I think this is possible but I could have missed some protection in madv_free that prevents it happening." This patch aims for solving both problems all at once and is ready for other problem with KSM, MADV_FREE and soft-dirty story[3]. TLB batch API(tlb_[gather|finish]_mmu] uses [inc|dec]_tlb_flush_pending and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can catch there are parallel threads going on. In that case, forcefully, flush TLB to prevent for user to access memory via stale TLB entry although it fail to gather page table entry. I confirmed this patch works with [4] test program Nadav gave so this patch supersedes "mm: Always flush VMA ranges affected by zap_page_range v2" in current mmotm. NOTE: This patch modifies arch-specific TLB gathering interface(x86, ia64, s390, sh, um). It seems most of architecture are straightforward but s390 need to be careful because tlb_flush_mmu works only if mm->context.flush_mm is set to non-zero which happens only a pte entry really is cleared by ptep_get_and_clear and friends. However, this problem never changes the pte entries but need to flush to prevent memory access from stale tlb. [1] http://lkml.kernel.org/r/20170725101230.5v7gvnjmcnkzzql3@techsingularity.net [2] http://lkml.kernel.org/r/20170725100722.2dxnmgypmwnrfawp@suse.de [3] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com [4] https://patchwork.kernel.org/patch/9861621/ [minchan@kernel.org: decrease tlb flush pending count in tlb_finish_mmu] Link: http://lkml.kernel.org/r/20170808080821.GA31730@bbox Link: http://lkml.kernel.org/r/20170802000818.4760-7-namit@vmware.com Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Nadav Amit <namit@vmware.com> Reported-by: Nadav Amit <namit@vmware.com> Reported-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Jeff Dike <jdike@addtoit.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-11mm: make tlb_flush_pending globalMinchan Kim1-21/+0
Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING| COMPACTION] but upcoming patches to solve subtle TLB flush batching problem will use it regardless of compaction/NUMA so this patch doesn't remove the dependency. [akpm@linux-foundation.org: remove more ifdefs from world's ugliest printk statement] Link: http://lkml.kernel.org/r/20170802000818.4760-6-namit@vmware.com Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-11mm: refactor TLB gathering APIMinchan Kim1-0/+6
This patch is a preparatory patch for solving race problems caused by TLB batch. For that, we will increase/decrease TLB flush pending count of mm_struct whenever tlb_[gather|finish]_mmu is called. Before making it simple, this patch separates architecture specific part and rename it to arch_tlb_[gather|finish]_mmu and generic part just calls it. It shouldn't change any behavior. Link: http://lkml.kernel.org/r/20170802000818.4760-5-namit@vmware.com Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Jeff Dike <jdike@addtoit.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-11mm: migrate: fix barriers around tlb_flush_pendingNadav Amit1-4/+10
Reading tlb_flush_pending while the page-table lock is taken does not require a barrier, since the lock/unlock already acts as a barrier. Removing the barrier in mm_tlb_flush_pending() to address this issue. However, migrate_misplaced_transhuge_page() calls mm_tlb_flush_pending() while the page-table lock is already released, which may present a problem on architectures with weak memory model (PPC). To deal with this case, a new parameter is added to mm_tlb_flush_pending() to indicate if it is read without the page-table lock taken, and calling smp_mb__after_unlock_lock() in this case. Link: http://lkml.kernel.org/r/20170802000818.4760-3-namit@vmware.com Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-11mm: migrate: prevent racy access to tlb_flush_pendingNadav Amit1-9/+22
Patch series "fixes of TLB batching races", v6. It turns out that Linux TLB batching mechanism suffers from various races. Races that are caused due to batching during reclamation were recently handled by Mel and this patch-set deals with others. The more fundamental issue is that concurrent updates of the page-tables allow for TLB flushes to be batched on one core, while another core changes the page-tables. This other core may assume a PTE change does not require a flush based on the updated PTE value, while it is unaware that TLB flushes are still pending. This behavior affects KSM (which may result in memory corruption) and MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED. Memory corruption in KSM is harder to produce in practice, but was observed by hacking the kernel and adding a delay before flushing and replacing the KSM page. Finally, there is also one memory barrier missing, which may affect architectures with weak memory model. This patch (of 7): Setting and clearing mm->tlb_flush_pending can be performed by multiple threads, since mmap_sem may only be acquired for read in task_numa_work(). If this happens, tlb_flush_pending might be cleared while one of the threads still changes PTEs and batches TLB flushes. This can lead to the same race between migration and change_protection_range() that led to the introduction of tlb_flush_pending. The result of this race was data corruption, which means that this patch also addresses a theoretically possible data corruption. An actual data corruption was not observed, yet the race was was confirmed by adding assertion to check tlb_flush_pending is not set by two threads, adding artificial latency in change_protection_range() and using sysctl to reduce kernel.numa_balancing_scan_delay_ms. Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com Fixes: 20841405940e ("mm: fix TLB flush race between migration, and change_protection_range") Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Russell King <linux@armlinux.org.uk> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-11Merge tag 'pci-v4.13-fixes-2' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI fix from Bjorn Helgaas: "Work around Renesas uPD72020x 32-bit DMA issue" * tag 'pci-v4.13-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: xhci: Reset Renesas uPD72020x USB controller for 32-bit DMA issue PCI: Add pci_reset_function_locked()
2017-08-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-3/+3
Mainline had UFO fixes, but UFO is removed in net-next so we take the HEAD hunks. Minor context conflict in bcmsysport statistics bug fix. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10perf/x86: Fix RDPMC vs. mm_struct trackingPeter Zijlstra1-2/+2
Vince reported the following rdpmc() testcase failure: > Failing test case: > > fd=perf_event_open(); > addr=mmap(fd); > exec() // without closing or unmapping the event > fd=perf_event_open(); > addr=mmap(fd); > rdpmc() // GPFs due to rdpmc being disabled The problem is of course that exec() plays tricks with what is current->mm, only destroying the old mappings after having installed the new mm. Fix this confusion by passing along vma->vm_mm instead of relying on current->mm. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: 1e0fb9ec679c ("perf: Add pmu callbacks to track event mapping and unmapping") Link: http://lkml.kernel.org/r/20170802173930.cstykcqefmqt7jau@hirez.programming.kicks-ass.net [ Minor cleanups. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10nvmet_fc: add defer_req callback for deferment of cmd buffer returnJames Smart1-0/+7
At queue creation, the transport allocates a local job struct (struct nvmet_fc_fcp_iod) for each possible element of the queue. When a new CMD is received from the wire, a jobs struct is allocated from the queue and then used for the duration of the command. The job struct contains buffer space for the wire command iu. Thus, upon allocation of the job struct, the cmd iu buffer is copied to the job struct and the LLDD may immediately free/reuse the CMD IU buffer passed in the call. However, in some circumstances, due to the packetized nature of FC and the api of the FC LLDD which may issue a hw command to send the wire response, but the LLDD may not get the hw completion for the command and upcall the nvmet_fc layer before a new command may be asynchronously received on the wire. In other words, its possible for the initiator to get the response from the wire, thus believe a command slot free, and send a new command iu. The new command iu may be received by the LLDD and passed to the transport before the LLDD had serviced the hw completion and made the teardown calls for the original job struct. As such, there is no available job struct available for the new io. E.g. it appears like the host sent more queue elements than the queue size. It didn't based on it's understanding. Rather than treat this as a hard connection failure queue the new request until the job struct does free up. As the buffer isn't copied as there's no job struct, a special return value must be returned to the LLDD to signify to hold off on recycling the cmd iu buffer. And later, when a job struct is allocated and the buffer copied, a new LLDD callback is introduced to notify the LLDD and allow it to recycle it's command iu buffer. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-08-10Merge tag 'drm-misc-fixes-2017-08-08' of ↵Dave Airlie1-1/+2
git://anongit.freedesktop.org/git/drm-misc into drm-fixes Core Changes: - dma-buf: Allow multiple sync_files to wrap a single dma-fence (Chris) Driver Changes: - rockchip: misc fixes to vop driver from the downstream rockchip tree (Mark) - Error path cleanups to tc358767 & host1x (Lucas & Paul, respectively) * tag 'drm-misc-fixes-2017-08-08' of git://anongit.freedesktop.org/git/drm-misc: drm/rockchip: vop: report error when check resource error drm/rockchip: vop: round_up pitches to word align drm/rockchip: vop: fix NV12 video display error drm/rockchip: vop: fix iommu page fault when resume dma-buf/sync_file: Allow multiple sync_files to wrap a single dma-fence drm/bridge: tc358767: fix probe without attached output node