summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2018-05-17bpf: sockmap, on update propagate errors back to userspaceJohn Fastabend1-1/+1
When an error happens in the update sockmap element logic also pass the err up to the user. Fixes: e5cd3abcb31a ("bpf: sockmap, refactor sockmap routines to work with hashmap") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-17bpf: fix sock hashmap kmalloc warningYonghong Song1-0/+6
syzbot reported a kernel warning below: WARNING: CPU: 0 PID: 4499 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 mm/slab_common.c:996 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 4499 Comm: syz-executor050 Not tainted 4.17.0-rc3+ #9 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b9/0x294 lib/dump_stack.c:113 panic+0x22f/0x4de kernel/panic.c:184 __warn.cold.8+0x163/0x1b3 kernel/panic.c:536 report_bug+0x252/0x2d0 lib/bug.c:186 fixup_bug arch/x86/kernel/traps.c:178 [inline] do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992 RIP: 0010:kmalloc_slab+0x56/0x70 mm/slab_common.c:996 RSP: 0018:ffff8801d907fc58 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8801aeecb280 RCX: ffffffff8185ebd7 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffe1 RBP: ffff8801d907fc58 R08: ffff8801adb5e1c0 R09: ffffed0035a84700 R10: ffffed0035a84700 R11: ffff8801ad423803 R12: ffff8801aeecb280 R13: 00000000fffffff4 R14: ffff8801ad891a00 R15: 00000000014200c0 __do_kmalloc mm/slab.c:3713 [inline] __kmalloc+0x25/0x760 mm/slab.c:3727 kmalloc include/linux/slab.h:517 [inline] map_get_next_key+0x24a/0x640 kernel/bpf/syscall.c:858 __do_sys_bpf kernel/bpf/syscall.c:2131 [inline] __se_sys_bpf kernel/bpf/syscall.c:2096 [inline] __x64_sys_bpf+0x354/0x4f0 kernel/bpf/syscall.c:2096 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe The test case is against sock hashmap with a key size 0xffffffe1. Such a large key size will cause the below code in function sock_hash_alloc() overflowing and produces a smaller elem_size, hence map creation will be successful. htab->elem_size = sizeof(struct htab_elem) + round_up(htab->map.key_size, 8); Later, when map_get_next_key is called and kernel tries to allocate the key unsuccessfully, it will issue the above warning. Similar to hashtab, ensure the key size is at most MAX_BPF_STACK for a successful map creation. Fixes: 81110384441a ("bpf: sockmap, add hash map support") Reported-by: syzbot+e4566d29080e7f3460ff@syzkaller.appspotmail.com Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-16locking/percpu-rwsem: Annotate rwsem ownership transfer by setting ↵Waiman Long1-0/+2
RWSEM_OWNER_UNKNOWN The filesystem freezing code needs to transfer ownership of a rwsem embedded in a percpu-rwsem from the task that does the freezing to another one that does the thawing by calling percpu_rwsem_release() after freezing and percpu_rwsem_acquire() before thawing. However, the new rwsem debug code runs afoul with this scheme by warning that the task that releases the rwsem isn't the one that acquires it, as reported by Amir Goldstein: DEBUG_LOCKS_WARN_ON(sem->owner != get_current()) WARNING: CPU: 1 PID: 1401 at /home/amir/build/src/linux/kernel/locking/rwsem.c:133 up_write+0x59/0x79 Call Trace: percpu_up_write+0x1f/0x28 thaw_super_locked+0xdf/0x120 do_vfs_ioctl+0x270/0x5f1 ksys_ioctl+0x52/0x71 __x64_sys_ioctl+0x16/0x19 do_syscall_64+0x5d/0x167 entry_SYSCALL_64_after_hwframe+0x49/0xbe To work properly with the rwsem debug code, we need to annotate that the rwsem ownership is unknown during the tranfer period until a brave soul comes forward to acquire the ownership. During that period, optimistic spinning will be disabled. Reported-by: Amir Goldstein <amir73il@gmail.com> Tested-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Theodore Y. Ts'o <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-fsdevel@vger.kernel.org Link: http://lkml.kernel.org/r/1526420991-21213-3-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-16locking/rwsem: Add a new RWSEM_ANONYMOUSLY_OWNED flagWaiman Long3-21/+28
There are use cases where a rwsem can be acquired by one task, but released by another task. In thess cases, optimistic spinning may need to be disabled. One example will be the filesystem freeze/thaw code where the task that freezes the filesystem will acquire a write lock on a rwsem and then un-owns it before returning to userspace. Later on, another task will come along, acquire the ownership, thaw the filesystem and release the rwsem. Bit 0 of the owner field was used to designate that it is a reader owned rwsem. It is now repurposed to mean that the owner of the rwsem is not known. If only bit 0 is set, the rwsem is reader owned. If bit 0 and other bits are set, it is writer owned with an unknown owner. One such value for the latter case is (-1L). So we can set owner to 1 for reader-owned, -1 for writer-owned. The owner is unknown in both cases. To handle transfer of rwsem ownership, the higher level code should set the owner field to -1 to indicate a write-locked rwsem with unknown owner. Optimistic spinning will be disabled in this case. Once the higher level code figures who the new owner is, it can then set the owner field accordingly. Tested-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Theodore Y. Ts'o <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-fsdevel@vger.kernel.org Link: http://lkml.kernel.org/r/1526420991-21213-2-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-15tick/broadcast: Use for_each_cpu() specially on UP kernelsDexuan Cui1-0/+8
for_each_cpu() unintuitively reports CPU0 as set independent of the actual cpumask content on UP kernels. This causes an unexpected PIT interrupt storm on a UP kernel running in an SMP virtual machine on Hyper-V, and as a result, the virtual machine can suffer from a strange random delay of 1~20 minutes during boot-up, and sometimes it can hang forever. Protect if by checking whether the cpumask is empty before entering the for_each_cpu() loop. [ tglx: Use !IS_ENABLED(CONFIG_SMP) instead of #ifdeffery ] Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Poulson <jopoulso@microsoft.com> Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: stable@vger.kernel.org Cc: Rakib Mullick <rakib.mullick@gmail.com> Cc: Jork Loeser <Jork.Loeser@microsoft.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: KY Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Link: https://lkml.kernel.org/r/KL1P15301MB000678289FE55BA365B3279ABF990@KL1P15301MB0006.APCP153.PROD.OUTLOOK.COM Link: https://lkml.kernel.org/r/KL1P15301MB0006FA63BC22BEB64902EAA0BF930@KL1P15301MB0006.APCP153.PROD.OUTLOOK.COM
2018-05-15bpf: sockmap, add hash map supportJohn Fastabend3-17/+492
Sockmap is currently backed by an array and enforces keys to be four bytes. This works well for many use cases and was originally modeled after devmap which also uses four bytes keys. However, this has become limiting in larger use cases where a hash would be more appropriate. For example users may want to use the 5-tuple of the socket as the lookup key. To support this add hash support. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-15bpf: sockmap, refactor sockmap routines to work with hashmapJohn Fastabend1-60/+88
This patch only refactors the existing sockmap code. This will allow much of the psock initialization code path and bpf helper codes to work for both sockmap bpf map types that are backed by an array, the currently supported type, and the new hash backed bpf map type sockhash. Most the fallout comes from three changes, - Pushing bpf programs into an independent structure so we can use it from the htab struct in the next patch. - Generalizing helpers to use void *key instead of the hardcoded u32. - Instead of passing map/key through the metadata we now do the lookup inline. This avoids storing the key in the metadata which will be useful when keys can be longer than 4 bytes. We rename the sk pointers to sk_redir at this point as well to avoid any confusion between the current sk pointer and the redirect pointer sk_redir. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-15bpf: enable stackmap with build_id in nmi contextSong Liu1-6/+53
Currently, we cannot parse build_id in nmi context because of up_read(&current->mm->mmap_sem), this makes stackmap with build_id less useful. This patch enables parsing build_id in nmi by putting the up_read() call in irq_work. To avoid memory allocation in nmi context, we use per cpu variable for the irq_work. As a result, only one irq_work per cpu is allowed. If the irq_work is in-use, we fallback to only report ips. Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-13Merge branch 'x86-pti-for-linus' of ↵Linus Torvalds7-67/+91
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/pti updates from Thomas Gleixner: "A mixed bag of fixes and updates for the ghosts which are hunting us. The scheduler fixes have been pulled into that branch to avoid conflicts. - A set of fixes to address a khread_parkme() race which caused lost wakeups and loss of state. - A deadlock fix for stop_machine() solved by moving the wakeups outside of the stopper_lock held region. - A set of Spectre V1 array access restrictions. The possible problematic spots were discuvered by Dan Carpenters new checks in smatch. - Removal of an unused file which was forgotten when the rest of that functionality was removed" * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vdso: Remove unused file perf/x86/cstate: Fix possible Spectre-v1 indexing for pkg_msr perf/x86/msr: Fix possible Spectre-v1 indexing in the MSR driver perf/x86: Fix possible Spectre-v1 indexing for x86_pmu::event_map() perf/x86: Fix possible Spectre-v1 indexing for hw_perf_event cache_* perf/core: Fix possible Spectre-v1 indexing for ->aux_pages[] sched/autogroup: Fix possible Spectre-v1 indexing for sched_prio_to_weight[] sched/core: Fix possible Spectre-v1 indexing for sched_prio_to_weight[] sched/core: Introduce set_special_state() kthread, sched/wait: Fix kthread_parkme() completion issue kthread, sched/wait: Fix kthread_parkme() wait-loop sched/fair: Fix the update of blocked load when newly idle stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock
2018-05-13Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds1-56/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Thomas Gleixner: "Revert the new NUMA aware placement approach which turned out to create more problems than it solved" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert "sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()"
2018-05-12Revert "sched/numa: Delay retrying placement for automatic NUMA balance ↵Mel Gorman1-56/+1
after wake_affine()" This reverts commit 7347fc87dfe6b7315e74310ee1243dc222c68086. Srikar Dronamra pointed out that while the commit in question did show a performance improvement on ppc64, it did so at the cost of disabling active CPU migration by automatic NUMA balancing which was not the intent. The issue was that a serious flaw in the logic failed to ever active balance if SD_WAKE_AFFINE was disabled on scheduler domains. Even when it's enabled, the logic is still bizarre and against the original intent. Investigation showed that fixing the patch in either the way he suggested, using the correct comparison for jiffies values or introducing a new numa_migrate_deferred variable in task_struct all perform similarly to a revert with a mix of gains and losses depending on the workload, machine and socket count. The original intent of the commit was to handle a problem whereby wake_affine, idle balancing and automatic NUMA balancing disagree on the appropriate placement for a task. This was particularly true for cases where a single task was a massive waker of tasks but where wake_wide logic did not apply. This was particularly noticeable when a futex (a barrier) woke all worker threads and tried pulling the wakees to the waker nodes. In that specific case, it could be handled by tuning MPI or openMP appropriately, but the behavior is not illogical and was worth attempting to fix. However, the approach was wrong. Given that we're at rc4 and a fix is not obvious, it's better to play safe, revert this commit and retry later. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: ggherdovich@suse.cz Cc: hpa@zytor.com Cc: matt@codeblueprint.co.uk Cc: mpe@ellerman.id.au Link: http://lkml.kernel.org/r/20180509163115.6fnnyeg4vdm2ct4v@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-12Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-0/+5
Merge misc fixes from Andrew Morton: "13 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: rbtree: include rcu.h scripts/faddr2line: fix error when addr2line output contains discriminator ocfs2: take inode cluster lock before moving reflinked inode from orphan dir mm, oom: fix concurrent munlock and oom reaper unmap, v3 mm: migrate: fix double call of radix_tree_replace_slot() proc/kcore: don't bounds check against address 0 mm: don't show nr_indirectly_reclaimable in /proc/vmstat mm: sections are not offlined during memory hotremove z3fold: fix reclaim lock-ups init: fix false positives in W+X checking lib/find_bit_benchmark.c: avoid soft lockup in test_find_first_bit() KASAN: prohibit KASAN+STRUCTLEAK combination MAINTAINERS: update Shuah's email address
2018-05-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller7-41/+67
The bpf syscall and selftests conflicts were trivial overlapping changes. The r8169 change involved moving the added mdelay from 'net' into a different function. A TLS close bug fix overlapped with the splitting of the TLS state into separate TX and RX parts. I just expanded the tests in the bug fix from "ctx->conf == X" into "ctx->tx_conf == X && ctx->rx_conf == X". Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-12init: fix false positives in W+X checkingJeffrey Hugo1-0/+5
load_module() creates W+X mappings via __vmalloc_node_range() (from layout_and_allocate()->move_module()->module_alloc()) by using PAGE_KERNEL_EXEC. These mappings are later cleaned up via "call_rcu_sched(&freeinit->rcu, do_free_init)" from do_init_module(). This is a problem because call_rcu_sched() queues work, which can be run after debug_checkwx() is run, resulting in a race condition. If hit, the race results in a nasty splat about insecure W+X mappings, which results in a poor user experience as these are not the mappings that debug_checkwx() is intended to catch. This issue is observed on multiple arm64 platforms, and has been artificially triggered on an x86 platform. Address the race by flushing the queued work before running the arch-defined mark_rodata_ro() which then calls debug_checkwx(). Link: http://lkml.kernel.org/r/1525103946-29526-1-git-send-email-jhugo@codeaurora.org Fixes: e1a58320a38d ("x86/mm: Warn on W^X mappings") Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org> Reported-by: Timur Tabi <timur@codeaurora.org> Reported-by: Jan Glauber <jan.glauber@caviumnetworks.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Will Deacon <will.deacon@arm.com> Acked-by: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-5/+14
Pull networking fixes from David Miller: 1) Verify lengths of keys provided by the user is AF_KEY, from Kevin Easton. 2) Add device ID for BCM89610 PHY. Thanks to Bhadram Varka. 3) Add Spectre guards to some ATM code, courtesy of Gustavo A. R. Silva. 4) Fix infinite loop in NSH protocol code. To Eric Dumazet we are most grateful for this fix. 5) Line up /proc/net/netlink headers properly. This fix from YU Bo, we do appreciate. 6) Use after free in TLS code. Once again we are blessed by the honorable Eric Dumazet with this fix. 7) Fix regression in TLS code causing stalls on partial TLS records. This fix is bestowed upon us by Andrew Tomt. 8) Deal with too small MTUs properly in LLC code, another great gift from Eric Dumazet. 9) Handle cached route flushing properly wrt. MTU locking in ipv4, to Hangbin Liu we give thanks for this. 10) Fix regression in SO_BINDTODEVIC handling wrt. UDP socket demux. Paolo Abeni, he gave us this. 11) Range check coalescing parameters in mlx4 driver, thank you Moshe Shemesh. 12) Some ipv6 ICMP error handling fixes in rxrpc, from our good brother David Howells. 13) Fix kexec on mlx5 by freeing IRQs in shutdown path. Daniel Juergens, you're the best! 14) Don't send bonding RLB updates to invalid MAC addresses. Debabrata Benerjee saved us! 15) Uh oh, we were leaking in udp_sendmsg and ping_v4_sendmsg. The ship is now water tight, thanks to Andrey Ignatov. 16) IPSEC memory leak in ixgbe from Colin Ian King, man we've got holes everywhere! 17) Fix error path in tcf_proto_create, Jiri Pirko what would we do without you! * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (92 commits) net sched actions: fix refcnt leak in skbmod net: sched: fix error path in tcf_proto_create() when modules are not configured net sched actions: fix invalid pointer dereferencing if skbedit flags missing ixgbe: fix memory leak on ipsec allocation ixgbevf: fix ixgbevf_xmit_frame()'s return type ixgbe: return error on unsupported SFP module when resetting ice: Set rq_last_status when cleaning rq ipv4: fix memory leaks in udp_sendmsg, ping_v4_sendmsg mlxsw: core: Fix an error handling path in 'mlxsw_core_bus_device_register()' bonding: send learning packets for vlans on slave bonding: do not allow rlb updates to invalid mac net/mlx5e: Err if asked to offload TC match on frag being first net/mlx5: E-Switch, Include VF RDMA stats in vport statistics net/mlx5: Free IRQs in shutdown path rxrpc: Trace UDP transmission failure rxrpc: Add a tracepoint to log ICMP/ICMP6 and error messages rxrpc: Fix the min security level for kernel calls rxrpc: Fix error reception on AF_INET6 sockets rxrpc: Fix missing start of call timeout qed: fix spelling mistake: "taskelt" -> "tasklet" ...
2018-05-11Merge tag 'trace-v4.17-rc4' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Working on some new updates to trace filtering, I noticed that the regex_match_front() test was updated to be limited to the size of the pattern instead of the full test string. But as the test string is not guaranteed to be nul terminated, it still needs to consider the size of the test string" * tag 'trace-v4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix regex_match_front() to not over compare the test string
2018-05-11Merge tag 'pm-4.17-rc5' of ↵Linus Torvalds1-14/+2
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix two PCI power management regressions from the 4.13 cycle and one cpufreq schedutil governor bug introduced during the 4.12 cycle, drop a stale comment from the schedutil code and fix two mistakes in docs. Specifics: - Restore device_may_wakeup() check in pci_enable_wake() removed inadvertently during the 4.13 cycle to prevent systems from drawing excessive power when suspended or off, among other things (Rafael Wysocki). - Fix pci_dev_run_wake() to properly handle devices that only can signal PME# when in the D3cold power state (Kai Heng Feng). - Fix the schedutil cpufreq governor to avoid using UINT_MAX as the new CPU frequency in some cases due to a missing check (Rafael Wysocki). - Remove a stale comment regarding worker kthreads from the schedutil cpufreq governor (Juri Lelli). - Fix a copy-paste mistake in the intel_pstate driver documentation (Juri Lelli). - Fix a typo in the system sleep states documentation (Jonathan Neuschäfer)" * tag 'pm-4.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PCI / PM: Check device_may_wakeup() in pci_enable_wake() PCI / PM: Always check PME wakeup capability for runtime wakeup support cpufreq: schedutil: Avoid using invalid next_freq cpufreq: schedutil: remove stale comment PM: docs: intel_pstate: fix Active Mode w/o HWP paragraph PM: docs: sleep-states: Fix a typo ("includig")
2018-05-11tracing: Fix regex_match_front() to not over compare the test stringSteven Rostedt (VMware)1-0/+3
The regex match function regex_match_front() in the tracing filter logic, was fixed to test just the pattern length from testing the entire test string. That is, it went from strncmp(str, r->pattern, len) to strcmp(str, r->pattern, r->len). The issue is that str is not guaranteed to be nul terminated, and if r->len is greater than the length of str, it can access more memory than is allocated. The solution is to add a simple test if (len < r->len) return 0. Cc: stable@vger.kernel.org Fixes: 285caad415f45 ("tracing/filters: Fix MATCH_FRONT_ONLY filter matching") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-05-11compat: fix 4-byte infoleak via uninitialized struct fieldJann Horn1-0/+1
Commit 3a4d44b61625 ("ntp: Move adjtimex related compat syscalls to native counterparts") removed the memset() in compat_get_timex(). Since then, the compat adjtimex syscall can invoke do_adjtimex() with an uninitialized ->tai. If do_adjtimex() doesn't write to ->tai (e.g. because the arguments are invalid), compat_put_timex() then copies the uninitialized ->tai field to userspace. Fix it by adding the memset() back. Fixes: 3a4d44b61625 ("ntp: Move adjtimex related compat syscalls to native counterparts") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-09bpf: xdp: allow offloads to store into rx_queue_indexJakub Kicinski1-1/+1
It's fairly easy for offloaded XDP programs to select the RX queue packets go to. We need a way of expressing this in the software. Allow write to the rx_queue_index field of struct xdp_md for device-bound programs. Skip convert_ctx_access callback entirely for offloads. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-09bpf: btf: Add struct bpf_btf_infoMartin KaFai Lau2-6/+37
During BPF_OBJ_GET_INFO_BY_FD on a btf_fd, the current bpf_attr's info.info is directly filled with the BTF binary data. It is not extensible. In this case, we want to add BTF ID. This patch adds "struct bpf_btf_info" which has the BTF ID as one of its member. The BTF binary data itself is exposed through the "btf" and "btf_size" members. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-09bpf: btf: Introduce BTF IDMartin KaFai Lau2-11/+121
This patch gives an ID to each loaded BTF. The ID is allocated by the idr like the existing prog-id and map-id. The bpf_put(map->btf) is moved to __bpf_map_put() so that the userspace can stop seeing the BTF ID ASAP when the last BTF refcnt is gone. It also makes BTF accessible from userspace through the 1. new BPF_BTF_GET_FD_BY_ID command. It is limited to CAP_SYS_ADMIN which is inline with the BPF_BTF_LOAD cmd and the existing BPF_[MAP|PROG]_GET_FD_BY_ID cmd. 2. new btf_id (and btf_key_id + btf_value_id) in "struct bpf_map_info" Once the BTF ID handler is accessible from userspace, freeing a BTF object has to go through a rcu period. The BPF_BTF_GET_FD_BY_ID cmd can then be done under a rcu_read_lock() instead of taking spin_lock. [Note: A similar rcu usage can be done to the existing bpf_prog_get_fd_by_id() in a follow up patch] When processing the BPF_BTF_GET_FD_BY_ID cmd, refcount_inc_not_zero() is needed because the BTF object could be already in the rcu dead row . btf_get() is removed since its usage is currently limited to btf.c alone. refcount_inc() is used directly instead. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-09bpf: btf: Avoid WARN_ON when CONFIG_REFCOUNT_FULL=yMartin KaFai Lau1-1/+1
If CONFIG_REFCOUNT_FULL=y, refcount_inc() WARN when refcount is 0. When creating a new btf, the initial btf->refcnt is 0 and triggered the following: [ 34.855452] refcount_t: increment on 0; use-after-free. [ 34.856252] WARNING: CPU: 6 PID: 1857 at lib/refcount.c:153 refcount_inc+0x26/0x30 .... [ 34.868809] Call Trace: [ 34.869168] btf_new_fd+0x1af6/0x24d0 [ 34.869645] ? btf_type_seq_show+0x200/0x200 [ 34.870212] ? lock_acquire+0x3b0/0x3b0 [ 34.870726] ? security_capable+0x54/0x90 [ 34.871247] __x64_sys_bpf+0x1b2/0x310 [ 34.871761] ? __ia32_sys_bpf+0x310/0x310 [ 34.872285] ? bad_area_access_error+0x310/0x310 [ 34.872894] do_syscall_64+0x95/0x3f0 This patch uses refcount_set() instead. Reported-by: Yonghong Song <yhs@fb.com> Tested-by: Yonghong Song <yhs@fb.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-09cpufreq: schedutil: Avoid using invalid next_freqRafael J. Wysocki1-1/+2
If the next_freq field of struct sugov_policy is set to UINT_MAX, it shouldn't be used for updating the CPU frequency (this is a special "invalid" value), but after commit b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely) it may be passed as the new frequency to sugov_update_commit() in sugov_update_single(). Fix that by adding an extra check for the special UINT_MAX value of next_freq to sugov_update_single(). Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely) Reported-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: 4.12+ <stable@vger.kernel.org> # 4.12+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-09cpufreq: schedutil: remove stale commentJuri Lelli1-13/+0
After commit 794a56ebd9a57 (sched/cpufreq: Change the worker kthread to SCHED_DEADLINE) schedutil kthreads are "ignored" for a clock frequency selection point of view, so the potential corner case for RT tasks is not possible at all now. Remove the stale comment mentioning it. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller10-214/+566
Minor conflict, a CHECK was placed into an if() statement in net-next, whilst a newline was added to that CHECK call in 'net'. Thanks to Daniel for the merge resolution. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-06Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds1-19/+44
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull clocksource fixes from Thomas Gleixner: "The recent addition of the early TSC clocksource breaks on machines which have an unstable TSC because in case that TSC is disabled, then the clocksource selection logic falls back to the early TSC which is obviously bogus. That also unearthed a few robustness issues in the clocksource derating code which are addressed as well" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: Rework stale comment clocksource: Consistent de-rate when marking unstable x86/tsc: Fix mark_tsc_unstable() clocksource: Initialize cs->wd_list clocksource: Allow clocksource_mark_unstable() on unregistered clocksources x86/tsc: Always unregister clocksource_tsc_early
2018-05-05Merge tag 'trace-v4.17-rc1-3' of ↵Linus Torvalds2-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Some of the files in the tracing directory show file mode 0444 when they are writable by root. To fix the confusion, they should be 0644. Note, either case root can still write to them. Zhengyuan asked why I never applied that patch (the first one is from 2014!). I simply forgot about it. /me lowers head in shame" * tag 'trace-v4.17-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix the file mode of stack tracer ftrace: Have set_graph_* files have normal file modes
2018-05-05perf/core: Fix possible Spectre-v1 indexing for ->aux_pages[]Peter Zijlstra1-2/+5
> kernel/events/ring_buffer.c:871 perf_mmap_to_page() warn: potential spectre issue 'rb->aux_pages' Userspace controls @pgoff through the fault address. Sanitize the array index before doing the array dereference. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-05sched/autogroup: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]Peter Zijlstra1-2/+5
> kernel/sched/autogroup.c:230 proc_sched_autogroup_set_nice() warn: potential spectre issue 'sched_prio_to_weight' Userspace controls @nice, sanitize the array index. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-05sched/core: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]Peter Zijlstra1-1/+6
> kernel/sched/core.c:6921 cpu_weight_nice_write_s64() warn: potential spectre issue 'sched_prio_to_weight' Userspace controls @nice, so sanitize the value before using it to index an array. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller1-5/+14
Daniel Borkmann says: ==================== pull-request: bpf 2018-05-05 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Sanitize attr->{prog,map}_type from bpf(2) since used as an array index to retrieve prog/map specific ops such that we prevent potential out of bounds value under speculation, from Mark and Daniel. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-05seccomp: Move speculation migitation control to arch codeThomas Gleixner1-13/+2
The migitation control is simpler to implement in architecture code as it avoids the extra function call to check the mode. Aside of that having an explicit seccomp enabled mode in the architecture mitigations would require even more workarounds. Move it into architecture code and provide a weak function in the seccomp code. Remove the 'which' argument as this allows the architecture to decide which mitigations are relevant for seccomp. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-05-05seccomp: Add filter flag to opt-out of SSB mitigationKees Cook1-8/+11
If a seccomp user is not interested in Speculative Store Bypass mitigation by default, it can set the new SECCOMP_FILTER_FLAG_SPEC_ALLOW flag when adding filters. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-05-05seccomp: Use PR_SPEC_FORCE_DISABLEThomas Gleixner1-1/+1
Use PR_SPEC_FORCE_DISABLE in seccomp() because seccomp does not allow to widen restrictions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-05-05bpf, xskmap: fix crash in xsk_map_alloc error path handlingDaniel Borkmann1-0/+2
If bpf_map_precharge_memlock() did not fail, then we set err to zero. However, any subsequent failure from either alloc_percpu() or the bpf_map_area_alloc() will return ERR_PTR(0) which in find_and_alloc_map() will cause NULL pointer deref. In devmap we have the convention that we return -EINVAL on page count overflow, so keep the same logic here and just set err to -ENOMEM after successful bpf_map_precharge_memlock(). Fixes: fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Björn Töpel <bjorn.topel@intel.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-05bpf: fix references to free_bpf_prog_info() in commentsJakub Kicinski1-2/+2
Comments in the verifier refer to free_bpf_prog_info() which seems to have never existed in tree. Replace it with free_used_maps(). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-05bpf: replace map pointer loads before calling into offloadsJakub Kicinski1-5/+5
Offloads may find host map pointers more useful than map fds. Map pointers can be used to identify the map, while fds are only valid within the context of loading process. Jump to skip_full_check on error in case verifier log overflow has to be handled (replace_map_fd_with_map_ptr() prints to the log, driver prep may do that too in the future). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-05bpf: export bpf_event_output()Jakub Kicinski1-0/+1
bpf_event_output() is useful for offloads to add events to BPF event rings, export it. Note that export is placed near the stub since tracing is optional and kernel/bpf/core.c is always going to be built. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-05nfp: bpf: record offload neutral maps in the driverJakub Kicinski1-0/+2
For asynchronous events originating from the device, like perf event output, we need to be able to make sure that objects being referred to by the FW message are valid on the host. FW events can get queued and reordered. Even if we had a FW message "barrier" we should still protect ourselves from bogus FW output. Add a reverse-mapping hash table and record in it all raw map pointers FW may refer to. Only record neutral maps, i.e. perf event arrays. These are currently the only objects FW can refer to. Use RCU protection on the read side, update side is under RTNL. Since program vs map destruction order is slightly painful for offload simply take an extra reference on all the recorded maps to make sure they don't disappear. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-05bpf: offload: allow offloaded programs to use perf event arraysJakub Kicinski1-2/+4
BPF_MAP_TYPE_PERF_EVENT_ARRAY is special as far as offload goes. The map only holds glue to perf ring, not actual data. Allow non-offloaded perf event arrays to be used in offloaded programs. Offload driver can extract the events from HW and put them in the map for user space to retrieve. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller19-168/+144
Overlapping changes in selftests Makefile. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04bpf: add faked "ending" subprogJiong Wang1-20/+14
There are quite a few code snippet like the following in verifier: subprog_start = 0; if (env->subprog_cnt == cur_subprog + 1) subprog_end = insn_cnt; else subprog_end = env->subprog_info[cur_subprog + 1].start; The reason is there is no marker in subprog_info array to tell the end of it. We could resolve this issue by introducing a faked "ending" subprog. The special "ending" subprog is with "insn_cnt" as start offset, so it is serving as the end mark whenever we iterate over all subprogs. Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04bpf: centre subprog information fieldsJiong Wang1-30/+32
It is better to centre all subprog information fields into one structure. This structure could later serve as function node in call graph. Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04bpf: unify main prog and subprogJiong Wang1-26/+31
Currently, verifier treat main prog and subprog differently. All subprogs detected are kept in env->subprog_starts while main prog is not kept there. Instead, main prog is implicitly defined as the prog start at 0. There is actually no difference between main prog and subprog, it is better to unify them, and register all progs detected into env->subprog_starts. This could also help simplifying some code logic. Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-04sched/core: Introduce set_special_state()Peter Zijlstra2-18/+16
Gaurav reported a perceived problem with TASK_PARKED, which turned out to be a broken wait-loop pattern in __kthread_parkme(), but the reported issue can (and does) in fact happen for states that do not do condition based sleeps. When the 'current->state = TASK_RUNNING' store of a previous (concurrent) try_to_wake_up() collides with the setting of a 'special' sleep state, we can loose the sleep state. Normal condition based wait-loops are immune to this problem, but for sleep states that are not condition based are subject to this problem. There already is a fix for TASK_DEAD. Abstract that and also apply it to TASK_STOPPED and TASK_TRACED, both of which are also without condition based wait-loop. Reported-by: Gaurav Kohli <gkohli@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds3-29/+77
Pull networking fixes from David Miller: 1) Various sockmap fixes from John Fastabend (pinned map handling, blocking in recvmsg, double page put, error handling during redirect failures, etc.) 2) Fix dead code handling in x86-64 JIT, from Gianluca Borello. 3) Missing device put in RDS IB code, from Dag Moxnes. 4) Don't process fast open during repair mode in TCP< from Yuchung Cheng. 5) Move address/port comparison fixes in SCTP, from Xin Long. 6) Handle add a bond slave's master into a bridge properly, from Hangbin Liu. 7) IPv6 multipath code can operate on unitialized memory due to an assumption that the icmp header is in the linear SKB area. Fix from Eric Dumazet. 8) Don't invoke do_tcp_sendpages() recursively via TLS, from Dave Watson. 9) Fix memory leaks in x86-64 JIT, from Daniel Borkmann. 10) RDS leaks kernel memory to userspace, from Eric Dumazet. 11) DCCP can invoke a tasklet on a freed socket, take a refcount. Also from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (78 commits) dccp: fix tasklet usage smc: fix sendpage() call net/smc: handle unregistered buffers net/smc: call consolidation qed: fix spelling mistake: "offloded" -> "offloaded" net/mlx5e: fix spelling mistake: "loobpack" -> "loopback" tcp: restore autocorking rds: do not leak kernel memory to user land qmi_wwan: do not steal interfaces from class drivers ipv4: fix fnhe usage by non-cached routes bpf: sockmap, fix error handling in redirect failures bpf: sockmap, zero sg_size on error when buffer is released bpf: sockmap, fix scatterlist update on error path in send with apply net_sched: fq: take care of throttled flows before reuse ipv6: Revert "ipv6: Allow non-gateway ECMP for IPv6" bpf, x64: fix memleak when not converging on calls bpf, x64: fix memleak when not converging after image net/smc: restrict non-blocking connect finish 8139too: Use disable_irq_nosync() in rtl8139_poll_controller() sctp: fix the issue that the cookie-ack with auth can't get processed ...
2018-05-04bpf: use array_index_nospec in find_prog_typeDaniel Borkmann1-2/+8
Commit 9ef09e35e521 ("bpf: fix possible spectre-v1 in find_and_alloc_map()") converted find_and_alloc_map() over to use array_index_nospec() to sanitize map type that user space passes on map creation, and this patch does an analogous conversion for progs in find_prog_type() as it's also passed from user space when loading progs as attr->prog_type. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-04bpf: implement ld_abs/ld_ind in native bpfDaniel Borkmann2-88/+32
The main part of this work is to finally allow removal of LD_ABS and LD_IND from the BPF core by reimplementing them through native eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and keeping them around in native eBPF caused way more trouble than actually worth it. To just list some of the security issues in the past: * fdfaf64e7539 ("x86: bpf_jit: support negative offsets") * 35607b02dbef ("sparc: bpf_jit: fix loads from negative offsets") * e0ee9c12157d ("x86: bpf_jit: fix two bugs in eBPF JIT compiler") * 07aee9439454 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call") * 6d59b7dbf72e ("bpf, s390x: do not reload skb pointers in non-skb context") * 87338c8e2cbb ("bpf, ppc64: do not reload skb pointers in non-skb context") For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy these days due to their limitations and more efficient/flexible alternatives that have been developed over time such as direct packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a register, the load happens in host endianness and its exception handling can yield unexpected behavior. The latter is explained in depth in f6b1b3bf0d5f ("bpf: fix subprog verifier bypass by div/mod by 0 exception") with similar cases of exceptions we had. In native eBPF more recent program types will disable LD_ABS/LD_IND altogether through may_access_skb() in verifier, and given the limitations in terms of exception handling, it's also disabled in programs that use BPF to BPF calls. In terms of cBPF, the LD_ABS/LD_IND is used in networking programs to access packet data. It is not used in seccomp-BPF but programs that use it for socket filtering or reuseport for demuxing with cBPF. This is mostly relevant for applications that have not yet migrated to native eBPF. The main complexity and source of bugs in LD_ABS/LD_IND is coming from their implementation in the various JITs. Most of them keep the model around from cBPF times by implementing a fastpath written in asm. They use typically two from the BPF program hidden CPU registers for caching the skb's headlen (skb->len - skb->data_len) and skb->data. Throughout the JIT phase this requires to keep track whether LD_ABS/LD_IND are used and if so, the two registers need to be recached each time a BPF helper would change the underlying packet data in native eBPF case. At least in eBPF case, available CPU registers are rare and the additional exit path out of the asm written JIT helper makes it also inflexible since not all parts of the JITer are in control from plain C. A LD_ABS/LD_IND implementation in eBPF therefore allows to significantly reduce the complexity in JITs with comparable performance results for them, e.g.: test_bpf tcpdump port 22 tcpdump complex x64 - before 15 21 10 14 19 18 - after 7 10 10 7 10 15 arm64 - before 40 91 92 40 91 151 - after 51 64 73 51 62 113 For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter() and cache the skb's headlen and data in the cBPF prologue. The BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just used as a local temporary variable. This allows to shrink the image on x86_64 also for seccomp programs slightly since mapping to %rsi is not an ereg. In callee-saved R8 and R9 we now track skb data and headlen, respectively. For normal prologue emission in the JITs this does not add any extra instructions since R8, R9 are pushed to stack in any case from eBPF side. cBPF uses the convert_bpf_ld_abs() emitter which probes the fast path inline already and falls back to bpf_skb_load_helper_{8,16,32}() helper relying on the cached skb data and headlen as well. R8 and R9 never need to be reloaded due to bpf_helper_changes_pkt_data() since all skb access in cBPF is read-only. Then, for the case of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally, does neither cache skb data and headlen nor has an inlined fast path. The reason for the latter is that native eBPF does not have any extra registers available anyway, but even if there were, it avoids any reload of skb data and headlen in the first place. Additionally, for the negative offsets, we provide an alternative bpf_skb_load_bytes_relative() helper in eBPF which operates similarly as bpf_skb_load_bytes() and allows for more flexibility. Tested myself on x64, arm64, s390x, from Sandipan on ppc64. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-04bpf: fix possible spectre-v1 in find_and_alloc_map()Mark Rutland1-3/+6
It's possible for userspace to control attr->map_type. Sanitize it when using it as an array index to prevent an out-of-bounds value being used under speculation. Found by smatch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: netdev@vger.kernel.org Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>