summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2016-04-28drm/vmwgfx: Enable SVGA_3D_CMD_DX_SET_PREDICATIONCharmaine Lee1-1/+1
Fixes piglit tests nv_conditional_render-* crashes. Signed-off-by: Charmaine Lee <charmainel@vmware.com> Reviewed-by: Brian Paul <brianp@vmware.com> Reviewed-by: Sinclair Yeh <syeh@vmware.com>
2016-04-28IB/security: Restrict use of the write() interfaceJason Gunthorpe7-1/+40
The drivers/infiniband stack uses write() as a replacement for bi-directional ioctl(). This is not safe. There are ways to trigger write calls that result in the return structure that is normally written to user space being shunted off to user specified kernel memory instead. For the immediate repair, detect and deny suspicious accesses to the write API. For long term, update the user space libraries and the kernel API to something that doesn't present the same security vulnerabilities (likely a structured ioctl() interface). The impacted uAPI interfaces are generally only available if hardware from drivers/infiniband is installed in the system. Reported-by: Jann Horn <jann@thejh.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> [ Expanded check to all known write() entry points ] Cc: stable@vger.kernel.org Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Use kernel default llseek for ui deviceDean Luick1-23/+2
The ui device llseek had a mistake with SEEK_END and did not fully follow seek semantics. Correct all this by using a kernel supplied function for fixed size devices. Cc: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Don't attempt to free resources if initialization failedMitko Haralanov2-33/+29
Attempting to free resources which have not been allocated and initialized properly led to the following kernel backtrace: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffffa09658fe>] unlock_exp_tids.isra.8+0x2e/0x120 [hfi1] PGD 852a43067 PUD 85d4a6067 PMD 0 Oops: 0000 [#1] SMP CPU: 0 PID: 2831 Comm: osu_bw Tainted: G IO 3.12.18-wfr+ #1 task: ffff88085b15b540 ti: ffff8808588fe000 task.ti: ffff8808588fe000 RIP: 0010:[<ffffffffa09658fe>] [<ffffffffa09658fe>] unlock_exp_tids.isra.8+0x2e/0x120 [hfi1] RSP: 0018:ffff8808588ffde0 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff880858a31800 RCX: 0000000000000000 RDX: ffff88085d971bc0 RSI: ffff880858a318f8 RDI: ffff880858a318c0 RBP: ffff8808588ffe20 R08: 0000000000000000 R09: 0000000000000000 R10: ffff88087ffd6f40 R11: 0000000001100348 R12: ffff880852900000 R13: ffff880858a318c0 R14: 0000000000000000 R15: ffff88085d971be8 FS: 00007f4674e83740(0000) GS:ffff88087f400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000085c377000 CR4: 00000000001407f0 Stack: ffffffffa0941a71 ffff880858a318f8 ffff88085d971bc0 ffff880858a31800 ffff880852900000 ffff880858a31800 00000000003ffff7 ffff88085d971bc0 ffff8808588ffe60 ffffffffa09663fc ffff8808588ffe60 ffff880858a31800 Call Trace: [<ffffffffa0941a71>] ? find_mmu_handler+0x51/0x70 [hfi1] [<ffffffffa09663fc>] hfi1_user_exp_rcv_free+0x6c/0x120 [hfi1] [<ffffffffa0932809>] hfi1_file_close+0x1a9/0x340 [hfi1] [<ffffffff8116c189>] __fput+0xe9/0x270 [<ffffffff8116c35e>] ____fput+0xe/0x10 [<ffffffff81065707>] task_work_run+0xa7/0xe0 [<ffffffff81002969>] do_notify_resume+0x59/0x80 [<ffffffff814ffc1a>] int_signal+0x12/0x17 This commit re-arranges the context initialization code in a way that would allow for context event flags to be used to determine whether the context has been successfully initialized. In turn, this can be used to skip the resource de-allocation if they were never allocated in the first place. Fixes: 3abb33ac6521 ("staging/hfi1: Add TID cache receive init and free funcs") Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com. Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Fix missing lock/unlock in verbs drain callbackMike Marciniszyn1-0/+2
The iowait_sdma_drained() callback lacked locking to protect the qp s_flags field. This causes the s_flags to be out of sync on multiple CPUs, potentially corrupting the s_flags. Fixes: a545f5308b6c ("staging/rdma/hfi: fix CQ completion order issue") Reviewed-by: Sebastian Sanchez <sebastian.sanchez@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/rdmavt: Fix send schedulingJubin John1-2/+2
call_send is used to determine whether to send immediately or schedule a send for later. The current logic in rdmavt is inverted and has a negative impact on the latency of the hfi1 and qib drivers. Fix this regression by correctly calling send immediately when call_send is set. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Prevent unpinning of wrong pagesMitko Haralanov1-5/+8
The routine used by the SDMA cache to handle already cached nodes can extend an already existing node. In its error handling code, the routine will unpin pages when not all pages of the buffer extension were pinned. There was a bug in that part of the routine, which would mistakenly unpin pages from the original set rather than the newly pinned pages. This commit fixes that bug by offsetting the page array to the proper place pointing at the beginning of the newly pinned pages. Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Fix deadlock caused by locking with wrong scopeMitko Haralanov1-5/+11
The locking around the interval RB tree is designed to prevent access to the tree while it's being modified. The locking in its current form is too overzealous, which is causing a deadlock in certain cases with the following backtrace: Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0 CPU: 0 PID: 5836 Comm: IMB-MPI1 Tainted: G O 3.12.18-wfr+ #1 0000000000000000 ffff88087f206c50 ffffffff814f1caa ffffffff817b53f0 ffff88087f206cc8 ffffffff814ecd56 0000000000000010 ffff88087f206cd8 ffff88087f206c78 0000000000000000 0000000000000000 0000000000001662 Call Trace: <NMI> [<ffffffff814f1caa>] dump_stack+0x45/0x56 [<ffffffff814ecd56>] panic+0xc2/0x1cb [<ffffffff810d4370>] ? restart_watchdog_hrtimer+0x50/0x50 [<ffffffff810d4432>] watchdog_overflow_callback+0xc2/0xd0 [<ffffffff81109b4e>] __perf_event_overflow+0x8e/0x2b0 [<ffffffff8110a714>] perf_event_overflow+0x14/0x20 [<ffffffff8101c906>] intel_pmu_handle_irq+0x1b6/0x390 [<ffffffff814f927b>] perf_event_nmi_handler+0x2b/0x50 [<ffffffff814f8ad8>] nmi_handle.isra.3+0x88/0x180 [<ffffffff814f8d39>] do_nmi+0x169/0x310 [<ffffffff814f8177>] end_repeat_nmi+0x1e/0x2e [<ffffffff81272600>] ? unmap_single+0x30/0x30 [<ffffffff814f780d>] ? _raw_spin_lock_irqsave+0x2d/0x40 [<ffffffff814f780d>] ? _raw_spin_lock_irqsave+0x2d/0x40 [<ffffffff814f780d>] ? _raw_spin_lock_irqsave+0x2d/0x40 <<EOE>> <IRQ> [<ffffffffa056c4a8>] hfi1_mmu_rb_search+0x38/0x70 [hfi1] [<ffffffffa05919cb>] user_sdma_free_request+0xcb/0x120 [hfi1] [<ffffffffa0593393>] user_sdma_txreq_cb+0x263/0x350 [hfi1] [<ffffffffa057fad7>] ? sdma_txclean+0x27/0x1c0 [hfi1] [<ffffffffa0593130>] ? user_sdma_send_pkts+0x1710/0x1710 [hfi1] [<ffffffffa057fdd6>] sdma_make_progress+0x166/0x480 [hfi1] [<ffffffff810762c9>] ? ttwu_do_wakeup+0x19/0xd0 [<ffffffffa0581c7e>] sdma_engine_interrupt+0x8e/0x100 [hfi1] [<ffffffffa0546bdd>] sdma_interrupt+0x5d/0xa0 [hfi1] [<ffffffff81097e57>] handle_irq_event_percpu+0x47/0x1d0 [<ffffffff81098017>] handle_irq_event+0x37/0x60 [<ffffffff8109aa5f>] handle_edge_irq+0x6f/0x120 [<ffffffff810044af>] handle_irq+0xbf/0x150 [<ffffffff8104c9b7>] ? irq_enter+0x17/0x80 [<ffffffff8150168d>] do_IRQ+0x4d/0xc0 [<ffffffff814f7c6a>] common_interrupt+0x6a/0x6a <EOI> [<ffffffff81073524>] ? finish_task_switch+0x54/0xe0 [<ffffffff814f56c6>] __schedule+0x3b6/0x7e0 [<ffffffff810763a6>] __cond_resched+0x26/0x30 [<ffffffff814f5eda>] _cond_resched+0x3a/0x50 [<ffffffff814f4f82>] down_write+0x12/0x30 [<ffffffffa0591619>] hfi1_release_user_pages+0x69/0x90 [hfi1] [<ffffffffa059173a>] sdma_rb_remove+0x9a/0xc0 [hfi1] [<ffffffffa056c00d>] __mmu_rb_remove.isra.5+0x5d/0x70 [hfi1] [<ffffffffa056c536>] hfi1_mmu_rb_remove+0x56/0x70 [hfi1] [<ffffffffa059427b>] hfi1_user_sdma_process_request+0x74b/0x1160 [hfi1] [<ffffffffa055c763>] hfi1_aio_write+0xc3/0x100 [hfi1] [<ffffffff8116a14c>] do_sync_readv_writev+0x4c/0x80 [<ffffffff8116b58b>] do_readv_writev+0xbb/0x230 [<ffffffff811a9da1>] ? fsnotify+0x241/0x320 [<ffffffff81073524>] ? finish_task_switch+0x54/0xe0 [<ffffffff8116b795>] vfs_writev+0x35/0x60 [<ffffffff8116b8c9>] SyS_writev+0x49/0xc0 [<ffffffff810cd876>] ? __audit_syscall_exit+0x1f6/0x2a0 [<ffffffff814ff992>] system_call_fastpath+0x16/0x1b As evident from the backtrace above, the process was being put to sleep while holding the lock. Limiting the scope of the lock only to the RB tree operation fixes the above error allowing for proper locking and the process being put to sleep when needed. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/hfi1: Prevent NULL pointer deferences in caching codeMitko Haralanov4-23/+37
There is a potential kernel crash when the MMU notifier calls the invalidation routines in the hfi1 pinned page caching code for sdma. The invalidation routine could call the remove callback for the node, which in turn ends up dereferencing the current task_struct to get a pointer to the mm_struct. However, the mm_struct pointer could be NULL resulting in the following backtrace: BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 IP: [<ffffffffa041f75a>] sdma_rb_remove+0xaa/0x100 [hfi1] 15 task: ffff88085e66e080 ti: ffff88085c244000 task.ti: ffff88085c244000 RIP: 0010:[<ffffffffa041f75a>] [<ffffffffa041f75a>] sdma_rb_remove+0xaa/0x100 [hfi1] RSP: 0000:ffff88085c245878 EFLAGS: 00010002 RAX: 0000000000000000 RBX: ffff88105b9bbd40 RCX: ffffea003931a830 RDX: 0000000000000004 RSI: ffff88105754a9c0 RDI: ffff88105754a9c0 RBP: ffff88085c245890 R08: ffff88105b9bbd70 R09: 00000000fffffffb R10: ffff88105b9bbd58 R11: 0000000000000013 R12: ffff88105754a9c0 R13: 0000000000000001 R14: 0000000000000001 R15: ffff88105b9bbd40 FS: 0000000000000000(0000) GS:ffff88107ef40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000a8 CR3: 0000000001a0b000 CR4: 00000000001407e0 Stack: ffff88105b9bbd40 ffff88080ec481a8 ffff88080ec481b8 ffff88085c2458c0 ffffffffa03fa00e ffff88080ec48190 ffff88080ed9cd00 0000000001024000 0000000000000000 ffff88085c245920 ffffffffa03fa0e7 0000000000000282 Call Trace: [<ffffffffa03fa00e>] __mmu_rb_remove.isra.5+0x5e/0x70 [hfi1] [<ffffffffa03fa0e7>] mmu_notifier_mem_invalidate+0xc7/0xf0 [hfi1] [<ffffffffa03fa143>] mmu_notifier_page+0x13/0x20 [hfi1] [<ffffffff81156dd0>] __mmu_notifier_invalidate_page+0x50/0x70 [<ffffffff81140bbb>] try_to_unmap_one+0x20b/0x470 [<ffffffff81141ee7>] try_to_unmap_anon+0xa7/0x120 [<ffffffff81141fad>] try_to_unmap+0x4d/0x60 [<ffffffff8111fd7b>] shrink_page_list+0x2eb/0x9d0 [<ffffffff81120ab3>] shrink_inactive_list+0x243/0x490 [<ffffffff81121491>] shrink_lruvec+0x4c1/0x640 [<ffffffff81121641>] shrink_zone+0x31/0x100 [<ffffffff81121b0f>] kswapd_shrink_zone.constprop.62+0xef/0x1c0 [<ffffffff811229e3>] kswapd+0x403/0x7e0 [<ffffffff811225e0>] ? shrink_all_memory+0xf0/0xf0 [<ffffffff81068ac0>] kthread+0xc0/0xd0 [<ffffffff81068a00>] ? insert_kthread_work+0x40/0x40 [<ffffffff814ff8ec>] ret_from_fork+0x7c/0xb0 [<ffffffff81068a00>] ? insert_kthread_work+0x40/0x40 To correct this, the mm_struct passed to us by the MMU notifier is used (which is what should have been done to begin with). This avoids the broken derefences and ensures that the correct mm_struct is used. Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28net: snmp: fix 64bit stats on 32bit archesEric Dumazet1-4/+4
I accidentally replaced BH disabling by preemption disabling in SNMP_ADD_STATS64() and SNMP_UPD_PO_STATS64() on 32bit builds. For 64bit stats on 32bit arch, we really need to disable BH, since the "struct u64_stats_sync syncp" might be manipulated both from process and BH contexts. Fixes: 6aef70a851ac ("net: snmp: kill various STATS_USER() helpers") Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28MAINTAINERS: Update iser/isert maintainer contact infoSagi Grimberg1-2/+2
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28IB/mlx5: Expose correct max_sge_rd limitSagi Grimberg2-1/+12
mlx5 devices (Connect-IB, ConnectX-4, ConnectX-4-LX) has a limitation where rdma read work queue entries cannot exceed 512 bytes. A rdma_read wqe needs to fit in 512 bytes: - wqe control segment (16 bytes) - rdma segment (16 bytes) - scatter elements (16 bytes each) So max_sge_rd should be: (512 - 16 - 16) / 16 = 30. Cc: linux-stable@vger.kernel.org Reported-by: Christoph Hellwig <hch@lst.de> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagig@grimberg.me> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-04-28mmc: sunxi: Disable eMMC HS-DDR (MMC_CAP_1_8V_DDR) for Allwinner A80Chen-Yu Tsai1-0/+5
eMMC HS-DDR no longer works on the A80, despite it working when support for this developed. Disable it for now. Signed-off-by: Chen-Yu Tsai <wens@csie.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2016-04-28perf/x86/intel: Fix incorrect lbr_sel_mask valueKan Liang1-2/+4
This patch fixes a bug which was introduced by: b16a5b52eb90 ("perf/x86: Add option to disable reading branch flags/cycles") In this patch, lbr_sel_mask is used to mask the lbr_select. But LBR_SEL_MASK doesn't include the bit for LBR_CALL_STACK. So LBR call stack will never be set in lbr_select. This patch corrects the LBR_SEL_MASK by including all valid bits in LBR_SELECT. Also, the LBR_CALL_STACK bit is different as other bit in LBR_SELECT. It does not operate in suppress mode, so it needs to be specially handled in intel_pmu_setup_hw_lbr_filter. Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1461231010-4399-1-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-28perf/x86/intel/pt: Don't die on VMXONAlexander Shishkin4-11/+75
Some versions of Intel PT do not support tracing across VMXON, more specifically, VMXON will clear TraceEn control bit and any attempt to set it before VMXOFF will throw a #GP, which in the current state of things will crash the kernel. Namely: $ perf record -e intel_pt// kvm -nographic on such a machine will kill it. To avoid this, notify the intel_pt driver before VMXON and after VMXOFF so that it knows when not to enable itself. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Gleb Natapov <gleb@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: hpa@zytor.com Link: http://lkml.kernel.org/r/87oa9dwrfk.fsf@ashishki-desk.ger.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-28perf/core: Fix perf_event_open() vs. execve() racePeter Zijlstra1-16/+36
Jann reported that the ptrace_may_access() check in find_lively_task_by_vpid() is racy against exec(). Specifically: perf_event_open() execve() ptrace_may_access() commit_creds() ... if (get_dumpable() != SUID_DUMP_USER) perf_event_exit_task(); perf_install_in_context() would result in installing a counter across the creds boundary. Fix this by wrapping lots of perf_event_open() in cred_guard_mutex. This should be fine as perf_event_exit_task() is already called with cred_guard_mutex held, so all perf locks already nest inside it. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-28perf/x86/amd: Set the size of event map array to PERF_COUNT_HW_MAXAdam Borowski1-1/+1
The entry for PERF_COUNT_HW_REF_CPU_CYCLES is not used on AMD, but is referenced by filter_events() which expects undefined events to have a value of 0. Found via KASAN: UBSAN: Undefined behaviour in arch/x86/events/amd/core.c:132:30 index 9 is out of range for type 'u64 [9]' UBSAN: Undefined behaviour in arch/x86/events/amd/core.c:132:9 load of address ffffffff81c021c8 with insufficient space for an object of type 'const u64' Signed-off-by: Adam Borowski <kilobyte@angband.pl> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1461749731-30979-1-git-send-email-kilobyte@angband.pl Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-28rbd: report unsupported features to syslogIlya Dryomov1-3/+6
... instead of just returning an error. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com>
2016-04-28rbd: fix rbd map vs notify racesIlya Dryomov1-24/+19
A while ago, commit 9875201e1049 ("rbd: fix use-after free of rbd_dev->disk") fixed rbd unmap vs notify race by introducing an exported wrapper for flushing notifies and sticking it into do_rbd_remove(). A similar problem exists on the rbd map path, though: the watch is registered in rbd_dev_image_probe(), while the disk is set up quite a few steps later, in rbd_dev_device_setup(). Nothing prevents a notify from coming in and crashing on a NULL rbd_dev->disk: BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 Call Trace: [<ffffffffa0508344>] rbd_watch_cb+0x34/0x180 [rbd] [<ffffffffa04bd290>] do_event_work+0x40/0xb0 [libceph] [<ffffffff8109d5db>] process_one_work+0x17b/0x470 [<ffffffff8109e3ab>] worker_thread+0x11b/0x400 [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400 [<ffffffff810a5acf>] kthread+0xcf/0xe0 [<ffffffff810b41b3>] ? finish_task_switch+0x53/0x170 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 [<ffffffff81645dd8>] ret_from_fork+0x58/0x90 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 RIP [<ffffffffa050828a>] rbd_dev_refresh+0xfa/0x180 [rbd] If an error occurs during rbd map, we have to error out, potentially tearing down a watch. Just like on rbd unmap, notifies have to be flushed, otherwise rbd_watch_cb() may end up trying to read in the image header after rbd_dev_image_release() has run: Assertion failure in rbd_dev_header_info() at line 4722: rbd_assert(rbd_image_format_valid(rbd_dev->image_format)); Call Trace: [<ffffffff81cccee0>] ? rbd_parent_request_create+0x150/0x150 [<ffffffff81cd4e59>] rbd_dev_refresh+0x59/0x390 [<ffffffff81cd5229>] rbd_watch_cb+0x69/0x290 [<ffffffff81fde9bf>] do_event_work+0x10f/0x1c0 [<ffffffff81107799>] process_one_work+0x689/0x1a80 [<ffffffff811076f7>] ? process_one_work+0x5e7/0x1a80 [<ffffffff81132065>] ? finish_task_switch+0x225/0x640 [<ffffffff81107110>] ? pwq_dec_nr_in_flight+0x2b0/0x2b0 [<ffffffff81108c69>] worker_thread+0xd9/0x1320 [<ffffffff81108b90>] ? process_one_work+0x1a80/0x1a80 [<ffffffff8111b02d>] kthread+0x21d/0x2e0 [<ffffffff8111ae10>] ? kthread_stop+0x550/0x550 [<ffffffff82022802>] ret_from_fork+0x22/0x40 [<ffffffff8111ae10>] ? kthread_stop+0x550/0x550 RIP [<ffffffff81ccd8f9>] rbd_dev_header_info+0xa19/0x1e30 To fix this, a) check if RBD_DEV_FLAG_EXISTS is set before calling revalidate_disk(), b) move ceph_osdc_flush_notifies() call into rbd_dev_header_unwatch_sync() to cover rbd map error paths and c) turn header read-in into a critical section. The latter also happens to take care of rbd map foo@bar vs rbd snap rm foo@bar race. Fixes: http://tracker.ceph.com/issues/15490 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com>
2016-04-28x86/apic: Handle zero vector gracefully in clear_vector_irq()Keith Busch1-1/+2
If x86_vector_alloc_irq() fails x86_vector_free_irqs() is invoked to cleanup the already allocated vectors. This subsequently calls clear_vector_irq(). The failed irq has no vector assigned, which triggers the BUG_ON(!vector) in clear_vector_irq(). We cannot suppress the call to x86_vector_free_irqs() for the failed interrupt, because the other data related to this irq must be cleaned up as well. So calling clear_vector_irq() with vector == 0 is legitimate. Remove the BUG_ON and return if vector is zero, [ tglx: Massaged changelog ] Fixes: b5dc8e6c21e7 "x86/irq: Use hierarchical irqdomain to manage CPU interrupt vectors" Signed-off-by: Keith Busch <keith.busch@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-04-28Merge branch 'socket-space-optimizations'David S. Miller1-0/+8
Eric Dumazet says: ==================== net: avoid some atomic ops when FASYNC is not used We can avoid some atomic operations on sockets not using FASYNC ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net: SOCKWQ_ASYNC_WAITDATA optimizationsEric Dumazet1-2/+4
SOCKWQ_ASYNC_WAITDATA is set/cleared in sk_wait_data() and equivalent functions, so that sock_wake_async() can send a SIGIO only when necessary. Since these atomic operations are really not needed unless socket expressed interest in FASYNC, we can omit them in most cases. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net: SOCKWQ_ASYNC_NOSPACE optimizationsEric Dumazet1-0/+6
SOCKWQ_ASYNC_NOSPACE is tested in sock_wake_async() so that a SIGIO signal is sent when needed. tcp_sendmsg() clears the bit. tcp_poll() sets the bit when stream is not writeable. We can avoid two atomic operations by first checking if socket is actually interested in the FASYNC business (most sockets in real applications do not use AIO, but select()/poll()/epoll()) This also removes one cache line miss to access sk->sk_wq->flags in tcp_sendmsg() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28Merge branch 'snmp-stats-update'David S. Miller54-531/+512
Eric Dumazet says: ==================== net: snmp: update SNMP methods In the old days (before linux-3.0), SNMP counters were duplicated, one set for user context, and anther one for BH context. After commit 8f0ea0fe3a03 ("snmp: reduce percpu needs by 50%") we have a single copy, and what really matters is preemption being enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc() respectively. This patch series kills the obsolete STATS_USER() helpers, and rename all XXX_BH() helpers to __XXX() ones, to more closely match conventions used to update per cpu variables. This is probably going to hurt maintainers job for a while, since cherry-picks will not be clean, but this had to be cleaned at one point. I am so sorry guys. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net: snmp: kill STATS_BH macrosEric Dumazet8-45/+45
There is nothing related to BH in SNMP counters anymore, since linux-3.0. Rename helpers to use __ prefix instead of _BH prefix, for contexts where preemption is disabled. This more closely matches convention used to update percpu variables. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28ipv6: kill ICMP6MSGIN_INC_STATS_BH()Eric Dumazet2-4/+2
IPv6 ICMP stats are atomics anyway. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28ipv6: rename IP6_UPD_PO_STATS_BH()Eric Dumazet2-3/+3
Rename IP6_UPD_PO_STATS_BH() to __IP6_UPD_PO_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28ipv6: rename IP6_INC_STATS_BH()Eric Dumazet7-91/+91
Rename IP6_INC_STATS_BH() to __IP6_INC_STATS() and IP6_ADD_STATS_BH() to __IP6_ADD_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net: rename NET_{ADD|INC}_STATS_BH()Eric Dumazet25-149/+153
Rename NET_INC_STATS_BH() to __NET_INC_STATS() and NET_ADD_STATS_BH() to __NET_ADD_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net: rename IP_UPD_PO_STATS_BH()Eric Dumazet2-4/+4
Rename IP_UPD_PO_STATS_BH() to __IP_UPD_PO_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net: rename IP_ADD_STATS_BH()Eric Dumazet3-5/+5
Rename IP_ADD_STATS_BH() to __IP_ADD_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net: rename ICMP6_INC_STATS_BH()Eric Dumazet6-15/+15
Rename ICMP6_INC_STATS_BH() to __ICMP6_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net: rename IP_INC_STATS_BH()Eric Dumazet8-29/+29
Rename IP_INC_STATS_BH() to __IP_INC_STATS(), to better express this is used in non preemptible context. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net: sctp: rename SCTP_INC_STATS_BH()Eric Dumazet2-7/+7
Rename SCTP_INC_STATS_BH() to __SCTP_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net: icmp: rename ICMPMSGIN_INC_STATS_BH()Eric Dumazet2-2/+2
Remove misleading _BH suffix. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net: tcp: rename TCP_INC_STATS_BHEric Dumazet7-25/+25
Rename TCP_INC_STATS_BH() to __TCP_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net: xfrm: kill XFRM_INC_STATS_BH()Eric Dumazet1-2/+0
Not used anymore. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net: udp: rename UDP_INC_STATS_BH()Eric Dumazet5-52/+52
Rename UDP_INC_STATS_BH() to __UDP_INC_STATS(), and UDP6_INC_STATS_BH() to __UDP6_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net: rename ICMP_INC_STATS_BH()Eric Dumazet6-14/+14
Rename ICMP_INC_STATS_BH() to __ICMP_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28dccp: rename DCCP_INC_STATS_BH()Eric Dumazet7-16/+16
Rename DCCP_INC_STATS_BH() to __DCCP_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net: snmp: kill various STATS_USER() helpersEric Dumazet10-78/+59
In the old days (before linux-3.0), SNMP counters were duplicated, one for user context, and one for BH context. After commit 8f0ea0fe3a03 ("snmp: reduce percpu needs by 50%") we have a single copy, and what really matters is preemption being enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc() respectively. We therefore kill SNMP_INC_STATS_USER(), SNMP_ADD_STATS_USER(), NET_INC_STATS_USER(), NET_ADD_STATS_USER(), SCTP_INC_STATS_USER(), SNMP_INC_STATS64_USER(), SNMP_ADD_STATS64_USER(), TCP_ADD_STATS_USER(), UDP_INC_STATS_USER(), UDP6_INC_STATS_USER(), and XFRM_INC_STATS_USER() Following patches will rename __BH helpers to make clear their usage is not tied to BH being disabled. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28Merge branch '40GbE' of ↵David S. Miller21-407/+947
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== 40GbE Intel Wired LAN Driver Updates 2016-04-27 This series contains updates to i40e and i40evf. Alex Duyck cleans up the feature flags since they are becoming pretty "massive", the primary change being that we now build our features list around hw_encap_features. Added support for IPIP and SIT offloads, which should improvement in throughput for IPIP and SIT tunnels with the offload enabled. Mitch adds support for configuring RSS on behalf of the VFs, which removes the burden of dealing with different hardware interfaces from the VF drivers and improves future compatibility. Fix to ensure that we do not panic by checking that the vsi_res pointer is valid before dereferencing it, after which we can drink beer and eat peanuts. Shannon does come housekeeping in i40e_add_fdir_ethtool() in preparation for more cloud filter work. Added flexibility to the nvmupdate facility by adding the ability to specify an AQ event opcode to wait on after Exec_AQ request. Michal adds device capability which defines if an update is available and if a security check is needed during the update process. Kamil just adds a device id to support X722 QSFP+ device. Greg fixes an issue where a mirror rule ID may be zero, so do not return invalid parameter when the user passes in a zero for a rule ID. Adds support to steer packets to VSIs by VLAN tag alone while being in promiscuous mode for multicast and unicast MAC addresses. Jesse fixes the driver from offloading the VLAN tag into the skb any time there was a VLAN tag and the hardware stripping was enabled, to making sure it is enabled before put_tag. v2: Dropped patch 8 ("i40e: Allow user to change input set mask for flow director") while Kiran reworks a more generalized solution based on feedback from David Miller. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net-rfs: fix false sharing accessing sd->input_queue_headEric Dumazet1-2/+6
sd->input_queue_head is incremented for each processed packet in process_backlog(), and read from other cpus performing Out Of Order avoidance in get_rps_cpu() Moving this field in a separate cache line keeps it mostly hot for the cpu in process_backlog(), as other cpus will only read it. In a stress test, process_backlog() was consuming 6.80 % of cpu cycles, and the patch reduced the cost to 0.65 % Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28net: w5100: support W5500Akinobu Mita4-85/+365
This adds support for W5500 chip. W5500 has similar register and memory organization with W5100 and W5200. There are a few important differences listed below but it is still possible to share common code with W5100 and W5200. * W5500 register and memory are organized by multiple blocks. Each one is selected by 16bits offset address and 5bits block select bits. But the existing register access operations take u16 address. This change extends the addess by u32 address and put offset address to lower 16bits and block select bits to upper 16bits. This change also adds the offset addresses for socket register and TX/RX memory blocks to the driver private data structure in order to reduce conditional switches for each chip. * W5500 has the different register offset for socket interrupt mask register. Newly added internal functions w5100_enable_intr() and w5100_disable_intr() take care of the diffrence. * W5500 has the different register offset for retry time-value register. But this register is only used to verify that the reset value is correctly read at initialization. So move the verification to w5100_hw_reset() which already does different things for different chips. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Mike Sinkovsky <msink@permonline.ru> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28thermal: use %d to print S32 parametersLeo Yan1-1/+1
Power allocator's parameters are S32 type, so use %d to print them. Acked-by: Javi Merino <javi.merino@arm.com> Signed-off-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
2016-04-28thermal: hisilicon: increase temperature resolutionLeo Yan1-2/+2
When calculate temperature, old code firstly do division and then convert to "millicelsius" unit. This will lose resolution and only can read back temperature with "Celsius" unit. So firstly scale step value to "millicelsius" and then do division, so finally we can increase resolution for temperature value. Also refine the calculation from temperature value to step value. Signed-off-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
2016-04-28sparc64: Fix bootup regressions on some Kconfig combinations.David S. Miller8-55/+34
The system call tracing bug fix mentioned in the Fixes tag below increased the amount of assembler code in the sequence of assembler files included by head_64.S This caused to total set of code to exceed 0x4000 bytes in size, which overflows the expression in head_64.S that works to place swapper_tsb at address 0x408000. When this is violated, the TSB is not properly aligned, and also the trap table is not aligned properly either. All of this together results in failed boots. So, do two things: 1) Simplify some code by using ba,a instead of ba/nop to get those bytes back. 2) Add a linker script assertion to make sure that if this happens again the build will fail. Fixes: 1a40b95374f6 ("sparc: Fix system call tracing register handling.") Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: Joerg Abraham <joerg.abraham@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27Merge branch 'bnxt_en-fixes'David S. Miller2-14/+52
Michael Chan says: ==================== bnxt_en: Bug fixes for net. Only use MSIX on VF, and fix rx page buffers on architectures with PAGE_SIZE >= 64K. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27bnxt_en: Divide a page into 32K buffers for the aggregation ring if necessary.Michael Chan2-5/+34
If PAGE_SIZE is bigger than BNXT_RX_PAGE_SIZE, that means the native CPU page is bigger than the maximum length of the RX BD. Divide the page into multiple 32K buffers for the aggregation ring. Add an offset field in the bnxt_sw_rx_agg_bd struct to keep track of the page offset of each buffer. Since each page can be referenced by multiple buffer entries, call get_page() as needed to get the proper reference count. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27bnxt_en: Limit RX BD pages to be no bigger than 32K.Michael Chan2-9/+18
The RX BD length field of this device is 16-bit, so the largest buffer size is 65535. For LRO and GRO, we allocate native CPU pages for the aggregation ring buffers. It won't work if the native CPU page size is 64K or bigger. We fix this by defining BNXT_RX_PAGE_SIZE to be native CPU page size up to 32K. Replace PAGE_SIZE with BNXT_RX_PAGE_SIZE in all appropriate places related to the rx aggregation ring logic. The next patch will add additional logic to divide the page into 32K chunks for aggrgation ring buffers if PAGE_SIZE is bigger than BNXT_RX_PAGE_SIZE. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>