summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2015-07-29block: add a bi_error field to struct bioChristoph Hellwig95-682/+622
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-17block: make /sys/block/<dev>/queue/discard_max_bytes writeableJens Axboe4-2/+53
Lots of devices support huge discard sizes these days. Depending on how the device handles them internally, huge discards can introduce massive latencies (hundreds of msec) on the device side. We have a sysfs file, discard_max_bytes, that advertises the max hardware supported discard size. Make this writeable, and split the settings into a soft and hard limit. This can be set from 'discard_granularity' and up to the hardware limit. Add a new sysfs file, 'discard_max_hw_bytes', that shows the hw set limit. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-17block: have drivers use blk_queue_max_discard_sectors()Jens Axboe12-15/+15
Some drivers use it now, others just set the limits field manually. But in preparation for splitting this into a hard and soft limit, ensure that they all call the proper function for setting the hw limit for discards. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-17block: partition: convert percpu refMing Lei3-15/+27
Percpu refcount is the perfect match for partition's case, and the conversion is quite straight. With the convertion, one pair of atomic inc/dec can be saved for accounting block I/O, which is run in hot path of block I/O. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-17block: partition: introduce hd_free_part()Ming Lei3-4/+8
So the helper can be used in both generic partition case and part0 case. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-17Merge tag 'pm+acpi-4.2-rc3' of ↵Linus Torvalds4-20/+32
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management and ACPI fixes from Rafael Wysocki: "These fix two bugs in the cpufreq core (including one recent regression), fix a 4.0 PCI regression related to the ACPI resources management and quieten an RCU-related lockdep complaint about a tracepoint in the suspend-to-idle code. Specifics: - Fix a recently introduced issue in the cpufreq policy object reinitialization that leads to CPU offline/online breakage (Viresh Kumar) - Make it possible to access frequency tables of offline CPUs which is needed by thermal management code among other things (Viresh Kumar) - Fix an ACPI resource management regression introduced during the 4.0 cycle that may cause incorrect resource validation results to appear in 32-bit x86 kernels due to silent truncation of 64-bit values to 32-bit (Jiang Liu) - Fix up an RCU-related lockdep complaint about suspicious RCU usage in idle caused by using a suspend tracepoint in the core suspend- to-idle code (Rafael J Wysocki)" * tag 'pm+acpi-4.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / PCI: Fix regressions caused by resource_size_t overflow with 32-bit kernel cpufreq: Allow freq_table to be obtained for offline CPUs cpufreq: Initialize the governor again while restoring policy suspend-to-idle: Prevent RCU from complaining about tick_freeze()
2015-07-17Merge tag 'platform-drivers-x86-v4.2-3' of ↵Linus Torvalds4-111/+176
git://git.infradead.org/users/dvhart/linux-platform-drivers-x86 Pull x86 platform driver fixes from Darren Hart: "Fix SMBIOS call handling and hwswitch state coherency in the dell-laptop driver. Cleanups for intel_*_ipc drivers. Details: dell-laptop: - Do not cache hwswitch state - Check return value of each SMBIOS call - Clear buffer before each SMBIOS call intel_scu_ipc: - Move local memory initialization out of a mutex intel_pmc_ipc: - Update kerneldoc formatting - Fix compiler casting warnings" * tag 'platform-drivers-x86-v4.2-3' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86: intel_scu_ipc: move local memory initialization out of a mutex intel_pmc_ipc: Update kerneldoc formatting dell-laptop: Do not cache hwswitch state dell-laptop: Check return value of each SMBIOS call dell-laptop: Clear buffer before each SMBIOS call intel_pmc_ipc: Fix compiler casting warnings
2015-07-17Merge branch 'for-next' of ↵Linus Torvalds10-128/+45
git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu Pull m68knommu/coldfire fixes from Greg Ungerer: "Contains build fixes and updates for the ColdFire defconfigs. Specifically there is a couple of fixes that address problems building allnoconfig. Also fix for enabling PCI bus on the M54xx family of ColdFire" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu: m68k: enable PCI support for m5475evb defconfig m68k: fix io functions for ColdFire/MMU/PCI case m68knommu: update defconfig for ColdFire m5475evb m68knommu: update defconfig for ColdFire m5407c3 m68knommu: update defconfig for ColdFire m5307c3 m68knommu: update defconfig for ColdFire m5275evb m68knommu: update defconfig for ColdFire m5272c3 m68knommu: update defconfig for ColdFire m5249evb m68knommu: update defconfig for m5208evb m68knommu: make ColdFire SoC selection a choice m68knommu: improve the clock configuration defaults m68knommu: force setting of CONFIG_CLOCK_FREQ for ColdFire
2015-07-17Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds10-79/+113
Pull block fixes from Jens Axboe: "A collection of fixes from the last few weeks that should go into the current series. This contains: - Various fixes for the per-blkcg policy data, fixing regressions since 4.1. From Arianna and Tejun - Code cleanup for bcache closure macros from me. Really just flushing this out, it's been sitting in another branch for months - FIELD_SIZEOF cleanup from Maninder Singh - bio integrity oops fix from Mike - Timeout regression fix for blk-mq from Ming Lei" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: set default timeout as 30 seconds NVMe: Reread partitions on metadata formats bcache: don't embed 'return' statements in closure macros blkcg: fix blkcg_policy_data allocation bug blkcg: implement all_blkcgs list blkcg: blkcg_css_alloc() should grab blkcg_pol_mutex while iterating blkcg_policy[] blkcg: allow blkcg_pol_mutex to be grabbed from cgroup [file] methods block/blk-cgroup.c: free per-blkcg data when freeing the blkcg block: use FIELD_SIZEOF to calculate size of a field bio integrity: do not assume bio_integrity_pool exists if bioset exists
2015-07-17Merge tag 'jfs-4.2' of git://github.com/kleikamp/linux-shaggyLinus Torvalds3-17/+16
Pull jfs fixes from David Kleikamp: "A couple trivial fixes and an error path fix" * tag 'jfs-4.2' of git://github.com/kleikamp/linux-shaggy: jfs: clean up jfs_rename and fix out of order unlock jfs: fix indentation on if statement jfs: removed a prohibited space after opening parenthesis
2015-07-17Merge branches 'pm-cpuidle', 'pm-cpufreq' and 'acpi-resources'Rafael J. Wysocki4-20/+32
* pm-cpuidle: suspend-to-idle: Prevent RCU from complaining about tick_freeze() * pm-cpufreq: cpufreq: Allow freq_table to be obtained for offline CPUs cpufreq: Initialize the governor again while restoring policy * acpi-resources: ACPI / PCI: Fix regressions caused by resource_size_t overflow with 32-bit kernel
2015-07-16blk-mq: set default timeout as 30 secondsMing Lei1-1/+1
It is reasonable to set default timeout of request as 30 seconds instead of 30000 ticks, which may be 300 seconds if HZ is 100, for example, some arm64 based systems may choose 100 HZ. Signed-off-by: Ming Lei <ming.lei@canonical.com> Fixes: c76cbbcf4044 ("blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()" Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-16Merge branch 'for-linus' of ↵Linus Torvalds2-1/+10
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull TPM bugfixes from James Morris. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: tpm, tpm_crb: fail when TPM2 ACPI table contents look corrupted tpm: Fix initialization of the cdev
2015-07-16Merge tag 'for-linus' of ↵Linus Torvalds36-269/+351
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma fixes from Doug Ledford: "Mainly fix-ups for the various 4.2 items" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (24 commits) IB/core: Destroy ocrdma_dev_id IDR on module exit IB/core: Destroy multcast_idr on module exit IB/mlx4: Optimize do_slave_init IB/mlx4: Fix memory leak in do_slave_init IB/mlx4: Optimize freeing of items on error unwind IB/mlx4: Fix use of flow-counters for process_mad IB/ipath: Convert use of __constant_<foo> to <foo> IB/ipoib: Set MTU to max allowed by mode when mode changes IB/ipoib: Scatter-Gather support in connected mode IB/ucm: Fix bitmap wrap when devnum > IB_UCM_MAX_DEVICES IB/ipoib: Prevent lockdep warning in __ipoib_ib_dev_flush IB/ucma: Fix lockdep warning in ucma_lock_files rds: rds_ib_device.refcount overflow RDMA/nes: Fix for incorrect recording of the MAC address RDMA/nes: Fix for resolving the neigh RDMA/core: Fixes for port mapper client registration IB/IPoIB: Fix bad error flow in ipoib_add_port() IB/mlx4: Do not attemp to report HCA clock offset on VFs IB/cm: Do not queue work to a device that's going away IB/srp: Avoid using uninitialized variable ...
2015-07-16NVMe: Reread partitions on metadata formatsKeith Busch1-2/+11
This patch has the driver automatically reread partitions if a namespace has a separate metadata format. Previously revalidating a disk was sufficient to get the correct capacity set on such formatted drives, but partitions that may exist would not have been surfaced. Reported-by: Paul Grabinar <paul.grabinar@ranbarg.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Tested-by: Paul Grabinar <paul.grabinar@ranbarg.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-15Merge tag 'locks-v4.2-1' of git://git.samba.org/jlayton/linuxLinus Torvalds3-40/+46
Pull file locking updates from Jeff Layton: "I had thought that I was going to get away without a pull request this cycle. There was a NFSv4 file locking problem that cropped up that I tried to fix in the NFSv4 code alone, but that fix has turned out to be problematic. These patches fix this in the correct way. Note that this touches some NFSv4 code as well. Ordinarily I'd wait for Trond to ACK this, but he's on holiday right now and the bug is rather nasty. So I suggest we merge this and if he raises issues with it we can sort it out when he gets back" Acked-by: Bruce Fields <bfields@fieldses.org> Acked-by: Dan Williams <dan.j.williams@intel.com> [ +1 to this series fixing a 100% reproducible slab corruption + general protection fault in my nfs-root test environment. - Dan ] Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com> * tag 'locks-v4.2-1' of git://git.samba.org/jlayton/linux: locks: inline posix_lock_file_wait and flock_lock_file_wait nfs4: have do_vfs_lock take an inode pointer locks: new helpers - flock_lock_inode_wait and posix_lock_inode_wait locks: have flock_lock_file take an inode pointer instead of a filp Revert "nfs: take extra reference to fl->fl_file when running a LOCKU operation"
2015-07-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds10-21/+165
Pull KVM fixes from Paolo Bonzini: - Fix FPU refactoring ("kvm: x86: fix load xsave feature warning") - Fix eager FPU mode (Cc stable) - AMD bits of MTRR virtualization * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvm: x86: fix load xsave feature warning KVM: x86: apply guest MTRR virtualization on host reserved pages KVM: SVM: Sync g_pat with guest-written PAT value KVM: SVM: use NPT page attributes KVM: count number of assigned devices KVM: VMX: fix vmwrite to invalid VMCS KVM: x86: reintroduce kvm_is_mmio_pfn x86: hyperv: add CPUID bit for crash handlers
2015-07-15Merge tag 'arc-v4.2-rc3-fixes' of ↵Linus Torvalds16-57/+112
git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc Pull ARC fixes from Vineet Gupta: - Makefile changes (top-level+ARC) reinstates -O3 builds (regression since 3.16) - IDU intc related fixes, IRQ affinity - patch to make bitops safer for ARC - perf fix from Alexey to remove signed PC braino - Futex backend gets llock/scond support * tag 'arc-v4.2-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: ARCv2: support HS38 releases ARC: make sure instruction_pointer() returns unsigned value ARC: slightly refactor macros for boot logging ARC: Add llock/scond to futex backend arc:irqchip: prepare for drivers/irqchip/irqchip.h removal ARC: Make ARC bitops "safer" (add anti-optimization) ARCv2: [axs103] bump CPU frequency from 75 to 90 MHZ ARCv2: intc: IDU: Fix potential race in installing a chained IRQ handler ARCv2: intc: IDU: support irq affinity ARC: fix unused var wanring ARC: Don't memzero twice in dma_alloc_coherent for __GFP_ZERO ARC: Override toplevel default -O2 with -O3 kbuild: Allow arch Makefiles to override {cpp,ld,c}flags ARCv2: guard SLC DMA ops with spinlock ARC: Kconfig: better way to disable ARC_HAS_LLSC for ARC_CPU_750D
2015-07-15Merge branch 'for-linus' of ↵Linus Torvalds10-31/+87
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Martin Schwidefsky: "One improvement for the zcrypt driver, the quality attribute for the hwrng device has been missing. Without it the kernel entropy seeding will not happen automatically. And six bug fixes, the most important one is the fix for the vector register corruption due to machine checks" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/nmi: fix vector register corruption s390/process: fix sfpc inline assembly s390/dasd: fix kernel panic when alias is set offline s390/sclp: clear upper register halves in _sclp_print_early s390/oprofile: fix compile error s390/sclp: fix compile error s390/zcrypt: enable s390 hwrng to seed kernel entropy
2015-07-15jfs: clean up jfs_rename and fix out of order unlockDave Kleikamp1-14/+13
The end of jfs_rename(), which is also used by the error paths, included a call to IWRITE_UNLOCK(new_ip) after labels out1, out2 and out3. If we come in through these labels, IWRITE_LOCK() has not been called yet. In moving that call to the correct spot, I also moved some exceptional truncate code earlier as well, since the early error paths don't need to deal with it, and I renamed out4: to out_tx: so a future patch by Jan Kara doesn't need to deal with renumbering or confusing out-of-order labels. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2015-07-15Merge tag 'module-final-v4.2-rc1' of ↵Linus Torvalds2-78/+84
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux Pull final init.h/module.h code relocation from Paul Gortmaker: "With the release of 4.2-rc2 done, we should not be seeing any new code added that gets upset by this small code move, and we've banked yet another complete week of testing with this move in place on top of 4.2-rc1 via linux-next to ensure that remained true. Given that, I'd like to put it in now so that people formulating new work for 4.3-rc1 will be exposed to the ever so slightly stricter (but sensible) requirements wrt. whether they are needing init.h vs. module.h macros, even if they are not using linux-next. The diffstat of the move is slightly asymmetrical due to needing to leave behind a couple #ifdef in the old location and add the same ones to the new location, but other than that, it is a 1:1 move, complete with the module_init/exit trailing semicolon that we can't fix. That is, until/unless someone does a tree-wide sed fix of all the approximately 800 currently in tree users relying on it" * tag 'module-final-v4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: module: relocate module_init from init.h to module.h
2015-07-15Merge tag 'trace-v4.2-rc1-fix' of ↵Linus Torvalds2-7/+11
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Fengguang Wu discovered a crash that happened to be because of the branch tracer (traces unlikely and likely branches) when enabled with certain debug options. What happened was that various debug options like lockdep and DEBUG_PREEMPT can cause parts of the branch tracer to recurse outside its recursion protection. In fact, part of its recursion protection used these features that caused the lockup. This cleans up the code a little and makes the recursion protection a bit more robust" * tag 'trace-v4.2-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Have branch tracer use recursive field of task struct
2015-07-15Merge tag 'tpm-fixes-for-4.2-rc2' of ↵James Morris2-1/+10
https://github.com/PeterHuewe/linux-tpmdd into for-linus
2015-07-14intel_scu_ipc: move local memory initialization out of a mutexChristophe JAILLET1-3/+3
'{ }' and memset will both reset the cbuf buffer. Only once is enough and this can be done outside fo the mutex. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Darren Hart <dvhart@linux.intel.com>
2015-07-14IB/core: Destroy ocrdma_dev_id IDR on module exitJohannes Thumshirn1-0/+1
Destroy ocrdma_dev_id IDR on module exit, reclaiming the allocated memory. This was detected by the following semantic patch (written by Luis Rodriguez <mcgrof@suse.com>) <SmPL> @ defines_module_init @ declarer name module_init, module_exit; declarer name DEFINE_IDR; identifier init; @@ module_init(init); @ defines_module_exit @ identifier exit; @@ module_exit(exit); @ declares_idr depends on defines_module_init && defines_module_exit @ identifier idr; @@ DEFINE_IDR(idr); @ on_exit_calls_destroy depends on declares_idr && defines_module_exit @ identifier declares_idr.idr, defines_module_exit.exit; @@ exit(void) { ... idr_destroy(&idr); ... } @ missing_module_idr_destroy depends on declares_idr && defines_module_exit && !on_exit_calls_destroy @ identifier declares_idr.idr, defines_module_exit.exit; @@ exit(void) { ... +idr_destroy(&idr); } </SmPL> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/core: Destroy multcast_idr on module exitJohannes Thumshirn1-0/+1
Destroy multcast_idr on module exit, reclaiming the allocated memory. This was detected by the following semantic patch (written by Luis Rodriguez <mcgrof@suse.com>) <SmPL> @ defines_module_init @ declarer name module_init, module_exit; declarer name DEFINE_IDR; identifier init; @@ module_init(init); @ defines_module_exit @ identifier exit; @@ module_exit(exit); @ declares_idr depends on defines_module_init && defines_module_exit @ identifier idr; @@ DEFINE_IDR(idr); @ on_exit_calls_destroy depends on declares_idr && defines_module_exit @ identifier declares_idr.idr, defines_module_exit.exit; @@ exit(void) { ... idr_destroy(&idr); ... } @ missing_module_idr_destroy depends on declares_idr && defines_module_exit && !on_exit_calls_destroy @ identifier declares_idr.idr, defines_module_exit.exit; @@ exit(void) { ... +idr_destroy(&idr); } </SmPL> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/mlx4: Optimize do_slave_initDoug Ledford1-7/+9
There is little chance our memory allocation will fail, so we can combine initializing the work structs with allocating them instead of looping through all of them once to allocate and again to initialize. Then when we need to actually find out if our device is up or in the process of going down, have all of our work structs batched up, take the spin_lock once and only once, and do all of the batch under the one spin_lock invocation instead of incurring all of the locked memory cycles we would otherwise incur to take/release the spin_lock over and over again. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/mlx4: Fix memory leak in do_slave_initDoug Ledford1-0/+2
We create a number of work structs to be queued up to a workqueue, and on completion of the workqueue handler, the workqueue handler frees the allocated memory. If, however, we don't queue the work struct because the device is going down, then we need to free the memory ourselves. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/mlx4: Optimize freeing of items on error unwindManinder Singh1-5/+3
On failure, we loop through all possible pointers and test them before calling kfree. But really, why even attempt to free items we didn't allocate when we can easily loop through exactly and only the devices for which the original memory allocation succeeded and free just those. Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/mlx4: Fix use of flow-counters for process_madOr Gerlitz1-10/+19
For IB links, reading HCA flow counters through iboe_process_mad() should be used when mlx4_ib_process_mad() is invoked only for VFs PMA queries and exactly nothing else. Fixes: 7193a141eb74 ('IB/mlx4: Set VF to read from QP counters') Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/ipath: Convert use of __constant_<foo> to <foo>Vaishali Thakkar1-2/+2
In little endian cases, the macros be16_to_cpu and cpu_to_be64 unfolds to __swab{16,64} which provides special case for constants. In big endian cases, __constant_be16_to_cpu and be16_to_cpu expand directly to the same expression. The same applies for __constant_cpu_to_be64 and cpu_to_be64. So, replace __constant_be16_to_cpu with be16_to_cpu and __constant_cpu_to_be64 with cpu_to_be64, with the goal of getting rid of the definition of __constant_be16_to_cpu and __constant_cpu_to_be64 completely. Signed-off-by: Vaishali Thakkar <vthakkar1994@gmail.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/ipoib: Set MTU to max allowed by mode when mode changesErez Shitrit1-0/+1
When switching between modes (datagram / connected) change the MTU accordingly. datagram mode up to 4K, connected mode up to (64K - 0x10). Signed-off-by: ELi Cohen <eli@mellanox.com> Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/ipoib: Scatter-Gather support in connected modeYuval Shaia4-46/+54
By default, IPoIB-CM driver uses 64k MTU. Larger MTU gives better performance. This MTU plus overhead puts the memory allocation for IP based packets at 32 4k pages (order 5), which have to be contiguous. When the system memory under pressure, it was observed that allocating 128k contiguous physical memory is difficult and causes serious errors (such as system becomes unusable). This enhancement resolve the issue by removing the physically contiguous memory requirement using Scatter/Gather feature that exists in Linux stack. With this fix Scatter-Gather will be supported also in connected mode. This change reverts some of the change made in commit e112373fd6aa ("IPoIB/cm: Reduce connected mode TX object size"). The ability to use SG in IPoIB CM is possible because the coupling between NETIF_F_SG and NETIF_F_CSUM was removed in commit ec5f06156423 ("net: Kill link between CSUM and SG features.") Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> Acked-by: Christian Marie <christian@ponies.io> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/ucm: Fix bitmap wrap when devnum > IB_UCM_MAX_DEVICESCarol L Soto1-2/+2
ib_ucm_release_dev clears the wrong bit if devnum is greater than IB_UCM_MAX_DEVICES. Signed-off-by: Carol L Soto <clsoto@linux.vnet.ibm.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/ipoib: Prevent lockdep warning in __ipoib_ib_dev_flushHaggai Eran1-6/+7
__ipoib_ib_dev_flush calls itself recursively on child devices, and lockdep complains about locking vlan_rwsem twice (see below). Use down_read_nested instead of down_read to prevent the warning. ============================================= [ INFO: possible recursive locking detected ] 4.1.0-rc4+ #36 Tainted: G O --------------------------------------------- kworker/u20:2/261 is trying to acquire lock: (&priv->vlan_rwsem){.+.+..}, at: [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib] but task is already holding lock: (&priv->vlan_rwsem){.+.+..}, at: [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&priv->vlan_rwsem); lock(&priv->vlan_rwsem); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by kworker/u20:2/261: #0: ("%s""ipoib_flush"){.+.+..}, at: [<ffffffff810827cc>] process_one_work+0x15c/0x760 #1: ((&priv->flush_heavy)){+.+...}, at: [<ffffffff810827cc>] process_one_work+0x15c/0x760 #2: (&priv->vlan_rwsem){.+.+..}, at: [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib] stack backtrace: CPU: 3 PID: 261 Comm: kworker/u20:2 Tainted: G O 4.1.0-rc4+ #36 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 Workqueue: ipoib_flush ipoib_ib_dev_flush_heavy [ib_ipoib] ffff8801c6c54790 ffff8801c9927af8 ffffffff81665238 0000000000000001 ffffffff825b5b30 ffff8801c9927bd8 ffffffff810bba51 ffff880100000000 ffffffff00000001 ffff880100000001 ffff8801c6c55428 ffff8801c6c54790 Call Trace: [<ffffffff81665238>] dump_stack+0x4f/0x6f [<ffffffff810bba51>] __lock_acquire+0x741/0x1820 [<ffffffff810bcbf8>] lock_acquire+0xc8/0x240 [<ffffffffa0791e2a>] ? __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib] [<ffffffff81669d2c>] down_read+0x4c/0x70 [<ffffffffa0791e2a>] ? __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib] [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib] [<ffffffffa0791e4a>] __ipoib_ib_dev_flush+0x5a/0x2b0 [ib_ipoib] [<ffffffffa07920ba>] ipoib_ib_dev_flush_heavy+0x1a/0x20 [ib_ipoib] [<ffffffff81082871>] process_one_work+0x201/0x760 [<ffffffff810827cc>] ? process_one_work+0x15c/0x760 [<ffffffff81082ef0>] worker_thread+0x120/0x4d0 [<ffffffff81082dd0>] ? process_one_work+0x760/0x760 [<ffffffff81082dd0>] ? process_one_work+0x760/0x760 [<ffffffff81088b7e>] kthread+0xfe/0x120 [<ffffffff81088a80>] ? __init_kthread_worker+0x70/0x70 [<ffffffff8166c6e2>] ret_from_fork+0x42/0x70 [<ffffffff81088a80>] ? __init_kthread_worker+0x70/0x70 Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/ucma: Fix lockdep warning in ucma_lock_filesHaggai Eran1-2/+2
The ucma_lock_files() locks the mut mutex on two files, e.g. for migrating an ID. Use mutex_lock_nested() to prevent the warning below. ============================================= [ INFO: possible recursive locking detected ] 4.1.0-rc6-hmm+ #40 Tainted: G O --------------------------------------------- pingpong_rpc_se/10260 is trying to acquire lock: (&file->mut){+.+.+.}, at: [<ffffffffa047ac55>] ucma_migrate_id+0xc5/0x248 [rdma_ucm] but task is already holding lock: (&file->mut){+.+.+.}, at: [<ffffffffa047ac4b>] ucma_migrate_id+0xbb/0x248 [rdma_ucm] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&file->mut); lock(&file->mut); *** DEADLOCK *** May be due to missing lock nesting notation 1 lock held by pingpong_rpc_se/10260: #0: (&file->mut){+.+.+.}, at: [<ffffffffa047ac4b>] ucma_migrate_id+0xbb/0x248 [rdma_ucm] stack backtrace: CPU: 0 PID: 10260 Comm: pingpong_rpc_se Tainted: G O 4.1.0-rc6-hmm+ #40 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 ffff8801f85b63d0 ffff880195677b58 ffffffff81668f49 0000000000000001 ffffffff825cbbe0 ffff880195677c38 ffffffff810bb991 ffff880100000000 ffff880100000000 ffff880100000001 ffff8801f85b7010 ffffffff8121bee9 Call Trace: [<ffffffff81668f49>] dump_stack+0x4f/0x6e [<ffffffff810bb991>] __lock_acquire+0x741/0x1820 [<ffffffff8121bee9>] ? dput+0x29/0x320 [<ffffffff810bcb38>] lock_acquire+0xc8/0x240 [<ffffffffa047ac55>] ? ucma_migrate_id+0xc5/0x248 [rdma_ucm] [<ffffffff8166b901>] ? mutex_lock_nested+0x291/0x3e0 [<ffffffff8166b6d5>] mutex_lock_nested+0x65/0x3e0 [<ffffffffa047ac55>] ? ucma_migrate_id+0xc5/0x248 [rdma_ucm] [<ffffffff810baeed>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8166b66e>] ? mutex_unlock+0xe/0x10 [<ffffffffa047ac55>] ucma_migrate_id+0xc5/0x248 [rdma_ucm] [<ffffffffa0478474>] ucma_write+0xa4/0xb0 [rdma_ucm] [<ffffffff81200674>] __vfs_write+0x34/0x100 [<ffffffff8112427c>] ? __audit_syscall_entry+0xac/0x110 [<ffffffff810ec055>] ? current_kernel_time+0xc5/0xe0 [<ffffffff812aa4d3>] ? security_file_permission+0x23/0x90 [<ffffffff8120088d>] ? rw_verify_area+0x5d/0xe0 [<ffffffff812009bb>] vfs_write+0xab/0x120 [<ffffffff81201519>] SyS_write+0x59/0xd0 [<ffffffff8112427c>] ? __audit_syscall_entry+0xac/0x110 [<ffffffff8166ffee>] system_call_fastpath+0x12/0x76 Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14rds: rds_ib_device.refcount overflowWengang Wang1-1/+3
Fixes: 3e0249f9c05c ("RDS/IB: add refcount tracking to struct rds_ib_device") There lacks a dropping on rds_ib_device.refcount in case rds_ib_alloc_fmr failed(mr pool running out). this lead to the refcount overflow. A complain in line 117(see following) is seen. From vmcore: s_ib_rdma_mr_pool_depleted is 2147485544 and rds_ibdev->refcount is -2147475448. That is the evidence the mr pool is used up. so rds_ib_alloc_fmr is very likely to return ERR_PTR(-EAGAIN). 115 void rds_ib_dev_put(struct rds_ib_device *rds_ibdev) 116 { 117 BUG_ON(atomic_read(&rds_ibdev->refcount) <= 0); 118 if (atomic_dec_and_test(&rds_ibdev->refcount)) 119 queue_work(rds_wq, &rds_ibdev->free_work); 120 } fix is to drop refcount when rds_ib_alloc_fmr failed. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14RDMA/nes: Fix for incorrect recording of the MAC addressTatyana Nikolova1-1/+1
Fix for incorrect recording of the MAC address Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14RDMA/nes: Fix for resolving the neighTatyana Nikolova1-2/+3
Neighbor resolution doesn't work without this fix Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14RDMA/core: Fixes for port mapper client registrationTatyana Nikolova3-25/+48
Fixes to allow clients to make remove mapping requests, after they have provided the user space service with the mapping information, they are using when the service is restarted. 1) Adding IWPM_REG_VALID, IWPM_REG_INCOMPL and IWPM_REG_UNDEF registration types for the port mapper clients and functions to set/check the registration type. 2) If the port mapper user space service is not available to register the client, then its registration stays IWPM_REG_UNDEF and the registration isn't checked until the service becomes available (no mappings are possible, if the user space service isn't running). 3) After the service is restarted, the user space port mapper pid is set to valid and the client registration is set to IWPM_REG_INCOMPL to allow the client to make remove mapping requests. Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/IPoIB: Fix bad error flow in ipoib_add_port()Amir Vadai1-2/+4
Error values of ib_query_port() and ib_query_device() weren't propagated correctly. Because of that, ipoib_add_port() could return NULL value, which escaped the IS_ERR() check in ipoib_add_one() and we crashed. Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/mlx4: Do not attemp to report HCA clock offset on VFsMatan Barak1-5/+6
mlx4 VFs can provide CQE raw time-stamping services, but they don't have the hca core clock mapped to their PCI bars. As such, we should not attempt to query and report the clock offset to user space for VFs. Doing so causes query_device over VFs to fail with -ENOSUPP. Fixes: 4b664c4355b2 ('IB/mlx4: Add support for CQ time-stamping') Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/cm: Do not queue work to a device that's going awayErez Shitrit1-6/+55
Whenever ib_cm gets remove_one call, like when there is a hot-unplug event, the driver should mark itself as going_down and confirm that no new works are going to be queued for that device. so, the order of the actions are: 1. mark the going_down bit. 2. flush the wq. 3. [make sure no new works for that device.] 4. unregister mad agent. otherwise, works that are already queued can be scheduled after the mad agent was freed. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/srp: Avoid using uninitialized variableSagi Grimberg3-8/+7
We might return res which is not initialized. Also reduce code duplication by exporting srp_parse_tmo so srp_tmo_set can reuse it. Detected by Coverity. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jenny Falkovich <jennyf@mellanox.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/srpt: Convert use of __constant_cpu_to_beXX to cpu_to_beXXVaishali Thakkar1-36/+35
In little endian cases, the macro cpu_to_be{16,32,64} unfolds to __swab{16,32,64} which provides special case for constants. In big endian cases, __constant_cpu_to_be{16,32,64} and cpu_to_be{16,32,64} expand directly to the same expression. So, replace __constant_cpu_to_be{16,32,64} with cpu_to_be{16,32,64} with the goal of getting rid of the definitions of __constant_cpu_to_be{16,32,64} completely. The Coccinelle semantic patch that performs this transformation is as follows: @@expression x;@@ ( - __constant_cpu_to_be16(x) + cpu_to_be16(x) | - __constant_cpu_to_be32(x) + cpu_to_be32(x) | - __constant_cpu_to_be64(x) + cpu_to_be64(x) ) Signed-off-by: Vaishali Thakkar <vthakkar1994@gmail.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/mad: Remove improper use of BUG_ONIra Weiny7-14/+21
We recently added BUG_ON's which were inappropriate for a condition which should never happen. Change these to be WARN_ON_ONCE as a debugging aid. Fixes: 4cd7c9479aff ('IB/mad: Add support for additional MAD info to/from drivers') Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB/mad: Fix compare between big endian and cpu endianIra Weiny1-1/+1
The define OPA_LID_PERMISSIVE is big endian and was compared to the cpu endian variable opa_drslid. Problem caught by 0-day build infrastructure. Fixes: 8e4349d13f33 (IB/mad: Add final OPA MAD processing) Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: John, Jubin <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14IB: Add rdma_cap_ib_switch helper and use where appropriateHal Rosenstock11-90/+66
Persuant to Liran's comments on node_type on linux-rdma mailing list: In an effort to reform the RDMA core and ULPs to minimize use of node_type in struct ib_device, an additional bit is added to struct ib_device for is_switch (IB switch). This is needed to be initialized by any IB switch device driver. This is a NEW requirement on such device drivers which are all "out of tree". In addition, an ib_switch helper was added to ib_verbs.h based on the is_switch device bit rather than node_type (although those should be consistent). The RDMA core (MAD, SMI, agent, sa_query, multicast, sysfs) as well as (IPoIB and SRP) ULPs are updated where appropriate to use this new helper. In some cases, the helper is now used under the covers of using rdma_[start end]_port rather than the open coding previously used. Reviewed-by: Sean Hefty <sean.hefty@intel.com> Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Tested-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Hal Rosenstock <hal@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14tpm, tpm_crb: fail when TPM2 ACPI table contents look corruptedJarkko Sakkinen1-0/+8
At least some versions of AMI BIOS have corrupted contents in the TPM2 ACPI table and namely the physical address of the control area is set to zero. This patch changes the driver to fail gracefully when we observe a zero address instead of continuing to ioremap. Cc: <stable@vger.kernel.org> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Reviewed-by: Peter Huewe <peterhuewe@gmx.de> Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
2015-07-14tpm: Fix initialization of the cdevJason Gunthorpe1-1/+2
When a cdev is contained in a dynamic structure the cdev parent kobj should be set to the kobj that controls the lifetime of the enclosing structure. In TPM's case this is the embedded struct device. Also, cdev_init 0's the whole structure, so all sets must be after, not before. This fixes module ref counting and cdev. Cc: <stable@vger.kernel.org> Fixes: 313d21eeab92 ("tpm: device class for tpm") Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Tested-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: Peter Huewe <peterhuewe@gmx.de>