Age | Commit message (Collapse) | Author | Files | Lines |
|
The mkey_table xarray is touched by the reg_mr_callback() function which
is called from a hard irq. Thus all other uses of xa_lock must use the
_irq variants.
WARNING: inconsistent lock state
5.4.0-rc1 #12 Not tainted
--------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
python3/343 [HC0[0]:SC0[0]:HE1:SE1] takes:
ffff888182be1d40 (&(&xa->xa_lock)->rlock#3){?.-.}, at: xa_erase+0x12/0x30
{IN-HARDIRQ-W} state was registered at:
lock_acquire+0xe1/0x200
_raw_spin_lock_irqsave+0x35/0x50
reg_mr_callback+0x2dd/0x450 [mlx5_ib]
mlx5_cmd_exec_cb_handler+0x2c/0x70 [mlx5_core]
mlx5_cmd_comp_handler+0x355/0x840 [mlx5_core]
[..]
Possible unsafe locking scenario:
CPU0
----
lock(&(&xa->xa_lock)->rlock#3);
<Interrupt>
lock(&(&xa->xa_lock)->rlock#3);
*** DEADLOCK ***
2 locks held by python3/343:
#0: ffff88818eb4bd38 (&uverbs_dev->disassociate_srcu){....}, at: ib_uverbs_ioctl+0xe5/0x1e0 [ib_uverbs]
#1: ffff888176c76d38 (&file->hw_destroy_rwsem){++++}, at: uobj_destroy+0x2d/0x90 [ib_uverbs]
stack backtrace:
CPU: 3 PID: 343 Comm: python3 Not tainted 5.4.0-rc1 #12
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x86/0xca
print_usage_bug.cold.50+0x2e5/0x355
mark_lock+0x871/0xb50
? match_held_lock+0x20/0x250
? check_usage_forwards+0x240/0x240
__lock_acquire+0x7de/0x23a0
? __kasan_check_read+0x11/0x20
? mark_lock+0xae/0xb50
? mark_held_locks+0xb0/0xb0
? find_held_lock+0xca/0xf0
lock_acquire+0xe1/0x200
? xa_erase+0x12/0x30
_raw_spin_lock+0x2a/0x40
? xa_erase+0x12/0x30
xa_erase+0x12/0x30
mlx5_ib_dealloc_mw+0x55/0xa0 [mlx5_ib]
uverbs_dealloc_mw+0x3c/0x70 [ib_uverbs]
uverbs_free_mw+0x1a/0x20 [ib_uverbs]
destroy_hw_idr_uobject+0x49/0xa0 [ib_uverbs]
[..]
Fixes: 0417791536ae ("RDMA/mlx5: Add missing synchronize_srcu() for MW cases")
Link: https://lore.kernel.org/r/20191024234910.GA9038@ziepe.ca
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Before QP is closed it changes to ERROR state, when this happens
the QP was left with old rate limit that was already removed from
the table.
Fixes: 7d29f349a4b9 ("IB/mlx5: Properly adjust rate limit on QP state transitions")
Signed-off-by: Rafi Wiener <rafiw@mellanox.com>
Signed-off-by: Oleg Kuporosov <olegk@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20191002120243.16971-1-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
While MR uses live as the SRCU 'update', the MW case uses the xarray
directly, xa_erase() causes the MW to become inaccessible to the pagefault
thread.
Thus whenever a MW is removed from the xarray we must synchronize_srcu()
before freeing it.
This must be done before freeing the mkey as re-use of the mkey while the
pagefault thread is using the stale mkey is undesirable.
Add the missing synchronizes to MW and DEVX indirect mkey and delete the
bogus protection against double destroy in mlx5_core_destroy_mkey()
Fixes: 534fd7aac56a ("IB/mlx5: Manage indirection mkey upon DEVX flow for ODP")
Fixes: 6aec21f6a832 ("IB/mlx5: Page faults handling infrastructure")
Link: https://lore.kernel.org/r/20191001153821.23621-7-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
live is used to signal to the pagefault thread that the MR is initialized
and ready for use. It should be after the umem is assigned and all other
setup is completed. This prevents races (at least) of the form:
CPU0 CPU1
mlx5_ib_alloc_implicit_mr()
implicit_mr_alloc()
live = 1
imr->umem = umem
num_pending_prefetch_inc()
if (live)
atomic_inc(num_pending_prefetch)
atomic_set(num_pending_prefetch,0) // Overwrites other thread's store
Further, live is being used with SRCU as the 'update' in an
acquire/release fashion, so it can not be read and written raw.
Move all live = 1's to after MR initialization is completed and use
smp_store_release/smp_load_acquire() for manipulating it.
Add a missing live = 0 when an implicit MR child is deleted, before
queuing work to do synchronize_srcu().
The barriers in update_odp_mr() were some broken attempt to create a
acquire/release, but were not even applied consistently and missed the
point, delete it as well.
Fixes: 6aec21f6a832 ("IB/mlx5: Page faults handling infrastructure")
Link: https://lore.kernel.org/r/20191001153821.23621-6-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
During destroy setting live = 0 and then synchronize_srcu() prevents
num_pending_prefetch from incrementing, and also, ensures that all work
holding that count is queued on the WQ. Testing before causes races of the
form:
CPU0 CPU1
dereg_mr()
mlx5_ib_advise_mr_prefetch()
srcu_read_lock()
num_pending_prefetch_inc()
if (!live)
live = 0
atomic_read() == 0
// skip flush_workqueue()
atomic_inc()
queue_work();
srcu_read_unlock()
WARN_ON(atomic_read()) // Fails
Swap the order so that the synchronize_srcu() prevents this.
Fixes: a6bc3875f176 ("IB/mlx5: Protect against prefetch of invalid MR")
Link: https://lore.kernel.org/r/20191001153821.23621-5-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
This fixes a race of the form:
CPU0 CPU1
mlx5_ib_invalidate_range() mlx5_ib_invalidate_range()
// This one actually makes npages == 0
ib_umem_odp_unmap_dma_pages()
if (npages == 0 && !dying)
// This one does nothing
ib_umem_odp_unmap_dma_pages()
if (npages == 0 && !dying)
dying = 1;
dying = 1;
schedule_work(&umem_odp->work);
// Double schedule of the same work
schedule_work(&umem_odp->work); // BOOM
npages and dying must be read and written under the umem_mutex lock.
Since whenever ib_umem_odp_unmap_dma_pages() is called mlx5 must also call
mlx5_ib_update_xlt, and both need to be done in the same locking region,
hoist the lock out of unmap.
This avoids an expensive double critical section in
mlx5_ib_invalidate_range().
Fixes: 81713d3788d2 ("IB/mlx5: Add implicit MR support")
Link: https://lore.kernel.org/r/20191001153821.23621-4-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
mlx5_ib_update_xlt() must be protected against parallel free of the MR it
is accessing, also it must be called single threaded while updating the
HW. Otherwise we can have races of the form:
CPU0 CPU1
mlx5_ib_update_xlt()
mlx5_odp_populate_klm()
odp_lookup() == NULL
pklm = ZAP
implicit_mr_get_data()
implicit_mr_alloc()
<update interval tree>
mlx5_ib_update_xlt
mlx5_odp_populate_klm()
odp_lookup() != NULL
pklm = VALID
mlx5_ib_post_send_wait()
mlx5_ib_post_send_wait() // Replaces VALID with ZAP
This can be solved by putting both the SRCU and the umem_mutex lock around
every call to mlx5_ib_update_xlt(). This ensures that the content of the
interval tree relavent to mlx5_odp_populate_klm() (ie mr->parent == mr)
will not change while it is running, and thus the posted WRs to update the
KLM will always reflect the correct information.
The race above will resolve by either having CPU1 wait till CPU0 completes
the ZAP or CPU0 will run after the add and instead store VALID.
The pagefault path adding children already holds the umem_mutex and SRCU,
so the only missed lock is during MR destruction.
Fixes: 81713d3788d2 ("IB/mlx5: Add implicit MR support")
Link: https://lore.kernel.org/r/20191001153821.23621-3-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
This code is completely broken, the umem of a ODP MR simply cannot be
discarded without a lot more locking, nor can an ODP mkey be blithely
destroyed via destroy_mkey().
Fixes: 6aec21f6a832 ("IB/mlx5: Page faults handling infrastructure")
Link: https://lore.kernel.org/r/20191001153821.23621-2-jgg@ziepe.ca
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Pull RDMA subsystem updates from Jason Gunthorpe:
"This cycle mainly saw lots of bug fixes and clean up code across the
core code and several drivers, few new functional changes were made.
- Many cleanup and bug fixes for hns
- Various small bug fixes and cleanups in hfi1, mlx5, usnic, qed,
bnxt_re, efa
- Share the query_port code between all the iWarp drivers
- General rework and cleanup of the ODP MR umem code to fit better
with the mmu notifier get/put scheme
- Support rdma netlink in non init_net name spaces
- mlx5 support for XRC devx and DC ODP"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (99 commits)
RDMA: Fix double-free in srq creation error flow
RDMA/efa: Fix incorrect error print
IB/mlx5: Free mpi in mp_slave mode
IB/mlx5: Use the original address for the page during free_pages
RDMA/bnxt_re: Fix spelling mistake "missin_resp" -> "missing_resp"
RDMA/hns: Package operations of rq inline buffer into separate functions
RDMA/hns: Optimize cmd init and mode selection for hip08
IB/hfi1: Define variables as unsigned long to fix KASAN warning
IB/{rdmavt, hfi1, qib}: Add a counter for credit waits
IB/hfi1: Add traces for TID RDMA READ
RDMA/siw: Relax from kmap_atomic() use in TX path
IB/iser: Support up to 16MB data transfer in a single command
RDMA/siw: Fix page address mapping in TX path
RDMA: Fix goto target to release the allocated memory
RDMA/usnic: Avoid overly large buffers on stack
RDMA/odp: Add missing cast for 32 bit
RDMA/hns: Use devm_platform_ioremap_resource() to simplify code
Documentation/infiniband: update name of some functions
RDMA/cma: Fix false error message
RDMA/hns: Fix wrong assignment of qp_access_flags
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull hmm updates from Jason Gunthorpe:
"This is more cleanup and consolidation of the hmm APIs and the very
strongly related mmu_notifier interfaces. Many places across the tree
using these interfaces are touched in the process. Beyond that a
cleanup to the page walker API and a few memremap related changes
round out the series:
- General improvement of hmm_range_fault() and related APIs, more
documentation, bug fixes from testing, API simplification &
consolidation, and unused API removal
- Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
and make them internal kconfig selects
- Hoist a lot of code related to mmu notifier attachment out of
drivers by using a refcount get/put attachment idiom and remove the
convoluted mmu_notifier_unregister_no_release() and related APIs.
- General API improvement for the migrate_vma API and revision of its
only user in nouveau
- Annotate mmu_notifiers with lockdep and sleeping region debugging
Two series unrelated to HMM or mmu_notifiers came along due to
dependencies:
- Allow pagemap's memremap_pages family of APIs to work without
providing a struct device
- Make walk_page_range() and related use a constant structure for
function pointers"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
libnvdimm: Enable unit test infrastructure compile checks
mm, notifier: Catch sleeping/blocking for !blockable
kernel.h: Add non_block_start/end()
drm/radeon: guard against calling an unpaired radeon_mn_unregister()
csky: add missing brackets in a macro for tlb.h
pagewalk: use lockdep_assert_held for locking validation
pagewalk: separate function pointers from iterator data
mm: split out a new pagewalk.h header from mm.h
mm/mmu_notifiers: annotate with might_sleep()
mm/mmu_notifiers: prime lockdep
mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
mm/hmm: hmm_range_fault() infinite loop
mm/hmm: hmm_range_fault() NULL pointer bug
mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
mm/mmu_notifiers: remove unregister_no_release
RDMA/odp: remove ib_ucontext from ib_umem
RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
RDMA/mlx5: Use ib_umem_start instead of umem.address
...
|
|
ib_add_slave_port() allocates a multiport struct but never frees it.
Don't leak memory, free the allocated mpi struct during driver unload.
Cc: <stable@vger.kernel.org>
Fixes: 32f69e4be269 ("{net, IB}/mlx5: Manage port association for multiport RoCE")
Link: https://lore.kernel.org/r/20190916064818.19823-3-leon@kernel.org
Signed-off-by: Danit Goldberg <danitg@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
The removal of 'buffer' in the patch below caused free_page() to use a
value that had been offset since the wqe pointer is adjusted while the
routine runs.
The current implementation of free_pages() rounds down to a pfn,
discarding the adjustment, but this is not the right way to use the
API. Preserve the initial value and use it for free_page().
Fixes: 0f51427bd097 ("RDMA/mlx5: Cleanup WQE page fault handler")
Link: https://lore.kernel.org/r/20190916064818.19823-2-leon@kernel.org
Signed-off-by: Danit Goldberg <danitg@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
To resolve dependencies in following patches
mlx5_ib.h conflict resolved by keeing both hunks
Linux 5.3-rc8
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Add flow steering actions: modify header and packet reformat
to the fs_cmd shim layer. This allows each namespace to define
possibly different functionality for alloc/dealloc action commands.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Merge mlx5-next patches needed for upcoming mlx5 software steering.
1) Alex adds HW bits and definitions required for SW steering
2) Ariel moves device memory management to mlx5_core (From mlx5_ib)
3) Maor, Cleanups and fixups for eswitch mode and RoCE
4) Mark, Set only stag for match untagged packets
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
|
Move the device memory allocation and deallocation commands
SW ICM memory to mlx5_core to expose this API for all
mlx5_core users.
This comes as preparation for supporting SW steering in kernel
where it will be required to allocate and register device
memory for direct rule insertion.
In addition, an API to register this device memory for future
remote access operations is introduced using the create_mkey
commands.
Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
mlx5 HW spec and bits updates:
1) Aya exposes IP-in-IP capability in mlx5_core.
2) Maxim exposes lag tx port affinity capabilities.
3) Moshe adds VNIC_ENV internal rq counter bits.
4) ODP capabilities for DC transport
Misc updates:
5) Saeed, two compiler warnings cleanups
6) Add XRQ legacy commands opcodes
7) Use refcount_t for refcount
8) fix a -Wstringop-truncation warning
|
|
Michael Guralnik says:
====================
The series adds support for on-demand paging for DC transport.
As DC is a mlx-only transport, the capabilities are exposed to the user
using DEVX objects and later on through mlx5dv_query_device.
====================
Based on the mlx5-next branch from
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux for
dependencies
* branch 'mlx5-odp-dc':
IB/mlx5: Add page fault handler for DC initiator WQE
IB/mlx5: Remove check of FW capabilities in ODP page fault handling
net/mlx5: Set ODP capabilities for DC transport to max
|
|
Parsing DC initiator WQEs upon page fault requires skipping an address
vector segment, as in UD WQEs.
Link: https://lore.kernel.org/r/20190819120815.21225-4-leon@kernel.org
Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
As page fault handling is initiated by FW, there is no need to check that
the ODP supports the operation and transport.
Link: https://lore.kernel.org/r/20190819120815.21225-3-leon@kernel.org
Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Currently user applications can only steer TCP/IP(NIC RX/RX) traffic.
This patch adds RDMA_RX as a new flow type to allow the user to insert
steering rules to control RDMA traffic.
Two destinations are supported(but not set at the same time): devx
flow table object and QP.
Signed-off-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190819113626.20284-4-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
This is a significant simplification, no extra list is kept per FD, and
the interval tree is now shared between all the ucontexts, reducing
overhead if there are multiple ucontexts active.
Link: https://lore.kernel.org/r/20190806231548.25242-7-jgg@ziepe.ca
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Jason Gunthorpe says:
====================
This is a collection of general cleanups for ODP to clarify some of the
flows around umem creation and use of the interval tree.
====================
The branch is based on v5.3-rc5 due to dependencies
* odp_fixes:
RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
RDMA/mlx5: Use ib_umem_start instead of umem.address
RDMA/core: Make invalidate_range a device operation
RDMA/odp: Use kvcalloc for the dma_list and page_list
RDMA/odp: Check for overflow when computing the umem_odp end
RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
RDMA/odp: Split creating a umem_odp from ib_umem_get
RDMA/odp: Make the three ways to create a umem_odp clear
RMDA/odp: Consolidate umem_odp initialization
RDMA/odp: Make it clearer when a umem is an implicit ODP umem
RDMA/odp: Iterate over the whole rbtree directly
RDMA/odp: Use the common interval tree library instead of generic
RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
These are the same thing since mr always comes from odp->private. It is
confusing to reference the same memory via two names.
Link: https://lore.kernel.org/r/20190819111710.18440-13-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
These are subtly different, the address is the original VA requested
during umem_get, while ib_umem_start() is the version that is rounded to
the proper page size, ie is the true start of the umem's dma map.
Link: https://lore.kernel.org/r/20190819111710.18440-12-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
The callback function 'invalidate_range' is implemented in a driver so the
place for it is in the ib_device_ops structure and not in ib_ucontext.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Link: https://lore.kernel.org/r/20190819111710.18440-11-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Now that there are allocator APIs that return the ib_umem_odp directly
it should be freed through a umem_odp free'er as well.
Link: https://lore.kernel.org/r/20190819111710.18440-8-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
This is the last creation API that is overloaded for both, there is very
little code sharing and a driver has to be specifically ready for a
umem_odp to be created to use the odp version.
Link: https://lore.kernel.org/r/20190819111710.18440-7-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
The three paths to build the umem_odps are kind of muddled, they are:
- As a normal ib_mr umem
- As a child in an implicit ODP umem tree
- As the root of an implicit ODP umem tree
Only the first two are actually umem's, the last is an abuse.
The implicit case can only be triggered by explicit driver request, it
should never be co-mingled with the normal case. While we are here, make
sensible function names and add some comments to make this clearer.
Link: https://lore.kernel.org/r/20190819111710.18440-6-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Implicit ODP umems are special, they don't have any page lists, they don't
exist in the interval tree and they are never DMA mapped.
Instead of trying to guess this based on a zero length use an explicit
flag.
Further, do not allow non-implicit umems to be 0 size.
Link: https://lore.kernel.org/r/20190819111710.18440-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Instead of intersecting a full interval, just iterate over every element
directly. This is faster and clearer.
Link: https://lore.kernel.org/r/20190819111710.18440-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Check conditions that are mandatory to post_send UMR WQEs.
1. Modifying page size.
2. Modifying remote atomic permissions if atomic access is required.
If either condition is not fulfilled then fail to post_send() flow.
Fixes: c8d75a980fab ("IB/mlx5: Respect new UMR capabilities")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190815083834.9245-9-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
The UMR WQE in the MR re-registration flow requires that
modify_atomic and modify_entity_size capabilities are enabled.
Therefore, check that the these capabilities are present before going to
umr flow and go through slow path if not.
Fixes: c8d75a980fab ("IB/mlx5: Respect new UMR capabilities")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190815083834.9245-8-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
ODP depends on the several device capabilities, among them is the ability
to send UMR WQEs with that modify atomic and entity size of the MR.
Therefore, only if all conditions to send such a UMR WQE are met then
driver can report that ODP is supported. Use this check of conditions
in all places where driver needs to know about ODP support.
Also, implicit ODP support depends on ability of driver to send UMR WQEs
for an indirect mkey. Therefore, verify that all conditions to do so are
met when reporting support.
Fixes: c8d75a980fab ("IB/mlx5: Respect new UMR capabilities")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190815083834.9245-7-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Introduce helper function to unify various use_umr checks.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190815083834.9245-6-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
When ODP is enabled with IB_ACCESS_HUGETLB then the required pages
should be calculated based on the extent of the MR, which is rounded
to the nearest huge page alignment.
Fixes: d2183c6f1958 ("RDMA/umem: Move page_shift from ib_umem to ib_odp_umem")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190815083834.9245-5-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
NULL-ing notifier_call is performed under protection
of mlx5_ib_multiport_mutex lock. Such protection is
not easily spotted and better to be guarded by lockdep
annotation.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190813102814.22350-1-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Expose missing XRQ legacy commands over the DEVX interface.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190808084358.29517-5-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Add two events that were defined in the device specification but were
not exposed in the driver list.
Post this patch those events can be read over the DEVX events interface
once be reported by the firmware.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Edward Srouji <edwards@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190808084358.29517-4-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Merging tip of mlx5-next in order to get changes related to adding
XRQ support to the DEVX interface needed prior to the following two
patches.
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Call to uverbs_close_fd() releases file pointer to 'ev_file' and
mlx5_ib_dev is going to be inaccessible. Cache pointer prior cleaning
resources to solve the KASAN warning below.
BUG: KASAN: use-after-free in devx_async_event_close+0x391/0x480 [mlx5_ib]
Read of size 8 at addr ffff888301e3cec0 by task devx_direct_tes/4631
CPU: 1 PID: 4631 Comm: devx_direct_tes Tainted: G OE 5.3.0-rc1-for-upstream-dbg-2019-07-26_01-19-56-93 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
Call Trace:
dump_stack+0x9a/0xeb
print_address_description+0x1e2/0x400
? devx_async_event_close+0x391/0x480 [mlx5_ib]
__kasan_report+0x15c/0x1df
? devx_async_event_close+0x391/0x480 [mlx5_ib]
kasan_report+0xe/0x20
devx_async_event_close+0x391/0x480 [mlx5_ib]
__fput+0x26a/0x7b0
task_work_run+0x10d/0x180
exit_to_usermode_loop+0x137/0x160
do_syscall_64+0x3c7/0x490
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f5df907d664
Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f
80 00 00 00 00 8b 05 6a cd 20 00 48 63 ff 85 c0 75 13 b8
03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 44 f3 c3 66 90
48 83 ec 18 48 89 7c 24 08 e8
RSP: 002b:00007ffd353cb958 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 000056017a88c348 RCX: 00007f5df907d664
RDX: 00007f5df969d400 RSI: 00007f5de8f1ec90 RDI: 0000000000000006
RBP: 00007f5df9681dc0 R08: 00007f5de8736410 R09: 000056017a9d2dd0
R10: 000000000000000b R11: 0000000000000246 R12: 00007f5de899d7d0
R13: 00007f5df96c4248 R14: 00007f5de8f1ecb0 R15: 000056017ae41308
Allocated by task 4631:
save_stack+0x19/0x80
kasan_kmalloc.constprop.3+0xa0/0xd0
alloc_uobj+0x71/0x230 [ib_uverbs]
alloc_begin_fd_uobject+0x2e/0xc0 [ib_uverbs]
rdma_alloc_begin_uobject+0x96/0x140 [ib_uverbs]
ib_uverbs_run_method+0xdf0/0x1940 [ib_uverbs]
ib_uverbs_cmd_verbs+0x57e/0xdb0 [ib_uverbs]
ib_uverbs_ioctl+0x177/0x260 [ib_uverbs]
do_vfs_ioctl+0x18f/0x1010
ksys_ioctl+0x70/0x80
__x64_sys_ioctl+0x6f/0xb0
do_syscall_64+0x95/0x490
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Freed by task 4631:
save_stack+0x19/0x80
__kasan_slab_free+0x11d/0x160
slab_free_freelist_hook+0x67/0x1a0
kfree+0xb9/0x2a0
uverbs_close_fd+0x118/0x1c0 [ib_uverbs]
devx_async_event_close+0x28a/0x480 [mlx5_ib]
__fput+0x26a/0x7b0
task_work_run+0x10d/0x180
exit_to_usermode_loop+0x137/0x160
do_syscall_64+0x3c7/0x490
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The buggy address belongs to the object at ffff888301e3cda8
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 280 bytes inside of 512-byte region
[ffff888301e3cda8, ffff888301e3cfa8)
The buggy address belongs to the page:
page:ffffea000c078e00 refcount:1 mapcount:0
mapping:ffff888352811300 index:0x0 compound_mapcount: 0
flags: 0x2fffff80010200(slab|head)
raw: 002fffff80010200 ffffea000d152608 ffffea000c077808 ffff888352811300
raw: 0000000000000000 0000000000250025 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888301e3cd80: fc fc fc fc fc fb fb fb fb fb fb fb fb fb fb fb
ffff888301e3ce00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888301e3ce80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888301e3cf00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888301e3cf80: fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc
Disabling lock debugging due to kernel taint
Cc: <stable@vger.kernel.org> # 5.2
Fixes: 759738537142 ("IB/mlx5: Enable subscription for device events over DEVX")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Link: https://lore.kernel.org/r/20190808081538.28772-1-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
In order to improve readability, add ib_port_phys_state enum to replace
the use of magic numbers.
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Andrew Boyer <aboyer@tobark.org>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Link: https://lore.kernel.org/r/20190807103138.17219-2-kamalheib1@gmail.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
The code accidentally checks "event_sub" instead of "event_sub->eventfd".
Fixes: 759738537142 ("IB/mlx5: Enable subscription for device events over DEVX")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190807123236.GA11452@mwanda
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Once implicit MR is being called to be released by
ib_umem_notifier_release() its leaves were marked as "dying".
However, when dereg_mr()->mlx5_ib_free_implicit_mr()->mr_leaf_free() is
called, it skips running the mr_leaf_free_action (i.e. umem_odp->work)
when those leaves were marked as "dying".
As such ib_umem_release() for the leaves won't be called and their MRs
will be leaked as well.
When an application exits/killed without calling dereg_mr we might hit the
above flow.
This fatal scenario is reported by WARN_ON() upon
mlx5_ib_dealloc_ucontext() as ibcontext->per_mm_list is not empty, the
call trace can be seen below.
Originally the "dying" mark as part of ib_umem_notifier_release() was
introduced to prevent pagefault_mr() from returning a success response
once this happened. However, we already have today the completion
mechanism so no need for that in those flows any more. Even in case a
success response will be returned the firmware will not find the pages and
an error will be returned in the following call as a released mm will
cause ib_umem_odp_map_dma_pages() to permanently fail mmget_not_zero().
Fix the above issue by dropping the "dying" from the above flows. The
other flows that are using "dying" are still needed it for their
synchronization purposes.
WARNING: CPU: 1 PID: 7218 at
drivers/infiniband/hw/mlx5/main.c:2004
mlx5_ib_dealloc_ucontext+0x84/0x90 [mlx5_ib]
CPU: 1 PID: 7218 Comm: ibv_rc_pingpong Tainted: G E
5.2.0-rc6+ #13
Call Trace:
uverbs_destroy_ufile_hw+0xb5/0x120 [ib_uverbs]
ib_uverbs_close+0x1f/0x80 [ib_uverbs]
__fput+0xbe/0x250
task_work_run+0x88/0xa0
do_exit+0x2cb/0xc30
? __fput+0x14b/0x250
do_group_exit+0x39/0xb0
get_signal+0x191/0x920
? _raw_spin_unlock_bh+0xa/0x20
? inet_csk_accept+0x229/0x2f0
do_signal+0x36/0x5e0
? put_unused_fd+0x5b/0x70
? __sys_accept4+0x1a6/0x1e0
? inet_hash+0x35/0x40
? release_sock+0x43/0x90
? _raw_spin_unlock_bh+0xa/0x20
? inet_listen+0x9f/0x120
exit_to_usermode_loop+0x5c/0xc6
do_syscall_64+0x182/0x1b0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 81713d3788d2 ("IB/mlx5: Add implicit MR support")
Link: https://lore.kernel.org/r/20190805083010.21777-1-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Reference counters are preferred to use refcount_t instead of
atomic_t.
This is because the implementation of refcount_t can prevent
overflows and detect possible use-after-free.
So convert atomic_t ref counters to refcount_t.
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
|
The "MLX5_CMD_OP_QUERY_LAG" is one of the DEVX general commands, add it.
Fixes: 8aa8c95ce4cc ("IB/mlx5: Add support for DEVX general command")
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
|
The below kernel panic was observed when created bond mode LACP
with GRE tunnel on top. The reason to it was not released spinlock
during mlx5 notify unregsiter sequence.
[ 234.562007] BUG: scheduling while atomic: sh/10900/0x00000002
[ 234.563005] Preemption disabled at:
[ 234.566864] ------------[ cut here ]------------
[ 234.567120] DEBUG_LOCKS_WARN_ON(val > preempt_count())
[ 234.567139] WARNING: CPU: 16 PID: 10900 at kernel/sched/core.c:3203 preempt_count_sub+0xca/0x170
[ 234.569550] CPU: 16 PID: 10900 Comm: sh Tainted: G W 5.2.0-rc1-for-linust-dbg-2019-05-25_04-57-33-60 #1
[ 234.569886] Hardware name: Dell Inc. PowerEdge R720/0X3D66, BIOS 2.6.1 02/12/2018
[ 234.570183] RIP: 0010:preempt_count_sub+0xca/0x170
[ 234.570404] Code: 03 38
d0 7c 08 84 d2 0f 85 b0 00 00 00 8b 15 dd 02 03 04 85 d2 75 ba 48 c7 c6
00 e1 88 83 48 c7 c7 40 e1 88 83 e8 76 11 f7 ff <0f> 0b 5b c3 65 8b 05
d3 1f d8 7e 84 c0 75 82 e8 62 c3 c3 00 85 c0
[ 234.570911] RSP: 0018:ffff888b94477b08 EFLAGS: 00010286
[ 234.571133] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[ 234.571391] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000246
[ 234.571648] RBP: ffff888ba5560000 R08: fffffbfff08962d5 R09: fffffbfff08962d5
[ 234.571902] R10: 0000000000000001 R11: fffffbfff08962d4 R12: ffff888bac6e9548
[ 234.572157] R13: ffff888babfaf728 R14: ffff888bac6e9568 R15: ffff888babfaf750
[ 234.572412] FS: 00007fcafa59b740(0000) GS:ffff888bed200000(0000) knlGS:0000000000000000
[ 234.572686] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 234.572914] CR2: 00007f984f16b140 CR3: 0000000b2bf0a001 CR4: 00000000001606e0
[ 234.573172] Call Trace:
[ 234.573336] _raw_spin_unlock+0x2e/0x50
[ 234.573542] mlx5_ib_unbind_slave_port+0x1bc/0x690 [mlx5_ib]
[ 234.573793] mlx5_ib_cleanup_multiport_master+0x1d3/0x660 [mlx5_ib]
[ 234.574039] mlx5_ib_stage_init_cleanup+0x4c/0x360 [mlx5_ib]
[ 234.574271] ? kfree+0xf5/0x2f0
[ 234.574465] __mlx5_ib_remove+0x61/0xd0 [mlx5_ib]
[ 234.574688] ? __mlx5_ib_remove+0xd0/0xd0 [mlx5_ib]
[ 234.574951] mlx5_remove_device+0x234/0x300 [mlx5_core]
[ 234.575224] mlx5_unregister_device+0x4d/0x1e0 [mlx5_core]
[ 234.575493] remove_one+0x4f/0x160 [mlx5_core]
[ 234.575704] pci_device_remove+0xef/0x2a0
[ 234.581407] ? pcibios_free_irq+0x10/0x10
[ 234.587143] ? up_read+0xc1/0x260
[ 234.592785] device_release_driver_internal+0x1ab/0x430
[ 234.598442] unbind_store+0x152/0x200
[ 234.604064] ? sysfs_kf_write+0x3b/0x180
[ 234.609441] ? sysfs_file_ops+0x160/0x160
[ 234.615021] kernfs_fop_write+0x277/0x440
[ 234.620288] ? __sb_start_write+0x1ef/0x2c0
[ 234.625512] vfs_write+0x15e/0x460
[ 234.630786] ksys_write+0x156/0x1e0
[ 234.635988] ? __ia32_sys_read+0xb0/0xb0
[ 234.641120] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 234.646163] do_syscall_64+0x95/0x470
[ 234.651106] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 234.656004] RIP: 0033:0x7fcaf9c9cfd0
[ 234.660686] Code: 73 01
c3 48 8b 0d c0 6e 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00
83 3d cd cf 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73
31 c3 48 83 ec 08 e8 ee cb 01 00 48 89 04 24
[ 234.670128] RSP: 002b:00007ffd3b01ddd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 234.674811] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fcaf9c9cfd0
[ 234.679387] RDX: 000000000000000d RSI: 00007fcafa5c1000 RDI: 0000000000000001
[ 234.683848] RBP: 00007fcafa5c1000 R08: 000000000000000a R09: 00007fcafa59b740
[ 234.688167] R10: 00007ffd3b01d8e0 R11: 0000000000000246 R12: 00007fcaf9f75400
[ 234.692386] R13: 000000000000000d R14: 0000000000000001 R15: 0000000000000000
[ 234.696495] irq event stamp: 153067
[ 234.700525] hardirqs last enabled at (153067): [<ffffffff83258c39>] _raw_spin_unlock_irqrestore+0x59/0x70
[ 234.704665] hardirqs last disabled at (153066): [<ffffffff83259382>] _raw_spin_lock_irqsave+0x22/0x90
[ 234.708722] softirqs last enabled at (153058): [<ffffffff836006c5>] __do_softirq+0x6c5/0xb4e
[ 234.712673] softirqs last disabled at (153051): [<ffffffff81227c1d>] irq_exit+0x17d/0x1d0
[ 234.716601] ---[ end trace 5dbf096843ee9ce6 ]---
Fixes: df097a278c75 ("IB/mlx5: Use the new mlx5 core notifier API")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190731083852.584-1-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Driver shouldn't allow to use UMR to register a MR when
umr_modify_atomic_disabled is set. Otherwise it will always end up with a
failure in the post send flow which sets the UMR WQE to modify atomic access
right.
Fixes: c8d75a980fab ("IB/mlx5: Respect new UMR capabilities")
Signed-off-by: Guy Levi <guyle@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190731081929.32559-1-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Delete DEBUG ODP dead code which is leftover from development
stage and doesn't need to be part of the upstream kernel.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190731115627.5433-1-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
The fix for IB port statistics initialization ("IB/core: Fix querying
total rdma stats") is needed before we take a follow-on patch to
for-next.
Signed-off-by: Doug Ledford <dledford@redhat.com>
|