Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 779e0bf47632c609c59f527f9711ecd3214dccb0 ]
In procedure ib_register_device, procedure kobject_uevent is called
(advertising that the device is ready for userspace usage) even when
device_enable_and_get() returned an error.
As a result, various RDMA modules attempted to register for the device
even while the device driver was preparing to unregister the device.
Fix this by advertising the device availability only after enabling the
device succeeds.
Fixes: e7a5b4aafd82 ("RDMA/device: Don't fire uevent before device is fully initialized")
Link: https://lore.kernel.org/r/20201208073545.9723-3-leon@kernel.org
Suggested-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0cb42c0265837fafa2b4f302c8a7fed2631d7869 ]
ib_unregister_device_queued() can only be used by drivers using the new
dealloc_device callback flow, and it has a safety WARN_ON to ensure
drivers are using it properly.
However, if unregister and register are raced there is a special
destruction path that maintains the uniform error handling semantic of
'caller does ib_dealloc_device() on failure'. This requires disabling the
dealloc_device callback which triggers the WARN_ON.
Instead of using NULL to disable the callback use a special function
pointer so the WARN_ON does not trigger.
Fixes: d0899892edd0 ("RDMA/device: Provide APIs from the core code to help unregistration")
Link: https://lore.kernel.org/r/0-v1-a36d512e0a99+762-syz_dealloc_driver_jgg@nvidia.com
Reported-by: syzbot+4088ed905e4ae2b0e13b@syzkaller.appspotmail.com
Suggested-by: Hillf Danton <hdanton@sina.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit f2f2b3bbf0d9f8d090b9a019679223b2bd1c66c4 upstream.
If name memory allocation fails the name will be left empty and
device_add_one() will crash:
kobject: (0000000004952746): attempted to be registered with empty name!
WARNING: CPU: 0 PID: 329 at lib/kobject.c:234 kobject_add_internal+0x7ac/0x9a0 lib/kobject.c:234
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 329 Comm: syz-executor.5 Not tainted 5.6.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x197/0x210 lib/dump_stack.c:118
panic+0x2e3/0x75c kernel/panic.c:221
__warn.cold+0x2f/0x3e kernel/panic.c:582
report_bug+0x289/0x300 lib/bug.c:195
fixup_bug arch/x86/kernel/traps.c:174 [inline]
fixup_bug arch/x86/kernel/traps.c:169 [inline]
do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267
do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286
invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
RIP: 0010:kobject_add_internal+0x7ac/0x9a0 lib/kobject.c:234
Code: 1a 98 ca f9 e9 f0 f8 ff ff 4c 89 f7 e8 6d 98 ca f9 e9 95 f9 ff ff e8 c3 f0 8b f9 4c 89 e6 48 c7 c7 a0 0e 1a 89 e8 e3 41 5c f9 <0f> 0b 41 bd ea ff ff ff e9 52 ff ff ff e8 a2 f0 8b f9 0f 0b e8 9b
RSP: 0018:ffffc90005b27908 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000040000 RSI: ffffffff815eae46 RDI: fffff52000b64f13
RBP: ffffc90005b27960 R08: ffff88805aeba480 R09: ffffed1015d06659
R10: ffffed1015d06658 R11: ffff8880ae8332c7 R12: ffff8880a37fd000
R13: 0000000000000000 R14: ffff888096691780 R15: 0000000000000001
kobject_add_varg lib/kobject.c:390 [inline]
kobject_add+0x150/0x1c0 lib/kobject.c:442
device_add+0x3be/0x1d00 drivers/base/core.c:2412
add_one_compat_dev drivers/infiniband/core/device.c:901 [inline]
add_one_compat_dev+0x46a/0x7e0 drivers/infiniband/core/device.c:857
rdma_dev_init_net+0x2eb/0x490 drivers/infiniband/core/device.c:1120
ops_init+0xb3/0x420 net/core/net_namespace.c:137
setup_net+0x2d5/0x8b0 net/core/net_namespace.c:327
copy_net_ns+0x29e/0x5a0 net/core/net_namespace.c:468
create_new_namespaces+0x403/0xb50 kernel/nsproxy.c:108
unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:229
ksys_unshare+0x444/0x980 kernel/fork.c:2955
__do_sys_unshare kernel/fork.c:3023 [inline]
__se_sys_unshare kernel/fork.c:3021 [inline]
__x64_sys_unshare+0x31/0x40 kernel/fork.c:3021
do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Link: https://lore.kernel.org/r/20200309193200.GA10633@ziepe.ca
Cc: stable@kernel.org
Fixes: 4e0f7b907072 ("RDMA/core: Implement compat device/sysfs tree in net namespace")
Reported-by: syzbot+ab4dae63f7d310641ded@syzkaller.appspotmail.com
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 6b57cea9221b0247ad5111b348522625e489a8e4 ]
Currently when the low level driver notifies Pkey, GID, and port change
events they are notified to the registered handlers in the order they are
registered.
IB core and other ULPs such as IPoIB are interested in GID, LID, Pkey
change events.
Since all GID queries done by ULPs are serviced by IB core, and the IB
core deferes cache updates to a work queue, it is possible for other
clients to see stale cache data when they handle their own events.
For example, the below call tree shows how ipoib will call
rdma_query_gid() concurrently with the update to the cache sitting in the
WQ.
mlx5_ib_handle_event()
ib_dispatch_event()
ib_cache_event()
queue_work() -> slow cache update
[..]
ipoib_event()
queue_work()
[..]
work handler
ipoib_ib_dev_flush_light()
__ipoib_ib_dev_flush()
ipoib_dev_addr_changed_valid()
rdma_query_gid() <- Returns old GID, cache not updated.
Move all the event dispatch to a work queue so that the cache update is
always done before any clients are notified.
Fixes: f35faa4ba956 ("IB/core: Simplify ib_query_gid to always refer to cache")
Link: https://lore.kernel.org/r/20191212113024.336702-3-leon@kernel.org
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 55bfe905fa97633438c13fb029aed85371d85480 ]
Improve return code from ib_modify_port() by doing the following:
- Use "-EOPNOTSUPP" instead "-ENOSYS" which is the proper return code
- Allow only fake IB_PORT_CM_SUP manipulation for RoCE providers that
didn't implement the modify_port callback, otherwise return
"-EOPNOTSUPP"
Fixes: 61e0962d5221 ("IB: Avoid ib_modify_port() failure for RoCE devices")
Link: https://lore.kernel.org/r/20191028155931.1114-2-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c9121262d57b8a3be4f08073546436ba0128ca6a ]
The dma_set_max_seg_size() call in setup_dma_device() does not have any
effect since device->dev.dma_parms is NULL. Fix this by initializing
device->dev.dma_parms first.
Link: https://lore.kernel.org/r/20191025225830.257535-5-bvanassche@acm.org
Fixes: d10bcf947a3e ("RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEs")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
When rdmacm module is not loaded, and when netlink message is received to
get char device info, it results into a deadlock due to recursive locking
of rdma_nl_mutex with the below call sequence.
[..]
rdma_nl_rcv()
mutex_lock()
[..]
rdma_nl_rcv_msg()
ib_get_client_nl_info()
request_module()
iw_cm_init()
rdma_nl_register()
mutex_lock(); <- Deadlock, acquiring mutex again
Due to above call sequence, following call trace and deadlock is observed.
kernel: __mutex_lock+0x35e/0x860
kernel: ? __mutex_lock+0x129/0x860
kernel: ? rdma_nl_register+0x1a/0x90 [ib_core]
kernel: rdma_nl_register+0x1a/0x90 [ib_core]
kernel: ? 0xffffffffc029b000
kernel: iw_cm_init+0x34/0x1000 [iw_cm]
kernel: do_one_initcall+0x67/0x2d4
kernel: ? kmem_cache_alloc_trace+0x1ec/0x2a0
kernel: do_init_module+0x5a/0x223
kernel: load_module+0x1998/0x1e10
kernel: ? __symbol_put+0x60/0x60
kernel: __do_sys_finit_module+0x94/0xe0
kernel: do_syscall_64+0x5a/0x270
kernel: entry_SYSCALL_64_after_hwframe+0x49/0xbe
process stack trace:
[<0>] __request_module+0x1c9/0x460
[<0>] ib_get_client_nl_info+0x5e/0xb0 [ib_core]
[<0>] nldev_get_chardev+0x1ac/0x320 [ib_core]
[<0>] rdma_nl_rcv_msg+0xeb/0x1d0 [ib_core]
[<0>] rdma_nl_rcv+0xcd/0x120 [ib_core]
[<0>] netlink_unicast+0x179/0x220
[<0>] netlink_sendmsg+0x2f6/0x3f0
[<0>] sock_sendmsg+0x30/0x40
[<0>] ___sys_sendmsg+0x27a/0x290
[<0>] __sys_sendmsg+0x58/0xa0
[<0>] do_syscall_64+0x5a/0x270
[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
To overcome this deadlock and to allow multiple netlink messages to
progress in parallel, following scheme is implemented.
1. Split the lock protecting the cb_table into a per-index lock, and make
it a rwlock. This lock is used to ensure no callbacks are running after
unregistration returns. Since a module will not be registered once it
is already running callbacks, this avoids the deadlock.
2. Use smp_store_release() to update the cb_table during registration so
that no lock is required. This avoids lockdep problems with thinking
all the rwsems are the same lock class.
Fixes: 0e2d00eb6fd45 ("RDMA: Add NLDEV_GET_CHARDEV to allow char dev discovery and autoload")
Link: https://lore.kernel.org/r/20191015080733.18625-1-leon@kernel.org
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
iwarp_query_port
If an iWARP driver is probed and removed while there are no ips set for
the device, it will lead to a reference count leak on the inet device of
the netdevice.
In addition, the netdevice was accessed after already calling netdev_put,
which could lead to using the netdev after already freed.
Fixes: 4929116bdf72 ("RDMA/core: Add common iWARP query port")
Link: https://lore.kernel.org/r/20190925123332.10746-1-michal.kalderon@marvell.com
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Jason Gunthorpe says:
====================
This is a collection of general cleanups for ODP to clarify some of the
flows around umem creation and use of the interval tree.
====================
The branch is based on v5.3-rc5 due to dependencies
* odp_fixes:
RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
RDMA/mlx5: Use ib_umem_start instead of umem.address
RDMA/core: Make invalidate_range a device operation
RDMA/odp: Use kvcalloc for the dma_list and page_list
RDMA/odp: Check for overflow when computing the umem_odp end
RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
RDMA/odp: Split creating a umem_odp from ib_umem_get
RDMA/odp: Make the three ways to create a umem_odp clear
RMDA/odp: Consolidate umem_odp initialization
RDMA/odp: Make it clearer when a umem is an implicit ODP umem
RDMA/odp: Iterate over the whole rbtree directly
RDMA/odp: Use the common interval tree library instead of generic
RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
The callback function 'invalidate_range' is implemented in a driver so the
place for it is in the ib_device_ops structure and not in ib_ucontext.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Link: https://lore.kernel.org/r/20190819111710.18440-11-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Add support for a common iWARP query port function, the new function
includes a common code that is used by the iWARP devices to update the
port attributes like max_mtu, active_mtu, state, and phys_state, the
function also includes a call for the driver-specific query_port callback
to query the device-specific port attributes.
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Link: https://lore.kernel.org/r/20190807103138.17219-4-kamalheib1@gmail.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Due to the complexity of client->remove() callbacks it is desirable to not
hold any locks while calling them. Remove the last one by tracking only
the highest client ID and running backwards from there over the xarray.
Since the only purpose of that lock was to protect the linked list, we can
drop the lock.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190731081841.32345-3-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
lockdep reports:
WARNING: possible circular locking dependency detected
modprobe/302 is trying to acquire lock:
0000000007c8919c ((wq_completion)ib_cm){+.+.}, at: flush_workqueue+0xdf/0x990
but task is already holding lock:
000000002d3d2ca9 (&device->client_data_rwsem){++++}, at: remove_client_context+0x79/0xd0 [ib_core]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&device->client_data_rwsem){++++}:
down_read+0x3f/0x160
ib_get_net_dev_by_params+0xd5/0x200 [ib_core]
cma_ib_req_handler+0x5f6/0x2090 [rdma_cm]
cm_process_work+0x29/0x110 [ib_cm]
cm_req_handler+0x10f5/0x1c00 [ib_cm]
cm_work_handler+0x54c/0x311d [ib_cm]
process_one_work+0x4aa/0xa30
worker_thread+0x62/0x5b0
kthread+0x1ca/0x1f0
ret_from_fork+0x24/0x30
-> #1 ((work_completion)(&(&work->work)->work)){+.+.}:
process_one_work+0x45f/0xa30
worker_thread+0x62/0x5b0
kthread+0x1ca/0x1f0
ret_from_fork+0x24/0x30
-> #0 ((wq_completion)ib_cm){+.+.}:
lock_acquire+0xc8/0x1d0
flush_workqueue+0x102/0x990
cm_remove_one+0x30e/0x3c0 [ib_cm]
remove_client_context+0x94/0xd0 [ib_core]
disable_device+0x10a/0x1f0 [ib_core]
__ib_unregister_device+0x5a/0xe0 [ib_core]
ib_unregister_device+0x21/0x30 [ib_core]
mlx5_ib_stage_ib_reg_cleanup+0x9/0x10 [mlx5_ib]
__mlx5_ib_remove+0x3d/0x70 [mlx5_ib]
mlx5_ib_remove+0x12e/0x140 [mlx5_ib]
mlx5_remove_device+0x144/0x150 [mlx5_core]
mlx5_unregister_interface+0x3f/0xf0 [mlx5_core]
mlx5_ib_cleanup+0x10/0x3a [mlx5_ib]
__x64_sys_delete_module+0x227/0x350
do_syscall_64+0xc3/0x6a4
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Which is due to the read side of the client_data_rwsem being obtained
recursively through a work queue flush during cm client removal.
The lock is being held across the remove in remove_client_context() so
that the function is a fence, once it returns the client is removed. This
is required so that the two callers do not proceed with destruction until
the client completes removal.
Instead of using client_data_rwsem use the existing device unregistration
refcount and add a similar client unregistration (client->uses) refcount.
This will fence the two unregistration paths without holding any locks.
Cc: <stable@vger.kernel.org>
Fixes: 921eab1143aa ("RDMA/devices: Re-organize device.c locking")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190731081841.32345-2-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Now that IB core supports RDMA device binding with specific net namespace,
enable IB core to accept netlink commands in non init_net namespaces.
This is done by having per net namespace netlink socket.
At present only netlink device handling client RDMA_NL_NLDEV supports
device handling in multiple net namespaces. Hence do not accept netlink
messages for other clients in non init_net net namespaces.
Link: https://lore.kernel.org/r/20190723070205.6247-1-leon@kernel.org
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
While compiled with CONFIG_DEBUG_MUTEXES, the kernel ensures that mutex is
not held during destroy. Hence add mutex_destroy() for mutexes used in
RDMA modules.
Link: https://lore.kernel.org/r/20190723065733.4899-2-leon@kernel.org
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Pull rdma updates from Jason Gunthorpe:
"A smaller cycle this time. Notably we see another new driver, 'Soft
iWarp', and the deletion of an ancient unused driver for nes.
- Revise and simplify the signature offload RDMA MR APIs
- More progress on hoisting object allocation boiler plate code out
of the drivers
- Driver bug fixes and revisions for hns, hfi1, efa, cxgb4, qib,
i40iw
- Tree wide cleanups: struct_size, put_user_page, xarray, rst doc
conversion
- Removal of obsolete ib_ucm chardev and nes driver
- netlink based discovery of chardevs and autoloading of the modules
providing them
- Move more of the rdamvt/hfi1 uapi to include/uapi/rdma
- New driver 'siw' for software based iWarp running on top of netdev,
much like rxe's software RoCE.
- mlx5 feature to report events in their raw devx format to userspace
- Expose per-object counters through rdma tool
- Adaptive interrupt moderation for RDMA (DIM), sharing the DIM core
from netdev"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (194 commits)
RMDA/siw: Require a 64 bit arch
RDMA/siw: Mark expected switch fall-throughs
RDMA/core: Fix -Wunused-const-variable warnings
rdma/siw: Remove set but not used variable 's'
rdma/siw: Add missing dependencies on LIBCRC32C and DMA_VIRT_OPS
RDMA/siw: Add missing rtnl_lock around access to ifa
rdma/siw: Use proper enumerated type in map_cqe_status
RDMA/siw: Remove unnecessary kthread create/destroy printouts
IB/rdmavt: Fix variable shadowing issue in rvt_create_cq
RDMA/core: Fix race when resolving IP address
RDMA/core: Make rdma_counter.h compile stand alone
IB/core: Work on the caller socket net namespace in nldev_newlink()
RDMA/rxe: Fill in wc byte_len with IB_WC_RECV_RDMA_WITH_IMM
RDMA/mlx5: Set RDMA DIM to be enabled by default
RDMA/nldev: Added configuration of RDMA dynamic interrupt moderation to netlink
RDMA/core: Provide RDMA DIM support for ULPs
linux/dim: Implement RDMA adaptive moderation (DIM)
IB/mlx5: Report correctly tag matching rendezvous capability
docs: infiniband: add it to the driver-api bookset
IB/mlx5: Implement VHCA tunnel mechanism in DEVX
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity updates from Mimi Zohar:
"Bug fixes, code clean up, and new features:
- IMA policy rules can be defined in terms of LSM labels, making the
IMA policy dependent on LSM policy label changes, in particular LSM
label deletions. The new environment, in which IMA-appraisal is
being used, frequently updates the LSM policy and permits LSM label
deletions.
- Prevent an mmap'ed shared file opened for write from also being
mmap'ed execute. In the long term, making this and other similar
changes at the VFS layer would be preferable.
- The IMA per policy rule template format support is needed for a
couple of new/proposed features (eg. kexec boot command line
measurement, appended signatures, and VFS provided file hashes).
- Other than the "boot-aggregate" record in the IMA measuremeent
list, all other measurements are of file data. Measuring and
storing the kexec boot command line in the IMA measurement list is
the first buffer based measurement included in the measurement
list"
* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
integrity: Introduce struct evm_xattr
ima: Update MAX_TEMPLATE_NAME_LEN to fit largest reasonable definition
KEXEC: Call ima_kexec_cmdline to measure the boot command line args
IMA: Define a new template field buf
IMA: Define a new hook to measure the kexec boot command line arguments
IMA: support for per policy rule template formats
integrity: Fix __integrity_init_keyring() section mismatch
ima: Use designated initializers for struct ima_event_data
ima: use the lsm policy update notifier
LSM: switch to blocking policy update notifiers
x86/ima: fix the Kconfig dependency for IMA_ARCH_POLICY
ima: Make arch_policy_entry static
ima: prevent a file already mmap'ed write to be mmap'ed execute
x86/ima: check EFI SetupMode too
|
|
Added parameter in ib_device for enabling dynamic interrupt moderation so
that it can be configured in userspace using rdma tool.
In order to set adaptive-moderation for an ib device the command is:
rdma dev set [DEV] adaptive-moderation [on|off]
Please set on/off.
rdma dev show
0: mlx5_0: node_type ca fw 16.26.0055 node_guid 248a:0703:00a5:29d0
sys_image_guid 248a:0703:00a5:29d0 adaptive-moderation on
rdma resource show cq
dev mlx5_0 cqn 0 cqe 1023 users 4 poll-ctx UNBOUND_WORKQUEUE
adaptive-moderation off comm [ib_core]
Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
This patch adds the ability to return all available counters together with
their properties and hwstats.
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
In auto mode all QPs belong to one category are bind automatically to a
single counter set. Currently only "qp type" is supported.
In this mode the qp counter is set in RST2INIT modification, and when a qp
is destroyed the counter is unbound.
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Add an API to support set/clear per-port auto mode.
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
For dependencies in next patches.
Resolve conflicts:
- Use uverbs_get_cleared_udata() with new cq allocation flow
- Continue to delete nes despite SPDX conflict
- Resolve list appends in mlx5_command_str()
- Use u16 for vport_rule stuff
- Resolve list appends in struct ib_client
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
This function will map the previously dma mapped SG lists for PI
(protection information) and data to an appropriate memory region for
future registration.
The given MR must be allocated as IB_MR_TYPE_INTEGRITY.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
This is a preparation for signature verbs API re-design. In the new
design a single MR with IB_MR_TYPE_INTEGRITY type will be used to perform
the needed mapping for data integrity operations.
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Update the struct ib_client for all modules exporting cdevs related to the
ibdevice to also implement RDMA_NLDEV_CMD_GET_CHARDEV. All cdevs are now
autoloadable and discoverable by userspace over netlink instead of relying
on sysfs.
uverbs also exposes the DRIVER_ID for drivers that are able to support
driver id binding in rdma-core.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Allow userspace to issue a netlink query against the ib_device for
something like "uverbs" and get back the char dev name, inode major/minor,
and interface ABI information for "uverbs0".
Since we are now in netlink this can also trigger a module autoload to
make the uverbs device come into existence.
Largely this will let us replace searching and reading inside sysfs to
setup devices, and provides an alternative (using driver_id) to device
name based provider binding for things like rxe.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
lockdep_assert_held_write()
All callers of lockdep_assert_held_exclusive() use it to verify the
correct locking state of either a semaphore (ldisc_sem in tty,
mmap_sem for perf events, i_rwsem of inode for dax) or rwlock by
apparmor. Thus it makes sense to rename _exclusive to _write since
that's the semantics callers care. Additionally there is already
lockdep_assert_held_read(), which this new naming is more consistent with.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190531100651.3969-1-nborisov@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Atomic policy updaters are not very useful as they cannot
usually perform the policy updates on their own. Since it
seems that there is no strict need for the atomicity,
switch to the blocking variant. While doing so, rename
the functions accordingly.
Signed-off-by: Janne Karhunen <janne.karhunen@gmail.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
|
|
Ensure that CQ is allocated and freed by IB/core and not by drivers.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Tested-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
This more closely follows how other subsytems work, with owner being a
member of the structure containing the function pointers.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
No reason for every driver to emit code to set this, just make it part of
the driver's existing static const ops structure.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
No reason for every driver to emit code to set this, just make it part of
the driver's existing static const ops structure.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
This happens if assign_name() returns failure when called from
ib_register_device(), that will lead to the following panic in every time
that someone touches the port_data's data members.
BUG: unable to handle kernel NULL pointer dereference at 00000000000000c0
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 19 PID: 1994 Comm: systemd-udevd Not tainted 5.1.0-rc5+ #1
Hardware name: HP ProLiant DL360p Gen8, BIOS P71 12/20/2013
RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
Code: 85 ff 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 53 9c 58 66 66 90
66 90 48 89 c3 fa 66 66 90 66 66 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 0f
94 c2 84 d2 74 05 48 89 d8 5b c3 89 c6 e8 b4 85 8a
RSP: 0018:ffffa8d7079a7c08 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000202 RCX: ffffa8d7079a7bf8
RDX: 0000000000000001 RSI: ffff93607c990000 RDI: 00000000000000c0
RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffffc08c4dd8
R10: 0000000000000000 R11: 0000000000000001 R12: 00000000000000c0
R13: ffff93607c990000 R14: ffffffffc05a9740 R15: ffffa8d7079a7e98
FS: 00007f1c6ee438c0(0000) GS:ffff93609f6c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000c0 CR3: 0000000819fca002 CR4: 00000000000606e0
Call Trace:
free_netdevs+0x4d/0xe0 [ib_core]
ib_dealloc_device+0x51/0xb0 [ib_core]
__mlx5_ib_add+0x5e/0x70 [mlx5_ib]
mlx5_add_device+0x57/0xe0 [mlx5_core]
mlx5_register_interface+0x85/0xc0 [mlx5_core]
? 0xffffffffc0474000
do_one_initcall+0x4e/0x1d4
? _cond_resched+0x15/0x30
? kmem_cache_alloc_trace+0x15f/0x1c0
do_init_module+0x5a/0x218
load_module+0x186b/0x1e40
? m_show+0x1c0/0x1c0
__do_sys_finit_module+0x94/0xe0
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 8ceb1357b337 ("RDMA/device: Consolidate ib_device per_port data into one place")
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
The return value from ib_device_check_mandatory() is always 0 - change it
to be void.
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
SRP logic used device name and port index as symlink to relevant
kobject. If the IB device is renamed then the prior name will be re-used
by the next device plugged in and sysfs will panic as SRP will try to
re-use the same name.
mlx5_ib: Mellanox Connect-IB Infiniband driver v5.0-0
sysfs: cannot create duplicate filename '/class/infiniband_srp/srp-mlx5_0-1'
CPU: 3 PID: 1107 Comm: modprobe Not tainted 5.1.0-for-upstream-perf-2019-05-12_15-09-52-87 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x5a/0x73
sysfs_warn_dup+0x58/0x70
sysfs_do_create_link_sd.isra.2+0xa3/0xb0
device_add+0x33f/0x660
srp_add_one+0x301/0x4f0 [ib_srp]
add_client_context+0x99/0xe0 [ib_core]
enable_device_and_get+0xd1/0x1b0 [ib_core]
ib_register_device+0x533/0x710 [ib_core]
? mutex_lock+0xe/0x30
__mlx5_ib_add+0x23/0x70 [mlx5_ib]
mlx5_add_device+0x4e/0xd0 [mlx5_core]
mlx5_register_interface+0x85/0xc0 [mlx5_core]
? 0xffffffffa0791000
do_one_initcall+0x4b/0x1cb
? kmem_cache_alloc_trace+0xc6/0x1d0
? do_init_module+0x22/0x21f
do_init_module+0x5a/0x21f
load_module+0x17f2/0x1ca0
? m_show+0x1c0/0x1c0
__do_sys_finit_module+0x94/0xe0
do_syscall_64+0x48/0x120
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f157cce10d9
The module load/unload sequence was used to trigger such kernel panic:
sudo modprobe ib_srp
sudo modprobe -r mlx5_ib
sudo modprobe -r mlx5_core
sudo modprobe mlx5_core
Have SRP track the name of the core device so that it can't have a name
collision.
Fixes: d21943dd19b5 ("RDMA/core: Implement IB device rename function")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
When the refcount is 0 the device is invisible to netlink. However in the
patch below the refcount = 1 was moved to after the device_add(). This
creates a race where userspace can issue a netlink query after the
device_add() event and not see the device as visible.
Ensure that no uevent is fired before device is fully registered.
Fixes: d79af7242bb2 ("RDMA/device: Expose ib_device_try_get(()")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Integrate iw_cm_verbs data members into ib_device_ops and ib_device
structs, this is done to achieve the following:
1) Avoid memory related bugs durring error unwind
2) Make the code more cleaner
3) Reduce code duplication
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
The driver interface cannot manipulate the sysfs of the compat device,
only of the full device so we must avoid calling the driver sysfs APIs on
compat devices.
This prevents an oops:
Call Trace:
dump_stack+0x5a/0x73
kobject_init+0x74/0x80
kobject_init_and_add+0x35/0xb0
hfi1_create_port_files+0x6e/0x3c0 [hfi1]
ib_setup_port_attrs+0x43b/0x560 [ib_core]
add_one_compat_dev+0x16a/0x230 [ib_core]
rdma_dev_init_net+0x110/0x160 [ib_core]
ops_init+0x38/0xf0
setup_net+0xcf/0x1e0
copy_net_ns+0xb7/0x130
create_new_namespaces+0x11a/0x1b0
unshare_nsproxy_namespaces+0x55/0xa0
ksys_unshare+0x1a7/0x340
__x64_sys_unshare+0xe/0x20
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 5417783eabb2 ("RDMA/core: Support core port attributes in non init_net")
Reported-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Similarly to dev/netdev/etc printk helpers, add standard printk helpers
for the RDMA subsystem.
Example output:
efa 0000:00:06.0 efa_0: Hello World!
efa_0: Hello World! (no parent device set)
(NULL ib_device): Hello World! (ibdev is NULL)
Cc: Jason Baron <jbaron@akamai.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Suggested-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Provide an option to change the net namespace of a rdma device through a
netlink command. When multiple rdma devices exists in a system, and when
containers are used, this will limit rdma device visibility to a specified
net namespace.
An example command to change net namespace of mlx5_1 device to the
previously created net namespace 'foo' is:
$ ip netns add foo
$ rdma dev set mlx5_1 netns foo
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Introduce a helper function that changes rdma device's net namespace which
performs mini disable/enable sequence to have device visible only in
assigned net namespace.
Device unregistration, device rename and device change net namespace
may be invoked concurrently.
(a) device unregistration needs to wait if a device change (rename or net
namespace change) operation is in progress.
(b) device net namespace change should not proceed if the unregistration
has started.
(c) while one cpu is changing device net namespace, other cpu should not
be able to rename or change net namespace.
To address above concurrency,
(a) Use unreg_mutex to synchronize between ib_unregister_device() and net
namespace change operation
(b) In cases where unregister_device() has started unregistration before
change_netns got chance to acquire unreg_mutex, validate the refcount
- if it dropped to zero, abort the net namespace change operation.
Finally use the helper function to change net namespace of ib device to
move the device back to init_net when such net is deleted.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
So we can use the disable_device() helper while changing the net namespace
of the rdma device in a subsequent patch, move free_netdevs() out of it.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Convert SRQ allocation from drivers to be in the IB/core
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Simplify drivers by ensuring lifetime of ib_ah object. The changes
in .create_ah() go hand in hand with relevant update in .destroy_ah().
We will use this opportunity and convert .destroy_ah() to don't fail, as
it was suggested a long time ago, because there is nothing to do in case
of failure during destroy.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Combine contiguous regions of PAGE_SIZE pages into single scatter list
entry while building the scatter table for a umem. This minimizes the
number of the entries in the scatter list and reduces the DMA mapping
overhead, particularly with the IOMMU.
Set default max_seg_size in core for IB devices to 2G and do not combine
if we exceed this limit.
Also, purge npages in struct ib_umem as we now DMA map the umem SGL with
sg_nents and npage computation is not needed. Drivers should now be using
ib_umem_num_pages(), so fix the last stragglers.
Move npages tracking to ib_umem_odp as ODP drivers still need it.
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Tested-by: Gal Pressman <galpress@amazon.com>
Tested-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Add netlink command that enables/disables sharing rdma device among
multiple net namespaces.
Using rdma tool,
$rdma sys set netns shared (default mode)
When rdma subsystem netns mode is set to shared mode, rdma devices
will be accessible in all net namespaces.
Using rdma tool,
$rdma sys set netns exclusive
When rdma subsystem netns mode is set to exclusive mode, devices
will be accessible in only one net namespace at any given
point of time.
If there are any net namespaces other than default init_net exists,
while executing this command, it will fail and mode cannot be changed.
To change this mode, netlink command is used instead of sysctl, because
netlink command allows to auto load a module.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Add an interface via netlink command to query whether rdma devices are
shared among multiple net namespaces or not. When using RDMAtool, it can
be queried as,
$rdma system show netns
netns shared
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Extend ib_device_get_by_index() API to check device access for
net namespace for serving netlink commands.
Also enforce net ns check on dumpit commands which iterate over all
registered rdma devices and which don't call ib_device_get_by_index().
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Introduce an API rdma_dev_access_netns() to check whether a rdma device
can be accessed from the specified net namespace or not.
Use rdma_dev_access_netns() while opening character uverbs, umad network
device and also check while rdma cm_id binds to rdma device.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Add module parameter to change a sharing mode of ib_core early in the
boot process. This parameter helps to those systems where modern up
to date rdma tool (iproute2) package may not be available during
kernel upgrade cycle.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|