Age | Commit message (Collapse) | Author | Files | Lines |
|
commit ed7a01fd3fd77f40b4ef2562b966a5decd8928d2 upstream.
Tracking CM_ID resource is performed in two stages: creation of cm_id
and connecting it to the cma_dev. It is needed because rdma-cm protocol
exports two separate user-visible calls rdma_create_id and rdma_accept.
At the time of CM_ID creation, the real owner of that object is unknown
yet and we need to grab task_struct. This task_struct is released or
reassigned in attach phase later on. but call to rdma_destroy_id left
this task_struct unreleased.
Such separation is unique to CM_ID and other restrack objects initialize
in one shot. It means that it is safe to use "res->valid" check to catch
unfinished CM_ID flow and release task_struct for that object.
Fixes: 00313983cda6 ("RDMA/nldev: provide detailed CM_ID information")
Reported-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Reviewed-by: Yossi Itigin <yosefe@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Cc: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 425784aa5b029eeb80498c73a68f62c3ad1d3b3f ]
The async_file might be freed before the disassociation has been ended,
causing qp shutdown to use after free on it.
Since uverbs_destroy_ufile_hw is not a fence, it returns if a
disassociation is ongoing in another thread. It has to be written this way
to avoid deadlock. However this means that the ufile FD close cannot
destroy anything that may still be used by an active kref, such as the the
async_file.
To fix that move the kref_put() to be in ib_uverbs_release_file().
BUG: unable to handle kernel paging request at ffffffffba682787
PGD bc80e067 P4D bc80e067 PUD bc80f063 PMD 1313df163 PTE 80000000bc682061
Oops: 0003 [#1] SMP PTI
CPU: 1 PID: 32410 Comm: bash Tainted: G OE 4.20.0-rc6+ #3
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:__pv_queued_spin_lock_slowpath+0x1b3/0x2a0
Code: 98 83 e2 60 49 89 df 48 8b 04 c5 80 18 72 ba 48 8d
ba 80 32 02 00 ba 00 80 00 00 4c 8d 65 14 41 bd 01 00 00 00 48 01 c7 85
d2 <48> 89 2f 48 89 fb 74 14 8b 45 08 85 c0 75 42 84 d2 74 6b f3 90 83
RSP: 0018:ffffc1bbc064fb58 EFLAGS: 00010006
RAX: ffffffffba65f4e7 RBX: ffff9f209c656c00 RCX: 0000000000000001
RDX: 0000000000008000 RSI: 0000000000000000 RDI: ffffffffba682787
RBP: ffff9f217bb23280 R08: 0000000000000001 R09: 0000000000000000
R10: ffff9f209d2c7800 R11: ffffffffffffffe8 R12: ffff9f217bb23294
R13: 0000000000000001 R14: 0000000000000000 R15: ffff9f209c656c00
FS: 00007fac55aad740(0000) GS:ffff9f217bb00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffba682787 CR3: 000000012f8e0000 CR4: 00000000000006e0
Call Trace:
_raw_spin_lock_irq+0x27/0x30
ib_uverbs_release_uevent+0x1e/0xa0 [ib_uverbs]
uverbs_free_qp+0x7e/0x90 [ib_uverbs]
destroy_hw_idr_uobject+0x1c/0x50 [ib_uverbs]
uverbs_destroy_uobject+0x2e/0x180 [ib_uverbs]
__uverbs_cleanup_ufile+0x73/0x90 [ib_uverbs]
uverbs_destroy_ufile_hw+0x5d/0x120 [ib_uverbs]
ib_uverbs_remove_one+0xea/0x240 [ib_uverbs]
ib_unregister_device+0xfb/0x200 [ib_core]
mlx5_ib_remove+0x51/0xe0 [mlx5_ib]
mlx5_remove_device+0xc1/0xd0 [mlx5_core]
mlx5_unregister_device+0x3d/0xb0 [mlx5_core]
remove_one+0x2a/0x90 [mlx5_core]
pci_device_remove+0x3b/0xc0
device_release_driver_internal+0x16d/0x240
unbind_store+0xb2/0x100
kernfs_fop_write+0x102/0x180
__vfs_write+0x36/0x1a0
? __alloc_fd+0xa9/0x170
? set_close_on_exec+0x49/0x70
vfs_write+0xad/0x1a0
ksys_write+0x52/0xc0
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fac551aac60
Cc: <stable@vger.kernel.org> # 4.2
Fixes: 036b10635739 ("IB/uverbs: Enable device removal when there are active user space applications")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 770b7d96cfff6a8bf6c9f261ba6f135dc9edf484 ]
We encountered a use-after-free bug when unloading the driver:
[ 3562.116059] BUG: KASAN: use-after-free in ib_mad_post_receive_mads+0xddc/0xed0 [ib_core]
[ 3562.117233] Read of size 4 at addr ffff8882ca5aa868 by task kworker/u13:2/23862
[ 3562.118385]
[ 3562.119519] CPU: 2 PID: 23862 Comm: kworker/u13:2 Tainted: G OE 5.1.0-for-upstream-dbg-2019-05-19_16-44-30-13 #1
[ 3562.121806] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014
[ 3562.123075] Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
[ 3562.124383] Call Trace:
[ 3562.125640] dump_stack+0x9a/0xeb
[ 3562.126911] print_address_description+0xe3/0x2e0
[ 3562.128223] ? ib_mad_post_receive_mads+0xddc/0xed0 [ib_core]
[ 3562.129545] __kasan_report+0x15c/0x1df
[ 3562.130866] ? ib_mad_post_receive_mads+0xddc/0xed0 [ib_core]
[ 3562.132174] kasan_report+0xe/0x20
[ 3562.133514] ib_mad_post_receive_mads+0xddc/0xed0 [ib_core]
[ 3562.134835] ? find_mad_agent+0xa00/0xa00 [ib_core]
[ 3562.136158] ? qlist_free_all+0x51/0xb0
[ 3562.137498] ? mlx4_ib_sqp_comp_worker+0x1970/0x1970 [mlx4_ib]
[ 3562.138833] ? quarantine_reduce+0x1fa/0x270
[ 3562.140171] ? kasan_unpoison_shadow+0x30/0x40
[ 3562.141522] ib_mad_recv_done+0xdf6/0x3000 [ib_core]
[ 3562.142880] ? _raw_spin_unlock_irqrestore+0x46/0x70
[ 3562.144277] ? ib_mad_send_done+0x1810/0x1810 [ib_core]
[ 3562.145649] ? mlx4_ib_destroy_cq+0x2a0/0x2a0 [mlx4_ib]
[ 3562.147008] ? _raw_spin_unlock_irqrestore+0x46/0x70
[ 3562.148380] ? debug_object_deactivate+0x2b9/0x4a0
[ 3562.149814] __ib_process_cq+0xe2/0x1d0 [ib_core]
[ 3562.151195] ib_cq_poll_work+0x45/0xf0 [ib_core]
[ 3562.152577] process_one_work+0x90c/0x1860
[ 3562.153959] ? pwq_dec_nr_in_flight+0x320/0x320
[ 3562.155320] worker_thread+0x87/0xbb0
[ 3562.156687] ? __kthread_parkme+0xb6/0x180
[ 3562.158058] ? process_one_work+0x1860/0x1860
[ 3562.159429] kthread+0x320/0x3e0
[ 3562.161391] ? kthread_park+0x120/0x120
[ 3562.162744] ret_from_fork+0x24/0x30
...
[ 3562.187615] Freed by task 31682:
[ 3562.188602] save_stack+0x19/0x80
[ 3562.189586] __kasan_slab_free+0x11d/0x160
[ 3562.190571] kfree+0xf5/0x2f0
[ 3562.191552] ib_mad_port_close+0x200/0x380 [ib_core]
[ 3562.192538] ib_mad_remove_device+0xf0/0x230 [ib_core]
[ 3562.193538] remove_client_context+0xa6/0xe0 [ib_core]
[ 3562.194514] disable_device+0x14e/0x260 [ib_core]
[ 3562.195488] __ib_unregister_device+0x79/0x150 [ib_core]
[ 3562.196462] ib_unregister_device+0x21/0x30 [ib_core]
[ 3562.197439] mlx4_ib_remove+0x162/0x690 [mlx4_ib]
[ 3562.198408] mlx4_remove_device+0x204/0x2c0 [mlx4_core]
[ 3562.199381] mlx4_unregister_interface+0x49/0x1d0 [mlx4_core]
[ 3562.200356] mlx4_ib_cleanup+0xc/0x1d [mlx4_ib]
[ 3562.201329] __x64_sys_delete_module+0x2d2/0x400
[ 3562.202288] do_syscall_64+0x95/0x470
[ 3562.203277] entry_SYSCALL_64_after_hwframe+0x49/0xbe
The problem was that the MAD PD was deallocated before the MAD CQ.
There was completion work pending for the CQ when the PD got deallocated.
When the mad completion handling reached procedure
ib_mad_post_receive_mads(), we got a use-after-free bug in the following
line of code in that procedure:
sg_list.lkey = qp_info->port_priv->pd->local_dma_lkey;
(the pd pointer in the above line is no longer valid, because the
pd has been deallocated).
We fix this by allocating the PD before the CQ in procedure
ib_mad_port_open(), and deallocating the PD after freeing the CQ
in procedure ib_mad_port_close().
Since the CQ completion work queue is flushed during ib_free_cq(),
no completions will be pending for that CQ when the PD is later
deallocated.
Note that freeing the CQ before deallocating the PD is the practice
in the ULPs.
Fixes: 4be90bc60df4 ("IB/mad: Remove ib_get_dma_mr calls")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190801121449.24973-1-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 61f259821dd3306e49b7d42a3f90fb5a4ff3351b ]
Some processors may mispredict an array bounds check and
speculatively access memory that they should not. With
a user supplied array index we like to play things safe
by masking the value with the array size before it is
used as an index.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20190731043957.GA1600@agluck-desk2.amr.corp.intel.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
Like commit 641114d2af31 ("RDMA: Directly cast the sockaddr union to
sockaddr") we need to quiet gcc 9 from warning about this crazy union.
That commit did not fix all of the warnings in 4.19 and older kernels
because the logic in roce_resolve_route_from_path() was rewritten
between 4.19 and 5.2 when that change happened.
Cc: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 641114d2af312d39ca9bbc2369d18a5823da51c6 upstream.
gcc 9 now does allocation size tracking and thinks that passing the member
of a union and then accessing beyond that member's bounds is an overflow.
Instead of using the union member, use the entire union with a cast to
get to the sockaddr. gcc will now know that the memory extends the full
size of the union.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 5d7ed2f27bbd482fd29e6b2e204b1a1ee8a0b268 ]
When two netdev have same link local addresses (such as vlan and non
vlan), two rdma cm listen id should be able to bind to following different
addresses.
listener-1: addr=lla, scope_id=A, port=X
listener-2: addr=lla, scope_id=B, port=X
However while comparing the addresses only addr and port are considered,
due to which 2nd listener fails to listen.
In below example of two listeners, 2nd listener is failing with address in
use error.
$ rping -sv -a fe80::268a:7ff:feb3:d113%ens2f1 -p 4545&
$ rping -sv -a fe80::268a:7ff:feb3:d113%ens2f1.200 -p 4545
rdma_bind_addr: Address already in use
To overcome this, consider the scope_ids as well which forms the accurate
IPv6 link local address.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 535005ca8e5e71918d64074032f4b9d4fef8981e upstream.
The open-coded variant missed destroy of SELinux created QP, reuse already
existing ib_detroy_qp() call and use this opportunity to clean
ib_create_qp() from double prints and unclear exit paths.
Reported-by: Parav Pandit <parav@mellanox.com>
Fixes: d291f1a65232 ("IB/core: Enforce PKey security on QPs")
Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6e88e672b69f0e627acdae74a527b730ea224b6b upstream.
If the MAD agents isn't allowed to manage the subnet, or fails to register
for the LSM notifier, the security context is leaked. Free the context in
these cases.
Fixes: 47a2b338fe63 ("IB/core: Enforce security on management datagrams")
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reported-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d60667fc398ed34b3c7456b020481c55c760e503 upstream.
If the notifier runs after the security context is freed an access of
freed memory can occur.
Fixes: 47a2b338fe63 ("IB/core: Enforce security on management datagrams")
Signed-off-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5fc01fb846bce8fa6d5f95e2625b8ce0f8e86810 upstream.
If cma_acquire_dev_by_src_ip() returns error in addr_handler(), the
device state changes back to RDMA_CM_ADDR_BOUND but the resolved source
IP address is still left. After that, if rdma_destroy_id() is called
after rdma_listen(), the device is freed without removed from
listen_any_list in cma_cancel_operation(). Revert to the previous IP
address if acquiring device fails.
Reported-by: syzbot+f3ce716af730c8f96637@syzkaller.appspotmail.com
Signed-off-by: Myungho Jung <mhjungk@gmail.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a9666c1cae8dbcd1a9aacd08a778bf2a28eea300 upstream.
Unsafe global rkey is considered dangerous because it exposes memory
registered for all memory in the system. Only users with a QP on the same
PD can use the rkey, and generally those QPs will already know the
value. However, out of caution, do not expose the value to unprivleged
users on the local system. Require CAP_NET_ADMIN instead.
Cc: <stable@vger.kernel.org> # 4.16
Fixes: 29cf1351d450 ("RDMA/nldev: provide detailed PD information")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 37fbd834b4e492dc41743830cbe435f35120abd8 ]
When support for bonding of RoCE devices was added, there was
necessarily a link between the RoCE device and the paired netdevice that
was part of the bond. If you remove the mlx4_en module, that paired
association is broken (the RoCE device is still present but the paired
netdevice has been released). We need to account for this in
is_upper_ndev_bond_master_filter() and filter out those links with a
broken pairing or else we later oops in netdev_next_upper_dev_rcu().
Fixes: 408f1242d940 ("IB/core: Delete lower netdevice default GID entries in bonding scenario")
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d52ef88a9f4be523425730da3239cf87bee936da ]
Currently when MAC address is changed, regardless of the netdev reg_state,
GID entries are removed and added to reflect the new MAC address and new
default GID entries.
When a bonding device is used and the underlying PCI device is removed
several netdevice events are generated. Two events of the interest are
CHANGEADDR and UNREGISTER event on lower(slave) netdevice of the bond
netdevice.
Sometimes CHANGEADDR event is generated when netdev state is
UNREGISTERING (after UNREGISTER event is generated). In this scenario, GID
entries for default GIDs are added and never deleted because GID entries
are deleted only when netdev state is < UNREGISTERED.
This leads to non zero reference count on the netdevice. Due to this, PCI
device unbind operation is getting stuck.
To avoid it, when changing mac address, add GID entries only if netdev is
in REGISTERED state.
Fixes: 03db3a2d81e6 ("IB/core: Add RoCE GID table management")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e54b6a3bcd1ec972b25a164bdf495d9e7120b107 ]
Add missing check for failure of cm_init_av_by_path
Fixes: e1444b5a163e ("IB/cm: Fix automatic path migration support")
Reported-by: Slava Shwartsman <slavash@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 0f6ef65d1c6ec8deb5d0f11f86631ec4cfe8f22e ]
If the provider driver (such as rdma_rxe) doesn't support pma counters,
avoid exposing its directory similar to optional hw_counters directory.
If core fails to read the PMA counter, return an error so that user can
retry later if needed.
Fixes: 35c4cbb17811 ("IB/core: Create get_perf_mad function in sysfs.c")
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
hdr.cmd can be indirectly controlled by user-space, hence leading to
a potential exploitation of the Spectre variant 1 vulnerability.
This issue was detected with the help of Smatch:
drivers/infiniband/core/ucma.c:1686 ucma_write() warn: potential
spectre issue 'ucma_cmd_table' [r] (local cap)
Fix this by sanitizing hdr.cmd before using it to index
ucm_cmd_table.
Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].
[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2
Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
hdr.cmd can be indirectly controlled by user-space, hence leading to
a potential exploitation of the Spectre variant 1 vulnerability.
This issue was detected with the help of Smatch:
drivers/infiniband/core/ucm.c:1127 ib_ucm_write() warn: potential
spectre issue 'ucm_cmd_table' [r] (local cap)
Fix this by sanitizing hdr.cmd before using it to index
ucm_cmd_table.
Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].
[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2
Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Currently add_modify_gid() for IB link layer has followong issue
in cache update path.
When GID update event occurs, core releases reference to the GID
table without updating its state and/or entry pointer.
CPU-0 CPU-1
------ -----
ib_cache_update() IPoIB ULP
add_modify_gid() [..]
put_gid_entry()
refcnt = 0, but
state = valid,
entry is valid.
(work item is not yet executed).
ipoib_create_ah()
rdma_create_ah()
rdma_get_gid_attr() <--
Tries to acquire gid_attr
which has refcnt = 0.
This is incorrect.
GID entry state and entry pointer is provides the accurate GID enty
state. Such fields must be updated with rwlock to protect against
readers and, such fields must be in sane state before refcount can drop
to zero. Otherwise above race condition can happen leading to
use-after-free situation.
Following backtrace has been observed when cache update for an IB port
is triggered while IPoIB ULP is creating an AH.
Therefore, when updating GID entry, first mark a valid entry as invalid
through state and set the barrier so that no callers can acquired
the GID entry, followed by release reference to it.
refcount_t: increment on 0; use-after-free.
WARNING: CPU: 4 PID: 29106 at lib/refcount.c:153 refcount_inc_checked+0x30/0x50
Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
RIP: 0010:refcount_inc_checked+0x30/0x50
RSP: 0018:ffff8802ad36f600 EFLAGS: 00010082
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 0000000000000008 RDI: ffffffff86710100
RBP: ffff8802d6e60a30 R08: ffffed005d67bf8b R09: ffffed005d67bf8b
R10: 0000000000000001 R11: ffffed005d67bf8a R12: ffff88027620cee8
R13: ffff8802d6e60988 R14: ffff8802d6e60a78 R15: 0000000000000202
FS: 0000000000000000(0000) GS:ffff8802eb200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3ab35e5c88 CR3: 00000002ce84a000 CR4: 00000000000006e0
IPv6: ADDRCONF(NETDEV_CHANGE): ib1: link becomes ready
Call Trace:
rdma_get_gid_attr+0x220/0x310 [ib_core]
? lock_acquire+0x145/0x3a0
rdma_fill_sgid_attr+0x32c/0x470 [ib_core]
rdma_create_ah+0x89/0x160 [ib_core]
? rdma_fill_sgid_attr+0x470/0x470 [ib_core]
? ipoib_create_ah+0x52/0x260 [ib_ipoib]
ipoib_create_ah+0xf5/0x260 [ib_ipoib]
ipoib_mcast_join_complete+0xbbe/0x2540 [ib_ipoib]
Fixes: b150c3862d21 ("IB/core: Introduce GID entry reference counts")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Make sure we free struct uverbs_api once we clean the radix tree. It was
allocated by uverbs_alloc_api().
Fixes: 9ed3e5f44772 ("IB/uverbs: Build the specs into a radix tree at runtime")
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Uverbs shouldn't enforce QP state in the command unless the user set the QP
state bit in the attribute mask.
In addition, only copy qp attr fields which have the corresponding bit set
in the attribute mask over to the internal attr structure.
Fixes: 88de869bbe4f ("RDMA/uverbs: Ensure validity of current QP state value")
Fixes: bc38a6abdd5a ("[PATCH] IB uverbs: core implementation")
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
There is a race condition between ucma_close() and ucma_resolve_ip():
CPU0 CPU1
ucma_resolve_ip(): ucma_close():
ctx = ucma_get_ctx(file, cmd.id);
list_for_each_entry_safe(ctx, tmp, &file->ctx_list, list) {
mutex_lock(&mut);
idr_remove(&ctx_idr, ctx->id);
mutex_unlock(&mut);
...
mutex_lock(&mut);
if (!ctx->closing) {
mutex_unlock(&mut);
rdma_destroy_id(ctx->cm_id);
...
ucma_free_ctx(ctx);
ret = rdma_resolve_addr();
ucma_put_ctx(ctx);
Before idr_remove(), ucma_get_ctx() could still find the ctx
and after rdma_destroy_id(), rdma_resolve_addr() may still
access id_priv pointer. Also, ucma_put_ctx() may use ctx after
ucma_free_ctx() too.
ucma_close() should call ucma_put_ctx() too which tests the
refcnt and waits for the last one releasing it. The similar
pattern is already used by ucma_destroy_id().
Reported-and-tested-by: syzbot+da2591e115d57a9cbb8b@syzkaller.appspotmail.com
Reported-by: syzbot+cfe3c1e8ef634ba8964b@syzkaller.appspotmail.com
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
|
|
Currently a uverbs completion event queue is flushed of events in
ib_uverbs_comp_event_close() with the queue spinlock held and then
released. Yet setting ev_queue->is_closed is not set until later in
uverbs_hot_unplug_completion_event_file().
In between the time ib_uverbs_comp_event_close() releases the lock and
uverbs_hot_unplug_completion_event_file() acquires the lock, a completion
event can arrive and be inserted into the event queue by
ib_uverbs_comp_handler().
This can cause a "double add" list_add warning or crash depending on the
kernel configuration, or a memory leak because the event is never dequeued
since the queue is already closed down.
So add setting ev_queue->is_closed = 1 to ib_uverbs_comp_event_close().
Cc: stable@vger.kernel.org
Fixes: 1e7710f3f656 ("IB/core: Change completion channel to use the reworked objects schema")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
When AF_IB addresses are used during rdma_resolve_addr() a lock is not
held. A cma device can get removed while list traversal is in progress
which may lead to crash. ie
CPU0 CPU1
==== ====
rdma_resolve_addr()
cma_resolve_ib_dev()
list_for_each() cma_remove_one()
cur_dev->device mutex_lock(&lock)
list_del();
mutex_unlock(&lock);
cma_process_remove();
Therefore, hold a lock while traversing the list which avoids such
situation.
Cc: <stable@vger.kernel.org> # 3.10
Fixes: f17df3b0dede ("RDMA/cma: Add support for AF_IB to rdma_resolve_addr()")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
If ib_uverbs_create_uapi() fails, dev_num should be freed from the bitmap.
Fixes: 7d96c9b17636 ("IB/uverbs: Have the core code create the uverbs_root_spec")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
The object lock was supposed to always be released during destroy, but
when the destruction retry series was integrated with the destroy series
it created a failure path that missed the unlock.
Keep with convention, if destroy fails the caller must undo all locking.
Fixes: 87ad80abc70d ("IB/uverbs: Consolidate uobject destruction")
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
The current code grabs the private_data of whatever file descriptor
userspace has supplied and implicitly casts it to a `struct ucma_file *`,
potentially causing a type confusion.
This is probably fine in practice because the pointer is only used for
comparisons, it is never actually dereferenced; and even in the
comparisons, it is unlikely that a file from another filesystem would have
a ->private_data pointer that happens to also be valid in this context.
But ->private_data is not always guaranteed to be a valid pointer to an
object owned by the file's filesystem; for example, some filesystems just
cram numbers in there.
Check the type of the supplied file descriptor to be safe, analogous to how
other places in the kernel do it.
Fixes: 88314e4dda1e ("RDMA/cma: add support for rdma_migrate_id()")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.
Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a new
oom victim prematurely because the previous one still hasn't torn its
memory down yet.
We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held. Moreover
majority of notifiers only care about a portion of the address space and
there is absolutely zero reason to fail when we are unmapping an unrelated
range. Many notifiers do really block and wait for HW which is harder to
handle and we have to bail out though.
This patch handles the low hanging fruit.
__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
are not allowed to sleep if the flag is set to false. This is achieved by
using trylock instead of the sleepable lock for most callbacks and
continue as long as we do not block down the call chain.
I think we can improve that even further because there is a common pattern
to do a range lookup first and then do something about that. The first
part can be done without a sleeping lock in most cases AFAICS.
The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.
The simplest way for driver developers to test this code path is to wrap
userspace code which uses these notifiers into a memcg and set the hard
limit to hit the oom. This can be done e.g. after the test faults in all
the mmu notifier managed memory and set the hard limit to something really
small. Then we are looking for a proper process tear down.
[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: minor code simplification]
Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
Reported-by: David Rientjes <rientjes@google.com>
Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
rdma.git merge resolution for the 4.19 merge window
Conflicts:
drivers/infiniband/core/rdma_core.c
- Use the rdma code and revise with the new spelling for
atomic_fetch_add_unless
drivers/nvme/host/rdma.c
- Replace max_sge with max_send_sge in new blk code
drivers/nvme/target/rdma.c
- Use the blk code and revise to use NULL for ib_post_recv when
appropriate
- Replace max_sge with max_recv_sge in new blk code
net/rds/ib_send.c
- Use the net code and revise to use NULL for ib_post_recv when
appropriate
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Resolve merge conflicts from the -rc cycle against the rdma.git tree:
Conflicts:
drivers/infiniband/core/uverbs_cmd.c
- New ifs added to ib_uverbs_ex_create_flow in -rc and for-next
- Merge removal of file->ucontext in for-next with new code in -rc
drivers/infiniband/core/uverbs_main.c
- for-next removed code from ib_uverbs_write() that was modified
in for-rc
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Filter functions returns either 0 or 1, therefore better change their
return type from int to bool to reflect the same. Additionally some
filter functions have suffix of _filter some doesn't. Make all filter
function consistent to have __filter suffix to improve code readability.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Update all GID table entries of the netdevice whose MAC address changed.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Currently following issues exist:
1. Default GIDs of the lower (slave) netdevice if the bond netdevice is
added. Rather default GID should be of bond master netdevice.
2. Due to this, when failover event occurs FAILOVER event handler attempts
to delete the GID of the upper device and tries to add the default GID
of the lower device. This is incorrect behavior.
To have simple and correct code:
(a) Split default GIDs addition out of add_netdev_ips(). This allows
easier removal in future if RoCE default GIDs are removed.
(b) Add default GIDs of the bond master device by using right filter and
callback function.
(c) Remove unused function enum_netdev_default_gids().
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Now that we correctly delete the default GIDs of lower devices during
CHANGEUPPER event, add default GIDs of the bonding master device.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
When NETDEV_CHANGEUPPER event occurs, lower device is not yet established
as slave of the master, and when upper device is bond device, default GID
entries not deleted.
Due to this, when bond device is fully configured, default GID entries of
bond device cannot be added as default GID entries are occupied by the
lower netdevice. This is incorrect.
Default GID entries should really be of bond netdevice because in all RoCE
GIDs (default or IP), MAC address of the bond device will be used. It is
confusing to have default GID of netdevice which is not really used for
any purpose.
Therefore, as first step, implement
(a) filter function which filters if a CHANGEUPPER event netdevice and
associated upper device is master device or not.
(b) callback function which deletes the default GIDs of lower (event
netdevice).
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Currently bond_delete_netdev_default_gids() is called by two callers.
(a) del_netdev_default_ips_join()
(b) del_netdev_default_ips()
Both above functions changes the argument order while calling
bond_delete_netdev_default_gids(). This required silly
del_netdev_default_ips() wrapper.
Additionally, del_netdev_default_ips() deletes default GIDs not IP based
GIDs. del_netdev_default_ips() having _ips suffix is confusing.
Therefore, get rid of confusing del_netdev_default_ips() and simplify
bond_delete_netdev_default_gids() to follow same argument order as its
caller.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Add comment for handling CHANGEUPPER netevent handling.
To improve code readability,
(a) move cmd definitions to its respective if-else branches,
(b) avoid single line structure definitions.
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Even though this interface is marked CONFIG_BROKEN we still expect it to
compile, at least until we delete it completely.
Also mark INFINIBAND_USER_ACCESS_UCM with COMPILE_TEST so these situations
can be detected.
Fixes: e7ff98aefc9e ("RDMA/cma: Constify path record, ib_cm_event, listen_id pointers")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking/atomics update from Thomas Gleixner:
"The locking, atomics and memory model brains delivered:
- A larger update to the atomics code which reworks the ordering
barriers, consolidates the atomic primitives, provides the new
atomic64_fetch_add_unless() primitive and cleans up the include
hell.
- Simplify cmpxchg() instrumentation and add instrumentation for
xchg() and cmpxchg_double().
- Updates to the memory model and documentation"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
locking/atomics: Rework ordering barriers
locking/atomics: Instrument cmpxchg_double*()
locking/atomics: Instrument xchg()
locking/atomics: Simplify cmpxchg() instrumentation
locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
tools/memory-model: Rename litmus tests to comply to norm7
tools/memory-model/Documentation: Fix typo, smb->smp
sched/Documentation: Update wake_up() & co. memory-barrier guarantees
locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
sched/core: Use smp_mb() in wake_woken_function()
tools/memory-model: Add informal LKMM documentation to MAINTAINERS
locking/atomics/Documentation: Describe atomic_set() as a write operation
tools/memory-model: Make scripts executable
tools/memory-model: Remove ACCESS_ONCE() from model
tools/memory-model: Remove ACCESS_ONCE() from recipes
locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
tools/memory-model: Add litmus test for full multicopy atomicity
locking/refcount: Always allow checked forms
...
|
|
Now that the ioctl path and uobjects are converted to use uverbs_api, it
is now safe to remove the disassociation protection from the common ioctl
code.
This completes the work to make destroy functions continue to work even
after device disassociation.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Everything now uses the uverbs_uapi data structure.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Convert the ioctl method syscall path to use the uverbs_api data
structures. The new uapi structure includes all the same information, just
in a different and more optimal way.
- Use attr_bkey instead of 2 level radix trees for everything related to
attributes. This includes the attribute storage, presence, and
detection of missing mandatory attributes.
- Avoid iterating over all attribute storage at finish, instead use
find_first_bit with the attr_bkey to locate only those attrs that need
cleanup.
- Organize things to always run, and always rely on, cleanup. This
avoids a bunch of tricky error unwind cases.
- Locate the method using the radix tree, and locate the attributes
using a very efficient incremental radix tree lookup
- Use the precomputed destroy_bkey to handle uobject destruction
- Use the precomputed allocation sizes and precomputed 'need_stack'
to avoid maths in the fast path. This is optimal if userspace
does not pass (many) unsupported attributes.
Overall this results in much better codegen for the attribute accessors,
everything is now stored in bitmaps or linear arrays indexed by attr_bkey.
The compiler can compute attr_bkey values at compile time for all method
attributes, meaning things like uverbs_attr_is_valid() now compile into
single instruction bit tests.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
Several handlers need temporary allocations for the life of the method,
switch them to use the uverbs_alloc allocator.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
|
|
This is similar in spirit to devm, it keeps track of any allocations
linked to this method call and ensures they are all freed when the method
exits. Further, if there is space in the internal/onstack buffer then the
allocator will hand out that memory and avoid an expensive call to
kalloc/kfree in the syscall path.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
|
|
Memory in the bundle is valuable, do not waste it holding an 8 byte
pointer for the rare case of writing to a PTR_OUT. We can compute the
pointer by storing a small 1 byte array offset and the base address of the
uattr memory in the bundle private memory.
This also means we can access the kernel's copy of the ib_uverbs_attr, so
drop the copy of flags as well.
Since the uattr base should be private bundle information this also
de-inlines the already too big uverbs_copy_to inline and moves
create_udata into uverbs_ioctl.c so they can see the private struct
definition.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
|
|
This already existed as the anonymous 'ctx' structure, but this was not
really a useful form. Hoist this struct into bundle_priv and rework the
internal things to use it instead.
Move a bunch of the processing internal state into the priv and reduce the
excessive use of function arguments.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
|
|
Currently the struct uverbs_obj_type stored in the ib_uobject is part of
the .rodata segment of the module that defines the object. This is a
problem if drivers define new uapi objects as we will be left with a
dangling pointer after device disassociation.
Switch the uverbs_obj_type for struct uverbs_api_object, which is
allocated memory that is part of the uverbs_api and is guaranteed to
always exist. Further this moves the 'type_class' into this memory which
means access to the IDR/FD function pointers is also guaranteed. Drivers
cannot define new types.
This makes it safe to continue to use all uobjects, including driver
defined ones, after disassociation.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
|
This radix tree datastructure is intended to replace the 'hash' structure
used today for parsing ioctl methods during system calls. This first
commit introduces the structure and builds it from the existing .rodata
descriptions.
The so-called hash arrangement is actually a 5 level open coded radix tree.
This new version uses a 3 level radix tree built using the radix tree
library.
Overall this is much less code and much easier to build as the radix tree
API allows for dynamic modification during the building. There is a small
memory penalty to pay for this, but since the radix tree is allocated on
a per device basis, a few kb of RAM seems immaterial considering the
gained simplicity.
The radix tree is similar to the existing tree, but also has a 'attr_bkey'
concept, which is a small value'd index for each method attribute. This is
used to simplify and improve performance of everything in the next
patches.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
|
|
There is no reason for drivers to do this, the core code should take of
everything. The drivers will provide their information from rodata to
describe their modifications to the core's base uapi specification.
The core uses this to build up the runtime uapi for each device.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
|
|
This is missing a zeroing of the high bits of flags, and is also not
correct for big endian machines. Properly zero extend the 32 bit flags
into the 64 bit stack variable.
Reported-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Fixes: bccd06223f21 ("IB/uverbs: Add UVERBS_ATTR_FLAGS_IN to the specs language")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
|