summaryrefslogtreecommitdiff
path: root/drivers/infiniband/core
AgeCommit message (Collapse)AuthorFilesLines
2020-11-16treewide: rename nla_strlcpy to nla_strscpy.Francis Laniel1-5/+5
Calls to nla_strlcpy are now replaced by calls to nla_strscpy which is the new name of this function. Signed-off-by: Francis Laniel <laniel_francis@privacyrequired.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-12RDMA/umem: Use ib_dma_max_seg_size instead of dma_get_max_seg_sizeChristoph Hellwig1-4/+4
RDMA ULPs must not call DMA mapping APIs directly but instead use the ib_dma_* wrappers. Fixes: 0c16d9635e3a ("RDMA/umem: Move to allocate SG table from pages") Link: https://lore.kernel.org/r/20201106181941.1878556-3-hch@lst.de Reported-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-12RDMA/core: Make FD destroy callback voidLeon Romanovsky3-6/+5
All FD object destroy implementations return 0, so declare this callback void. Link: https://lore.kernel.org/r/20201104144556.3809085-3-leon@kernel.org Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-12RDMA/core: Postpone uobject cleanup on failure till FD closeLeon Romanovsky10-65/+39
Remove the ib_is_destroyable_retryable() concept. The idea here was to allow the drivers to forcibly clean the HW object even if they otherwise didn't want to (eg because of usecnt). This was an attempt to clean up in a world where drivers were not allowed to fail HW object destruction. Now that we are going back to allowing HW objects to fail destroy this doesn't make sense. Instead if a uobject's HW object can't be destroyed it is left on the uobject list and it is up to uverbs_destroy_ufile_hw() to clean it. Multiple passes over the uobject list allow hidden dependencies to be resolved. If that fails the HW driver is broken, throw a WARN_ON and leak the HW object memory. All the other tricky failure paths (eg on creation error unwind) have already been updated to this new model. Link: https://lore.kernel.org/r/20201104144556.3809085-2-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-12RDMA/cm: Make the local_id_table xarray non-irqJason Gunthorpe1-6/+6
The xarray is never mutated from an IRQ handler, only from work queues under a spinlock_irq. Thus there is no reason for it be an IRQ type xarray. This was copied over from the original IDR code, but the recent rework put the xarray inside another spinlock_irq which will unbalance the unlocking. Fixes: c206f8bad15d ("RDMA/cm: Make it clearer how concurrency works in cm_req_handler()") Link: https://lore.kernel.org/r/0-v1-808b6da3bd3f+1857-cm_xarray_no_irq_jgg@nvidia.com Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-02IB/core: Add support for NDR link speedMeir Lichtinger1-0/+4
Add new IBTA speed NDR, supporting signaling rate of 100Gb. Link: https://lore.kernel.org/r/20201026133738.1340432-2-leon@kernel.org Signed-off-by: Meir Lichtinger <meirl@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-02RDMA/mlx5: Use ib_umem_find_best_pgsz() for mkc'sJason Gunthorpe1-0/+9
Now that all the PAS arrays or UMR XLT's for mkcs are filled using rdma_for_each_block() we can use the common ib_umem_find_best_pgsz() algorithm. Link: https://lore.kernel.org/r/20201026132314.1336717-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-31RDMA: Convert various random sprintf sysfs _show uses to sysfs_emitJoe Perches4-50/+58
Manual changes for sysfs_emit as cocci scripts can't easily convert them. Link: https://lore.kernel.org/r/ecde7791467cddb570c6f6d2c908ffbab9145cac.1602122880.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-31RDMA: Manual changes for sysfs_emit and neateningJoe Perches1-29/+36
Make changes to use sysfs_emit in the RDMA code as cocci scripts can not be written to handle _all_ the possible variants of various sprintf family uses in sysfs show functions. While there, make the code more legible and update its style to be more like the typical kernel styles. Miscellanea: o Use intermediate pointers for dereferences o Add and use string lookup functions o return early when any intermediate call fails so normal return is at the bottom of the function o mlx4/mcg.c:sysfs_show_group: use scnprintf to format intermediate strings Link: https://lore.kernel.org/r/f5c9e4c9d8dafca1b7b70bd597ee7f8f219c31c8.1602122880.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-28RDMA/core: Fix error return in _ib_modify_qp()Jing Xiangfeng1-1/+3
Fix to return error code PTR_ERR() from the error handling case instead of 0. Fixes: 51aab12631dd ("RDMA/core: Get xmit slave for LAG") Link: https://lore.kernel.org/r/20201016075845.129562-1-jingxiangfeng@huawei.com Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-28RDMA: Add rdma_connect_locked()Jason Gunthorpe1-10/+38
There are two flows for handling RDMA_CM_EVENT_ROUTE_RESOLVED, either the handler triggers a completion and another thread does rdma_connect() or the handler directly calls rdma_connect(). In all cases rdma_connect() needs to hold the handler_mutex, but when handler's are invoked this is already held by the core code. This causes ULPs using the 2nd method to deadlock. Provide a rdma_connect_locked() and have all ULPs call it from their handlers. Link: https://lore.kernel.org/r/0-v2-53c22d5c1405+33-rdma_connect_locking_jgg@nvidia.com Reported-and-tested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Fixes: 2a7cec538169 ("RDMA/cma: Fix locking for the RDMA_CM_CONNECT state") Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-27RDMA: Convert sysfs device * show functions to use sysfs_emit()Joe Perches4-24/+33
Done with cocci script: @@ identifier d_show; identifier dev, attr, buf; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... return - sprintf(buf, + sysfs_emit(buf, ...); ...> } @@ identifier d_show; identifier dev, attr, buf; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... return - snprintf(buf, PAGE_SIZE, + sysfs_emit(buf, ...); ...> } @@ identifier d_show; identifier dev, attr, buf; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... return - scnprintf(buf, PAGE_SIZE, + sysfs_emit(buf, ...); ...> } @@ identifier d_show; identifier dev, attr, buf; expression chr; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... return - strcpy(buf, chr); + sysfs_emit(buf, chr); ...> } @@ identifier d_show; identifier dev, attr, buf; identifier len; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... len = - sprintf(buf, + sysfs_emit(buf, ...); ...> return len; } @@ identifier d_show; identifier dev, attr, buf; identifier len; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... len = - snprintf(buf, PAGE_SIZE, + sysfs_emit(buf, ...); ...> return len; } @@ identifier d_show; identifier dev, attr, buf; identifier len; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... len = - scnprintf(buf, PAGE_SIZE, + sysfs_emit(buf, ...); ...> return len; } @@ identifier d_show; identifier dev, attr, buf; identifier len; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... - len += scnprintf(buf + len, PAGE_SIZE - len, + len += sysfs_emit_at(buf, len, ...); ...> return len; } @@ identifier d_show; identifier dev, attr, buf; expression chr; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { ... - strcpy(buf, chr); - return strlen(buf); + return sysfs_emit(buf, chr); } Link: https://lore.kernel.org/r/7f406fa8e3aa2552c022bec680f621e38d1fe414.1602122879.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-27RDMA/uverbs: Fix false error in query gid IOCTLGal Pressman1-3/+0
Some drivers (such as EFA) have a GID table, but aren't IB/RoCE devices. Remove the unnecessary rdma_ib_or_roce() check. This fixes rdma-core failures for EFA when it uses the new ioctl interface for querying the GID table. Fixes: 9f85cbe50aa0 ("RDMA/uverbs: Expose the new GID query API to user space") Link: https://lore.kernel.org/r/20201026082621.32463-1-galpress@amazon.com Signed-off-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-27RDMA: Remove AH from uverbs_cmd_maskJason Gunthorpe3-5/+11
Drivers that need a uverbs AH should instead set the create_user_ah() op similar to reg_user_mr(). MODIFY_AH and QUERY_AH cmds were never implemented so are just deleted. Link: https://lore.kernel.org/r/11-v1-caa70ba3d1ab+1436e-ucmd_mask_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-27RDMA/core Remove uverbs_ex_cmd_maskJason Gunthorpe2-19/+1
No driver sets it, and the core code sets a maximum mask, simply remove it. Disabled operations are now handled either by having a NULL ops pointer, or by having the common driver callbacks check for unsupported extended attributes. Link: https://lore.kernel.org/r/9-v1-caa70ba3d1ab+1436e-ucmd_mask_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-27RDMA: Check create_flags during create_qpJason Gunthorpe1-0/+1
Each driver should check that the QP attrs create_flags is supported. Unfortuantely when create_flags was added to the QP attrs the drivers were not updated. uverbs_ex_cmd_mask was used to block it - even though kernel drivers use these flags too. Check that flags is zero in all drivers that don't use it, remove IB_USER_VERBS_EX_CMD_CREATE_QP from uverbs_ex_cmd_mask. Fix the error code to be EOPNOTSUPP. Link: https://lore.kernel.org/r/8-v1-caa70ba3d1ab+1436e-ucmd_mask_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-27RDMA: Check flags during create_cqJason Gunthorpe1-0/+1
Each driver should check that the CQ attrs is supported. Unfortuantely when flags was added to the CQ attrs the drivers were not updated, uverbs_ex_cmd_mask was used to block it. This was missed when create CQ was converted to ioctl, so non-zero flags could have been passed into drivers. Check that flags is zero in all drivers that don't use it, remove IB_USER_VERBS_EX_CMD_CREATE_CQ from uverbs_ex_cmd_mask. Fixes: 41b2a71fc848 ("IB/uverbs: Move ioctl path of create_cq and destroy_cq to a new file") Link: https://lore.kernel.org/r/7-v1-caa70ba3d1ab+1436e-ucmd_mask_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-27RDMA: Check attr_mask during modify_qpJason Gunthorpe2-6/+3
Each driver should check that it can support the provided attr_mask during modify_qp. IB_USER_VERBS_EX_CMD_MODIFY_QP was being used to block modify_qp_ex because the driver didn't check RATE_LIMIT. Link: https://lore.kernel.org/r/6-v1-caa70ba3d1ab+1436e-ucmd_mask_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-27RDMA: Check srq_type during create_srqJason Gunthorpe1-0/+1
uverbs was blocking srq_types the driver doesn't support based on the CREATE_XSRQ cmd_mask. Fix all drivers to check for supported srq_types during create_srq and move CREATE_XSRQ to the core code. Link: https://lore.kernel.org/r/5-v1-caa70ba3d1ab+1436e-ucmd_mask_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-27RDMA: Move more uverbs_cmd_mask settings to the coreJason Gunthorpe2-5/+19
These functions all depend on the driver providing a specific op: - REREG_MR is rereg_user_mr(). bnxt_re set this without providing the op - ATTACH/DEATCH_MCAST is attach_mcast()/detach_mcast(). usnic set this without providing the op - OPEN_QP doesn't involve the driver but requires a XRCD. qedr provides xrcd but forgot to set it, usnic doesn't provide XRCD but set it anyhow. - OPEN/CLOSE_XRCD are the ops alloc_xrcd()/dealloc_xrcd() - CREATE_SRQ/DESTROY_SRQ are the ops create_srq()/destroy_srq() - QUERY/MODIFY_SRQ is op query_srq()/modify_srq(). hns sets this but sometimes supplies a NULL op. - RESIZE_CQ is op resize_cq(). bnxt_re sets this boes doesn't supply an op - ALLOC/DEALLOC_MW is alloc_mw()/dealloc_mw(). cxgb4 provided an (now deleted) implementation but no userspace All drivers were checked that no drivers provide the op without also setting uverbs_cmd_mask so this should have no functional change. Link: https://lore.kernel.org/r/4-v1-caa70ba3d1ab+1436e-ucmd_mask_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-27RDMA: Remove elements in uverbs_cmd_mask that all drivers setJason Gunthorpe1-0/+16
This is a step toward eliminating uverbs_cmd_mask. Preset this list in the core code. Only the op reg_user_mr wasn't already being required from the drivers. Link: https://lore.kernel.org/r/3-v1-caa70ba3d1ab+1436e-ucmd_mask_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-27RDMA: Remove uverbs_ex_cmd_mask values that are linked to functionsJason Gunthorpe2-1/+12
Since a while now the uverbs layer checks if the driver implements a function before allowing the ucmd to proceed. This largely obsoletes the cmd_mask stuff, but there is some tricky bits in drivers preventing it from being removed. Remove the easy elements of uverbs_ex_cmd_mask by pre-setting them in the core code. These are triggered soley based on the related ops function pointer. query_device_ex is not triggered based on an op, but all drivers already implement something compatible with the extension, so enable it globally too. Link: https://lore.kernel.org/r/2-v1-caa70ba3d1ab+1436e-ucmd_mask_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-17Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds28-1298/+1804
Pull rdma updates from Jason Gunthorpe: "A usual cycle for RDMA with a typical mix of driver and core subsystem updates: - Driver minor changes and bug fixes for mlx5, efa, rxe, vmw_pvrdma, hns, usnic, qib, qedr, cxgb4, hns, bnxt_re - Various rtrs fixes and updates - Bug fix for mlx4 CM emulation for virtualization scenarios where MRA wasn't working right - Use tracepoints instead of pr_debug in the CM code - Scrub the locking in ucma and cma to close more syzkaller bugs - Use tasklet_setup in the subsystem - Revert the idea that 'destroy' operations are not allowed to fail at the driver level. This proved unworkable from a HW perspective. - Revise how the umem API works so drivers make fewer mistakes using it - XRC support for qedr - Convert uverbs objects RWQ and MW to new the allocation scheme - Large queue entry sizes for hns - Use hmm_range_fault() for mlx5 On Demand Paging - uverbs APIs to inspect the GID table instead of sysfs - Move some of the RDMA code for building large page SGLs into lib/scatterlist" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (191 commits) RDMA/ucma: Fix use after free in destroy id flow RDMA/rxe: Handle skb_clone() failure in rxe_recv.c RDMA/rxe: Move the definitions for rxe_av.network_type to uAPI RDMA: Explicitly pass in the dma_device to ib_register_device lib/scatterlist: Do not limit max_segment to PAGE_ALIGNED values IB/mlx4: Convert rej_tmout radix-tree to XArray RDMA/rxe: Fix bug rejecting all multicast packets RDMA/rxe: Fix skb lifetime in rxe_rcv_mcast_pkt() RDMA/rxe: Remove duplicate entries in struct rxe_mr IB/hfi,rdmavt,qib,opa_vnic: Update MAINTAINERS IB/rdmavt: Fix sizeof mismatch MAINTAINERS: CISCO VIC LOW LATENCY NIC DRIVER RDMA/bnxt_re: Fix sizeof mismatch for allocation of pbl_tbl. RDMA/bnxt_re: Use rdma_umem_for_each_dma_block() RDMA/umem: Move to allocate SG table from pages lib/scatterlist: Add support in dynamic allocation of SG table from pages tools/testing/scatterlist: Show errors in human readable form tools/testing/scatterlist: Rejuvenate bit-rotten test RDMA/ipoib: Set rtnl_link_ops for ipoib interfaces RDMA/uverbs: Expose the new GID query API to user space ...
2020-10-16mm: remove the now-unnecessary mmget_still_valid() hackJann Horn1-3/+0
The preceding patches have ensured that core dumping properly takes the mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all its users. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16RDMA/ucma: Fix use after free in destroy id flowMaor Gottlieb1-5/+6
ucma_free_ctx() should call to __destroy_id() on all the connection requests that have not been delivered to user space. Currently it calls on the context itself and cause to use after free. Fixes the trace: BUG: Unable to handle kernel data access on write at 0x5deadbeef0000108 Faulting instruction address: 0xc0080000002428f4 Oops: Kernel access of bad area, sig: 11 [#1] Call Trace: [c000000207f2b680] [c00800000024280c] .__destroy_id+0x28c/0x610 [rdma_ucm] (unreliable) [c000000207f2b750] [c0080000002429c4] .__destroy_id+0x444/0x610 [rdma_ucm] [c000000207f2b820] [c008000000242c24] .ucma_close+0x94/0xf0 [rdma_ucm] [c000000207f2b8c0] [c00000000046fbdc] .__fput+0xac/0x330 [c000000207f2b960] [c00000000015d48c] .task_work_run+0xbc/0x110 [c000000207f2b9f0] [c00000000012fb00] .do_exit+0x430/0xc50 [c000000207f2bae0] [c0000000001303ec] .do_group_exit+0x5c/0xd0 [c000000207f2bb70] [c000000000144a34] .get_signal+0x194/0xe30 [c000000207f2bc60] [c00000000001f6b4] .do_notify_resume+0x124/0x470 [c000000207f2bd60] [c000000000032484] .interrupt_exit_user_prepare+0x1b4/0x240 [c000000207f2be20] [c000000000010034] interrupt_return+0x14/0x1c0 Rename listen_ctx to conn_req_ctx as the poor name was the cause of this bug. Fixes: a1d33b70dbbc ("RDMA/ucma: Rework how new connections are passed through event delivery") Link: https://lore.kernel.org/r/20201012045600.418271-4-leon@kernel.org Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-16RDMA: Explicitly pass in the dma_device to ib_register_deviceJason Gunthorpe1-53/+22
The code in setup_dma_device has become rather convoluted, move all of this to the drivers. Drives now pass in a DMA capable struct device which will be used to setup DMA, or drivers must fully configure the ibdev for DMA and pass in NULL. Other than setting the masks in rvt all drivers were doing this already anyhow. mthca, mlx4 and mlx5 were already setting up maximum DMA segment size for DMA based on their hardweare limits in: __mthca_init_one() dma_set_max_seg_size (1G) __mlx4_init_one() dma_set_max_seg_size (1G) mlx5_pci_init() set_dma_caps() dma_set_max_seg_size (2G) Other non software drivers (except usnic) were extended to UINT_MAX [1, 2] instead of 2G as was before. [1] https://lore.kernel.org/linux-rdma/20200924114940.GE9475@nvidia.com/ [2] https://lore.kernel.org/linux-rdma/20200924114940.GE9475@nvidia.com/ Link: https://lore.kernel.org/r/20201008082752.275846-1-leon@kernel.org Link: https://lore.kernel.org/r/6b2ed339933d066622d5715903870676d8cc523a.1602590106.git.mchehab+huawei@kernel.org Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-16Merge branch 'dynamic_sg' into rdma.git for-nextJason Gunthorpe2-84/+16
From Maor Gottlieb says: ==================== This series extends __sg_alloc_table_from_pages to allow chaining of new pages to an already initialized SG table. This allows for drivers to utilize the optimization of merging contiguous pages without a need to pre allocate all the pages and hold them in a very large temporary buffer prior to the call to SG table initialization. The last patch changes the Infiniband core to use the new API. It removes duplicate functionality from the code and benefits from the optimization of allocating dynamic SG table from pages. In huge pages system of 2MB page size, without this change, the SG table would contain x512 SG entries. ==================== * branch 'dynamic_sg': RDMA/umem: Move to allocate SG table from pages lib/scatterlist: Add support in dynamic allocation of SG table from pages tools/testing/scatterlist: Show errors in human readable form tools/testing/scatterlist: Rejuvenate bit-rotten test
2020-10-06RDMA/umem: Move to allocate SG table from pagesMaor Gottlieb1-82/+12
Remove the implementation of ib_umem_add_sg_table and instead call to __sg_alloc_table_from_pages which already has the logic to merge contiguous pages. Besides that it removes duplicated functionality, it reduces the memory consumption of the SG table significantly. Prior to this patch, the SG table was allocated in advance regardless consideration of contiguous pages. In huge pages system of 2MB page size, without this change, the SG table would contain x512 SG entries. E.g. for 100GB memory registration: Number of entries Size Before 26214400 600.0MB After 51200 1.2MB Link: https://lore.kernel.org/r/20201004154340.1080481-5-leon@kernel.org Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds3-9/+19
Pull networking fixes from David Miller: 1) Make sure SKB control block is in the proper state during IPSEC ESP-in-TCP encapsulation. From Sabrina Dubroca. 2) Various kinds of attributes were not being cloned properly when we build new xfrm_state objects from existing ones. Fix from Antony Antony. 3) Make sure to keep BTF sections, from Tony Ambardar. 4) TX DMA channels need proper locking in lantiq driver, from Hauke Mehrtens. 5) Honour route MTU during forwarding, always. From Maciej Żenczykowski. 6) Fix races in kTLS which can result in crashes, from Rohit Maheshwari. 7) Skip TCP DSACKs with rediculous sequence ranges, from Priyaranjan Jha. 8) Use correct address family in xfrm state lookups, from Herbert Xu. 9) A bridge FDB flush should not clear out user managed fdb entries with the ext_learn flag set, from Nikolay Aleksandrov. 10) Fix nested locking of netdev address lists, from Taehee Yoo. 11) Fix handling of 32-bit DATA_FIN values in mptcp, from Mat Martineau. 12) Fix r8169 data corruptions on RTL8402 chips, from Heiner Kallweit. 13) Don't free command entries in mlx5 while comp handler could still be running, from Eran Ben Elisha. 14) Error flow of request_irq() in mlx5 is busted, due to an off by one we try to free and IRQ never allocated. From Maor Gottlieb. 15) Fix leak when dumping netlink policies, from Johannes Berg. 16) Sendpage cannot be performed when a page is a slab page, or the page count is < 1. Some subsystems such as nvme were doing so. Create a "sendpage_ok()" helper and use it as needed, from Coly Li. 17) Don't leak request socket when using syncookes with mptcp, from Paolo Abeni. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (111 commits) net/core: check length before updating Ethertype in skb_mpls_{push,pop} net: mvneta: fix double free of txq->buf net_sched: check error pointer in tcf_dump_walker() net: team: fix memory leak in __team_options_register net: typhoon: Fix a typo Typoon --> Typhoon net: hinic: fix DEVLINK build errors net: stmmac: Modify configuration method of EEE timers tcp: fix syn cookied MPTCP request socket leak libceph: use sendpage_ok() in ceph_tcp_sendpage() scsi: libiscsi: use sendpage_ok() in iscsi_tcp_segment_map() drbd: code cleanup by using sendpage_ok() to check page for kernel_sendpage() tcp: use sendpage_ok() to detect misused .sendpage nvme-tcp: check page by sendpage_ok() before calling kernel_sendpage() net: add WARN_ONCE in kernel_sendpage() for improper zero-copy send net: introduce helper sendpage_ok() in include/linux/net.h net: usb: pegasus: Proper error handing when setting pegasus' MAC address net: core: document two new elements of struct net_device netlink: fix policy dump leak net/mlx5e: Fix race condition on nhe->n pointer in neigh update net/mlx5e: Fix VLAN create flow ...
2020-10-02RDMA/uverbs: Expose the new GID query API to user spaceAvihai Horon1-1/+195
Expose the query GID table and entry API to user space by adding two new methods and method handlers to the device object. This API provides a faster way to query a GID table using single call and will be used in libibverbs to improve current approach that requires multiple calls to open, close and read multiple sysfs files for a single GID table entry. Link: https://lore.kernel.org/r/20200923165015.2491894-5-leon@kernel.org Signed-off-by: Avihai Horon <avihaih@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-02RDMA/core: Introduce new GID table query APIAvihai Horon1-3/+63
Introduce rdma_query_gid_table which enables querying all the GID tables of a given device and copying the attributes of all valid GID entries to a provided buffer. This API provides a faster way to query a GID table using single call and will be used in libibverbs to improve current approach that requires multiple calls to open, close and read multiple sysfs files for a single GID table entry. Link: https://lore.kernel.org/r/20200923165015.2491894-4-leon@kernel.org Signed-off-by: Avihai Horon <avihaih@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-02RDMA/core: Modify enum ib_gid_type and enum rdma_network_typeAvihai Horon4-5/+14
Separate IB_GID_TYPE_IB and IB_GID_TYPE_ROCE to two different values, so enum ib_gid_type will match the gid types of the new query GID table API which will be introduced in the following patches. This change in enum ib_gid_type requires to change also enum rdma_network_type by separating RDMA_NETWORK_IB and RDMA_NETWORK_ROCE_V1 values. Link: https://lore.kernel.org/r/20200923165015.2491894-3-leon@kernel.org Signed-off-by: Avihai Horon <avihaih@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-02RDMA/core: Change rdma_get_gid_attr returned error codeAvihai Horon2-2/+3
Change the error code returned from rdma_get_gid_attr when the GID entry is invalid but the GID index is in the gid table size range to -ENODATA instead of -EINVAL. This change is done in order to provide a more accurate error reporting to be used by the new GID query API in user space. Nevertheless, -EINVAL is still returned from sysfs in the aforementioned case to maintain compatibility with user space that expects -EINVAL. Link: https://lore.kernel.org/r/20200923165015.2491894-2-leon@kernel.org Signed-off-by: Avihai Horon <avihaih@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-02RDMA/core: Constify struct attribute_groupRikard Falkeborn1-6/+6
The only usage of the pma_table field in the ib_port struct is to pass its address to sysfs_create_group() and sysfs_remove_group(). Make it const to make it possible to constify a couple of static struct attribute_group. This allows the compiler to put them in read-only memory. Link: https://lore.kernel.org/r/20200930224004.24279-2-rikard.falkeborn@gmail.com Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-01IB/core: Enable ODP sync without faultingYishai Hadas1-10/+25
Enable ODP sync without faulting, this improves performance by reducing the number of page faults in the system. The gain from this option is that the device page table can be aligned with the presented pages in the CPU page table without causing page faults. As of that, the overhead on data path from hardware point of view to trigger a fault which end-up by calling the driver to bring the pages will be dropped. Link: https://lore.kernel.org/r/20200930163828.1336747-3-leon@kernel.org Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-01IB/core: Improve ODP to use hmm_range_fault()Yishai Hadas1-166/+112
Move to use hmm_range_fault() instead of get_user_pags_remote() to improve performance in a few aspects: This includes: - Dropping the need to allocate and free memory to hold its output - No need any more to use put_page() to unpin the pages - The logic to detect contiguous pages is done based on the returned order, no need to run per page and evaluate. In addition, moving to use hmm_range_fault() enables to reduce page faults in the system with it's snapshot mode, this will be introduced in next patches from this series. As part of this, cleanup some flows and use the required data structures to work with hmm_range_fault(). Link: https://lore.kernel.org/r/20200930163828.1336747-2-leon@kernel.org Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-30RDMA/addr: Fix race with netevent_callback()/rdma_addr_cancel()Jason Gunthorpe1-6/+5
This three thread race can result in the work being run once the callback becomes NULL: CPU1 CPU2 CPU3 netevent_callback() process_one_req() rdma_addr_cancel() [..] spin_lock_bh() set_timeout() spin_unlock_bh() spin_lock_bh() list_del_init(&req->list); spin_unlock_bh() req->callback = NULL spin_lock_bh() if (!list_empty(&req->list)) // Skipped! // cancel_delayed_work(&req->work); spin_unlock_bh() process_one_req() // again req->callback() // BOOM cancel_delayed_work_sync() The solution is to always cancel the work once it is completed so any in between set_timeout() does not result in it running again. Cc: stable@vger.kernel.org Fixes: 44e75052bc2a ("RDMA/rdma_cm: Make rdma_addr_cancel into a fence") Link: https://lore.kernel.org/r/20200930072007.1009692-1-leon@kernel.org Reported-by: Dan Aloni <dan@kernelim.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-30RDMA/core: Remove ucontext->closingJason Gunthorpe1-1/+0
Nothing reads this any more, and the reason for its existence has passed due to the deferred fput() scheme. Fixes: 8ea1f989aa07 ("drivers/IB,usnic: reduce scope of mmap_sem") Link: https://lore.kernel.org/r/0-v1-df64ff042436+42-uctx_closing_jgg@nvidia.com Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-29RDMA/core: Align write and ioctl checks of QP typesLeon Romanovsky1-2/+15
The ioctl flow checks that the user provides only a supported list of QP types, while write flow didn't do it and relied on the driver to check it. Align those flows to fail as early as possible. Link: https://lore.kernel.org/r/20200926102450.2966017-8-leon@kernel.org Reviewed-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-29net: core: introduce struct netdev_nested_priv for nested interface ↵Taehee Yoo3-9/+19
infrastructure Functions related to nested interface infrastructure such as netdev_walk_all_{ upper | lower }_dev() pass both private functions and "data" pointer to handle their own things. At this point, the data pointer type is void *. In order to make it easier to expand common variables and functions, this new netdev_nested_priv structure is added. In the following patch, a new member variable will be added into this struct to fix the lockdep issue. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-23RDMA/restrack: Improve readability in task name managementLeon Romanovsky10-114/+157
Use rdma_restrack_set_name() and rdma_restrack_parent_name() instead of tricky uses of rdma_restrack_attach_task()/rdma_restrack_uadd(). This uniformly makes all restracks add'd using rdma_restrack_add(). Link: https://lore.kernel.org/r/20200922091106.2152715-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-23RDMA/restrack: Simplify restrack tracking in kernel flowsLeon Romanovsky7-50/+22
Have a single rdma_restrack_add() that adds an entry, there is no reason to split the user/kernel here, the rdma_restrack_set_task() is responsible for this difference. This patch prepares the code to the future requirement of making restrack is mandatory for managing ib objects. Link: https://lore.kernel.org/r/20200922091106.2152715-5-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-23RDMA/restrack: Count references to the verbs objectsLeon Romanovsky9-31/+70
Refactor the restrack code to make sure the kref inside the restrack entry properly kref's the object in which it is embedded. This slight change is needed for future conversions of MR and QP which are refcounted before the release and kfree. The ideal flow from ib_core perspective as follows: * Allocate ib_* structure with rdma_zalloc_*. * Set everything that is known to ib_core to that newly created object. * Initialize kref with restrack help * Call to driver specific allocation functions. * Insert into restrack DB .... * Return and release restrack with restrack_put. Largely this means a rdma_restrack_new() should be called near allocating the containing structure. Link: https://lore.kernel.org/r/20200922091106.2152715-4-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-23RDMA/cma: Delete from restrack DB after successful destroyLeon Romanovsky1-2/+1
Update the code to have similar destroy pattern like other IB objects. This change create asymmetry to the rdma_id_private create flow to make sure that memory is managed by restrack. Link: https://lore.kernel.org/r/20200922091106.2152715-2-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-19RDMA/ucma: Rework ucma_migrate_id() to avoid races with destroyJason Gunthorpe1-50/+29
ucma_destroy_id() assumes that all things accessing the ctx will do so via the xarray. This assumption violated only in the case the FD is being closed, then the ctx is reached via the ctx_list. Normally this is OK since ucma_destroy_id() cannot run concurrenty with release(), however with ucma_migrate_id() is involved this can violated as the close of the 2nd FD can run concurrently with destroy on the first: CPU0 CPU1 ucma_destroy_id(fda) ucma_migrate_id(fda -> fdb) ucma_get_ctx() xa_lock() _ucma_find_context() xa_erase() xa_unlock() xa_lock() ctx->file = new_file list_move() xa_unlock() ucma_put_ctx() ucma_close(fdb) _destroy_id() kfree(ctx) _destroy_id() wait_for_completion() // boom, ctx was freed The ctx->file must be modified under the handler and xa_lock, and prior to modification the ID must be rechecked that it is still reachable from cur_file, ie there is no parallel destroy or migrate. To make this work remove the double locking and streamline the control flow. The double locking was obsoleted by the handler lock now directly preventing new uevents from being created, and the ctx_list cannot be read while holding fgets on both files. Removing the double locking also removes the need to check for the same file. Fixes: 88314e4dda1e ("RDMA/cma: add support for rdma_migrate_id()") Link: https://lore.kernel.org/r/0-v1-05c5a4090305+3a872-ucma_syz_migrate_jgg@nvidia.com Reported-and-tested-by: syzbot+cc6fc752b3819e082d0c@syzkaller.appspotmail.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-18Merge branch 'mlx5_active_speed' into rdma.git for-nextJason Gunthorpe3-5/+6
Leon Romanovsky says: ==================== IBTA declares speed as 16 bits, but kernel stores it in u8. This series fixes in-kernel declaration while keeping external interface intact. ==================== Based on the mlx5-next branch at git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux due to dependencies. * branch 'mlx5_active_speed': RDMA: Fix link active_speed size RDMA/mlx5: Delete duplicated mlx5_ptys_width enum net/mlx5: Refactor query port speed functions
2020-09-18RDMA: Fix link active_speed sizeAharon Landau2-2/+3
According to the IB spec active_speed size should be u16 and not u8 as before. Changing it to allow further extensions in offered speeds. Link: https://lore.kernel.org/r/20200917090223.1018224-4-leon@kernel.org Signed-off-by: Aharon Landau <aharonl@mellanox.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17RDMA: Convert RWQ table logic to ib_core allocation schemeLeon Romanovsky4-35/+26
Move struct ib_rwq_ind_table allocation to ib_core. Link: https://lore.kernel.org/r/20200902081623.746359-3-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17RDMA: Clean MW allocation and free flowsLeon Romanovsky3-8/+20
Move allocation and destruction of memory windows under ib_core responsibility and clean drivers to ensure that no updates to MW ib_core structures are done in driver layer. Link: https://lore.kernel.org/r/20200902081623.746359-2-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-17RDMA/cma: Fix use after free race in roce multicast joinJason Gunthorpe1-108/+88
The roce path triggers a work queue that continues to touch the id_priv but doesn't hold any reference on it. Futher, unlike in the IB case, the work queue is not fenced during rdma_destroy_id(). This can trigger a use after free if a destroy is triggered in the incredibly narrow window after the queue_work and the work starting and obtaining the handler_mutex. The only purpose of this work queue is to run the ULP event callback from the standard context, so switch the design to use the existing cma_work_handler() scheme. This simplifies quite a lot of the flow: - Use the cma_work_handler() callback to launch the work for roce. This requires generating the event synchronously inside the rdma_join_multicast(), which in turn means the dummy struct ib_sa_multicast can become a simple stack variable. - cm_work_handler() used the id_priv kref, so we can entirely eliminate the kref inside struct cma_multicast. Since the cma_multicast never leaks into an unprotected work queue the kfree can be done at the same time as for IB. - Eliminating the general multicast.ib requires using cma_set_mgid() in a few places to recompute the mgid. Fixes: 3c86aa70bf67 ("RDMA/cm: Add RDMA CM support for IBoE devices") Link: https://lore.kernel.org/r/20200902081122.745412-9-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>