<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/net/rds/ib.h, branch linux-7.1.y</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=linux-7.1.y</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=linux-7.1.y'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2026-04-12T20:33:19+00:00</updated>
<entry>
<title>net/rds: Optimize rds_ib_laddr_check</title>
<updated>2026-04-12T20:33:19+00:00</updated>
<author>
<name>Håkon Bugge</name>
<email>haakon.bugge@oracle.com</email>
</author>
<published>2026-04-08T08:04:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=236f718ac885965fa886440b9898dfae185c9733'/>
<id>urn:sha1:236f718ac885965fa886440b9898dfae185c9733</id>
<content type='text'>
rds_ib_laddr_check() creates a CM_ID and attempts to bind the address
in question to it. This in order to qualify the allegedly local
address as a usable IB/RoCE address.

In the field, ExaWatcher runs rds-ping to all ports in the fabric from
all local ports. This using all active ToS'es. In a full rack system,
we have 14 cell servers and eight db servers. Typically, 6 ToS'es are
used. This implies 528 rds-ping invocations per ExaWatcher's "RDSinfo"
interval.

Adding to this, each rds-ping invocation creates eight sockets and
binds the local address to them:

socket(AF_RDS, SOCK_SEQPACKET, 0)       = 3
bind(3, {sa_family=AF_INET, sin_port=htons(0),
	sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0)       = 4
bind(4, {sa_family=AF_INET, sin_port=htons(0),
	sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0)       = 5
bind(5, {sa_family=AF_INET, sin_port=htons(0),
	sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0)       = 6
bind(6, {sa_family=AF_INET, sin_port=htons(0),
	sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0)       = 7
bind(7, {sa_family=AF_INET, sin_port=htons(0),
	sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0)       = 8
bind(8, {sa_family=AF_INET, sin_port=htons(0),
	sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0)       = 9
bind(9, {sa_family=AF_INET, sin_port=htons(0),
	sin_addr=inet_addr("192.168.36.2")}, 16) = 0
socket(AF_RDS, SOCK_SEQPACKET, 0)       = 10
bind(10, {sa_family=AF_INET, sin_port=htons(0),
	sin_addr=inet_addr("192.168.36.2")}, 16) = 0

So, at every interval ExaWatcher executes rds-ping's, 4224 CM_IDs are
allocated, considering this full-rack system. After the a CM_ID has
been allocated, rdma_bind_addr() is called, with the port number being
zero. This implies that the CMA will attempt to search for an un-used
ephemeral port. Simplified, the algorithm is to start at a random
position in the available port space, and then if needed, iterate
until an un-used port is found.

The book-keeping of used ports uses the idr system, which again uses
slab to allocate new struct idr_layer's. The size is 2092 bytes and
slab tries to reduce the wasted space. Hence, it chooses an order:3
allocation, for which 15 idr_layer structs will fit and only 1388
bytes are wasted per the 32KiB order:3 chunk.

Although this order:3 allocation seems like a good space/speed
trade-off, it does not resonate well with how it used by the CMA. The
combination of the randomized starting point in the port space (which
has close to zero spatial locality) and the close proximity in time of
the 4224 invocations of the rds-ping's, creates a memory hog for
order:3 allocations.

These costly allocations may need reclaims and/or compaction. At
worst, they may fail and produce a stack trace such as (from uek4):

[&lt;ffffffff811a72d5&gt;] __inc_zone_page_state+0x35/0x40
[&lt;ffffffff811c2e97&gt;] page_add_file_rmap+0x57/0x60
[&lt;ffffffffa37ca1df&gt;] remove_migration_pte+0x3f/0x3c0 [ksplice_6cn872bt_vmlinux_new]
[&lt;ffffffff811c3de8&gt;] rmap_walk+0xd8/0x340
[&lt;ffffffff811e8860&gt;] remove_migration_ptes+0x40/0x50
[&lt;ffffffff811ea83c&gt;] migrate_pages+0x3ec/0x890
[&lt;ffffffff811afa0d&gt;] compact_zone+0x32d/0x9a0
[&lt;ffffffff811b00ed&gt;] compact_zone_order+0x6d/0x90
[&lt;ffffffff811b03b2&gt;] try_to_compact_pages+0x102/0x270
[&lt;ffffffff81190e56&gt;] __alloc_pages_direct_compact+0x46/0x100
[&lt;ffffffff8119165b&gt;] __alloc_pages_nodemask+0x74b/0xaa0
[&lt;ffffffff811d8411&gt;] alloc_pages_current+0x91/0x110
[&lt;ffffffff811e3b0b&gt;] new_slab+0x38b/0x480
[&lt;ffffffffa41323c7&gt;] __slab_alloc+0x3b7/0x4a0 [ksplice_s0dk66a8_vmlinux_new]
[&lt;ffffffff811e42ab&gt;] kmem_cache_alloc+0x1fb/0x250
[&lt;ffffffff8131fdd6&gt;] idr_layer_alloc+0x36/0x90
[&lt;ffffffff8132029c&gt;] idr_get_empty_slot+0x28c/0x3d0
[&lt;ffffffff813204ad&gt;] idr_alloc+0x4d/0xf0
[&lt;ffffffffa051727d&gt;] cma_alloc_port+0x4d/0xa0 [rdma_cm]
[&lt;ffffffffa0517cbe&gt;] rdma_bind_addr+0x2ae/0x5b0 [rdma_cm]
[&lt;ffffffffa09d8083&gt;] rds_ib_laddr_check+0x83/0x2c0 [ksplice_6l2xst5i_rds_rdma_new]
[&lt;ffffffffa05f892b&gt;] rds_trans_get_preferred+0x5b/0xa0 [rds]
[&lt;ffffffffa05f09f2&gt;] rds_bind+0x212/0x280 [rds]
[&lt;ffffffff815b4016&gt;] SYSC_bind+0xe6/0x120
[&lt;ffffffff815b4d3e&gt;] SyS_bind+0xe/0x10
[&lt;ffffffff816b031a&gt;] system_call_fastpath+0x18/0xd4

To avoid these excessive calls to rdma_bind_addr(), we optimize
rds_ib_laddr_check() by simply checking if the address in question has
been used before. The rds_rdma module keeps track of addresses
associated with IB devices, and the function rds_ib_get_device() is
used to determine if the address already has been qualified as a valid
local address. If not found, we call the legacy rds_ib_laddr_check(),
now renamed to rds_ib_laddr_check_cm().

Signed-off-by: Håkon Bugge &lt;haakon.bugge@oracle.com&gt;
Signed-off-by: Somasundaram Krishnasamy &lt;somasundaram.krishnasamy@oracle.com&gt;
Signed-off-by: Gerd Rausch &lt;gerd.rausch@oracle.com&gt;
Signed-off-by: Allison Henderson &lt;achender@kernel.org&gt;
Link: https://patch.msgid.link/20260408080420.540032-2-achender@kernel.org
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>RDS: IB: Remove unused declarations</title>
<updated>2024-08-01T16:03:28+00:00</updated>
<author>
<name>Yue Haibing</name>
<email>yuehaibing@huawei.com</email>
</author>
<published>2024-07-31T06:36:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f9c141fc33393923451bdd190c67d9d4f5d2c67f'/>
<id>urn:sha1:f9c141fc33393923451bdd190c67d9d4f5d2c67f</id>
<content type='text'>
Commit f4f943c958a2 ("RDS: IB: ack more receive completions to improve performance")
removed rds_ib_recv_tasklet_fn() implementation but not the declaration.
And commit ec16227e1414 ("RDS/IB: Infiniband transport") declared but never implemented
other functions.

Signed-off-by: Yue Haibing &lt;yuehaibing@huawei.com&gt;
Reviewed-by: Simon Horman &lt;horms@kernel.org&gt;
Reviewed-by: Allison Henderson &lt;allison.henderson@oracle.com&gt;
Link: https://patch.msgid.link/20240731063630.3592046-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>rds: stop using dmapool</title>
<updated>2020-11-17T19:22:06+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2020-11-06T18:19:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=42f2611cc1738b201701e717246e11e86bef4e1e'/>
<id>urn:sha1:42f2611cc1738b201701e717246e11e86bef4e1e</id>
<content type='text'>
RDMA ULPs should only perform DMA through the ib_dma_* API instead of
using the hidden dma_device directly.  In addition using the dma coherent
API family that dmapool is a part of can be very ineffcient on plaforms
that are not DMA coherent.  Switch to use slab allocations and the
ib_dma_* APIs instead.

Link: https://lore.kernel.org/r/20201106181941.1878556-6-hch@lst.de
Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Acked-by: Santosh Shilimkar &lt;santosh.shilimkar@oracle.com&gt;
Signed-off-by: Jason Gunthorpe &lt;jgg@nvidia.com&gt;
</content>
</entry>
<entry>
<title>RDMA: Lift ibdev_to_node from rds to common code</title>
<updated>2020-11-12T17:33:44+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2020-11-06T18:19:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8ecfca68dc4cbee1272a0161e3f2fb9387dc6930'/>
<id>urn:sha1:8ecfca68dc4cbee1272a0161e3f2fb9387dc6930</id>
<content type='text'>
Lift the ibdev_to_node from rds to common code and document it.

Link: https://lore.kernel.org/r/20201106181941.1878556-4-hch@lst.de
Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jason Gunthorpe &lt;jgg@nvidia.com&gt;
</content>
</entry>
<entry>
<title>net/rds: NULL pointer de-reference in rds_ib_add_one()</title>
<updated>2020-06-15T19:58:59+00:00</updated>
<author>
<name>Ka-Cheong Poon</name>
<email>ka-cheong.poon@oracle.com</email>
</author>
<published>2020-06-15T07:40:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=33cf601da7a42b1a78d04746aea47b85ccb5c49b'/>
<id>urn:sha1:33cf601da7a42b1a78d04746aea47b85ccb5c49b</id>
<content type='text'>
The parent field of a struct device may be NULL.  The macro
ibdev_to_node() should check for that.

Signed-off-by: Ka-Cheong Poon &lt;ka-cheong.poon@oracle.com&gt;
Acked-by: Santosh Shilimkar &lt;santosh.shilimkar@oracle.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>RDMA/rds: Remove FMR support for memory registration</title>
<updated>2020-06-02T23:32:53+00:00</updated>
<author>
<name>Max Gurtovoy</name>
<email>maxg@mellanox.com</email>
</author>
<published>2020-05-28T19:45:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=07549ee21ce5247143ffb069bf838025d86b908c'/>
<id>urn:sha1:07549ee21ce5247143ffb069bf838025d86b908c</id>
<content type='text'>
Use FRWR method for memory registration by default and remove the ancient
and unsafe FMR method.

Link: https://lore.kernel.org/r/3-v3-f58e6669d5d3+2cf-fmr_removal_jgg@mellanox.com
Signed-off-by: Max Gurtovoy &lt;maxg@mellanox.com&gt;
Signed-off-by: Jason Gunthorpe &lt;jgg@mellanox.com&gt;
</content>
</entry>
<entry>
<title>net/rds: Handle ODP mr registration/unregistration</title>
<updated>2020-01-18T09:48:19+00:00</updated>
<author>
<name>Hans Westgaard Ry</name>
<email>hans.westgaard.ry@oracle.com</email>
</author>
<published>2020-01-15T12:43:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=2eafa1746f17872483d1033b0116ec71435ea19d'/>
<id>urn:sha1:2eafa1746f17872483d1033b0116ec71435ea19d</id>
<content type='text'>
On-Demand-Paging MRs are registered using ib_reg_user_mr and
unregistered with ib_dereg_mr.

Signed-off-by: Hans Westgaard Ry &lt;hans.westgaard.ry@oracle.com&gt;
Acked-by: Santosh Shilimkar &lt;santosh.shilimkar@oracle.com&gt;
Signed-off-by: Leon Romanovsky &lt;leonro@mellanox.com&gt;
</content>
</entry>
<entry>
<title>net/rds: Use DMA memory pool allocation for rds_header</title>
<updated>2019-10-03T19:11:08+00:00</updated>
<author>
<name>Ka-Cheong Poon</name>
<email>ka-cheong.poon@oracle.com</email>
</author>
<published>2019-10-03T04:11:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=9b17f5884be4484e4d9090a9dccf17e763e0589b'/>
<id>urn:sha1:9b17f5884be4484e4d9090a9dccf17e763e0589b</id>
<content type='text'>
Currently, RDS calls ib_dma_alloc_coherent() to allocate a large piece
of contiguous DMA coherent memory to store struct rds_header for
sending/receiving packets.  The memory allocated is then partitioned
into struct rds_header.  This is not necessary and can be costly at
times when memory is fragmented.  Instead, RDS should use the DMA
memory pool interface to handle this.  The DMA addresses of the pre-
allocated headers are stored in an array.  At send/receive ring
initialization and refill time, this arrary is de-referenced to get
the DMA addresses.  This array is not accessed at send/receive packet
processing.

Suggested-by: Håkon Bugge &lt;haakon.bugge@oracle.com&gt;
Signed-off-by: Ka-Cheong Poon &lt;ka-cheong.poon@oracle.com&gt;
Acked-by: Santosh Shilimkar &lt;santosh.shilimkar@oracle.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: rds: add service level support in rds-info</title>
<updated>2019-08-24T23:55:25+00:00</updated>
<author>
<name>Zhu Yanjun</name>
<email>yanjun.zhu@oracle.com</email>
</author>
<published>2019-08-24T01:04:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e0e6d062822529dbe9be21939359b0d1e065bb0f'/>
<id>urn:sha1:e0e6d062822529dbe9be21939359b0d1e065bb0f</id>
<content type='text'>
&gt;From IB specific 7.6.5 SERVICE LEVEL, Service Level (SL)
is used to identify different flows within an IBA subnet.
It is carried in the local route header of the packet.

Before this commit, run "rds-info -I". The outputs are as
below:
"
RDS IB Connections:
 LocalAddr  RemoteAddr Tos SL  LocalDev               RemoteDev
192.2.95.3  192.2.95.1  2   0  fe80::21:28:1a:39  fe80::21:28:10:b9
192.2.95.3  192.2.95.1  1   0  fe80::21:28:1a:39  fe80::21:28:10:b9
192.2.95.3  192.2.95.1  0   0  fe80::21:28:1a:39  fe80::21:28:10:b9
"
After this commit, the output is as below:
"
RDS IB Connections:
 LocalAddr  RemoteAddr Tos SL  LocalDev               RemoteDev
192.2.95.3  192.2.95.1  2   2  fe80::21:28:1a:39  fe80::21:28:10:b9
192.2.95.3  192.2.95.1  1   1  fe80::21:28:1a:39  fe80::21:28:10:b9
192.2.95.3  192.2.95.1  0   0  fe80::21:28:1a:39  fe80::21:28:10:b9
"

The commit fe3475af3bdf ("net: rds: add per rds connection cache
statistics") adds cache_allocs in struct rds_info_rdma_connection
as below:
struct rds_info_rdma_connection {
...
        __u32           rdma_mr_max;
        __u32           rdma_mr_size;
        __u8            tos;
        __u32           cache_allocs;
 };
The peer struct in rds-tools of struct rds_info_rdma_connection is as
below:
struct rds_info_rdma_connection {
...
        uint32_t        rdma_mr_max;
        uint32_t        rdma_mr_size;
        uint8_t         tos;
        uint8_t         sl;
        uint32_t        cache_allocs;
};
The difference between userspace and kernel is the member variable sl.
In the kernel struct, the member variable sl is missing. This will
introduce risks. So it is necessary to use this commit to avoid this risk.

Fixes: fe3475af3bdf ("net: rds: add per rds connection cache statistics")
CC: Joe Jin &lt;joe.jin@oracle.com&gt;
CC: JUNXIAO_BI &lt;junxiao.bi@oracle.com&gt;
Suggested-by: Gerd Rausch &lt;gerd.rausch@oracle.com&gt;
Signed-off-by: Zhu Yanjun &lt;yanjun.zhu@oracle.com&gt;
Acked-by: Santosh Shilimkar &lt;santosh.shilimkar@oracle.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net/rds: Keep track of and wait for FRWR segments in use upon shutdown</title>
<updated>2019-07-17T19:06:52+00:00</updated>
<author>
<name>Gerd Rausch</name>
<email>gerd.rausch@oracle.com</email>
</author>
<published>2019-07-16T22:29:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3a2886cca703fde5ee21baea9fedf8b1389c59d7'/>
<id>urn:sha1:3a2886cca703fde5ee21baea9fedf8b1389c59d7</id>
<content type='text'>
Since "rds_ib_free_frmr" and "rds_ib_free_frmr_list" simply put
the FRMR memory segments on the "drop_list" or "free_list",
and it is the job of "rds_ib_flush_mr_pool" to reap those entries
by ultimately issuing a "IB_WR_LOCAL_INV" work-request,
we need to trigger and then wait for all those memory segments
attached to a particular connection to be fully released before
we can move on to release the QP, CQ, etc.

So we make "rds_ib_conn_path_shutdown" wait for one more
atomic_t called "i_fastreg_inuse_count" that keeps track of how
many FRWR memory segments are out there marked "FRMR_IS_INUSE"
(and also wake_up rds_ib_ring_empty_wait, as they go away).

Signed-off-by: Gerd Rausch &lt;gerd.rausch@oracle.com&gt;
Acked-by: Santosh Shilimkar &lt;santosh.shilimkar@oracle.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
