Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull NFS client updates from Trond Myklebust:
"Highlights include:
- massive cleanup of the NFS read/write code by Anna and Dros
- support multiple NFS read/write requests per page in order to deal
with non-page aligned pNFS striping. Also cleans up the r/wsize <
page size code nicely.
- stable fix for ensuring inode is declared uptodate only after all
the attributes have been checked.
- stable fix for a kernel Oops when remounting
- NFS over RDMA client fixes
- move the pNFS files layout driver into its own subdirectory"
* tag 'nfs-for-3.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits)
NFS: populate ->net in mount data when remounting
pnfs: fix lockup caused by pnfs_generic_pg_test
NFSv4.1: Fix typo in dprintk
NFSv4.1: Comment is now wrong and redundant to code
NFS: Use raw_write_seqcount_begin/end int nfs4_reclaim_open_state
xprtrdma: Disconnect on registration failure
xprtrdma: Remove BUG_ON() call sites
xprtrdma: Avoid deadlock when credit window is reset
SUNRPC: Move congestion window constants to header file
xprtrdma: Reset connection timeout after successful reconnect
xprtrdma: Use macros for reconnection timeout constants
xprtrdma: Allocate missing pagelist
xprtrdma: Remove Tavor MTU setting
xprtrdma: Ensure ia->ri_id->qp is not NULL when reconnecting
xprtrdma: Reduce the number of hardway buffer allocations
xprtrdma: Limit work done by completion handler
xprtrmda: Reduce calls to ib_poll_cq() in completion handlers
xprtrmda: Reduce lock contention in completion handlers
xprtrdma: Split the completion queue
xprtrdma: Make rpcrdma_ep_destroy() return void
...
|
|
Pull nfsd updates from Bruce Fields:
"The largest piece is a long-overdue rewrite of the xdr code to remove
some annoying limitations: for example, there was no way to return
ACLs larger than 4K, and readdir results were returned only in 4k
chunks, limiting performance on large directories.
Also:
- part of Neil Brown's work to make NFS work reliably over the
loopback interface (so client and server can run on the same
machine without deadlocks). The rest of it is coming through
other trees.
- cleanup and bugfixes for some of the server RDMA code, from
Steve Wise.
- Various cleanup of NFSv4 state code in preparation for an
overhaul of the locking, from Jeff, Trond, and Benny.
- smaller bugfixes and cleanup from Christoph Hellwig and
Kinglong Mee.
Thanks to everyone!
This summer looks likely to be busier than usual for knfsd. Hopefully
we won't break it too badly; testing definitely welcomed"
* 'for-3.16' of git://linux-nfs.org/~bfields/linux: (100 commits)
nfsd4: fix FREE_STATEID lockowner leak
svcrdma: Fence LOCAL_INV work requests
svcrdma: refactor marshalling logic
nfsd: don't halt scanning the DRC LRU list when there's an RC_INPROG entry
nfs4: remove unused CHANGE_SECURITY_LABEL
nfsd4: kill READ64
nfsd4: kill READ32
nfsd4: simplify server xdr->next_page use
nfsd4: hash deleg stateid only on successful nfs4_set_delegation
nfsd4: rename recall_lock to state_lock
nfsd: remove unneeded zeroing of fields in nfsd4_proc_compound
nfsd: fix setting of NFS4_OO_CONFIRMED in nfsd4_open
nfsd4: use recall_lock for delegation hashing
nfsd: fix laundromat next-run-time calculation
nfsd: make nfsd4_encode_fattr static
SUNRPC/NFSD: Remove using of dprintk with KERN_WARNING
nfsd: remove unused function nfsd_read_file
nfsd: getattr for FATTR4_WORD0_FILES_AVAIL needs the statfs buffer
NFSD: Error out when getting more than one fsloc/secinfo/uuid
NFSD: Using type of uint32_t for ex_nflavors instead of int
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"A lot of activities on cgroup side. Heavy restructuring including
locking simplification took place to improve the code base and enable
implementation of the unified hierarchy, which currently exists behind
a __DEVEL__ mount option. The core support is mostly complete but
individual controllers need further work. To explain the design and
rationales of the the unified hierarchy
Documentation/cgroups/unified-hierarchy.txt
is added.
Another notable change is css (cgroup_subsys_state - what each
controller uses to identify and interact with a cgroup) iteration
update. This is part of continuing updates on css object lifetime and
visibility. cgroup started with reference count draining on removal
way back and is now reaching a point where csses behave and are
iterated like normal refcnted objects albeit with some complexities to
allow distinguishing the state where they're being deleted. The css
iteration update isn't taken advantage of yet but is planned to be
used to simplify memcg significantly"
* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
cgroup: disallow disabled controllers on the default hierarchy
cgroup: don't destroy the default root
cgroup: disallow debug controller on the default hierarchy
cgroup: clean up MAINTAINERS entries
cgroup: implement css_tryget()
device_cgroup: use css_has_online_children() instead of has_children()
cgroup: convert cgroup_has_live_children() into css_has_online_children()
cgroup: use CSS_ONLINE instead of CGRP_DEAD
cgroup: iterate cgroup_subsys_states directly
cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
cgroup: move cgroup->serial_nr into cgroup_subsys_state
cgroup: link all cgroup_subsys_states in their sibling lists
cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
cgroup: remove cgroup->parent
device_cgroup: remove direct access to cgroup->children
memcg: update memcg_has_children() to use css_next_child()
memcg: remove tasks/children test from mem_cgroup_force_empty()
cgroup: remove css_parent()
cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
cgroup: use cgroup->self.refcnt for cgroup refcnting
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
John W. Linville says:
====================
pull request: wireless-next 2014-06-06
Please accept this batch of fixes intended for the 3.16 stream.
For the bluetooth bits, Gustavo says:
"Here some more patches for 3.16. We know that Linus already opened the merge
window, but this is fix only pull request, and most of the patches here are
also tagged for stable."
Along with that, Andrea Merello provides a fix for the broken scanning
in the venerable at76c50x driver...
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
unregister_netdevice_many() API is error prone and we had too
many bugs because of dangling LIST_HEAD on stacks.
See commit f87e6f47933e3e ("net: dont leave active on stack LIST_HEAD")
In fact, instead of making sure no caller leaves an active list_head,
just force a list_del() in the callee. No one seems to need to access
the list after unregister_netdevice_many()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.linuxfoundation.org/llvmlinux/kernel
Pull LLVM patches from Behan Webster:
"Next set of patches to support compiling the kernel with clang.
They've been soaking in linux-next since the last merge window.
More still in the works for the next merge window..."
* tag 'llvmlinux-for-v3.16' of git://git.linuxfoundation.org/llvmlinux/kernel:
arm, unwind, LLVMLinux: Enable clang to be used for unwinding the stack
ARM: LLVMLinux: Change "extern inline" to "static inline" in glue-cache.h
all: LLVMLinux: Change DWARF flag to support gcc and clang
net: netfilter: LLVMLinux: vlais-netfilter
crypto: LLVMLinux: aligned-attribute.patch
|
|
Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.
* accumulated work in next: (6809 commits)
ufs: sb mutex merge + mutex_destroy
powerpc: update comments for generic idle conversion
cris: update comments for generic idle conversion
idle: remove cpu_idle() forward declarations
nbd: zero from and len fields in NBD_CMD_DISCONNECT.
mm: convert some level-less printks to pr_*
MAINTAINERS: adi-buildroot-devel is moderated
MAINTAINERS: add linux-api for review of API/ABI changes
mm/kmemleak-test.c: use pr_fmt for logging
fs/dlm/debug_fs.c: replace seq_printf by seq_puts
fs/dlm/lockspace.c: convert simple_str to kstr
fs/dlm/config.c: convert simple_str to kstr
mm: mark remap_file_pages() syscall as deprecated
mm: memcontrol: remove unnecessary memcg argument from soft limit functions
mm: memcontrol: clean up memcg zoneinfo lookup
mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
mm/mempool.c: update the kmemleak stack trace for mempool allocations
lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
mm: introduce kmemleak_update_trace()
mm/kmemleak.c: use %u to print ->checksum
...
|
|
Replaced non-standard C use of Variable Length Arrays In Structs (VLAIS) in
xt_repldata.h with a C99 compliant flexible array member and then calculated
offsets to the other struct members. These other members aren't referenced by
name in this code, however this patch maintains the same memory layout and
padding as was previously accomplished using VLAIS.
Had the original structure been ordered differently, with the entries VLA at
the end, then it could have been a flexible member, and this patch would have
been a lot simpler. However since the data stored in this structure is
ultimately exported to userspace, the order of this structure can't be changed.
This patch makes no attempt to change the existing behavior, merely the way in
which the current layout is accomplished using standard C99 constructs. As such
the code can now be compiled with either gcc or clang.
This version of the patch removes the trailing alignment that the VLAIS
structure would allocate in order to simplify the patch.
Author: Mark Charlebois <charlebm@gmail.com>
Signed-off-by: Mark Charlebois <charlebm@gmail.com>
Signed-off-by: Behan Webster <behanw@converseincode.com>
Signed-off-by: Vinícius Tinti <viniciustinti@gmail.com>
|
|
During key removal, the key object is freed, but not taken out of the
llsec key list properly. Fix that.
Signed-off-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fencing forces the invalidate to only happen after all prior send
work requests have been completed.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reported by : Devesh Sharma <Devesh.Sharma@Emulex.Com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
This patch refactors the NFSRDMA server marshalling logic to
remove the intermediary map structures. It also fixes an existing bug
where the NFSRDMA server was not minding the device fast register page
list length limitations.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
|
|
The rpc code makes available to the NFS server an array of pages to
encod into. The server represents its reply as an xdr buf, with the
head pointing into the first page in that array, the pages ** array
starting just after that, and the tail (if any) sharing any leftover
space in the page used by the head.
While encoding, we use xdr_stream->page_ptr to keep track of which page
we're currently using.
Currently we set xdr_stream->page_ptr to buf->pages, which makes the
head a weird exception to the rule that page_ptr always points to the
page we're currently encoding into. So, instead set it to buf->pages -
1 (the page actually containing the head), and remove the need for a
little unintuitive logic in xdr_get_next_encode_buffer() and
xdr_truncate_encode.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
|
|
ip_rt_put(rt) is always called in "error" branches above, but was missed in
skb_cow_head branch. As rt is not yet bound to skb here we have to release it by
hand.
Signed-off-by: Dmitry Popov <ixaphire@qrator.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add ceph_monc_wait_osdmap(), which will block until the osdmap with the
specified epoch is received or timeout occurs.
Export both of these as they are going to be needed by rbd.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
|
|
Add support for mon_get_version requests to libceph. This reuses much
of the ceph_mon_generic_request infrastructure, with one exception.
Older OSDs don't set mon_get_version reply hdr->tid even if the
original request had a non-zero tid, which makes it impossible to
lookup ceph_mon_generic_request contexts by tid in get_generic_reply()
for such replies. As a workaround, we allocate a reply message on the
reply path. This can probably interfere with revoke, but I don't see
a better way.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
|
|
Recognize poolop requests in debugfs monc dump, fix prink format
specifiers - tid is unsigned.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
|
|
To avoid the confusion of having two variables, shrink the function to
only use the parameter variable for looping.
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflicts:
drivers/net/xen-netback/netback.c
net/core/filter.c
A filter bug fix overlapped some cleanups and a conversion
over to some new insn generation macros.
A xen-netback bug fix overlapped the addition of multi-queue
support.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
BPF classic->internal converter broke SKF_AD_PKTTYPE extension, since
pkt_type_offset() was failing to find skb->pkt_type field which is defined as:
__u8 pkt_type:3,
fclone:2,
ipvs_property:1,
peeked:1,
nf_trace:1;
Fix it by searching for 3 most significant bits and shift them by 5 at run-time
Fixes: bd4cf0ed331a ("net: filter: rework/optimize internal BPF interpreter's instruction set")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Tested-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pablo Neira Ayuso says:
====================
Netfilter/nf_tables fixes for net-next
This patchset contains fixes for recent updates available in your
net-next, they are:
1) Fix double memory allocation for accounting objects that results
in a leak, this slipped through with the new quota extension,
patch from Mathieu Poirier.
2) Fix broken ordering when adding set element transactions.
3) Make sure that objects are released in reverse order in the abort
path, to avoid possible use-after-free when accessing dependencies.
4) Allow to delete several objects (as long as dependencies are
fulfilled) by using one batch. This includes changes in the use
counter semantics of the nf_tables objects.
5) Fix illegal sleeping allocation from rcu callback.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
br_manage_promisc() incorrectly expects br_auto_port() to return only 0
or 1, while it actually returns flags, i.e., a subset of BR_AUTO_MASK.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If an MPLS packet requires segmentation then use mpls_features
to determine if the software implementation should be used.
As no driver advertises MPLS GSO segmentation this will always be
the case.
I had not noticed that this was necessary before as software MPLS GSO
segmentation was already being used in my test environment. I believe that
the reason for that is the skbs in question always had fragments and the
driver I used does not advertise NETIF_F_FRAGLIST (which seems to be the
case for most drivers). Thus software segmentation was activated by
skb_gso_ok().
This introduces the overhead of an extra call to skb_network_protocol()
in the case where where CONFIG_NET_MPLS_GSO is set and
skb->ip_summed == CHECKSUM_NONE.
Thanks to Jesse Gross for prompting me to investigate this.
Signed-off-by: Simon Horman <horms@verge.net.au>
Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
|
|
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It is available since v3.15-rc5.
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
According to RFC1035 "[...] the total length of a domain name (i.e.,
label octets and label length octets) is restricted to 255 octets or
less."
Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Added VXLAN link configuration for sending UDP checksums, and allowing
TX and RX of UDP6 checksums.
Also, call common iptunnel_handle_offloads and added GSO support for
checksums.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Call gso_make_checksum. This should have the benefit of using a
checksum that may have been previously computed for the packet.
This also adds NETIF_F_GSO_GRE_CSUM to differentiate devices that
offload GRE GSO with and without the GRE checksum offloaed.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Added a new netif feature for GSO_UDP_TUNNEL_CSUM. This indicates
that a device is capable of computing the UDP checksum in the
encapsulating header of a UDP tunnel.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Call common gso_make_checksum when calculating checksum for a
TCP GSO segment.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When creating a GSO packet segment we may need to set more than
one checksum in the packet (for instance a TCP checksum and
UDP checksum for VXLAN encapsulation). To be efficient, we want
to do checksum calculation for any part of the packet at most once.
This patch adds csum_start offset to skb_gso_cb. This tracks the
starting offset for skb->csum which is initially set in skb_segment.
When a protocol needs to compute a transport checksum it calls
gso_make_checksum which computes the checksum value from the start
of transport header to csum_start and then adds in skb->csum to get
the full checksum. skb->csum and csum_start are then updated to reflect
the checksum of the resultant packet starting from the transport header.
This patch also adds a flag to skbuff, encap_hdr_csum, which is set
in *gso_segment fucntions to indicate that a tunnel protocol needs
checksum calculation
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Call common functions to set checksum for UDP tunnel.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Added udp_set_csum and udp6_set_csum functions to set UDP checksums
in packets. These are for simple UDP packets such as those that might
be created in UDP tunnels.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit efe4208 ("ipv6: make lookups simpler and faster") introduced a
regression in udp_v6_mcast_next(), resulting in multicast packets not
reaching the destination sockets under certain conditions.
The packet's IPv6 addresses are wrongly compared to the IPv6 addresses
from the function's socket argument, which indicates the starting point
for looping, instead of the loop variable. If the addresses from the
first socket do not match the packet's addresses, no socket in the list
will match.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This reverts commit 30f38d2fdd79f13fc929489f7e6e517b4a4bfe63.
fib_triestat is surrounded by a big lie: while it claims that it's a
seq_file (fib_triestat_seq_open, fib_triestat_seq_show), it isn't:
static const struct file_operations fib_triestat_fops = {
.owner = THIS_MODULE,
.open = fib_triestat_seq_open,
.read = seq_read,
.llseek = seq_lseek,
.release = single_release_net,
};
Yes, fib_triestat is just a regular file.
A small detail (assuming CONFIG_NET_NS=y) is that while for seq_files
you could do seq_file_net() to get the net ptr, doing so for a regular
file would be wrong and would dereference an invalid pointer.
The fib_triestat lie claimed a victim, and trying to show the file would
be bad for the kernel. This patch just reverts the issue and fixes
fib_triestat, which still needs a rewrite to either be a seq_file or
stop claiming it is.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If rpcrdma_register_external() fails during request marshaling, the
current RPC request is killed. Instead, this RPC should be retried
after reconnecting the transport instance.
The most likely reason for registration failure with FRMR is a
failed post_send, which would be due to a remote transport
disconnect or memory exhaustion. These issues can be recovered
by a retry.
Problems encountered in the marshaling logic itself will not be
corrected by trying again, so these should still kill a request.
Now that we've added a clean exit for marshaling errors, take the
opportunity to defang some BUG_ON's.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
If an error occurs in the marshaling logic, fail the RPC request
being processed, but leave the client running.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Update the cwnd while processing the server's reply. Otherwise the
next task on the xprt_sending queue is still subject to the old
credit window. Currently, no task is awoken if the old congestion
window is still exceeded, even if the new window is larger, and a
deadlock results.
This is an issue during a transport reconnect. Servers don't
normally shrink the credit window, but the client does reset it to
1 when reconnecting so the server can safely grow it again.
As a minor optimization, remove the hack of grabbing the initial
cwnd size (which happens to be RPC_CWNDSCALE) and using that value
as the congestion scaling factor. The scaling value is invariant,
and we are better off without the multiplication operation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
I would like to use one of the RPC client's congestion algorithm
constants in transport-specific code.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
If the new connection is able to make forward progress, reset the
re-establish timeout. Otherwise it keeps growing even if disconnect
events are rare.
The same behavior as TCP is adopted: reconnect immediately if the
transport instance has been able to make some forward progress.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Clean up: Ensure the same max and min constant values are used
everywhere when setting reconnect timeouts.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
GETACL relies on transport layer to alloc memory for reply buffer.
However xprtrdma assumes that the reply buffer (pagelist) has been
pre-allocated in upper layer. This problem was reported by IOL OFA lab
test on PPC.
Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Edward Mossman <emossman@iol.unh.edu>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Clean up. Remove HCA-specific clutter in xprtrdma, which is
supposed to be device-independent.
Hal Rosenstock <hal@dev.mellanox.co.il> observes:
> Note that there is OpenSM option (enable_quirks) to return 1K MTU
> in SA PathRecord responses for Tavor so that can be used for this.
> The default setting for enable_quirks is FALSE so that would need
> changing.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Devesh Sharma <Devesh.Sharma@Emulex.Com> reports that after a
disconnect, his HCA is failing to create a fresh QP, leaving
ia_ri->ri_id->qp set to NULL. But xprtrdma still allows RPCs to
wake up and post LOCAL_INV as they exit, causing an oops.
rpcrdma_ep_connect() is allowing the wake-up by leaking the QP
creation error code (-EPERM in this case) to the RPC client's
generic layer. xprt_connect_status() does not recognize -EPERM, so
it kills pending RPC tasks immediately rather than retrying the
connect.
Re-arrange the QP creation logic so that when it fails on reconnect,
it leaves ->qp with the old QP rather than NULL. If pending RPC
tasks wake and exit, LOCAL_INV work requests will flush rather than
oops.
On initial connect, leaving ->qp == NULL is OK, since there are no
pending RPCs that might use ->qp. But be sure not to try to destroy
a NULL QP when rpcrdma_ep_connect() is retried.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
While marshaling an RPC/RDMA request, the inline_{rsize,wsize}
settings determine whether an inline request is used, or whether
read or write chunks lists are built. The current default value of
these settings is 1024. Any RPC request smaller than 1024 bytes is
sent to the NFS server completely inline.
rpcrdma_buffer_create() allocates and pre-registers a set of RPC
buffers for each transport instance, also based on the inline rsize
and wsize settings.
RPC/RDMA requests and replies are built in these buffers. However,
if an RPC/RDMA request is expected to be larger than 1024, a buffer
has to be allocated and registered for that RPC, and deregistered
and released when the RPC is complete. This is known has a
"hardway allocation."
Since the introduction of NFSv4, the size of RPC requests has become
larger, and hardway allocations are thus more frequent. Hardway
allocations are significant overhead, and they waste the existing
RPC buffers pre-allocated by rpcrdma_buffer_create().
We'd like fewer hardway allocations.
Increasing the size of the pre-registered buffers is the most direct
way to do this. However, a blanket increase of the inline thresholds
has interoperability consequences.
On my 64-bit system, rpcrdma_buffer_create() requests roughly 7000
bytes for each RPC request buffer, using kmalloc(). Due to internal
fragmentation, this wastes nearly 1200 bytes because kmalloc()
already returns an 8192-byte piece of memory for a 7000-byte
allocation request, though the extra space remains unused.
So let's round up the size of the pre-allocated buffers, and make
use of the unused space in the kmalloc'd memory.
This change reduces the amount of hardway allocated memory for an
NFSv4 general connectathon run from 1322092 to 9472 bytes (99%).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Sagi Grimberg <sagig@dev.mellanox.co.il> points out that a steady
stream of CQ events could starve other work because of the boundless
loop pooling in rpcrdma_{send,recv}_poll().
Instead of a (potentially infinite) while loop, return after
collecting a budgeted number of completions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Change the completion handlers to grab up to 16 items per
ib_poll_cq() call. No extra ib_poll_cq() is needed if fewer than 16
items are returned.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Skip the ib_poll_cq() after re-arming, if the provider knows there
are no additional items waiting. (Have a look at commit ed23a727 for
more details).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
The current CQ handler uses the ib_wc.opcode field to distinguish
between event types. However, the contents of that field are not
reliable if the completion status is not IB_WC_SUCCESS.
When an error completion occurs on a send event, the CQ handler
schedules a tasklet with something that is not a struct rpcrdma_rep.
This is never correct behavior, and sometimes it results in a panic.
To resolve this issue, split the completion queue into a send CQ and
a receive CQ. The send CQ handler now handles only struct rpcrdma_mw
wr_id's, and the receive CQ handler now handles only struct
rpcrdma_rep wr_id's.
Fix suggested by Shirley Ma <shirley.ma@oracle.com>
Reported-by: Rafael Reiter <rafael.reiter@ims.co.at>
Fixes: 5c635e09cec0feeeb310968e51dad01040244851
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=73211
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Klemens Senn <klemens.senn@ims.co.at>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|