Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull nfsd updates from Chuck Lever:
"Bruce has announced he is leaving Red Hat at the end of the month and
is stepping back from his role as NFSD co-maintainer. As a result,
this includes a patch removing him from the MAINTAINERS file.
There is one patch in here that Jeff Layton was carrying in the locks
tree. Since he had only one for this cycle, he asked us to send it to
you via the nfsd tree.
There continues to be 0-day reports from Robert Morris @MIT. This time
we include a fix for a crash in the COPY_NOTIFY operation.
Highlights:
- Bruce steps down as NFSD maintainer
- Prepare for dynamic nfsd thread management
- More work on supporting re-exporting NFS mounts
- One fs/locks patch on behalf of Jeff Layton
Notable bug fixes:
- Fix zero-length NFSv3 WRITEs
- Fix directory cinfo on FS's that do not support iversion
- Fix WRITE verifiers for stable writes
- Fix crash on COPY_NOTIFY with a special state ID"
* tag 'nfsd-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (51 commits)
SUNRPC: Fix sockaddr handling in svcsock_accept_class trace points
SUNRPC: Fix sockaddr handling in the svc_xprt_create_error trace point
fs/locks: fix fcntl_getlk64/fcntl_setlk64 stub prototypes
nfsd: fix crash on COPY_NOTIFY with special stateid
MAINTAINERS: remove bfields
NFSD: Move fill_pre_wcc() and fill_post_wcc()
Revert "nfsd: skip some unnecessary stats in the v4 case"
NFSD: Trace boot verifier resets
NFSD: Rename boot verifier functions
NFSD: Clean up the nfsd_net::nfssvc_boot field
NFSD: Write verifier might go backwards
nfsd: Add a tracepoint for errors in nfsd4_clone_file_range()
NFSD: De-duplicate net_generic(nf->nf_net, nfsd_net_id)
NFSD: De-duplicate net_generic(SVC_NET(rqstp), nfsd_net_id)
NFSD: Clean up nfsd_vfs_write()
nfsd: Replace use of rwsem with errseq_t
NFSD: Fix verifier returned in stable WRITEs
nfsd: Retry once in nfsd_open on an -EOPENSTALE return
nfsd: Add errno mapping for EREMOTEIO
nfsd: map EBADF
...
|
|
Various places in the kernel - largely in filesystems - respond to a
memory allocation failure by looping around and re-trying. Some of
these cannot conveniently use __GFP_NOFAIL, for reasons such as:
- a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
- a need to check for the process being signalled between failures
- the possibility that other recovery actions could be performed
- the allocation is quite deep in support code, and passing down an
extra flag to say if __GFP_NOFAIL is wanted would be clumsy.
Many of these currently use congestion_wait() which (in almost all
cases) simply waits the given timeout - congestion isn't tracked for
most devices.
It isn't clear what the best delay is for loops, but it is clear that
the various filesystems shouldn't be responsible for choosing a timeout.
This patch introduces memalloc_retry_wait() with takes on that
responsibility. Code that wants to retry a memory allocation can call
this function passing the GFP flags that were used. It will wait
however is appropriate.
For now, it only considers __GFP_NORETRY and whatever
gfpflags_allow_blocking() tests. If blocking is allowed without
__GFP_NORETRY, then alloc_page either made some reclaim progress, or
waited for a while, before failing. So there is no need for much
further waiting. memalloc_retry_wait() will wait until the current
jiffie ends. If this condition is not met, then alloc_page() won't have
waited much if at all. In that case memalloc_retry_wait() waits about
200ms. This is the delay that most current loops uses.
linux/sched/mm.h needs to be included in some files now,
but linux/backing-dev.h does not.
Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
While testing, I got an unexpected KASAN splat:
Jan 08 13:50:27 oracle-102.nfsv4.dev kernel: BUG: KASAN: stack-out-of-bounds in trace_event_raw_event_svc_xprt_create_err+0x190/0x210 [sunrpc]
Jan 08 13:50:27 oracle-102.nfsv4.dev kernel: Read of size 28 at addr ffffc9000008f728 by task mount.nfs/4628
The memcpy() in the TP_fast_assign section of this trace point
copies the size of the destination buffer in order that the buffer
won't be overrun.
In other similar trace points, the source buffer for this memcpy is
a "struct sockaddr_storage" so the actual length of the source
buffer is always long enough to prevent the memcpy from reading
uninitialized or unallocated memory.
However, for this trace point, the source buffer can be as small as
a "struct sockaddr_in". For AF_INET sockaddrs, the memcpy() reads
memory that follows the source buffer, which is not always valid
memory.
To avoid copying past the end of the passed-in sockaddr, make the
source address's length available to the memcpy(). It would be a
little nicer if the tracing infrastructure was more friendly about
storing socket addresses that are not AF_INET, but I could not find
a way to make printk("%pIS") work with a dynamic array.
Reported-by: KASAN
Fixes: 4b8f380e46e4 ("SUNRPC: Tracepoint to record errors in svc_xpo_create()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
I'm about to add more information to the server-side SUNRPC
tracepoints, so I'm going to offset the increased trace log
consumption by getting rid of some tracepoints that fire frequently
but don't offer much value.
trace_svc_xprt_received() was useful for debugging, perhaps, but
is not generally informative.
trace_svc_handle_xprt() reports largely the same information as
trace_svc_xdr_recvfrom().
As a clean-up, rename trace_svc_xprt_do_enqueue() to match
svc_xprt_dequeue().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
This is an operational low memory situation that needs to be
flagged. The new tracepoint records a timestamp and the nfsd thread
that failed to allocate pages.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Restore performance on memory-starved servers
* tag 'nfsd-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
SUNRPC: improve error response to over-size gss credential
SUNRPC: don't pause on incomplete allocation
|
|
alloc_pages_bulk_array() attempts to allocate at least one page based on
the provided pages, and then opportunistically allocates more if that
can be done without dropping the spinlock.
So if it returns fewer than requested, that could just mean that it
needed to drop the lock. In that case, try again immediately.
Only pause for a time if no progress could be made.
Reported-and-tested-by: Mike Javorski <mike.javorski@gmail.com>
Reported-and-tested-by: Lothar Paltins <lopa@mailbox.org>
Fixes: f6e70aab9dfe ("SUNRPC: refresh rq_pages using a bulk page allocator")
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Mel Gorman <mgorman@suse.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Pull nfsd updates from Chuck Lever:
"New features:
- Support for server-side disconnect injection via debugfs
- Protocol definitions for new RPC_AUTH_TLS authentication flavor
Performance improvements:
- Reduce page allocator traffic in the NFSD splice read actor
- Reduce CPU utilization in svcrdma's Send completion handler
Notable bug fixes:
- Stabilize lockd operation when re-exporting NFS mounts
- Fix the use of %.*s in NFSD tracepoints
- Fix /proc/sys/fs/nfs/nsm_use_hostnames"
* tag 'nfsd-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits)
nfsd: fix crash on LOCKT on reexported NFSv3
nfs: don't allow reexport reclaims
lockd: don't attempt blocking locks on nfs reexports
nfs: don't atempt blocking locks on nfs reexports
Keep read and write fds with each nlm_file
lockd: update nlm_lookup_file reexport comment
nlm: minor refactoring
nlm: minor nlm_lookup_file argument change
lockd: lockd server-side shouldn't set fl_ops
SUNRPC: Add documentation for the fail_sunrpc/ directory
SUNRPC: Server-side disconnect injection
SUNRPC: Move client-side disconnect injection
SUNRPC: Add a /sys/kernel/debug/fail_sunrpc/ directory
svcrdma: xpt_bc_xprt is already clear in __svc_rdma_free()
nfsd4: Fix forced-expiry locking
rpc: fix gss_svc_init cleanup on failure
SUNRPC: Add RPC_AUTH_TLS protocol numbers
lockd: change the proc_handler for nsm_use_hostnames
sysctl: introduce new proc handler proc_dobool
SUNRPC: Fix a NULL pointer deref in trace_svc_stats_latency()
...
|
|
If the attempt to reserve a slot fails, we currently leak the XPT_BUSY
flag on the socket. Among other things, this make it impossible to close
the socket.
Fixes: 82011c80b3ec ("SUNRPC: Move svc_xprt_received() call sites")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Replacing a page in rq_pages[] requires a get_page(), which is a
bus-locked operation, and a put_page(), which can be even more
costly.
To reduce the cost of replacing a page in rq_pages[], batch the
put_page() operations by collecting "freed" pages in a pagevec,
and then release those pages when the pagevec is full. This
pagevec is also emptied when each RPC completes.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Reduce the rate at which nfsd threads hammer on the page allocator. This
improves throughput scalability by enabling the threads to run more
independently of each other.
[mgorman: Update interpretation of alloc_pages_bulk return value]
Link: https://lkml.kernel.org/r/20210325114228.27719-8-mgorman@techsingularity.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Alexander Lobakin <alobakin@pm.me>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Miller <davem@davemloft.net>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "SUNRPC consumer for the bulk page allocator"
This patch set and the measurements below are based on yesterday's
bulk allocator series:
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
The patches change SUNRPC to invoke the array-based bulk allocator
instead of alloc_page().
The micro-benchmark results are promising. I ran a mixture of 256KB
reads and writes over NFSv3. The server's kernel is built with KASAN
enabled, so the comparison is exaggerated but I believe it is still
valid.
I instrumented svc_recv() to measure the latency of each call to
svc_alloc_arg() and report it via a trace point. The following results
are averages across the trace events.
Single page: 25.007 us per call over 532,571 calls
Bulk list: 6.258 us per call over 517,034 calls
Bulk array: 4.590 us per call over 517,442 calls
This patch (of 2)
Refactor:
I'm about to use the loop variable @i for something else.
As far as the "i++" is concerned, that is a post-increment. The
value of @i is not used subsequently, so the increment operator
is unnecessary and can be removed.
Also note that nfsd_read_actor() was renamed nfsd_splice_actor()
by commit cf8208d0eabd ("sendfile: convert nfsd to
splice_direct_to_actor()").
Link: https://lkml.kernel.org/r/20210325114228.27719-7-mgorman@techsingularity.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Alexander Lobakin <alobakin@pm.me>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Miller <davem@davemloft.net>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently, XPT_BUSY is not cleared until xpo_recvfrom returns.
That effectively blocks the receipt and handling of the next RPC
message until the current one has been taken off the transport.
This strict ordering is a requirement for socket transports.
For our kernel RPC/RDMA transport implementation, however, dequeuing
an ingress message is nothing more than a list_del(). The transport
can safely be marked un-busy as soon as that is done.
To keep the changes simpler, this patch just moves the
svc_xprt_received() call site from svc_handle_xprt() into the
transports, so that the actual optimization can be done in a
subsequent patch.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Prepare svc_xprt_received() to be called from transport code instead
of from generic RPC server code.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Refactor a bit of commonly used logic so that every site that wants
a close deferred to an nfsd thread does all the right things
(set_bit(XPT_CLOSE) then enqueue).
Also, once XPT_CLOSE is set on a transport, it is never cleared. If
XPT_CLOSE is already set, then the close is already being handled
and the enqueue can be skipped.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
[ This problem is in mainline, but only rt has the chops to be
able to detect it. ]
Lockdep reports a circular lock dependency between serv->sv_lock and
softirq_ctl.lock on system shutdown, when using a kernel built with
CONFIG_PREEMPT_RT=y, and a nfs mount exists.
This is due to the definition of spin_lock_bh on rt:
local_bh_disable();
rt_spin_lock(lock);
which forces a softirq_ctl.lock -> serv->sv_lock dependency. This is
not a problem as long as _every_ lock of serv->sv_lock is a:
spin_lock_bh(&serv->sv_lock);
but there is one of the form:
spin_lock(&serv->sv_lock);
This is what is causing the circular dependency splat. The spin_lock()
grabs the lock without first grabbing softirq_ctl.lock via local_bh_disable.
If later on in the critical region, someone does a local_bh_disable, we
get a serv->sv_lock -> softirq_ctrl.lock dependency established. Deadlock.
Fix is to make serv->sv_lock be locked with spin_lock_bh everywhere, no
exceptions.
[ OK ] Stopped target NFS client services.
Stopping Logout off all iSCSI sessions on shutdown...
Stopping NFS server and services...
[ 109.442380]
[ 109.442385] ======================================================
[ 109.442386] WARNING: possible circular locking dependency detected
[ 109.442387] 5.10.16-rt30 #1 Not tainted
[ 109.442389] ------------------------------------------------------
[ 109.442390] nfsd/1032 is trying to acquire lock:
[ 109.442392] ffff994237617f60 ((softirq_ctrl.lock).lock){+.+.}-{2:2}, at: __local_bh_disable_ip+0xd9/0x270
[ 109.442405]
[ 109.442405] but task is already holding lock:
[ 109.442406] ffff994245cb00b0 (&serv->sv_lock){+.+.}-{0:0}, at: svc_close_list+0x1f/0x90
[ 109.442415]
[ 109.442415] which lock already depends on the new lock.
[ 109.442415]
[ 109.442416]
[ 109.442416] the existing dependency chain (in reverse order) is:
[ 109.442417]
[ 109.442417] -> #1 (&serv->sv_lock){+.+.}-{0:0}:
[ 109.442421] rt_spin_lock+0x2b/0xc0
[ 109.442428] svc_add_new_perm_xprt+0x42/0xa0
[ 109.442430] svc_addsock+0x135/0x220
[ 109.442434] write_ports+0x4b3/0x620
[ 109.442438] nfsctl_transaction_write+0x45/0x80
[ 109.442440] vfs_write+0xff/0x420
[ 109.442444] ksys_write+0x4f/0xc0
[ 109.442446] do_syscall_64+0x33/0x40
[ 109.442450] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 109.442454]
[ 109.442454] -> #0 ((softirq_ctrl.lock).lock){+.+.}-{2:2}:
[ 109.442457] __lock_acquire+0x1264/0x20b0
[ 109.442463] lock_acquire+0xc2/0x400
[ 109.442466] rt_spin_lock+0x2b/0xc0
[ 109.442469] __local_bh_disable_ip+0xd9/0x270
[ 109.442471] svc_xprt_do_enqueue+0xc0/0x4d0
[ 109.442474] svc_close_list+0x60/0x90
[ 109.442476] svc_close_net+0x49/0x1a0
[ 109.442478] svc_shutdown_net+0x12/0x40
[ 109.442480] nfsd_destroy+0xc5/0x180
[ 109.442482] nfsd+0x1bc/0x270
[ 109.442483] kthread+0x194/0x1b0
[ 109.442487] ret_from_fork+0x22/0x30
[ 109.442492]
[ 109.442492] other info that might help us debug this:
[ 109.442492]
[ 109.442493] Possible unsafe locking scenario:
[ 109.442493]
[ 109.442493] CPU0 CPU1
[ 109.442494] ---- ----
[ 109.442495] lock(&serv->sv_lock);
[ 109.442496] lock((softirq_ctrl.lock).lock);
[ 109.442498] lock(&serv->sv_lock);
[ 109.442499] lock((softirq_ctrl.lock).lock);
[ 109.442501]
[ 109.442501] *** DEADLOCK ***
[ 109.442501]
[ 109.442501] 3 locks held by nfsd/1032:
[ 109.442503] #0: ffffffff93b49258 (nfsd_mutex){+.+.}-{3:3}, at: nfsd+0x19a/0x270
[ 109.442508] #1: ffff994245cb00b0 (&serv->sv_lock){+.+.}-{0:0}, at: svc_close_list+0x1f/0x90
[ 109.442512] #2: ffffffff93a81b20 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock+0x5/0xc0
[ 109.442518]
[ 109.442518] stack backtrace:
[ 109.442519] CPU: 0 PID: 1032 Comm: nfsd Not tainted 5.10.16-rt30 #1
[ 109.442522] Hardware name: Supermicro X9DRL-3F/iF/X9DRL-3F/iF, BIOS 3.2 09/22/2015
[ 109.442524] Call Trace:
[ 109.442527] dump_stack+0x77/0x97
[ 109.442533] check_noncircular+0xdc/0xf0
[ 109.442546] __lock_acquire+0x1264/0x20b0
[ 109.442553] lock_acquire+0xc2/0x400
[ 109.442564] rt_spin_lock+0x2b/0xc0
[ 109.442570] __local_bh_disable_ip+0xd9/0x270
[ 109.442573] svc_xprt_do_enqueue+0xc0/0x4d0
[ 109.442577] svc_close_list+0x60/0x90
[ 109.442581] svc_close_net+0x49/0x1a0
[ 109.442585] svc_shutdown_net+0x12/0x40
[ 109.442588] nfsd_destroy+0xc5/0x180
[ 109.442590] nfsd+0x1bc/0x270
[ 109.442595] kthread+0x194/0x1b0
[ 109.442600] ret_from_fork+0x22/0x30
[ 109.518225] nfsd: last server has exited, flushing export cache
[ OK ] Stopped NFSv4 ID-name mapping service.
[ OK ] Stopped GSSAPI Proxy Daemon.
[ OK ] Stopped NFS Mount Daemon.
[ OK ] Stopped NFS status monitor for NFSv2/3 locking..
Fixes: 719f8bcc883e ("svcrpc: fix xpt_list traversal locking on shutdown")
Signed-off-by: Joe Korty <joe.korty@concurrent-rt.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Commit 156708adf2d9 ("SUNRPC: Move the svc_xdr_recvfrom()
tracepoint") tried to capture the correct XID in the trace record,
but this line in svc_recv:
rqstp->rq_xid = svc_getu32(&rqstp->rq_arg.head[0]);
alters the size of rq_arg.head[0].iov_len. The tracepoint records
the correct XID but an incorrect value for the length of the
xdr_buf's head.
To keep the trace callsites simple, I've created two trace classes.
One assumes the xdr_buf contains a full RPC message, and the XID
can be extracted from it. The other assumes the contents of the
xdr_buf are arbitrary, and the xid will be provided by the caller.
Currently there is only one user of each class, but I expect we will
need a few more tracepoints using each class as time goes on.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Commit c509f15a5801 ("SUNRPC: Split the xdr_buf event class") added
display of the rqst's XID to the svc_xdr_buf_class. However, when
the recvfrom tracepoint fires, rq_xid has yet to be filled in with
the current XID. So it ends up recording the previous XID that was
handled by that svc_rqst.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Pull NFS client updates from Anna Schumaker:
"New features and improvements:
- Sunrpc receive buffer sizes only change when establishing a GSS credentials
- Add more sunrpc tracepoints
- Improve on tracepoints to capture internal NFS I/O errors
Other bugfixes and cleanups:
- Move a dprintk() to after a call to nfs_alloc_fattr()
- Fix off-by-one issues in rpc_ntop6
- Fix a few coccicheck warnings
- Use the correct SPDX license identifiers
- Fix rpc_call_done assignment for BIND_CONN_TO_SESSION
- Replace zero-length array with flexible array
- Remove duplicate headers
- Set invalid blocks after NFSv4 writes to update space_used attribute
- Fix direct WRITE throughput regression"
* tag 'nfs-for-5.8-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits)
NFS: Fix direct WRITE throughput regression
SUNRPC: rpc_xprt lifetime events should record xprt->state
xprtrdma: Make xprt_rdma_slot_table_entries static
nfs: set invalid blocks after NFSv4 writes
NFS: remove redundant initialization of variable result
sunrpc: add missing newline when printing parameter 'auth_hashtable_size' by sysfs
NFS: Add a tracepoint in nfs_set_pgio_error()
NFS: Trace short NFS READs
NFS: nfs_xdr_status should record the procedure name
SUNRPC: Set SOFTCONN when destroying GSS contexts
SUNRPC: rpc_call_null_helper() should set RPC_TASK_SOFT
SUNRPC: rpc_call_null_helper() already sets RPC_TASK_NULLCREDS
SUNRPC: trace RPC client lifetime events
SUNRPC: Trace transport lifetime events
SUNRPC: Split the xdr_buf event class
SUNRPC: Add tracepoint to rpc_call_rpcerror()
SUNRPC: Update the RPC_SHOW_SOCKET() macro
SUNRPC: Update the rpc_show_task_flags() macro
SUNRPC: Trace GSS context lifetimes
SUNRPC: receive buffer size estimation values almost never change
...
|
|
To help tie the recorded xdr_buf to a particular RPC transaction,
the client side version of this class should display task ID
information and the server side one should show the request's XID.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
- Rename these so they are easy to enable and search for as a set
- Move the tracepoints to get a more accurate sense of control flow
- Tracepoints should not fire on xprt shutdown
- Display memory address in case data structure had been corrupted
- Abandon dprintk in these paths
I haven't ever gotten one of these tracepoints to trigger. I wonder
if we should simply remove them.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
In lieu of dprintks or tracepoints in each individual transport
implementation, introduce tracepoints in the generic part of the RPC
layer. These typically fire for connection lifetime events, so
shouldn't contribute a lot of noise.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Capture transport creation failures.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
It appears that the RPC/RDMA transport does not need serialization
of calls to its xpo_sendto method. Move the mutex into the socket
methods that still need that serialization.
Tail latencies are unambiguously better with this patch applied.
fio randrw 8KB 70/30 on NFSv3, smaller numbers are better:
clat percentiles (usec):
With xpt_mutex:
r | 99.99th=[ 8848]
w | 99.99th=[ 9634]
Without xpt_mutex:
r | 99.99th=[ 8586]
w | 99.99th=[ 8979]
Serializing the construction of RPC/RDMA transport headers is not
really necessary at this point, because the Linux NFS server
implementation never changes its credit grant on a connection. If
that should change, then svc_rdma_sendto will need to serialize
access to the transport's credit grant fields.
Reported-by: kbuild test robot <lkp@intel.com>
[ cel: fix uninitialized variable warning ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Utilize the xpo_release_rqst transport method to ensure that each
rqstp's svc_rdma_recv_ctxt object is released even when the server
cannot return a Reply for that rqstp.
Without this fix, each RPC whose Reply cannot be sent leaks one
svc_rdma_recv_ctxt. This is a 2.5KB structure, a 4KB DMA-mapped
Receive buffer, and any pages that might be part of the Reply
message.
The leak is infrequent unless the network fabric is unreliable or
Kerberos is in use, as GSS sequence window overruns, which result
in connection loss, are more common on fast transports.
Fixes: 3a88092ee319 ("svcrdma: Preserve Receive buffer until svc_rdma_sendto")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Currently, after the forward channel connection goes away,
backchannel operations are causing soft lockups on the server
because call_transmit_status's SOFTCONN logic ignores ENOTCONN.
Such backchannel Calls are aggressively retried until the client
reconnects.
Backchannel Calls should use RPC_TASK_NOCONNECT rather than
RPC_TASK_SOFTCONN. If there is no forward connection, the server is
not capable of establishing a connection back to the client, thus
that backchannel request should fail before the server attempts to
send it. Commit 58255a4e3ce5 ("NFSD: NFSv4 callback client should
use RPC_TASK_SOFTCONN") was merged several years before
RPC_TASK_NOCONNECT was available.
Because setup_callback_client() explicitly sets NOPING, the NFSv4.0
callback connection depends on the first callback RPC to initiate
a connection to the client. Thus NFSv4.0 needs to continue to use
RPC_TASK_SOFTCONN.
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org> # v4.20+
|
|
'maxlen' is the total size of the destination buffer. There is only one
caller and this value is 256.
When we compute the size already used and what we would like to add in
the buffer, the trailling NULL character is not taken into account.
However, this trailling character will be added by the 'strcat' once we
have checked that we have enough place.
So, there is a off-by-one issue and 1 byte of the stack could be
erroneously overwridden.
Take into account the trailling NULL, when checking if there is enough
place in the destination buffer.
While at it, also replace a 'sprintf' by a safer 'snprintf', check for
output truncation and avoid a superfluous 'strlen'.
Fixes: dc9a16e49dbba ("svc: Add /proc/sys/sunrpc/transport files")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
[ cel: very minor fix to documenting comment
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
This class can be used to create trace points in either the RPC
client or RPC server paths. It simply displays the length of each
part of an xdr_buf, which is useful to determine that the transport
and XDR codecs are operating correctly.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Reported-by: David Wysochanski <dwysocha@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Temporary sockets should inherit the credential (and hence the user
namespace) from the parent listener transport.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
In order to be able to interpret uids and gids correctly in knfsd, we
should cache the user namespace of the process that created the RPC
server's listener. To do so, we refcount the credential of that process.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
In the rpc server, When something happens that might be reason to wake
up a thread to do something, what we do is
- modify xpt_flags, sk_sock->flags, xpt_reserved, or
xpt_nr_rqsts to indicate the new situation
- call svc_xprt_enqueue() to decide whether to wake up a thread.
svc_xprt_enqueue may require multiple conditions to be true before
queueing up a thread to handle the xprt. In the SMP case, one of the
other CPU's may have set another required condition, and in that case,
although both CPUs run svc_xprt_enqueue(), it's possible that neither
call sees the writes done by the other CPU in time, and neither one
recognizes that all the required conditions have been set. A socket
could therefore be ignored indefinitely.
Add memory barries to ensure that any svc_xprt_enqueue() call will
always see the conditions changed by other CPUs before deciding to
ignore a socket.
I've never seen this race reported. In the unlikely event it happens,
another event will usually come along and the problem will fix itself.
So I don't think this is worth backporting to stable.
Chuck tried this patch and said "I don't see any performance
regressions, but my server has only a single last-level CPU cache."
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
The long name seemed cute till I wanted to refer to it somewhere else.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Use READ_ONCE() to tell the compiler to not optimse away the read of
xprt->xpt_flags in svc_xprt_release_slot().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
_svc_create_xprt() returns positive port number
so its non-zero return value is not an error
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
if node have NFSv41+ mounts inside several net namespaces
it can lead to use-after-free in svc_process_common()
svc_process_common()
/* Setup reply header */
rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp); <<< HERE
svc_process_common() can use incorrect rqstp->rq_xprt,
its caller function bc_svc_process() takes it from serv->sv_bc_xprt.
The problem is that serv is global structure but sv_bc_xprt
is assigned per-netnamespace.
According to Trond, the whole "let's set up rqstp->rq_xprt
for the back channel" is nothing but a giant hack in order
to work around the fact that svc_process_common() uses it
to find the xpt_ops, and perform a couple of (meaningless
for the back channel) tests of xpt_flags.
All we really need in svc_process_common() is to be able to run
rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr()
Bruce J Fields points that this xpo_prep_reply_hdr() call
is an awfully roundabout way just to do "svc_putnl(resv, 0);"
in the tcp case.
This patch does not initialiuze rqstp->rq_xprt in bc_svc_process(),
now it calls svc_process_common() with rqstp->rq_xprt = NULL.
To adjust reply header svc_process_common() just check
rqstp->rq_prot and calls svc_tcp_prep_reply_hdr() for tcp case.
To handle rqstp->rq_xprt = NULL case in functions called from
svc_process_common() patch intruduces net namespace pointer
svc_rqst->rq_bc_net and adjust SVC_NET() definition.
Some other function was also adopted to properly handle described case.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Cc: stable@vger.kernel.org
Fixes: 23c20ecd4475 ("NFS: callback up - users counting cleanup")
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Pull nfsd updates from Bruce Fields:
"Olga added support for the NFSv4.2 asynchronous copy protocol. We
already supported COPY, by copying a limited amount of data and then
returning a short result, letting the client resend. The asynchronous
protocol should offer better performance at the expense of some
complexity.
The other highlight is Trond's work to convert the duplicate reply
cache to a red-black tree, and to move it and some other server caches
to RCU. (Previously these have meant taking global spinlocks on every
RPC)
Otherwise, some RDMA work and miscellaneous bugfixes"
* tag 'nfsd-4.20' of git://linux-nfs.org/~bfields/linux: (30 commits)
lockd: fix access beyond unterminated strings in prints
nfsd: Fix an Oops in free_session()
nfsd: correctly decrement odstate refcount in error path
svcrdma: Increase the default connection credit limit
svcrdma: Remove try_module_get from backchannel
svcrdma: Remove ->release_rqst call in bc reply handler
svcrdma: Reduce max_send_sges
nfsd: fix fall-through annotations
knfsd: Improve lookup performance in the duplicate reply cache using an rbtree
knfsd: Further simplify the cache lookup
knfsd: Simplify NFS duplicate replay cache
knfsd: Remove dead code from nfsd_cache_lookup
SUNRPC: Simplify TCP receive code
SUNRPC: Replace the cache_detail->hash_lock with a regular spinlock
SUNRPC: Remove non-RCU protected lookup
NFS: Fix up a typo in nfs_dns_ent_put
NFS: Lockless DNS lookups
knfsd: Lockless lookup of NFSv4 identities.
SUNRPC: Lockless server RPCSEC_GSS context lookup
knfsd: Allow lockless lookups of the exports
...
|
|
In call_xpt_users(), we delete the entry from the list, but we
do not reinitialise it. This triggers the list poisoning when
we later call unregister_xpt_user() in nfsd4_del_conns().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Treat socket write space handling in the same way we now treat transport
congestion: by denying the XPRT_LOCK until the transport signals that it
has free buffer space.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Record the time between when a rqstp is enqueued on a transport
and when it is dequeued. This includes how long the rqstp waits on
the queue and how long it takes the kernel scheduler to wake a
nfsd thread to service it.
The svc_xprt_dequeue trace point is altered to include the number
of microseconds between xprt_enqueue and xprt_dequeue.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Introduce a mechanism to report the server-side execution latency of
each RPC. The goal is to enable user space to filter the trace
record for latency outliers, build histograms, etc.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
TP_printk defines a format string that is passed to user space for
converting raw trace event records to something human-readable.
My user space's printf (Oracle Linux 7), however, does not have a
%pI format specifier. The result is that what is supposed to be an
IP address in the output of "trace-cmd report" is just a string that
says the field couldn't be displayed.
To fix this, adopt the same approach as the client: maintain a pre-
formated presentation address for occasions when %pI is not
available.
The location of the trace_svc_send trace point is adjusted so that
rqst->rq_xprt is not NULL when the trace event is recorded.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
There doesn't seem to be a lot of value in calling trace_svc_recv
in the failing case.
1. There are two very common cases: one is the transport is not
ready, and the other is shutdown. Neither is terribly interesting.
2. The trace record for the failing case contains nothing but
the status code.
Therefore the trace point call site in the error exit is removed.
Since the trace point is now recording a length instead of a
status, rename the status field and remove the case that records a
zero XID.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
There are three cases where svc_xprt_do_enqueue() returns without
waking an nfsd thread:
1. There is no work to do
2. The transport is already busy
3. There are no available nfsd threads
Only 3. is truly interesting. Move the trace point so it records
that there was work to do and either an nfsd thread was awoken, or
a free one could not found.
As an additional clean up, remove a redundant comment and a couple
of dprintk call sites.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Reduce the amount of noise generated by trace_svc_xprt_dequeue by
moving it to the end of svc_get_next_xprt. This generates exactly
one trace event when a ready xprt is found, rather than spurious
events when there is no work to do. The empty events contain no
information that can't be obtained simply by tracing function calls
to svc_xprt_dequeue.
A small additional benefit is simplification of the svc_xprt_event
trace class, which no longer has to handle the case when the @xprt
parameter is NULL.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Clean up: Instead of returning a value that is used to set or clear
a bit, just make ->xpo_secure_port mangle that bit, and return void.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Clean up: Noticed during code inspection that there is already a
local automatic variable "xprt" so dereferencing rqst->rq_xprt
again is unnecessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
With all callbacks converted, and the timer callback prototype
switched over, the TIMER_FUNC_TYPE cast is no longer needed,
so remove it. Conversion was done with the following scripts:
perl -pi -e 's|\(TIMER_FUNC_TYPE\)||g' \
$(git grep TIMER_FUNC_TYPE | cut -d: -f1 | sort -u)
perl -pi -e 's|\(TIMER_DATA_TYPE\)||g' \
$(git grep TIMER_DATA_TYPE | cut -d: -f1 | sort -u)
The now unused macros are also dropped from include/linux/timer.h.
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
Pull nfsd updates from Bruce Fields:
"Lots of good bugfixes, including:
- fix a number of races in the NFSv4+ state code
- fix some shutdown crashes in multiple-network-namespace cases
- relax our 4.1 session limits; if you've an artificially low limit
to the number of 4.1 clients that can mount simultaneously, try
upgrading"
* tag 'nfsd-4.15' of git://linux-nfs.org/~bfields/linux: (22 commits)
SUNRPC: Improve ordering of transport processing
nfsd: deal with revoked delegations appropriately
svcrdma: Enqueue after setting XPT_CLOSE in completion handlers
nfsd: use nfs->ns.inum as net ID
rpc: remove some BUG()s
svcrdma: Preserve CB send buffer across retransmits
nfds: avoid gettimeofday for nfssvc_boot time
fs, nfsd: convert nfs4_file.fi_ref from atomic_t to refcount_t
fs, nfsd: convert nfs4_cntl_odstate.co_odcount from atomic_t to refcount_t
fs, nfsd: convert nfs4_stid.sc_count from atomic_t to refcount_t
lockd: double unregister of inetaddr notifiers
nfsd4: catch some false session retries
nfsd4: fix cached replies to solo SEQUENCE compounds
sunrcp: make function _svc_create_xprt static
SUNRPC: Fix tracepoint storage issues with svc_recv and svc_rqst_status
nfsd: use ARRAY_SIZE
nfsd: give out fewer session slots as limit approaches
nfsd: increase DRC cache limit
nfsd: remove unnecessary nofilehandle checks
nfs_common: convert int to bool
...
|