summaryrefslogtreecommitdiff
path: root/fs/nfs
AgeCommit message (Collapse)AuthorFilesLines
2019-03-03NFSv4.1: Bump the default callback session slot count to 16Trond Myklebust1-1/+1
Users can still control this value explicitly using the max_session_cb_slots module parameter, but let's bump the default up to 16 for now. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFS/flexfiles: Clean up mirror DS initialisationTrond Myklebust1-35/+31
Get rid of the redundant parameter and rename the function ff_layout_mirror_valid() to ff_layout_init_mirror_ds() for clarity. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFS/flexfiles: Remove dead code in ff_layout_mirror_valid()Trond Myklebust1-15/+0
nfs4_ff_alloc_deviceid_node() guarantees that if mirror->mirror_ds is a valid pointer, then so is mirror->mirror_ds->ds. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFS/flexfile: Simplify nfs4_ff_layout_select_ds_stateid()Trond Myklebust3-26/+10
Pass in a pointer to the mirror rather than forcing another array access. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFS/flexfile: Simplify nfs4_ff_layout_ds_version()Trond Myklebust2-5/+5
Pass in a pointer to the mirror rather than forcing another array access. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFS/flexfiles: Simplify ff_layout_get_ds_cred()Trond Myklebust3-8/+9
Pass in a pointer to the mirror rather than forcing another array access. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFS/flexfiles: Simplify nfs4_ff_find_or_create_ds_client()Trond Myklebust3-10/+6
Pass in a pointer to the mirror rather than forcing another array access. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFS/flexfiles: Simplify nfs4_ff_layout_select_ds_fh()Trond Myklebust3-16/+5
Pass in a pointer to the mirror rather than having to retrieve it from the array and then verify the resulting pointer. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFS/flexfiles: Speed up read failover when DSes are downTrond Myklebust3-12/+73
If we notice that a DS may be down, we should attempt to read from the other mirrors first before we go back to retry the dead DS. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFS/flexfiles: Don't invalidate DS deviceids for being unresponsiveTrond Myklebust2-21/+3
If the DS is unresponsive, we want to just mark it as such, while reporting the errors. If the server later returns the same deviceid in a new layout, then we don't want to have to look it up again. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFS/flexfiles: Remove bogus checks for invalid deviceidsTrond Myklebust1-20/+0
We already check the deviceids before we start the RPC call. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFS/flexfiles: Avoid unnecessary layout invalidationsTrond Myklebust1-3/+3
In ff_layout_mirror_valid() we may not want to invalidate the layout segment despite the call to GETDEVICEINFO failing. The reason is that a read may still be able to make progress on another mirror. So instead we let the caller (in this case nfs4_ff_layout_prepare_ds()) decide whether or not it needs to invalidate. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFS/flexfiles: refactor calls to fs4_ff_layout_prepare_ds()Trond Myklebust3-22/+28
While we may want to skip attempting to connect to a downed mirror when we're deciding which mirror to select for a read, we do not want to do so once we've committed to attempting the I/O in ff_layout_read/write_pagelist(), or ff_layout_initiate_commit() Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFSv4: Handle early exit in layoutget by returning an errorTrond Myklebust1-2/+4
If the LAYOUTGET rpc call exits early without an error, convert it to EAGAIN. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFS/flexfiles: Send LAYOUTERROR when failing over mirrored readsTrond Myklebust3-6/+57
When a read to the preferred mirror returns an error, the flexfiles driver records the error in the inode list and currently marks the layout for return before failing over the attempted read to the next mirror. What we actually want to do is fire off a LAYOUTERROR to notify the MDS that there is an issue with the preferred mirror, then we fail over. Only once we've failed to read from all mirrors should we return the layout. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFSv4.2: Add client support for the generic 'layouterror' RPC callTrond Myklebust5-1/+269
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFSv4/flexfiles: Abort I/O early if the layout segment was invalidatedTrond Myklebust1-0/+17
If a layout segment gets invalidated while a pNFS I/O operation is queued for transmission, then we ideally want to abort immediately. This is particularly the case when there is a large number of I/O related RPCs queued in the RPC layer, and the layout segment gets invalidated due to an ENOSPC error, or an EACCES (because the client was fenced). We may end up forced to spam the MDS with a lot of otherwise unnecessary LAYOUTERRORs after that I/O fails. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFSv4/pnfs: Fix barriers in nfs4_mark_deviceid_unavailable()Trond Myklebust1-0/+3
Fix the memory barriers in nfs4_mark_deviceid_unavailable() and nfs4_test_deviceid_unavailable(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFS/flexfiles: Fix up sparse RCU annotationsTrond Myklebust1-2/+2
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFSv4/flexfiles: Fix invalid deref in FF_LAYOUT_DEVID_NODE()Trond Myklebust1-13/+19
If the attempt to instantiate the mirror's layout DS pointer failed, then that pointer may hold a value of type ERR_PTR(), so we need to check that before we dereference it. Fixes: 65990d1afbd2d ("pNFS/flexfiles: Fix a deadlock on LAYOUTGET") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-02NFS: Add missing encode / decode sequence_maxsz to v4.2 operationsAnna Schumaker1-0/+10
These really should have been there from the beginning, but we never noticed because there was enough slack in the RPC request for the extra bytes. Chuck's recent patch to use au_cslack and au_rslack to compute buffer size shrunk the buffer enough that this was now a problem for SEEK operations on my test client. Fixes: f4ac1674f5da4 ("nfs: Add ALLOCATE support") Fixes: 2e72448b07dc3 ("NFS: Add COPY nfs operation") Fixes: cb95deea0b4aa ("NFS OFFLOAD_CANCEL xdr") Fixes: 624bd5b7b683c ("nfs: Add DEALLOCATE support") Fixes: 1c6dcbe5ceff8 ("NFS: Implement SEEK") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFSv4.1: Don't process the sequence op more than once.Trond Myklebust1-1/+1
Ensure that if we call nfs41_sequence_process() a second time for the same rpc_task, then we only process the results once. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-03-01NFSv4.1: Reinitialise sequence results before retransmitting a requestTrond Myklebust1-4/+8
If we have to retransmit a request, we should ensure that we reinitialise the sequence results structure, since in the event of a signal we need to treat the request as if it had not been sent. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org
2019-02-25Merge tag 'nfs-rdma-for-5.1-1' of ↵Trond Myklebust9-654/+406
git://git.linux-nfs.org/projects/anna/linux-nfs NFSoRDMA client updates for 5.1 New features: - Convert rpc auth layer to use xdr_streams - Config option to disable insecure enctypes - Reduce size of RPC receive buffers Bugfixes and cleanups: - Fix sparse warnings - Check inline size before providing a write chunk - Reduce the receive doorbell rate - Various tracepoint improvements [Trond: Fix up merge conflicts] Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-23NFS/pnfs: Bulk destroy of layouts needs to be safe w.r.t. umountTrond Myklebust2-10/+24
If a bulk layout recall or a metadata server reboot coincides with a umount, then holding a reference to an inode is unsafe unless we also hold a reference to the super block. Fixes: fd9a8d7160937 ("NFSv4.1: Fix bulk recall and destroy of layouts") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-21NFS: Fix a soft lockup in the delegation recovery codeTrond Myklebust2-8/+13
Fix a soft lockup when NFS client delegation recovery is attempted but the inode is in the process of being freed. When the igrab(inode) call fails, and we have to restart the recovery process, we need to ensure that we won't attempt to recover the same delegation again. Fixes: 45870d6909d5a ("NFSv4.1: Test delegation stateids when server...") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-21NFSv4.1: Avoid false retries when RPC calls are interruptedTrond Myklebust3-60/+55
A 'false retry' in NFSv4.1 occurs when the client attempts to transmit a new RPC call using a slot+sequence number combination that references an already cached one. Currently, the Linux NFS client will do this if a user process interrupts an RPC call that is in progress. The problem with doing so is that we defeat the main mechanism used by the server to differentiate between a new call and a replayed one. Even if the server is able to perfectly cache the arguments of the old call, it cannot know if the client intended to replay or send a new call. The obvious fix is to bump the sequence number pre-emptively if an RPC call is interrupted, but in order to deal with the corner cases where the interrupted call is not actually received and processed by the server, we need to interpret the error NFS4ERR_SEQ_MISORDERED as a sign that we need to either wait or locate a correct sequence number that lies between the value we sent, and the last value that was acked by a SEQUENCE call on that slot. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Tested-by: Jason Tibbitts <tibbs@math.uh.edu>
2019-02-21nfs: fix xfstest generic/099 failed on nfsv3ZhangXiaoxu1-2/+0
After setxattr, the nfsv3 cached the acl which set by user. But at the backend, the shared file system (eg. ext4) will check the acl, if it can merged with mode, it won't add acl to the file. So, the nfsv3 cached acl is redundant. Don't 'set_cached_acl' when setxattr. Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-21pNFS: Avoid read/modify/write when it is not necessaryKazuo Ito1-14/+26
As the block and SCSI layouts can only read/write fixed-length blocks, we must perform read-modify-write when data to be written is not aligned to a block boundary or smaller than the block size. (612aa983a0410 pnfs: add flag to force read-modify-write in ->write_begin) The current code tries to see if we have to do read-modify-write on block-oriented pNFS layouts by just checking !PageUptodate(page), but the same condition also applies for overwriting of any uncached potions of existing files, making such operations excessively slow even it is block-aligned. The change does not affect the optimization for modify-write-read cases (38c73044f5f4d NFS: read-modify-write page updating), because partial update of !PageUptodate() pages can only happen in layouts that can do arbitrary length read/write and never in block-based ones. Testing results: We ran fio on one of the pNFS clients running 4.20 kernel (vanilla and patched) in this configuration to read/write/overwrite files on the storage array, exported as pnfs share by the server. pNFS clients ---1G Ethernet--- pNFS server (HP DL360 G8) (HP DL360 G8) | | | | +------8G Fiber Channel--------+ | Storage Array (HP P6350) Throughput of overwrite (both buffered and O_SYNC) is noticeably improved. Ops. |block size| Throughput | | (KiB) | (MiB/s) | | | 4.20 | patched| ---------+----------+----------------+ buffered | 4| 21.3 | 232 | overwrite| 32| 22.2 | 256 | | 512| 22.4 | 260 | ---------+----------+----------------+ O_SYNC | 4| 3.84| 4.77| overwrite| 32| 12.2 | 32.0 | | 512| 18.5 | 152 | ---------+----------+----------------+ Read and write (buffered and O_SYNC) by the same client remain unchanged by the patch either negatively or positively, as they should do. Ops. |block size| Throughput | | (KiB) | (MiB/s) | | | 4.20 | patched| ---------+----------+----------------+ read | 4| 548 | 550 | | 32| 547 | 551 | | 512| 548 | 551 | ---------+----------+----------------+ buffered | 4| 237 | 244 | write | 32| 261 | 268 | | 512| 265 | 272 | ---------+----------+----------------+ O_SYNC | 4| 0.46| 0.46| write | 32| 3.60| 3.57| | 512| 105 | 106 | ---------+----------+----------------+ Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp> Tested-by: Hiroyuki Watanabe <watanabe.hiroyuki@lab.ntt.co.jp> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-21pNFS: Fix potential corruption of page being writtenKazuo Ito1-1/+1
nfs_want_read_modify_write() didn't check for !PagePrivate when pNFS block or SCSI layout was in use, therefore we could lose data forever if the page being written was filled by a read before completion. Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-21NFS: Fix typo in comments of nfs_readdir_alloc_pages()zhangliguang1-2/+2
This fixes the typo in comments of nfs_readdir_alloc_pages(). Because nfs_readdir_large_page and nfs_readdir_free_pagearray had been renamed. Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-21NFS: Remove redundant semicolonzhangliguang1-1/+1
This removes redundant semicolon for ending code. Fixes: c7944ebb9ce9 ("NFSv4: Fix lookup revalidate of regular files") Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-21NFS: readdirplus optimization by cache mechanismluanshi2-7/+86
When listing very large directories via NFS, clients may take a long time to complete. There are about three factors involved: First of all, ls and practically every other method of listing a directory including python os.listdir and find rely on libc readdir(). However readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory, it is going to take an insanely long time to read all the directory entries. Secondly, libc readdir() reads 32K of directory entries at a time, in kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will be called for one page, which introduces many readdirplus rpc calls. Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry) to fill one page (filled by dentry), we found that nearly one third of data was wasted. To solve above problems, pagecache mechanism was introduced. One NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. This can reduce many readdirplus rpc calls and improve readdirplus performance. TESTING: When listing very large directories(include 300 thousand files) via NFS time ls -l /nfs_mount | wc -l without the patch: 300001 real 1m53.524s user 0m2.314s sys 0m2.599s with the patch: 300001 real 0m23.487s user 0m2.305s sys 0m2.558s Improved performance: 79.6% readdirplus rpc calls decrease: 85% Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-21fs/nfs: Fix nfs_parse_devname to not modify it's argumentEric W. Biederman1-1/+1
In the rare and unsupported case of a hostname list nfs_parse_devname will modify dev_name. There is no need to modify dev_name as the all that is being computed is the length of the hostname, so the computed length can just be shorted. Fixes: dc04589827f7 ("NFS: Use common device name parsing logic for NFSv4 and NFSv2/v3") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-21NFS: drop useless LIST_HEADJulia Lawall1-1/+0
Drop LIST_HEAD where the variable it declares has never been used. The semantic patch that fixes this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ identifier x; @@ - LIST_HEAD(x); ... when != x // </smpl> Fixes: 0e20162ed1e9 ("NFSv4.1 Use MDS auth flavor for data server connection") Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20NFS: Fix sparse annotations for nfs_set_open_stateid_locked()Trond Myklebust1-0/+4
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20NFS: Fix up documentation warningsTrond Myklebust15-62/+75
Fix up some compiler warnings about function parameters, etc not being correctly described or formatted. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20NFS: ENOMEM should also be a fatal error.Trond Myklebust1-0/+1
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20NFS: EINTR is also a fatal error.Trond Myklebust1-0/+1
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20NFS: Ensure NFS writeback allocations don't recurse back into NFS.Trond Myklebust1-1/+6
All the allocations that we can hit in the NFS layer and sunrpc layers themselves are already marked as GFP_NOFS, but we need to ensure that any calls to generic kernel functionality do the right thing as well. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20NFS: Pass error information to the pgio error cleanup routineTrond Myklebust4-7/+15
Allow the caller to pass error information when cleaning up a failed I/O request so that we can conditionally take action to cancel the request altogether if the error turned out to be fatal. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20NFS: Clean up list moves of struct nfs_pageTrond Myklebust2-10/+5
In several places we're just moving the struct nfs_page from one list to another by first removing from the existing list, then adding to the new one. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20NFS: Don't recoalesce on error in nfs_pageio_complete_mirror()Trond Myklebust1-1/+1
If the I/O completion failed with a fatal error, then we should just exit nfs_pageio_complete_mirror() rather than try to recoalesce. Fixes: a7d42ddb3099 ("nfs: add mirroring support to pgio layer") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v4.0+
2019-02-20NFS: Fix an I/O request leakage in nfs_do_recoalesceTrond Myklebust1-1/+0
Whether we need to exit early, or just reprocess the list, we must not lost track of the request which failed to get recoalesced. Fixes: 03d5eb65b538 ("NFS: Fix a memory leak in nfs_do_recoalesce") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v4.0+
2019-02-20NFS: Fix I/O request leakagesTrond Myklebust1-5/+21
When we fail to add the request to the I/O queue, we currently leave it to the caller to free the failed request. However since some of the requests that fail are actually created by nfs_pageio_add_request() itself, and are not passed back the caller, this leads to a leakage issue, which can again cause page locks to leak. This commit addresses the leakage by freeing the created requests on error, using desc->pg_completion_ops->error_cleanup() Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Fixes: a7d42ddb30997 ("nfs: add mirroring support to pgio layer") Cc: stable@vger.kernel.org # v4.0: c18b96a1b862: nfs: clean up rest of reqs Cc: stable@vger.kernel.org # v4.0: d600ad1f2bdb: NFS41: pop some layoutget Cc: stable@vger.kernel.org # v4.0+
2019-02-20Merge branch 'fixes-v5.1-rc6' of ↵Linus Torvalds1-14/+17
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull keys fixes from James Morris: - Handle quotas better, allowing full quota to be reached. - Fix the creation of shortcuts in the assoc_array internal representation when the index key needs to be an exact multiple of the machine word size. - Fix a dependency loop between the request_key contruction record and the request_key authentication key. The construction record isn't really necessary and can be dispensed with. - Set the timestamp on a new key rather than leaving it as 0. This would ordinarily be fine - provided the system clock is never set to a time before 1970 * 'fixes-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: keys: Timestamp new keys keys: Fix dependency loop between construction record and auth key assoc_array: Fix shortcut creation KEYS: allow reaching the keys quotas exactly
2019-02-16keys: Fix dependency loop between construction record and auth keyDavid Howells1-14/+17
In the request_key() upcall mechanism there's a dependency loop by which if a key type driver overrides the ->request_key hook and the userspace side manages to lose the authorisation key, the auth key and the internal construction record (struct key_construction) can keep each other pinned. Fix this by the following changes: (1) Killing off the construction record and using the auth key instead. (2) Including the operation name in the auth key payload and making the payload available outside of security/keys/. (3) The ->request_key hook is given the authkey instead of the cons record and operation name. Changes (2) and (3) allow the auth key to naturally be cleaned up if the keyring it is in is destroyed or cleared or the auth key is unlinked. Fixes: 7ee02a316600 ("keys: Fix dependency loop between construction record and auth key") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <james.morris@microsoft.com>
2019-02-14NFS: Account for XDR pad of buf->pagesChuck Lever3-15/+16
Certain NFS results (eg. READLINK) might expect a data payload that is not an exact multiple of 4 bytes. In this case, XDR encoding is required to pad that payload so its length on the wire is a multiple of 4 bytes. The constants that define the maximum size of each NFS result do not appear to account for this extra word. In each case where the data payload is to be received into pages: - 1 word is added to the size of the receive buffer allocated by call_allocate - rpc_inline_rcv_pages subtracts 1 word from @hdrsize so that the extra buffer space falls into the rcv_buf's tail iovec - If buf->pagelen is word-aligned, an XDR pad is not needed and is thus removed from the tail Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-02-14SUNRPC: Introduce rpc_prepare_reply_pages()Chuck Lever3-73/+34
prepare_reply_buffer() and its NFSv4 equivalents expose the details of the RPC header and the auth slack values to upper layer consumers, creating a layering violation, and duplicating code. Remedy these issues by adding a new RPC client API that hides those details from upper layers in a common helper function. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-02-13NFS: Add trace events to report non-zero NFS status codesChuck Lever6-4/+133
These can help field troubleshooting without needing the overhead of a full network capture (ie, tcpdump). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>