summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2021-08-27fscache: Remove the object list procfileDavid Howells7-440/+0
Remove the object list procfile from fscache as objects will become entirely internal to the cache. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/162431198332.2908479.5847286163455099669.stgit@warthog.procyon.org.uk/
2021-08-27fscache, cachefiles: Remove the histogram stuffDavid Howells14-336/+0
Remove the histogram stuff as it's mostly going to be outdated. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/162431195953.2908479.16770977195634296638.stgit@warthog.procyon.org.uk/
2021-08-27fscache: Procfile to display cookiesDavid Howells3-0/+111
Add /proc/fs/fscache/cookies to display active cookies. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/158861211871.340223.7223853943667440807.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/159465771021.1376105.6933857529128238020.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/160588460994.3465195.16963417803501149328.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/162431194785.2908479.786917990782538164.stgit@warthog.procyon.org.uk/
2021-08-27fscache: Add a cookie debug ID and use that in tracesDavid Howells3-16/+28
Add a cookie debug ID and use that in traces and in procfiles rather than displaying the (hashed) pointer to the cookie. This is easier to correlate and we don't lose anything when interpreting oops output since that shows unhashed addresses and registers that aren't comparable to the hashed values. Changes: ver #2: - Fix the fscache_op tracepoint to handle a NULL cookie pointer. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/158861210988.340223.11688464116498247790.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/159465769844.1376105.14119502774019865432.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/160588459097.3465195.1273313637721852165.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/162431193544.2908479.17556704572948300790.stgit@warthog.procyon.org.uk/
2021-08-27dax: remove bdev_dax_supportedChristoph Hellwig3-3/+6
All callers already have a dax_device obtained from fs_dax_get_by_bdev at hand, so just pass that to dax_supported() insted of doing another lookup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/20210826135510.6293-10-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-08-27xfs: factor out a xfs_buftarg_is_dax helperChristoph Hellwig1-4/+11
Refactor the DAX setup code in preparation of removing bdev_dax_supported. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20210826135510.6293-9-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-08-27fsdax: improve the FS_DAX Kconfig description and help textChristoph Hellwig1-3/+18
Rename the main option text to clarify it is for file system access, and add a bit of text that explains how to actually switch a nvdimm to a fsdax capable state. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210826135510.6293-2-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-08-26hostfs: support splice_writeJohannes Berg1-0/+1
There's really no good reason not to, and e.g. trace-cmd currently requires it for the temporary per-CPU files. Hook up splice_write just like everyone else does. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-08-26nfsd: fix crash on LOCKT on reexported NFSv3J. Bruce Fields1-2/+3
Unlike other filesystems, NFSv3 tries to use fl_file in the GETLK case. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-26nfs: don't allow reexport reclaimsJ. Bruce Fields3-0/+7
In the reexport case, nfsd is currently passing along locks with the reclaim bit set. The client sends a new lock request, which is granted if there's currently no conflict--even if it's possible a conflicting lock could have been briefly held in the interim. We don't currently have any way to safely grant reclaim, so for now let's just deny them all. I'm doing this by passing the reclaim bit to nfs and letting it fail the call, with the idea that eventually the client might be able to do something more forgiving here. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-26lockd: don't attempt blocking locks on nfs reexportsJ. Bruce Fields1-3/+10
As in the v4 case, it doesn't work well to block waiting for a lock on an nfs filesystem. As in the v4 case, that means we're depending on the client to poll. It's probably incorrect to depend on that, but I *think* clients do poll in practice. In any case, it's an improvement over hanging the lockd thread indefinitely as we currently are. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-26nfs: don't atempt blocking locks on nfs reexportsJ. Bruce Fields2-3/+7
NFS implements blocking locks by blocking inside its lock method. In the reexport case, this blocks the nfs server thread, which could lead to deadlocks since an nfs server thread might be required to unlock the conflicting lock. It also causes a crash, since the nfs server thread assumes it can free the lock when its lm_notify lock callback is called. Ideal would be to make the nfs lock method return without blocking in this case, but for now it works just not to attempt blocking locks. The difference is just that the original client will have to poll (as it does in the v4.0 case) instead of getting a callback when the lock's available. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-26Merge tag 'ceph-for-5.14-rc8' of git://github.com/ceph/ceph-clientLinus Torvalds5-15/+27
Pull ceph fixes from Ilya Dryomov: "Two memory management fixes for the filesystem" * tag 'ceph-for-5.14-rc8' of git://github.com/ceph/ceph-client: ceph: fix possible null-pointer dereference in ceph_mdsmap_decode() ceph: correctly handle releasing an embedded cap flush
2021-08-26Merge tag 'for-5.14-rc7-tag' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more fix that I think qualifies for a late merge. It's a revert of a one-liner fix that meanwhile got backported to stable kernels and we got reports from users. The broken fix prevents creating compressed inline extents, which could be noticeable on space consumption. Technically it's a regression as the patch was merged in 5.14-rc1 but got propagated to several stable kernels and has higher exposure than a 'typical' development cycle bug" * tag 'for-5.14-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Revert "btrfs: compression: don't try to compress if we don't have enough pages"
2021-08-26iomap: standardize tracepoint formatting and storageDarrick J. Wong1-10/+18
Print all the offset, pos, and length quantities in hexadecimal. While we're at it, update the types of the tracepoint structure fields to match the types of the values being recorded in them. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-26cifs: Do not leak EDEADLK to dgetents64 for STATUS_USER_SESSION_DELETEDRonnie Sahlberg1-1/+22
RHBZ: 1994393 If we hit a STATUS_USER_SESSION_DELETED for the Create part in the Create/QueryDirectory compound that starts a directory scan we will leak EDEADLK back to userspace and surprise glibc and the application. Pick this up initiate_cifs_search() and retry a small number of tries before we return an error to userspace. Cc: stable@vger.kernel.org Reported-by: Xiaoli Feng <xifeng@redhat.com> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-08-25cifs: cifs_md4 convert to SPDX identifierSteve French2-6/+1
Add SPDX license identifier and replace license boilerplate for cifs_md4.c Signed-off-by: Steve French <stfrench@microsoft.com>
2021-08-25cifs: create a MD4 module and switch cifs.ko to use itRonnie Sahlberg6-15/+238
MD4 support will likely be removed from the crypto directory, but is needed for compression of NTLMSSP in SMB3 mounts. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-08-25cifs: fork arc4 and create a separate module for it for cifs and other usersRonnie Sahlberg7-5/+128
We can not drop ARC4 and basically destroy CIFS connectivity for almost all CIFS users so create a new forked ARC4 module that CIFS and other subsystems that have a hard dependency on ARC4 can use. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-08-25cifs: remove support for NTLM and weaker authentication algorithmsRonnie Sahlberg14-720/+5
for SMB1. This removes the dependency to DES. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-08-25cifs: enable fscache usage even for files opened as rwShyam Prasad N6-11/+76
So far, the fscache implementation we had supports only a small set of use cases. Particularly for files opened with O_RDONLY. This commit enables it even for rw based file opens. It also enables the reuse of cached data in case of mount option (cache=singleclient) where it is guaranteed that this is the only client (and server) which operates on the files. There's also a single line change in fscache.c to get around a bug seen in fscache. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-08-25smb3: fix posix extensions mount optionSteve French1-2/+9
We were incorrectly initializing the posix extensions in the conversion to the new mount API. CC: <stable@vger.kernel.org> # 5.11+ Reported-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Suggested-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-08-25cifs: fix wrong release in sess_alloc_buffer() failed pathDing Hui1-1/+1
smb_buf is allocated by small_smb_init_no_tc(), and buf type is CIFS_SMALL_BUFFER, so we should use cifs_small_buf_release() to release it in failed path. Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-08-25CIFS: Fix a potencially linear read overflowLen Baker1-7/+2
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated. Also, the strnlen() call does not avoid the read overflow in the strlcpy function when a not NUL-terminated string is passed. So, replace this block by a call to kstrndup() that avoids this type of overflow and does the same. Fixes: 066ce6899484d ("cifs: rename cifs_strlcpy_to_host and make it use new functions") Signed-off-by: Len Baker <len.baker@gmx.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-08-25io_uring: don't free request to slabHao Xu1-1/+1
It's not necessary to free the request back to slab when we fail to get sqe, just move it to state->free_list. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/20210825175856.194299-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-25pipe: do FASYNC notifications for every pipe IO, not just state changesLinus Torvalds1-12/+8
It turns out that the SIGIO/FASYNC situation is almost exactly the same as the EPOLLET case was: user space really wants to be notified after every operation. Now, in a perfect world it should be sufficient to only notify user space on "state transitions" when the IO state changes (ie when a pipe goes from unreadable to readable, or from unwritable to writable). User space should then do as much as possible - fully emptying the buffer or what not - and we'll notify it again the next time the state changes. But as with EPOLLET, we have at least one case (stress-ng) where the kernel sent SIGIO due to the pipe being marked for asynchronous notification, but the user space signal handler then didn't actually necessarily read it all before returning (it read more than what was written, but since there could be multiple writes, it could leave data pending). The user space code then expected to get another SIGIO for subsequent writes - even though the pipe had been readable the whole time - and would only then read more. This is arguably a user space bug - and Colin King already fixed the stress-ng code in question - but the kernel regression rules are clear: it doesn't matter if kernel people think that user space did something silly and wrong. What matters is that it used to work. So if user space depends on specific historical kernel behavior, it's a regression when that behavior changes. It's on us: we were silly to have that non-optimal historical behavior, and our old kernel behavior was what user space was tested against. Because of how the FASYNC notification was tied to wakeup behavior, this was first broken by commits f467a6a66419 and 1b6b26ae7053 ("pipe: fix and clarify pipe read/write wakeup logic"), but at the time it seems nobody noticed. Probably because the stress-ng problem case ends up being timing-dependent too. It was then unwittingly fixed by commit 3a34b13a88ca ("pipe: make pipe writes always wake up readers") only to be broken again when by commit 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads"). And at that point the kernel test robot noticed the performance refression in the stress-ng.sigio.ops_per_sec case. So the "Fixes" tag below is somewhat ad hoc, but it matches when the issue was noticed. Fix it for good (knock wood) by simply making the kill_fasync() case separate from the wakeup case. FASYNC is quite rare, and we clearly shouldn't even try to use the "avoid unnecessary wakeups" logic for it. Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/ Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads") Reported-by: kernel test robot <oliver.sang@intel.com> Tested-by: Oliver Sang <oliver.sang@intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Colin Ian King <colin.king@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-25ceph: fix possible null-pointer dereference in ceph_mdsmap_decode()Tuo Li1-3/+5
kcalloc() is called to allocate memory for m->m_info, and if it fails, ceph_mdsmap_destroy() behind the label out_err will be called: ceph_mdsmap_destroy(m); In ceph_mdsmap_destroy(), m->m_info is dereferenced through: kfree(m->m_info[i].export_targets); To fix this possible null-pointer dereference, check m->m_info before the for loop to free m->m_info[i].export_targets. [ jlayton: fix up whitespace damage only kfree(m->m_info) if it's non-NULL ] Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Signed-off-by: Tuo Li <islituo@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-08-25ceph: correctly handle releasing an embedded cap flushXiubo Li4-12/+22
The ceph_cap_flush structures are usually dynamically allocated, but the ceph_cap_snap has an embedded one. When force umounting, the client will try to remove all the session caps. During this, it will free them, but that should not be done with the ones embedded in a capsnap. Fix this by adding a new boolean that indicates that the cap flush is embedded in a capsnap, and skip freeing it if that's set. At the same time, switch to using list_del_init() when detaching the i_list and g_list heads. It's possible for a forced umount to remove these objects but then handle_cap_flushsnap_ack() races in and does the list_del_init() again, corrupting memory. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/52283 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-08-25cachefiles: Use file_inode() rather than accessing ->f_inodeDavid Howells1-2/+2
Use the file_inode() helper rather than accessing ->f_inode directly. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/162431192403.2908479.4590814090994846904.stgit@warthog.procyon.org.uk/
2021-08-25netfs: Move cookie debug ID to struct netfs_cache_resourcesDavid Howells1-1/+1
Move the cookie debug ID from struct netfs_read_request to struct netfs_cache_resources and drop the 'cookie_' prefix. This makes it available for things that want to use netfs_cache_resources without having a netfs_read_request. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/162431190784.2908479.13386972676539789127.stgit@warthog.procyon.org.uk/
2021-08-25fscache: Select netfs stats if fscache stats are enabledDavid Howells1-0/+1
Unconditionally select the stats produced by the netfs lib if fscache stats are enabled as the former are displayed in the latter's procfile. Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-cachefs@redhat.com Reviewed-by: Jeff Layton <jlayton@redhat.com> Link: https://lore.kernel.org/r/162280352566.3319242.10615341893991206961.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/162431189627.2908479.9165349423842063755.stgit@warthog.procyon.org.uk/
2021-08-25erofs: fix double free of 'copied'Gao Xiang1-0/+1
Dan reported a new smatch warning [1] "fs/erofs/inode.c:210 erofs_read_inode() error: double free of 'copied'" Due to new chunk-based format handling logic, the error path can be called after kfree(copied). Set "copied = NULL" after kfree(copied) to fix this. [1] https://lore.kernel.org/r/202108251030.bELQozR7-lkp@intel.com Link: https://lore.kernel.org/r/20210825120757.11034-1-hsiangkao@linux.alibaba.com Fixes: c5aa903a59db ("erofs: support reading chunk-based uncompressed files") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-08-25Revert "btrfs: compression: don't try to compress if we don't have enough pages"Qu Wenruo1-1/+1
This reverts commit f2165627319ffd33a6217275e5690b1ab5c45763. [BUG] It's no longer possible to create compressed inline extent after commit f2165627319f ("btrfs: compression: don't try to compress if we don't have enough pages"). [CAUSE] For compression code, there are several possible reasons we have a range that needs to be compressed while it's no more than one page. - Compressed inline write The data is always smaller than one sector and the test lacks the condition to properly recognize a non-inline extent. - Compressed subpage write For the incoming subpage compressed write support, we require page alignment of the delalloc range. And for 64K page size, we can compress just one page into smaller sectors. For those reasons, the requirement for the data to be more than one page is not correct, and is already causing regression for compressed inline data writeback. The idea of skipping one page to avoid wasting CPU time could be revisited in the future. [FIX] Fix it by reverting the offending commit. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Link: https://lore.kernel.org/linux-btrfs/afa2742.c084f5d6.17b6b08dffc@tnonline.net Fixes: f2165627319f ("btrfs: compression: don't try to compress if we don't have enough pages") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-25io_uring: accept directly into fixed file tablePavel Begunkov1-6/+18
As done with open opcodes, allow accept to skip installing fd into processes' file tables and put it directly into io_uring's fixed file table. Same restrictions and design as for open. Suggested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/6d16163f376fac7ac26a656de6b42199143e9721.1629888991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-25io_uring: hand code io_accept() fd installingPavel Begunkov1-7/+20
Make io_accept() to handle file descriptor allocations and installation. A preparation patch for bypassing file tables. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/5b73d204caa0ce979ccb98136695b60f52a3d98c.1629888991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-25io_uring: openat directly into fixed fd tablePavel Begunkov1-8/+66
Instead of opening a file into a process's file table as usual and then registering the fd within io_uring, some users may want to skip the first step and place it directly into io_uring's fixed file table. This patch adds such a capability for IORING_OP_OPENAT and IORING_OP_OPENAT2. The behaviour is controlled by setting sqe->file_index, where 0 implies the old behaviour using normal file tables. If non-zero value is specified, then it will behave as described and place the file into a fixed file slot sqe->file_index - 1. A file table should be already created, the slot should be valid and empty, otherwise the operation will fail. Keep the error codes consistent with IORING_OP_FILES_UPDATE, ENXIO and EINVAL on inappropriate fixed tables, and return EBADF on collision with already registered file. Note: IOSQE_FIXED_FILE can't be used to switch between modes, because accept takes a file, and it already uses the flag with a different meaning. Suggested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/e9b33d1163286f51ea707f87d95bd596dada1e65.1629888991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-25configfs: fix a race in configfs_lookup()Sishuai Gong1-3/+4
When configfs_lookup() is executing list_for_each_entry(), it is possible that configfs_dir_lseek() is calling list_del(). Some unfortunate interleavings of them can cause a kernel NULL pointer dereference error Thread 1 Thread 2 //configfs_dir_lseek() //configfs_lookup() list_del(&cursor->s_sibling); list_for_each_entry(sd, ...) Fix this by grabbing configfs_dirent_lock in configfs_lookup() while iterating ->s_children. Signed-off-by: Sishuai Gong <sishuai@purdue.edu> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-25configfs: fold configfs_attach_attr into configfs_lookupChristoph Hellwig1-49/+24
This makes it more clear what gets added to the dcache and prepares for an additional locking fix. Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-25configfs: simplify the configfs_dirent_is_readyChristoph Hellwig1-3/+1
Return the error directly instead of using a goto. Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-25configfs: return -ENAMETOOLONG earlier in configfs_lookupChristoph Hellwig1-2/+3
Just like most other file systems: get the simple checks out of the way first. Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-25xfs: fix I_DONTCACHEDave Chinner2-2/+3
Yup, the VFS hoist broke it, and nobody noticed. Bulkstat workloads make it clear that it doesn't work as it should. Fixes: dae2f8ed7992 ("fs: Lift XFS_IDONTCACHE to the VFS layer") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-24block: mark blkdev_fsync staticChristoph Hellwig1-2/+2
blkdev_fsync is only used inside of block_dev.c since the removal of the raw drіver. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210824151823.1575100-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-24fs: clean up after mandatory file locking support removalLukas Bulwahn1-7/+3
Commit 3efee0567b4a ("fs: remove mandatory file locking support") removes some operations in functions rw_verify_area(). As these functions are now simplified, do some syntactic clean-up as follow-up to the removal as well, which was pointed out by compiler warnings and static analysis. No functional change. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
2021-08-24xfs: only set IOMAP_F_SHARED when providing a srcmap to a writeDarrick J. Wong1-4/+4
While prototyping a free space defragmentation tool, I observed an unexpected IO error while running a sequence of commands that can be recreated by the following sequence of commands: # xfs_io -f -c "pwrite -S 0x58 -b 10m 0 10m" file1 # cp --reflink=always file1 file2 # punch-alternating -o 1 file2 # xfs_io -c "funshare 0 10m" file2 fallocate: Input/output error I then scraped this (abbreviated) stack trace from dmesg: WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450 CPU: 0 PID: 30788 Comm: xfs_io Not tainted 5.14.0-rc6-xfsx #rc6 5ef57b62a900814b3e4d885c755e9014541c8732 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:iomap_write_begin+0x376/0x450 RSP: 0018:ffffc90000c0fc20 EFLAGS: 00010297 RAX: 0000000000000001 RBX: ffffc90000c0fd10 RCX: 0000000000001000 RDX: ffffc90000c0fc54 RSI: 000000000000000c RDI: 000000000000000c RBP: ffff888005d5dbd8 R08: 0000000000102000 R09: ffffc90000c0fc50 R10: 0000000000b00000 R11: 0000000000101000 R12: ffffea0000336c40 R13: 0000000000001000 R14: ffffc90000c0fd10 R15: 0000000000101000 FS: 00007f4b8f62fe40(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056361c554108 CR3: 000000000524e004 CR4: 00000000001706f0 Call Trace: iomap_unshare_actor+0x95/0x140 iomap_apply+0xfa/0x300 iomap_file_unshare+0x44/0x60 xfs_reflink_unshare+0x50/0x140 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195] xfs_file_fallocate+0x27c/0x610 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195] vfs_fallocate+0x133/0x330 __x64_sys_fallocate+0x3e/0x70 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f4b8f79140a Looking at the iomap tracepoints, I saw this: iomap_iter: dev 8:64 ino 0x100 pos 0 length 0 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 0 length 131072 type DELALLOC flags SHARED iomap_iter_srcmap: dev 8:64 ino 0x100 bdev 8:64 addr 147456 offset 0 length 4096 type MAPPED flags iomap_iter: dev 8:64 ino 0x100 pos 0 length 4096 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 4096 length 4096 type DELALLOC flags SHARED console: WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450 The first time funshare calls ->iomap_begin, xfs sees that the first block is shared and creates a 128k delalloc reservation in the COW fork. The delalloc reservation is returned as dstmap, and the shared block is returned as srcmap. So far so good. funshare calls ->iomap_begin to try the second block. This time there's no srcmap (punch-alternating punched it out!) but we still have the delalloc reservation in the COW fork. Therefore, we again return the reservation as dstmap and the hole as srcmap. iomap_unshare_iter incorrectly tries to unshare the hole, which __iomap_write_begin rejects because shared regions must be fully written and therefore cannot require zeroing. Therefore, change the buffered write iomap_begin function not to set IOMAP_F_SHARED when there isn't a source mapping to read from for the unsharing. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-08-24Keep read and write fds with each nlm_fileJ. Bruce Fields5-41/+102
We shouldn't really be using a read-only file descriptor to take a write lock. Most filesystems will put up with it. But NFS, for example, won't. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-08-23io_uring: add support for IORING_OP_LINKATDmitry Kadashev3-1/+74
IORING_OP_LINKAT behaves like linkat(2) and takes the same flags and arguments. In some internal places 'hardlink' is used instead of 'link' to avoid confusion with the SQE links. Name 'link' conflicts with the existing 'link' member of io_kiocb. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-12-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23io_uring: add support for IORING_OP_SYMLINKATDmitry Kadashev3-2/+69
IORING_OP_SYMLINKAT behaves like symlinkat(2) and takes the same flags and arguments. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/ Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-11-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23block: use the percpu bio cache in __blkdev_direct_IOChristoph Hellwig1-2/+4
Use bio_alloc_kiocb to dip into the percpu cache of bios when the caller asks for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23io_uring: enable use of bio alloc cacheJens Axboe1-1/+1
Mark polled IO as being safe for dipping into the bio allocation cache, in case the targeted bio_set has it enabled. This brings an IOPOLL gen2 Optane QD=128 workload from ~3.2M IOPS to ~3.5M IOPS. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23io_uring: fix io_try_cancel_userdata race for iowqPavel Begunkov1-2/+3
WARNING: CPU: 1 PID: 5870 at fs/io_uring.c:5975 io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975 CPU: 0 PID: 5870 Comm: iou-wrk-5860 Not tainted 5.14.0-rc6-next-20210820-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975 Call Trace: io_async_cancel fs/io_uring.c:6014 [inline] io_issue_sqe+0x22d5/0x65a0 fs/io_uring.c:6407 io_wq_submit_work+0x1dc/0x300 fs/io_uring.c:6511 io_worker_handle_work+0xa45/0x1840 fs/io-wq.c:533 io_wqe_worker+0x2cc/0xbb0 fs/io-wq.c:582 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 io_try_cancel_userdata() can be called from io_async_cancel() executing in the io-wq context, so the warning fires, which is there to alert anyone accessing task->io_uring->io_wq in a racy way. However, io_wq_put_and_exit() always first waits for all threads to complete, so the only detail left is to zero tctx->io_wq after the context is removed. note: one little assumption is that when IO_WQ_WORK_CANCEL, the executor won't touch ->io_wq, because io_wq_destroy() might cancel left pending requests in such a way. Cc: stable@vger.kernel.org Reported-by: syzbot+b0c9d1588ae92866515f@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dfdd37a80cfa9ffd3e59538929c99cdd55d8699e.1629721757.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>