summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2018-12-20vfs: Separate changing mount flags full remountDavid Howells1-54/+92
Separate just the changing of mount flags (MS_REMOUNT|MS_BIND) from full remount because the mount data will get parsed with the new fs_context stuff prior to doing a remount - and this causes the syscall to fail under some circumstances. To quote Eric's explanation: [...] mount(..., MS_REMOUNT|MS_BIND, ...) now validates the mount options string, which breaks systemd unit files with ProtectControlGroups=yes (e.g. systemd-networkd.service) when systemd does the following to change a cgroup (v1) mount to read-only: mount(NULL, "/run/systemd/unit-root/sys/fs/cgroup/systemd", NULL, MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL) ... when the kernel has CONFIG_CGROUPS=y but no cgroup subsystems enabled, since in that case the error "cgroup1: Need name or subsystem set" is hit when the mount options string is empty. Probably it doesn't make sense to validate the mount options string at all in the MS_REMOUNT|MS_BIND case, though maybe you had something else in mind. This is also worthwhile doing because we will need to add a mount_setattr() syscall to take over the remount-bind function. Reported-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: David Howells <dhowells@redhat.com>
2018-12-20vfs: Suppress MS_* flag defs within the kernel unless explicitly enabledDavid Howells3-0/+3
Only the mount namespace code that implements mount(2) should be using the MS_* flags. Suppress them inside the kernel unless uapi/linux/mount.h is included. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: David Howells <dhowells@redhat.com>
2018-12-20iomap: Revert "fs/iomap.c: get/put the page in iomap_page_create/release()"Dave Chinner1-7/+0
This reverts commit 61c6de667263184125d5ca75e894fcad632b0dd3. The reverted commit added page reference counting to iomap page structures that are used to track block size < page size state. This was supposed to align the code with page migration page accounting assumptions, but what it has done instead is break XFS filesystems. Every fstests run I've done on sub-page block size XFS filesystems has since picking up this commit 2 days ago has failed with bad page state errors such as: # ./run_check.sh "-m rmapbt=1,reflink=1 -i sparse=1 -b size=1k" "generic/038" .... SECTION -- xfs FSTYP -- xfs (debug) PLATFORM -- Linux/x86_64 test1 4.20.0-rc6-dgc+ MKFS_OPTIONS -- -f -m rmapbt=1,reflink=1 -i sparse=1 -b size=1k /dev/sdc MOUNT_OPTIONS -- /dev/sdc /mnt/scratch generic/038 454s ... run fstests generic/038 at 2018-12-20 18:43:05 XFS (sdc): Unmounting Filesystem XFS (sdc): Mounting V5 Filesystem XFS (sdc): Ending clean mount BUG: Bad page state in process kswapd0 pfn:3a7fa page:ffffea0000ccbeb0 count:0 mapcount:0 mapping:ffff88800d9b6360 index:0x1 flags: 0xfffffc0000000() raw: 000fffffc0000000 dead000000000100 dead000000000200 ffff88800d9b6360 raw: 0000000000000001 0000000000000000 00000000ffffffff page dumped because: non-NULL mapping CPU: 0 PID: 676 Comm: kswapd0 Not tainted 4.20.0-rc6-dgc+ #915 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014 Call Trace: dump_stack+0x67/0x90 bad_page.cold.116+0x8a/0xbd free_pcppages_bulk+0x4bf/0x6a0 free_unref_page_list+0x10f/0x1f0 shrink_page_list+0x49d/0xf50 shrink_inactive_list+0x19d/0x3b0 shrink_node_memcg.constprop.77+0x398/0x690 ? shrink_slab.constprop.81+0x278/0x3f0 shrink_node+0x7a/0x2f0 kswapd+0x34b/0x6d0 ? node_reclaim+0x240/0x240 kthread+0x11f/0x140 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x24/0x30 Disabling lock debugging due to kernel taint .... The failures are from anyway that frees pages and empties the per-cpu page magazines, so it's not a predictable failure or an easy to debug failure. generic/038 is a reliable reproducer of this problem - it has a 9 in 10 failure rate on one of my test machines. Failure on other machines have been at random points in fstests runs but every run has ended up tripping this problem. Hence generic/038 was used to bisect the failure because it was the most reliable failure. It is too close to the 4.20 release (not to mention holidays) to try to diagnose, fix and test the underlying cause of the problem, so reverting the commit is the only option we have right now. The revert has been tested against a current tot 4.20-rc7+ kernel across multiple machines running sub-page block size XFs filesystems and none of the bad page state failures have been seen. Signed-off-by: Dave Chinner <dchinner@redhat.com> Cc: Piotr Jaroszynski <pjaroszynski@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Brian Foster <bfoster@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-20xfs: stringify scrub types in ftrace outputDarrick J. Wong2-28/+79
Use __print_symbolic to print the scrub type in ftrace output. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-20xfs: stringify btree cursor types in ftrace outputDarrick J. Wong3-14/+49
Use __print_symbolic to print the btree type in ftrace output. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-20xfs: move XFS_INODE_FORMAT_STR mappings to libxfsDarrick J. Wong2-5/+15
Move XFS_INODE_FORMAT_STR to libxfs so that we don't forget to keep it updated, and add necessary TRACE_DEFINE_ENUM. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-20xfs: move XFS_AG_BTREE_CMP_FORMAT_STR mappings to libxfsDarrick J. Wong2-4/+5
Move XFS_AG_BTREE_CMP_FORMAT_STR to libxfs so that we don't forget to keep it updated, and TRACE_DEFINE_ENUM the values while we're at it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-20xfs: fix symbolic enum printing in ftrace outputDarrick J. Wong3-0/+26
ftrace's __print_symbolic() has a (very poorly documented) requirement that any enum values used in the symbol to string translation table be wrapped in a TRACE_DEFINE_ENUM so that the enum value can be encoded in the ftrace ring buffer. Fix this unsatisfied requirement. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-20xfs: fix function pointer type in ftrace formatDarrick J. Wong1-1/+1
Use %pS instead of %pF in ftrace strings so that we record the actual function address instead of the function descriptor. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-12-19ext4: check for shutdown and r/o file system in ext4_write_inode()Theodore Ts'o1-2/+7
If the file system has been shut down or is read-only, then ext4_write_inode() needs to bail out early. Also use jbd2_complete_transaction() instead of ext4_force_commit() so we only force a commit if it is needed. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2018-12-19ext4: force inode writes when nfsd calls commit_metadata()Theodore Ts'o1-0/+11
Some time back, nfsd switched from calling vfs_fsync() to using a new commit_metadata() hook in export_operations(). If the file system did not provide a commit_metadata() hook, it fell back to using sync_inode_metadata(). Unfortunately doesn't work on all file systems. In particular, it doesn't work on ext4 due to how the inode gets journalled --- the VFS writeback code will not always call ext4_write_inode(). So we need to provide our own ext4_nfs_commit_metdata() method which calls ext4_write_inode() directly. Google-Bug-Id: 121195940 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2018-12-19NFS/NFSD/SUNRPC: replace generic creds with 'struct cred'.NeilBrown26-307/+237
SUNRPC has two sorts of credentials, both of which appear as "struct rpc_cred". There are "generic credentials" which are supplied by clients such as NFS and passed in 'struct rpc_message' to indicate which user should be used to authorize the request, and there are low-level credentials such as AUTH_NULL, AUTH_UNIX, AUTH_GSS which describe the credential to be sent over the wires. This patch replaces all the generic credentials by 'struct cred' pointers - the credential structure used throughout Linux. For machine credentials, there is a special 'struct cred *' pointer which is statically allocated and recognized where needed as having a special meaning. A look-up of a low-level cred will map this to a machine credential. Signed-off-by: NeilBrown <neilb@suse.com> Acked-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-12-19NFS: struct nfs_open_dir_context: convert rpc_cred pointer to cred.NeilBrown4-17/+33
Use the common 'struct cred' to pass credentials for readdir. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-12-19NFS: change access cache to use 'struct cred'.NeilBrown3-30/+39
Rather than keying the access cache with 'struct rpc_cred', use 'struct cred'. Then use cred_fscmp() to compare credentials rather than comparing the raw pointer. A benefit of this approach is that in the common case we avoid the rpc_lookup_cred_nonblock() call which can be slow when the cred cache is large. This also keeps many fewer items pinned in the rpc cred cache, so the cred cache is less likely to get large. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-12-19NFS: move credential expiry tracking out of SUNRPC into NFS.NeilBrown2-3/+23
NFS needs to know when a credential is about to expire so that it can modify write-back behaviour to finish the write inside the expiry time. It currently uses functions in SUNRPC code which make use of a fairly complex callback scheme and flags in the generic credientials. As I am working to discard the generic credentials, this has to change. This patch moves the logic into NFS, in part by finding and caching the low-level credential in the open_context. We then make direct cred-api calls on that. This makes the code much simpler and removes a dependency on generic rpc credentials. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-12-19NFS/SUNRPC: don't lookup machine credential until rpcauth_bindcred().NeilBrown4-42/+11
When NFS creates a machine credential, it is a "generic" credential, not tied to any auth protocol, and is really just a container for the princpal name. This doesn't get linked to a genuine credential until rpcauth_bindcred() is called. The lookup always succeeds, so various places that test if the machine credential is NULL, are pointless. As a step towards getting rid of generic credentials, this patch gets rid of generic machine credentials. The nfs_client and rpc_client just hold a pointer to a constant principal name. When a machine credential is wanted, a special static 'struct rpc_cred' pointer is used. rpcauth_bindcred() recognizes this, finds the principal from the client, and binds the correct credential. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-12-19NFSv4: don't require lock for get_renew_cred or get_machine_credNeilBrown4-25/+16
This lock is no longer necessary. If nfs4_get_renew_cred() needs to hunt through the open-state creds for a user cred, it still takes the lock to stablize the rbtree, but otherwise there are no races. Note that this completely removes the lock from nfs4_renew_state(). It appears that the original need for the locking here was removed long ago, and there is no longer anything to protect. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-12-19NFSv4: add cl_root_cred for use when machine cred is not available.NeilBrown2-8/+14
NFSv4 state management tries a root credential when no machine credential is available, as can happen with kerberos. It does this by replacing the cl_machine_cred with a root credential. This means that any user of the machine credential needs to take a lock while getting a reference to the machine credential, which is a little cumbersome. So introduce an explicit cl_root_cred, and never free either credential until client shutdown. This means that no locking is needed to reference these credentials. Future patches will make use of this. This is only a temporary addition. both cl_machine_cred and cl_root_cred will disappear later in the series. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-12-19SUNRPC: remove uid and gid from struct auth_credNeilBrown2-10/+10
Use cred->fsuid and cred->fsgid instead. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-12-19SUNRPC: remove groupinfo from struct auth_cred.NeilBrown1-13/+1
We can use cred->groupinfo (from the 'struct cred') instead. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-12-19SUNRPC: add 'struct cred *' to auth_cred and rpc_credNeilBrown2-1/+29
The SUNRPC credential framework was put together before Linux has 'struct cred'. Now that we have it, it makes sense to use it. This first step just includes a suitable 'struct cred *' pointer in every 'struct auth_cred' and almost every 'struct rpc_cred'. The rpc_cred used for auth_null has a NULL 'struct cred *' as nothing else really makes sense. For rpc_cred, the pointer is reference counted. For auth_cred it isn't. struct auth_cred are either allocated on the stack, in which case the thread owns a reference to the auth, or are part of 'struct generic_cred' in which case gc_base owns the reference, and "acred" shares it. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-12-19nfs: fix comment to nfs_generic_pg_test which does the oppositePavel Tikhomirov1-1/+1
Please see comment to filelayout_pg_test for reference. To: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Cc: linux-nfs@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-12-19NFSv4: cleanup remove unused nfs4_xdev_fs_typeOlga Kornievskaia1-1/+0
commit e8f25e6d6d19 "NFS: Remove the NFS v4 xdev mount function" removed the last use of this. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-12-19ext4: avoid declaring fs inconsistent due to invalid file handlesTheodore Ts'o8-41/+65
If we receive a file handle, either from NFS or open_by_handle_at(2), and it points at an inode which has not been initialized, and the file system has metadata checksums enabled, we shouldn't try to get the inode, discover the checksum is invalid, and then declare the file system as being inconsistent. This can be reproduced by creating a test file system via "mke2fs -t ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that directory, and then running the following program. #define _GNU_SOURCE #include <fcntl.h> struct handle { struct file_handle fh; unsigned char fid[MAX_HANDLE_SZ]; }; int main(int argc, char **argv) { struct handle h = {{8, 1 }, { 12, }}; open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY); return 0; } Google-Bug-Id: 120690101 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2018-12-19ext4: include terminating u32 in size of xattr entries when expanding inodesTheodore Ts'o1-1/+1
In ext4_expand_extra_isize_ea(), we calculate the total size of the xattr header, plus the xattr entries so we know how much of the beginning part of the xattrs to move when expanding the inode extra size. We need to include the terminating u32 at the end of the xattr entries, or else if there is uninitialized, non-zero bytes after the xattr entries and before the xattr values, the list of xattr entries won't be properly terminated. Reported-by: Steve Graham <stgraham2000@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2018-12-19smb3: Fix rmdir compounding regression to strict serversRonnie Sahlberg3-17/+25
Some servers require that the setinfo matches the exact size, and in this case compounding changes introduced by commit c2e0fe3f5aae ("cifs: make rmdir() use compounding") caused us to send 8 bytes (padded length) instead of 1 byte (the size of the structure). See MS-FSCC section 2.4.11. Fixing this when we send a SET_INFO command for delete file disposition, then ends up as an iov of a single byte but this causes problems with SMB3 and encryption. To avoid this, instead of creating a one byte iov for the disposition value and then appending an additional iov with a 7 byte padding we now handle this as a single 8 byte iov containing both the disposition byte as well as the padding in one single buffer. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Paulo Alcantara <palcantara@suse.de>
2018-12-19binder: fix use-after-free due to ksys_close() during fdget()Todd Kjos1-0/+29
44d8047f1d8 ("binder: use standard functions to allocate fds") exposed a pre-existing issue in the binder driver. fdget() is used in ksys_ioctl() as a performance optimization. One of the rules associated with fdget() is that ksys_close() must not be called between the fdget() and the fdput(). There is a case where this requirement is not met in the binder driver which results in the reference count dropping to 0 when the device is still in use. This can result in use-after-free or other issues. If userpace has passed a file-descriptor for the binder driver using a BINDER_TYPE_FDA object, then kys_close() is called on it when handling a binder_ioctl(BC_FREE_BUFFER) command. This violates the assumptions for using fdget(). The problem is fixed by deferring the close using task_work_add(). A new variant of __close_fd() was created that returns a struct file with a reference. The fput() is deferred instead of using ksys_close(). Fixes: 44d8047f1d87a ("binder: use standard functions to allocate fds") Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Todd Kjos <tkjos@google.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-18Merge tag 'spi-nor/for-4.21' of git://git.infradead.org/linux-mtd into mtd/nextBoris Brezillon65-359/+602
Core changes: - Parse the 4BAIT SFDP section - Add a bunch of SPI NOR entries to the flash_info table - Add the concept of SFDP fixups and use it to fix a bug on MX25L25635F - A bunch of minor cleanups/comestic changes
2018-12-18Merge tag 'nand/for-4.21' of git://git.infradead.org/linux-mtd into mtd/nextBoris Brezillon20-109/+200
NAND core changes: - kernel-doc miscellaneous fixes. - Third batch of fixes/cleanup to the raw NAND core impacting various controller drivers (ams-delta, marvell, fsmc, denali, tegra, vf610): * Stopping to pass mtd_info objects to internal functions * Reorganizing code to avoid forward declarations * Dropping useless test in nand_legacy_set_defaults() * Moving nand_exec_op() to internal.h * Adding nand_[de]select_target() helpers * Passing the CS line to be selected in struct nand_operation * Making ->select_chip() optional when ->exec_op() is implemented * Deprecating the ->select_chip() hook * Moving the ->exec_op() method to nand_controller_ops * Moving ->setup_data_interface() to nand_controller_ops * Deprecating the dummy_controller field * Fixing JEDEC detection * Providing a helper for polling GPIO R/B pin Raw NAND chip drivers changes: - Macronix: * Flagging 1.8V AC chips with a broken GET_FEATURES(TIMINGS) Raw NAND controllers drivers changes: - Ams-delta: * Fixing the error path * SPDX tag added * May be compiled with COMPILE_TEST=y * Conversion to ->exec_op() interface * Dropping .IOADDR_R/W use * Use GPIO API for data I/O - Denali: * Removing denali_reset_banks() * Removing ->dev_ready() hook * Including <linux/bits.h> instead of <linux/bitops.h> * Changes to comply with the above fixes/cleanup done in the core. - FSMC: * Adding an SPDX tag to replace the license text * Making conversion from chip to fsmc consistent * Fixing unchecked return value in fsmc_read_page_hwecc * Changes to comply with the above fixes/cleanup done in the core. - Marvell: * Preventing timeouts on a loaded machine (fix) * Changes to comply with the above fixes/cleanup done in the core. - OMAP2: * Pass the parent of pdev to dma_request_chan() (fix) - R852: * Use generic DMA API - sh_flctl: * Converting to SPDX identifiers - Sunxi: * Write pageprog related opcodes to the right register: WCMD_SET (fix) - Tegra: * Stop implementing ->select_chip() - VF610: * Adding an SPDX tag to replace the license text * Changes to comply with the above fixes/cleanup done in the core. - Various trivial/spelling/coding style fixes. SPI-NAND drivers changes: - Removing the depreacated mt29f_spinand driver from staging. - Adding support for: * Toshiba TC58CVG2S0H * GigaDevice GD5FxGQ4xA * Winbond W25N01GV
2018-12-18xfs: Fix x32 ioctls when cmd numbers differ from ia32.Nick Bowler1-3/+15
Several ioctl structs change size between native 32-bit (ia32) and x32 applications, because x32 follows the native 64-bit (amd64) integer alignment rules and uses 64-bit time_t. In these instances, the ioctl number changes so userspace simply gets -ENOTTY. This scenario can be handled by simply adding more cases. Looking at the different ioctls implemented here: - All the ones marked 'No size or alignment issue on any arch' should presumably all be fine. - All the ones under BROKEN_X86_ALIGNMENT are different under integer alignment rules. Since x32 matches amd64 here, we just need both sets of cases handled. - XFS_IOC_SWAPEXT has both integer alignment differences and time_t differences. Since x32 matches amd64 here, we need to add a case which calls the native implementation. - The remaining ioctls have neither 64-bit integers nor time_t, so x32 matches ia32 here and no change is required at this level. The bulkstat ioctl implementations have some pointer chasing which is handled separately. Signed-off-by: Nick Bowler <nbowler@draconx.ca> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-18xfs: Fix bulkstat compat ioctls on x32 userspace.Nick Bowler1-4/+30
The bulkstat family of ioctls are problematic on x32, because there is a mixup of native 32-bit and 64-bit conventions. The xfs_fsop_bulkreq struct contains pointers and 32-bit integers so that matches the native 32-bit layout, and that means the ioctl implementation goes into the regular compat path on x32. However, the 'ubuffer' member of that struct in turn refers to either struct xfs_inogrp or xfs_bstat (or an array of these). On x32, those structures match the native 64-bit layout. The compat implementation writes out the 32-bit version of these structures. This is not the expected format for x32 userspace, causing problems. Fortunately the functions which actually output these xfs_inogrp and xfs_bstat structures have an easy way to select which output format is required, so we just need a little tweak to select the right format on x32. Signed-off-by: Nick Bowler <nbowler@draconx.ca> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-18xfs: Align compat attrlist_by_handle with native implementation.Nick Bowler1-0/+6
While inspecting the ioctl implementations, I noticed that the compat implementation of XFS_IOC_ATTRLIST_BY_HANDLE does not do exactly the same thing as the native implementation. Specifically, the "cursor" does not appear to be written out to userspace on the compat path, like it is on the native path. This adjusts the compat implementation to copy out the cursor just like the native implementation does. The attrlist cursor does not require any special compat handling. This fixes xfstests xfs/269 on both IA-32 and x32 userspace, when running on an amd64 kernel. Signed-off-by: Nick Bowler <nbowler@draconx.ca> Fixes: 0facef7fb053b ("xfs: in _attrlist_by_handle, copy the cursor back to userspace") Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-18quota: Lock s_umount in exclusive mode for Q_XQUOTA{ON,OFF} quotactls.Javier Barrio1-1/+2
Commit 1fa5efe3622db58cb8c7b9a50665e9eb9a6c7e97 (ext4: Use generic helpers for quotaon and quotaoff) made possible to call quotactl(Q_XQUOTAON/OFF) on ext4 filesystems with sysfile quota support. This leads to calling dquot_enable/disable without s_umount held in excl. mode, because quotactl_cmd_onoff checks only for Q_QUOTAON/OFF. The following WARN_ON_ONCE triggers (in this case for dquot_enable, ext4, latest Linus' tree): [ 117.807056] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: quota,prjquota [...] [ 155.036847] WARNING: CPU: 0 PID: 2343 at fs/quota/dquot.c:2469 dquot_enable+0x34/0xb9 [ 155.036851] Modules linked in: quota_v2 quota_tree ipv6 af_packet joydev mousedev psmouse serio_raw pcspkr i2c_piix4 intel_agp intel_gtt e1000 ttm drm_kms_helper drm agpgart fb_sys_fops syscopyarea sysfillrect sysimgblt i2c_core input_leds kvm_intel kvm irqbypass qemu_fw_cfg floppy evdev parport_pc parport button crc32c_generic dm_mod ata_generic pata_acpi ata_piix libata loop ext4 crc16 mbcache jbd2 usb_storage usbcore sd_mod scsi_mod [ 155.036901] CPU: 0 PID: 2343 Comm: qctl Not tainted 4.20.0-rc6-00025-gf5d582777bcb #9 [ 155.036903] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 155.036911] RIP: 0010:dquot_enable+0x34/0xb9 [ 155.036915] Code: 41 56 41 55 41 54 55 53 4c 8b 6f 28 74 02 0f 0b 4d 8d 7d 70 49 89 fc 89 cb 41 89 d6 89 f5 4c 89 ff e8 23 09 ea ff 85 c0 74 0a <0f> 0b 4c 89 ff e8 8b 09 ea ff 85 db 74 6a 41 8b b5 f8 00 00 00 0f [ 155.036918] RSP: 0018:ffffb09b00493e08 EFLAGS: 00010202 [ 155.036922] RAX: 0000000000000001 RBX: 0000000000000008 RCX: 0000000000000008 [ 155.036924] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff9781b67cd870 [ 155.036926] RBP: 0000000000000002 R08: 0000000000000000 R09: 61c8864680b583eb [ 155.036929] R10: ffffb09b00493e48 R11: ffffffffff7ce7d4 R12: ffff9781b7ee8d78 [ 155.036932] R13: ffff9781b67cd800 R14: 0000000000000004 R15: ffff9781b67cd870 [ 155.036936] FS: 00007fd813250b88(0000) GS:ffff9781ba000000(0000) knlGS:0000000000000000 [ 155.036939] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 155.036942] CR2: 00007fd812ff61d6 CR3: 000000007c882000 CR4: 00000000000006b0 [ 155.036951] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 155.036953] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 155.036955] Call Trace: [ 155.037004] dquot_quota_enable+0x8b/0xd0 [ 155.037011] kernel_quotactl+0x628/0x74e [ 155.037027] ? do_mprotect_pkey+0x2a6/0x2cd [ 155.037034] __x64_sys_quotactl+0x1a/0x1d [ 155.037041] do_syscall_64+0x55/0xe4 [ 155.037078] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 155.037105] RIP: 0033:0x7fd812fe1198 [ 155.037109] Code: 02 77 0d 48 89 c1 48 c1 e9 3f 75 04 48 8b 04 24 48 83 c4 50 5b c3 48 83 ec 08 49 89 ca 48 63 d2 48 63 ff b8 b3 00 00 00 0f 05 <48> 89 c7 e8 c1 eb ff ff 5a c3 48 63 ff b8 bb 00 00 00 0f 05 48 89 [ 155.037112] RSP: 002b:00007ffe8cd7b050 EFLAGS: 00000206 ORIG_RAX: 00000000000000b3 [ 155.037116] RAX: ffffffffffffffda RBX: 00007ffe8cd7b148 RCX: 00007fd812fe1198 [ 155.037119] RDX: 0000000000000000 RSI: 00007ffe8cd7cea9 RDI: 0000000000580102 [ 155.037121] RBP: 00007ffe8cd7b0f0 R08: 000055fc8eba8a9d R09: 0000000000000000 [ 155.037124] R10: 00007ffe8cd7b074 R11: 0000000000000206 R12: 00007ffe8cd7b168 [ 155.037126] R13: 000055fc8eba8897 R14: 0000000000000000 R15: 0000000000000000 [ 155.037131] ---[ end trace 210f864257175c51 ]--- and then the syscall proceeds without s_umount locking. This patch locks the superblock ->s_umount sem. in exclusive mode for all Q_XQUOTAON/OFF quotactls too in addition to Q_QUOTAON/OFF. AFAICT, other than ext4, only xfs and ocfs2 are affected by this change. The VFS will now call in xfs_quota_* functions with s_umount held, which wasn't the case before. This looks good to me but I can not say for sure. Ext4 and ocfs2 where already beeing called with s_umount exclusive via quota_quotaon/off which is basically the same. Signed-off-by: Javier Barrio <javier.barrio.mart@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2018-12-18gfs2: take jdata unstuff into account in do_growBob Peterson1-0/+2
Before this patch, function do_grow would not reserve enough journal blocks in the transaction to unstuff jdata files while growing them. This patch adds the logic to add one more block if the file to grow is jdata. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-12-18aio: abstract out io_event filler helperJens Axboe1-4/+10
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-18aio: split out iocb copy from io_submit_one()Jens Axboe1-30/+38
In preparation of handing in iocbs in a different fashion as well. Also make it clear that the iocb being passed in isn't modified, by marking it const throughout. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-18aio: use iocb_put() instead of open coding itJens Axboe1-2/+1
Replace the percpu_ref_put() + kmem_cache_free() with a call to iocb_put() instead. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-18aio: only use blk plugs for > 2 depth submissionsJens Axboe1-4/+14
Plugging is meant to optimize submission of a string of IOs, if we don't have more than 2 being submitted, don't bother setting up a plug. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-18aio: don't zero entire aio_kiocb aio_get_req()Jens Axboe1-2/+7
It's 192 bytes, fairly substantial. Most items don't need to be cleared, especially not upfront. Clear the ones we do need to clear, and leave the other ones for setup when the iocb is prepared and submitted. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-18aio: separate out ring reservation from req allocationChristoph Hellwig1-13/+17
This is in preparation for certain types of IO not needing a ring reserveration. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-18aio: use assigned completion handlerJens Axboe1-1/+1
We know this is a read/write request, but in preparation for having different kinds of those, ensure that we call the assigned handler instead of assuming it's aio_complete_rq(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-18Merge branch 'for-4.21/block' into for-4.21/aioJens Axboe6-27/+68
* for-4.21/block: (351 commits) blk-mq: enable IO poll if .nr_queues of type poll > 0 blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight() blk-mq: skip zero-queue maps in blk_mq_map_swqueue block: fix blk-iolatency accounting underflow blk-mq: fix dispatch from sw queue block: mq-deadline: Fix write completion handling nvme-pci: don't share queue maps blk-mq: only dispatch to non-defauly queue maps if they have queues blk-mq: export hctx->type in debugfs instead of sysfs blk-mq: fix allocation for queue mapping table blk-wbt: export internal state via debugfs blk-mq-debugfs: support rq_qos block: update sysfs documentation block: loop: check error using IS_ERR instead of IS_ERR_OR_NULL in loop_add() aoe: add __exit annotation block: clear REQ_HIPRI if polling is not supported blk-mq: replace and kill blk_mq_request_issue_directly blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests blk-mq: refactor the code of issue request directly block: remove the bio_integrity_advance export ...
2018-12-18vfs: replace current_kernel_time64 with ktime equivalentArnd Bergmann1-1/+3
current_time is the last remaining caller of current_kernel_time64(), which is a wrapper around ktime_get_coarse_real_ts64(). This calls the latter directly for consistency with the rest of the kernel that is moving to the ktime_get_ family of time accessors, as now documented in Documentation/core-api/timekeeping.rst. An open questions is whether we may want to actually call the more accurate ktime_get_real_ts64() for file systems that save high-resolution timestamps in their on-disk format. This would add a small overhead to each update of the inode stamps but lead to inode timestamps to actually have a usable resolution better than one jiffy (1 to 10 milliseconds normally). Experiments on a variety of hardware platforms show a typical time of around 100 CPU cycles to read the cycle counter and calculate the accurate time from that. On old platforms without a cycle counter, this can be signiciantly higher, up to several microseconds to access a hardware clock, but those have become very rare by now. I traced the original addition of the current_kernel_time() call to set the nanosecond fields back to linux-2.5.48, where Andi Kleen added a patch with subject "nanosecond stat timefields". Andi explains that the motivation was to introduce as little overhead as possible back then. At this time, reading the clock hardware was also more expensive when most architectures did not have a cycle counter. One side effect of having more accurate inode timestamp would be having to write out the inode every time that mtime/ctime/atime get touched on most systems, whereas many file systems today only write it when the timestamps have changed, i.e. at most once per jiffy unless something else changes as well. That change would certainly be noticed in some workloads, which is enough reason to not do it without a good reason, regardless of the cost of reading the time. One thing we could still consider however would be to round the timestamps from current_time() to multiples of NSEC_PER_JIFFY, e.g. full milliseconds rather than having six or seven meaningless but confusing digits at the end of the timestamp. Link: http://lkml.kernel.org/r/20180726130820.4174359-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-12-18exofs_mount(): fix leaks on failure exitsAl Viro1-8/+29
... and don't abuse mount_nodev(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: David Howells <dhowells@redhat.com>
2018-12-17btrfs: Fix typos in comments and stringsAndrea Gelmini25-69/+70
The typos accumulate over time so once in a while time they get fixed in a large patch. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: improve error handling of btrfs_add_linkJohannes Thumshirn1-1/+6
In the error handling block, err holds the return value of either btrfs_del_root_ref() or btrfs_del_inode_ref() but it hasn't been checked since it's introduction with commit fe66a05a0679 (Btrfs: improve error handling for btrfs_insert_dir_item callers) in 2012. If the error handling in the error handling fails, there's not much left to do and the abort either happened earlier in the callees or is necessary here. So if one of btrfs_del_root_ref() or btrfs_del_inode_ref() failed, abort the transaction, but still return the original code of the failure stored in 'ret' as this will be reported to the user. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17Btrfs: use generic_remap_file_range_prep() for cloning and deduplicationFilipe Manana1-491/+133
Since cloning and deduplication are no longer Btrfs specific operations, we now have generic code to handle parameter validation, compare file ranges used for deduplication, clear capabilities when cloning, etc. This change makes Btrfs use it, eliminating a lot of code in Btrfs and also fixing a few bugs, such as: 1) When cloning, the destination file's capabilities were not dropped (the fstest generic/513 tests this); 2) We were not checking if the destination file is immutable; 3) Not checking if either the source or destination files are swap files (swap file support is coming soon for Btrfs); 4) System limits were not checked (resource limits and O_LARGEFILE). Note that the generic helper generic_remap_file_range_prep() does start and waits for writeback by calling filemap_write_and_wait_range(), however that is not enough for Btrfs for two reasons: 1) With compression, we need to start writeback twice in order to get the pages marked for writeback and ordered extents created; 2) filemap_write_and_wait_range() (and all its other variants) only waits for the IO to complete, but we need to wait for the ordered extents to finish, so that when we do the actual reflinking operations the file extent items are in the fs tree. This is also important due to the fact that the generic helper, for the deduplication case, compares the contents of the pages in the requested range, which might require reading extents from disk in the very unlikely case that pages get invalidated after writeback finishes (so the file extent items must be up to date in the fs tree). Since these reasons are specific to Btrfs we have to do it in the Btrfs code before calling generic_remap_file_range_prep(). This also results in a simpler way of dealing with existing delalloc in the source/target ranges, specially for the deduplication case where we used to lock all the pages first and then if we found any dealloc for the range, or ordered extent, we would unlock the pages trigger writeback and wait for ordered extents to complete, then lock all the pages again and check if deduplication can be done. So now we get a simpler approach: lock the inodes, then trigger writeback and then wait for ordered extents to complete. So make btrfs use generic_remap_file_range_prep() (XFS and OCFS2 use it) to eliminate duplicated code, fix a few bugs and benefit from future bug fixes done there - for example the recent clone and dedupe bugs involving reflinking a partial EOF block got a counterpart fix in the generic helper, since it affected all filesystems supporting these operations, so we no longer need special checks in Btrfs for them. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Refactor main loop in extent_readpagesNikolay Borisov1-20/+14
extent_readpages processes all pages in the readlist in batches of 16, this is implemented by a single for loop but thanks to an if condition the loop does 2 things based on whether we've filled the batch or not. Additionally due to the structure of the code there is an additional check which deals with partial batches. Streamline all of this by explicitly using two loops. The outter one is used to process all pages while the inner one just fills in the batch of 16 (currently). Due to this new structure the code guarantees that all pages are processed in the loop hence the code to deal with any leftovers is eliminated. This also enable the compiler to inline __extent_readpages: ./scripts/bloat-o-meter fs/btrfs/extent_io.o extent_io.for add/remove: 0/1 grow/shrink: 1/0 up/down: 660/-820 (-160) Function old new delta extent_readpages 476 1136 +660 __extent_readpages 820 - -820 Total: Before=44315, After=44155, chg -0.36% Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove 1st shrink/grow phase from balanceNikolay Borisov1-53/+0
The first step of the rebalance process ensures there is 1MiB free on each device. This number seems rather small. And in fact when talking to the original authors their opinions were: "man that's a little bonkers" "i don't think we even need that code anymore" "I think it was there to make sure we had room for the blank 1M at the beginning. I bet it goes all the way back to v0" "we just don't need any of that tho, i say we just delete it" Clearly, this piece of code has lost its original intent throughout the years. It doesn't really bring any real practical benefits to the relocation process. Additionally, this patch makes the balance process more lightweight by removing a pair of shrink/grow operations which are rather expensive for heavily populated filesystems. This is mainly due to shrink requiring relocating block groups, involving heavy use of the btree. The intermediate shrink/grow can fail and leave the filesystem in a middle state that would need to be changed back by the user. Suggested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17Btrfs: send, fix race with transaction commits that create snapshotsFilipe Manana1-6/+23
If we create a snapshot of a snapshot currently being used by a send operation, we can end up with send failing unexpectedly (returning -ENOENT error to user space for example). The following diagram shows how this happens. CPU 1 CPU2 CPU3 btrfs_ioctl_send() (...) create_snapshot() -> creates snapshot of a root used by the send task btrfs_commit_transaction() create_pending_snapshot() __get_inode_info() btrfs_search_slot() btrfs_search_slot_get_root() down_read commit_root_sem get reference on eb of the commit root -> eb with bytenr == X up_read commit_root_sem btrfs_cow_block(root node) btrfs_free_tree_block() -> creates delayed ref to free the extent btrfs_run_delayed_refs() -> runs the delayed ref, adds extent to fs_info->pinned_extents btrfs_finish_extent_commit() unpin_extent_range() -> marks extent as free in the free space cache transaction commit finishes btrfs_start_transaction() (...) btrfs_cow_block() btrfs_alloc_tree_block() btrfs_reserve_extent() -> allocates extent at bytenr == X btrfs_init_new_buffer(bytenr X) btrfs_find_create_tree_block() alloc_extent_buffer(bytenr X) find_extent_buffer(bytenr X) -> returns existing eb, which the send task got (...) -> modifies content of the eb with bytenr == X -> uses an eb that now belongs to some other tree and no more matches the commit root of the snapshot, resuts will be unpredictable The consequences of this race can be various, and can lead to searches in the commit root performed by the send task failing unexpectedly (unable to find inode items, returning -ENOENT to user space, for example) or not failing because an inode item with the same number was added to the tree that reused the metadata extent, in which case send can behave incorrectly in the worst case or just fail later for some reason. Fix this by performing a copy of the commit root's extent buffer when doing a search in the context of a send operation. CC: stable@vger.kernel.org # 4.4.x: 1fc28d8e2e9: Btrfs: move get root out of btrfs_search_slot to a helper CC: stable@vger.kernel.org # 4.4.x: f9ddfd0592a: Btrfs: remove unused check of skip_locking CC: stable@vger.kernel.org # 4.4.x Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>