summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2019-08-13fscrypt: allow unprivileged users to add/remove keys for v2 policiesEric Biggers3-33/+336
Allow the FS_IOC_ADD_ENCRYPTION_KEY and FS_IOC_REMOVE_ENCRYPTION_KEY ioctls to be used by non-root users to add and remove encryption keys from the filesystem-level crypto keyrings, subject to limitations. Motivation: while privileged fscrypt key management is sufficient for some users (e.g. Android and Chromium OS, where a privileged process manages all keys), the old API by design also allows non-root users to set up and use encrypted directories, and we don't want to regress on that. Especially, we don't want to force users to continue using the old API, running into the visibility mismatch between files and keyrings and being unable to "lock" encrypted directories. Intuitively, the ioctls have to be privileged since they manipulate filesystem-level state. However, it's actually safe to make them unprivileged if we very carefully enforce some specific limitations. First, each key must be identified by a cryptographic hash so that a user can't add the wrong key for another user's files. For v2 encryption policies, we use the key_identifier for this. v1 policies don't have this, so managing keys for them remains privileged. Second, each key a user adds is charged to their quota for the keyrings service. Thus, a user can't exhaust memory by adding a huge number of keys. By default each non-root user is allowed up to 200 keys; this can be changed using the existing sysctl 'kernel.keys.maxkeys'. Third, if multiple users add the same key, we keep track of those users of the key (of which there remains a single copy), and won't really remove the key, i.e. "lock" the encrypted files, until all those users have removed it. This prevents denial of service attacks that would be possible under simpler schemes, such allowing the first user who added a key to remove it -- since that could be a malicious user who has compromised the key. Of course, encryption keys should be kept secret, but the idea is that using encryption should never be *less* secure than not using encryption, even if your key was compromised. We tolerate that a user will be unable to really remove a key, i.e. unable to "lock" their encrypted files, if another user has added the same key. But in a sense, this is actually a good thing because it will avoid providing a false notion of security where a key appears to have been removed when actually it's still in memory, available to any attacker who compromises the operating system kernel. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-13fscrypt: v2 encryption policy supportEric Biggers7-185/+684
Add a new fscrypt policy version, "v2". It has the following changes from the original policy version, which we call "v1" (*): - Master keys (the user-provided encryption keys) are only ever used as input to HKDF-SHA512. This is more flexible and less error-prone, and it avoids the quirks and limitations of the AES-128-ECB based KDF. Three classes of cryptographically isolated subkeys are defined: - Per-file keys, like used in v1 policies except for the new KDF. - Per-mode keys. These implement the semantics of the DIRECT_KEY flag, which for v1 policies made the master key be used directly. These are also planned to be used for inline encryption when support for it is added. - Key identifiers (see below). - Each master key is identified by a 16-byte master_key_identifier, which is derived from the key itself using HKDF-SHA512. This prevents users from associating the wrong key with an encrypted file or directory. This was easily possible with v1 policies, which identified the key by an arbitrary 8-byte master_key_descriptor. - The key must be provided in the filesystem-level keyring, not in a process-subscribed keyring. The following UAPI additions are made: - The existing ioctl FS_IOC_SET_ENCRYPTION_POLICY can now be passed a fscrypt_policy_v2 to set a v2 encryption policy. It's disambiguated from fscrypt_policy/fscrypt_policy_v1 by the version code prefix. - A new ioctl FS_IOC_GET_ENCRYPTION_POLICY_EX is added. It allows getting the v1 or v2 encryption policy of an encrypted file or directory. The existing FS_IOC_GET_ENCRYPTION_POLICY ioctl could not be used because it did not have a way for userspace to indicate which policy structure is expected. The new ioctl includes a size field, so it is extensible to future fscrypt policy versions. - The ioctls FS_IOC_ADD_ENCRYPTION_KEY, FS_IOC_REMOVE_ENCRYPTION_KEY, and FS_IOC_GET_ENCRYPTION_KEY_STATUS now support managing keys for v2 encryption policies. Such keys are kept logically separate from keys for v1 encryption policies, and are identified by 'identifier' rather than by 'descriptor'. The 'identifier' need not be provided when adding a key, since the kernel will calculate it anyway. This patch temporarily keeps adding/removing v2 policy keys behind the same permission check done for adding/removing v1 policy keys: capable(CAP_SYS_ADMIN). However, the next patch will carefully take advantage of the cryptographically secure master_key_identifier to allow non-root users to add/remove v2 policy keys, thus providing a full replacement for v1 policies. (*) Actually, in the API fscrypt_policy::version is 0 while on-disk fscrypt_context::format is 1. But I believe it makes the most sense to advance both to '2' to have them be in sync, and to consider the numbering to start at 1 except for the API quirk. Reviewed-by: Paul Crowley <paulcrowley@google.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-13fscrypt: add an HKDF-SHA512 implementationEric Biggers4-0/+199
Add an implementation of HKDF (RFC 5869) to fscrypt, for the purpose of deriving additional key material from the fscrypt master keys for v2 encryption policies. HKDF is a key derivation function built on top of HMAC. We choose SHA-512 for the underlying unkeyed hash, and use an "hmac(sha512)" transform allocated from the crypto API. We'll be using this to replace the AES-ECB based KDF currently used to derive the per-file encryption keys. While the AES-ECB based KDF is believed to meet the original security requirements, it is nonstandard and has problems that don't exist in modern KDFs such as HKDF: 1. It's reversible. Given a derived key and nonce, an attacker can easily compute the master key. This is okay if the master key and derived keys are equally hard to compromise, but now we'd like to be more robust against threats such as a derived key being compromised through a timing attack, or a derived key for an in-use file being compromised after the master key has already been removed. 2. It doesn't evenly distribute the entropy from the master key; each 16 input bytes only affects the corresponding 16 output bytes. 3. It isn't easily extensible to deriving other values or keys, such as a public hash for securely identifying the key, or per-mode keys. Per-mode keys will be immediately useful for Adiantum encryption, for which fscrypt currently uses the master key directly, introducing unnecessary usage constraints. Per-mode keys will also be useful for hardware inline encryption, which is currently being worked on. HKDF solves all the above problems. Reviewed-by: Paul Crowley <paulcrowley@google.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-13fscrypt: add FS_IOC_GET_ENCRYPTION_KEY_STATUS ioctlEric Biggers1-0/+63
Add a new fscrypt ioctl, FS_IOC_GET_ENCRYPTION_KEY_STATUS. Given a key specified by 'struct fscrypt_key_specifier' (the same way a key is specified for the other fscrypt key management ioctls), it returns status information in a 'struct fscrypt_get_key_status_arg'. The main motivation for this is that applications need to be able to check whether an encrypted directory is "unlocked" or not, so that they can add the key if it is not, and avoid adding the key (which may involve prompting the user for a passphrase) if it already is. It's possible to use some workarounds such as checking whether opening a regular file fails with ENOKEY, or checking whether the filenames "look like gibberish" or not. However, no workaround is usable in all cases. Like the other key management ioctls, the keyrings syscalls may seem at first to be a good fit for this. Unfortunately, they are not. Even if we exposed the keyring ID of the ->s_master_keys keyring and gave everyone Search permission on it (note: currently the keyrings permission system would also allow everyone to "invalidate" the keyring too), the fscrypt keys have an additional state that doesn't map cleanly to the keyrings API: the secret can be removed, but we can be still tracking the files that were using the key, and the removal can be re-attempted or the secret added again. After later patches, some applications will also need a way to determine whether a key was added by the current user vs. by some other user. Reserved fields are included in fscrypt_get_key_status_arg for this and other future extensions. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-13fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctlEric Biggers3-5/+411
Add a new fscrypt ioctl, FS_IOC_REMOVE_ENCRYPTION_KEY. This ioctl removes an encryption key that was added by FS_IOC_ADD_ENCRYPTION_KEY. It wipes the secret key itself, then "locks" the encrypted files and directories that had been unlocked using that key -- implemented by evicting the relevant dentries and inodes from the VFS caches. The problem this solves is that many fscrypt users want the ability to remove encryption keys, causing the corresponding encrypted directories to appear "locked" (presented in ciphertext form) again. Moreover, users want removing an encryption key to *really* remove it, in the sense that the removed keys cannot be recovered even if kernel memory is compromised, e.g. by the exploit of a kernel security vulnerability or by a physical attack. This is desirable after a user logs out of the system, for example. In many cases users even already assume this to be the case and are surprised to hear when it's not. It is not sufficient to simply unlink the master key from the keyring (or to revoke or invalidate it), since the actual encryption transform objects are still pinned in memory by their inodes. Therefore, to really remove a key we must also evict the relevant inodes. Currently one workaround is to run 'sync && echo 2 > /proc/sys/vm/drop_caches'. But, that evicts all unused inodes in the system rather than just the inodes associated with the key being removed, causing severe performance problems. Moreover, it requires root privileges, so regular users can't "lock" their encrypted files. Another workaround, used in Chromium OS kernels, is to add a new VFS-level ioctl FS_IOC_DROP_CACHE which is a more restricted version of drop_caches that operates on a single super_block. It does: shrink_dcache_sb(sb); invalidate_inodes(sb, false); But it's still a hack. Yet, the major users of filesystem encryption want this feature badly enough that they are actually using these hacks. To properly solve the problem, start maintaining a list of the inodes which have been "unlocked" using each master key. Originally this wasn't possible because the kernel didn't keep track of in-use master keys at all. But, with the ->s_master_keys keyring it is now possible. Then, add an ioctl FS_IOC_REMOVE_ENCRYPTION_KEY. It finds the specified master key in ->s_master_keys, then wipes the secret key itself, which prevents any additional inodes from being unlocked with the key. Then, it syncs the filesystem and evicts the inodes in the key's list. The normal inode eviction code will free and wipe the per-file keys (in ->i_crypt_info). Note that freeing ->i_crypt_info without evicting the inodes was also considered, but would have been racy. Some inodes may still be in use when a master key is removed, and we can't simply revoke random file descriptors, mmap's, etc. Thus, the ioctl simply skips in-use inodes, and returns -EBUSY to indicate that some inodes weren't evicted. The master key *secret* is still removed, but the fscrypt_master_key struct remains to keep track of the remaining inodes. Userspace can then retry the ioctl to evict the remaining inodes. Alternatively, if userspace adds the key again, the refreshed secret will be associated with the existing list of inodes so they remain correctly tracked for future key removals. The ioctl doesn't wipe pagecache pages. Thus, we tolerate that after a kernel compromise some portions of plaintext file contents may still be recoverable from memory. This can be solved by enabling page poisoning system-wide, which security conscious users may choose to do. But it's very difficult to solve otherwise, e.g. note that plaintext file contents may have been read in other places than pagecache pages. Like FS_IOC_ADD_ENCRYPTION_KEY, FS_IOC_REMOVE_ENCRYPTION_KEY is initially restricted to privileged users only. This is sufficient for some use cases, but not all. A later patch will relax this restriction, but it will require introducing key hashes, among other changes. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-13fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctlEric Biggers6-3/+393
Add a new fscrypt ioctl, FS_IOC_ADD_ENCRYPTION_KEY. This ioctl adds an encryption key to the filesystem's fscrypt keyring ->s_master_keys, making any files encrypted with that key appear "unlocked". Why we need this ~~~~~~~~~~~~~~~~ The main problem is that the "locked/unlocked" (ciphertext/plaintext) status of encrypted files is global, but the fscrypt keys are not. fscrypt only looks for keys in the keyring(s) the process accessing the filesystem is subscribed to: the thread keyring, process keyring, and session keyring, where the session keyring may contain the user keyring. Therefore, userspace has to put fscrypt keys in the keyrings for individual users or sessions. But this means that when a process with a different keyring tries to access encrypted files, whether they appear "unlocked" or not is nondeterministic. This is because it depends on whether the files are currently present in the inode cache. Fixing this by consistently providing each process its own view of the filesystem depending on whether it has the key or not isn't feasible due to how the VFS caches work. Furthermore, while sometimes users expect this behavior, it is misguided for two reasons. First, it would be an OS-level access control mechanism largely redundant with existing access control mechanisms such as UNIX file permissions, ACLs, LSMs, etc. Encryption is actually for protecting the data at rest. Second, almost all users of fscrypt actually do need the keys to be global. The largest users of fscrypt, Android and Chromium OS, achieve this by having PID 1 create a "session keyring" that is inherited by every process. This works, but it isn't scalable because it prevents session keyrings from being used for any other purpose. On general-purpose Linux distros, the 'fscrypt' userspace tool [1] can't similarly abuse the session keyring, so to make 'sudo' work on all systems it has to link all the user keyrings into root's user keyring [2]. This is ugly and raises security concerns. Moreover it can't make the keys available to system services, such as sshd trying to access the user's '~/.ssh' directory (see [3], [4]) or NetworkManager trying to read certificates from the user's home directory (see [5]); or to Docker containers (see [6], [7]). By having an API to add a key to the *filesystem* we'll be able to fix the above bugs, remove userspace workarounds, and clearly express the intended semantics: the locked/unlocked status of an encrypted directory is global, and encryption is orthogonal to OS-level access control. Why not use the add_key() syscall ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We use an ioctl for this API rather than the existing add_key() system call because the ioctl gives us the flexibility needed to implement fscrypt-specific semantics that will be introduced in later patches: - Supporting key removal with the semantics such that the secret is removed immediately and any unused inodes using the key are evicted; also, the eviction of any in-use inodes can be retried. - Calculating a key-dependent cryptographic identifier and returning it to userspace. - Allowing keys to be added and removed by non-root users, but only keys for v2 encryption policies; and to prevent denial-of-service attacks, users can only remove keys they themselves have added, and a key is only really removed after all users who added it have removed it. Trying to shoehorn these semantics into the keyrings syscalls would be very difficult, whereas the ioctls make things much easier. However, to reuse code the implementation still uses the keyrings service internally. Thus we get lockless RCU-mode key lookups without having to re-implement it, and the keys automatically show up in /proc/keys for debugging purposes. References: [1] https://github.com/google/fscrypt [2] https://goo.gl/55cCrI#heading=h.vf09isp98isb [3] https://github.com/google/fscrypt/issues/111#issuecomment-444347939 [4] https://github.com/google/fscrypt/issues/116 [5] https://bugs.launchpad.net/ubuntu/+source/fscrypt/+bug/1770715 [6] https://github.com/google/fscrypt/issues/128 [7] https://askubuntu.com/questions/1130306/cannot-run-docker-on-an-encrypted-filesystem Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-13fscrypt: rename keyinfo.c to keysetup.cEric Biggers3-2/+2
Rename keyinfo.c to keysetup.c since this better describes what the file does (sets up the key), and it matches the new file keysetup_v1.c. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-13fscrypt: move v1 policy key setup to keysetup_v1.cEric Biggers4-322/+369
In preparation for introducing v2 encryption policies which will find and derive encryption keys differently from the current v1 encryption policies, move the v1 policy-specific key setup code from keyinfo.c into keysetup_v1.c. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-13fscrypt: refactor key setup code in preparation for v2 policiesEric Biggers2-112/+146
Do some more refactoring of the key setup code, in preparation for introducing a filesystem-level keyring and v2 encryption policies: - Now that ci_inode exists, don't pass around the inode unnecessarily. - Define a function setup_file_encryption_key() which handles the crypto key setup given an under-construction fscrypt_info. Don't pass the fscrypt_context, since everything is in the fscrypt_info. [This will be extended for v2 policies and the fs-level keyring.] - Define a function fscrypt_set_derived_key() which sets the per-file key, without depending on anything specific to v1 policies. [This will also be used for v2 policies.] - Define a function fscrypt_setup_v1_file_key() which takes the raw master key, thus separating finding the key from using it. [This will also be used if the key is found in the fs-level keyring.] Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-13fscrypt: rename fscrypt_master_key to fscrypt_direct_keyEric Biggers2-69/+68
In preparation for introducing a filesystem-level keyring which will contain fscrypt master keys, rename the existing 'struct fscrypt_master_key' to 'struct fscrypt_direct_key'. This is the structure in the existing table of master keys that's maintained to deduplicate the crypto transforms for v1 DIRECT_KEY policies. I've chosen to keep this table as-is rather than make it automagically add/remove the keys to/from the filesystem-level keyring, since that would add a lot of extra complexity to the filesystem-level keyring. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-13fscrypt: add ->ci_inode to fscrypt_infoEric Biggers2-0/+5
Add an inode back-pointer to 'struct fscrypt_info', such that inode->i_crypt_info->ci_inode == inode. This will be useful for: 1. Evicting the inodes when a fscrypt key is removed, since we'll track the inodes using a given key by linking their fscrypt_infos together, rather than the inodes directly. This avoids bloating 'struct inode' with a new list_head. 2. Simplifying the per-file key setup, since the inode pointer won't have to be passed around everywhere just in case something goes wrong and it's needed for fscrypt_warn(). Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-13fscrypt: use FSCRYPT_* definitions, not FS_*Eric Biggers5-43/+44
Update fs/crypto/ to use the new names for the UAPI constants rather than the old names, then make the old definitions conditional on !__KERNEL__. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-13fscrypt: use ENOPKG when crypto API support missingEric Biggers1-9/+11
Return ENOPKG rather than ENOENT when trying to open a file that's encrypted using algorithms not available in the kernel's crypto API. This avoids an ambiguity, since ENOENT is also returned when the file doesn't exist. Note: this is the same approach I'm taking for fs-verity. Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-13fscrypt: improve warnings for missing crypto API supportEric Biggers1-5/+14
Users of fscrypt with non-default algorithms will encounter an error like the following if they fail to include the needed algorithms into the crypto API when configuring the kernel (as per the documentation): Error allocating 'adiantum(xchacha12,aes)' transform: -2 This requires that the user figure out what the "-2" error means. Make it more friendly by printing a warning like the following instead: Missing crypto API support for Adiantum (API name: "adiantum(xchacha12,aes)") Also upgrade the log level for *other* errors to KERN_ERR. Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-13fscrypt: improve warning messages for unsupported encryption contextsEric Biggers1-3/+15
When fs/crypto/ encounters an inode with an invalid encryption context, currently it prints a warning if the pair of encryption modes are unrecognized, but it's silent if there are other problems such as unsupported context size, format, or flags. To help people debug such situations, add more warning messages. Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-13fscrypt: make fscrypt_msg() take inode instead of super_blockEric Biggers5-35/+28
Most of the warning and error messages in fs/crypto/ are for situations related to a specific inode, not merely to a super_block. So to make things easier, make fscrypt_msg() take an inode rather than a super_block, and make it print the inode number. Note: This is the same approach I'm taking for fsverity_msg(). Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-13fscrypt: clean up base64 encoding/decodingEric Biggers1-17/+17
Some minor cleanups for the code that base64 encodes and decodes encrypted filenames and long name digests: - Rename "digest_{encode,decode}()" => "base64_{encode,decode}()" since they are used for filenames too, not just for long name digests. - Replace 'while' loops with more conventional 'for' loops. - Use 'u8' for binary data. Keep 'char' for string data. - Fully constify the lookup table (pointer was not const). - Improve comment. No actual change in behavior. Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-13fscrypt: remove loadable module related codeEric Biggers3-26/+1
Since commit 643fa9612bf1 ("fscrypt: remove filesystem specific build config option"), fs/crypto/ can no longer be built as a loadable module. Thus it no longer needs a module_exit function, nor a MODULE_LICENSE. So remove them, and change module_init to late_initcall. Reviewed-by: Chandan Rajendra <chandan@linux.ibm.com> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-03Merge tag 'xfs-5.3-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2-1/+8
Pull xfs fixes from Darrick Wong: - Avoid leaking kernel stack contents to userspace - Fix a potential null pointer dereference in the dabtree scrub code * tag 'xfs-5.3-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: Fix possible null-pointer dereferences in xchk_da_btree_block_check_sibling() xfs: fix stack contents leakage in the v1 inumber ioctls
2019-08-03Merge branch 'akpm' (patches from Andrew)Linus Torvalds2-8/+39
Merge misc fixes from Andrew Morton: "17 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: drivers/acpi/scan.c: document why we don't need the device_hotplug_lock memremap: move from kernel/ to mm/ lib/test_meminit.c: use GFP_ATOMIC in RCU critical section asm-generic: fix -Wtype-limits compiler warnings cgroup: kselftest: relax fs_spec checks mm/memory_hotplug.c: remove unneeded return for void function mm/migrate.c: initialize pud_entry in migrate_vma() coredump: split pipe command whitespace before expanding template page flags: prioritize kasan bits over last-cpuid ubsan: build ubsan.c more conservatively kasan: remove clang version check for KASAN_STACK mm: compaction: avoid 100% CPU usage during compaction when a task is killed mm: migrate: fix reference check race between __find_get_block() and migration mm: vmscan: check if mem cgroup is disabled or not before calling memcg slab shrinker ocfs2: remove set but not used variable 'last_hash' Revert "kmemleak: allow to coexist with fault injection" kernel/signal.c: fix a kernel-doc markup
2019-08-03coredump: split pipe command whitespace before expanding templatePaul Wise1-5/+39
Save the offsets of the start of each argument to avoid having to update pointers to each argument after every corename krealloc and to avoid having to duplicate the memory for the dump command. Executable names containing spaces were previously being expanded from %e or %E and then split in the middle of the filename. This is incorrect behaviour since an argument list can represent arguments with spaces. The splitting could lead to extra arguments being passed to the core dump handler that it might have interpreted as options or ignored completely. Core dump handlers that are not aware of this Linux kernel issue will be using %e or %E without considering that it may be split and so they will be vulnerable to processes with spaces in their names breaking their argument list. If their internals are otherwise well written, such as if they are written in shell but quote arguments, they will work better after this change than before. If they are not well written, then there is a slight chance of breakage depending on the details of the code but they will already be fairly broken by the split filenames. Core dump handlers that are aware of this Linux kernel issue will be placing %e or %E as the last item in their core_pattern and then aggregating all of the remaining arguments into one, separated by spaces. Alternatively they will be obtaining the filename via other methods. Both of these will be compatible with the new arrangement. A side effect from this change is that unknown template types (for example %z) result in an empty argument to the dump handler instead of the argument being dropped. This is a desired change as: It is easier for dump handlers to process empty arguments than dropped ones, especially if they are written in shell or don't pass each template item with a preceding command-line option in order to differentiate between individual template types. Most core_patterns in the wild do not use options so they can confuse different template types (especially numeric ones) if an earlier one gets dropped in old kernels. If the kernel introduces a new template type and a core_pattern uses it, the core dump handler might not expect that the argument can be dropped in old kernels. For example, this can result in security issues when %d is dropped in old kernels. This happened with the corekeeper package in Debian and resulted in the interface between corekeeper and Linux having to be rewritten to use command-line options to differentiate between template types. The core_pattern for most core dump handlers is written by the handler author who would generally not insert unknown template types so this change should be compatible with all the core dump handlers that exist. Link: http://lkml.kernel.org/r/20190528051142.24939-1-pabs3@bonedaddy.net Fixes: 74aadce98605 ("core_pattern: allow passing of arguments to user mode helper when core_pattern is a pipe") Signed-off-by: Paul Wise <pabs3@bonedaddy.net> Reported-by: Jakub Wilk <jwilk@jwilk.net> [https://bugs.debian.org/924398] Reported-by: Paul Wise <pabs3@bonedaddy.net> [https://lore.kernel.org/linux-fsdevel/c8b7ecb8508895bf4adb62a748e2ea2c71854597.camel@bonedaddy.net/] Suggested-by: Jakub Wilk <jwilk@jwilk.net> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-03ocfs2: remove set but not used variable 'last_hash'YueHaibing1-3/+0
Fixes gcc '-Wunused-but-set-variable' warning: fs/ocfs2/xattr.c: In function ocfs2_xattr_bucket_find: fs/ocfs2/xattr.c:3828:6: warning: variable last_hash set but not used [-Wunused-but-set-variable] It's never used and can be removed. Link: http://lkml.kernel.org/r/20190716132110.34836-1-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-03Merge tag 'for-linus-20190802' of git://git.kernel.dk/linux-blockLinus Torvalds2-27/+62
Pull block fixes from Jens Axboe: "Here's a small collection of fixes that should go into this series. This contains: - io_uring potential use-after-free fix (Jackie) - loop regression fix (Jan) - O_DIRECT fragmented bio regression fix (Damien) - Mark Denis as the new floppy maintainer (Denis) - ataflop switch fall-through annotation (Gustavo) - libata zpodd overflow fix (Kees) - libata ahci deferred probe fix (Miquel) - nbd invalidation BUG_ON() fix (Munehisa) - dasd endless loop fix (Stefan)" * tag 'for-linus-20190802' of git://git.kernel.dk/linux-block: s390/dasd: fix endless loop after read unit address configuration block: Fix __blkdev_direct_IO() for bio fragments MAINTAINERS: floppy: take over maintainership nbd: replace kill_bdev() with __invalidate_device() again ata: libahci: do not complain in case of deferred probe io_uring: fix KASAN use after free in io_sq_wq_submit_work loop: Fix mount(2) failure due to race with LOOP_SET_FD libata: zpodd: Fix small read overflow in zpodd_get_mech_type() ataflop: Mark expected switch fall-through
2019-08-03Merge tag 'for-5.3-rc2-tag' of ↵Linus Torvalds4-67/+47
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - tiny race window during 2 transactions aborting at the same time can accidentally lead to a commit - regression fix, possible deadlock during fiemap - fix for an old bug when incremental send can fail on a file that has been deduplicated in a special way * tag 'for-5.3-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix deadlock between fiemap and transaction commits Btrfs: fix race leading to fs corruption after transaction abort Btrfs: fix incremental send failure after deduplication
2019-08-02Merge tag 'gfs2-v5.3-rc2.fixes' of ↵Linus Torvalds1-3/+12
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fix from Andreas Gruenbacher: "Fix gfs2 cluster coherency bug" * tag 'gfs2-v5.3-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Inode dirtying fix
2019-08-01block: Fix __blkdev_direct_IO() for bio fragmentsDamien Le Moal1-1/+2
The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO (patch 6a43074e2f46) introduced two problems with BIO fragment handling for direct IOs: 1) The dio size processed is calculated by incrementing the ret variable by the size of the bio fragment issued for the dio. However, this size is obtained directly from bio->bi_iter.bi_size AFTER the bio submission which may result in referencing the bi_size value after the bio completed, resulting in an incorrect value use. 2) The ret variable is not incremented by the size of the last bio fragment issued for the bio, leading to an invalid IO size being returned to the user. Fix both problem by using dio->size (which is incremented before the bio submission) to update the value of ret after bio submissions, including for the last bio fragment issued. Fixes: 6a43074e2f46 ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO") Reported-by: Masato Suzuki <masato.suzuki@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-31Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-4/+1
Pull mount_capable() fix from Al Viro. * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Unbreak mount_capable()
2019-07-31gfs2: Inode dirtying fixAndreas Gruenbacher1-3/+12
With the recent iomap write page reclaim deadlock fix, it turns out that the GLF_DIRTY flag isn't always set when it needs to be anymore: previously, this happened as a side effect of always adding the inode buffer head to the current transaction with gfs2_trans_add_meta, but this isn't happening consistently anymore. Fix by removing an additional unnecessary gfs2_trans_add_meta call and by setting the GLF_DIRTY flag in gfs2_iomap_end. (The GLF_DIRTY flag causes inode_go_sync to flush the transaction log when syncing out the glock of that inode. When the flag isn't set, inode_go_sync will skip inodes, including ones with an i_state of I_DIRTY_PAGES, which will lead to cluster incoherency.) In addition, in gfs2_iomap_page_done, if the metadata has changed, mark the inode as I_DIRTY_DATASYNC to have the inode added to the current transaction: we don't expect metadata to change here, but let's err on the safe side. Fixes: d0a22a4b03b8 ("gfs2: Fix iomap write page reclaim deadlock"); Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-07-31Unbreak mount_capable()Al Viro1-4/+1
In "consolidate the capability checks in sget_{fc,userns}())" the wrong argument had been passed to mount_capable() by sget_fc(). That mistake had been further obscured later, when switching mount_capable() to fs_context has moved the calculation of bogus argument from sget_fc() to mount_capable() itself. It should've been fc->user_ns all along. Screwed-up-by: Al Viro <viro@zeniv.linux.org.uk> Reported-by: Christian Brauner <christian@brauner.io> Tested-by: Christian Brauner <christian@brauner.io> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-31io_uring: fix KASAN use after free in io_sq_wq_submit_workJackie Liu1-1/+2
[root@localhost ~]# ./liburing/test/link QEMU Standard PC report that: [ 29.379892] CPU: 0 PID: 84 Comm: kworker/u2:2 Not tainted 5.3.0-rc2-00051-g4010b622f1d2-dirty #86 [ 29.379902] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 29.379913] Workqueue: io_ring-wq io_sq_wq_submit_work [ 29.379929] Call Trace: [ 29.379953] dump_stack+0xa9/0x10e [ 29.379970] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.379986] print_address_description.cold.6+0x9/0x317 [ 29.379999] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380010] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380026] __kasan_report.cold.7+0x1a/0x34 [ 29.380044] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380061] kasan_report+0xe/0x12 [ 29.380076] io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380104] ? io_sq_thread+0xaf0/0xaf0 [ 29.380152] process_one_work+0xb59/0x19e0 [ 29.380184] ? pwq_dec_nr_in_flight+0x2c0/0x2c0 [ 29.380221] worker_thread+0x8c/0xf40 [ 29.380248] ? __kthread_parkme+0xab/0x110 [ 29.380265] ? process_one_work+0x19e0/0x19e0 [ 29.380278] kthread+0x30b/0x3d0 [ 29.380292] ? kthread_create_on_node+0xe0/0xe0 [ 29.380311] ret_from_fork+0x3a/0x50 [ 29.380635] Allocated by task 209: [ 29.381255] save_stack+0x19/0x80 [ 29.381268] __kasan_kmalloc.constprop.6+0xc1/0xd0 [ 29.381279] kmem_cache_alloc+0xc0/0x240 [ 29.381289] io_submit_sqe+0x11bc/0x1c70 [ 29.381300] io_ring_submit+0x174/0x3c0 [ 29.381311] __x64_sys_io_uring_enter+0x601/0x780 [ 29.381322] do_syscall_64+0x9f/0x4d0 [ 29.381336] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 29.381633] Freed by task 84: [ 29.382186] save_stack+0x19/0x80 [ 29.382198] __kasan_slab_free+0x11d/0x160 [ 29.382210] kmem_cache_free+0x8c/0x2f0 [ 29.382220] io_put_req+0x22/0x30 [ 29.382230] io_sq_wq_submit_work+0x28b/0xe90 [ 29.382241] process_one_work+0xb59/0x19e0 [ 29.382251] worker_thread+0x8c/0xf40 [ 29.382262] kthread+0x30b/0x3d0 [ 29.382272] ret_from_fork+0x3a/0x50 [ 29.382569] The buggy address belongs to the object at ffff888067172140 which belongs to the cache io_kiocb of size 224 [ 29.384692] The buggy address is located 120 bytes inside of 224-byte region [ffff888067172140, ffff888067172220) [ 29.386723] The buggy address belongs to the page: [ 29.387575] page:ffffea00019c5c80 refcount:1 mapcount:0 mapping:ffff88806ace5180 index:0x0 [ 29.387587] flags: 0x100000000000200(slab) [ 29.387603] raw: 0100000000000200 dead000000000100 dead000000000122 ffff88806ace5180 [ 29.387617] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000 [ 29.387624] page dumped because: kasan: bad access detected [ 29.387920] Memory state around the buggy address: [ 29.388771] ffff888067172080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 29.390062] ffff888067172100: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb [ 29.391325] >ffff888067172180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 29.392578] ^ [ 29.393480] ffff888067172200: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc [ 29.394744] ffff888067172280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 29.396003] ================================================================== [ 29.397260] Disabling lock debugging due to kernel taint io_sq_wq_submit_work free and read req again. Cc: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Cc: linux-block@vger.kernel.org Cc: stable@vger.kernel.org Fixes: f7b76ac9d17e ("io_uring: fix counter inc/dec mismatch in async_list") Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-31Merge branch 'dax-fix-5.3-rc3' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull dax fix from Dan Williams: "Fix a botched manual patch update that got dropped between testing and application" * 'dax-fix-5.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: Fix missed wakeup in put_unlocked_entry()
2019-07-30Merge tag 'f2fs-for-5.4-rc3' of ↵Linus Torvalds3-100/+81
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fixes from Jaegeuk Kim: "This set of patches adjust to follow recent setflags changes and fix two regressions" * tag 'f2fs-for-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: use EINVAL for superblock with invalid magic f2fs: fix to read source block before invalidating it f2fs: remove redundant check from f2fs_setflags_common() f2fs: use generic checking function for FS_IOC_FSSETXATTR f2fs: use generic checking and prep function for FS_IOC_SETFLAGS
2019-07-30loop: Fix mount(2) failure due to race with LOOP_SET_FDJan Kara1-25/+58
Commit 33ec3e53e7b1 ("loop: Don't change loop device under exclusive opener") made LOOP_SET_FD ioctl acquire exclusive block device reference while it updates loop device binding. However this can make perfectly valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding temporarily the exclusive bdev reference in cases like this: for i in {a..z}{a..z}; do dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024 mkfs.ext2 $i.image mkdir mnt$i done echo "Run" for i in {a..z}{a..z}; do mount -o loop -t ext2 $i.image mnt$i & done Fix the problem by not getting full exclusive bdev reference in LOOP_SET_FD but instead just mark the bdev as being claimed while we update the binding information. This just blocks new exclusive openers instead of failing them with EBUSY thus fixing the problem. Fixes: 33ec3e53e7b1 ("loop: Don't change loop device under exclusive opener") Cc: stable@vger.kernel.org Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-30xfs: Fix possible null-pointer dereferences in ↵Jia-Ju Bai1-1/+5
xchk_da_btree_block_check_sibling() In xchk_da_btree_block_check_sibling(), there is an if statement on line 274 to check whether ds->state->altpath.blk[level].bp is NULL: if (ds->state->altpath.blk[level].bp) When ds->state->altpath.blk[level].bp is NULL, it is used on line 281: xfs_trans_brelse(..., ds->state->altpath.blk[level].bp); struct xfs_buf_log_item *bip = bp->b_log_item; ASSERT(bp->b_transp == tp); Thus, possible null-pointer dereferences may occur. To fix these bugs, ds->state->altpath.blk[level].bp is checked before being used. These bugs are found by a static analysis tool STCheck written by us. Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-07-30Btrfs: fix deadlock between fiemap and transaction commitsFilipe Manana3-5/+22
The fiemap handler locks a file range that can have unflushed delalloc, and after locking the range, it tries to attach to a running transaction. If the running transaction started its commit, that is, it is in state TRANS_STATE_COMMIT_START, and either the filesystem was mounted with the flushoncommit option or the transaction is creating a snapshot for the subvolume that contains the file that fiemap is operating on, we end up deadlocking. This happens because fiemap is blocked on the transaction, waiting for it to complete, and the transaction is waiting for the flushed dealloc to complete, which requires locking the file range that the fiemap task already locked. The following stack traces serve as an example of when this deadlock happens: (...) [404571.515510] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [404571.515956] Call Trace: [404571.516360] ? __schedule+0x3ae/0x7b0 [404571.516730] schedule+0x3a/0xb0 [404571.517104] lock_extent_bits+0x1ec/0x2a0 [btrfs] [404571.517465] ? remove_wait_queue+0x60/0x60 [404571.517832] btrfs_finish_ordered_io+0x292/0x800 [btrfs] [404571.518202] normal_work_helper+0xea/0x530 [btrfs] [404571.518566] process_one_work+0x21e/0x5c0 [404571.518990] worker_thread+0x4f/0x3b0 [404571.519413] ? process_one_work+0x5c0/0x5c0 [404571.519829] kthread+0x103/0x140 [404571.520191] ? kthread_create_worker_on_cpu+0x70/0x70 [404571.520565] ret_from_fork+0x3a/0x50 [404571.520915] kworker/u8:6 D 0 31651 2 0x80004000 [404571.521290] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs] (...) [404571.537000] fsstress D 0 13117 13115 0x00004000 [404571.537263] Call Trace: [404571.537524] ? __schedule+0x3ae/0x7b0 [404571.537788] schedule+0x3a/0xb0 [404571.538066] wait_current_trans+0xc8/0x100 [btrfs] [404571.538349] ? remove_wait_queue+0x60/0x60 [404571.538680] start_transaction+0x33c/0x500 [btrfs] [404571.539076] btrfs_check_shared+0xa3/0x1f0 [btrfs] [404571.539513] ? extent_fiemap+0x2ce/0x650 [btrfs] [404571.539866] extent_fiemap+0x2ce/0x650 [btrfs] [404571.540170] do_vfs_ioctl+0x526/0x6f0 [404571.540436] ksys_ioctl+0x70/0x80 [404571.540734] __x64_sys_ioctl+0x16/0x20 [404571.540997] do_syscall_64+0x60/0x1d0 [404571.541279] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [404571.543729] btrfs D 0 14210 14208 0x00004000 [404571.544023] Call Trace: [404571.544275] ? __schedule+0x3ae/0x7b0 [404571.544526] ? wait_for_completion+0x112/0x1a0 [404571.544795] schedule+0x3a/0xb0 [404571.545064] schedule_timeout+0x1ff/0x390 [404571.545351] ? lock_acquire+0xa6/0x190 [404571.545638] ? wait_for_completion+0x49/0x1a0 [404571.545890] ? wait_for_completion+0x112/0x1a0 [404571.546228] wait_for_completion+0x131/0x1a0 [404571.546503] ? wake_up_q+0x70/0x70 [404571.546775] btrfs_wait_ordered_extents+0x27c/0x400 [btrfs] [404571.547159] btrfs_commit_transaction+0x3b0/0xae0 [btrfs] [404571.547449] ? btrfs_mksubvol+0x4a4/0x640 [btrfs] [404571.547703] ? remove_wait_queue+0x60/0x60 [404571.547969] btrfs_mksubvol+0x605/0x640 [btrfs] [404571.548226] ? __sb_start_write+0xd4/0x1c0 [404571.548512] ? mnt_want_write_file+0x24/0x50 [404571.548789] btrfs_ioctl_snap_create_transid+0x169/0x1a0 [btrfs] [404571.549048] btrfs_ioctl_snap_create_v2+0x11d/0x170 [btrfs] [404571.549307] btrfs_ioctl+0x133f/0x3150 [btrfs] [404571.549549] ? mem_cgroup_charge_statistics+0x4c/0xd0 [404571.549792] ? mem_cgroup_commit_charge+0x84/0x4b0 [404571.550064] ? __handle_mm_fault+0xe3e/0x11f0 [404571.550306] ? do_raw_spin_unlock+0x49/0xc0 [404571.550608] ? _raw_spin_unlock+0x24/0x30 [404571.550976] ? __handle_mm_fault+0xedf/0x11f0 [404571.551319] ? do_vfs_ioctl+0xa2/0x6f0 [404571.551659] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [404571.552087] do_vfs_ioctl+0xa2/0x6f0 [404571.552355] ksys_ioctl+0x70/0x80 [404571.552621] __x64_sys_ioctl+0x16/0x20 [404571.552864] do_syscall_64+0x60/0x1d0 [404571.553104] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) If we were joining the transaction instead of attaching to it, we would not risk a deadlock because a join only blocks if the transaction is in a state greater then or equals to TRANS_STATE_COMMIT_DOING, and the delalloc flush performed by a transaction is done before it reaches that state, when it is in the state TRANS_STATE_COMMIT_START. However a transaction join is intended for use cases where we do modify the filesystem, and fiemap only needs to peek at delayed references from the current transaction in order to determine if extents are shared, and, besides that, when there is no current transaction or when it blocks to wait for a current committing transaction to complete, it creates a new transaction without reserving any space. Such unnecessary transactions, besides doing unnecessary IO, can cause transaction aborts (-ENOSPC) and unnecessary rotation of the precious backup roots. So fix this by adding a new transaction join variant, named join_nostart, which behaves like the regular join, but it does not create a transaction when none currently exists or after waiting for a committing transaction to complete. Fixes: 03628cdbc64db6 ("Btrfs: do not start a transaction during fiemap") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-30Btrfs: fix race leading to fs corruption after transaction abortFilipe Manana1-0/+10
When one transaction is finishing its commit, it is possible for another transaction to start and enter its initial commit phase as well. If the first ends up getting aborted, we have a small time window where the second transaction commit does not notice that the previous transaction aborted and ends up committing, writing a superblock that points to btrees that reference extent buffers (nodes and leafs) that were not persisted to disk. The consequence is that after mounting the filesystem again, we will be unable to load some btree nodes/leafs, either because the content on disk is either garbage (or just zeroes) or corresponds to the old content of a previouly COWed or deleted node/leaf, resulting in the well known error messages "parent transid verify failed on ...". The following sequence diagram illustrates how this can happen. CPU 1 CPU 2 <at transaction N> btrfs_commit_transaction() (...) --> sets transaction state to TRANS_STATE_UNBLOCKED --> sets fs_info->running_transaction to NULL (...) btrfs_start_transaction() start_transaction() wait_current_trans() --> returns immediately because fs_info->running_transaction is NULL join_transaction() --> creates transaction N + 1 --> sets fs_info->running_transaction to transaction N + 1 --> adds transaction N + 1 to the fs_info->trans_list list --> returns transaction handle pointing to the new transaction N + 1 (...) btrfs_sync_file() btrfs_start_transaction() --> returns handle to transaction N + 1 (...) btrfs_write_and_wait_transaction() --> writeback of some extent buffer fails, returns an error btrfs_handle_fs_error() --> sets BTRFS_FS_STATE_ERROR in fs_info->fs_state --> jumps to label "scrub_continue" cleanup_transaction() btrfs_abort_transaction(N) --> sets BTRFS_FS_STATE_TRANS_ABORTED flag in fs_info->fs_state --> sets aborted field in the transaction and transaction handle structures, for transaction N only --> removes transaction from the list fs_info->trans_list btrfs_commit_transaction(N + 1) --> transaction N + 1 was not aborted, so it proceeds (...) --> sets the transaction's state to TRANS_STATE_COMMIT_START --> does not find the previous transaction (N) in the fs_info->trans_list, so it doesn't know that transaction was aborted, and the commit of transaction N + 1 proceeds (...) --> sets transaction N + 1 state to TRANS_STATE_UNBLOCKED btrfs_write_and_wait_transaction() --> succeeds writing all extent buffers created in the transaction N + 1 write_all_supers() --> succeeds --> we now have a superblock on disk that points to trees that refer to at least one extent buffer that was never persisted So fix this by updating the transaction commit path to check if the flag BTRFS_FS_STATE_TRANS_ABORTED is set on fs_info->fs_state if after setting the transaction to the TRANS_STATE_COMMIT_START we do not find any previous transaction in the fs_info->trans_list. If the flag is set, just fail the transaction commit with -EROFS, as we do in other places. The exact error code for the previous transaction abort was already logged and reported. Fixes: 49b25e0540904b ("btrfs: enhance transaction abort infrastructure") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-30Btrfs: fix incremental send failure after deduplicationFilipe Manana1-62/+15
When doing an incremental send operation we can fail if we previously did deduplication operations against a file that exists in both snapshots. In that case we will fail the send operation with -EIO and print a message to dmesg/syslog like the following: BTRFS error (device sdc): Send: inconsistent snapshot, found updated \ extent for inode 257 without updated inode item, send root is 258, \ parent root is 257 This requires that we deduplicate to the same file in both snapshots for the same amount of times on each snapshot. The issue happens because a deduplication only updates the iversion of an inode and does not update any other field of the inode, therefore if we deduplicate the file on each snapshot for the same amount of time, the inode will have the same iversion value (stored as the "sequence" field on the inode item) on both snapshots, therefore it will be seen as unchanged between in the send snapshot while there are new/updated/deleted extent items when comparing to the parent snapshot. This makes the send operation return -EIO and print an error message. Example reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt # Create our first file. The first half of the file has several 64Kb # extents while the second half as a single 512Kb extent. $ xfs_io -f -s -c "pwrite -S 0xb8 -b 64K 0 512K" /mnt/foo $ xfs_io -c "pwrite -S 0xb8 512K 512K" /mnt/foo # Create the base snapshot and the parent send stream from it. $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1 $ btrfs send -f /tmp/1.snap /mnt/mysnap1 # Create our second file, that has exactly the same data as the first # file. $ xfs_io -f -c "pwrite -S 0xb8 0 1M" /mnt/bar # Create the second snapshot, used for the incremental send, before # doing the file deduplication. $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2 # Now before creating the incremental send stream: # # 1) Deduplicate into a subrange of file foo in snapshot mysnap1. This # will drop several extent items and add a new one, also updating # the inode's iversion (sequence field in inode item) by 1, but not # any other field of the inode; # # 2) Deduplicate into a different subrange of file foo in snapshot # mysnap2. This will replace an extent item with a new one, also # updating the inode's iversion by 1 but not any other field of the # inode. # # After these two deduplication operations, the inode items, for file # foo, are identical in both snapshots, but we have different extent # items for this inode in both snapshots. We want to check this doesn't # cause send to fail with an error or produce an incorrect stream. $ xfs_io -r -c "dedupe /mnt/bar 0 0 512K" /mnt/mysnap1/foo $ xfs_io -r -c "dedupe /mnt/bar 512K 512K 512K" /mnt/mysnap2/foo # Create the incremental send stream. $ btrfs send -p /mnt/mysnap1 -f /tmp/2.snap /mnt/mysnap2 ERROR: send ioctl failed with -5: Input/output error This issue started happening back in 2015 when deduplication was updated to not update the inode's ctime and mtime and update only the iversion. Back then we would hit a BUG_ON() in send, but later in 2016 send was updated to return -EIO and print the error message instead of doing the BUG_ON(). A test case for fstests follows soon. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203933 Fixes: 1c919a5e13702c ("btrfs: don't update mtime/ctime on deduped inodes") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-29dax: Fix missed wakeup in put_unlocked_entry()Jan Kara1-1/+1
The condition checking whether put_unlocked_entry() needs to wake up following waiter got broken by commit 23c84eb78375 ("dax: Fix missed wakeup with PMD faults"). We need to wake the waiter whenever the passed entry is valid (i.e., non-NULL and not special conflict entry). This could lead to processes never being woken up when waiting for entry lock. Fix the condition. Cc: <stable@vger.kernel.org> Link: http://lore.kernel.org/r/20190729120228.GC17833@quack2.suse.cz Fixes: 23c84eb78375 ("dax: Fix missed wakeup with PMD faults") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-07-29f2fs: use EINVAL for superblock with invalid magicIcenowy Zheng1-24/+24
The kernel mount_block_root() function expects -EACESS or -EINVAL for a unmountable filesystem when trying to mount the root with different filesystem types. However, in 5.3-rc1 the behavior when F2FS code cannot find valid block changed to return -EFSCORRUPTED(-EUCLEAN), and this error code makes mount_block_root() fail when trying to probe F2FS. When the magic number of the superblock mismatches, it has a high probability that it's just not a F2FS. In this case return -EINVAL seems to be a better result, and this return value can make mount_block_root() probing work again. Return -EINVAL when the superblock has magic mismatch, -EFSCORRUPTED in other cases (the magic matches but the superblock cannot be recognized). Fixes: 10f966bbf521 ("f2fs: use generic EFSBADCRC/EFSCORRUPTED") Signed-off-by: Icenowy Zheng <icenowy@aosc.io> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-07-29xfs: fix stack contents leakage in the v1 inumber ioctlsDarrick J. Wong1-0/+3
Explicitly initialize the onstack structures to zero so we don't leak kernel memory into userspace when converting the in-core inumbers structure to the v1 inogrp ioctl structure. Add a comment about why we have to use memset to ensure that the padding holes in the structures are set to zero. Fixes: 5f19c7fc6873351 ("xfs: introduce v5 inode group structure") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2019-07-28Merge tag 'spdx-5.3-rc2' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx Pull SPDX fixes from Greg KH: "Here are some small SPDX fixes for 5.3-rc2 for things that came in during the 5.3-rc1 merge window that we previously missed. Only three small patches here: - two uapi patches to resolve some SPDX tags that were not correct - fix an invalid SPDX tag in the iomap Makefile file All have been properly reviewed on the public mailing lists" * tag 'spdx-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: iomap: fix Invalid License ID treewide: remove SPDX "WITH Linux-syscall-note" from kernel-space headers again treewide: add "WITH Linux-syscall-note" to SPDX tag of uapi headers
2019-07-28Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "Two fixes for the fair scheduling class: - Prevent freeing memory which is accessible by concurrent readers - Make the RCU annotations for numa groups consistent" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Use RCU accessors consistently for ->numa_group sched/fair: Don't free p->numa_faults with concurrent readers
2019-07-27Merge tag 'Wimplicit-fallthrough-5.3-rc2' of ↵Linus Torvalds2-37/+68
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull Wimplicit-fallthrough enablement from Gustavo A. R. Silva: "This marks switch cases where we are expecting to fall through, and globally enables the -Wimplicit-fallthrough option in the main Makefile. Finally, some missing-break fixes that have been tagged for -stable: - drm/amdkfd: Fix missing break in switch statement - drm/amdgpu/gfx10: Fix missing break in switch statement With these changes, we completely get rid of all the fall-through warnings in the kernel" * tag 'Wimplicit-fallthrough-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: Makefile: Globally enable fall-through warning drm/i915: Mark expected switch fall-throughs drm/amd/display: Mark expected switch fall-throughs drm/amdkfd/kfd_mqd_manager_v10: Avoid fall-through warning drm/amdgpu/gfx10: Fix missing break in switch statement drm/amdkfd: Fix missing break in switch statement perf/x86/intel: Mark expected switch fall-throughs mtd: onenand_base: Mark expected switch fall-through afs: fsclient: Mark expected switch fall-throughs afs: yfsclient: Mark expected switch fall-throughs can: mark expected switch fall-throughs firewire: mark expected switch fall-throughs
2019-07-27f2fs: fix to read source block before invalidating itJaegeuk Kim1-36/+34
f2fs_allocate_data_block() invalidates old block address and enable new block address. Then, if we try to read old block by f2fs_submit_page_bio(), it will give WARN due to reading invalid blocks. Let's make the order sanely back. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-07-26Merge tag 'for-5.3-rc1-tag' of ↵Linus Torvalds2-8/+12
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two regression fixes: - hangs caused by a missing barrier in the locking code - memory leaks of extent_state due to bad handling of a cached pointer" * tag 'for-5.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix extent_state leak in btrfs_lock_and_flush_ordered_range btrfs: Fix deadlock caused by missing memory barrier
2019-07-26Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-2/+2
Pull vfs umount_tree() leak fix from Al Viro: "Fix braino introduced in 'switch the remnants of releasing the mountpoint away from fs_pin'. The most visible result is leaking struct mount when mounting btrfs, making it impossible to shut down" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix the struct mount leak in umount_tree()
2019-07-26Merge tag 'for-linus-20190726' of git://git.kernel.dk/linux-blockLinus Torvalds2-25/+114
Pull block fixes from Jens Axboe: - Several io_uring fixes/improvements: - Blocking fix for O_DIRECT (me) - Latter page slowness for registered buffers (me) - Fix poll hang under certain conditions (me) - Defer sequence check fix for wrapped rings (Zhengyuan) - Mismatch in async inc/dec accounting (Zhengyuan) - Memory ordering issue that could cause stall (Zhengyuan) - Track sequential defer in bytes, not pages (Zhengyuan) - NVMe pull request from Christoph - Set of hang fixes for wbt (Josef) - Redundant error message kill for libahci (Ding) - Remove unused blk_mq_sched_started_request() and related ops (Marcos) - drbd dynamic alloc shash descriptor to reduce stack use (Arnd) - blkcg ->pd_stat() non-debug print (Tejun) - bcache memory leak fix (Wei) - Comment fix (Akinobu) - BFQ perf regression fix (Paolo) * tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits) io_uring: ensure ->list is initialized for poll commands Revert "nvme-pci: don't create a read hctx mapping without read queues" nvme: fix multipath crash when ANA is deactivated nvme: fix memory leak caused by incorrect subsystem free nvme: ignore subnqn for ADATA SX6000LNP drbd: dynamically allocate shash descriptor block: blk-mq: Remove blk_mq_sched_started_request and started_request bcache: fix possible memory leak in bch_cached_dev_run() io_uring: track io length in async_list based on bytes io_uring: don't use iov_iter_advance() for fixed buffers block: properly handle IOCB_NOWAIT for async O_DIRECT IO blk-mq: allow REQ_NOWAIT to return an error inline io_uring: add a memory barrier before atomic_read rq-qos: use a mb for got_token rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule rq-qos: don't reset has_sleepers on spurious wakeups rq-qos: fix missed wake-ups in rq_qos_throttle wait: add wq_has_single_sleeper helper block, bfq: check also in-flight I/O in dispatch plugging block: fix sysfs module parameters directory path in comment ...
2019-07-26fix the struct mount leak in umount_tree()Al Viro1-2/+2
We need to drop everything we remove from the tree, whether mnt_has_parent() is true or not. Usually the bug manifests as a slow memory leak (leaked struct mount for initramfs); it becomes much more visible in mount_subtree() users, such as btrfs. There we leak a struct mount for btrfs superblock being mounted, which prevents fs shutdown on subsequent umount. Fixes: 56cbb429d911 ("switch the remnants of releasing the mountpoint away from fs_pin") Reported-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-26btrfs: fix extent_state leak in btrfs_lock_and_flush_ordered_rangeNaohiro Aota1-5/+6
btrfs_lock_and_flush_ordered_range() loads given "*cached_state" into cachedp, which, in general, is NULL. Then, lock_extent_bits() updates "cachedp", but it never goes backs to the caller. Thus the caller still see its "cached_state" to be NULL and never free the state allocated under btrfs_lock_and_flush_ordered_range(). As a result, we will see massive state leak with e.g. fstests btrfs/005. Fix this bug by properly handling the pointers. Fixes: bd80d94efb83 ("btrfs: Always use a cached extent_state in btrfs_lock_and_flush_ordered_range") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-26afs: fsclient: Mark expected switch fall-throughsGustavo A. R. Silva1-18/+33
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. This patch fixes the following warnings: Warning level 3 was used: -Wimplicit-fallthrough=3 fs/afs/fsclient.c: In function ‘afs_deliver_fs_fetch_acl’: fs/afs/fsclient.c:2199:19: warning: this statement may fall through [-Wimplicit-fallthrough=] call->unmarshall++; ~~~~~~~~~~~~~~~~^~ fs/afs/fsclient.c:2202:2: note: here case 1: ^~~~ fs/afs/fsclient.c:2216:19: warning: this statement may fall through [-Wimplicit-fallthrough=] call->unmarshall++; ~~~~~~~~~~~~~~~~^~ fs/afs/fsclient.c:2219:2: note: here case 2: ^~~~ fs/afs/fsclient.c:2225:19: warning: this statement may fall through [-Wimplicit-fallthrough=] call->unmarshall++; ~~~~~~~~~~~~~~~~^~ fs/afs/fsclient.c:2228:2: note: here case 3: ^~~~ This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>