summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2019-07-09NFS: Fix show_nfs_errors macros againChuck Lever1-82/+82
I noticed that NFS status values stopped working again. trace_print_symbols_seq() takes an unsigned long. Passing a negative errno or negative NFSERR value just confuses it, and since we're using C macros here and not static inline functions, all bets are off due to implicit type conversion. Straight-line the calling conventions so that error codes are stored in the trace record as positive values in an unsigned long field, mapped to symbolic as an unsigned long, and displayed as a negative value, to continue to enable grepping on "error=-". It's often the case that an error value that is positive is a byte count but when it's negative, it's an error (e.g. nfs4_write). Fix those cases so that the value that is eventually stored in the error field is a positive NFS status or errno, or zero. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09NFS4: Add a trace event to record invalid CB sequence IDsChuck Lever2-8/+58
Help debug NFSv4 callback failures. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09Merge branch 'siginfo-linus' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull force_sig() argument change from Eric Biederman: "A source of error over the years has been that force_sig has taken a task parameter when it is only safe to use force_sig with the current task. The force_sig function is built for delivering synchronous signals such as SIGSEGV where the userspace application caused a synchronous fault (such as a page fault) and the kernel responded with a signal. Because the name force_sig does not make this clear, and because the force_sig takes a task parameter the function force_sig has been abused for sending other kinds of signals over the years. Slowly those have been fixed when the oopses have been tracked down. This set of changes fixes the remaining abusers of force_sig and carefully rips out the task parameter from force_sig and friends making this kind of error almost impossible in the future" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits) signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus signal: Remove the signal number and task parameters from force_sig_info signal: Factor force_sig_info_to_task out of force_sig_info signal: Generate the siginfo in force_sig signal: Move the computation of force into send_signal and correct it. signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal signal: Remove the task parameter from force_sig_fault signal: Use force_sig_fault_to_task for the two calls that don't deliver to current signal: Explicitly call force_sig_fault on current signal/unicore32: Remove tsk parameter from __do_user_fault signal/arm: Remove tsk parameter from __do_user_fault signal/arm: Remove tsk parameter from ptrace_break signal/nds32: Remove tsk parameter from send_sigtrap signal/riscv: Remove tsk parameter from do_trap signal/sh: Remove tsk parameter from force_sig_info_fault signal/um: Remove task parameter from send_sigtrap signal/x86: Remove task parameter from send_sigtrap signal: Remove task parameter from force_sig_mceerr signal: Remove task parameter from force_sig signal: Remove task parameter from force_sigsegv ...
2019-07-09pstore: Fix double-free in pstore_mkfile() failure pathNorbert Manthey1-7/+6
The pstore_mkfile() function is passed a pointer to a struct pstore_record. On success it consumes this 'record' pointer and references it from the created inode. On failure, however, it may or may not free the record. There are even two different code paths which return -ENOMEM -- one of which does and the other doesn't free the record. Make the behaviour deterministic by never consuming and freeing the record when returning failure, allowing the caller to do the cleanup consistently. Signed-off-by: Norbert Manthey <nmanthey@amazon.de> Link: https://lore.kernel.org/r/1562331960-26198-1-git-send-email-nmanthey@amazon.de Fixes: 83f70f0769ddd ("pstore: Do not duplicate record metadata") Fixes: 1dfff7dd67d1a ("pstore: Pass record contents instead of copying") Cc: stable@vger.kernel.org [kees: also move "private" allocation location, rename inode cleanup label] Signed-off-by: Kees Cook <keescook@chromium.org>
2019-07-09pstore: no need to check return value of debugfs_create functionsGreg Kroah-Hartman1-16/+2
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Cc: Kees Cook <keescook@chromium.org> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kees Cook <keescook@chromium.org>
2019-07-09pstore/ram: Improve backward compatibility with older ChromebooksDouglas Anderson1-0/+21
When you try to run an upstream kernel on an old ARM-based Chromebook you'll find that console-ramoops doesn't work. Old ARM-based Chromebooks, before <https://crrev.com/c/439792> ("ramoops: support upstream {console,pmsg,ftrace}-size properties") used to create a "ramoops" node at the top level that looked like: / { ramoops { compatible = "ramoops"; reg = <...>; record-size = <...>; dump-oops; }; }; ...and these Chromebooks assumed that the downstream kernel would make console_size / pmsg_size match the record size. The above ramoops node was added by the firmware so it's not easy to make any changes. Let's match the expected behavior, but only for those using the old backward-compatible way of working where ramoops is right under the root node. NOTE: if there are some out-of-tree devices that had ramoops at the top level, left everything but the record size as 0, and somehow doesn't want this behavior, we can try to add more conditions here. Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org>
2019-07-09Merge branch 'linus' of ↵Linus Torvalds3-48/+17
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "Here is the crypto update for 5.3: API: - Test shash interface directly in testmgr - cra_driver_name is now mandatory Algorithms: - Replace arc4 crypto_cipher with library helper - Implement 5 way interleave for ECB, CBC and CTR on arm64 - Add xxhash - Add continuous self-test on noise source to drbg - Update jitter RNG Drivers: - Add support for SHA204A random number generator - Add support for 7211 in iproc-rng200 - Fix fuzz test failures in inside-secure - Fix fuzz test failures in talitos - Fix fuzz test failures in qat" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (143 commits) crypto: stm32/hash - remove interruptible condition for dma crypto: stm32/hash - Fix hmac issue more than 256 bytes crypto: stm32/crc32 - rename driver file crypto: amcc - remove memset after dma_alloc_coherent crypto: ccp - Switch to SPDX license identifiers crypto: ccp - Validate the the error value used to index error messages crypto: doc - Fix formatting of new crypto engine content crypto: doc - Add parameter documentation crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR crypto: arm64/aes-ce - add 5 way interleave routines crypto: talitos - drop icv_ool crypto: talitos - fix hash on SEC1. crypto: talitos - move struct talitos_edesc into talitos.h lib/scatterlist: Fix mapping iterator when sg->offset is greater than PAGE_SIZE crypto/NX: Set receive window credits to max number of CRBs in RxFIFO crypto: asymmetric_keys - select CRYPTO_HASH where needed crypto: serpent - mark __serpent_setkey_sbox noinline crypto: testmgr - dynamically allocate crypto_shash crypto: testmgr - dynamically allocate testvec_config crypto: talitos - eliminate unneeded 'done' functions at build time ...
2019-07-09nfsd: Fix misuse of strlcpyJoe Perches1-1/+1
Probable cut&paste typo - use the correct field size. (Not currently a practical problem since these two fields have the same size, but we should fix it anyway.) Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-07-09Merge tag 'keys-acl-20190703' of ↵Linus Torvalds10-21/+78
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull keyring ACL support from David Howells: "This changes the permissions model used by keys and keyrings to be based on an internal ACL by the following means: - Replace the permissions mask internally with an ACL that contains a list of ACEs, each with a specific subject with a permissions mask. Potted default ACLs are available for new keys and keyrings. ACE subjects can be macroised to indicate the UID and GID specified on the key (which remain). Future commits will be able to add additional subject types, such as specific UIDs or domain tags/namespaces. Also split a number of permissions to give finer control. Examples include splitting the revocation permit from the change-attributes permit, thereby allowing someone to be granted permission to revoke a key without allowing them to change the owner; also the ability to join a keyring is split from the ability to link to it, thereby stopping a process accessing a keyring by joining it and thus acquiring use of possessor permits. - Provide a keyctl to allow the granting or denial of one or more permits to a specific subject. Direct access to the ACL is not granted, and the ACL cannot be viewed" * tag 'keys-acl-20190703' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: keys: Provide KEYCTL_GRANT_PERMISSION keys: Replace uid/gid/perm permissions checking with an ACL
2019-07-09Merge tag 'keys-namespace-20190627' of ↵Linus Torvalds5-8/+12
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull keyring namespacing from David Howells: "These patches help make keys and keyrings more namespace aware. Firstly some miscellaneous patches to make the process easier: - Simplify key index_key handling so that the word-sized chunks assoc_array requires don't have to be shifted about, making it easier to add more bits into the key. - Cache the hash value in the key so that we don't have to calculate on every key we examine during a search (it involves a bunch of multiplications). - Allow keying_search() to search non-recursively. Then the main patches: - Make it so that keyring names are per-user_namespace from the point of view of KEYCTL_JOIN_SESSION_KEYRING so that they're not accessible cross-user_namespace. keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEYRING_NAME for this. - Move the user and user-session keyrings to the user_namespace rather than the user_struct. This prevents them propagating directly across user_namespaces boundaries (ie. the KEY_SPEC_* flags will only pick from the current user_namespace). - Make it possible to include the target namespace in which the key shall operate in the index_key. This will allow the possibility of multiple keys with the same description, but different target domains to be held in the same keyring. keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEY_TAG for this. - Make it so that keys are implicitly invalidated by removal of a domain tag, causing them to be garbage collected. - Institute a network namespace domain tag that allows keys to be differentiated by the network namespace in which they operate. New keys that are of a type marked 'KEY_TYPE_NET_DOMAIN' are assigned the network domain in force when they are created. - Make it so that the desired network namespace can be handed down into the request_key() mechanism. This allows AFS, NFS, etc. to request keys specific to the network namespace of the superblock. This also means that the keys in the DNS record cache are thenceforth namespaced, provided network filesystems pass the appropriate network namespace down into dns_query(). For DNS, AFS and NFS are good, whilst CIFS and Ceph are not. Other cache keyrings, such as idmapper keyrings, also need to set the domain tag - for which they need access to the network namespace of the superblock" * tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: keys: Pass the network namespace into request_key mechanism keys: Network namespace domain tag keys: Garbage collect keys for which the domain has been removed keys: Include target namespace in match criteria keys: Move the user and user-session keyrings to the user_namespace keys: Namespace keyring names keys: Add a 'recurse' flag for keyring searches keys: Cache the hash value to avoid lots of recalculation keys: Simplify key description management
2019-07-09Merge branch 'x86-core-for-linus' of ↵Linus Torvalds2-0/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 AVX512 status update from Ingo Molnar: "This adds a new ABI that the main scheduler probably doesn't want to deal with but HPC job schedulers might want to use: the AVX512_elapsed_ms field in the new /proc/<pid>/arch_status task status file, which allows the user-space job scheduler to cluster such tasks, to avoid turbo frequency drops" * 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Documentation/filesystems/proc.txt: Add arch_status file x86/process: Add AVX-512 usage elapsed time to /proc/pid/arch_status proc: Add /proc/<pid>/arch_status
2019-07-09Merge branch 'sched-core-for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Remove the unused per rq load array and all its infrastructure, by Dietmar Eggemann. - Add utilization clamping support by Patrick Bellasi. This is a refinement of the energy aware scheduling framework with support for boosting of interactive and capping of background workloads: to make sure critical GUI threads get maximum frequency ASAP, and to make sure background processing doesn't unnecessarily move to cpufreq governor to higher frequencies and less energy efficient CPU modes. - Add the bare minimum of tracepoints required for LISA EAS regression testing, by Qais Yousef - which allows automated testing of various power management features, including energy aware scheduling. - Restructure the former tsk_nr_cpus_allowed() facility that the -rt kernel used to modify the scheduler's CPU affinity logic such as migrate_disable() - introduce the task->cpus_ptr value instead of taking the address of &task->cpus_allowed directly - by Sebastian Andrzej Siewior. - Misc optimizations, fixes, cleanups and small enhancements - see the Git log for details. * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) sched/uclamp: Add uclamp support to energy_compute() sched/uclamp: Add uclamp_util_with() sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks sched/uclamp: Set default clamps for RT tasks sched/uclamp: Reset uclamp values on RESET_ON_FORK sched/uclamp: Extend sched_setattr() to support utilization clamping sched/core: Allow sched_setattr() to use the current policy sched/uclamp: Add system default clamps sched/uclamp: Enforce last task's UCLAMP_MAX sched/uclamp: Add bucket local max tracking sched/uclamp: Add CPU's clamp buckets refcounting sched/fair: Rename weighted_cpuload() to cpu_runnable_load() sched/debug: Export the newly added tracepoints sched/debug: Add sched_overutilized tracepoint sched/debug: Add new tracepoint to track PELT at se level sched/debug: Add new tracepoints to track PELT at rq level sched/debug: Add a new sched_trace_*() helper functions sched/autogroup: Make autogroup_path() always available sched/wait: Deduplicate code with do-while sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity() ...
2019-07-09Merge branch 'locking-core-for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The main changes in this cycle are: - rwsem scalability improvements, phase #2, by Waiman Long, which are rather impressive: "On a 2-socket 40-core 80-thread Skylake system with 40 reader and writer locking threads, the min/mean/max locking operations done in a 5-second testing window before the patchset were: 40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810 40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255 After the patchset, they became: 40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741 40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098" There's a lot of changes to the locking implementation that makes it similar to qrwlock, including owner handoff for more fair locking. Another microbenchmark shows how across the spectrum the improvements are: "With a locking microbenchmark running on 5.1 based kernel, the total locking rates (in kops/s) on a 2-socket Skylake system with equal numbers of readers and writers (mixed) before and after this patchset were: # of Threads Before Patch After Patch ------------ ------------ ----------- 2 2,618 4,193 4 1,202 3,726 8 802 3,622 16 729 3,359 32 319 2,826 64 102 2,744" The changes are extensive and the patch-set has been through several iterations addressing various locking workloads. There might be more regressions, but unless they are pathological I believe we want to use this new implementation as the baseline going forward. - jump-label optimizations by Daniel Bristot de Oliveira: the primary motivation was to remove IPI disturbance of isolated RT-workload CPUs, which resulted in the implementation of batched jump-label updates. Beyond the improvement of the real-time characteristics kernel, in one test this patchset improved static key update overhead from 57 msecs to just 1.4 msecs - which is a nice speedup as well. - atomic64_t cross-arch type cleanups by Mark Rutland: over the last ~10 years of atomic64_t existence the various types used by the APIs only had to be self-consistent within each architecture - which means they became wildly inconsistent across architectures. Mark puts and end to this by reworking all the atomic64 implementations to use 's64' as the base type for atomic64_t, and to ensure that this type is consistently used for parameters and return values in the API, avoiding further problems in this area. - A large set of small improvements to lockdep by Yuyang Du: type cleanups, output cleanups, function return type and othr cleanups all around the place. - A set of percpu ops cleanups and fixes by Peter Zijlstra. - Misc other changes - please see the Git log for more details" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits) locking/lockdep: increase size of counters for lockdep statistics locking/atomics: Use sed(1) instead of non-standard head(1) option locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING x86/jump_label: Make tp_vec_nr static x86/percpu: Optimize raw_cpu_xchg() x86/percpu, sched/fair: Avoid local_clock() x86/percpu, x86/irq: Relax {set,get}_irq_regs() x86/percpu: Relax smp_processor_id() x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() locking/rwsem: Guard against making count negative locking/rwsem: Adaptive disabling of reader optimistic spinning locking/rwsem: Enable time-based spinning on reader-owned rwsem locking/rwsem: Make rwsem->owner an atomic_long_t locking/rwsem: Enable readers spinning on writer locking/rwsem: Clarify usage of owner's nonspinaable bit locking/rwsem: Wake up almost all readers in wait queue locking/rwsem: More optimal RT task handling of null owner locking/rwsem: Always release wait_lock before waking up tasks locking/rwsem: Implement lock handoff to prevent lock starvation locking/rwsem: Make rwsem_spin_on_owner() return owner state ...
2019-07-08ubifs: Don't leak orphans on memory during commitRichard Weinberger1-26/+24
If an orphan has child orphans (xattrs), and due to a commit the parent orpahn cannot get free()'ed immediately, put also all child orphans on the erase list. Otherwise UBIFS will free() them only upon unmount and we waste memory. Fixes: 988bec41318f ("ubifs: orphan: Handle xattrs like files") Signed-off-by: Richard Weinberger <richard@nod.at>
2019-07-08ubifs: Check link count of inodes when killing orphans.Richard Weinberger1-9/+35
O_TMPFILE files can change their link count back to non-zero. This corner case needs to get addressed in the orphans subsystem too. Fixes: 474b93704f32 ("ubifs: Implement O_TMPFILE") Reported-by: Lars Persson <lists@bofh.nu> Signed-off-by: Richard Weinberger <richard@nod.at>
2019-07-08ubifs: Add support for zstd compression.Michele Dionisio4-1/+40
zstd shows a good compression rate and is faster than lzo, also on slow ARM cores. Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Signed-off-by: Michele Dionisio <michele.dionisio@gmail.com> [rw: rewrote commit message] Signed-off-by: Richard Weinberger <richard@nod.at>
2019-07-08ubifs: support offline signed imagesSascha Hauer7-44/+225
HMACs can only be generated on the system the UBIFS image is running on. To support offline signed images we add a PKCS#7 signature to the UBIFS image which can be created by mkfs.ubifs. Both the master node and the superblock need to be authenticated, during normal runtime both are protected with HMACs. For offline signature support however only a single signature is desired. We add a signature covering the superblock node directly behind it. To protect the master node a hash of the master node is added to the superblock which is used when the master node doesn't contain a HMAC. Transition to a read/write filesystem is also supported. During transition first the master node is rewritten with a HMAC (implicitly, it is written anyway as the FS is marked dirty). Afterwards the superblock is rewritten with a HMAC. Once after the image has been mounted read/write it is HMAC only, the signature is no longer required or even present on the filesystem. In an offline signed image the master node is authenticated by the superblock. In a transition to r/w we have to make sure that the master node is rewritten before the superblock node. In this case the master node gets a HMAC and its authenticity no longer depends on the superblock node. There are some cases in which the current code first writes the superblock node though, so with this patch writing of the superblock node is delayed until the master node is written. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>
2019-07-08ubifs: remove unnecessary check in ubifs_log_start_commitLiu Song1-4/+1
In ubifs_log_start_commit, the value of c->lhead_offs is zero or set to zero by code bellow. /* Switch to the next log LEB */ if (c->lhead_offs) { c->lhead_lnum = ubifs_next_log_lnum(c, c->lhead_lnum); ubifs_assert(c->lhead_lnum != c->ltail_lnum); c->lhead_offs = 0; } The value of 'len' can not exceed 'max_len' which assigned value by code bellow. max_len = UBIFS_CS_NODE_SZ + c->jhead_cnt * UBIFS_REF_NODE_SZ; The value of c->lhead_offs changed by code bellow and cannot exceed 'max_len'. c->lhead_offs += len; if (c->lhead_offs == c->leb_size) { c->lhead_lnum = ubifs_next_log_lnum(c, c->lhead_lnum); c->lhead_offs = 0; } Usually, the size of PEB is between 64KB and 256KB. So the value of c->lhead_offs is far less than c->leb_size. The check 'if (c->lhead_offs == c->leb_size)' could never to be true. Signed-off-by: Liu Song <liu.song11@zte.com.cn> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Signed-off-by: Richard Weinberger <richard@nod.at>
2019-07-08ubifs: Fix typo of output in get_cs_sqnumLiu Song1-1/+1
"Not a CS node" makes more sense than "Node a CS node". Signed-off-by: Liu Song <liu.song11@zte.com.cn> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Signed-off-by: Richard Weinberger <richard@nod.at>
2019-07-08ubifs: Simplify redundant codeLiu Song1-2/+1
cbuf's size can be simply assigned. Signed-off-by: Liu Song <liu.song11@zte.com.cn> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Signed-off-by: Richard Weinberger <richard@nod.at>
2019-07-08ubifs: Correctly use tnc_next() in search_dh_cookie()Richard Weinberger1-5/+11
Commit c877154d307f fixed an uninitialized variable and optimized the function to not call tnc_next() in the first iteration of the loop. While this seemed perfectly legit and wise, it turned out to be illegal. If the lookup function does not find an exact match it will rewind the cursor by 1. The rewinded cursor will not match the name hash we are looking for and this results in a spurious -ENOENT. So we need to move to the next entry in case of an non-exact match, but not if the match was exact. While we are here, update the documentation to avoid further confusion. Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: c877154d307f ("ubifs: Fix uninitialized variable in search_dh_cookie()") Fixes: 781f675e2d7e ("ubifs: Fix unlink code wrt. double hash lookups") Signed-off-by: Richard Weinberger <richard@nod.at>
2019-07-08Merge tag 'v5.2' into perf/core, to pick up fixesIngo Molnar21-100/+180
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-08ceph: fix end offset in truncate_inode_pages_range callLuis Henriques1-1/+1
Commit e450f4d1a5d6 ("ceph: pass inclusive lend parameter to filemap_write_and_wait_range()") fixed the end offset parameter used to call filemap_write_and_wait_range and invalidate_inode_pages2_range. Unfortunately it missed truncate_inode_pages_range, introducing a regression that is easily detected by xfstest generic/130. The problem is that when doing direct IO it is possible that an extra page is truncated from the page cache when the end offset is page aligned. This can cause data loss if that page hasn't been sync'ed to the OSDs. While there, change code to use PAGE_ALIGN macro instead. Cc: stable@vger.kernel.org Fixes: e450f4d1a5d6 ("ceph: pass inclusive lend parameter to filemap_write_and_wait_range()") Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: use generic_delete_inode() for ->drop_inodeLuis Henriques3-12/+1
ceph_drop_inode() implementation is not any different from the generic function, thus there's no point in keeping it around. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: use ceph_evict_inode to cleanup inode's resourceYan, Zheng3-4/+7
remove_session_caps() relies on __wait_on_freeing_inode(), to wait for freeing inode to remove its caps. But VFS wakes freeing inode waiters before calling destroy_inode(). Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/40102 Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: initialize superblock s_time_gran to 1Luis Henriques1-1/+1
Having granularity set to 1us results in having inode timestamps with a accurancy different from the fuse client (i.e. atime, ctime and mtime will always end with '000'). This patch normalizes this behaviour and sets the granularity to 1. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08libceph: rename r_unsafe_item to r_private_itemIlya Dryomov1-3/+3
This list item remained from when we had safe and unsafe replies (commit vs ack). It has since become a private list item for use by clients. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: don't NULL terminate virtual xattrsJeff Layton1-25/+59
The convention with xattrs is to not store the termination with string data, given that it returns the length. This is how setfattr/getfattr operate. Most of ceph's virtual xattr routines use snprintf to plop the string directly into the destination buffer, but snprintf always NULL terminates the string. This means that if we send the kernel a buffer that is the exact length needed to hold the string, it'll end up truncated. Add a ceph_fmt_xattr helper function to format the string into an on-stack buffer that should always be large enough to hold the whole thing and then memcpy the result into the destination buffer. If it does turn out that the formatted string won't fit in the on-stack buffer, then return -E2BIG and do a WARN_ONCE(). Change over most of the virtual xattr routines to use the new helper. A couple of the xattrs are sourced from strings however, and it's difficult to know how long they'll be. Just have those memcpy the result in place after verifying the length. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Acked-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: return -ERANGE if virtual xattr value didn't fit in bufferJeff Layton1-7/+7
The getxattr manpage states that we should return ERANGE if the destination buffer size is too small to hold the value. ceph_vxattrcb_layout does this internally, but we should be doing this for all vxattrs. Fix the only caller of getxattr_cb to check the returned size against the buffer length and return -ERANGE if it doesn't fit. Drop the same check in ceph_vxattrcb_layout and just rely on the caller to handle it. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Acked-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: make getxattr_cb return ssize_tJeff Layton1-45/+45
The getxattr_cb functions return size_t, which is unsigned and then cast that value to int and then ssize_t before returning it. While all of this works, it relies on implicit casting rules for signed/unsigned conversions. Change getxattr_cb to return ssize_t to better conform with what the caller actually wants. Also, remove some suspicious casts. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Acked-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: more precise CEPH_CLIENT_CAPS_PENDING_CAPSNAPYan, Zheng1-11/+30
Client uses this flag to tell mds if there is more cap snap need to flush. It's mainly for the case that client needs to re-send cap/snap flushes after mds failover, but CEPH_CAP_ANY_FILE_WR on corresponding inodes are all released before mds failover. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: kick flushing and flush snaps before sending normal cap messageYan, Zheng1-4/+14
Otherwise client may send cap flush messages in wrong order. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: clear CEPH_I_KICK_FLUSH flag inside __kick_flushing_caps()Yan, Zheng1-9/+4
Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: increment change_attribute on local changesJeff Layton2-0/+7
We don't set SB_I_VERSION on ceph since we need to manage it ourselves, so we must increment it whenever we update the file times. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: handle change_attr in cap messagesJeff Layton3-9/+13
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: add change_attr field to ceph_inode_infoJeff Layton3-2/+8
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: allow querying of STATX_BTIME in ceph_getattrJeff Layton1-3/+13
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: handle btime in cap messagesJeff Layton3-7/+14
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: add btime field to ceph_inode_infoJeff Layton4-8/+17
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: have MDS map decoding use entity_addr_t decoderJeff Layton1-4/+8
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: remove request from waiting list before unregisterYan, Zheng1-0/+2
Link: https://tracker.ceph.com/issues/40339 Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: don't blindly unregister session that is in opening stateYan, Zheng1-33/+26
handle_cap_export() may add placeholder caps to session that is in opening state. These caps' session pointer become wild after session get unregistered. The fix is not to unregister session in opening state during mds failovers, just let client to reconnect later when mds is recovered. Link: https://tracker.ceph.com/issues/40190 Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: fix infinite loop in get_quota_realm()Yan, Zheng1-2/+13
get_quota_realm() enters infinite loop if quota inode has no caps. This can happen after client gets evicted. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Luis Henriques <lhenriques@suse.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: add selinux supportYan, Zheng7-17/+173
When creating new file/directory, use security_dentry_init_security() to prepare selinux context for the new inode, then send openc/mkdir request to MDS, together with selinux xattr. security_dentry_init_security() only supports single security module and only selinux has dentry_init_security hook. So only selinux is supported for now. We can add support for other security modules once kernel has a generic version of dentry_init_security() Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: rename struct ceph_acls_info to ceph_acl_sec_ctxYan, Zheng5-52/+55
Also rename ceph_release_acls_info() to ceph_release_acl_sec_ctx(). And move their definitions to different files. This is preparation for security label support. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: fix debug print format in __set_xattr()Yan, Zheng1-2/+2
name is not '\0' terminated. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: fix warning PTR_ERR_OR_ZERO can be usedHariprasad Kelam1-1/+1
change1: fix below warning reported by coccicheck /fs/ceph/export.c:371:33-39: WARNING: PTR_ERR_OR_ZERO can be used change2: typecasted PTR_ERR_OR_ZERO to long as dout expecting long Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: hold i_ceph_lock when removing caps for freeing inodeYan, Zheng3-6/+8
ceph_d_revalidate(, LOOKUP_RCU) may call __ceph_caps_issued_mask() on a freeing inode. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: ensure d_name/d_parent stability in ceph_mdsc_lease_send_msg()Yan, Zheng3-16/+16
Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: use READ_ONCE to access d_parent in RCU critical sectionYan, Zheng1-2/+2
Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>