summaryrefslogtreecommitdiff
path: root/fs/dcache.c
AgeCommit message (Collapse)AuthorFilesLines
2018-02-04Merge tag 'usercopy-v4.16-rc1' of ↵Linus Torvalds1-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardened usercopy whitelisting from Kees Cook: "Currently, hardened usercopy performs dynamic bounds checking on slab cache objects. This is good, but still leaves a lot of kernel memory available to be copied to/from userspace in the face of bugs. To further restrict what memory is available for copying, this creates a way to whitelist specific areas of a given slab cache object for copying to/from userspace, allowing much finer granularity of access control. Slab caches that are never exposed to userspace can declare no whitelist for their objects, thereby keeping them unavailable to userspace via dynamic copy operations. (Note, an implicit form of whitelisting is the use of constant sizes in usercopy operations and get_user()/put_user(); these bypass all hardened usercopy checks since these sizes cannot change at runtime.) This new check is WARN-by-default, so any mistakes can be found over the next several releases without breaking anyone's system. The series has roughly the following sections: - remove %p and improve reporting with offset - prepare infrastructure and whitelist kmalloc - update VFS subsystem with whitelists - update SCSI subsystem with whitelists - update network subsystem with whitelists - update process memory with whitelists - update per-architecture thread_struct with whitelists - update KVM with whitelists and fix ioctl bug - mark all other allocations as not whitelisted - update lkdtm for more sensible test overage" * tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits) lkdtm: Update usercopy tests for whitelisting usercopy: Restrict non-usercopy caches to size 0 kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl kvm: whitelist struct kvm_vcpu_arch arm: Implement thread_struct whitelist for hardened usercopy arm64: Implement thread_struct whitelist for hardened usercopy x86: Implement thread_struct whitelist for hardened usercopy fork: Provide usercopy whitelisting for task_struct fork: Define usercopy region in thread_stack slab caches fork: Define usercopy region in mm_struct slab caches net: Restrict unwhitelisted proto caches to size 0 sctp: Copy struct sctp_sock.autoclose to userspace using put_user() sctp: Define usercopy region in SCTP proto slab cache caif: Define usercopy region in caif proto slab cache ip: Define usercopy region in IP proto slab cache net: Define usercopy region in struct proto slab cache scsi: Define usercopy region in scsi_sense_cache slab cache cifs: Define usercopy region in cifs_request slab cache vxfs: Define usercopy region in vxfs_inode slab cache ufs: Define usercopy region in ufs_inode_cache slab cache ...
2018-02-01Merge branch 'KASAN-read_word_at_a_time'Linus Torvalds1-6/+1
Merge KASAN word-at-a-time fixups from Andrey Ryabinin. The word-at-a-time optimizations have caused headaches for KASAN, since the whole point is that we access byte streams in bigger chunks, and KASAN can be unhappy about the potential extra access at the end of the string. We used to have a horrible hack in dcache, and then people got complaints from the strscpy() case. This fixes it all up properly, by adding an explicit helper for the "access byte stream one word at a time" case. * emailed patches from Andrey Ryabinin <aryabinin@virtuozzo.com>: fs: dcache: Revert "manually unpoison dname after allocation to shut up kasan's reports" fs/dcache: Use read_word_at_a_time() in dentry_string_cmp() lib/strscpy: Shut up KASAN false-positives in strscpy() compiler.h: Add read_word_at_a_time() function. compiler.h, kasan: Avoid duplicating __read_once_size_nocheck()
2018-02-01fs: dcache: Revert "manually unpoison dname after allocation to shut up ↵Andrey Ryabinin1-5/+0
kasan's reports" This reverts commit df4c0e36f1b1782b0611a77c52cc240e5c4752dd. It's no longer needed since dentry_string_cmp() now uses read_word_at_a_time() to avoid kasan's reports. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01fs/dcache: Use read_word_at_a_time() in dentry_string_cmp()Andrey Ryabinin1-1/+1
dentry_string_cmp() performs the word-at-a-time reads from 'cs' and may read slightly more than it was requested in kmallac(). Normally this would make KASAN to report out-of-bounds access, but this was workarounded by commit df4c0e36f1b1 ("fs: dcache: manually unpoison dname after allocation to shut up kasan's reports"). This workaround is not perfect, since it allows out-of-bounds access to dentry's name for all the code, not just in dentry_string_cmp(). So it would be better to use read_word_at_a_time() instead and revert commit df4c0e36f1b1. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-01Merge branch 'work.dcache' of ↵Linus Torvalds1-7/+16
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dcache updates from Al Viro: "Neil Brown's d_move()/d_path() race fix" * 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: VFS: close race between getcwd() and d_move()
2018-01-31Merge branch 'work.misc' of ↵Linus Torvalds1-15/+17
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "All kinds of misc stuff, without any unifying topic, from various people. Neil's d_anon patch, several bugfixes, introduction of kvmalloc analogue of kmemdup_user(), extending bitfield.h to deal with fixed-endians, assorted cleanups all over the place..." * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits) alpha: osf_sys.c: use timespec64 where appropriate alpha: osf_sys.c: fix put_tv32 regression jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path dcache: delete unused d_hash_mask dcache: subtract d_hash_shift from 32 in advance fs/buffer.c: fold init_buffer() into init_page_buffers() fs: fold __inode_permission() into inode_permission() fs: add RWF_APPEND sctp: use vmemdup_user() rather than badly open-coding memdup_user() snd_ctl_elem_init_enum_names(): switch to vmemdup_user() replace_user_tlv(): switch to vmemdup_user() new primitive: vmemdup_user() memdup_user(): switch to GFP_USER eventfd: fold eventfd_ctx_get() into eventfd_ctx_fileget() eventfd: fold eventfd_ctx_read() into eventfd_read() eventfd: convert to use anon_inode_getfd() nfs4file: get rid of pointless include of btrfs.h uvc_v4l2: clean copyin/copyout up vme_user: don't use __copy_..._user() usx2y: don't bother with memdup_user() for 16-byte structure ...
2018-01-26dcache: delete unused d_hash_maskAlexey Dobriyan1-3/+2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-26dcache: subtract d_hash_shift from 32 in advanceAlexey Dobriyan1-1/+3
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-24vfs: factor out helpers d_instantiate_anon() and d_alloc_anon()Miklos Szeredi1-31/+56
Those helpers are going to be used by overlayfs to implement NFS export decode. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-24ovl: verify directory index entries on mountAmir Goldstein1-0/+1
Directory index entries should have 'upper' xattr pointing to the real upper dir. Verifying that the upper dir file handle is not stale is expensive, so only verify stale directory index entries on mount if NFS export feature is enabled. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-01-15vfs: Define usercopy region in names_cache slab cachesDavid Windsor1-2/+2
VFS pathnames are stored in the names_cache slab cache, either inline or across an entire allocation entry (when approaching PATH_MAX). These are copied to/from userspace, so they must be entirely whitelisted. cache object allocation: include/linux/fs.h: #define __getname() kmem_cache_alloc(names_cachep, GFP_KERNEL) example usage trace: strncpy_from_user+0x4d/0x170 getname_flags+0x6f/0x1f0 user_path_at_empty+0x23/0x40 do_mount+0x69/0xda0 SyS_mount+0x83/0xd0 fs/namei.c: getname_flags(...): ... result = __getname(); ... kname = (char *)result->iname; result->name = kname; len = strncpy_from_user(kname, filename, EMBEDDED_NAME_MAX); ... if (unlikely(len == EMBEDDED_NAME_MAX)) { const size_t size = offsetof(struct filename, iname[1]); kname = (char *)result; result = kzalloc(size, GFP_KERNEL); ... result->name = kname; len = strncpy_from_user(kname, filename, PATH_MAX); In support of usercopy hardening, this patch defines the entire cache object in the names_cache slab cache as whitelisted, since it may entirely hold name strings to be copied to/from userspace. This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust commit log, add usage trace] Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2018-01-15dcache: Define usercopy region in dentry_cache slab cacheDavid Windsor1-2/+3
When a dentry name is short enough, it can be stored directly in the dentry itself (instead in a separate kmalloc allocation). These dentry short names, stored in struct dentry.d_iname and therefore contained in the dentry_cache slab cache, need to be coped to userspace. cache object allocation: fs/dcache.c: __d_alloc(...): ... dentry = kmem_cache_alloc(dentry_cache, ...); ... dentry->d_name.name = dentry->d_iname; example usage trace: filldir+0xb0/0x140 dcache_readdir+0x82/0x170 iterate_dir+0x142/0x1b0 SyS_getdents+0xb5/0x160 fs/readdir.c: (called via ctx.actor by dir_emit) filldir(..., const char *name, ...): ... copy_to_user(..., name, namlen) fs/libfs.c: dcache_readdir(...): ... next = next_positive(dentry, p, 1) ... dir_emit(..., next->d_name.name, ...) In support of usercopy hardening, this patch defines a region in the dentry_cache slab cache in which userspace copy operations are allowed. This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamic copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: adjust hunks for kmalloc-specific things moved later] [kees: adjust commit log, provide usage trace] Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2017-12-28VFS: close race between getcwd() and d_move()NeilBrown1-7/+16
d_move() will call __d_drop() and then __d_rehash() on the dentry being moved. This creates a small window when the dentry appears to be unhashed. Many tests of d_unhashed() are made under ->d_lock and so are safe from racing with this window, but some aren't. In particular, getcwd() calls d_unlinked() (which calls d_unhashed()) without d_lock protection, so it can race. This races has been seen in practice with lustre, which uses d_move() as part of name lookup. See: https://jira.hpdd.intel.com/browse/LU-9735 It could race with a regular rename(), and result in ENOENT instead of either the 'before' or 'after' name. The race can be demonstrated with a simple program which has two threads, one renaming a directory back and forth while another calls getcwd() within that directory: it should never fail, but does. See: https://patchwork.kernel.org/patch/9455345/ We could fix this race by taking d_lock and rechecking when d_unhashed() reports true. Alternately when can remove the window, which is the approach this patch takes. ___d_drop() is introduce which does *not* clear d_hash.pprev so the dentry still appears to be hashed. __d_drop() calls ___d_drop(), then clears d_hash.pprev. __d_move() now uses ___d_drop() and only clears d_hash.pprev when not rehashing. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-26VFS: don't keep disconnected dentries on d_anonNeilBrown1-10/+12
The original purpose of the per-superblock d_anon list was to keep disconnected dentries in the cache between consecutive requests to the NFS server. Dentries can be disconnected if a client holds a file open and repeatedly performs IO on it, and if the server drops the dentry, whether due to memory pressure, server restart, or "echo 3 > /proc/sys/vm/drop_caches". This purpose was thwarted by commit 75a6f82a0d10 ("freeing unlinked file indefinitely delayed") which caused disconnected dentries to be freed as soon as their refcount reached zero. This means that, when a dentry being used by nfsd gets disconnected, a new one needs to be allocated for every request (unless requests overlap). As the dentry has no name, no parent, and no children, there is little of value to cache. As small memory allocations are typically fast (from per-cpu free lists) this likely has little cost. This means that the original purpose of s_anon is no longer relevant: there is no longer any need to keep disconnected dentries on a list so they appear to be hashed. However, s_anon now has a new use. When you mount an NFS filesystem, the dentry stored in s_root is just a placebo. The "real" root dentry is allocated using d_obtain_root() and so it kept on the s_anon list. I don't know the reason for this, but suspect it related to NFSv4 where a mount of "server:/some/path" require NFS to look up the root filehandle on the server, then walk down "/some" and "/path" to get the filehandle to mount. Whatever the reason, NFS depends on the s_anon list and on shrink_dcache_for_umount() pruning all dentries on this list. So we cannot simply remove s_anon. We could just leave the code unchanged, but apart from that being potentially confusing, the (unfair) bit-spin-lock which protects s_anon can become a bottle neck when lots of disconnected dentries are being created. So this patch renames s_anon to s_roots, and stops storing disconnected dentries on the list. Only dentries obtained with d_obtain_root() are now stored on this list. There are many fewer of these (only NFS and NILFS2 use the call, and only during filesystem mount) so contention on the bit-lock will not be a problem. Possibly an alternate solution should be found for NFS and NILFS2, but that would require understanding their needs first. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-07vfs: remove unused hardirq.hYang Shi1-1/+0
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by vfs at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi <yang.s@alibaba-inc.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-04fs/dcache: Use release-acquire for name/length updatePaul E. McKenney1-7/+3
The code in __d_alloc() carefully orders filling in the NUL character of the name (and the length, hash, and the name itself) with assigning of the name itself. However, prepend_name() does not order the accesses to the ->name and ->len fields, other than on TSO systems. This commit therefore replaces prepend_name()'s READ_ONCE() of ->name with an smp_load_acquire(), which orders against the subsequent READ_ONCE() of ->len. Because READ_ONCE() now incorporates smp_read_barrier_depends(), prepend_name()'s smp_read_barrier_depends() is removed. Finally, to save a line, the smp_wmb()/store pair in __d_alloc() is replaced by smp_store_release(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: <linux-fsdevel@vger.kernel.org>
2017-11-16kmemcheck: remove annotationsLevin, Alexander (Sasha Levin)1-2/+0
Patch series "kmemcheck: kill kmemcheck", v2. As discussed at LSF/MM, kill kmemcheck. KASan is a replacement that is able to work without the limitation of kmemcheck (single CPU, slow). KASan is already upstream. We are also not aware of any users of kmemcheck (or users who don't consider KASan as a suitable replacement). The only objection was that since KASAN wasn't supported by all GCC versions provided by distros at that time we should hold off for 2 years, and try again. Now that 2 years have passed, and all distros provide gcc that supports KASAN, kill kmemcheck again for the very same reasons. This patch (of 4): Remove kmemcheck annotations, and calls to kmemcheck from the kernel. [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs] Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Cc: Alexander Potapenko <glider@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Hansen <devtimhansen@gmail.com> Cc: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-25locking/atomics, fs/dcache: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE()Mark Rutland1-9/+9
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't currently harmful. However, for some features it is necessary to instrument reads and writes separately, which is not possible with ACCESS_ONCE(). This distinction is critical to correct operation. It's possible to transform the bulk of kernel code using the Coccinelle script below. However, this doesn't handle comments, leaving references to ACCESS_ONCE() instances which have been removed. As a preparatory step, this patch converts the dcache code and comments to use {READ,WRITE}_ONCE() consistently. ---- virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1508792849-3115-4-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-24locking/barriers: Convert users of lockless_dereference() to READ_ONCE()Will Deacon1-2/+2
READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it can be used instead of lockless_dereference() without any change in semantics. Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-15Merge branch 'work.mount' of ↵Linus Torvalds1-2/+6
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull ->s_options removal from Al Viro: "Preparations for fsmount/fsopen stuff (coming next cycle). Everything gets moved to explicit ->show_options(), killing ->s_options off + some cosmetic bits around fs/namespace.c and friends. Basically, the stuff needed to work with fsmount series with minimum of conflicts with other work. It's not strictly required for this merge window, but it would reduce the PITA during the coming cycle, so it would be nice to have those bits and pieces out of the way" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: isofs: Fix isofs_show_options() VFS: Kill off s_options and helpers orangefs: Implement show_options 9p: Implement show_options isofs: Implement show_options afs: Implement show_options affs: Implement show_options befs: Implement show_options spufs: Implement show_options bpf: Implement show_options ramfs: Implement show_options pstore: Implement show_options omfs: Implement show_options hugetlbfs: Implement show_options VFS: Don't use save/replace_mount_options if not using generic_show_options VFS: Provide empty name qstr VFS: Make get_filesystem() return the affected filesystem VFS: Clean up whitespace in fs/namespace.c and fs/super.c Provide a function to create a NUL-terminated string from unterminated data
2017-07-11fs/dcache.c: fix spin lockup issue on nlru->lockSahitya Tummala1-2/+3
__list_lru_walk_one() acquires nlru spin lock (nlru->lock) for longer duration if there are more number of items in the lru list. As per the current code, it can hold the spin lock for upto maximum UINT_MAX entries at a time. So if there are more number of items in the lru list, then "BUG: spinlock lockup suspected" is observed in the below path: spin_bug+0x90 do_raw_spin_lock+0xfc _raw_spin_lock+0x28 list_lru_add+0x28 dput+0x1c8 path_put+0x20 terminate_walk+0x3c path_lookupat+0x100 filename_lookup+0x6c user_path_at_empty+0x54 SyS_faccessat+0xd0 el0_svc_naked+0x24 This nlru->lock is acquired by another CPU in this path - d_lru_shrink_move+0x34 dentry_lru_isolate_shrink+0x48 __list_lru_walk_one.isra.10+0x94 list_lru_walk_node+0x40 shrink_dcache_sb+0x60 do_remount_sb+0xbc do_emergency_remount+0xb0 process_one_work+0x228 worker_thread+0x2e0 kthread+0xf4 ret_from_fork+0x10 Fix this lockup by reducing the number of entries to be shrinked from the lru list to 1024 at once. Also, add cond_resched() before processing the lru list again. Link: http://marc.info/?t=149722864900001&r=1&w=2 Link: http://lkml.kernel.org/r/1498707575-2472-1-git-send-email-stummala@codeaurora.org Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Alexander Polakov <apolyakov@beget.ru> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-08Merge branch 'work.misc' of ↵Linus Torvalds1-0/+32
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc filesystem updates from Al Viro: "Assorted normal VFS / filesystems stuff..." * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: dentry name snapshots Make statfs properly return read-only state after emergency remount fs/dcache: init in_lookup_hashtable minix: Deinline get_block, save 2691 bytes fs: Reorder inode_owner_or_capable() to avoid needless fs: warn in case userspace lied about modprobe return
2017-07-08dentry name snapshotsAl Viro1-0/+27
take_dentry_name_snapshot() takes a safe snapshot of dentry name; if the name is a short one, it gets copied into caller-supplied structure, otherwise an extra reference to external name is grabbed (those are never modified). In either case the pointer to stable string is stored into the same structure. dentry must be held by the caller of take_dentry_name_snapshot(), but may be freely dropped afterwards - the snapshot will stay until destroyed by release_dentry_name_snapshot(). Intended use: struct name_snapshot s; take_dentry_name_snapshot(&s, dentry); ... access s.name ... release_dentry_name_snapshot(&s); Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name to pass down with event. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-07mm: update callers to use HASH_ZERO flagPavel Tatashin1-14/+4
Update dcache, inode, pid, mountpoint, and mount hash tables to use HASH_ZERO, and remove initialization after allocations. In case of places where HASH_EARLY was used such as in __pv_init_lock_hash the zeroed hash table was already assumed, because memblock zeroes the memory. CPU: SPARC M6, Memory: 7T Before fix: Dentry cache hash table entries: 1073741824 Inode-cache hash table entries: 536870912 Mount-cache hash table entries: 16777216 Mountpoint-cache hash table entries: 16777216 ftrace: allocating 20414 entries in 40 pages Total time: 11.798s After fix: Dentry cache hash table entries: 1073741824 Inode-cache hash table entries: 536870912 Mount-cache hash table entries: 16777216 Mountpoint-cache hash table entries: 16777216 ftrace: allocating 20414 entries in 40 pages Total time: 3.198s CPU: Intel Xeon E5-2630, Memory: 2.2T: Before fix: Dentry cache hash table entries: 536870912 Inode-cache hash table entries: 268435456 Mount-cache hash table entries: 8388608 Mountpoint-cache hash table entries: 8388608 CPU: Physical Processor ID: 0 Total time: 3.245s After fix: Dentry cache hash table entries: 536870912 Inode-cache hash table entries: 268435456 Mount-cache hash table entries: 8388608 Mountpoint-cache hash table entries: 8388608 CPU: Physical Processor ID: 0 Total time: 3.244s Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Babu Moger <babu.moger@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06VFS: Provide empty name qstrDavid Howells1-2/+6
Provide an empty name (ie. "") qstr for general use. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-30fs/dcache: init in_lookup_hashtableSebastian Andrzej Siewior1-0/+5
in_lookup_hashtable was introduced in commit 94bdd655caba ("parallel lookups machinery, part 3") and never initialized but since it is in the data it is all zeros. But we need this for -RT. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15Hang/soft lockup in d_invalidate with simultaneous callsAl Viro1-6/+4
It's not hard to trigger a bunch of d_invalidate() on the same dentry in parallel. They end up fighting each other - any dentry picked for removal by one will be skipped by the rest and we'll go for the next iteration through the entire subtree, even if everything is being skipped. Morevoer, we immediately go back to scanning the subtree. The only thing we really need is to dissolve all mounts in the subtree and as soon as we've nothing left to do, we can just unhash the dentry and bugger off. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-03fs: don't set *REFERENCED on single use objectsJosef Bacik1-2/+2
By default we set DCACHE_REFERENCED and I_REFERENCED on any dentry or inode we create. This is problematic as this means that it takes two trips through the LRU for any of these objects to be reclaimed, regardless of their actual lifetime. With enough pressure from these caches we can easily evict our working set from page cache with single use objects. So instead only set *REFERENCED if we've already been added to the LRU list. This means that we've been touched since the first time we were accessed, and so more likely to need to hang out in cache. To illustrate this issue I wrote the following scripts https://github.com/josefbacik/debug-scripts/tree/master/cache-pressure on my test box. It is a single socket 4 core CPU with 16gib of RAM and I tested on an Intel 2tib NVME drive. The cache-pressure.sh script creates a new file system and creates 2 6.5gib files in order to take up 13gib of the 16gib of ram with pagecache. Then it runs a test program that reads these 2 files in a loop, and keeps track of how often it has to read bytes for each loop. On an ideal system with no pressure we should have to read 0 bytes indefinitely. The second thing this script does is start a fs_mark job that creates a ton of 0 length files, putting pressure on the system with slab only allocations. On exit the script prints out how many bytes were read by the read-file program. The results are as follows Without patch: /mnt/btrfs-test/reads/file1: total read during loops 27262988288 /mnt/btrfs-test/reads/file2: total read during loops 27262976000 With patch: /mnt/btrfs-test/reads/file2: total read during loops 18640457728 /mnt/btrfs-test/reads/file1: total read during loops 9565376512 This patch results in a 50% reduction of the amount of pages evicted from our working set. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-10mnt: Protect the mountpoint hashtable with mount_lockEric W. Biederman1-2/+5
Protecting the mountpoint hashtable with namespace_sem was sufficient until a call to umount_mnt was added to mntput_no_expire. At which point it became possible for multiple calls of put_mountpoint on the same hash chain to happen on the same time. Kristen Johansen <kjlx@templeofstupid.com> reported: > This can cause a panic when simultaneous callers of put_mountpoint > attempt to free the same mountpoint. This occurs because some callers > hold the mount_hash_lock, while others hold the namespace lock. Some > even hold both. > > In this submitter's case, the panic manifested itself as a GP fault in > put_mountpoint() when it called hlist_del() and attempted to dereference > a m_hash.pprev that had been poisioned by another thread. Al Viro observed that the simple fix is to switch from using the namespace_sem to the mount_lock to protect the mountpoint hash table. I have taken Al's suggested patch moved put_mountpoint in pivot_root (instead of taking mount_lock an additional time), and have replaced new_mountpoint with get_mountpoint a function that does the hash table lookup and addition under the mount_lock. The introduction of get_mounptoint ensures that only the mount_lock is needed to manipulate the mountpoint hashtable. d_set_mounted is modified to only set DCACHE_MOUNTED if it is not already set. This allows get_mountpoint to use the setting of DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry happens exactly once. Cc: stable@vger.kernel.org Fixes: ce07d891a089 ("mnt: Honor MNT_LOCKED when detaching mounts") Reported-by: Krister Johansen <kjlx@templeofstupid.com> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-12-24Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds1-1/+1
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-04vfs: remove unused have_submounts() functionIan Kent1-33/+0
Now that path_has_submounts() has been added have_submounts() is no longer used so remove it. Link: http://lkml.kernel.org/r/20161011053428.27645.12310.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Omar Sandoval <osandov@osandov.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-04vfs: add path_has_submounts()Ian Kent1-0/+39
d_mountpoint() can only be used reliably to establish if a dentry is not mounted in any namespace. It isn't aware of the possibility there may be multiple mounts using the given dentry, possibly in a different namespace. Add function, path_has_submounts(), that checks is a struct path contains mounts (or is a mountpoint itself) to handle this case. Link: http://lkml.kernel.org/r/20161011053403.27645.55242.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Omar Sandoval <osandov@osandov.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-08-07Merge branch 'for-linus-2' of ↵Linus Torvalds1-40/+15
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "Assorted cleanups and fixes. In the "trivial API change" department - ->d_compare() losing 'parent' argument" * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: cachefiles: Fix race between inactivating and culling a cache object 9p: use clone_fid() 9p: fix braino introduced in "9p: new helper - v9fs_parent_fid()" vfs: make dentry_needs_remove_privs() internal vfs: remove file_needs_remove_privs() vfs: fix deadlock in file_remove_privs() on overlayfs get rid of 'parent' argument of ->d_compare() cifs, msdos, vfat, hfs+: don't bother with parent in ->d_compare() affs ->d_compare(): don't bother with ->d_inode fold _d_rehash() and __d_rehash() together fold dentry_rcuwalk_invalidate() into its only remaining caller
2016-08-06Merge branch 'work.const-qstr' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull qstr constification updates from Al Viro: "Fairly self-contained bunch - surprising lot of places passes struct qstr * as an argument when const struct qstr * would suffice; it complicates analysis for no good reason. I'd prefer to feed that separately from the assorted fixes (those are in #for-linus and with somewhat trickier topology)" * 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: qstr: constify instances in adfs qstr: constify instances in lustre qstr: constify instances in f2fs qstr: constify instances in ext2 qstr: constify instances in vfat qstr: constify instances in procfs qstr: constify instances in fuse qstr constify instances in fs/dcache.c qstr: constify instances in nfs qstr: constify instances in ocfs2 qstr: constify instances in autofs4 qstr: constify instances in hfs qstr: constify instances in hfsplus qstr: constify instances in logfs qstr: constify dentry_init_security
2016-07-31get rid of 'parent' argument of ->d_compare()Al Viro1-2/+2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-30fold _d_rehash() and __d_rehash() togetherAl Viro1-23/+11
The only place where we feed to __d_rehash() something other than d_hash(dentry->d_name.hash) is __d_move(), where we give it d_hash of another dentry. Postpone rehashing until we'd switched the names and we are rid of that exception, along with the need to keep _d_rehash() and __d_rehash() separate. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-30fold dentry_rcuwalk_invalidate() into its only remaining callerAl Viro1-15/+2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-28Merge branch 'work.misc' of ↵Linus Torvalds1-139/+69
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "Assorted cleanups and fixes. Probably the most interesting part long-term is ->d_init() - that will have a bunch of followups in (at least) ceph and lustre, but we'll need to sort the barrier-related rules before it can get used for really non-trivial stuff. Another fun thing is the merge of ->d_iput() callers (dentry_iput() and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all except the one in __d_lookup_lru())" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) fs/dcache.c: avoid soft-lockup in dput() vfs: new d_init method vfs: Update lookup_dcache() comment bdev: get rid of ->bd_inodes Remove last traces of ->sync_page new helper: d_same_name() dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends() vfs: clean up documentation vfs: document ->d_real() vfs: merge .d_select_inode() into .d_real() unify dentry_iput() and dentry_unlink_inode() binfmt_misc: ->s_root is not going anywhere drop redundant ->owner initializations ufs: get rid of redundant checks orangefs: constify inode_operations missed comment updates from ->direct_IO() prototype change file_inode(f)->i_mapping is f->f_mapping trim fsnotify hooks a bit 9p: new helper - v9fs_parent_fid() debugfs: ->d_parent is never NULL or negative ...
2016-07-28Merge branch 'salted-string-hash'Linus Torvalds1-13/+10
This changes the vfs dentry hashing to mix in the parent pointer at the _beginning_ of the hash, rather than at the end. That actually improves both the hash and the code generation, because we can move more of the computation to the "static" part of the dcache setup, and do less at lookup runtime. It turns out that a lot of other hash users also really wanted to mix in a base pointer as a 'salt' for the hash, and so the slightly extended interface ends up working well for other cases too. Users that want a string hash that is purely about the string pass in a 'salt' pointer of NULL. * merge branch 'salted-string-hash': fs/dcache.c: Save one 32-bit multiply in dcache lookup vfs: make the string hashes salt the hash
2016-07-24fs/dcache.c: avoid soft-lockup in dput()Wei Fang1-2/+5
We triggered soft-lockup under stress test which open/access/write/close one file concurrently on more than five different CPUs: WARN: soft lockup - CPU#0 stuck for 11s! [who:30631] ... [<ffffffc0003986f8>] dput+0x100/0x298 [<ffffffc00038c2dc>] terminate_walk+0x4c/0x60 [<ffffffc00038f56c>] path_lookupat+0x5cc/0x7a8 [<ffffffc00038f780>] filename_lookup+0x38/0xf0 [<ffffffc000391180>] user_path_at_empty+0x78/0xd0 [<ffffffc0003911f4>] user_path_at+0x1c/0x28 [<ffffffc00037d4fc>] SyS_faccessat+0xb4/0x230 ->d_lock trylock may failed many times because of concurrently operations, and dput() may execute a long time. Fix this by replacing cpu_relax() with cond_resched(). dput() used to be sleepable, so make it sleepable again should be safe. Cc: <stable@vger.kernel.org> Signed-off-by: Wei Fang <fangwei1@huawei.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-24vfs: new d_init methodMiklos Szeredi1-0/+11
Allow filesystem to initialize dentry at allocation time. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-24Merge branch 'test.d_iput' into work.miscAl Viro1-35/+10
2016-07-21qstr constify instances in fs/dcache.cAl Viro1-2/+2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-01Merge branch 'd_real' of ↵Al Viro1-3/+0
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into work.misc
2016-07-01new helper: d_same_name()Al Viro1-91/+36
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-01dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()He Kuang1-4/+3
lockless_dereference() was added which can be used in place of hard-coding smp_read_barrier_depends(). Signed-off-by: He Kuang <hekuang@huawei.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-01Merge branch 'for-linus' into work.miscAl Viro1-13/+66
2016-06-30vfs: merge .d_select_inode() into .d_real()Miklos Szeredi1-3/+0
The two methods essentially do the same: find the real dentry/inode belonging to an overlay dentry. The difference is in the usage: vfs_open() uses ->d_select_inode() and expects the function to perform copy-up if necessary based on the open flags argument. file_dentry() uses ->d_real() passing in the overlay dentry as well as the underlying inode. vfs_rename() uses ->d_select_inode() but passes zero flags. ->d_real() with a zero inode would have worked just as well here. This patch merges the functionality of ->d_select_inode() into ->d_real() by adding an 'open_flags' argument to the latter. [Al Viro] Make the signature of d_real() match that of ->d_real() again. And constify the inode argument, while we are at it. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-06-20fix idiotic braino in d_alloc_parallel()Al Viro1-5/+12
Check for d_unhashed() while searching in in-lookup hash was absolutely wrong. Worse, it masked a deadlock on dget() done under bitlock that nests inside ->d_lock. Thanks to J. R. Okajima for spotting it. Spotted-by: "J. R. Okajima" <hooanon05g@gmail.com> Wearing-brown-paperbag: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-06-12fs/dcache.c: Save one 32-bit multiply in dcache lookupGeorge Spelvin1-1/+1
Noe that we're mixing in the parent pointer earlier, we don't need to use hash_32() to mix its bits. Instead, we can just take the msbits of the hash value directly. For those applications which use the partial_name_hash(), move the multiply to end_name_hash. Signed-off-by: George Spelvin <linux@sciencehorizons.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>