summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2014-12-11binfmt_misc: replace get_unused_fd() with get_unused_fd_flags(0)Yann Droneaud1-1/+1
This patch replaces calls to get_unused_fd() with equivalent call to get_unused_fd_flags(0) to preserve current behavor for existing code. In a further patch, get_unused_fd() will be removed so that new code start using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either by default or choosen by userspace. Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11proc: task_state: ptrace_parent() doesn't need pid_alive() checkOleg Nesterov1-7/+6
p->ptrace != 0 means that release_task(p) was not called, so pid_alive() buys nothing and we can remove this check. Other callers already use it directly without additional checks. Note: with or without this patch ptrace_parent() can return the pointer to the freed task, this will be explained/fixed later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com>, Cc: Sterling Alexander <stalexan@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roland McGrath <roland@hack.frob.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11proc: task_state: move the main seq_printf() outside of rcu_read_lock()Oleg Nesterov1-6/+6
task_state() does seq_printf() under rcu_read_lock(), but this is only needed for task_tgid_nr_ns() and task_numa_group_id(). We can calculate tgid/ngid and drop rcu lock. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com>, Cc: Sterling Alexander <stalexan@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roland McGrath <roland@hack.frob.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11proc: task_state: deuglify the max_fds calculationOleg Nesterov1-12/+11
1. The usage of fdt looks very ugly, it can't be NULL if ->files is not NULL. We can use "unsigned int max_fds" instead. 2. This also allows to move seq_printf(max_fds) outside of task_lock() and join it with the previous seq_printf(). See also the next patch. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com>, Cc: Sterling Alexander <stalexan@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roland McGrath <roland@hack.frob.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11proc: task_state: read cred->group_info outside of task_lock()Oleg Nesterov1-2/+1
task_state() reads cred->group_info under task_lock() because a long ago it was task_struct->group_info and it was actually protected by task->alloc_lock. Today this task_unlock() after rcu_read_unlock() just adds the confusion, move task_unlock() up. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com>, Cc: Sterling Alexander <stalexan@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roland McGrath <roland@hack.frob.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11fs/proc.c: use rb_entry_safe() instead of rb_entry()Nicolas Dichtel1-12/+4
Better to use existing macro that rewriting them. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11procfs: fix error handling of proc_register()Debabrata Banerjee1-1/+8
proc_register() error paths are leaking inodes and directory refcounts. Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11fs/proc: use a rb tree for the directory entriesNicolas Dichtel4-64/+113
When a lot of netdevices are created, one of the bottleneck is the creation of proc entries. This serie aims to accelerate this part. The current implementation for the directories in /proc is using a single linked list. This is slow when handling directories with large numbers of entries (eg netdevice-related entries when lots of tunnels are opened). This patch replaces this linked list by a red-black tree. Here are some numbers: dummy30000.batch contains 30 000 times 'link add type dummy'. Before the patch: $ time ip -b dummy30000.batch real 2m31.950s user 0m0.440s sys 2m21.440s $ time rmmod dummy real 1m35.764s user 0m0.000s sys 1m24.088s After the patch: $ time ip -b dummy30000.batch real 2m0.874s user 0m0.448s sys 1m49.720s $ time rmmod dummy real 1m13.988s user 0m0.000s sys 1m1.008s The idea of improving this part was suggested by Thierry Herbelot. [akpm@linux-foundation.org: initialise proc_root.subdir at compile time] Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Thierry Herbelot <thierry.herbelot@6wind.com>. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11mm: fix huge zero page accounting in smaps reportKirill A. Shutemov1-36/+68
As a small zero page, huge zero page should not be accounted in smaps report as normal page. For small pages we rely on vm_normal_page() to filter out zero page, but vm_normal_page() is not designed to handle pmds. We only get here due hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not necessary compatible on each and every architecture. Let's add separate codepath to handle pmds. follow_trans_huge_pmd() will detect huge zero page for us. We would need pmd_dirty() helper to do this properly. The patch adds it to THP-enabled architectures which don't yet have one. [akpm@linux-foundation.org: use do_div to fix 32-bit build] Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Tested-by: Fengwei Yin <yfw.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11fs/char_dev.c: remove pointless assignment from __register_chrdev_region()Jan Kara1-1/+0
At one place we assign major number we found to ret. That assignment is then never used and actually doesn't make any sense given how the code is currently structured (the assignment comes from pre-git times). Just remove it. Coverity id: 1226852. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11ocfs2: remove unneeded NULL checkDan Carpenter1-1/+1
In commit 1faf289454b9 ("ocfs2_dlm: disallow a domain join if node maps mismatch") we introduced a new earlier NULL check so this one is not needed. Also static checkers complain because we dereference it first and then check for NULL. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11ocfs2: remove bogus NULL check in ocfs2_move_extents()Dan Carpenter1-3/+0
"inode" isn't NULL here, and also we dereference it on the previous line so static checkers get annoyed. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11ocfs2: do not set filesystem readonly if link downjiangyiwen2-2/+2
Do not set the filesystem readonly if the storage link is down. In this case, metadata is not corrupted and only -EIO is returned. And if it is indeed corrupted metadata, it has already called ocfs2_error() in ocfs2_validate_inode_block(). Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11ocfs2: do not set OCFS2_LOCK_UPCONVERT_FINISHING if nonblocking lock can not ↵Xue jiufei2-6/+37
be granted at once ocfs2_readpages() use nonblocking flag to avoid page lock inversion. It will trigger cluster hang because that flag OCFS2_LOCK_UPCONVERT_FINISHING is not cleared if nonblocking lock cannot be granted at once. The flag would prevent dc thread from downconverting. So other nodes cannot acheive this lockres for ever. So we should not set OCFS2_LOCK_UPCONVERT_FINISHING when receiving ast if nonblocking lock had already returned. Signed-off-by: joyce.xue <xuejiufei@huawei.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11ocfs2: fix error handling when creating debugfs root in ocfs2_init()Jan Kara1-1/+2
Error handling if creation of root of debugfs in ocfs2_init() fails is broken. Although error code is set we fail to exit ocfs2_init() with error and thus initialization ends with success. Later when mounting a filesystem, ocfs2 debugfs entries end up being created in the root of debugfs filesystem which is confusing. Fix the error handling to bail out. Coverity id: 1227009. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11ocfs2: remove filesize checks for sync I/O journal commitGoldwyn Rodrigues1-3/+1
Filesize is not a good indication that the file needs to be synced. An example where this breaks is: 1. Open the file in O_SYNC|O_RDWR 2. Read a small portion of the file (say 64 bytes) 3. Lseek to starting of the file 4. Write 64 bytes If the node crashes, it is not written out to disk because this was not committed in the journal and the other node which reads the file after recovery reads stale data (even if the write on the other node was successful) Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11ocfs2: o2net: fix connect expiredJunxiao Bi1-1/+1
Set nn_persistent_error to -ENOTCONN will stop reconnect since the "stop" condition in o2net_start_connect() will be true. stop = (nn->nn_sc || (nn->nn_persistent_error && (nn->nn_persistent_error != -ENOTCONN || timeout == 0))); This will make connection never be established if the first connection request is lost. Set nn_persistent_error to 0 when connect expired to fix this. With this changes, dlm will not be waken up when connect expired, this is OK since dlm depends on network, dlm can do nothing in this case if waken up. Let it wait there for network recover and connect built again to continue. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11ocfs2: o2dlm: fix a race between purge and master querySrinivas Eeda1-0/+12
Node A sends master query request to node B which is the master. At this time lockres happens to be on purgelist. dlm_master_request_handler gets the dlm spinlock, finds the resource and releases the dlm spin lock. Right at this dlm_thread on this node could purge the lockres. dlm_master_request_handler can then acquire lockres spinlock and reply to Node A that node B is the master even though lockres on node B is purged. The above scenario will now make node A falsely think node B is the master which is inconsistent. Further if another node C tries to master the same resource, every node will respond they are not the master. Node C then masters the resource and sends assert master to all nodes. This will now make node A crash with the following message. dlm_assert_master_handler:1831 ERROR: DIE! Mastery assert from 9, but current owner is 10! Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com> Tested-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11ocfs2: report error from o2hb_do_disk_heartbeat() to userJan Kara1-2/+2
Report return value of o2hb_do_disk_heartbeat() as a part of ML_HEARTBEAT message so that we know whether a heartbeat actually happened or not. This also makes assigned but otherwise unused 'ret' variable used. Coverity id: 1227053. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11ocfs2: remove bogus test from ocfs2_read_locked_inode()Jan Kara1-2/+1
'args' are always set for ocfs2_read_locked_inode() and brelse() checks whether bh is NULL. So the test (args && bh) is unnecessary (plus the args part is really confusing anyway). Remove it. Coverity id: 1128856. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11ocfs2: Fix xattr check in ocfs2_get_xattr_nolock()Jan Kara1-1/+1
ocfs2_get_xattr_nolock() checks whether inode has any extended attributes (OCFS2_HAS_XATTR_FL). If not, it just sets 'ret' to -ENODATA but continues with checking inline and external attributes anyway (which is pointless although it does not harm). Just return immediately when we know there are no extended attributes in the inode. Coverity id: 1226906. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11ocfs2: fix an off-by-one BUG_ON() statementDan Carpenter1-1/+1
The ->si_slots[] array is allocated in ocfs2_init_slot_info() it has "->max_slots" number of elements so this test should be >= instead of >. Static checker work. Compile tested only. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11ocfs2/dlm: let sender retry if dlm_dispatch_assert_master failed with -ENOMEMJoseph Qi1-5/+13
Do not BUG() if GFP_ATOMIC allocation fails in dlm_dispatch_assert_master. Instead, return -ENOMEM to the sender and then retry. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Alex Chen <alex.chen@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11fs/cifs/smb2file.c: replace count*size kzalloc by kcallocFabian Frederick1-2/+2
kcalloc manages count*sizeof overflow. Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Steve French <sfrench@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11fs/cifs/file.c: replace count*size kzalloc by kcallocFabian Frederick1-2/+2
kcalloc manages count*sizeof overflow. Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Steve French <sfrench@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11fs/cifs: remove obsolete __constantFabian Frederick7-47/+47
Replace all __constant_foo to foo() except in smb2status.h (1700 lines to update). Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Steve French <sfrench@samba.org> Cc: Jeff Layton <jlayton@poochiereds.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10Merge branch 'x86-mpx-for-linus' of ↵Linus Torvalds2-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MPX support from Thomas Gleixner: "This enables support for x86 MPX. MPX is a new debug feature for bound checking in user space. It requires kernel support to handle the bound tables and decode the bound violating instruction in the trap handler" * 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: asm-generic: Remove asm-generic arch_bprm_mm_init() mm: Make arch_unmap()/bprm_mm_init() available to all architectures x86: Cleanly separate use of asm-generic/mm_hooks.h x86 mpx: Change return type of get_reg_offset() fs: Do not include mpx.h in exec.c x86, mpx: Add documentation on Intel MPX x86, mpx: Cleanup unused bound tables x86, mpx: On-demand kernel allocation of bounds tables x86, mpx: Decode MPX instruction to get bound violation information x86, mpx: Add MPX-specific mmap interface x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific x86, mpx: Add MPX to disabled features ia64: Sync struct siginfo with general version mips: Sync struct siginfo with general version mpx: Extend siginfo structure to include bound violation information x86, mpx: Rename cfg_reg_u and status_reg x86: mpx: Give bndX registers actual names x86: Remove arbitrary instruction size limit in instruction decoder
2014-12-10Merge branch 'sched-core-for-linus' of ↵Linus Torvalds1-5/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this cycle are: - 'Nested Sleep Debugging', activated when CONFIG_DEBUG_ATOMIC_SLEEP=y. This instruments might_sleep() checks to catch places that nest blocking primitives - such as mutex usage in a wait loop. Such bugs can result in hard to debug races/hangs. Another category of invalid nesting that this facility will detect is the calling of blocking functions from within schedule() -> sched_submit_work() -> blk_schedule_flush_plug(). There's some potential for false positives (if secondary blocking primitives themselves are not ready yet for this facility), but the kernel will warn once about such bugs per bootup, so the warning isn't much of a nuisance. This feature comes with a number of fixes, for problems uncovered with it, so no messages are expected normally. - Another round of sched/numa optimizations and refinements, for CONFIG_NUMA_BALANCING=y. - Another round of sched/dl fixes and refinements. Plus various smaller fixes and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) sched: Add missing rcu protection to wake_up_all_idle_cpus sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK sched/numa: Init numa balancing fields of init_task sched/deadline: Remove unnecessary definitions in cpudeadline.h sched/cpupri: Remove unnecessary definitions in cpupri.h sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task() sched/fair: Fix stale overloaded status in the busiest group finding logic sched: Move p->nr_cpus_allowed check to select_task_rq() sched/completion: Document when to use wait_for_completion_io_*() sched: Update comments about CLONE_NEWUTS and CLONE_NEWIPC sched/fair: Kill task_struct::numa_entry and numa_group::task_list sched: Refactor task_struct to use numa_faults instead of numa_* pointers sched/deadline: Don't check CONFIG_SMP in switched_from_dl() sched/deadline: Reschedule from switched_from_dl() after a successful pull sched/deadline: Push task away if the deadline is equal to curr during wakeup sched/deadline: Add deadline rq status print sched/deadline: Fix artificial overrun introduced by yield_task_dl() sched/rt: Clean up check_preempt_equal_prio() sched/core: Use dl_bw_of() under rcu_read_lock_sched() sched: Check if we got a shallowest_idle_cpu before searching for least_loaded_cpu ...
2014-12-03fat: fix oops on corrupted vfat fsAl Viro1-9/+11
a) don't bother with ->d_time for positives - we only check it for negatives anyway. b) make sure to set it at unlink and rmdir time - at *that* point soon-to-be negative dentry matches then-current directory contents c) don't go into renaming of old alias in vfat_lookup() unless it has the same parent (which it will, unless we are seeing corrupted image) [hirofumi@mail.parknet.co.jp: make change minimum, don't call d_move() for dir] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: <stable@vger.kernel.org> [3.17.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-02Merge tag 'ext4_for_linus_urgent' of ↵Linus Torvalds1-3/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfix from Ted Ts'o: "Fix an ext4 metadata checksum regression introduced in v3.18-rc3" * tag 'ext4_for_linus_urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: jbd2: fix regression where we fail to initialize checksum seed when loading
2014-12-02jbd2: fix regression where we fail to initialize checksum seed when loadingDarrick J. Wong1-3/+2
When we're enabling journal features, we cannot use the predicate jbd2_journal_has_csum_v2or3() because we haven't yet set the sb feature flag fields! Moreover, we just finished loading the shash driver, so the test is unnecessary; calculate the seed always. Without this patch, we fail to initialize the checksum seed the first time we turn on journal_checksum, which means that all journal blocks written during that first mount are corrupt. Transactions written after the second mount will be fine, since the feature flag will be set in the journal superblock. xfstests generic/{034,321,322} are the regression tests. (This is important for 3.18.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM> Reported-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-30btrfs: zero out left over bytes after processing compression streamsChris Mason4-5/+67
Don Bailey noticed that our page zeroing for compression at end-io time isn't complete. This reworks a patch from Linus to push the zeroing into the zlib and lzo specific functions instead of trying to handle the corners inside btrfs_decompress_buf2page Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reported-by: Don A. Bailey <donb@securitymouse.com> cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-26Merge branch 'for-3.18' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2-5/+12
Pull nfsd bugfixes from Bruce Fields: "These fix one mishandling of the case when security labels are configured out, and two races in the 4.1 backchannel code" * 'for-3.18' of git://linux-nfs.org/~bfields/linux: nfsd: Fix slot wake up race in the nfsv4.1 callback code SUNRPC: Fix locking around callback channel reply receive nfsd: correctly define v4.2 support attributes
2014-11-26Merge git://git.kvack.org/~bcrl/aio-fixesLinus Torvalds1-7/+14
Pull aio fix from Ben LaHaise: "Dirty page accounting fix for aio" * git://git.kvack.org/~bcrl/aio-fixes: aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer
2014-11-23Merge branch 'for-linus' of ↵Linus Torvalds3-15/+25
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs deadlock fix from Chris Mason: "This has a fix for a long standing deadlock that we've been trying to nail down for a while. It ended up being a bad interaction with the fair reader/writer locks and the order btrfs reacquires locks in the btree" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix lockups from btrfs_clear_path_blocking
2014-11-21Merge branch 'overlayfs-current' of ↵Al Viro28-347/+400
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into for-linus "The biggest change is to rename the filesystem from "overlayfs" to "overlay". This will allow legacy overlayfs to be easily carried by distros alongside the new mainline one. Also fix a couple of copy-up races and allow escaping comma character in filenames." The last bit is about commas in pathname mount options...
2014-11-20ovl: ovl_dir_fsync() cleanupMiklos Szeredi1-2/+2
Check against !OVL_PATH_LOWER instead of OVL_PATH_MERGE. For a copied up directory the two are currently equivalent. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-11-20ovl: pass dentry into ovl_dir_read_merged()Miklos Szeredi1-21/+14
Pass dentry into ovl_dir_read_merged() insted of upperpath and lowerpath. This cleans up callers and paves the way for multi-layer directory reads. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-11-20ovl: use lockless_dereference() for upperdentryMiklos Szeredi1-6/+1
Don't open code lockless_dereference() in ovl_upperdentry_dereference(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-11-20ovl: allow filenames with commaMiklos Szeredi1-3/+45
Allow option separator (comma) to be escaped with backslash. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-11-20ovl: fix race in private xattr checksMiklos Szeredi1-9/+18
Xattr operations can race with copy up. This does not matter as long as we consistently fiter out "trunsted.overlay.opaque" attribute on upper directories. Previously we checked parent against OVL_PATH_MERGE. This is too general, and prone to race with copy-up. I.e. we found the parent to be on the lower layer but ovl_dentry_real() would return the copied-up dentry, possibly with the "opaque" attribute. So instead use ovl_path_real() and decide to filter the attributes based on the actual type of the dentry we'll use. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-11-20ovl: fix remove/copy-up raceMiklos Szeredi1-12/+19
ovl_remove_and_whiteout() needs to check if upper dentry exists or not after having locked upper parent directory. Previously we used a "type" value computed before locking the upper parent directory, which is susceptible to racing with copy-up. There's a similar check in ovl_check_empty_and_clear(). This one is not actually racy, since copy-up doesn't change the "emptyness" property of a directory. Add a comment to this effect, and check the existence of upper dentry locally to make the code cleaner. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-11-20ovl: rename filesystem type to "overlay"Miklos Szeredi4-7/+7
Some distributions carry an "old" format of overlayfs while mainline has a "new" format. The distros will possibly want to keep the old overlayfs alongside the new for compatibility reasons. To make it possible to differentiate the two versions change the name of the new one from "overlayfs" to "overlay". Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reported-by: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Andy Whitcroft <apw@canonical.com>
2014-11-19nfsd: Fix slot wake up race in the nfsv4.1 callback codeTrond Myklebust1-2/+6
The currect code for nfsd41_cb_get_slot() and nfsd4_cb_done() has no locking in order to guarantee atomicity, and so allows for races of the form. Task 1 Task 2 ====== ====== if (test_and_set_bit(0) != 0) { clear_bit(0) rpc_wake_up_next(queue) rpc_sleep_on(queue) return false; } This patch breaks the race condition by adding a retest of the bit after the call to rpc_sleep_on(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-11-19btrfs: fix lockups from btrfs_clear_path_blockingChris Mason3-15/+25
The fair reader/writer locks mean that btrfs_clear_path_blocking needs to strictly follow lock ordering rules even when we already have blocking locks on a given path. Before we can clear a blocking lock on the path, we need to make sure all of the locks have been converted to blocking. This will remove lock inversions against anyone spinning in write_lock() against the buffers we're trying to get read locks on. These inversions didn't exist before the fair read/writer locks, but now we need to be more careful. We papered over this deadlock in the past by changing btrfs_try_read_lock() to be a true trylock against both the spinlock and the blocking lock. This was slower, and not sufficient to fix all the deadlocks. This patch adds a btrfs_tree_read_lock_atomic(), which basically means get the spinlock but trylock on the blocking lock. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Reported-by: Patrick Schmid <schmid@phys.ethz.ch> cc: stable@vger.kernel.org #v3.15+
2014-11-19isofs: avoid unused function warningArnd Bergmann1-21/+21
With the isofs_hash() function removed, isofs_hash_ms() is the only user of isofs_hash_common(), but it's defined inside of an #ifdef, which triggers this gcc warning in ARM axm55xx_defconfig starting with v3.18-rc3: fs/isofs/inode.c:177:1: warning: 'isofs_hash_common' defined but not used [-Wunused-function] This patch moves the function inside of the same #ifdef section to avoid that warning, which seems the best compromise of a relatively harmless patch for a late -rc. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: b0afd8e5db7b ("isofs: don't bother with ->d_op for normal case") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19vfs: fix reference leak in d_prune_aliases()Yan, Zheng1-0/+1
In "d_prune_alias(): just lock the parent and call __dentry_kill()" the old dget + d_drop + dput has been replaced with lock_parent + __dentry_kill; unfortunately, dput() does more than just killing dentry - it also drops the reference to parent. New variant leaks that reference and needs dput(parent) after killing the child off. Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19nfsd: correctly define v4.2 support attributesChristoph Hellwig1-3/+6
Even when security labels are disabled we support at least the same attributes as v4.1. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-11-18fs: Do not include mpx.h in exec.cDave Hansen1-1/+0
We no longer need mpx.h in exec.c. This will obviously also break the build for non-x86 builds. We get the MPX includes that we need from mmu_context.h now. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141118003608.837015B3@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-18x86, mpx: On-demand kernel allocation of bounds tablesDave Hansen1-0/+2
This is really the meat of the MPX patch set. If there is one patch to review in the entire series, this is the one. There is a new ABI here and this kernel code also interacts with userspace memory in a relatively unusual manner. (small FAQ below). Long Description: This patch adds two prctl() commands to provide enable or disable the management of bounds tables in kernel, including on-demand kernel allocation (See the patch "on-demand kernel allocation of bounds tables") and cleanup (See the patch "cleanup unused bound tables"). Applications do not strictly need the kernel to manage bounds tables and we expect some applications to use MPX without taking advantage of this kernel support. This means the kernel can not simply infer whether an application needs bounds table management from the MPX registers. The prctl() is an explicit signal from userspace. PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to require kernel's help in managing bounds tables. PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel won't allocate and free bounds tables even if the CPU supports MPX. PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds directory out of a userspace register (bndcfgu) and then cache it into a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT will set "bd_addr" to an invalid address. Using this scheme, we can use "bd_addr" to determine whether the management of bounds tables in kernel is enabled. Also, the only way to access that bndcfgu register is via an xsaves, which can be expensive. Caching "bd_addr" like this also helps reduce the cost of those xsaves when doing table cleanup at munmap() time. Unfortunately, we can not apply this optimization to #BR fault time because we need an xsave to get the value of BNDSTATUS. ==== Why does the hardware even have these Bounds Tables? ==== MPX only has 4 hardware registers for storing bounds information. If MPX-enabled code needs more than these 4 registers, it needs to spill them somewhere. It has two special instructions for this which allow the bounds to be moved between the bounds registers and some new "bounds tables". They are similar conceptually to a page fault and will be raised by the MPX hardware during both bounds violations or when the tables are not present. This patch handles those #BR exceptions for not-present tables by carving the space out of the normal processes address space (essentially calling the new mmap() interface indroduced earlier in this patch set.) and then pointing the bounds-directory over to it. The tables *need* to be accessed and controlled by userspace because the instructions for moving bounds in and out of them are extremely frequent. They potentially happen every time a register pointing to memory is dereferenced. Any direct kernel involvement (like a syscall) to access the tables would obviously destroy performance. ==== Why not do this in userspace? ==== This patch is obviously doing this allocation in the kernel. However, MPX does not strictly *require* anything in the kernel. It can theoretically be done completely from userspace. Here are a few ways this *could* be done. I don't think any of them are practical in the real-world, but here they are. Q: Can virtual space simply be reserved for the bounds tables so that we never have to allocate them? A: As noted earlier, these tables are *HUGE*. An X-GB virtual area needs 4*X GB of virtual space, plus 2GB for the bounds directory. If we were to preallocate them for the 128TB of user virtual address space, we would need to reserve 512TB+2GB, which is larger than the entire virtual address space today. This means they can not be reserved ahead of time. Also, a single process's pre-popualated bounds directory consumes 2GB of virtual *AND* physical memory. IOW, it's completely infeasible to prepopulate bounds directories. Q: Can we preallocate bounds table space at the same time memory is allocated which might contain pointers that might eventually need bounds tables? A: This would work if we could hook the site of each and every memory allocation syscall. This can be done for small, constrained applications. But, it isn't practical at a larger scale since a given app has no way of controlling how all the parts of the app might allocate memory (think libraries). The kernel is really the only place to intercept these calls. Q: Could a bounds fault be handed to userspace and the tables allocated there in a signal handler instead of in the kernel? A: (thanks to tglx) mmap() is not on the list of safe async handler functions and even if mmap() would work it still requires locking or nasty tricks to keep track of the allocation state there. Having ruled out all of the userspace-only approaches for managing bounds tables that we could think of, we create them on demand in the kernel. Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>