summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2014-01-11kernfs: fix get_active failure handling in kernfs_seq_*()Tejun Heo1-7/+44
When kernfs_seq_start() fails to obtain an active reference, it returns ERR_PTR(-ENODEV). kernfs_seq_stop() is then invoked with the error pointer value; however, it still proceeds to invoke kernfs_put_active() on the node leading to unbalanced put. If kernfs_seq_stop() is called even after active ref failure, it should skip invocation of @ops->seq_stop() and put_active. Unfortunately, this is a bit complicated because active ref failure isn't the only thing which may fail with ERR_PTR(-ENODEV). @ops->seq_start/next() may also fail with the error value and kernfs_seq_stop() doesn't have a way to tell apart those failures. Work it around by factoring out the active part of kernfs_seq_stop() into kernfs_seq_stop_active() and invoking it directly if @ops->seq_start/next() fail with ERR_PTR(-ENODEV) and updating kernfs_seq_stop() to skip kernfs_seq_stop_active() on ERR_PTR(-ENODEV). This is a bit nasty but ensures that the active put is skipped iff get_active failed in kernfs_seq_start(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-24Merge 3.13-rc5 into staging-nextGreg Kroah-Hartman27-269/+409
We want these fixes here to handle some merge issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-22Merge git://git.kvack.org/~bcrl/aio-nextLinus Torvalds1-9/+46
Pull AIO leak fixes from Ben LaHaise: "I've put these two patches plus Linus's change through a round of tests, and it passes millions of iterations of the aio numa migratepage test, as well as a number of repetitions of a few simple read and write tests. The first patch fixes the memory leak Kent introduced, while the second patch makes aio_migratepage() much more paranoid and robust" * git://git.kvack.org/~bcrl/aio-next: aio/migratepages: make aio migrate pages sane aio: fix kioctx leak introduced by "aio: Fix a trinity splat"
2013-12-22aio: clean up and fix aio_setup_ring page mappingLinus Torvalds1-35/+23
Since commit 36bc08cc01709 ("fs/aio: Add support to aio ring pages migration") the aio ring setup code has used a special per-ring backing inode for the page allocations, rather than just using random anonymous pages. However, rather than remembering the pages as it allocated them, it would allocate the pages, insert them into the file mapping (dirty, so that they couldn't be free'd), and then forget about them. And then to look them up again, it would mmap the mapping, and then use "get_user_pages()" to get back an array of the pages we just created. Now, not only is that incredibly inefficient, it also leaked all the pages if the mmap failed (which could happen due to excessive number of mappings, for example). So clean it all up, making it much more straightforward. Also remove some left-overs of the previous (broken) mm_populate() usage that was removed in commit d6c355c7dabc ("aio: fix race in ring buffer page lookup introduced by page migration support") but left the pointless and now misleading MAP_POPULATE flag around. Tested-and-acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-22aio/migratepages: make aio migrate pages saneBenjamin LaHaise1-8/+44
The arbitrary restriction on page counts offered by the core migrate_page_move_mapping() code results in rather suspicious looking fiddling with page reference counts in the aio_migratepage() operation. To fix this, make migrate_page_move_mapping() take an extra_count parameter that allows aio to tell the code about its own reference count on the page being migrated. While cleaning up aio_migratepage(), make it validate that the old page being passed in is actually what aio_migratepage() expects to prevent misbehaviour in the case of races. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-12-22aio: fix kioctx leak introduced by "aio: Fix a trinity splat"Benjamin LaHaise1-1/+2
e34ecee2ae791df674dfb466ce40692ca6218e43 reworked the percpu reference counting to correct a bug trinity found. Unfortunately, the change lead to kioctxes being leaked because there was no final reference count to put. Add that reference count back in to fix things. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: stable@vger.kernel.org
2013-12-21Merge tag 'xfs-for-linus-v3.13-rc5' of git://oss.sgi.com/xfs/xfsLinus Torvalds10-83/+167
Pull xfs bugfixes from Ben Myers: "This contains fixes for some asserts related to project quotas, a memory leak, a hang when disabling group or project quotas before disabling user quotas, Dave's email address, several fixes for the alignment of file allocation to stripe unit/width geometry, a fix for an assertion with xfs_zero_remaining_bytes, and the behavior of metadata writeback in the face of IO errors. Details: - fix memory leak in xfs_dir2_node_removename - fix quota assertion in xfs_setattr_size - fix quota assertions in xfs_qm_vop_create_dqattach - fix for hang when disabling group and project quotas before disabling user quotas - fix Dave Chinner's email address in MAINTAINERS - fix for file allocation alignment - fix for assertion in xfs_buf_stale by removing xfsbdstrat - fix for alignment with swalloc mount option - fix for "retry forever" semantics on IO errors" * tag 'xfs-for-linus-v3.13-rc5' of git://oss.sgi.com/xfs/xfs: xfs: abort metadata writeback on permanent errors xfs: swalloc doesn't align allocations properly xfs: remove xfsbdstrat error xfs: align initial file allocations correctly MAINTAINERS: fix incorrect mail address of XFS maintainer xfs: fix infinite loop by detaching the group/project hints from user dquot xfs: fix assertion failure at xfs_setattr_nonsize xfs: fix false assertion at xfs_qm_vop_create_dqattach xfs: fix memory leak in xfs_dir2_node_removename
2013-12-21pstore: Don't allow high traffic options on fragile devicesLuck, Tony1-2/+5
Some pstore backing devices use on board flash as persistent storage. These have limited numbers of write cycles so it is a poor idea to use them from high frequency operations. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-19Merge tag 'driver-core-3.13-rc5' of ↵Linus Torvalds1-5/+3
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fix from Greg KH: "Here's a single sysfs fix for 3.13-rc5 that resolves a lockdep issue in sysfs that has been reported" * tag 'driver-core-3.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: sysfs: give different locking key to regular and bin files
2013-12-17Merge branch 'for-linus' of ↵Linus Torvalds2-80/+64
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull two Ceph fixes from Sage Weil: "One of these is fixing a regression from the d_flags file type patch that went into -rc1 that broke instantiation of inodes and dentries (we were doing dentries first). The other is just an off-by-one corner case" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: Avoid data inconsistency due to d-cache aliasing in readpage() ceph: initialize inode before instantiating dentry
2013-12-17kernfs: add kernfs_dir_opsTejun Heo2-2/+44
Add support for mkdir(2), rmdir(2) and rename(2) syscalls. This is implemented through optional kernfs_dir_ops callback table which can be specified on kernfs_create_root(). An implemented callback is invoked when the matching syscall is invoked. As kernfs keep dcache syncs with internal representation and revalidates dentries on each access, the implementation of these methods is extremely simple. Each just discovers the relevant kernfs_node(s) and invokes the requested callback which is allowed to do any kernfs operations and the end result doesn't necessarily have to match the expected semantics of the syscall. This will be used to convert cgroup to use kernfs instead of its own filesystem implementation. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17kernfs: allow negative dentriesTejun Heo1-24/+13
kernfs doesn't allow negative dentries - kernfs_iop_lookup() returns ERR_PTR(-ENOENT) instead of NULL which short-circuits negative dentry creation and kernfs's d_delete() callback, kernfs_dop_delete(), returns 1 for all removed nodes. This in turn allows kernfs_dop_revalidate() to assume that there's no negative dentry for kernfs. This worked fine for sysfs but kernfs is scheduled to grow mkdir(2) support which depend on negative dentries. This patch updates so that kernfs allows negative dentries. The required changes are almost trivial - kernfs_iop_lookup() now returns NULL instead of ERR_PTR(-ENOENT) when the target kernfs_node doesn't exist, kernfs_dop_delete() is removed and kernfs_dop_revalidate() is updated to check whether the target dentry is negative and request fresh lookup if so. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17kernfs: update kernfs_rename_ns() to consider KERNFS_STATIC_NAMETejun Heo1-1/+5
kernfs_rename_ns() currently assumes that the target sysfs_dirent has a copied name. This has been okay because sysfs supports rename only for directories which always have copied names; however, there's nothing in kernfs interface which calls for such restriction and currently invoking kernfs_rename_ns() on a regular file leads to oops because it ends up trying to kfree() a static name. This patch updates kernfs_rename_ns() so that it skips kfree() of the old name if it's static. This allows it to be used for all node types. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17kernfs: mark static names with KERNFS_STATIC_NAMETejun Heo4-15/+22
Because sysfs used struct attribute which are supposed to stay constant, sysfs didn't copy names when creating regular files. The specified string for name was supposed to stay constant. Such distinction isn't inherent for kernfs. kernfs_create_file[_ns]() should be able to take the same @name as kernfs_create_dir[_ns]() As there can be huge number of sysfs attributes, we still want to be able to use static names for sysfs attributes. This patch renames kernfs_create_file_ns_key() to __kernfs_create_file() and adds @name_is_static parameter so that the caller can explicitly indicate that @name can be used without copying. kernfs is updated to use KERNFS_STATIC_NAME to distinguish static and copied names. This patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17kernfs: add REMOVED check to create and rename pathsTejun Heo1-0/+7
kernfs currently assumes that the caller doesn't try to create a new node under a removed parent, rename a removed node, or move a node under a removed node. While this works fine for sysfs, it'd be nice to have protection against such cases especially given that kernfs is planned to add support for mkdir, rmdir and rename requsts from userland which may make race conditions more likely. This patch updates create and rename paths to check REMOVED and fail the operation with -ENOENT if performed on or towards removed nodes. Note that remove path already has such check. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17kernfs: add @mode to kernfs_create_dir[_ns]()Tejun Heo3-6/+9
sysfs assumed 0755 for all newly created directories and kernfs inherited it. This assumption is unnecessarily restrictive and inconsistent with kernfs_create_file[_ns](). This patch adds @mode parameter to kernfs_create_dir[_ns]() and update uses in sysfs accordingly. Among others, this will be useful for implementations of the planned ->mkdir() method. This patch doesn't introduce any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17xfs: abort metadata writeback on permanent errorsDave Chinner3-5/+32
If we are doing aysnc writeback of metadata, we can get write errors but have nobody to report them to. At the moment, we simply attempt to reissue the write from io completion in the hope that it's a transient error. When it's not a transient error, the buffer is stuck forever in this loop, and we cannot break out of it. Eventually, unmount will hang because the AIL cannot be emptied and everything goes downhill from them. To solve this problem, only retry the write IO once before aborting it. We don't throw the buffer away because some transient errors can last minutes (e.g. FC path failover) or even hours (thin provisioned devices that have run out of backing space) before they go away. Hence we really want to keep trying until we can't try any more. Because the buffer was not cleaned, however, it does not get removed from the AIL and hence the next pass across the AIL will start IO on it again. As such, we still get the "retry forever" semantics that we currently have, but we allow other access to the buffer in the mean time. Meanwhile the filesystem can continue to modify the buffer and relog it, so the IO errors won't hang the log or the filesystem. Now when we are pushing the AIL, we can see all these "permanent IO error" buffers and we can issue a warning about failures before we retry the IO. We can also catch these buffers when unmounting an issue a corruption warning, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-17xfs: swalloc doesn't align allocations properlyDave Chinner1-6/+17
When swalloc is specified as a mount option, allocations are supposed to be aligned to the stripe width rather than the stripe unit of the underlying filesystem. However, it does not do this. What the implementation does is round up the allocation size to a stripe width, hence ensuring that all allocations span a full stripe width. It does not, however, ensure that that allocation is aligned to a stripe width, and hence the allocations can span multiple underlying stripes and so still see RMW cycles for things like direct IO on MD RAID. So, if the swalloc mount option is set, change the allocation alignment in xfs_bmap_btalloc() to use the stripe width rather than the stripe unit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-17xfs: remove xfsbdstrat errorChristoph Hellwig5-29/+43
The xfsbdstrat helper is a small but useless wrapper for xfs_buf_iorequest that handles the case of a shut down filesystem. Most of the users have private, uncached buffers that can just be freed in this case, but the complex error handling in xfs_bioerror_relse messes up the case when it's called without a locked buffer. Remove xfsbdstrat and opencode the error handling in the callers. All but one can simply return an error and don't need to deal with buffer state, and the one caller that cares about the buffer state could do with a major cleanup as well, but we'll defer that to later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-17xfs: align initial file allocations correctlyDave Chinner1-2/+7
The function xfs_bmap_isaeof() is used to indicate that an allocation is occurring at or past the end of file, and as such should be aligned to the underlying storage geometry if possible. Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the behaviour of this function for empty files - it turned off allocation alignment for this case accidentally. Hence large initial allocations from direct IO are not getting correctly aligned to the underlying geometry, and that is cause write performance to drop in alignment sensitive configurations. Fix it by considering allocation into empty files as requiring aligned allocation again. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit f9b395a8ef8f34d19cae2cde361e19c96e097fad)
2013-12-17xfs: fix infinite loop by detaching the group/project hints from user dquotJie Liu1-21/+50
xfs_quota(8) will hang up if trying to turn group/project quota off before the user quota is off, this could be 100% reproduced by: # mount -ouquota,gquota /dev/sda7 /xfs # mkdir /xfs/test # xfs_quota -xc 'off -g' /xfs <-- hangs up # echo w > /proc/sysrq-trigger # dmesg SysRq : Show Blocked State task PC stack pid father xfs_quota D 0000000000000000 0 27574 2551 0x00000000 [snip] Call Trace: [<ffffffff81aaa21d>] schedule+0xad/0xc0 [<ffffffff81aa327e>] schedule_timeout+0x35e/0x3c0 [<ffffffff8114b506>] ? mark_held_locks+0x176/0x1c0 [<ffffffff810ad6c0>] ? call_timer_fn+0x2c0/0x2c0 [<ffffffffa0c25380>] ? xfs_qm_shrink_count+0x30/0x30 [xfs] [<ffffffff81aa3306>] schedule_timeout_uninterruptible+0x26/0x30 [<ffffffffa0c26155>] xfs_qm_dquot_walk+0x235/0x260 [xfs] [<ffffffffa0c059d8>] ? xfs_perag_get+0x1d8/0x2d0 [xfs] [<ffffffffa0c05805>] ? xfs_perag_get+0x5/0x2d0 [xfs] [<ffffffffa0b7707e>] ? xfs_inode_ag_iterator+0xae/0xf0 [xfs] [<ffffffffa0c22280>] ? xfs_trans_free_dqinfo+0x50/0x50 [xfs] [<ffffffffa0b7709f>] ? xfs_inode_ag_iterator+0xcf/0xf0 [xfs] [<ffffffffa0c261e6>] xfs_qm_dqpurge_all+0x66/0xb0 [xfs] [<ffffffffa0c2497a>] xfs_qm_scall_quotaoff+0x20a/0x5f0 [xfs] [<ffffffffa0c2b8f6>] xfs_fs_set_xstate+0x136/0x180 [xfs] [<ffffffff8136cf7a>] do_quotactl+0x53a/0x6b0 [<ffffffff812fba4b>] ? iput+0x5b/0x90 [<ffffffff8136d257>] SyS_quotactl+0x167/0x1d0 [<ffffffff814cf2ee>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff81abcd19>] system_call_fastpath+0x16/0x1b It's fine if we turn user quota off at first, then turn off other kind of quotas if they are enabled since the group/project dquot refcount is decreased to zero once the user quota if off. Otherwise, those dquots refcount is non-zero due to the user dquot might refer to them as hint(s). Hence, above operation cause an infinite loop at xfs_qm_dquot_walk() while trying to purge dquot cache. This problem has been around since Linux 3.4, it was introduced by: [ b84a3a9675 xfs: remove the per-filesystem list of dquots ] Originally we will release the group dquot pointers because the user dquots maybe carrying around as a hint via xfs_qm_detach_gdquots(). However, with above change, there is no such work to be done before purging group/project dquot cache. In order to solve this problem, this patch introduces a special routine xfs_qm_dqpurge_hints(), and it would release the group/project dquot pointers the user dquots maybe carrying around as a hint, and then it will proceed to purge the user dquot cache if requested. Cc: stable@vger.kernel.org Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit df8052e7dae00bde6f21b40b6e3e1099770f3afc)
2013-12-17xfs: fix assertion failure at xfs_setattr_nonsizeJie Liu1-1/+2
For CRC enabled v5 super block, change a file's ownership can simply trigger an ASSERT failure at xfs_setattr_nonsize() if both group and project quota are enabled, i.e, [ 305.337609] XFS: Assertion failed: !XFS_IS_PQUOTA_ON(mp), file: fs/xfs/xfs_iops.c, line: 621 [ 305.339250] Kernel BUG at ffffffffa0a7fa32 [verbose debug info unavailable] [ 305.383939] Call Trace: [ 305.385536] [<ffffffffa0a7d95a>] xfs_setattr_nonsize+0x69a/0x720 [xfs] [ 305.387142] [<ffffffffa0a7dea9>] xfs_vn_setattr+0x29/0x70 [xfs] [ 305.388727] [<ffffffff811ca388>] notify_change+0x1a8/0x350 [ 305.390298] [<ffffffff811ac39d>] chown_common+0xfd/0x110 [ 305.391868] [<ffffffff811ad6bf>] SyS_fchownat+0xaf/0x110 [ 305.393440] [<ffffffff811ad760>] SyS_lchown+0x20/0x30 [ 305.394995] [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f [ 305.399870] RIP [<ffffffffa0a7fa32>] assfail+0x22/0x30 [xfs] This fix adjust the assertion to check if the super block support both quota inodes or not. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 5a01dd54f4a7fb513062070c5acef20d13cad980)
2013-12-17xfs: fix false assertion at xfs_qm_vop_create_dqattachJie Liu1-6/+3
After the previous fix, there still has another ASSERT failure if turning off any type of quota while fsstress is running at the same time. Backtrace in this case: [ 50.867897] XFS: Assertion failed: XFS_IS_GQUOTA_ON(mp), file: fs/xfs/xfs_qm.c, line: 2118 [ 50.867924] ------------[ cut here ]------------ ... <snip> [ 50.867957] Kernel BUG at ffffffffa0b55a32 [verbose debug info unavailable] [ 50.867999] invalid opcode: 0000 [#1] SMP [ 50.869407] Call Trace: [ 50.869446] [<ffffffffa0bc408a>] xfs_qm_vop_create_dqattach+0x19a/0x2d0 [xfs] [ 50.869512] [<ffffffffa0b9cc45>] xfs_create+0x5c5/0x6a0 [xfs] [ 50.869564] [<ffffffffa0b5307c>] xfs_vn_mknod+0xac/0x1d0 [xfs] [ 50.869615] [<ffffffffa0b531d6>] xfs_vn_mkdir+0x16/0x20 [xfs] [ 50.869655] [<ffffffff811becd5>] vfs_mkdir+0x95/0x130 [ 50.869689] [<ffffffff811bf63a>] SyS_mkdirat+0xaa/0xe0 [ 50.869723] [<ffffffff811bf689>] SyS_mkdir+0x19/0x20 [ 50.869757] [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f [ 50.869793] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 <snip> [ 50.870003] RIP [<ffffffffa0b55a32>] assfail+0x22/0x30 [xfs] [ 50.870050] RSP <ffff88002941fd60> [ 50.879251] ---[ end trace c93a2b342341c65b ]--- We're hitting the ASSERT(XFS_IS_*QUOTA_ON(mp)) in xfs_qm_vop_create_dqattach(), however the assertion itself is not right IMHO. While performing quota off, we firstly clear the XFS_*QUOTA_ACTIVE bit(s) from struct xfs_mount without taking any special locks, see xfs_qm_scall_quotaoff(). Hence there is no guarantee that the desired quota is still active. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 37eb9706ebf5b99d14c6086cdeef2c2f73f9c9fb)
2013-12-17xfs: fix memory leak in xfs_dir2_node_removenameMark Tinguely1-13/+13
Fix the leak of kernel memory in xfs_dir2_node_removename() when xfs_dir2_leafn_remove() returns an error code. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit ef701600fd26cace9d513ee174688a2b83832126)
2013-12-13ceph: Avoid data inconsistency due to d-cache aliasing in readpage()Li Wang1-2/+6
If the length of data to be read in readpage() is exactly PAGE_CACHE_SIZE, the original code does not flush d-cache for data consistency after finishing reading. This patches fixes this. Signed-off-by: Li Wang <liwang@ubuntukylin.com> Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-13ceph: initialize inode before instantiating dentryYan, Zheng1-78/+58
commit b18825a7c8 (Put a small type field into struct dentry::d_flags) put a type field into struct dentry::d_flags. __d_instantiate() set the field by checking inode->i_mode. So we should initialize inode before instantiating dentry when handling mds reply. Fixes: http://tracker.ceph.com/issues/6930 Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-13Merge branch 'akpm' (fixes from Andrew)Linus Torvalds1-5/+9
Merge patches from Andrew Morton: "13 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: memcg: do not allow task about to OOM kill to bypass the limit mm: memcg: fix race condition between memcg teardown and swapin thp: move preallocated PTE page table on move_huge_pmd() mfd/rtc: s5m: fix register updating by adding regmap for RTC rtc: s5m: enable IRQ wake during suspend rtc: s5m: limit endless loop waiting for register update rtc: s5m: fix unsuccesful IRQ request during probe drivers/rtc/rtc-s5m.c: fix info->rtc assignment include/linux/kernel.h: make might_fault() a nop for !MMU drivers/rtc/rtc-at91rm9200.c: correct alarm over day/month wrap procfs: also fix proc_reg_get_unmapped_area() for !MMU case mm: memcg: do not declare OOM from __GFP_NOFAIL allocations include/linux/hugetlb.h: make isolate_huge_page() an inline
2013-12-13procfs: also fix proc_reg_get_unmapped_area() for !MMU caseJan Beulich1-5/+9
Commit fad1a86e25e0 ("procfs: call default get_unmapped_area on MMU-present architectures"), as its title says, took care of only the MMU case, leaving the !MMU side still in the regressed state (returning -EIO in all cases where pde->proc_fops->get_unmapped_area is NULL). From the fad1a86e25e0 changelog: "Commit c4fe24485729 ("sparc: fix PCI device proc file mmap(2)") added proc_reg_get_unmapped_area in proc_reg_file_ops and proc_reg_file_ops_no_compat, by which now mmap always returns EIO if get_unmapped_area method is not defined for the target procfs file, which causes regression of mmap on /proc/vmcore. To address this issue, like get_unmapped_area(), call default current->mm->get_unmapped_area on MMU-present architectures if pde->proc_fops->get_unmapped_area, i.e. the one in actual file operation in the procfs file, is not defined" Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: <stable@vger.kernel.org> [3.12.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-13Merge branch 'for-linus' of ↵Linus Torvalds5-42/+73
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is a small collection of fixes. It was rebased this morning, but I was just fixing signed-off-by tags with the wrong email" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix access_ok() check in btrfs_ioctl_send() Btrfs: make sure we cleanup all reloc roots if error happens Btrfs: skip building backref tree for uuid and quota tree when doing balance relocation Btrfs: fix an oops when doing balance relocation Btrfs: don't miss skinny extent items on delayed ref head contention btrfs: call mnt_drop_write after interrupted subvol deletion Btrfs: don't clear the default compression type
2013-12-13Merge branch 'for-3.13' of git://linux-nfs.org/~bfields/linuxLinus Torvalds1-1/+8
Pull nfsd reply cache bugfix from Bruce Fields: "One bugfix for nfsd crashes" * 'for-3.13' of git://linux-nfs.org/~bfields/linux: nfsd: when reusing an existing repcache entry, unhash it first
2013-12-12dcache: allow word-at-a-time name hashing with big-endian CPUsWill Deacon2-7/+2
When explicitly hashing the end of a string with the word-at-a-time interface, we have to be careful which end of the word we pick up. On big-endian CPUs, the upper-bits will contain the data we're after, so ensure we generate our masks accordingly (and avoid hashing whatever random junk may have been sitting after the string). This patch adds a new dcache helper, bytemask_from_count, which creates a mask appropriate for the CPU endianness. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-12Merge tag 'xfs-for-linus-v3.13-rc4' of git://oss.sgi.com/xfs/xfsLinus Torvalds4-5/+12
Pull xfs bugfixes from Ben Myers: - fix for buffer overrun in agfl with growfs on v4 superblock - return EINVAL if requested discard length is less than a block - fix possible memory corruption in xfs_attrlist_by_handle() * tag 'xfs-for-linus-v3.13-rc4' of git://oss.sgi.com/xfs/xfs: xfs: growfs overruns AGFL buffer on V4 filesystems xfs: don't perform discard if the given range length is less than block size xfs: underflow bug in xfs_attrlist_by_handle()
2013-12-12Btrfs: fix access_ok() check in btrfs_ioctl_send()Dan Carpenter1-2/+2
The closing parenthesis is in the wrong place. We want to check "sizeof(*arg->clone_sources) * arg->clone_sources_count" instead of "sizeof(*arg->clone_sources * arg->clone_sources_count)". Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> cc: stable@vger.kernel.org
2013-12-12Btrfs: make sure we cleanup all reloc roots if error happensWang Shilong1-0/+7
I hit an oops when merging reloc roots fails, the reason is that new reloc roots may be added and we should make sure we cleanup all reloc roots. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12Btrfs: skip building backref tree for uuid and quota tree when doing balance ↵Wang Shilong1-1/+3
relocation Quota tree and UUID Tree is only cowed, they can not be snapshoted. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12Btrfs: fix an oops when doing balance relocationWang Shilong1-23/+47
I hit an oops when inserting reloc root into @reloc_root_tree(it can be easily triggered when forcing cow for relocation root) [ 866.494539] [<ffffffffa0499579>] btrfs_init_reloc_root+0x79/0xb0 [btrfs] [ 866.495321] [<ffffffffa044c240>] record_root_in_trans+0xb0/0x110 [btrfs] [ 866.496109] [<ffffffffa044d758>] btrfs_record_root_in_trans+0x48/0x80 [btrfs] [ 866.496908] [<ffffffffa0494da8>] select_reloc_root+0xa8/0x210 [btrfs] [ 866.497703] [<ffffffffa0495c8a>] do_relocation+0x16a/0x540 [btrfs] This is because reloc root inserted into @reloc_root_tree is not within one transaction,reloc root may be cowed and root block bytenr will be reused then oops happens.We should update reloc root in @reloc_root_tree when cow reloc root node, fix it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12Btrfs: don't miss skinny extent items on delayed ref head contentionFilipe David Borba Manana1-12/+10
Currently extent-tree.c:btrfs_lookup_extent_info() can miss the lookup of skinny extent items. This can happen when the execution flow is the following: * We do an extent tree lookup and fail to find a skinny extent item; * As a result, we attempt to see if a non-skinny extent item exists, either by looking at previous item in the leaf or by doing another full extent tree search; * We have a transaction and then we check for a matching delayed ref head in the transaction's delayed refs rbtree; * We find such delayed ref head and then we try to lock it with a call to mutex_trylock(); * The lock was contended so we jump to the label "again", which repeats the extent tree search but for a non-skinny extent item, because we set previously metadata variable to 0 and the search key to look for a non-skinny extent-item; * After the jump (and after releasing the transaction's delayed refs lock), a skinny extent item might have been added to the extent tree but we will miss it because metadata is set to 0 and the search key is set for a non-skinny extent-item. The fix here is to not reset metadata to 0 and to jump to the initial search key setup if the delayed ref head is contended, instead of jumping directly to the extent tree search label ("again"). This issue was found while investigating the issue reported at Bugzilla 64961. David Sterba suspected this function was missing extent items, and that this could be caused by the last change to this function, which was made in the following patch: [PATCH] Btrfs: optimize btrfs_lookup_extent_info() (commit 74be9510876a66ad9826613ac8a526d26f9e7f01) But in fact this issue already existed before, because after failing to find a skinny extent item, the code set the search key for a non-skinny extent item, and on contention of a matching delayed ref head it would not search the extent tree for a skinny extent item anymore. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12btrfs: call mnt_drop_write after interrupted subvol deletionDavid Sterba1-1/+2
If btrfs_ioctl_snap_destroy blocks on the mutex and the process is killed, mnt_write count is unbalanced and leads to unmountable filesystem. CC: stable@vger.kernel.org Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12Btrfs: don't clear the default compression typeMiao Xie1-3/+2
We met a oops caused by the wrong compression type: [ 556.512356] BUG: unable to handle kernel NULL pointer dereference at (null) [ 556.512370] IP: [<ffffffff811dbaa0>] __list_del_entry+0x1/0x98 [SNIP] [ 556.512490] [<ffffffff811dbb44>] ? list_del+0xd/0x2b [ 556.512539] [<ffffffffa05dd5ce>] find_workspace+0x97/0x175 [btrfs] [ 556.512546] [<ffffffff813c14b5>] ? _raw_spin_lock+0xe/0x10 [ 556.512576] [<ffffffffa05de276>] btrfs_compress_pages+0x2d/0xa2 [btrfs] [ 556.512601] [<ffffffffa05af060>] compress_file_range.constprop.54+0x1f2/0x4e8 [btrfs] [ 556.512627] [<ffffffffa05af388>] async_cow_start+0x32/0x4d [btrfs] [ 556.512655] [<ffffffffa05cc7a1>] worker_loop+0x144/0x4c3 [btrfs] [ 556.512661] [<ffffffff81059404>] ? finish_task_switch+0x80/0xb8 [ 556.512689] [<ffffffffa05cc65d>] ? btrfs_queue_worker+0x244/0x244 [btrfs] [ 556.512695] [<ffffffff8104fa4e>] kthread+0x8d/0x95 [ 556.512699] [<ffffffff81050000>] ? bit_waitqueue+0x34/0x7d [ 556.512704] [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65 [ 556.512709] [<ffffffff813c7eec>] ret_from_fork+0x7c/0xb0 [ 556.512713] [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65 Steps to reproduce: # mkfs.btrfs -f <dev> # mount -o nodatacow <dev> <mnt> # touch <mnt>/<file> # chattr =c <mnt>/<file> # dd if=/dev/zero of=<mnt>/<file> bs=1M count=10 It is because we cleared the default compression type when setting the nodatacow. In fact, we needn't do it because we have used COMPRESS flag to indicate if we need compressed the file data or not, needn't use the variant -- compress_type -- in btrfs_info to do the same thing, and just use it to hold the default compression type. Or we would get a wrong compress type for a file whose own compress flag is set but the compress flag of its filesystem is not set. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12kernfs: s/sysfs/kernfs/ in internal functions and whatever is leftTejun Heo6-251/+254
kernfs has just been separated out from sysfs and we're already in full conflict mode. Nothing can make the situation any worse. Let's take the chance to name things properly. This patch performs the following renames. * s/sysfs_*()/kernfs_*()/ in all internal functions * s/sysfs/kernfs/ in internal strings, comments and whatever is remaining * Uniformly rename various vfs operations so that they're consistently named and distinguishable. This patch is strictly rename only and doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-12kernfs: s/sysfs/kernfs/ in global variablesTejun Heo6-73/+73
kernfs has just been separated out from sysfs and we're already in full conflict mode. Nothing can make the situation any worse. Let's take the chance to name things properly. This patch performs the following renames. * s/sysfs_mutex/kernfs_mutex/ * s/sysfs_dentry_ops/kernfs_dops/ * s/sysfs_dir_operations/kernfs_dir_fops/ * s/sysfs_dir_inode_operations/kernfs_dir_iops/ * s/kernfs_file_operations/kernfs_file_fops/ - renamed for consistency * s/sysfs_symlink_inode_operations/kernfs_symlink_iops/ * s/sysfs_aops/kernfs_aops/ * s/sysfs_backing_dev_info/kernfs_bdi/ * s/sysfs_inode_operations/kernfs_iops/ * s/sysfs_dir_cachep/kernfs_node_cache/ * s/sysfs_ops/kernfs_sops/ This patch is strictly rename only and doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-12kernfs: s/sysfs/kernfs/ in constantsTejun Heo8-45/+45
kernfs has just been separated out from sysfs and we're already in full conflict mode. Nothing can make the situation any worse. Let's take the chance to name things properly. This patch performs the following renames. * s/SYSFS_DIR/KERNFS_DIR/ * s/SYSFS_KOBJ_ATTR/KERNFS_FILE/ * s/SYSFS_KOBJ_LINK/KERNFS_LINK/ * s/SYSFS_{TYPE_FLAGS}/KERNFS_{TYPE_FLAGS}/ * s/SYSFS_FLAG_{FLAG}/KERNFS_{FLAG}/ * s/sysfs_type()/kernfs_type()/ * s/SD_DEACTIVATED_BIAS/KN_DEACTIVATED_BIAS/ This patch is strictly rename only and doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-12kernfs: s/sysfs/kernfs/ in various data structuresTejun Heo7-138/+138
kernfs has just been separated out from sysfs and we're already in full conflict mode. Nothing can make the situation any worse. Let's take the chance to name things properly. This patch performs the following renames. * s/sysfs_open_dirent/kernfs_open_node/ * s/sysfs_open_file/kernfs_open_file/ * s/sysfs_inode_attrs/kernfs_iattrs/ * s/sysfs_addrm_cxt/kernfs_addrm_cxt/ * s/sysfs_super_info/kernfs_super_info/ * s/sysfs_info()/kernfs_info()/ * s/sysfs_open_dirent_lock/kernfs_open_node_lock/ * s/sysfs_open_file_mutex/kernfs_open_file_mutex/ * s/sysfs_of()/kernfs_of()/ This patch is strictly rename only and doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-12kernfs: drop s_ prefix from kernfs_node membersTejun Heo8-178/+174
kernfs has just been separated out from sysfs and we're already in full conflict mode. Nothing can make the situation any worse. Let's take the chance to name things properly. s_ prefix for kernfs members is used inconsistently and a misnomer now. It's not like kernfs_node is used widely across the kernel making the ability to grep for the members particularly useful. Let's just drop the prefix. This patch is strictly rename only and doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-12kernfs: s/sysfs_dirent/kernfs_node/ and rename its friends accordinglyTejun Heo12-668/+669
kernfs has just been separated out from sysfs and we're already in full conflict mode. Nothing can make the situation any worse. Let's take the chance to name things properly. This patch performs the following renames. * s/sysfs_elem_dir/kernfs_elem_dir/ * s/sysfs_elem_symlink/kernfs_elem_symlink/ * s/sysfs_elem_attr/kernfs_elem_file/ * s/sysfs_dirent/kernfs_node/ * s/sd/kn/ in kernfs proper * s/parent_sd/parent/ * s/target_sd/target/ * s/dir_sd/parent/ * s/to_sysfs_dirent()/rb_to_kn()/ * misc renames of local vars when they conflict with the above Because md, mic and gpio dig into sysfs details, this patch ends up modifying them. All are sysfs_dirent renames and trivial. While we can avoid these by introducing a dummy wrapping struct sysfs_dirent around kernfs_node, given the limited usage outside kernfs and sysfs proper, I don't think such workaround is called for. This patch is strictly rename only and doesn't introduce any functional difference. - mic / gpio renames were missing. Spotted by kbuild test robot. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Neil Brown <neilb@suse.de> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-11sysfs: fix use-after-free in sysfs_kill_sb()Tejun Heo1-1/+3
While restructuring the [u]mount path, 4b93dc9b1c68 ("sysfs, kernfs: prepare mount path for kernfs") incorrectly updated sysfs_kill_sb() so that it first kills super_block and then tries to dereference its namespace tag to drop it. Fix it by caching namespace tag before killing the superblock and then drop the cached namespace tag. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Tested-by: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/g/20131205031051.GC5135@yliu-dev.sh.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-11sysfs: bail early from kernfs_file_mmap() to avoid spurious lockdep warningTejun Heo2-8/+22
This is v3.14 fix for the same issue that a8b14744429f ("sysfs: give different locking key to regular and bin files") addresses for v3.13. Due to the extensive kernfs reorganization in v3.14 branch, the same fix couldn't be ported as-is. The v3.13 fix was ignored while merging it into v3.14 branch. 027a485d12e0 ("sysfs: use a separate locking class for open files depending on mmap") assigned different lockdep key to sysfs_open_file->mutex depending on whether the file implements mmap or not in an attempt to avoid spurious lockdep warning caused by merging of regular and bin file paths. While this restored some of the original behavior of using different locks (at least lockdep is concerned) for the different clases of files. The restoration wasn't full because now the lockdep key assignment depends on whether the file has mmap or not instead of whether it's a regular file or not. This means that bin files which don't implement mmap will get assigned the same lockdep class as regular files. This is problematic because file_operations for bin files still implements the mmap file operation and checking whether the sysfs file actually implements mmap happens in the file operation after grabbing @sysfs_open_file->mutex. We still end up adding locking dependency from mmap locking to sysfs_open_file->mutex to the regular file mutex which triggers spurious circular locking warning. For v3.13, a8b14744429f ("sysfs: give different locking key to regular and bin files") fixed it by giving sysfs_open_file->mutex different lockdep keys depending on whether the file is regular or bin instead of whether mmap exists or not; however, due to the way sysfs is now layered behind kernfs, this approach is no longer viable. kernfs can tell whether a sysfs node has mmap implemented or not but can't tell whether a bin file from a regular one. This patch updates kernfs such that kernfs_file_mmap() checks SYSFS_FLAG_HAS_MMAP and bail before grabbing sysfs_open_file->mutex so that it doesn't add spurious locking dependency from mmap to sysfs_open_file->mutex and changes sysfs so that it specifies kernfs_ops->mmap iff the sysfs file implements mmap. Combined, this ensures that sysfs_open_file->mutex is grabbed under mmap path iff the sysfs file actually implements mmap. As sysfs_open_file->mutex is already given a different lockdep key if mmap is implemented, this removes the spurious locking dependency. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dave Jones <davej@redhat.com> Link: http://lkml.kernel.org/g/20131203184324.GA11320@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-11nfsd: when reusing an existing repcache entry, unhash it firstJeff Layton1-1/+8
The DRC code will attempt to reuse an existing, expired cache entry in preference to allocating a new one. It'll then search the cache, and if it gets a hit it'll then free the cache entry that it was going to reuse. The cache code doesn't unhash the entry that it's going to reuse however, so it's possible for it end up designating an entry for reuse and then subsequently freeing the same entry after it finds it. This leads it to a later use-after-free situation and usually some list corruption warnings or an oops. Fix this by simply unhashing the entry that we intend to reuse. That will mean that it's not findable via a search and should prevent this situation from occurring. Cc: stable@vger.kernel.org # v3.10+ Reported-by: Christoph Hellwig <hch@infradead.org> Reported-by: g. artim <gartim@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-12-10xfs: growfs overruns AGFL buffer on V4 filesystemsDave Chinner1-1/+5
This loop in xfs_growfs_data_private() is incorrect for V4 superblocks filesystems: for (bucket = 0; bucket < XFS_AGFL_SIZE(mp); bucket++) agfl->agfl_bno[bucket] = cpu_to_be32(NULLAGBLOCK); For V4 filesystems, we don't have a agfl header structure, and so XFS_AGFL_SIZE() returns an entire sector's worth of entries, which we then index from an offset into the sector. Hence: buffer overrun. This problem was introduced in 3.10 by commit 77c95bba ("xfs: add CRC checks to the AGFL") which changed the AGFL structure but failed to update the growfs code to handle the different structures. Fix it by using the correct offset into the buffer for both V4 and V5 filesystems. Cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit b7d961b35b3ab69609aeea93f870269cb6e7ba4d)
2013-12-10xfs: don't perform discard if the given range length is less than block sizeJie Liu1-2/+3
For discard operation, we should return EINVAL if the given range length is less than a block size, otherwise it will go through the file system to discard data blocks as the end range might be evaluated to -1, e.g, # fstrim -v -o 0 -l 100 /xfs7 /xfs7: 9811378176 bytes were trimmed This issue can be triggered via xfstests/generic/288. Also, it seems to get the request queue pointer via bdev_get_queue() instead of the hard code pointer dereference is not a bad thing. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit f9fd0135610084abef6867d984e9951c3099950d)