summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2014-09-05ext4: prepare to drop EXT4_STATE_DELALLOC_RESERVEDTheodore Ts'o5-14/+17
The EXT4_STATE_DELALLOC_RESERVED flag was originally implemented because it was too hard to make sure the mballoc and get_block flags could be reliably passed down through all of the codepaths that end up calling ext4_mb_new_blocks(). Since then, we have mb_flags passed down through most of the code paths, so getting rid of EXT4_STATE_DELALLOC_RESERVED isn't as tricky as it used to. This commit plumbs in the last of what is required, and then adds a WARN_ON check to make sure we haven't missed anything. If this passes a full regression test run, we can then drop EXT4_STATE_DELALLOC_RESERVED. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-05ext4: pass allocation_request struct to ext4_(alloc,splice)_branchTheodore Ts'o1-44/+38
Instead of initializing the allocation_request structure in ext4_alloc_branch(), set it up in ext4_ind_map_blocks(), and then pass it to ext4_alloc_branch() and ext4_splice_branch(). This allows ext4_ind_map_blocks to pass flags in the allocation request structure without having to add Yet Another argument to ext4_alloc_branch(). Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-02ext4: track extent status tree shrinker delay staticticsZheng Liu4-21/+193
This commit adds some statictics in extent status tree shrinker. The purpose to add these is that we want to collect more details when we encounter a stall caused by extent status tree shrinker. Here we count the following statictics: stats: the number of all objects on all extent status trees the number of reclaimable objects on lru list cache hits/misses the last sorted interval the number of inodes on lru list average: scan time for shrinking some objects the number of shrunk objects maximum: the inode that has max nr. of objects on lru list the maximum scan time for shrinking some objects The output looks like below: $ cat /proc/fs/ext4/sda1/es_shrinker_info stats: 28228 objects 6341 reclaimable objects 5281/631 cache hits/misses 586 ms last sorted interval 250 inodes on lru list average: 153 us scan time 128 shrunk objects maximum: 255 inode (255 objects, 198 reclaimable) 125723 us max scan time If the lru list has never been sorted, the following line will not be printed: 586ms last sorted interval If there is an empty lru list, the following lines also will not be printed: 250 inodes on lru list ... maximum: 255 inode (255 objects, 198 reclaimable) 0 us max scan time Meanwhile in this commit a new trace point is defined to print some details in __ext4_es_shrink(). Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-02ext4: improve extents status tree trace pointZheng Liu1-3/+3
This commit improves the trace point of extents status tree. We rename trace_ext4_es_shrink_enter in ext4_es_count() because it is also used in ext4_es_scan() and we can not identify them from the result. Further this commit fixes a variable name in trace point in order to keep consistency with others. Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-02ext4: fix comments about get_blocksSeunghun Lee1-2/+2
get_blocks is renamed to get_block. Signed-off-by: Seunghun Lee <waydi1@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-02ext4: enable block_validity by defaultDarrick J. Wong1-2/+2
Enable by default the block_validity feature, which checks for collisions between newly allocated blocks and critical system metadata. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-02jbd2: fold __wait_cp_io into jbd2_log_do_checkpoint()Theodore Ts'o1-56/+31
__wait_cp_io() is only called by jbd2_log_do_checkpoint(). Fold it in to make it a bit easier to understand. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-02jbd2: fold __process_buffer() into jbd2_log_do_checkpoint()Theodore Ts'o1-111/+84
__process_buffer() is only called by jbd2_log_do_checkpoint(), and it had a very complex locking protocol where it would be called with the j_list_lock, and sometimes exit with the lock held (if the return code was 0), or release the lock. This was confusing both to humans and to smatch (which erronously complained that the lock was taken twice). Folding __process_buffer() to the caller allows us to simplify the control flow, making the resulting function easier to read and reason about, and dropping the compiled size of fs/jbd2/checkpoint.c by 150 bytes (over 4% of the text size). Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-01ext4: rename ext4_ext_find_extent() to ext4_find_extent()Theodore Ts'o5-28/+27
Make the function name less redundant. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01ext4: reuse path object in ext4_move_extents()Theodore Ts'o1-15/+12
Reuse the path object in ext4_move_extents() so we don't unnecessarily free and reallocate it. Also clean up the get_ext_path() wrapper so that it has the same semantics of freeing the path object on error as ext4_ext_find_extent(). Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01ext4: reuse path object in ext4_ext_shift_extents()Theodore Ts'o1-17/+8
Now that the semantics of ext4_ext_find_extent() are much cleaner, it's safe and more efficient to reuse the path object across the multiple calls to ext4_ext_find_extent() in ext4_ext_shift_extents(). Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01ext4: teach ext4_ext_find_extent() to realloc path if necessaryTheodore Ts'o2-10/+11
This adds additional safety in case for some reason we end reusing a path structure which isn't big enough for current depth of the inode. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01ext4: allow a NULL argument to ext4_ext_drop_refs()Theodore Ts'o4-51/+29
Teach ext4_ext_drop_refs() to accept a NULL argument, much like kfree(). This allows us to drop a lot of checks to make sure path is non-NULL before calling ext4_ext_drop_refs(). Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01ext4: call ext4_ext_drop_refs() from ext4_ext_find_extent()Theodore Ts'o1-7/+4
In nearly all of the calls to ext4_ext_find_extent() where the caller is trying to recycle the path object, ext4_ext_drop_refs() gets called to release the buffer heads before the path object gets overwritten. To simplify things for the callers, and to avoid the possibility of a memory leak, make ext4_ext_find_extent() responsible for dropping the buffers. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01ext4: drop EXT4_EX_NOFREE_ON_ERR from rest of extents handling codeTheodore Ts'o3-58/+59
Drop EXT4_EX_NOFREE_ON_ERR from ext4_ext_create_new_leaf(), ext4_split_extent(), ext4_convert_unwritten_extents_endio(). This requires fixing all of their callers to potentially ext4_ext_find_extent() to free the struct ext4_ext_path object in case of an error, and there are interlocking dependencies all the way up to ext4_ext_map_blocks(), ext4_swap_extents(), and ext4_ext_remove_space(). Once this is done, we can drop the EXT4_EX_NOFREE_ON_ERR flag since it is no longer necessary. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01ext4: drop EXT4_EX_NOFREE_ON_ERR in convert_initialized_extent()Theodore Ts'o1-5/+5
Transfer responsibility of freeing struct ext4_ext_path on error to ext4_ext_find_extent(). Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01ext4: collapse ext4_convert_initialized_extents()Theodore Ts'o1-77/+59
The function ext4_convert_initialized_extents() is only called by a single function --- ext4_ext_convert_initalized_extents(). Inline the code and get rid of the unnecessary bits in order to simplify the code. Rename ext4_ext_convert_initalized_extents() to convert_initalized_extents() since it's a static function that is actually only used in a single caller, ext4_ext_map_blocks(). Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01ext4: teach ext4_ext_find_extent() to free path on errorTheodore Ts'o3-12/+21
Right now, there are a places where it is all to easy to leak memory on an error path, via a usage like this: struct ext4_ext_path *path = NULL while (...) { ... path = ext4_ext_find_extent(inode, block, path, 0); if (IS_ERR(path)) { /* oops, if path was non-NULL before the call to ext4_ext_find_extent, we've leaked it! :-( */ ... return PTR_ERR(path); } ... } Unfortunately, there some code paths where we are doing the following instead: path = ext4_ext_find_extent(inode, block, orig_path, 0); and where it's important that we _not_ free orig_path in the case where ext4_ext_find_extent() returns an error. So change the function signature of ext4_ext_find_extent() so that it takes a struct ext4_ext_path ** for its third argument, and by default, on an error, it will free the struct ext4_ext_path, and then zero out the struct ext4_ext_path * pointer. In order to avoid causing problems, we add a flag EXT4_EX_NOFREE_ON_ERR which causes ext4_ext_find_extent() to use the original behavior of forcing the caller to deal with freeing the original path pointer on the error case. The goal is to get rid of EXT4_EX_NOFREE_ON_ERR entirely, but this allows for a gentle transition and makes the patches easier to verify. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01ext4: fix accidental flag aliasing in ext4_map_blocks flagsTheodore Ts'o1-2/+3
Commit b8a8684502a0f introduced an accidental flag aliasing between EXT4_EX_NOCACHE and EXT4_GET_BLOCKS_CONVERT_UNWRITTEN. Fortunately, this didn't introduce any untorward side effects --- we got lucky. Nevertheless, fix this and leave a warning to hopefully avoid this from happening in the future. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01ext4: fix ZERO_RANGE bug hidden by flag aliasingTheodore Ts'o1-7/+14
We accidently aliased EXT4_EX_NOCACHE and EXT4_GET_CONVERT_UNWRITTEN falgs, which apparently was hiding a bug that was unmasked when this flag aliasing issue was addressed (see the subsequent commit). The reproduction case was: fsx -N 10000 -l 500000 -r 4096 -t 4096 -w 4096 -Z -R -W /vdb/junk ... which would cause fsx to report corruption in the data file. The fix we have is a bit of an overkill, but I'd much rather be conservative for now, and we can optimize ZERO_RANGE_FL handling later. The fact that we need to zap the extent_status cache for the inode is unfortunate, but correctness is far more important than performance. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Namjae Jeon <namjae.jeon@samsung.com>
2014-08-31ext4: fix ext4_swap_extents() error handlingTheodore Ts'o1-33/+29
If ext4_ext_find_extent() returns an error, we have to clear path1 or path2 or else we would end up trying to free an ERR_PTR, which would be bad. Also eliminate some redundant code and mark the error paths as unlikely() Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-31ext4: refactor ext4_move_extents code baseDmitry Monakhov3-891/+338
ext4_move_extents is too complex for review. It has duplicate almost each function available in the rest of other codebase. It has useless artificial restriction orig_offset == donor_offset. But in fact logic of ext4_move_extents is very simple: Iterate extents one by one (similar to ext4_fill_fiemap_extents) ->Iterate each page covered extent (similar to generic_perform_write) ->swap extents for covered by page (can be shared with IOC_MOVE_DATA) Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-31ext4: use ext4_ext_next_allocated_block instead of mext_next_extentDmitry Monakhov3-12/+8
This allows us to make mext_next_extent static and potentially get rid of it. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-31ext4: use ext4_update_i_disksize instead of opencoded onesDmitry Monakhov1-4/+1
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-30ext4: remove a duplicate call in ext4_init_new_dir()Wang Shilong1-4/+0
ext4_journal_get_write_access() has just been called in ext4_append() calling it again here is duplicated. Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-30ext4: convert do_split() to use the ERR_PTR conventionTheodore Ts'o1-12/+11
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-30ext4: convert dx_probe() to use the ERR_PTR conventionTheodore Ts'o1-54/+35
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-30ext4: convert ext4_bread() to use the ERR_PTR conventionTheodore Ts'o5-43/+34
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-30ext4: convert ext4_getblk() to use the ERR_PTR conventionTheodore Ts'o3-33/+30
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-30ext4: convert ext4_dx_find_entry() to use the ERR_PTR conventionTheodore Ts'o1-26/+20
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-29Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds10-124/+208
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "Ext4 bug fixes for 3.17, to provide better handling of memory allocation failures, and to fix some journaling bugs involving journal checksums and FALLOC_FL_ZERO_RANGE" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix same-dir rename when inline data directory overflows jbd2: fix descriptor block size handling errors with journal_csum jbd2: fix infinite loop when recovering corrupt journal blocks ext4: update i_disksize coherently with block allocation on error path ext4: fix transaction issues for ext4_fallocate and ext_zero_range ext4: fix incorect journal credits reservation in ext4_zero_range ext4: move i_size,i_disksize update routines to helper function ext4: fix BUG_ON in mb_free_blocks() ext4: propagate errors up to ext4_find_entry()'s callers
2014-08-29ext4: fix same-dir rename when inline data directory overflowsDarrick J. Wong1-3/+18
When performing a same-directory rename, it's possible that adding or setting the new directory entry will cause the directory to overflow the inline data area, which causes the directory to be converted to an extent-based directory. Under this circumstance it is necessary to re-read the directory when deleting the old dirent because the "old directory" context still points to i_block in the inode table, which is now an extent tree root! The delete fails with an FS error, and the subsequent fsck complains about incorrect link counts and hardlinked directories. Test case (originally found with flat_dir_test in the metadata_csum test program): # mkfs.ext4 -O inline_data /dev/sda # mount /dev/sda /mnt # mkdir /mnt/x # touch /mnt/x/changelog.gz /mnt/x/copyright /mnt/x/README.Debian # sync # for i in /mnt/x/*; do mv $i $i.longer; done # ls -la /mnt/x/ total 0 -rw-r--r-- 1 root root 0 Aug 25 12:03 changelog.gz.longer -rw-r--r-- 1 root root 0 Aug 25 12:03 copyright -rw-r--r-- 1 root root 0 Aug 25 12:03 copyright.longer -rw-r--r-- 1 root root 0 Aug 25 12:03 README.Debian.longer (Hey! Why are there four files now??) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-08-29jbd2: fix descriptor block size handling errors with journal_csumDarrick J. Wong5-44/+70
It turns out that there are some serious problems with the on-disk format of journal checksum v2. The foremost is that the function to calculate descriptor tag size returns sizes that are too big. This causes alignment issues on some architectures and is compounded by the fact that some parts of jbd2 use the structure size (incorrectly) to determine the presence of a 64bit journal instead of checking the feature flags. Therefore, introduce journal checksum v3, which enlarges the descriptor block tag format to allow for full 32-bit checksums of journal blocks, fix the journal tag function to return the correct sizes, and fix the jbd2 recovery code to use feature flags to determine 64bitness. Add a few function helpers so we don't have to open-code quite so many pieces. Switching to a 16-byte block size was found to increase journal size overhead by a maximum of 0.1%, to convert a 32-bit journal with no checksumming to a 32-bit journal with checksum v3 enabled. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: TR Reardon <thomas_reardon@hotmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-08-29jbd2: fix infinite loop when recovering corrupt journal blocksDarrick J. Wong1-2/+5
When recovering the journal, don't fall into an infinite loop if we encounter a corrupt journal block. Instead, just skip the block and return an error, which fails the mount and thus forces the user to run a full filesystem fsck. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-08-29ext4: update i_disksize coherently with block allocation on error pathDmitry Monakhov1-2/+8
In case of delalloc block i_disksize may be less than i_size. So we have to update i_disksize each time we allocated and submitted some blocks beyond i_disksize. We weren't doing this on the error paths, so fix this. testcase: xfstest generic/019 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-08-28ext4: fix transaction issues for ext4_fallocate and ext_zero_rangeDmitry Monakhov1-33/+35
After commit f282ac19d86f we use different transactions for preallocation and i_disksize update which result in complain from fsck after power-failure. spotted by generic/019. IMHO this is regression because fs becomes inconsistent, even more 'e2fsck -p' will no longer works (which drives admins go crazy) Same transaction requirement applies ctime,mtime updates testcase: xfstest generic/019 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-08-28ext4: fix incorect journal credits reservation in ext4_zero_rangeDmitry Monakhov1-2/+9
Currently we reserve only 4 blocks but in worst case scenario ext4_zero_partial_blocks() may want to zeroout and convert two non adjacent blocks. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-08-27Merge branch 'for-linus' of ↵Linus Torvalds17-135/+312
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "The biggest of these comes from Liu Bo, who tracked down a hang we've been hitting since moving to kernel workqueues (it's a btrfs bug, not in the generic code). His patch needs backporting to 3.16 and 3.15 stable, which I'll send once this is in. Otherwise these are assorted fixes. Most were integrated last week during KS, but I wanted to give everyone the chance to test the result, so I waited for rc2 to come out before sending" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits) Btrfs: fix task hang under heavy compressed write Btrfs: fix filemap_flush call in btrfs_file_release Btrfs: fix crash on endio of reading corrupted block btrfs: fix leak in qgroup_subtree_accounting() error path btrfs: Use right extent length when inserting overlap extent map. Btrfs: clone, don't create invalid hole extent map Btrfs: don't monopolize a core when evicting inode Btrfs: fix hole detection during file fsync Btrfs: ensure tmpfile inode is always persisted with link count of 0 Btrfs: race free update of commit root for ro snapshots Btrfs: fix regression of btrfs device replace Btrfs: don't consider the missing device when allocating new chunks Btrfs: Fix wrong device size when we are resizing the device Btrfs: don't write any data into a readonly device when scrub Btrfs: Fix the problem that the replace destroys the seed filesystem btrfs: Return right extent when fiemap gives unaligned offset and len. Btrfs: fix wrong extent mapping for DirectIO Btrfs: fix wrong write range for filemap_fdatawrite_range() Btrfs: fix wrong missing device counter decrease Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs ...
2014-08-26Merge tag 'nfs-for-3.17-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2-29/+76
Pull NFS client fixes from Trond Myklebust: "Highlights: - more fixes for read/write codepath regressions * sleeping while holding the inode lock * stricter enforcement of page contiguity when coalescing requests * fix up error handling in the page coalescing code - don't busy wait on SIGKILL in the file locking code" * tag 'nfs-for-3.17-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: Don't busy-wait on SIGKILL in __nfs_iocounter_wait nfs: can_coalesce_requests must enforce contiguity nfs: disallow duplicate pages in pgio page vectors nfs: don't sleep with inode lock in lock_and_join_requests nfs: fix error handling in lock_and_join_requests nfs: use blocking page_group_lock in add_request nfs: fix nonblocking calls to nfs_page_group_lock nfs: change nfs_page_group_lock argument
2014-08-25aio: fix reqs_available handlingBenjamin LaHaise1-4/+73
As reported by Dan Aloni, commit f8567a3845ac ("aio: fix aio request leak when events are reaped by userspace") introduces a regression when user code attempts to perform io_submit() with more events than are available in the ring buffer. Reverting that commit would reintroduce a regression when user space event reaping is used. Fixing this bug is a bit more involved than the previous attempts to fix this regression. Since we do not have a single point at which we can count events as being reaped by user space and io_getevents(), we have to track event completion by looking at the number of events left in the event ring. So long as there are as many events in the ring buffer as there have been completion events generate, we cannot call put_reqs_available(). The code to check for this is now placed in refill_reqs_available(). A test program from Dan and modified by me for verifying this bug is available at http://www.kvack.org/~bcrl/20140824-aio_bug.c . Reported-by: Dan Aloni <dan@kernelim.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Acked-by: Dan Aloni <dan@kernelim.com> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: stable@vger.kernel.org # v3.16 and anything that f8567a3845ac was backported to Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-24Btrfs: fix task hang under heavy compressed writeLiu Bo12-61/+141
This has been reported and discussed for a long time, and this hang occurs in both 3.15 and 3.16. Btrfs now migrates to use kernel workqueue, but it introduces this hang problem. Btrfs has a kind of work queued as an ordered way, which means that its ordered_func() must be processed in the way of FIFO, so it usually looks like -- normal_work_helper(arg) work = container_of(arg, struct btrfs_work, normal_work); work->func() <---- (we name it work X) for ordered_work in wq->ordered_list ordered_work->ordered_func() ordered_work->ordered_free() The hang is a rare case, first when we find free space, we get an uncached block group, then we go to read its free space cache inode for free space information, so it will file a readahead request btrfs_readpages() for page that is not in page cache __do_readpage() submit_extent_page() btrfs_submit_bio_hook() btrfs_bio_wq_end_io() submit_bio() end_workqueue_bio() <--(ret by the 1st endio) queue a work(named work Y) for the 2nd also the real endio() So the hang occurs when work Y's work_struct and work X's work_struct happens to share the same address. A bit more explanation, A,B,C -- struct btrfs_work arg -- struct work_struct kthread: worker_thread() pick up a work_struct from @worklist process_one_work(arg) worker->current_work = arg; <-- arg is A->normal_work worker->current_func(arg) normal_work_helper(arg) A = container_of(arg, struct btrfs_work, normal_work); A->func() A->ordered_func() A->ordered_free() <-- A gets freed B->ordered_func() submit_compressed_extents() find_free_extent() load_free_space_inode() ... <-- (the above readhead stack) end_workqueue_bio() btrfs_queue_work(work C) B->ordered_free() As if work A has a high priority in wq->ordered_list and there are more ordered works queued after it, such as B->ordered_func(), its memory could have been freed before normal_work_helper() returns, which means that kernel workqueue code worker_thread() still has worker->current_work pointer to be work A->normal_work's, ie. arg's address. Meanwhile, work C is allocated after work A is freed, work C->normal_work and work A->normal_work are likely to share the same address(I confirmed this with ftrace output, so I'm not just guessing, it's rare though). When another kthread picks up work C->normal_work to process, and finds our kthread is processing it(see find_worker_executing_work()), it'll think work C as a collision and skip then, which ends up nobody processing work C. So the situation is that our kthread is waiting forever on work C. Besides, there're other cases that can lead to deadlock, but the real problem is that all btrfs workqueue shares one work->func, -- normal_work_helper, so this makes each workqueue to have its own helper function, but only a wraper pf normal_work_helper. With this patch, I no long hit the above hang. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-08-24ext4: move i_size,i_disksize update routines to helper functionDmitry Monakhov3-39/+28
Cc: stable@vger.kernel.org # needed for bug fix patches Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-24ext4: fix BUG_ON in mb_free_blocks()Theodore Ts'o1-0/+5
If we suffer a block allocation failure (for example due to a memory allocation failure), it's possible that we will call ext4_discard_allocated_blocks() before we've actually allocated any blocks. In that case, fe_len and fe_start in ac->ac_f_ex will still be zero, and this will result in mb_free_blocks(inode, e4b, 0, 0) triggering the BUG_ON on mb_free_blocks(): BUG_ON(last >= (sb->s_blocksize << 3)); Fix this by bailing out of ext4_discard_allocated_blocks() if fs_len is zero. Also fix a missing ext4_mb_unload_buddy() call in ext4_discard_allocated_blocks(). Google-Bug-Id: 16844242 Fixes: 86f0afd463215fc3e58020493482faa4ac3a4d69 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-08-24ext4: propagate errors up to ext4_find_entry()'s callersTheodore Ts'o2-3/+34
If we run into some kind of error, such as ENOMEM, while calling ext4_getblk() or ext4_dx_find_entry(), we need to make sure this error gets propagated up to ext4_find_entry() and then to its callers. This way, transient errors such as ENOMEM can get propagated to the VFS. This is important so that the system calls return the appropriate error, and also so that in the case of ext4_lookup(), we return an error instead of a NULL inode, since that will result in a negative dentry cache entry that will stick around long past the OOM condition which caused a transient ENOMEM error. Google-Bug-Id: #17142205 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2014-08-23nfs: Don't busy-wait on SIGKILL in __nfs_iocounter_waitDavid Jeffery1-1/+1
If a SIGKILL is sent to a task waiting in __nfs_iocounter_wait, it will busy-wait or soft lockup in its while loop. nfs_wait_bit_killable won't sleep, and the loop won't exit on the error return. Stop the busy-wait by breaking out of the loop when nfs_wait_bit_killable returns an error. Signed-off-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-23nfs: can_coalesce_requests must enforce contiguityWeston Andros Adamson1-0/+8
Commit 6094f83864c1d1296566a282cba05ba613f151ee "nfs: allow coalescing of subpage requests" got rid of the requirement that requests cover whole pages, but it made some incorrect assumptions. It turns out that callers of this interface can map adjacent requests (by file position as seen by req_offset + req->wb_bytes) to different pages, even when they could share a page. An example is the direct I/O interface - iov_iter_get_pages_alloc may return one segment with a partial page filled and the next segment (which is adjacent in the file position) starts with a new page. Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-23nfs: disallow duplicate pages in pgio page vectorsWeston Andros Adamson1-3/+15
Adjacent requests that share the same page are allowed, but should only use one entry in the page vector. This avoids overruning the page vector - it is sized based on how many bytes there are, not by request count. This fixes issues that manifest as "Redzone overwritten" bugs (the vector overrun) and hangs waiting on page read / write, as it waits on the same page more than once. This also adds bounds checking to the page vector with a graceful failure (WARN_ON_ONCE and pgio error returned to application). Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-23nfs: don't sleep with inode lock in lock_and_join_requestsWeston Andros Adamson2-1/+28
This handles the 'nonblock=false' case in nfs_lock_and_join_requests. If the group is already locked and blocking is allowed, drop the inode lock and wait for the group lock to be cleared before trying it all again. This should fix warnings found in peterz's tree (sched/wait branch), where might_sleep() checks are added to wait.[ch]. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Reviewed-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-23nfs: fix error handling in lock_and_join_requestsWeston Andros Adamson1-1/+4
This fixes handling of errors from nfs_page_group_lock in nfs_lock_and_join_requests. It now releases the inode lock and the reference to the head request. Reported-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Reviewed-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-23nfs: use blocking page_group_lock in add_requestWeston Andros Adamson1-11/+2
__nfs_pageio_add_request was calling nfs_page_group_lock nonblocking, but this can return -EAGAIN which would end up passing -EIO to the application. There is no reason not to block in this path, so change the two calls to do so. Also, there is no need to check the return value of nfs_page_group_lock when nonblock=false, so remove the error handling code. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Reviewed-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>