summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2014-06-10Btrfs: fix NULL pointer crash of deleting a seed deviceLiu Bo1-4/+8
Same as normal devices, seed devices should be initialized with fs_info->dev_root as well, otherwise we'll get a NULL pointer crash. Cc: Chris Murphy <lists@colorremedies.com> Reported-by: Chris Murphy <lists@colorremedies.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: fix joining same transaction handle more than twiceWang Shilong1-2/+9
We hit something like the following function call flows: |->run_delalloc_range() |->btrfs_join_transaction() |->cow_file_range() |->btrfs_join_transaction() |->find_free_extent() |->btrfs_join_transaction() Trace infomation can be seen as: [ 7411.127040] ------------[ cut here ]------------ [ 7411.127060] WARNING: CPU: 0 PID: 11557 at fs/btrfs/transaction.c:383 start_transaction+0x561/0x580 [btrfs]() [ 7411.127079] CPU: 0 PID: 11557 Comm: kworker/u8:9 Tainted: G O 3.13.0+ #4 [ 7411.127080] Hardware name: LENOVO QiTianM4350/ , BIOS F1KT52AUS 05/24/2013 [ 7411.127085] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-5) [ 7411.127092] Call Trace: [ 7411.127097] [<ffffffff815b87b0>] dump_stack+0x45/0x56 [ 7411.127101] [<ffffffff81051ffd>] warn_slowpath_common+0x7d/0xa0 [ 7411.127102] [<ffffffff810520da>] warn_slowpath_null+0x1a/0x20 [ 7411.127109] [<ffffffffa0444fb1>] start_transaction+0x561/0x580 [btrfs] [ 7411.127115] [<ffffffffa0445027>] btrfs_join_transaction+0x17/0x20 [btrfs] [ 7411.127120] [<ffffffffa0431c91>] find_free_extent+0xa21/0xb50 [btrfs] [ 7411.127126] [<ffffffffa0431f68>] btrfs_reserve_extent+0xa8/0x1a0 [btrfs] [ 7411.127131] [<ffffffffa04322ce>] btrfs_alloc_free_block+0xee/0x440 [btrfs] [ 7411.127137] [<ffffffffa043bd6e>] ? btree_set_page_dirty+0xe/0x10 [btrfs] [ 7411.127142] [<ffffffffa041da51>] __btrfs_cow_block+0x121/0x530 [btrfs] [ 7411.127146] [<ffffffffa041dfff>] btrfs_cow_block+0x11f/0x1c0 [btrfs] [ 7411.127151] [<ffffffffa0421b74>] btrfs_search_slot+0x1d4/0x9c0 [btrfs] [ 7411.127157] [<ffffffffa0438567>] btrfs_lookup_file_extent+0x37/0x40 [btrfs] [ 7411.127163] [<ffffffffa0456bfc>] __btrfs_drop_extents+0x16c/0xd90 [btrfs] [ 7411.127169] [<ffffffffa0444ae3>] ? start_transaction+0x93/0x580 [btrfs] [ 7411.127171] [<ffffffff811663e2>] ? kmem_cache_alloc+0x132/0x140 [ 7411.127176] [<ffffffffa041cd9a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [ 7411.127182] [<ffffffffa044aa61>] cow_file_range_inline+0x181/0x2e0 [btrfs] [ 7411.127187] [<ffffffffa044aead>] cow_file_range+0x2ed/0x440 [btrfs] [ 7411.127194] [<ffffffffa0464d7f>] ? free_extent_buffer+0x4f/0xb0 [btrfs] [ 7411.127200] [<ffffffffa044b38f>] run_delalloc_nocow+0x38f/0xa60 [btrfs] [ 7411.127207] [<ffffffffa0461600>] ? test_range_bit+0x30/0x180 [btrfs] [ 7411.127212] [<ffffffffa044bd48>] run_delalloc_range+0x2e8/0x350 [btrfs] [ 7411.127219] [<ffffffffa04618f9>] ? find_lock_delalloc_range+0x1a9/0x1e0 [btrfs] [ 7411.127222] [<ffffffff812a1e71>] ? blk_queue_bio+0x2c1/0x330 [ 7411.127228] [<ffffffffa0462ad4>] __extent_writepage+0x2f4/0x760 [btrfs] Here we fix it by avoiding joining transaction again if we have held a transaction handle when allocating chunk in find_free_extent(). Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: use helpers for last_trans_log_full_commit instead of opencodeMiao Xie4-27/+36
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: check if items are ordered when a leaf is marked dirtyFilipe Manana1-0/+6
To ease finding bugs during development related to modifying btree leaves in such a way that it makes its items not sorted by key anymore. Since this is an expensive check, it's only enabled if CONFIG_BTRFS_FS_CHECK_INTEGRITY is set, which isn't meant to be enabled for regular users. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: don't access non-existent key when csum tree is emptyFilipe Manana1-1/+1
When the csum tree is empty, our leaf (path->nodes[0]) has a number of items equal to 0 and since btrfs_header_nritems() returns an unsigned integer (and so is our local nritems variable) the following comparison always evaluates to false: if (path->slots[0] >= nritems - 1) { As the casting rules lead to: if ((u32)0 >= (u32)4294967295) { This makes us access key at slot paths->slots[0] + 1 (1) of the empty leaf some lines below: btrfs_item_key_to_cpu(path->nodes[0], &found_key, slot); if (found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID || found_key.type != BTRFS_EXTENT_CSUM_KEY) { found_next = 1; goto insert; } So just don't access such non-existent slot and don't set found_next to 1 when the tree is empty. It's very unlikely we'll get a random key with the objectid and type values above, which is where we could go into trouble. If nritems is 0, just set found_next to 1 anyway as it will make us insert a csum item covering our whole extent (or the whole leaf) when the tree is empty. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: make sure there are not any read requests before stopping workersWang Shilong1-0/+5
In close_ctree(), after we have stopped all workers,there maybe still some read requests(for example readahead) to submit and this *maybe* trigger an oops that user reported before: kernel BUG at fs/btrfs/async-thread.c:619! By hacking codes, i can reproduce this problem with one cpu available. We fix this potential problem by invalidating all btree inode pages before stopping all workers. Thanks to Miao for pointing out this problem. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: fix possible memory leak in btrfs_create_tree()Tsutomu Itoh1-0/+1
In btrfs_create_tree(), if btrfs_insert_root() fails, we should free root->commit_root. Reported-by: Alex Lyakas <alex@zadarastorage.com> Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: remove useless ACL checkZhangZhen1-7/+0
posix_acl_xattr_set() already does the check, and it's the only way to feed in an ACL from userspace. So the check here is useless, remove it. Signed-off-by: zhang zhen <zhenzhang.zhang@huawei.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: btrfs_rm_device() should zero mirror SB as wellAnand Jain1-0/+31
This fix will ensure all SB copies on the disk is zeroed when the disk is intentionally removed. This helps to better manage disks in the user land. This version of patch also merges the Zach patch as below. btrfs: don't double brelse on device rm Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: use bitfield instead of integer data type for the some variants in ↵Miao Xie12-94/+109
btrfs_root Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: send, fix more issues related to directory renamesFilipe Manana1-94/+96
This is a continuation of the previous changes titled: Btrfs: fix incremental send's decision to delay a dir move/rename Btrfs: part 2, fix incremental send's decision to delay a dir move/rename There's a few more cases where a directory rename/move must be delayed which was previously overlooked. If our immediate ancestor has a lower inode number than ours and it doesn't have a delayed rename/move operation associated to it, it doesn't mean there isn't any non-direct ancestor of our current inode that needs to be renamed/moved before our current inode (i.e. with a higher inode number than ours). So we can't stop the search if our immediate ancestor has a lower inode number than ours, we need to navigate the directory hierarchy upwards until we hit the root or: 1) find an ancestor with an higher inode number that was renamed/moved in the send root too (or already has a pending rename/move registered); 2) find an ancestor that is a new directory (higher inode number than ours and exists only in the send root). Reproducer for case 1) $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir -p /mnt/a/b $ mkdir -p /mnt/a/c/d $ mkdir /mnt/a/b/e $ mkdir /mnt/a/c/d/f $ mv /mnt/a/b /mnt/a/c/d/2b $ mkdir /mnt/a/x $ mkdir /mnt/a/y $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mv /mnt/a/x /mnt/a/y $ mv /mnt/a/c/d/2b/e /mnt/a/c/d/2b/2e $ mv /mnt/a/c/d /mnt/a/h/2d $ mv /mnt/a/c /mnt/a/h/2d/2b/2c $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send Simple reproducer for case 2) $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir -p /mnt/a/b $ mkdir /mnt/a/c $ mv /mnt/a/b /mnt/a/c/b2 $ mkdir /mnt/a/e $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mv /mnt/a/c/b2 /mnt/a/e/b3 $ mkdir /mnt/a/e/b3/f $ mkdir /mnt/a/h $ mv /mnt/a/c /mnt/a/e/b3/f/c2 $ mv /mnt/a/e /mnt/a/h/e2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send Another simple reproducer for case 2) $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir -p /mnt/a/b $ mkdir /mnt/a/c $ mkdir /mnt/a/b/d $ mkdir /mnt/a/c/e $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mkdir /mnt/a/b/d/f $ mkdir /mnt/a/b/g $ mv /mnt/a/c/e /mnt/a/b/g/e2 $ mv /mnt/a/c /mnt/a/b/d/f/c2 $ mv /mnt/a/b/d/f /mnt/a/b/g/e2/f2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send More complex reproducer for case 2) $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir -p /mnt/a/b $ mkdir -p /mnt/a/c/d $ mkdir /mnt/a/b/e $ mkdir /mnt/a/c/d/f $ mv /mnt/a/b /mnt/a/c/d/2b $ mkdir /mnt/a/x $ mkdir /mnt/a/y $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mv /mnt/a/x /mnt/a/y $ mv /mnt/a/c/d/2b/e /mnt/a/c/d/2b/2e $ mv /mnt/a/c/d /mnt/a/h/2d $ mv /mnt/a/c /mnt/a/h/2d/2b/2c $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send For both cases the incremental send would enter an infinite loop when building path strings. While solving these cases, this change also re-implements the code to detect when directory moves/renames should be delayed. Instead of dealing with several specific cases separately, it's now more generic handling all cases with a simple detection algorithm and if when applying a delayed move/rename there's a path loop detected, it further delays the move/rename registering a new ancestor inode as the dependency inode (so our rename happens after that ancestor is renamed). Tests for these cases is being added to xfstests too. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: send, remove dead code from __get_cur_name_and_parentFilipe Manana1-6/+0
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: send, account for orphan directories when building path stringsFilipe Manana1-24/+9
If we have directories with a pending move/rename operation, we must take into account any orphan directories that got created before executing the pending move/rename. Those orphan directories are directories with an inode number higher then the current send progress and that don't exist in the parent snapshot, they are created before current progress reaches their inode number, with a generated name of the form oN-M-I and at the root of the filesystem tree, and later when progress matches their inode number, moved/renamed to their final location. Reproducer: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir -p /mnt/a/b/c/d $ mkdir /mnt/a/b/e $ mv /mnt/a/b/c /mnt/a/b/e/CC $ mkdir /mnt/a/b/e/CC/d/f $ mkdir /mnt/a/g $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mkdir /mnt/a/g/h $ mv /mnt/a/b/e /mnt/a/g/h/EE $ mv /mnt/a/g/h/EE/CC/d /mnt/a/g/h/EE/DD $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send The second receive command failed with the following error: ERROR: rename a/b/e/CC/d -> o264-7-0/EE/DD failed. No such file or directory A test case for xfstests follows soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: send, avoid unnecessary inode item lookup in the btreeFilipe Manana1-6/+7
Regardless of whether the caller is interested or not in knowing the inode's generation (dir_gen != NULL), get_first_ref always does a btree lookup to get the inode item. Avoid this useless lookup if dir_gen parameter is NULL (which is in some cases). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: add dev maxs limit for __btrfs_alloc_chunk in kernel spaceGui Hecheng1-0/+16
For RAID0,5,6,10, For system chunk, there shouldn't be too many stripes to make a btrfs_chunk that exceeds BTRFS_SYSTEM_CHUNK_ARRAY_SIZE For data/meta chunk, there shouldn't be too many stripes to make a btrfs_chunk that exceeds a leaf. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: fix wrong max system array size check in kernel spaceGui Hecheng1-1/+2
For system chunk array, We copy a "disk_key" and an chunk item each time, so there should be enough space to hold both of them, not only the chunk item. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: Add check to avoid cleanup roots already in fs_info->dead_roots.Qu Wenruo1-7/+30
Current btrfs_orphan_cleanup will also cleanup roots which is already in fs_info->dead_roots without protection. This will have conditional race with fs_info->cleaner_kthread. This patch will use refs in root->root_item to detect roots in dead_roots and avoid conflicts. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: reclaim the reserved metadata space at backgroundMiao Xie4-1/+114
Before applying this patch, the task had to reclaim the metadata space by itself if the metadata space was not enough. And When the task started the space reclamation, all the other tasks which wanted to reserve the metadata space were blocked. At some cases, they would be blocked for a long time, it made the performance fluctuate wildly. So we introduce the background metadata space reclamation, when the space is about to be exhausted, we insert a reclaim work into the workqueue, the worker of the workqueue helps us to reclaim the reserved space at the background. By this way, the tasks needn't reclaim the space by themselves at most cases, and even if the tasks have to reclaim the space or are blocked for the space reclamation, they will get enough space more quickly. Here is my test result(Tested by compilebench): Memory: 2GB CPU: 2Cores * 1CPU Partition: 40GB(SSD) Test command: # compilebench -D <mnt> -m Without this patch: intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s) compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s) read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s) delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s) With this patch: intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s) compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s) read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s) delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s) Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: output warning instead of error when loading free space cache failedMiao Xie1-2/+2
If we fail to load a free space cache, we can rebuild it from the extent tree, so it is not a serious error, we should not output a error message that would make the users uncomfortable. This patch uses warning message instead of it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: Add ctime/mtime update for btrfs device add/remove.Qu Wenruo1-2/+24
Btrfs will send uevent to udev inform the device change, but ctime/mtime for the block device inode is not udpated, which cause libblkid used by btrfs-progs unable to detect device change and use old cache, causing 'btrfs dev scan; btrfs dev rmove; btrfs dev scan' give an error message. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Cc: Karel Zak <kzak@redhat.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: assert that send is not in progres before root deletionDavid Sterba2-13/+1
CC: Miao Xie <miaox@cn.fujitsu.com> CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: protect snapshots from deleting during sendDavid Sterba3-2/+53
The patch "Btrfs: fix protection between send and root deletion" (18f687d538449373c37c) does not actually prevent to delete the snapshot and just takes care during background cleaning, but this seems rather user unfriendly, this patch implements the idea presented in http://www.spinics.net/lists/linux-btrfs/msg30813.html - add an internal root_item flag to denote a dead root - check if the send_in_progress is set and refuse to delete, otherwise set the flag and proceed - check the flag in send similar to the btrfs_root_readonly checks, for all involved roots The root lookup in send via btrfs_read_fs_root_no_name will check if the root is really dead or not. If it is, ENOENT, aborted send. If it's alive, it's protected by send_in_progress, send can continue. CC: Miao Xie <miaox@cn.fujitsu.com> CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: remove redundant null check in btrfs_dentry_release()Daeseok Youn1-2/+1
It doesn't need to check NULL for kfree() Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: implement inode_operations callback tmpfileFilipe Manana1-20/+98
This implements the tmpfile callback of struct inode_operations, introduced in the linux kernel 3.11, and implemented already by some filesystems. This callback is invoked by the VFS when the flag O_TMPFILE is passed to the open system call. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.cz>
2014-06-10btrfs: make FS_INFO ioctl available to anyoneDavid Sterba1-3/+0
This ioctl provides basic info about the filesystem that can be obtained in other ways (eg. sysfs), there's no reason to restrict it to CAP_SYSADMIN. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: make DEV_INFO ioctl available to anyoneDavid Sterba1-3/+0
This ioctl provides basic info about the devices that can be obtained in other ways (eg. sysfs), there's no reason to restrict it to CAP_SYSADMIN. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: export more from FS_INFO to sysfsDavid Sterba1-0/+40
Similar to the FS_INFO updates, export the basic filesystem info through sysfs: node size, sector size and clone alignment. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: retrieve more info from FS_INFO ioctlDavid Sterba1-0/+4
Provide the basic information about filesystem through the ioctl: * b-tree node size (same as leaf size) * sector size * expected alignment of CLONE_RANGE and EXTENT_SAME ioctl arguments Backward compatibility: if the values are 0, kernel does not provide this information, the applications should ignore them. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: balance filter: add limit of processed chunksDavid Sterba3-1/+25
This started as debugging helper, to watch the effects of converting between raid levels on multiple devices, but could be useful standalone. In my case the usage filter was not finegrained enough and led to converting too many chunks at once. Another example use is in connection with drange+devid or vrange filters that allow to work with a specific chunk or even with a chunk on a given device. The limit filter applies last, the value of 0 means no limiting. CC: Ilya Dryomov <idryomov@gmail.com> CC: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: fix leaf corruption caused by ENOSPC while hole punchingFilipe Manana1-1/+19
While running a stress test with multiple threads writing to the same btrfs file system, I ended up with a situation where a leaf was corrupted in that it had 2 file extent item keys that had the same exact key. I was able to detect this quickly thanks to the following patch which triggers an assertion as soon as a leaf is marked dirty if there are duplicated keys or out of order keys: Btrfs: check if items are ordered when a leaf is marked dirty (https://patchwork.kernel.org/patch/3955431/) Basically while running the test, I got the following in dmesg: [28877.415877] WARNING: CPU: 2 PID: 10706 at fs/btrfs/file.c:553 btrfs_drop_extent_cache+0x435/0x440 [btrfs]() (...) [28877.415917] Call Trace: [28877.415922] [<ffffffff816f1189>] dump_stack+0x4e/0x68 [28877.415926] [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0 [28877.415929] [<ffffffff8104a37a>] warn_slowpath_null+0x1a/0x20 [28877.415944] [<ffffffffa03775a5>] btrfs_drop_extent_cache+0x435/0x440 [btrfs] [28877.415949] [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0 [28877.415962] [<ffffffffa03777d9>] fill_holes+0x229/0x3e0 [btrfs] [28877.415972] [<ffffffffa0345865>] ? block_rsv_add_bytes+0x55/0x80 [btrfs] [28877.415984] [<ffffffffa03792cb>] btrfs_fallocate+0xb6b/0xc20 [btrfs] (...) [29854.132560] BTRFS critical (device sdc): corrupt leaf, bad key order: block=955232256,root=1, slot=24 [29854.132565] BTRFS info (device sdc): leaf 955232256 total ptrs 40 free space 778 (...) [29854.132637] item 23 key (3486 108 667648) itemoff 2694 itemsize 53 [29854.132638] extent data disk bytenr 14574411776 nr 286720 [29854.132639] extent data offset 0 nr 286720 ram 286720 [29854.132640] item 24 key (3486 108 954368) itemoff 2641 itemsize 53 [29854.132641] extent data disk bytenr 0 nr 0 [29854.132643] extent data offset 0 nr 0 ram 0 [29854.132644] item 25 key (3486 108 954368) itemoff 2588 itemsize 53 [29854.132645] extent data disk bytenr 8699670528 nr 77824 [29854.132646] extent data offset 0 nr 77824 ram 77824 [29854.132647] item 26 key (3486 108 1146880) itemoff 2535 itemsize 53 [29854.132648] extent data disk bytenr 8699670528 nr 77824 [29854.132649] extent data offset 0 nr 77824 ram 77824 (...) [29854.132707] kernel BUG at fs/btrfs/ctree.h:3901! (...) [29854.132771] Call Trace: [29854.132779] [<ffffffffa0342b5c>] setup_items_for_insert+0x2dc/0x400 [btrfs] [29854.132791] [<ffffffffa0378537>] __btrfs_drop_extents+0xba7/0xdd0 [btrfs] [29854.132794] [<ffffffff8109c0d6>] ? trace_hardirqs_on_caller+0x16/0x1d0 [29854.132797] [<ffffffff8109c29d>] ? trace_hardirqs_on+0xd/0x10 [29854.132800] [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0 [29854.132810] [<ffffffffa036783b>] insert_reserved_file_extent.constprop.66+0xab/0x310 [btrfs] [29854.132820] [<ffffffffa036a6c6>] __btrfs_prealloc_file_range+0x116/0x340 [btrfs] [29854.132830] [<ffffffffa0374d53>] btrfs_prealloc_file_range+0x23/0x30 [btrfs] (...) So this is caused by getting an -ENOSPC error while punching a file hole, more specifically, we get -ENOSPC error from __btrfs_drop_extents in the while loop of file.c:btrfs_punch_hole() when it's unable to modify the btree to delete one or more file extent items due to lack of enough free space. When this happens, in btrfs_punch_hole(), we attempt to reclaim free space by switching our transaction block reservation object to root->fs_info->trans_block_rsv, end our transaction and start a new transaction basically - and, we keep increasing our current offset (cur_offset) as long as it's smaller than the end of the target range (lockend) - this makes use leave the loop with cur_offset == drop_end which in turn makes us call fill_holes() for inserting a file extent item that represents a 0 bytes range hole (and this insertion succeeds, as in the meanwhile more space became available). This 0 bytes file hole extent item is a problem because any subsequent caller of __btrfs_drop_extents (regular file writes, or fallocate calls for e.g.), with a start file offset that is equal to the offset of the hole, will not remove this extent item due to the following conditional in the while loop of __btrfs_drop_extents: if (extent_end <= search_start) { path->slots[0]++; goto next_slot; } This later makes the call to setup_items_for_insert() (at the very end of __btrfs_drop_extents), insert a new file extent item with the same offset as the 0 bytes file hole extent item that follows it. Needless is to say that this causes chaos, either when reading the leaf from disk (btree_readpage_end_io_hook), where we perform leaf sanity checks or in subsequent operations that manipulate file extent items, as in the fallocate call as shown by the dmesg trace above. Without my other patch to perform the leaf sanity checks once a leaf is marked as dirty (if the integrity checker is enabled), it would have been much harder to debug this issue. This change might fix a few similar issues reported by users in the mailing list regarding assertion failures in btrfs_set_item_key_safe calls performed by __btrfs_drop_extents, such as the following report: http://comments.gmane.org/gmane.comp.file-systems.btrfs/32938 Asking fill_holes() to create a 0 bytes wide file hole item also produced the first warning in the trace above, as we passed a range to btrfs_drop_extent_cache that has an end smaller (by -1) than its start. On 3.14 kernels this issue manifests itself through leaf corruption, as we get duplicated file extent item keys in a leaf when calling setup_items_for_insert(), but on older kernels, setup_items_for_insert() isn't called by __btrfs_drop_extents(), instead we have callers of __btrfs_drop_extents(), namely the functions inode.c:insert_inline_extent() and inode.c:insert_reserved_file_extent(), calling btrfs_insert_empty_item() to insert the new file extent item, which would fail with error -EEXIST, instead of inserting a duplicated key - which is still a serious issue as it would make all similar file extent item replace operations keep failing if they target the same file range. Cc: stable@vger.kernel.org Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: do not increment on bio_index one by oneLiu Bo1-1/+1
'bio_index' is just a index, it's really not necessary to do increment one by one. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: read inode size after acquiring the mutex when punching a holeFilipe Manana1-1/+2
In a previous change, commit 12870f1c9b2de7d475d22e73fd7db1b418599725, I accidentally moved the roundup of inode->i_size to outside of the critical section delimited by the inode mutex, which is not atomic and not correct since the size can be changed by other task before we acquire the mutex. Therefore fix it. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: Remove unnecessary check for NULLTobias Klauser1-2/+2
iput() already checks for the inode being NULL, thus it's unnecessary to check before calling. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: fix inline compressed read err corruptionZach Brown1-10/+5
uncompress_inline() is dropping the error from btrfs_decompress() after testing it and zeroing the page that was supposed to hold decompressed data. This can silently turn compressed inline data in to zeros if decompression fails due to corrupt compressed data or memory allocation failure. I verified this by manually forcing the error from btrfs_decompress() for a silly named copy of od: if (!strcmp(current->comm, "failod")) ret = -ENOMEM; # od -x /mnt/btrfs/dir/80 | head -1 0000000 3031 3038 310a 2d30 6f70 6e69 0a74 3031 # echo 3 > /proc/sys/vm/drop_caches # cp $(which od) /tmp/failod # /tmp/failod -x /mnt/btrfs/dir/80 | head -1 0000000 0000 0000 0000 0000 0000 0000 0000 0000 The fix is to pass the error to its caller. Which still has a BUG_ON(). So we fix that too. There seems to be no reason for the zeroing of the page on the error from btrfs_decompress() but not from the allocation error a few lines above. So the page zeroing is removed. Signed-off-by: Zach Brown <zab@redhat.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: return ptr error from compression workspaceZach Brown1-3/+3
The btrfs compression wrappers translated errors from workspace allocation to either -ENOMEM or -1. The compression type workspace allocators are already returning a ERR_PTR(-ENOMEM). Just return that and get rid of the magical -1. This helps a future patch return errors from the compression wrappers. Signed-off-by: Zach Brown <zab@redhat.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: return errno instead of -1 from compressionZach Brown2-20/+20
The compression layer seems to have been built to return -1 and have callers make up errors that make sense. This isn't great because there are different errors that originate down in the compression layer. Let's return real negative errnos from the compression layer so that callers can pass on the error without having to guess what happened. ENOMEM for allocation failure, E2BIG when compression exceeds the uncompressed input, and EIO for everything else. This helps a future path return errors from btrfs_decompress(). Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10btrfs: check_int: propagate out-of-memory error upwardsStefan Behrens1-1/+4
This issue was not causing any harm but IMO (and in the opinion of the static code checker) it is better to propagate this error status upwards. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Btrfs: fix hang on error (such as ENOSPC) when writing extent pagesFilipe Manana1-5/+11
When running low on available disk space and having several processes doing buffered file IO, I got the following trace in dmesg: [ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 seconds. [ 4202.720401] Not tainted 3.13.0-fdm-btrfs-next-26+ #1 [ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4202.720874] kworker/u8:1 D 0000000000000001 0 5450 2 0x00000000 [ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs] [ 4202.720908] ffff8801f62ddc38 0000000000000082 ffff880203ac2490 00000000001d3f40 [ 4202.720913] ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920 ffff880203ac2490 [ 4202.720918] 00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8 0000000000000001 [ 4202.720922] Call Trace: [ 4202.720931] [<ffffffff816a3cb9>] schedule+0x29/0x70 [ 4202.720950] [<ffffffffa01ec48d>] btrfs_start_ordered_extent+0x6d/0x110 [btrfs] [ 4202.720956] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0 [ 4202.720972] [<ffffffffa01ec559>] btrfs_run_ordered_extent_work+0x29/0x40 [btrfs] [ 4202.720988] [<ffffffffa0201987>] normal_work_helper+0x137/0x2c0 [btrfs] [ 4202.720994] [<ffffffff810680e5>] process_one_work+0x1f5/0x530 (...) [ 4202.721027] 2 locks held by kworker/u8:1/5450: [ 4202.721028] #0: (%s-%s){++++..}, at: [<ffffffff81068083>] process_one_work+0x193/0x530 [ 4202.721037] #1: ((&work->normal_work)){+.+...}, at: [<ffffffff81068083>] process_one_work+0x193/0x530 [ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds. [ 4202.721258] Not tainted 3.13.0-fdm-btrfs-next-26+ #1 [ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4202.721699] btrfs D 0000000000000001 0 7891 7890 0x00000001 [ 4202.721704] ffff88018c2119e8 0000000000000086 ffff8800a33d2490 00000000001d3f40 [ 4202.721710] ffff88018c211fd8 00000000001d3f40 ffff8802144b0000 ffff8800a33d2490 [ 4202.721714] ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8 7fffffffffffffff [ 4202.721718] Call Trace: [ 4202.721723] [<ffffffff816a3cb9>] schedule+0x29/0x70 [ 4202.721727] [<ffffffff816a2ebc>] schedule_timeout+0x1dc/0x270 [ 4202.721732] [<ffffffff8109bd79>] ? mark_held_locks+0xb9/0x140 [ 4202.721736] [<ffffffff816a90c0>] ? _raw_spin_unlock_irq+0x30/0x40 [ 4202.721740] [<ffffffff8109bf0d>] ? trace_hardirqs_on_caller+0x10d/0x1d0 [ 4202.721744] [<ffffffff816a488f>] wait_for_completion+0xdf/0x120 [ 4202.721749] [<ffffffff8107fa90>] ? try_to_wake_up+0x310/0x310 [ 4202.721765] [<ffffffffa01ebee4>] btrfs_wait_ordered_extents+0x1f4/0x280 [btrfs] [ 4202.721781] [<ffffffffa020526e>] btrfs_mksubvol.isra.62+0x30e/0x5a0 [btrfs] [ 4202.721786] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0 [ 4202.721799] [<ffffffffa02056a9>] btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs] [ 4202.721813] [<ffffffffa020583a>] btrfs_ioctl_snap_create_v2+0x10a/0x170 [btrfs] (...) It turns out that extent_io.c:__extent_writepage(), which ends up being called through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was getting -ENOSPC when calling the fill_delalloc callback. In this situation, it returned without the writepage_end_io_hook callback (inode.c:btrfs_writepage_end_io_hook) ever being called for the respective page, which prevents the ordered extent's bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work is never queued into the endio_write_workers queue. This makes the task that called btrfs_start_ordered_extent() hang forever on the wait queue of the ordered extent. This is fairly easy to reproduce using a small filesystem and fsstress on a quad core vm: mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd mount /dev/sdd /mnt fsstress -p 6 -d /mnt -n 100000 -x \ "btrfs subvolume snapshot -r /mnt /mnt/mysnap" \ -f allocsp=0 \ -f bulkstat=0 \ -f bulkstat1=0 \ -f chown=0 \ -f creat=1 \ -f dread=0 \ -f dwrite=0 \ -f fallocate=1 \ -f fdatasync=0 \ -f fiemap=0 \ -f freesp=0 \ -f fsync=0 \ -f getattr=0 \ -f getdents=0 \ -f link=0 \ -f mkdir=0 \ -f mknod=0 \ -f punch=1 \ -f read=0 \ -f readlink=0 \ -f rename=0 \ -f resvsp=0 \ -f rmdir=0 \ -f setxattr=0 \ -f stat=0 \ -f symlink=0 \ -f sync=0 \ -f truncate=1 \ -f unlink=0 \ -f unresvsp=0 \ -f write=4 So just ensure that if an error happens while writing the extent page we call the writepage_end_io_hook callback. Also make it return the error code and ensure the caller (extent_write_cache_pages) processes all pages in the page vector even if an error happens only for some of them, so that ordered extents end up released. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10Merge branch 'xfs-misc-fixes-3-for-3.16' into for-nextDave Chinner13-53/+60
2014-06-10Merge branch 'xfs-da-geom' into for-nextDave Chinner26-778/+819
2014-06-10xfs: fix xfs_da_args sparse warning in xfs_readdirDave Chinner1-1/+1
The kbuild test robot reported: >> fs/xfs/xfs_dir2_readdir.c:672:41: sparse: Using plain integer as NULL pointer Fix it. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-10nfsd4: fix FREE_STATEID lockowner leakJ. Bruce Fields1-1/+1
27b11428b7de ("nfsd4: remove lockowner when removing lock stateid") introduced a memory leak. Cc: stable@vger.kernel.org Reported-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-09NFSv4.1: Fix typo in dprintkTom Haynes1-1/+1
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-09NFSv4.1: Comment is now wrong and redundant to codeTom Haynes1-4/+1
The save of the write offset was removed some time ago, so that part of the comment is bogus. The remainder is pretty self-evident. So off with it! Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-09Merge tag 'for-linus-3.16-merge-window' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs Pull 9p fixes from Eric Van Hensbergen: "Two bug fixes, one in xattr error path and the other in parsing major/minor numbers from devices" * tag 'for-linus-3.16-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: 9P: fix return value in v9fs_fid_xattr_set fs/9p: adjust sscanf parameters accordingly to the variable types
2014-06-09Merge tag 'ext4_for_linus' of ↵Linus Torvalds20-429/+494
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Clean ups and miscellaneous bug fixes, in particular for the new collapse_range and zero_range fallocate functions. In addition, improve the scalability of adding and remove inodes from the orphan list" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits) ext4: handle symlink properly with inline_data ext4: fix wrong assert in ext4_mb_normalize_request() ext4: fix zeroing of page during writeback ext4: remove unused local variable "stored" from ext4_readdir(...) ext4: fix ZERO_RANGE test failure in data journalling ext4: reduce contention on s_orphan_lock ext4: use sbi in ext4_orphan_{add|del}() ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged() ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access ext4: remove unnecessary double parentheses ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails ext4: make local functions static ext4: fix block bitmap validation when bigalloc, ^flex_bg ext4: fix block bitmap initialization under sparse_super2 ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem ext4: avoid unneeded lookup when xattr name is invalid ext4: fix data integrity sync in ordered mode ext4: remove obsoleted check ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode ext4: fix locking for O_APPEND writes ...
2014-06-08Merge branch 'next' (accumulated 3.16 merge window patches) into masterLinus Torvalds213-4503/+2190
Now that 3.15 is released, this merges the 'next' branch into 'master', bringing us to the normal situation where my 'master' branch is the merge window. * accumulated work in next: (6809 commits) ufs: sb mutex merge + mutex_destroy powerpc: update comments for generic idle conversion cris: update comments for generic idle conversion idle: remove cpu_idle() forward declarations nbd: zero from and len fields in NBD_CMD_DISCONNECT. mm: convert some level-less printks to pr_* MAINTAINERS: adi-buildroot-devel is moderated MAINTAINERS: add linux-api for review of API/ABI changes mm/kmemleak-test.c: use pr_fmt for logging fs/dlm/debug_fs.c: replace seq_printf by seq_puts fs/dlm/lockspace.c: convert simple_str to kstr fs/dlm/config.c: convert simple_str to kstr mm: mark remap_file_pages() syscall as deprecated mm: memcontrol: remove unnecessary memcg argument from soft limit functions mm: memcontrol: clean up memcg zoneinfo lookup mm/memblock.c: call kmemleak directly from memblock_(alloc|free) mm/mempool.c: update the kmemleak stack trace for mempool allocations lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations mm: introduce kmemleak_update_trace() mm/kmemleak.c: use %u to print ->checksum ...
2014-06-08Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd into nextLinus Torvalds4-81/+98
Pull exofs raid6 support from Boaz Harrosh: "These simple patches will enable raid6 using the kernel's raid6_pq engine for support under exofs and pnfs-objects. There is nothing needed to do at exofs and pnfs-obj. Just fire your mkfs.exofs with --raid=6 (that was already supported before) and off you go as usual. The ORE will pick up the new map and will start writing two devices of redundancy bits. The patches are so simple because most of the ORE was already for the general raid case, only a few bug fixes were needed and the actual wiring into the raid6_pq engine" * 'for-linus' of git://git.open-osd.org/linux-open-osd: ore: Support for raid 6 ore: Remove redundant dev_order(), more cleanups ore: (trivial) reformat some code
2014-06-08f2fs: support f2fs_fiemapJaegeuk Kim3-0/+8
This patch links f2fs_fiemap with generic function with get_block. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-08Merge branch 'for-linus' of ↵Linus Torvalds1-2/+5
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "I had this in my 3.16 merge window queue, but it is small and obvious enough for 3.15. I cherry-picked and retested against current rc8" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: send, fix corrupted path strings for long paths