summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2015-06-03Btrfs: fail on mismatched subvol and subvolid mount optionsOmar Sandoval1-8/+25
There's nothing to stop a user from passing both subvol= and subvolid= to mount, but if they don't refer to the same subvolume, someone is going to be surprised at some point. Error out on this case, but allow users to pass in both if they do match (which they could, for example, get out of /proc/mounts). Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: clean up error handling in mount_subvol()Omar Sandoval1-28/+33
In preparation for new functionality in mount_subvol(), give it ownership of subvol_name and tidy up the error paths. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: remove all subvol options before mounting top-levelOmar Sandoval1-36/+20
Currently, setup_root_args() substitutes 's/subvol=[^,]*/subvolid=0/'. But, this means that if the user passes both a subvol and subvolid for some reason, we won't actually mount the top-level when we recursively mount. For example, consider: mkfs.btrfs -f /dev/sdb mount /dev/sdb /mnt btrfs subvol create /mnt/subvol1 # subvolid=257 btrfs subvol create /mnt/subvol2 # subvolid=258 umount /mnt mount -osubvol=/subvol1,subvolid=258 /dev/sdb /mnt In the final mount, subvol=/subvol1,subvolid=258 becomes subvolid=0,subvolid=258, and the last option takes precedence, so we mount subvol2 and try to look up subvol1 inside of it, which fails. So, instead, do a thorough scan through the argument list and remove any subvol= and subvolid= options, then append subvolid=0 to the end. This implicitly makes subvol= take precedence over subvolid=, but we're about to add a stricter check for that. This also makes setup_root_args() more generic, which we'll need soon. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: lock superblock before remounting for rw subvolOmar Sandoval1-0/+2
Since commit 0723a0473fb4 ("btrfs: allow mounting btrfs subvolumes with different ro/rw options"), when mounting a subvolume read/write when another subvolume has previously been mounted read-only, we first do a remount. However, this should be done with the superblock locked, as per sync_filesystem(): /* * We need to be protected against the filesystem going from * r/o to r/w or vice versa. */ WARN_ON(!rwsem_is_locked(&sb->s_umount)); This WARN_ON can easily be hit with: mkfs.btrfs -f /dev/vdb mount /dev/vdb /mnt btrfs subvol create /mnt/vol1 btrfs subvol create /mnt/vol2 umount /mnt mount -oro,subvol=/vol1 /dev/vdb /mnt mount -orw,subvol=/vol2 /dev/vdb /mnt2 Fixes: 0723a0473fb4 ("btrfs: allow mounting btrfs subvolumes with different ro/rw options") Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: wake up extent state waiters on unlock through clear_extent_bitsFilipe Manana1-1/+6
When we clear an extent state's EXTENT_LOCKED bit with clear_extent_bits() through free_io_failure(), we weren't waking up any tasks waiting for the extent's state EXTENT_LOCKED bit, leading to an hang. So make sure clear_extent_bits() ends up waking up any waiters if the bit EXTENT_LOCKED is supplied by its callers. Zygo Blaxell was experiencing such hangs at inode eviction time after file unlinks. Thanks to him for a set of scripts to reproduce the issue. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: fix chunk allocation regression leading to transaction abortFilipe Manana1-3/+19
With commit 1b9845081633 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole") introduced in the kernel 4.1 merge window, we end up using part of a device hole for which there are already pending chunks or pinned chunks. Before that commit we didn't use the hole and would just move on to the next hole in the device. However when we adjust the start offset for the chunk allocation and we have pinned chunks, we set it blindly to the end offset of the pinned chunk we are currently processing, which is dangerous because we can have a pending chunk that has a start offset that matches the end offset of our pinned chunk - leading us to a case where we end up getting two pending chunks that start at the same physical device offset, which makes us later abort the current transaction with -EEXIST when finishing the chunk allocation at btrfs_create_pending_block_groups(): [194737.659017] ------------[ cut here ]------------ [194737.660192] WARNING: CPU: 15 PID: 31111 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]() [194737.662209] BTRFS: Transaction aborted (error -17) [194737.663175] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse [194737.674015] CPU: 15 PID: 31111 Comm: xfs_io Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2 [194737.675986] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [194737.682999] 0000000000000009 ffff8800564c7a98 ffffffff8142fa46 ffffffff8108b6a2 [194737.684540] ffff8800564c7ae8 ffff8800564c7ad8 ffffffff81045ea5 ffff8800564c7b78 [194737.686017] ffffffffa0383aa7 00000000ffffffef ffff88000c7ba000 ffff8801a1f66f40 [194737.687509] Call Trace: [194737.688068] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [194737.689027] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [194737.690095] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [194737.691198] [<ffffffffa0383aa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs] [194737.693789] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [194737.695065] [<ffffffffa0383aa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs] [194737.696806] [<ffffffffa039a3bd>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs] [194737.698683] [<ffffffffa03aa433>] __btrfs_end_transaction+0x84/0x366 [btrfs] [194737.700329] [<ffffffffa03aa725>] btrfs_end_transaction+0x10/0x12 [btrfs] [194737.701924] [<ffffffffa0394b51>] btrfs_check_data_free_space+0x11f/0x27c [btrfs] [194737.703675] [<ffffffffa03b8ba4>] __btrfs_buffered_write+0x16a/0x4c8 [btrfs] [194737.705417] [<ffffffffa03bb502>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs] [194737.707058] [<ffffffffa03bb511>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs] [194737.708560] [<ffffffffa03bb68d>] btrfs_file_write_iter+0x325/0x431 [btrfs] [194737.710673] [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e [194737.712076] [<ffffffff811534c3>] new_sync_write+0x7c/0xa0 [194737.713293] [<ffffffff81153b58>] vfs_write+0xb2/0x117 [194737.714443] [<ffffffff81154424>] SyS_pwrite64+0x64/0x82 [194737.715646] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [194737.717175] ---[ end trace f2d5dc04e56d7e48 ]--- [194737.718170] BTRFS: error (device sdc) in btrfs_create_pending_block_groups:9524: errno=-17 Object already exists The -EEXIST failure comes from btrfs_finish_chunk_alloc(), called by btrfs_create_pending_block_groups(), when it attempts to insert a duplicated device extent item via btrfs_alloc_dev_extent(). This issue was reproducible with fstests generic/038 running in a loop for several hours (it's very hard to hit) and using MOUNT_OPTIONS="-o discard". Applying Jeff's recent patch titled "btrfs: add missing discards when unpinning extents with -o discard" makes the issue much easier to reproduce (usually within 4 to 5 hours), since it pins chunks for longer periods of time when an unused block group is deleted by the cleaner kthread. Fix this by making sure that we never adjust the start offset to a lower value than it currently has. Fixes: 1b9845081633 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole" Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03btrfs: use after free when closing devicesSasha Levin1-2/+2
__btrfs_close_devices() would call_rcu to free the device, which is racy with list_for_each_entry() accessing the memory to retrieve the next device on the list. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03btrfs: make root id query unprivilegedDavid Sterba1-4/+16
The INO_LOOKUP ioctl can lookup path for a given inode number and is thus restricted. As a sideefect it can find the root id of the containing subvolume and we're using this int the 'btrfs inspect rootid' command. The restriction is unnecessary in case we set the ioctl args args::treeid = 0 args::objectid = 256 (BTRFS_FIRST_FREE_OBJECTID) Then the path will be empty and the treeid is filled with the root id of the inode on which the ioctl is called. This behaviour is unchanged, after the root restriction is removed. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: fix block group ->space_info null pointer dereferenceFilipe Manana1-2/+23
When we create a block group we add it to the rbtree of block groups before setting its ->space_info field (while it's NULL). This is problematic since other tasks can access the block group from the rbtree and attempt to use its ->space_info before it is set by btrfs_make_block_group(). This can happen for example when a concurrent fitrim ioctl operation is ongoing, which produces a trace like the following when CONFIG_DEBUG_PAGEALLOC is set. [11509.604369] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 [11509.606373] IP: [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02 [11509.608179] PGD 2296a8067 PUD 22f4a2067 PMD 0 [11509.608179] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [11509.608179] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq processor i2c_piix4 psmou [11509.608179] CPU: 10 PID: 8538 Comm: fstrim Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2 [11509.608179] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [11509.608179] task: ffff88009f5c46d0 ti: ffff8801b3edc000 task.ti: ffff8801b3edc000 [11509.608179] RIP: 0010:[<ffffffff8107d675>] [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02 [11509.608179] RSP: 0018:ffff8801b3edf9e8 EFLAGS: 00010002 [11509.608179] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000000 [11509.608179] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 [11509.608179] RBP: ffff8801b3edfaa8 R08: 0000000000000001 R09: 0000000000000000 [11509.608179] R10: 0000000000000000 R11: ffff88009f5c4f98 R12: 0000000000000000 [11509.608179] R13: 0000000000000000 R14: 0000000000000018 R15: ffff88009f5c46d0 [11509.608179] FS: 00007f280a10e840(0000) GS:ffff88023ed40000(0000) knlGS:0000000000000000 [11509.608179] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [11509.608179] CR2: 0000000000000018 CR3: 00000002119bc000 CR4: 00000000000006e0 [11509.608179] Stack: [11509.608179] 0000000000000000 0000000000000000 0000000000000004 0000000000000000 [11509.608179] ffff880100000000 ffffffff00000000 0000000000000001 ffffffff00000000 [11509.608179] 0000000000000001 0000000000000000 ffff880100000000 00000000000006c4 [11509.608179] Call Trace: [11509.608179] [<ffffffff8107dc57>] ? __lock_acquire+0x696/0xf02 [11509.608179] [<ffffffff8107e806>] lock_acquire+0xa5/0x116 [11509.608179] [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs] [11509.608179] [<ffffffff81434f37>] _raw_spin_lock+0x34/0x44 [11509.608179] [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs] [11509.608179] [<ffffffffa04cc876>] do_trimming+0x51/0x145 [btrfs] [11509.608179] [<ffffffffa04cde7d>] btrfs_trim_block_group+0x201/0x491 [btrfs] [11509.608179] [<ffffffffa04849e2>] btrfs_trim_fs+0xe0/0x129 [btrfs] [11509.608179] [<ffffffffa04bb80a>] btrfs_ioctl_fitrim+0x138/0x167 [btrfs] [11509.608179] [<ffffffffa04c002f>] btrfs_ioctl+0x50d/0x21e8 [btrfs] [11509.608179] [<ffffffff81123bda>] ? might_fault+0x58/0xb5 [11509.608179] [<ffffffff81123bda>] ? might_fault+0x58/0xb5 [11509.608179] [<ffffffff81123bda>] ? might_fault+0x58/0xb5 [11509.608179] [<ffffffff81158050>] ? cp_new_stat+0x147/0x15e [11509.608179] [<ffffffff81163041>] do_vfs_ioctl+0x3c6/0x479 [11509.608179] [<ffffffff81158116>] ? SYSC_newfstat+0x25/0x2e [11509.608179] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58 [11509.608179] [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f [11509.608179] [<ffffffff8116314e>] SyS_ioctl+0x5a/0x7f [11509.608179] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [11509.608179] Code: f4 01 00 0f 85 c0 00 00 00 48 c7 c1 f3 1f 7d 81 48 c7 c2 aa cb 7c 81 be fc 0b 00 00 eb 70 83 3d 61 eb 9c 00 00 0f 84 a5 00 00 00 <49> 81 3e 40 a3 2b 82 b8 00 00 00 [11509.608179] RIP [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02 [11509.608179] RSP <ffff8801b3edf9e8> [11509.608179] CR2: 0000000000000018 [11509.608179] ---[ end trace 570a5c6769f0e49a ]--- Which corresponds to the following access in fs/btrfs/free-space-cache.c: static int do_trimming(struct btrfs_block_group_cache *block_group, u64 *total_trimmed, u64 start, u64 bytes, u64 reserved_start, u64 reserved_bytes, struct btrfs_trim_range *trim_entry) { struct btrfs_space_info *space_info = block_group->space_info; (...) spin_lock(&space_info->lock); ^^^^^ - block_group->space_info is NULL... Fix this by ensuring the block group's ->space_info is set before adding the block group to the rbtree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: check error before reporting missing device and add uuidAnand Jain1-1/+2
Report missing device when add is successful, otherwise it would exit as ENOMEM. And add uuid to the report. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03btrfs: Fix superblock csum type check.Qu Wenruo1-1/+1
Old csum type check is wrong and can't catch csum_type 1(not supported). Fix it to avoid hostile 0 division. Reported-by: Lukas Lueg <lukas.lueg@gmail.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: incremental send, fix clone operations for compressed extentsFilipe Manana1-1/+17
Marc reported a problem where the receiving end of an incremental send was performing clone operations that failed with -EINVAL. This happened because, unlike for uncompressed extents, we were not checking if the source clone offset and length, after summing the data offset, falls within the source file's boundaries. So make sure we do such checks when attempting to issue clone operations for compressed extents. Problem reproducible with the following steps: $ mkfs.btrfs -f /dev/sdb $ mount -o compress /dev/sdb /mnt $ mkfs.btrfs -f /dev/sdc $ mount -o compress /dev/sdc /mnt2 # Create the file with a single extent of 128K. This creates a metadata file # extent item with a data start offset of 0 and a logical length of 128K. $ xfs_io -f -c "pwrite -S 0xaa 64K 128K" -c "fsync" /mnt/foo # Now rewrite the range 64K to 112K of our file. This will make the inode's # metadata continue to point to the 128K extent we created before, but now # with an extent item that points to the extent with a data start offset of # 112K and a logical length of 16K. # That metadata file extent item is associated with the logical file offset # at 176K and covers the logical file range 176K to 192K. $ xfs_io -c "pwrite -S 0xbb 64K 112K" -c "fsync" /mnt/foo # Now rewrite the range 180K to 12K. This will make the inode's metadata # continue to point the the 128K extent we created earlier, with a single # extent item that points to it with a start offset of 112K and a logical # length of 4K. # That metadata file extent item is associated with the logical file offset # at 176K and covers the logical file range 176K to 180K. $ xfs_io -c "pwrite -S 0xcc 180K 12K" -c "fsync" /mnt/foo $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ touch /mnt/bar # Calls the btrfs clone ioctl. $ ~/xfstests/src/cloner -s $((176 * 1024)) -d $((176 * 1024)) \ -l $((4 * 1024)) /mnt/foo /mnt/bar $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 | btrfs receive /mnt2 At subvol /mnt/snap1 At subvol snap1 $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2 At subvol /mnt/snap2 At snapshot snap2 ERROR: failed to clone extents to bar Invalid argument A test case for fstests follows soon. Reported-by: Marc MERLIN <marc@merlins.org> Tested-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: David Sterba <dsterba@suse.cz> Tested-by: Jan Alexander Steffens (heftig) <jan.steffens@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03btrfs: qgroup: Fix possible leak in btrfs_add_qgroup_relation()Christian Engelmayer1-4/+4
Commit 9c8b35b1ba21 ("btrfs: quota: Automatically update related qgroups or mark INCONSISTENT flags when assigning/deleting a qgroup relations.") introduced the allocation of a temporary ulist in function btrfs_add_qgroup_relation() and added the corresponding cleanup to the out path. However, the allocation was introduced before the src/dst level check that directly returns. Fix the possible leakage of the ulist by moving the allocation after the input validation. Detected by Coverity CID 1295988. Signed-off-by: Christian Engelmayer <cengelma@gmx.at> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: fix mutex unlock without prior lock on space cache truncationFilipe Manana1-8/+6
If the call to btrfs_truncate_inode_items() failed and we don't have a block group, we were unlocking the cache_write_mutex without having locked it (we do it only if we have a block group). Fixes: 1bbc621ef284 ("Btrfs: allow block group cache writeout outside critical section in commit") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: log when missing device is createdAnand Jain1-0/+2
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03btrfs: fix warnings after changes in btrfs_abort_transactionDavid Sterba2-4/+4
fs/btrfs/volumes.c: In function ‘btrfs_create_uuid_tree’: fs/btrfs/volumes.c:3909:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=] btrfs_abort_transaction(trans, tree_root, ^ CC [M] fs/btrfs/ioctl.o fs/btrfs/ioctl.c: In function ‘create_subvol’: fs/btrfs/ioctl.c:549:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=] btrfs_abort_transaction(trans, root, PTR_ERR(new_root)); PTR_ERR returns long, but we're really using 'int' for the error codes everywhere so just set and use the local variable. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03btrfs: add 'cold' compiler annotations to all error handling functionsDavid Sterba2-0/+7
The annotated functios will be placed into .text.unlikely section. The annotation also hints compiler to move the code out of the hot paths, and may implicitly mark if-statement leading to that block as unlikely. This is a heuristic, the impact on the generated code is not significant. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03btrfs: report exact callsite where transaction abort occursDavid Sterba2-11/+9
WARN is called from a single location and all bugreports say that's in super.c __btrfs_abort_transaction. This is slightly confusing as we'd rather want to know the exact callsite. Whereas this information is printed in the syslog below the stacktrace, this requires further look and we usually see only the headline from WARNING. Moving the WARN into the macro has to inline some code and increases code by a few kilobytes: text data bss dec hex filename 835481 20305 14120 869906 d4612 btrfs.ko.before 842883 20305 14120 877308 d62fc btrfs.ko.after The delta is +7k (130+ calls), measured on 3.19 x86_64, distro config. The increase is not small and could lead to worse icache use. The code is on error/exit paths that can be recognized by compiler as cold and moved out of the way so the impact is speculated to be low, if measurable at all. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03btrfs: let tree defrag work in SSD modeDavid Sterba1-3/+0
Long time ago (2008) the defrag was automatic for new b-tree writes but has been disabled after performance problems. There was a leftover in tree-defrag.c that effectively stops any defragmentation on b-trees. This is a bit unexpected and IMHO undesired. The SSD mode is an optimization and defrag is supposed to work if the users asks for it. Related commits: 6702ed490ca0bb44e17131818a5a18b773957c5a Btrfs: Add run time btree defrag, and an ioctl to force btree defrag e18e4809b10e6c9efb5fe10c1ddcb4ebb690d517 Btrfs: Add mount -o ssd, which includes optimizations for seek free storage b3236e68bf86b3ae87f58984a1822369225211cb Btrfs: Leave on the tree defragger in mount -o ssd, it still helps there 9afbb0b752ef30a429c45b9de6706e28ad1a36e1 Btrfs: Disable tree defrag in SSD mode The last three commits switch the defrag+ssd off/on/off and the last one 3f157a2fd2ad731e1ed9964fecdc5f459f04a4a4 Btrfs: Online btree defragmentation fixes misses the bits from tree-defrag.c to revert to the behaviour introduced in e18e4809b10e. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: check pending chunks when shrinking fs to avoid corruptionFilipe Manana1-9/+40
When we shrink the usable size of a device (its total_bytes), we go over all the device extent items in the device tree and attempt to relocate the chunk of any device extent that goes beyond the new usable size for the device. We do that after setting the new usable size (total_bytes) in the device object, so that all new allocations (and reallocations) don't use areas of the device that go beyond the new (shorter) size. However we were not considering that before setting the new size in the device, pending chunks might have been created that use device extents that go beyond the new size, and those device extents are not yet in the device tree after we search the device tree - they are still attached to the list of new block group for some ongoing transaction handle, and they are only added to the device tree when the transaction handle is ended (via btrfs_create_pending_block_groups()). So check for pending chunks with device extents that go beyond the new size and if any exists, commit the current transaction and repeat the search in the device tree. Not doing this it would mean we would return success to user space while still having extents that go beyond the new size, and later user space could override those locations on the device while the fs still references them, causing all sorts of corruption and unexpected events. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: don't invalidate root dentry when subvolume deletion failsOmar Sandoval1-3/+1
Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate"), mounted subvolumes can be deleted because d_invalidate() won't fail. However, we run into problems when we attempt to delete the default subvolume while it is mounted as the root filesystem: # btrfs subvol list / ID 257 gen 306 top level 5 path rootvol ID 267 gen 334 top level 5 path snap1 # btrfs subvol get-default / ID 267 gen 334 top level 5 path snap1 # btrfs inspect-internal rootid / 267 # mount -o subvol=/ /dev/vda1 /mnt # btrfs subvol del /mnt/snap1 Delete subvolume (no-commit): '/mnt/snap1' ERROR: cannot delete '/mnt/snap1' - Operation not permitted # findmnt / findmnt: can't read /proc/mounts: No such file or directory # ls /proc # Markus reported that this same scenario simply led to a kernel oops. This happens because in btrfs_ioctl_snap_destroy(), we call d_invalidate() before we check may_destroy_subvol(), which means that we detach the submounts and drop the dentry before erroring out. Instead, we should only invalidate the dentry once the deletion has succeeded. Additionally, the shrink_dcache_sb() isn't necessary; d_invalidate() will prune the dcache for the deleted subvolume. Cc: <stable@vger.kernel.org> Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate") Reported-by: Markus Schauler <mschauler@gmail.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: incremental send, check if orphanized dir inode needs delayed renameFilipe Manana1-19/+37
If a directory inode is orphanized, because some inode previously processed has a new name that collides with the old name of the current inode, we need to check if it needs its rename operation delayed too, as its ancestor-descendent relationship with some other inode might have been reversed between the parent and send snapshots and therefore its rename operation needs to happen after that other inode is renamed. For example, for the following reproducer where this is needed (provided by Robbie Ko): $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt2 $ mkdir -p /mnt/data/n1/n2 $ mkdir /mnt/data/n4 $ mkdir -p /mnt/data/t6/t7 $ mkdir /mnt/data/t5 $ mkdir /mnt/data/t7 $ mkdir /mnt/data/n4/t2 $ mkdir /mnt/data/t4 $ mkdir /mnt/data/t3 $ mv /mnt/data/t7 /mnt/data/n4/t2 $ mv /mnt/data/t4 /mnt/data/n4/t2/t7 $ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4 $ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5 $ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6 $ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6 $ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2 $ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4 $ mv /mnt/data/n4/t2 /mnt/data/n4/n1 $ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2 $ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2 $ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2 $ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6 $ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3 $ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 | btrfs receive /mnt2 $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2 ERROR: send ioctl failed with -12: Cannot allocate memory Where the parent snapshot directory hierarchy is the following: . (ino 256) |-- data/ (ino 257) |-- n4/ (ino 260) |-- t2/ (ino 265) |-- t7/ (ino 264) |-- t4/ (ino 266) |-- t5/ (ino 263) |-- t6/ (ino 261) |-- n1/ (ino 258) |-- n2/ (ino 259) |-- t7/ (ino 262) |-- t3/ (ino 267) And the send snapshot's directory hierarchy is the following: . (ino 256) |-- data/ (ino 257) |-- n4/ (ino 260) |-- n1/ (ino 258) |-- t2/ (ino 265) |-- n2/ (ino 259) |-- t3/ (ino 267) | |-- t7 (ino 264) | |-- t6/ (ino 261) | |-- t4/ (ino 266) | |-- t5/ (ino 263) | |-- t7/ (ino 262) While processing inode 262 we orphanize inode 264 and later attempt to rename inode 264 to its new name/location, which resulted in building an incorrect destination path string for the rename operation with the value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must have been done only after inode 267 is processed and renamed, as the ancestor-descendent relationship between inodes 264 and 267 was reversed between both snapshots, because otherwise it results in an infinite loop when building the path string for inode 264 when we are processing an inode with a number larger than 264. That loop is the following: start inode 264, send progress of 265 for example parent of 264 -> 267 parent of 267 -> 262 parent of 262 -> 259 parent of 259 -> 261 parent of 261 -> 263 parent of 263 -> 266 parent of 266 -> 264 |--> back to first iteration while current path string length is <= PATH_MAX, and fail with -ENOMEM otherwise So fix this by making the check if we need to delay a directory rename regardless of the current inode having been orphanized or not. A test case for fstests follows soon. Thanks to Robbie Ko for providing a reproducer for this problem. Reported-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-03Btrfs: incremental send, don't delay directory renames unnecessarilyFilipe Manana1-2/+46
Even though we delay the rename of directories when they become descendents of other directories that were also renamed in the send root to prevent infinite path build loops, we were doing it in cases where this was not needed and was actually harmful resulting in infinite path build loops as we ended up with a circular dependency of delayed directory renames. Consider the following reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt2 $ mkdir /mnt/data $ mkdir /mnt/data/n1 $ mkdir /mnt/data/n1/n2 $ mkdir /mnt/data/n4 $ mkdir /mnt/data/n1/n2/p1 $ mkdir /mnt/data/n1/n2/p1/p2 $ mkdir /mnt/data/t6 $ mkdir /mnt/data/t7 $ mkdir -p /mnt/data/t5/t7 $ mkdir /mnt/data/t2 $ mkdir /mnt/data/t4 $ mkdir -p /mnt/data/t1/t3 $ mkdir /mnt/data/p1 $ mv /mnt/data/t1 /mnt/data/p1 $ mkdir -p /mnt/data/p1/p2 $ mv /mnt/data/t4 /mnt/data/p1/p2/t1 $ mv /mnt/data/t5 /mnt/data/n4/t5 $ mv /mnt/data/n1/n2/p1/p2 /mnt/data/n4/t5/p2 $ mv /mnt/data/t7 /mnt/data/n4/t5/p2/t7 $ mv /mnt/data/t2 /mnt/data/n4/t1 $ mv /mnt/data/p1 /mnt/data/n4/t5/p2/p1 $ mv /mnt/data/n1/n2 /mnt/data/n4/t5/p2/p1/p2/n2 $ mv /mnt/data/n4/t5/p2/p1/p2/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1 $ mv /mnt/data/n4/t5/t7 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7 $ mv /mnt/data/n4/t5/p2/p1/t1/t3 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3 $ mv /mnt/data/n4/t5/p2/p1/p2/n2/p1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1 $ mv /mnt/data/t6 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t5 $ mv /mnt/data/n4/t5/p2/p1/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t1 $ mv /mnt/data/n1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/n1 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/data/n4/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/t1 $ mv /mnt/data/n4/t5/p2/p1/p2/n2/t1 /mnt/data/n4/ $ mv /mnt/data/n4/t5/p2/p1/p2/n2 /mnt/data/n4/t1/n2 $ mv /mnt/data/n4/t1/t7/p1 /mnt/data/n4/t1/n2/p1 $ mv /mnt/data/n4/t1/t3/t1 /mnt/data/n4/t1/n2/t1 $ mv /mnt/data/n4/t1/t3 /mnt/data/n4/t1/n2/t1/t3 $ mv /mnt/data/n4/t5/p2/p1/p2 /mnt/data/n4/t1/n2/p1/p2 $ mv /mnt/data/n4/t1/t7 /mnt/data/n4/t1/n2/p1/t7 $ mv /mnt/data/n4/t5/p2/p1 /mnt/data/n4/t1/n2/p1/p2/p1 $ mv /mnt/data/n4/t1/n2/t1/t3/t5 /mnt/data/n4/t1/n2/p1/p2/t5 $ mv /mnt/data/n4/t5 /mnt/data/n4/t1/n2/p1/p2/p1/t5 $ mv /mnt/data/n4/t1/n2/p1/p2/p1/t5/p2 /mnt/data/n4/t1/n2/p1/p2/p1/p2 $ mv /mnt/data/n4/t1/n2/p1/p2/p1/p2/t7 /mnt/data/n4/t1/t7 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 | btrfs receive /mnt2 $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive -vv /mnt2 ERROR: send ioctl failed with -12: Cannot allocate memory This reproducer resulted in an infinite path build loop when building the path for inode 266 because the following circular dependency of delayed directory renames was created: ino 272 <- ino 261 <- ino 259 <- ino 268 <- ino 267 <- ino 261 Where the notation "X <- Y" means the rename of inode X is delayed by the rename of inode Y (X will be renamed after Y is renamed). This resulted in an infinite path build loop of inode 266 because that inode has inode 261 as an ancestor in the send root and inode 261 is in the circular dependency of delayed renames listed above. Fix this by not delaying the rename of a directory inode if an ancestor of the inode in the send root, which has a delayed rename operation, is not also a descendent of the inode in the parent root. Thanks to Robbie Ko for sending the reproducer example. A test case for xfstests follows soon. Reported-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-06-01Merge branch 'for-linus' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fix from Al Viro: "Off-by-one in d_walk()/__dentry_kill() race fix. It's very hard to hit; possible in the same conditions as the original bug, except that you need the skipped branch to contain all the remaining evictables, so that the d_walk()-calling loop in d_invalidate() decides there's nothing more to do and doesn't go for another pass - otherwise that next pass will sweep the sucker. So it's not too urgent, but seeing that the fix is obvious and the original commit has spread into all -stable branches..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: d_walk() might skip too much
2015-05-30Merge tag 'xfs-for-linus-4.1-rc6' of ↵Linus Torvalds8-79/+112
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs fixes from Dave Chinner: "This is a little larger than I'd like late in the release cycle, but all the fixes are for regressions introduced in the 4.1-rc1 merge, or are needed back in -stable kernels fairly quickly as they are filesystem corruption or userspace visible correctness issues. Changes in this update: - regression fix for new rename whiteout code - regression fixes for new superblock generic per-cpu counter code - fix for incorrect error return sign introduced in 3.17 - metadata corruption fixes that need to go back to -stable kernels" * tag 'xfs-for-linus-4.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: fix broken i_nlink accounting for whiteout tmpfile inode xfs: xfs_iozero can return positive errno xfs: xfs_attr_inactive leaves inconsistent attr fork state behind xfs: extent size hints can round up extents past MAXEXTLEN xfs: inode and free block counters need to use __percpu_counter_compare percpu_counter: batch size aware __percpu_counter_compare() xfs: use percpu_counter_read_positive for mp->m_icount
2015-05-29d_walk() might skip too muchAl Viro1-4/+4
when we find that a child has died while we'd been trying to ascend, we should go into the first live sibling itself, rather than its sibling. Off-by-one in question had been introduced in "deal with deadlock in d_walk()" and the fix needs to be backported to all branches this one has been backported to. Cc: stable@vger.kernel.org # 3.2 and later Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-29omfs: fix potential integer overflow in allocatorBob Copeland1-1/+1
Both 'i' and 'bits_per_entry' are signed integers but the result is a u64 block number. Cast i to u64 to avoid truncation on 32-bit targets. Found by Coverity (CID 200679). Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-05-29omfs: fix sign confusion for bitmap loop counterBob Copeland1-1/+2
The count variable is used to iterate down to (below) zero from the size of the bitmap and handle the one-filling the remainder of the last partial bitmap block. The loop conditional expects count to be signed in order to detect when the final block is processed, after which count goes negative. Unfortunately, a recent change made this unsigned along with some other related fields. The result of is this is that during mount, omfs_get_imap will overrun the bitmap array and corrupt memory unless number of blocks happens to be a multiple of 8 * blocksize. Fix by changing count back to signed: it is guaranteed to fit in an s32 without overflow due to an enforced limit on the number of blocks in the filesystem. Signed-off-by: Bob Copeland <me@bobcopeland.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-05-29omfs: set error return when d_make_root() failsBob Copeland1-1/+3
A static checker found the following issue in the error path for omfs_fill_super: fs/omfs/inode.c:552 omfs_fill_super() warn: missing error code here? 'd_make_root()' failed. 'ret' = '0' Fix by returning -ENOMEM in this case. Signed-off-by: Bob Copeland <me@bobcopeland.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-05-29fs, omfs: add NULL terminator in the end up the token listSasha Levin1-1/+2
match_token() expects a NULL terminator at the end of the token list so that it would know where to stop. Not having one causes it to overrun to invalid memory. In practice, passing a mount option that omfs didn't recognize would sometimes panic the system. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Bob Copeland <me@bobcopeland.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-05-29fs/binfmt_elf.c:load_elf_binary(): return -EINVAL on zero-length mappingsAndrew Morton1-1/+1
load_elf_binary() returns `retval', not `error'. Fixes: a87938b2e246b81b4fb ("fs/binfmt_elf.c: fix bug in loading of PIE binaries") Reported-by: James Hogan <james.hogan@imgtec.com> Cc: Michael Davidson <md@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-05-29xfs: fix broken i_nlink accounting for whiteout tmpfile inodeBrian Foster1-2/+8
XFS uses the internal tmpfile() infrastructure for the whiteout inode used for RENAME_WHITEOUT operations. For tmpfile inodes, XFS allocates the inode, drops di_nlink, adds the inode to the agi unlinked list, calls d_tmpfile() which correspondingly drops i_nlink of the vfs inode, and then finishes the common inode setup (e.g., clear I_NEW and unlock). The d_tmpfile() call was originally made inxfs_create_tmpfile(), but was pulled up out of that function as part of the following commit to resolve a deadlock issue: 330033d6 xfs: fix tmpfile/selinux deadlock and initialize security As a result, callers of xfs_create_tmpfile() are responsible for either calling d_tmpfile() or fixing up i_nlink appropriately. The whiteout tmpfile allocation helper does neither. As a result, the vfs ->i_nlink becomes inconsistent with the on-disk ->di_nlink once xfs_rename() links it back into the source dentry and calls xfs_bumplink(). Update the assert in xfs_rename() to help detect this problem in the future and update xfs_rename_alloc_whiteout() to decrement the link count as part of the manual tmpfile inode setup. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: xfs_iozero can return positive errnoDave Chinner1-1/+1
It was missed when we converted everything in XFs to use negative error numbers, so fix it now. Bug introduced in 3.17 by commit 2451337 ("xfs: global error sign conversion"), and should go back to stable kernels. Thanks to Brian Foster for noticing it. cc: <stable@vger.kernel.org> # 3.17, 3.18, 3.19, 4.0 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: xfs_attr_inactive leaves inconsistent attr fork state behindDave Chinner4-47/+58
xfs_attr_inactive() is supposed to clean up the attribute fork when the inode is being freed. While it removes attribute fork extents, it completely ignores attributes in local format, which means that there can still be active attributes on the inode after xfs_attr_inactive() has run. This leads to problems with concurrent inode writeback - the in-core inode attribute fork is removed without locking on the assumption that nothing will be attempting to access the attribute fork after a call to xfs_attr_inactive() because it isn't supposed to exist on disk any more. To fix this, make xfs_attr_inactive() completely remove all traces of the attribute fork from the inode, regardless of it's state. Further, also remove the in-core attribute fork structure safely so that there is nothing further that needs to be done by callers to clean up the attribute fork. This means we can remove the in-core and on-disk attribute forks atomically. Also, on error simply remove the in-memory attribute fork. There's nothing that can be done with it once we have failed to remove the on-disk attribute fork, so we may as well just blow it away here anyway. cc: <stable@vger.kernel.org> # 3.12 to 4.0 Reported-by: Waiman Long <waiman.long@hp.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: extent size hints can round up extents past MAXEXTLENDave Chinner1-12/+19
This results in BMBT corruption, as seen by this test: # mkfs.xfs -f -d size=40051712b,agcount=4 /dev/vdc .... # mount /dev/vdc /mnt/scratch # xfs_io -ft -c "extsize 16m" -c "falloc 0 30g" -c "bmap -vp" /mnt/scratch/foo which results in this failure on a debug kernel: XFS: Assertion failed: (blockcount & xfs_mask64hi(64-BMBT_BLOCKCOUNT_BITLEN)) == 0, file: fs/xfs/libxfs/xfs_bmap_btree.c, line: 211 .... Call Trace: [<ffffffff814cf0ff>] xfs_bmbt_set_allf+0x8f/0x100 [<ffffffff814cf18d>] xfs_bmbt_set_all+0x1d/0x20 [<ffffffff814f2efe>] xfs_iext_insert+0x9e/0x120 [<ffffffff814c7956>] ? xfs_bmap_add_extent_hole_real+0x1c6/0xc70 [<ffffffff814c7956>] xfs_bmap_add_extent_hole_real+0x1c6/0xc70 [<ffffffff814caaab>] xfs_bmapi_write+0x72b/0xed0 [<ffffffff811c72ac>] ? kmem_cache_alloc+0x15c/0x170 [<ffffffff814fe070>] xfs_alloc_file_space+0x160/0x400 [<ffffffff81ddcc29>] ? down_write+0x29/0x60 [<ffffffff815063eb>] xfs_file_fallocate+0x29b/0x310 [<ffffffff811d2bc8>] ? __sb_start_write+0x58/0x120 [<ffffffff811e3e18>] ? do_vfs_ioctl+0x318/0x570 [<ffffffff811cd680>] vfs_fallocate+0x140/0x260 [<ffffffff811ce6f8>] SyS_fallocate+0x48/0x80 [<ffffffff81ddec09>] system_call_fastpath+0x12/0x17 The tracepoint that indicates the extent that triggered the assert failure is: xfs_iext_insert: idx 0 offset 0 block 16777224 count 2097152 flag 1 Clearly indicating that the extent length is greater than MAXEXTLEN, which is 2097151. A prior trace point shows the allocation was an exact size match and that a length greater than MAXEXTLEN was asked for: xfs_alloc_size_done: agno 1 agbno 8 minlen 2097152 maxlen 2097152 ^^^^^^^ ^^^^^^^ We don't see this problem with extent size hints through the IO path because we can't do single IOs large enough to trigger MAXEXTLEN allocation. fallocate(), OTOH, is not limited in it's allocation sizes and so needs help here. The issue is that the extent size hint alignment is rounding up the extent size past MAXEXTLEN, because xfs_bmapi_write() is not taking into account extent size hints when calculating the maximum extent length to allocate. xfs_bmapi_reserve_delalloc() is already doing this, but direct extent allocation is not. Unfortunately, the calculation in xfs_bmapi_reserve_delalloc() is wrong, and it works only because delayed allocation extents are not limited in size to MAXEXTLEN in the in-core extent tree. hence this calculation does not work for direct allocation, and the delalloc code needs fixing. This may, in fact be the underlying bug that occassionally causes transaction overruns in delayed allocation extent conversion, so now we know it's wrong we should fix it, too. Many thanks to Brian Foster for finding this problem during review of this patch. Hence the fix, after much code reading, is to allow xfs_bmap_extsize_align() to align partial extents when full alignment would extend the alignment past MAXEXTLEN. We can safely do this because all callers have higher layer allocation loops that already handle short allocations, and so will simply run another allocation to cover the remainder of the requested allocation range that we ignored during alignment. The advantage of this approach is that it also removes the need for callers to do anything other than limit their requests to MAXEXTLEN - they don't really need to be aware of extent size hints at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: inode and free block counters need to use __percpu_counter_compareDave Chinner1-14/+20
Because the counters use a custom batch size, the comparison functions need to be aware of that batch size otherwise the comparison does not work correctly. This leads to ASSERT failures on generic/027 like this: XFS: Assertion failed: 0, file: fs/xfs/xfs_mount.c, line: 1099 ------------[ cut here ]------------ .... Call Trace: [<ffffffff81522a39>] xfs_mod_icount+0x99/0xc0 [<ffffffff815285cb>] xfs_trans_unreserve_and_mod_sb+0x28b/0x5b0 [<ffffffff8152f941>] xfs_log_commit_cil+0x321/0x580 [<ffffffff81528e17>] xfs_trans_commit+0xb7/0x260 [<ffffffff81503d4d>] xfs_bmap_finish+0xcd/0x1b0 [<ffffffff8151da41>] xfs_inactive_ifree+0x1e1/0x250 [<ffffffff8151dbe0>] xfs_inactive+0x130/0x200 [<ffffffff81523a21>] xfs_fs_evict_inode+0x91/0xf0 [<ffffffff811f3958>] evict+0xb8/0x190 [<ffffffff811f433b>] iput+0x18b/0x1f0 [<ffffffff811e8853>] do_unlinkat+0x1f3/0x320 [<ffffffff811d548a>] ? filp_close+0x5a/0x80 [<ffffffff811e999b>] SyS_unlinkat+0x1b/0x40 [<ffffffff81e0892e>] system_call_fastpath+0x12/0x71 This is a regression introduced by commit 501ab32 ("xfs: use generic percpu counters for inode counter"). This patch fixes the same problem for both the inode counter and the free block counter in the superblocks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: use percpu_counter_read_positive for mp->m_icountGeorge Wang1-3/+6
Function percpu_counter_read just return the current counter, which can be negative. This will cause the checking of "allocated inode counts <= m_maxicount" false positive. Use percpu_counter_read_positive can solve this problem, and be consistent with the purpose to introduce percpu mechanism to xfs. Signed-off-by: George Wang <xuw2015@gmail.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-28Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds13-74/+194
Pull cifs fixes from Steve French: "Back from SambaXP - now have 8 small CIFS bug fixes to merge" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: CIFS: Fix race condition on RFC1002_NEGATIVE_SESSION_RESPONSE Fix to convert SURROGATE PAIR cifs: potential missing check for posix_lock_file_wait Fix to check Unique id and FileType when client refer file directly. CIFS: remove an unneeded NULL check [cifs] fix null pointer check Fix that several functions handle incorrect value of mapchars cifs: Don't replace dentries for dfs mounts
2015-05-27Merge branch 'overlayfs-next' of ↵Linus Torvalds3-10/+36
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull two overlayfs fixes from Miklos Szeredi: "Overlayfs rmdir() failed to check for emptiness in one case; this was introduced in 4.0. The other bug was there since day one: failure to mount if upper fs is full, which bit some OpenWRT folks" * 'overlayfs-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: mount read-only if workdir can't be created ovl: don't remove non-empty opaque directory
2015-05-27Btrfs: sysfs: don't fail seeding for the sake of sysfs kobject issueAnand Jain1-1/+1
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27Btrfs: sysfs: add support to add parent for fsidAnand Jain1-2/+2
To support seed sysfs layout and represent seed fsid under the sprout we need the facility to create fsid under the specified parent. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27Btrfs: sysfs: separate kobject and attribute creationAnand Jain2-14/+19
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27Btrfs: sysfs: btrfs_sysfs_remove_fsid() make it non staticAnand Jain2-1/+2
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27Btrfs: sysfs: make btrfs_sysfs_add_device() non staticAnand Jain1-0/+1
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27Btrfs: sysfs: make btrfs_sysfs_add_fsid() non staticAnand Jain2-1/+3
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27Btrfs: sysfs btrfs_kobj_rm_device() pass fs_devices instead of fs_infoAnand Jain4-10/+10
since btrfs_kobj_rm_device() does nothing with fs_info Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27Btrfs: sysfs btrfs_kobj_add_device() pass fs_devices instead of fs_infoAnand Jain4-7/+6
btrfs_kobj_add_device() does not need fs_info any more. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27Btrfs: sysfs: provide framework to remove all fsid sysfs kobjectAnand Jain1-1/+16
Just a helper function to clean up the sysfs fsid kobjects. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27Btrfs: sysfs: add pointer to access fs_info from fs_devicesAnand Jain3-0/+25
adds fs_info pointer with struct btrfs_fs_devices. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27Btrfs: introduce btrfs_get_fs_uuids to get fs_uuidsAnand Jain2-0/+5
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>