summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)AuthorFilesLines
2019-07-10Btrfs: fix corruption reading shared and compressed extents after hole punchingFilipe Manana1-2/+2
commit 8e928218780e2f1cf2f5891c7575e8f0b284fcce upstream. In the past we had data corruption when reading compressed extents that are shared within the same file and they are consecutive, this got fixed by commit 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents") and by commit 808f80b46790f ("Btrfs: update fix for read corruption of compressed and shared extents"). However there was a case that was missing in those fixes, which is when the shared and compressed extents are referenced with a non-zero offset. The following shell script creates a reproducer for this issue: #!/bin/bash mkfs.btrfs -f /dev/sdc &> /dev/null mount -o compress /dev/sdc /mnt/sdc # Create a file with 3 consecutive compressed extents, each has an # uncompressed size of 128Kb and a compressed size of 4Kb. for ((i = 1; i <= 3; i++)); do head -c 4096 /dev/zero for ((j = 1; j <= 31; j++)); do head -c 4096 /dev/zero | tr '\0' "\377" done done > /mnt/sdc/foobar sync echo "Digest after file creation: $(md5sum /mnt/sdc/foobar)" # Clone the first extent into offsets 128K and 256K. xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar sync echo "Digest after cloning: $(md5sum /mnt/sdc/foobar)" # Punch holes into the regions that are already full of zeroes. xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar sync echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)" echo "Dropping page cache..." sysctl -q vm.drop_caches=1 echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)" umount /dev/sdc When running the script we get the following output: Digest after file creation: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar linked 131072/131072 bytes at offset 131072 128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec) linked 131072/131072 bytes at offset 262144 128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec) Digest after cloning: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar Digest after hole punching: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar Dropping page cache... Digest after hole punching: fba694ae8664ed0c2e9ff8937e7f1484 /mnt/sdc/foobar This happens because after reading all the pages of the extent in the range from 128K to 256K for example, we read the hole at offset 256K and then when reading the page at offset 260K we don't submit the existing bio, which is responsible for filling all the page in the range 128K to 256K only, therefore adding the pages from range 260K to 384K to the existing bio and submitting it after iterating over the entire range. Once the bio completes, the uncompressed data fills only the pages in the range 128K to 256K because there's no more data read from disk, leaving the pages in the range 260K to 384K unfilled. It is just a slightly different variant of what was solved by commit 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents"). Fix this by forcing a bio submit, during readpages(), whenever we find a compressed extent map for a page that is different from the extent map for the previous page or has a different starting offset (in case it's the same compressed extent), instead of the extent map's original start offset. A test case for fstests follows soon. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Fixes: 808f80b46790f ("Btrfs: update fix for read corruption of compressed and shared extents") Fixes: 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents") Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-10btrfs: init csum_list before possible freeDan Robertson1-1/+1
commit e49be14b8d80e23bb7c53d78c21717a474ade76b upstream. The scrub_ctx csum_list member must be initialized before scrub_free_ctx is called. If the csum_list is not initialized beforehand, the list_empty call in scrub_free_csums will result in a null deref if the allocation fails in the for loop. Fixes: a2de733c78fa ("btrfs: scrub") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Dan Robertson <dan@dlrobertson.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-04-04Btrfs: fix fsync of files with multiple hard links in new directoriesFilipe Manana3-0/+39
commit 41bd60676923822de1df2c50b3f9a10171f4338a upstream. The log tree has a long standing problem that when a file is fsync'ed we only check for new ancestors, created in the current transaction, by following only the hard link for which the fsync was issued. We follow the ancestors using the VFS' dget_parent() API. This means that if we create a new link for a file in a directory that is new (or in an any other new ancestor directory) and then fsync the file using an old hard link, we end up not logging the new ancestor, and on log replay that new hard link and ancestor do not exist. In some cases, involving renames, the file will not exist at all. Example: mkfs.btrfs -f /dev/sdb mount /dev/sdb /mnt mkdir /mnt/A touch /mnt/foo ln /mnt/foo /mnt/A/bar xfs_io -c fsync /mnt/foo <power failure> In this example after log replay only the hard link named 'foo' exists and directory A does not exist, which is unexpected. In other major linux filesystems, such as ext4, xfs and f2fs for example, both hard links exist and so does directory A after mounting again the filesystem. Checking if any new ancestors are new and need to be logged was added in 2009 by commit 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes"), however only for the ancestors of the hard link (dentry) for which the fsync was issued, instead of checking for all ancestors for all of the inode's hard links. So fix this by tracking the id of the last transaction where a hard link was created for an inode and then on fsync fallback to a full transaction commit when an inode has more than one hard link and at least one new hard link was created in the current transaction. This is the simplest solution since this is not a common use case (adding frequently hard links for which there's an ancestor created in the current transaction and then fsync the file). In case it ever becomes a common use case, a solution that consists of iterating the fs/subvol btree for each hard link and check if any ancestor is new, could be implemented. This solves many unexpected scenarios reported by Jayashree Mohan and Vijay Chidambaram, and for which there is a new test case for fstests under review. Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes") Reported-by: Vijay Chidambaram <vvijay03@gmail.com> Reported-by: Jayashree Mohan <jayashree2912@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: - In btrfs_log_inode_parent(), inode is a struct inode pointer not a struct btrfs_inode pointer - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-04-04Btrfs: fix stale dir entries after unlink, inode eviction and fsyncFilipe Manana1-0/+29
commit bde6c242027b0f1d697d5333950b3a05761d40e4 upstream. If we remove a hard link from an inode, the inode gets evicted, then we fsync the inode and then power fail/crash, when the log tree is replayed, the parent directory inode still has entries pointing to the name that no longer exists, while our inode no longer has the BTRFS_INODE_REF_KEY item matching the deleted hard link (as expected), leaving the filesystem in an inconsistent state. The stale directory entries can not be deleted (an attempt to delete them causes -ESTALE errors), which makes it impossible to delete the parent directory. This happens because we track the id of the transaction where the last unlink operation for the inode happened (last_unlink_trans) in an in-memory only field of the inode, that is, a value that is never persisted in the inode item stored on the fs/subvol btree. So if an inode is evicted and loaded again, the value for last_unlink_trans is set to 0, which prevents the fsync from logging the parent directory at btrfs_log_inode_parent(). So fix this by setting last_unlink_trans to the id of the transaction that last modified the inode when we load the inode. This is a pessimistic approach but it always ensures correctness with the trade off of ocassional full transaction commits when an fsync is done against the inode in the same transaction where it was evicted and reloaded when our inode is a directory and often logging its parent unnecessarily when our inode is not a directory. The following test case for fstests triggers the problem: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _need_to_be_root _supported_fs generic _supported_os Linux _require_scratch _require_dm_flakey _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create our test file with 2 hard links. mkdir $SCRATCH_MNT/testdir touch $SCRATCH_MNT/testdir/foo ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/bar # Make sure everything done so far is durably persisted. sync # Now remove one of the links, trigger inode eviction and then fsync # our inode. unlink $SCRATCH_MNT/testdir/bar echo 2 > /proc/sys/vm/drop_caches $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/foo # Silently drop all writes on our scratch device to simulate a power failure. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again and mount the fs to trigger log/journal replay. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Now verify our directory entries. echo "Entries in testdir:" ls -1 $SCRATCH_MNT/testdir # If we remove our inode, its parent should become empty and therefore we should # be able to remove the parent. rm -f $SCRATCH_MNT/testdir/* rmdir $SCRATCH_MNT/testdir _unmount_flakey # The fstests framework will call fsck against our filesystem which will verify # that all metadata is in a consistent state. status=0 exit The test failed on btrfs with: generic/098 4s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad) > --- tests/generic/098.out 2015-07-23 18:01:12.616175932 +0100 > +++ /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad 2015-07-23 18:04:58.924138308 +0100 @@ -1,3 +1,6 @@ QA output created by 098 Entries in testdir: +bar foo +rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo': Stale file handle +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty ... (Run 'diff -u tests/generic/098.out /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad' to see the entire diff) _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent (see /home/fdmanana/git/hub/xfstests/results//generic/098.full) $ cat /home/fdmanana/git/hub/xfstests/results//generic/098.full (...) checking fs roots root 5 inode 258 errors 2001, no inode item, link count wrong unresolved ref dir 257 index 0 namelen 3 name foo filetype 1 errors 6, no dir index, no inode ref unresolved ref dir 257 index 3 namelen 3 name bar filetype 1 errors 5, no dir item, no inode ref Checking filesystem on /dev/sdc (...) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-04-04Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.Yang Dongsheng2-9/+14
commit 6e17d30bfaf43e04d991392d8484f1c556810c33 upstream. We need to fill inode when we found a node for it in delayed_nodes_tree. But we did not fill the ->last_trans currently, it will cause the test of xfstest/generic/311 fail. Scenario of the 311 is shown as below: Problem: (1). test_fd = open(fname, O_RDWR|O_DIRECT) (2). pwrite(test_fd, buf, 4096, 0) (3). close(test_fd) (4). drop_all_caches() <-------- "echo 3 > /proc/sys/vm/drop_caches" (5). test_fd = open(fname, O_RDWR|O_DIRECT) (6). fsync(test_fd); <-------- we did not get the correct log entry for the file Reason: When we re-open this file in (5), we would find a node in delayed_nodes_tree and fill the inode we are lookup with the information. But the ->last_trans is not filled, then the fsync() will check the ->last_trans and found it's 0 then say this inode is already in our tree which is commited, not recording the extents for it. Fix: This patch fill the ->last_trans properly and set the runtime_flags if needed in this situation. Then we can get the log entries we expected after (6) and generic/311 passed. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-04-04btrfs: dev-replace: go back to suspended state if target device is missingAnand Jain1-0/+2
commit 0d228ece59a35a9b9e8ff0d40653234a6d90f61e upstream. At the time of forced unmount we place the running replace to BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state, so when the system comes back and expect the target device is missing. Then let the replace state continue to be in BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state instead of BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED as there isn't any matching scrub running as part of replace. Fixes: e93c89c1aaaa ("Btrfs: add new sources for device replace code") Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-02-11btrfs: relocation: set trans to be NULL after ending transactionPan Bian1-0/+1
commit 42a657f57628402c73237547f0134e083e2f6764 upstream. The function relocate_block_group calls btrfs_end_transaction to release trans when update_backref_cache returns 1, and then continues the loop body. If btrfs_block_rsv_refill fails this time, it will jump out the loop and the freed trans will be accessed. This may result in a use-after-free bug. The patch assigns NULL to trans after trans is released so that it will not be accessed. Fixes: 0647bf564f1 ("Btrfs: improve forever loop when doing balance relocation") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Pan Bian <bianpan2016@163.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-02-11Btrfs: fix race between enabling quotas and subvolume creationFilipe Manana1-1/+2
commit 552f0329c75b3e1d7f9bb8c9e421d37403f192cd upstream. We have a race between enabling quotas end subvolume creation that cause subvolume creation to fail with -EINVAL, and the following diagram shows how it happens: CPU 0 CPU 1 btrfs_ioctl() btrfs_ioctl_quota_ctl() btrfs_quota_enable() mutex_lock(fs_info->qgroup_ioctl_lock) btrfs_ioctl() create_subvol() btrfs_qgroup_inherit() -> save fs_info->quota_root into quota_root -> stores a NULL value -> tries to lock the mutex qgroup_ioctl_lock -> blocks waiting for the task at CPU0 -> sets BTRFS_FS_QUOTA_ENABLED in fs_info -> sets quota_root in fs_info->quota_root (non-NULL value) mutex_unlock(fs_info->qgroup_ioctl_lock) -> checks quota enabled flag is set -> returns -EINVAL because fs_info->quota_root was NULL before it acquired the mutex qgroup_ioctl_lock -> ioctl returns -EINVAL Returning -EINVAL to user space will be confusing if all the arguments passed to the subvolume creation ioctl were valid. Fix it by grabbing the value from fs_info->quota_root after acquiring the mutex. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-02-11Btrfs: ensure path name is null terminated at btrfs_control_ioctlFilipe Manana1-0/+1
commit f505754fd6599230371cb01b9332754ddc104be1 upstream. We were using the path name received from user space without checking that it is null terminated. While btrfs-progs is well behaved and does proper validation and null termination, someone could call the ioctl and pass a non-null terminated patch, leading to buffer overrun problems in the kernel. The ioctl is protected by CAP_SYS_ADMIN. So just set the last byte of the path to a null character, similar to what we do in other ioctls (add/remove/resize device, snapshot creation, etc). Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-02-11btrfs: Always try all copies when reading extent buffersNikolay Borisov1-9/+1
commit f8397d69daef06d358430d3054662fb597e37c00 upstream. When a metadata read is served the endio routine btree_readpage_end_io_hook is called which eventually runs the tree-checker. If tree-checker fails to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This leads to btree_read_extent_buffer_pages wrongly assuming that all available copies of this extent buffer are wrong and failing prematurely. Fix this modify btree_read_extent_buffer_pages to read all copies of the data. This failure was exhibitted in xfstests btrfs/124 which would spuriously fail its balance operations. The reason was that when balance was run following re-introduction of the missing raid1 disk __btrfs_map_block would map the read request to stripe 0, which corresponded to devid 2 (the disk which is being removed in the test): item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112 length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1 io_align 65536 io_width 65536 sector_size 4096 num_stripes 2 sub_stripes 1 stripe 0 devid 2 offset 2156920832 dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8 stripe 1 devid 1 offset 3553624064 dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca This caused read requests for a checksum item that to be routed to the stale disk which triggered the aforementioned logic involving EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of the balance operation. Fixes: a826d6dcb32d ("Btrfs: check items for correctness as we search") Suggested-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: - Deleted code is slightly different - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-02-11Btrfs: fix data corruption due to cloning of eof blockFilipe Manana1-2/+10
commit ac765f83f1397646c11092a032d4f62c3d478b81 upstream. We currently allow cloning a range from a file which includes the last block of the file even if the file's size is not aligned to the block size. This is fine and useful when the destination file has the same size, but when it does not and the range ends somewhere in the middle of the destination file, it leads to corruption because the bytes between the EOF and the end of the block have undefined data (when there is support for discard/trimming they have a value of 0x00). Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ export foo_size=$((256 * 1024 + 100)) $ xfs_io -f -c "pwrite -S 0x3c 0 $foo_size" /mnt/foo $ xfs_io -f -c "pwrite -S 0xb5 0 1M" /mnt/bar $ xfs_io -c "reflink /mnt/foo 0 512K $foo_size" /mnt/bar $ od -A d -t x1 /mnt/bar 0000000 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 * 0524288 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c * 0786528 3c 3c 3c 3c 00 00 00 00 00 00 00 00 00 00 00 00 0786544 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0790528 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 * 1048576 The bytes in the range from 786532 (512Kb + 256Kb + 100 bytes) to 790527 (512Kb + 256Kb + 4Kb - 1) got corrupted, having now a value of 0x00 instead of 0xb5. This is similar to the problem we had for deduplication that got recently fixed by commit de02b9f6bb65 ("Btrfs: fix data corruption when deduplicating between different files"). Fix this by not allowing such operations to be performed and return the errno -EINVAL to user space. This is what XFS is doing as well at the VFS level. This change however now makes us return -EINVAL instead of -EOPNOTSUPP for cases where the source range maps to an inline extent and the destination range's end is smaller then the destination file's size, since the detection of inline extents is done during the actual process of dropping file extent items (at __btrfs_drop_extents()). Returning the -EINVAL error is done early on and solely based on the input parameters (offsets and length) and destination file's size. This makes us consistent with XFS and anyone else supporting cloning since this case is now checked at a higher level in the VFS and is where the -EINVAL will be returned from starting with kernel 4.20 (the VFS changed was introduced in 4.20-rc1 by commit 07d19dc9fbe9 ("vfs: avoid problematic remapping requests into partial EOF block"). So this change is more geared towards stable kernels, as it's unlikely the new VFS checks get removed intentionally. A test case for fstests follows soon, as well as an update to filter existing tests that expect -EOPNOTSUPP to accept -EINVAL as well. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-02-11Btrfs: fix cur_offset in the error case for nocowRobbie Ko1-3/+2
commit 506481b20e818db40b6198815904ecd2d6daee64 upstream. When the cow_file_range fails, the related resources are unlocked according to the range [start..end), so the unlock cannot be repeated in run_delalloc_nocow. In some cases (e.g. cur_offset <= end && cow_start != -1), cur_offset is not updated correctly, so move the cur_offset update before cow_file_range. kernel BUG at mm/page-writeback.c:2663! Internal error: Oops - BUG: 0 [#1] SMP CPU: 3 PID: 31525 Comm: kworker/u8:7 Tainted: P O Hardware name: Realtek_RTD1296 (DT) Workqueue: writeback wb_workfn (flush-btrfs-1) task: ffffffc076db3380 ti: ffffffc02e9ac000 task.ti: ffffffc02e9ac000 PC is at clear_page_dirty_for_io+0x1bc/0x1e8 LR is at clear_page_dirty_for_io+0x14/0x1e8 pc : [<ffffffc00033c91c>] lr : [<ffffffc00033c774>] pstate: 40000145 sp : ffffffc02e9af4f0 Process kworker/u8:7 (pid: 31525, stack limit = 0xffffffc02e9ac020) Call trace: [<ffffffc00033c91c>] clear_page_dirty_for_io+0x1bc/0x1e8 [<ffffffbffc514674>] extent_clear_unlock_delalloc+0x1e4/0x210 [btrfs] [<ffffffbffc4fb168>] run_delalloc_nocow+0x3b8/0x948 [btrfs] [<ffffffbffc4fb948>] run_delalloc_range+0x250/0x3a8 [btrfs] [<ffffffbffc514c0c>] writepage_delalloc.isra.21+0xbc/0x1d8 [btrfs] [<ffffffbffc516048>] __extent_writepage+0xe8/0x248 [btrfs] [<ffffffbffc51630c>] extent_write_cache_pages.isra.17+0x164/0x378 [btrfs] [<ffffffbffc5185a8>] extent_writepages+0x48/0x68 [btrfs] [<ffffffbffc4f5828>] btrfs_writepages+0x20/0x30 [btrfs] [<ffffffc00033d758>] do_writepages+0x30/0x88 [<ffffffc0003ba0f4>] __writeback_single_inode+0x34/0x198 [<ffffffc0003ba6c4>] writeback_sb_inodes+0x184/0x3c0 [<ffffffc0003ba96c>] __writeback_inodes_wb+0x6c/0xc0 [<ffffffc0003bac20>] wb_writeback+0x1b8/0x1c0 [<ffffffc0003bb0f0>] wb_workfn+0x150/0x250 [<ffffffc0002b0014>] process_one_work+0x1dc/0x388 [<ffffffc0002b02f0>] worker_thread+0x130/0x500 [<ffffffc0002b6344>] kthread+0x10c/0x110 [<ffffffc000284590>] ret_from_fork+0x10/0x40 Code: d503201f a9025bb5 a90363b7 f90023b9 (d4210000) Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-02-11Btrfs: fix use-after-free when dumping free spaceFilipe Manana1-0/+2
commit 9084cb6a24bf5838a665af92ded1af8363f9e563 upstream. We were iterating a block group's free space cache rbtree without locking first the lock that protects it (the free_space_ctl->free_space_offset rbtree is protected by the free_space_ctl->tree_lock spinlock). KASAN reported an use-after-free problem when iterating such a rbtree due to a concurrent rbtree delete: [ 9520.359168] ================================================================== [ 9520.359656] BUG: KASAN: use-after-free in rb_next+0x13/0x90 [ 9520.359949] Read of size 8 at addr ffff8800b7ada500 by task btrfs-transacti/1721 [ 9520.360357] [ 9520.360530] CPU: 4 PID: 1721 Comm: btrfs-transacti Tainted: G L 4.19.0-rc8-nbor #555 [ 9520.360990] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 9520.362682] Call Trace: [ 9520.362887] dump_stack+0xa4/0xf5 [ 9520.363146] print_address_description+0x78/0x280 [ 9520.363412] kasan_report+0x263/0x390 [ 9520.363650] ? rb_next+0x13/0x90 [ 9520.363873] __asan_load8+0x54/0x90 [ 9520.364102] rb_next+0x13/0x90 [ 9520.364380] btrfs_dump_free_space+0x146/0x160 [btrfs] [ 9520.364697] dump_space_info+0x2cd/0x310 [btrfs] [ 9520.364997] btrfs_reserve_extent+0x1ee/0x1f0 [btrfs] [ 9520.365310] __btrfs_prealloc_file_range+0x1cc/0x620 [btrfs] [ 9520.365646] ? btrfs_update_time+0x180/0x180 [btrfs] [ 9520.365923] ? _raw_spin_unlock+0x27/0x40 [ 9520.366204] ? btrfs_alloc_data_chunk_ondemand+0x2c0/0x5c0 [btrfs] [ 9520.366549] btrfs_prealloc_file_range_trans+0x23/0x30 [btrfs] [ 9520.366880] cache_save_setup+0x42e/0x580 [btrfs] [ 9520.367220] ? btrfs_check_data_free_space+0xd0/0xd0 [btrfs] [ 9520.367518] ? lock_downgrade+0x2f0/0x2f0 [ 9520.367799] ? btrfs_write_dirty_block_groups+0x11f/0x6e0 [btrfs] [ 9520.368104] ? kasan_check_read+0x11/0x20 [ 9520.368349] ? do_raw_spin_unlock+0xa8/0x140 [ 9520.368638] btrfs_write_dirty_block_groups+0x2af/0x6e0 [btrfs] [ 9520.368978] ? btrfs_start_dirty_block_groups+0x870/0x870 [btrfs] [ 9520.369282] ? do_raw_spin_unlock+0xa8/0x140 [ 9520.369534] ? _raw_spin_unlock+0x27/0x40 [ 9520.369811] ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs] [ 9520.370137] commit_cowonly_roots+0x4b9/0x610 [btrfs] [ 9520.370560] ? commit_fs_roots+0x350/0x350 [btrfs] [ 9520.370926] ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs] [ 9520.371285] btrfs_commit_transaction+0x5e5/0x10e0 [btrfs] [ 9520.371612] ? btrfs_apply_pending_changes+0x90/0x90 [btrfs] [ 9520.371943] ? start_transaction+0x168/0x6c0 [btrfs] [ 9520.372257] transaction_kthread+0x21c/0x240 [btrfs] [ 9520.372537] kthread+0x1d2/0x1f0 [ 9520.372793] ? btrfs_cleanup_transaction+0xb50/0xb50 [btrfs] [ 9520.373090] ? kthread_park+0xb0/0xb0 [ 9520.373329] ret_from_fork+0x3a/0x50 [ 9520.373567] [ 9520.373738] Allocated by task 1804: [ 9520.373974] kasan_kmalloc+0xff/0x180 [ 9520.374208] kasan_slab_alloc+0x11/0x20 [ 9520.374447] kmem_cache_alloc+0xfc/0x2d0 [ 9520.374731] __btrfs_add_free_space+0x40/0x580 [btrfs] [ 9520.375044] unpin_extent_range+0x4f7/0x7a0 [btrfs] [ 9520.375383] btrfs_finish_extent_commit+0x15f/0x4d0 [btrfs] [ 9520.375707] btrfs_commit_transaction+0xb06/0x10e0 [btrfs] [ 9520.376027] btrfs_alloc_data_chunk_ondemand+0x237/0x5c0 [btrfs] [ 9520.376365] btrfs_check_data_free_space+0x81/0xd0 [btrfs] [ 9520.376689] btrfs_delalloc_reserve_space+0x25/0x80 [btrfs] [ 9520.377018] btrfs_direct_IO+0x42e/0x6d0 [btrfs] [ 9520.377284] generic_file_direct_write+0x11e/0x220 [ 9520.377587] btrfs_file_write_iter+0x472/0xac0 [btrfs] [ 9520.377875] aio_write+0x25c/0x360 [ 9520.378106] io_submit_one+0xaa0/0xdc0 [ 9520.378343] __se_sys_io_submit+0xfa/0x2f0 [ 9520.378589] __x64_sys_io_submit+0x43/0x50 [ 9520.378840] do_syscall_64+0x7d/0x240 [ 9520.379081] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 9520.379387] [ 9520.379557] Freed by task 1802: [ 9520.379782] __kasan_slab_free+0x173/0x260 [ 9520.380028] kasan_slab_free+0xe/0x10 [ 9520.380262] kmem_cache_free+0xc1/0x2c0 [ 9520.380544] btrfs_find_space_for_alloc+0x4cd/0x4e0 [btrfs] [ 9520.380866] find_free_extent+0xa99/0x17e0 [btrfs] [ 9520.381166] btrfs_reserve_extent+0xd5/0x1f0 [btrfs] [ 9520.381474] btrfs_get_blocks_direct+0x60b/0xbd0 [btrfs] [ 9520.381761] __blockdev_direct_IO+0x10ee/0x58a1 [ 9520.382059] btrfs_direct_IO+0x25a/0x6d0 [btrfs] [ 9520.382321] generic_file_direct_write+0x11e/0x220 [ 9520.382623] btrfs_file_write_iter+0x472/0xac0 [btrfs] [ 9520.382904] aio_write+0x25c/0x360 [ 9520.383172] io_submit_one+0xaa0/0xdc0 [ 9520.383416] __se_sys_io_submit+0xfa/0x2f0 [ 9520.383678] __x64_sys_io_submit+0x43/0x50 [ 9520.383927] do_syscall_64+0x7d/0x240 [ 9520.384165] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 9520.384439] [ 9520.384610] The buggy address belongs to the object at ffff8800b7ada500 which belongs to the cache btrfs_free_space of size 72 [ 9520.385175] The buggy address is located 0 bytes inside of 72-byte region [ffff8800b7ada500, ffff8800b7ada548) [ 9520.385691] The buggy address belongs to the page: [ 9520.385957] page:ffffea0002deb680 count:1 mapcount:0 mapping:ffff880108a1d700 index:0x0 compound_mapcount: 0 [ 9520.388030] flags: 0x8100(slab|head) [ 9520.388281] raw: 0000000000008100 ffffea0002deb608 ffffea0002728808 ffff880108a1d700 [ 9520.388722] raw: 0000000000000000 0000000000130013 00000001ffffffff 0000000000000000 [ 9520.389169] page dumped because: kasan: bad access detected [ 9520.389473] [ 9520.389658] Memory state around the buggy address: [ 9520.389943] ffff8800b7ada400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 9520.390368] ffff8800b7ada480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 9520.390796] >ffff8800b7ada500: fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc [ 9520.391223] ^ [ 9520.391461] ffff8800b7ada580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 9520.391885] ffff8800b7ada600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 9520.392313] ================================================================== [ 9520.392772] BTRFS critical (device vdc): entry offset 2258497536, bytes 131072, bitmap no [ 9520.393247] BUG: unable to handle kernel NULL pointer dereference at 0000000000000011 [ 9520.393705] PGD 800000010dbab067 P4D 800000010dbab067 PUD 107551067 PMD 0 [ 9520.394059] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 9520.394378] CPU: 4 PID: 1721 Comm: btrfs-transacti Tainted: G B L 4.19.0-rc8-nbor #555 [ 9520.394858] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 9520.395350] RIP: 0010:rb_next+0x3c/0x90 [ 9520.396461] RSP: 0018:ffff8801074ff780 EFLAGS: 00010292 [ 9520.396762] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81b5ac4c [ 9520.397115] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000011 [ 9520.397468] RBP: ffff8801074ff7a0 R08: ffffed0021d64ccc R09: ffffed0021d64ccc [ 9520.397821] R10: 0000000000000001 R11: ffffed0021d64ccb R12: ffff8800b91e0000 [ 9520.398188] R13: ffff8800a3ceba48 R14: ffff8800b627bf80 R15: 0000000000020000 [ 9520.398555] FS: 0000000000000000(0000) GS:ffff88010eb00000(0000) knlGS:0000000000000000 [ 9520.399007] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9520.399335] CR2: 0000000000000011 CR3: 0000000106b52000 CR4: 00000000000006a0 [ 9520.399679] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 9520.400023] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 9520.400400] Call Trace: [ 9520.400648] btrfs_dump_free_space+0x146/0x160 [btrfs] [ 9520.400974] dump_space_info+0x2cd/0x310 [btrfs] [ 9520.401287] btrfs_reserve_extent+0x1ee/0x1f0 [btrfs] [ 9520.401609] __btrfs_prealloc_file_range+0x1cc/0x620 [btrfs] [ 9520.401952] ? btrfs_update_time+0x180/0x180 [btrfs] [ 9520.402232] ? _raw_spin_unlock+0x27/0x40 [ 9520.402522] ? btrfs_alloc_data_chunk_ondemand+0x2c0/0x5c0 [btrfs] [ 9520.402882] btrfs_prealloc_file_range_trans+0x23/0x30 [btrfs] [ 9520.403261] cache_save_setup+0x42e/0x580 [btrfs] [ 9520.403570] ? btrfs_check_data_free_space+0xd0/0xd0 [btrfs] [ 9520.403871] ? lock_downgrade+0x2f0/0x2f0 [ 9520.404161] ? btrfs_write_dirty_block_groups+0x11f/0x6e0 [btrfs] [ 9520.404481] ? kasan_check_read+0x11/0x20 [ 9520.404732] ? do_raw_spin_unlock+0xa8/0x140 [ 9520.405026] btrfs_write_dirty_block_groups+0x2af/0x6e0 [btrfs] [ 9520.405375] ? btrfs_start_dirty_block_groups+0x870/0x870 [btrfs] [ 9520.405694] ? do_raw_spin_unlock+0xa8/0x140 [ 9520.405958] ? _raw_spin_unlock+0x27/0x40 [ 9520.406243] ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs] [ 9520.406574] commit_cowonly_roots+0x4b9/0x610 [btrfs] [ 9520.406899] ? commit_fs_roots+0x350/0x350 [btrfs] [ 9520.407253] ? btrfs_run_delayed_refs+0x1b8/0x230 [btrfs] [ 9520.407589] btrfs_commit_transaction+0x5e5/0x10e0 [btrfs] [ 9520.407925] ? btrfs_apply_pending_changes+0x90/0x90 [btrfs] [ 9520.408262] ? start_transaction+0x168/0x6c0 [btrfs] [ 9520.408582] transaction_kthread+0x21c/0x240 [btrfs] [ 9520.408870] kthread+0x1d2/0x1f0 [ 9520.409138] ? btrfs_cleanup_transaction+0xb50/0xb50 [btrfs] [ 9520.409440] ? kthread_park+0xb0/0xb0 [ 9520.409682] ret_from_fork+0x3a/0x50 [ 9520.410508] Dumping ftrace buffer: [ 9520.410764] (ftrace buffer empty) [ 9520.411007] CR2: 0000000000000011 [ 9520.411297] ---[ end trace 01a0863445cf360a ]--- [ 9520.411568] RIP: 0010:rb_next+0x3c/0x90 [ 9520.412644] RSP: 0018:ffff8801074ff780 EFLAGS: 00010292 [ 9520.412932] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff81b5ac4c [ 9520.413274] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000011 [ 9520.413616] RBP: ffff8801074ff7a0 R08: ffffed0021d64ccc R09: ffffed0021d64ccc [ 9520.414007] R10: 0000000000000001 R11: ffffed0021d64ccb R12: ffff8800b91e0000 [ 9520.414349] R13: ffff8800a3ceba48 R14: ffff8800b627bf80 R15: 0000000000020000 [ 9520.416074] FS: 0000000000000000(0000) GS:ffff88010eb00000(0000) knlGS:0000000000000000 [ 9520.416536] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9520.416848] CR2: 0000000000000011 CR3: 0000000106b52000 CR4: 00000000000006a0 [ 9520.418477] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 9520.418846] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 9520.419204] Kernel panic - not syncing: Fatal exception [ 9520.419666] Dumping ftrace buffer: [ 9520.419930] (ftrace buffer empty) [ 9520.420168] Kernel Offset: disabled [ 9520.420406] ---[ end Kernel panic - not syncing: Fatal exception ]--- Fix this by acquiring the respective lock before iterating the rbtree. Reported-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-02-11Btrfs: fix null pointer dereference on compressed write path errorFilipe Manana1-0/+1
commit 3527a018c00e5dbada2f9d7ed5576437b6dd5cfb upstream. At inode.c:compress_file_range(), under the "free_pages_out" label, we can end up dereferencing the "pages" pointer when it has a NULL value. This case happens when "start" has a value of 0 and we fail to allocate memory for the "pages" pointer. When that happens we jump to the "cont" label and then enter the "if (start == 0)" branch where we immediately call the cow_file_range_inline() function. If that function returns 0 (success creating an inline extent) or an error (like -ENOMEM for example) we jump to the "free_pages_out" label and then access "pages[i]" leading to a NULL pointer dereference, since "nr_pages" has a value greater than zero at that point. Fix this by setting "nr_pages" to 0 when we fail to allocate memory for the "pages" pointer. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119 Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads") Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-02-11Btrfs: don't clean dirty pages during buffered writesChris Mason1-6/+24
commit 7703bdd8d23e6ef057af3253958a793ec6066b28 upstream. During buffered writes, we follow this basic series of steps: again: lock all the pages wait for writeback on all the pages Take the extent range lock wait for ordered extents on the whole range clean all the pages if (copy_from_user_in_atomic() hits a fault) { drop our locks goto again; } dirty all the pages release all the locks The extra waiting, cleaning and locking are there to make sure we don't modify pages in flight to the drive, after they've been crc'd. If some of the pages in the range were already dirty when the write began, and we need to goto again, we create a window where a dirty page has been cleaned and unlocked. It may be reclaimed before we're able to lock it again, which means we'll read the old contents off the drive and lose any modifications that had been pending writeback. We don't actually need to clean the pages. All of the other locking in place makes sure we don't start IO on the pages, so we can just leave them dirty for the duration of the write. Fixes: 73d59314e6ed (the original btrfs merge) Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: - Keep passing additional argument of GFP_NOFS to clear_extent_bit() - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-02-11btrfs: wait on caching when putting the bg cacheJosef Bacik1-0/+1
commit 3aa7c7a31c26321696b92841d5103461c6f3f517 upstream. While testing my backport I noticed there was a panic if I ran generic/416 generic/417 generic/418 all in a row. This just happened to uncover a race where we had outstanding IO after we destroy all of our workqueues, and then we'd go to queue the endio work on those free'd workqueues. This is because we aren't waiting for the caching threads to be done before freeing everything up, so to fix this make sure we wait on any outstanding caching that's being done before we free up the block group, so we're sure to be done with all IO by the time we get to btrfs_stop_all_workers(). This fixes the panic I was seeing consistently in testing. ------------[ cut here ]------------ kernel BUG at fs/btrfs/volumes.c:6112! SMP PTI Modules linked in: CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 4.16.0-02155-g3553e54a578d-dirty #875 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Workqueue: btrfs-cache btrfs_cache_helper RIP: 0010:btrfs_map_bio+0x346/0x370 RSP: 0000:ffffc900061e79d0 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff880071542e00 RCX: 0000000000533000 RDX: ffff88006bb74380 RSI: 0000000000000008 RDI: ffff880078160000 RBP: 0000000000000001 R08: ffff8800781cd200 R09: 0000000000503000 R10: ffff88006cd21200 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: ffff8800781cd200 R15: ffff880071542e00 FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000817ffc4 CR3: 0000000078314000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btree_submit_bio_hook+0x8a/0xd0 submit_one_bio+0x5d/0x80 read_extent_buffer_pages+0x18a/0x320 btree_read_extent_buffer_pages+0xbc/0x200 ? alloc_extent_buffer+0x359/0x3e0 read_tree_block+0x3d/0x60 read_block_for_search.isra.30+0x1a5/0x360 btrfs_search_slot+0x41b/0xa10 btrfs_next_old_leaf+0x212/0x470 caching_thread+0x323/0x490 normal_work_helper+0xc5/0x310 process_one_work+0x141/0x340 worker_thread+0x44/0x3c0 kthread+0xf8/0x130 ? process_one_work+0x340/0x340 ? kthread_bind+0x10/0x10 ret_from_fork+0x35/0x40 RIP: btrfs_map_bio+0x346/0x370 RSP: ffffc900061e79d0 ---[ end trace 827eb13e50846033 ]--- Kernel panic - not syncing: Fatal exception Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-02-11btrfs: fix error handling in btrfs_dev_replace_startJeff Mahoney1-2/+5
commit 5c06147128fbbdf7a84232c5f0d808f53153defe upstream. When we fail to start a transaction in btrfs_dev_replace_start, we leave dev_replace->replace_start set to STARTED but clear ->srcdev and ->tgtdev. Later, that can result in an Oops in btrfs_dev_replace_progress when having state set to STARTED or SUSPENDED implies that ->srcdev is valid. Also fix error handling when the state is already STARTED or SUSPENDED while starting. That, too, will clear ->srcdev and ->tgtdev even though it doesn't own them. This should be an impossible case to hit since we should be protected by the BTRFS_FS_EXCL_OP bit being set. Let's add an ASSERT there while we're at it. Fixes: e93c89c1aaaaa (Btrfs: add new sources for device replace code) Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-12-17btrfs: rename total_bytes to avoid confusionAnand Jain1-5/+5
commit 3c1dbdf54a31f4f049a33214c3096595988786bf upstream. we are assigning number_devices to the total_bytes, that's very confusing for a moment Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-12-17Btrfs: fix btrfs_write_inode vs delayed iput deadlockJosef Bacik2-26/+0
commit 3c4276936f6fbe52884b4ea4e6cc120b890a0f9f upstream. We recently ran into the following deadlock involving btrfs_write_inode(): [ +0.005066] __schedule+0x38e/0x8c0 [ +0.007144] schedule+0x36/0x80 [ +0.006447] bit_wait+0x11/0x60 [ +0.006446] __wait_on_bit+0xbe/0x110 [ +0.007487] ? bit_wait_io+0x60/0x60 [ +0.007319] __inode_wait_for_writeback+0x96/0xc0 [ +0.009568] ? autoremove_wake_function+0x40/0x40 [ +0.009565] inode_wait_for_writeback+0x21/0x30 [ +0.009224] evict+0xb0/0x190 [ +0.006099] iput+0x1a8/0x210 [ +0.006103] btrfs_run_delayed_iputs+0x73/0xc0 [ +0.009047] btrfs_commit_transaction+0x799/0x8c0 [ +0.009567] btrfs_write_inode+0x81/0xb0 [ +0.008008] __writeback_single_inode+0x267/0x320 [ +0.009569] writeback_sb_inodes+0x25b/0x4e0 [ +0.008702] wb_writeback+0x102/0x2d0 [ +0.007487] wb_workfn+0xa4/0x310 [ +0.006794] ? wb_workfn+0xa4/0x310 [ +0.007143] process_one_work+0x150/0x410 [ +0.008179] worker_thread+0x6d/0x520 [ +0.007490] kthread+0x12c/0x160 [ +0.006620] ? put_pwq_unlocked+0x80/0x80 [ +0.008185] ? kthread_park+0xa0/0xa0 [ +0.007484] ? do_syscall_64+0x53/0x150 [ +0.007837] ret_from_fork+0x29/0x40 Writeback calls: btrfs_write_inode btrfs_commit_transaction btrfs_run_delayed_iputs If iput() is called on that same inode, evict() will wait for writeback forever. btrfs_write_inode() was originally added way back in 4730a4bc5bf3 ("btrfs_dirty_inode") to support O_SYNC writes. However, ->write_inode() hasn't been used for O_SYNC since 148f948ba877 ("vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode"), so btrfs_write_inode() is actually unnecessary (and leads to a bunch of unnecessary commits). Get rid of it, which also gets rid of the deadlock. Signed-off-by: Josef Bacik <jbacik@fb.com> [Omar: new commit message] Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: deleted function is slightly different] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-12-17btrfs: use correct compare function of dirty_metadata_bytesEthan Lien1-4/+6
commit d814a49198eafa6163698bdd93961302f3a877a4 upstream. We use customized, nodesize batch value to update dirty_metadata_bytes. We should also use batch version of compare function or we will easily goto fast path and get false result from percpu_counter_compare(). Fixes: e2d845211eda ("Btrfs: use percpu counter for dirty metadata count") Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: In __btrfs_btree_balance_dirty(), use root->fs_info instead of fs_info] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-11-20Btrfs: reserve space for O_TMPFILE orphan item deletionOmar Sandoval1-1/+2
commit 399b0bbf5f680797d3599fa14f16706ffc470145 upstream. btrfs_link() calls btrfs_orphan_del() if it's linking an O_TMPFILE but it doesn't reserve space to do so. Even before the removal of the orphan_block_rsv it wasn't using it. Fixes: ef3b9af50bfa ("Btrfs: implement inode_operations callback tmpfile") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-11-20Btrfs: don't return ino to ino cache if inode item removal failsOmar Sandoval1-12/+13
commit c08db7d8d295a4f3a10faaca376de011afff7950 upstream. In btrfs_evict_inode(), if btrfs_truncate_inode_items() fails, the inode item will still be in the tree but we still return the ino to the ino cache. That will blow up later when someone tries to allocate that ino, so don't return it to the cache. Fixes: 581bb050941b ("Btrfs: Cache free inode numbers in memory") Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: - Pass inode, not btrfs_inode, to btrfs_orphan_del() - Pass btrfs_root, not btrfs_fs_info, to btrfs_free_block_rsv() - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-11-20Btrfs: don't BUG_ON() in btrfs_truncate_inode_items()Omar Sandoval1-1/+4
commit 0552210997badb6a60740a26ff9d976a416510f0 upstream. btrfs_free_extent() can fail because of ENOMEM. There's no reason to panic here, we can just abort the transaction. Fixes: f4b9aa8d3b87 ("btrfs_truncate") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: - Also pass root to btrfs_abort_transaction() - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-10-21Btrfs: send, fix invalid access to commit roots due to concurrent snapshottingRobbie Ko1-2/+14
commit 6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2 upstream. [BUG] btrfs incremental send BUG happens when creating a snapshot of snapshot that is being used by send. [REASON] The problem can happen if while we are doing a send one of the snapshots used (parent or send) is snapshotted, because snapshoting implies COWing the root of the source subvolume/snapshot. 1. When doing an incremental send, the send process will get the commit roots from the parent and send snapshots, and add references to them through extent_buffer_get(). 2. When a snapshot/subvolume is snapshotted, its root node is COWed (transaction.c:create_pending_snapshot()). 3. COWing releases the space used by the node immediately, through: __btrfs_cow_block() --btrfs_free_tree_block() ----btrfs_add_free_space(bytenr of node) 4. Because send doesn't hold a transaction open, it's possible that the transaction used to create the snapshot commits, switches the commit root and the old space used by the previous root node gets assigned to some other node allocation. Allocation of a new node will use the existing extent buffer found in memory, which we previously got a reference through extent_buffer_get(), and allow the extent buffer's content (pages) to be modified: btrfs_alloc_tree_block --btrfs_reserve_extent ----find_free_extent (get bytenr of old node) --btrfs_init_new_buffer (use bytenr of old node) ----btrfs_find_create_tree_block ------alloc_extent_buffer --------find_extent_buffer (get old node) 5. So send can access invalid memory content and have unpredictable behaviour. [FIX] So we fix the problem by copying the commit roots of the send and parent snapshots and use those copies. CallTrace looks like this: ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:1861! invalid opcode: 0000 [#1] SMP CPU: 6 PID: 24235 Comm: btrfs Tainted: P O 3.10.105 #23721 ffff88046652d680 ti: ffff88041b720000 task.ti: ffff88041b720000 RIP: 0010:[<ffffffffa08dd0e8>] read_node_slot+0x108/0x110 [btrfs] RSP: 0018:ffff88041b723b68 EFLAGS: 00010246 RAX: ffff88043ca6b000 RBX: ffff88041b723c50 RCX: ffff880000000000 RDX: 000000000000004c RSI: ffff880314b133f8 RDI: ffff880458b24000 RBP: 0000000000000000 R08: 0000000000000001 R09: ffff88041b723c66 R10: 0000000000000001 R11: 0000000000001000 R12: ffff8803f3e48890 R13: ffff8803f3e48880 R14: ffff880466351800 R15: 0000000000000001 FS: 00007f8c321dc8c0(0000) GS:ffff88047fcc0000(0000) CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 R2: 00007efd1006d000 CR3: 0000000213a24000 CR4: 00000000003407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: ffff88041b723c50 ffff8803f3e48880 ffff8803f3e48890 ffff8803f3e48880 ffff880466351800 0000000000000001 ffffffffa08dd9d7 ffff88041b723c50 ffff8803f3e48880 ffff88041b723c66 ffffffffa08dde85 a9ff88042d2c4400 Call Trace: [<ffffffffa08dd9d7>] ? tree_move_down.isra.33+0x27/0x50 [btrfs] [<ffffffffa08dde85>] ? tree_advance+0xb5/0xc0 [btrfs] [<ffffffffa08e83d4>] ? btrfs_compare_trees+0x2d4/0x760 [btrfs] [<ffffffffa0982050>] ? finish_inode_if_needed+0x870/0x870 [btrfs] [<ffffffffa09841ea>] ? btrfs_ioctl_send+0xeda/0x1050 [btrfs] [<ffffffffa094bd3d>] ? btrfs_ioctl+0x1e3d/0x33f0 [btrfs] [<ffffffff81111133>] ? handle_pte_fault+0x373/0x990 [<ffffffff8153a096>] ? atomic_notifier_call_chain+0x16/0x20 [<ffffffff81063256>] ? set_task_cpu+0xb6/0x1d0 [<ffffffff811122c3>] ? handle_mm_fault+0x143/0x2a0 [<ffffffff81539cc0>] ? __do_page_fault+0x1d0/0x500 [<ffffffff81062f07>] ? check_preempt_curr+0x57/0x90 [<ffffffff8115075a>] ? do_vfs_ioctl+0x4aa/0x990 [<ffffffff81034f83>] ? do_fork+0x113/0x3b0 [<ffffffff812dd7d7>] ? trace_hardirqs_off_thunk+0x3a/0x6c [<ffffffff81150cc8>] ? SyS_ioctl+0x88/0xa0 [<ffffffff8153e422>] ? system_call_fastpath+0x16/0x1b ---[ end trace 29576629ee80b2e1 ]--- Fixes: 7069830a9e38 ("Btrfs: add btrfs_compare_trees function") Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: s/fs_info/left_root->fs_info/] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-10-21do d_instantiate/unlock_new_inode combinations safelyAl Viro1-12/+4
commit 1e2e547a93a00ebc21582c06ca3c6cfea2a309ee upstream. For anything NFS-exported we do _not_ want to unlock new inode before it has grown an alias; original set of fixes got the ordering right, but missed the nasty complication in case of lockdep being enabled - unlock_new_inode() does lockdep_annotate_inode_mutex_key(inode) which can only be done before anyone gets a chance to touch ->i_mutex. Unfortunately, flipping the order and doing unlock_new_inode() before d_instantiate() opens a window when mkdir can race with open-by-fhandle on a guessed fhandle, leading to multiple aliases for a directory inode and all the breakage that follows from that. Correct solution: a new primitive (d_instantiate_new()) combining these two in the right order - lockdep annotate, then d_instantiate(), then the rest of unlock_new_inode(). All combinations of d_instantiate() with unlock_new_inode() should be converted to that. Tested-by: Mike Marshall <hubcap@omnibond.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> [bwh: Backported to 3.16: - Drop changes in orangefs - Apply similar change to ext3 - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-10-21Btrfs: don't leave dangling dentry if symlink creation failedFilipe Manana1-4/+7
commit d50866d00fb39fcf72307001763ee9cc92625a43 upstream. When we are creating a symlink we might fail with an error after we created its inode and added the corresponding directory indexes to its parent inode. In this case we end up never removing the directory indexes because the inode eviction handler, called for our symlink inode on the final iput(), only removes items associated with the symlink inode and not with the parent inode. Example: $ mkfs.btrfs -f /dev/sdi $ mount /dev/sdi /mnt $ touch /mnt/foo $ ln -s /mnt/foo /mnt/bar ln: failed to create symbolic link ‘bar’: Cannot allocate memory $ umount /mnt $ btrfsck /dev/sdi Checking filesystem on /dev/sdi UUID: d5acb5ba-31bd-42da-b456-89dca2e716e1 checking extents checking free space cache checking fs roots root 5 inode 258 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 3 namelen 3 name bar filetype 7 errors 4, no inode ref found 131073 bytes used err is 1 total csum bytes: 0 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 124305 file data blocks allocated: 262144 referenced 262144 btrfs-progs v4.2.3 So fix this by adding the directory index entries as the very last step of symlink creation. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-10-21Btrfs: use insert_inode_locked4 for inode creationChris Mason1-67/+109
commit b0d5d10f41a0f1cd839408dd94427f2db3553bca upstream. Btrfs was inserting inodes into the hash table before we had fully set the inode up on disk. This leaves us open to rare races that allow two different inodes in memory for the same [root, inode] pair. This patch fixes things by using insert_inode_locked4 to insert an I_NEW inode and unlock_new_inode when we're ready for the rest of the kernel to use the inode. It also makes sure to init the operations pointers on the inode before going into the error handling paths. Signed-off-by: Chris Mason <clm@fb.com> Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-10-21Btrfs: ensure tmpfile inode is always persisted with link count of 0Filipe Manana1-0/+15
commit 5762b5c958abbecb7fb9f4596a6476d1ce91ecf6 upstream. If we open a file with O_TMPFILE, don't do any further operation on it (so that the inode item isn't updated) and then force a transaction commit, we get a persisted inode item with a link count of 1, and not 0 as it should be. Steps to reproduce it (requires a modern xfs_io with -T support): $ mkfs.btrfs -f /dev/sdd $ mount -o /dev/sdd /mnt $ xfs_io -T /mnt & $ sync Then btrfs-debug-tree shows the inode item with a link count of 1: $ btrfs-debug-tree /dev/sdd (...) fs tree key (FS_TREE ROOT_ITEM 0) leaf 29556736 items 4 free space 15851 generation 6 owner 5 fs uuid f164d01b-1b92-481d-a4e4-435fb0f843d0 chunk uuid 0e3d0e56-bcca-4a1c-aa5f-cec2c6f4f7a6 item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 inode generation 3 transid 6 size 0 block group 0 mode 40755 links 1 item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 inode ref index 0 namelen 2 name: .. item 2 key (257 INODE_ITEM 0) itemoff 15951 itemsize 160 inode generation 6 transid 6 size 0 block group 0 mode 100600 links 1 item 3 key (ORPHAN ORPHAN_ITEM 257) itemoff 15951 itemsize 0 orphan item checksum tree key (CSUM_TREE ROOT_ITEM 0) (...) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-10-21btrfs: Fix possible softlock on single core machinesNikolay Borisov1-0/+1
commit 1e1c50a929bc9e49bc3f9935b92450d9e69f8158 upstream. do_chunk_alloc implements a loop checking whether there is a pending chunk allocation and if so causes the caller do loop. Generally this loop is executed only once, however testing with btrfs/072 on a single core vm machines uncovered an extreme case where the system could loop indefinitely. This is due to a missing cond_resched when loop which doesn't give a chance to the previous chunk allocator finish its job. The fix is to simply add the missing cond_resched. Fixes: 6d74119f1a3e ("Btrfs: avoid taking the chunk_mutex in do_chunk_alloc") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-10-21Btrfs: bail out on error during replay_dir_deletesLiu Bo1-1/+3
commit b98def7ca6e152ee55e36863dddf6f41f12d1dc6 upstream. If errors were returned by btrfs_next_leaf(), replay_dir_deletes needs to bail out, otherwise @ret would be forced to be 0 after 'break;' and the caller won't be aware of it. Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-10-21Btrfs: fix NULL pointer dereference in log_dir_itemsLiu Bo1-2/+5
commit 80c0b4210a963e31529e15bf90519708ec947596 upstream. 0, 1 and <0 can be returned by btrfs_next_leaf(), and when <0 is returned, path->nodes[0] could be NULL, log_dir_items lacks such a check for <0 and we may run into a null pointer dereference panic. Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-10-21btrfs: tests/qgroup: Fix wrong tree backref levelQu Wenruo1-1/+1
commit 3c0efdf03b2d127f0e40e30db4e7aa0429b1b79a upstream. The extent tree of the test fs is like the following: BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919 item 0 key (4096 168 4096) itemoff 3944 itemsize 51 extent refs 1 gen 1 flags 2 tree block key (68719476736 0 0) level 1 ^^^^^^^ ref#0: tree block backref root 5 And it's using an empty tree for fs tree, so there is no way that its level can be 1. For REAL (created by mkfs) fs tree backref with no skinny metadata, the result should look like: item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51 refs 1 gen 4 flags TREE_BLOCK tree block key (256 INODE_ITEM 0) level 0 ^^^^^^^ tree block backref root 5 Fix the level to 0, so it won't break later tree level checker. Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-10-21Btrfs: fix copy_items() return value when logging an inodeFilipe Manana1-0/+1
commit 8434ec46c6e3232cebc25a910363b29f5c617820 upstream. When logging an inode, at tree-log.c:copy_items(), if we call btrfs_next_leaf() at the loop which checks for the need to log holes, we need to make sure copy_items() returns the value 1 to its caller and not 0 (on success). This is because the path the caller passed was released and is now different from what is was before, and the caller expects a return value of 0 to mean both success and that the path has not changed, while a return value of 1 means both success and signals the caller that it can not reuse the path, it has to perform another tree search. Even though this is a case that should not be triggered on normal circumstances or very rare at least, its consequences can be very unpredictable (especially when replaying a log tree). Fixes: 16e7549f045d ("Btrfs: incompatible format change to remove hole extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-10-21btrfs: Handle error from btrfs_uuid_tree_rem call in ↵Nikolay Borisov1-5/+11
_btrfs_ioctl_set_received_subvol commit d87ff75863e92a500538ab53318c5740f196631e upstream. As with every function which deals with modifying the btree btrfs_uuid_tree_rem can fail for any number of reasons (ie. EIO/ENOMEM). Handle return error value from this function gracefully by aborting the transaction. Fixes: dd5f9615fc5c ("Btrfs: maintain subvolume items in the UUID tree") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: - btrfs_{abort,end}_transaction() take a pointer to btrfs_root - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-10-21btrfs: Refactor transaction handling in received subvolume ioctlNikolay Borisov1-5/+1
commit efd38150af45375b46576d0110a323d7fab7e142 upstream. If btrfs_transaction_commit fails it will proceed to call cleanup_transaction, which in turn already does btrfs_abort_transaction. So let's remove the unnecessary code duplication. Also let's be explicit about handling failure of btrfs_uuid_tree_add by calling btrfs_end_transaction. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: - btrfs_{abort,end}_transaction() take a pointer to btrfs_root - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-10-21Btrfs: fix unexpected cow in run_delalloc_nocowLiu Bo1-4/+33
commit 5811375325420052fcadd944792a416a43072b7f upstream. Fstests generic/475 provides a way to fail metadata reads while checking if checksum exists for the inode inside run_delalloc_nocow(), and csum_exist_in_range() interprets error (-EIO) as inode having checksum and makes its caller enter the cow path. In case of free space inode, this ends up with a warning in cow_file_range(). The same problem applies to btrfs_cross_ref_exist() since it may also read metadata in between. With this, run_delalloc_nocow() bails out when errors occur at the two places. Fixes: 17d217fe970d ("Btrfs: fix nodatasum handling in balancing code") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-10-03mm: drop vm_ops->remap_pages and generic_file_remap_pages() stubKirill A. Shutemov1-1/+0
commit d83a08db5ba6072caa658745881f4baa9bad6a08 upstream. Nobody uses it anymore. [akpm@linux-foundation.org: fix filemap_xip.c] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.16: - Deleted code is slightly different - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-09-26btrfs: relocation: Only remove reloc rb_trees if reloc control has been ↵Qu Wenruo1-11/+12
initialized commit 389305b2aa68723c754f88d9dbd268a400e10664 upstream. Invalid reloc tree can cause kernel NULL pointer dereference when btrfs does some cleanup of the reloc roots. It turns out that fs_info::reloc_ctl can be NULL in btrfs_recover_relocation() as we allocate relocation control after all reloc roots have been verified. So when we hit: note, we haven't called set_reloc_control() thus fs_info::reloc_ctl is still NULL. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199833 Reported-by: Xu Wen <wen.xu@gatech.edu> Signed-off-by: Qu Wenruo <wqu@suse.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-06-17btrfs: alloc_chunk: fix DUP stripe size handlingHans van Kranenburg1-5/+6
commit 92e222df7b8f05c565009c7383321b593eca488b upstream. In case of using DUP, we search for enough unallocated disk space on a device to hold two stripes. The devices_info[ndevs-1].max_avail that holds the amount of unallocated space found is directly assigned to stripe_size, while it's actually twice the stripe size. Later on in the code, an unconditional division of stripe_size by dev_stripes corrects the value, but in the meantime there's a check to see if the stripe_size does not exceed max_chunk_size. Since during this check stripe_size is twice the amount as intended, the check will reduce the stripe_size to max_chunk_size if the actual correct to be used stripe_size is more than half the amount of max_chunk_size. The unconditional division later tries to correct stripe_size, but will actually make sure we can't allocate more than half the max_chunk_size. Fix this by moving the division by dev_stripes before the max chunk size check, so it always contains the right value, instead of putting a duct tape division in further on to get it fixed again. Since in all other cases than DUP, dev_stripes is 1, this change only affects DUP. Other attempts in the past were made to fix this: * 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried to fix the same problem, but still resulted in part of the code acting on a wrongly doubled stripe_size value. * 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally broke this fix again. The real problem was already introduced with the rest of the code in 73c5de0051. The user visible result however will be that the max chunk size for DUP will suddenly double, while it's actually acting according to the limits in the code again like it was 5 years ago. Reported-by: Naohiro Aota <naohiro.aota@wdc.com> Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation") Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6") Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comment ] Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: We were using do_div() here] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-06-17btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodesZygo Blaxell1-1/+10
commit c8195a7b1ad5648857ce20ba24f384faed8512bc upstream. Until v4.14, this warning was very infrequent: WARNING: CPU: 3 PID: 18172 at fs/btrfs/backref.c:1391 find_parent_nodes+0xc41/0x14e0 Modules linked in: [...] CPU: 3 PID: 18172 Comm: bees Tainted: G D W L 4.11.9-zb64+ #1 Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 2101 12/02/2014 Call Trace: dump_stack+0x85/0xc2 __warn+0xd1/0xf0 warn_slowpath_null+0x1d/0x20 find_parent_nodes+0xc41/0x14e0 __btrfs_find_all_roots+0xad/0x120 ? extent_same_check_offsets+0x70/0x70 iterate_extent_inodes+0x168/0x300 iterate_inodes_from_logical+0x87/0xb0 ? iterate_inodes_from_logical+0x87/0xb0 ? extent_same_check_offsets+0x70/0x70 btrfs_ioctl+0x8ac/0x2820 ? lock_acquire+0xc2/0x200 do_vfs_ioctl+0x91/0x700 ? __fget+0x112/0x200 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc6 ? trace_hardirqs_off_caller+0x1f/0x140 Starting with v4.14 (specifically 86d5f9944252 ("btrfs: convert prelimary reference tracking to use rbtrees")) the WARN_ON occurs three orders of magnitude more frequently--almost once per second while running workloads like bees. Replace the WARN_ON() with a comment rationale for its removal. The rationale is paraphrased from an explanation by Edmund Nadolski <enadolski@suse.de> on the linux-btrfs mailing list. Fixes: 8da6d5815c59 ("Btrfs: added btrfs_find_all_roots()") Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-06-17Btrfs: fix use-after-free on root->orphan_block_rsvLiu Bo1-13/+16
commit 1a932ef4e47984dee227834667b5ff5a334e4805 upstream. I got these from running generic/475, WARNING: CPU: 0 PID: 26384 at fs/btrfs/inode.c:3326 btrfs_orphan_commit_root+0x1ac/0x2b0 [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: btrfs_block_rsv_release+0x1c/0x70 [btrfs] Call Trace: btrfs_orphan_release_metadata+0x9f/0x200 [btrfs] btrfs_orphan_del+0x10d/0x170 [btrfs] btrfs_setattr+0x500/0x640 [btrfs] notify_change+0x7ae/0x870 do_truncate+0xca/0x130 vfs_truncate+0x2ee/0x3d0 do_sys_truncate+0xaf/0xf0 SyS_truncate+0xe/0x10 entry_SYSCALL_64_fastpath+0x1f/0x96 The race is between btrfs_orphan_commit_root and btrfs_orphan_del, t1 t2 btrfs_orphan_commit_root btrfs_orphan_del spin_lock check (&root->orphan_inodes) root->orphan_block_rsv = NULL; spin_unlock atomic_dec(&root->orphan_inodes); access root->orphan_block_rsv Accessing root->orphan_block_rsv must be done before decreasing root->orphan_inodes. Fixes: 703c88e03524 ("Btrfs: fix tracking of orphan inode count") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: Drop the added comment in a path that's unreachable here] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-06-17Btrfs: fix extent state leak from tree logLiu Bo1-2/+3
commit 55237a5f2431a72435e3ed39e4306e973c0446b7 upstream. It's possible that btrfs_sync_log() bails out after one of the two btrfs_write_marked_extents() which convert extent state's state bit into EXTENT_NEED_WAIT from EXTENT_DIRTY/EXTENT_NEW, however only EXTENT_DIRTY and EXTENT_NEW are searched by free_log_tree() so that those extent states with EXTENT_NEED_WAIT lead to memory leak. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-06-17Btrfs: fix crash due to not cleaning up tree log block's dirty bitsLiu Bo1-0/+9
commit 1846430c24d66e85cc58286b3319c82cd54debb2 upstream. In cases that the whole fs flips into readonly status due to failures in critical sections, then log tree's blocks are still dirty, and this leads to a crash during umount time, the crash is about use-after-free, umount -> close_ctree -> stop workers -> iput(btree_inode) -> iput_final -> write_inode_now -> ... -> queue job on stop'd workers Fixes: 681ae50917df ("Btrfs: cleanup reserved space when freeing tree log on error") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-06-17Btrfs: fix deadlock in run_delalloc_nocowLiu Bo1-1/+4
commit e89166990f11c3f21e1649d760dd35f9e410321c upstream. @cur_offset is not set back to what it should be (@cow_start) if btrfs_next_leaf() returns something wrong, and the range [cow_start, cur_offset) remains locked forever. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-06-17btrfs: Handle btrfs_set_extent_delalloc failure in fixup workerNikolay Borisov1-1/+9
commit f3038ee3a3f1017a1cbe9907e31fa12d366c5dcb upstream. This function was introduced by 247e743cbe6e ("Btrfs: Use async helpers to deal with pages that have been improperly dirtied") and it didn't do any error handling then. This function might very well fail in ENOMEM situation, yet it's not handled, this could lead to inconsistent state. So let's handle the failure by setting the mapping error bit. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-03-03btrfs: Fix possible off-by-one in btrfs_search_path_in_treeNikolay Borisov1-1/+1
commit c8bcbfbd239ed60a6562964b58034ac8a25f4c31 upstream. The name char array passed to btrfs_search_path_in_tree is of size BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes are in the range of [0, 4079]. Currently the code uses the define but this represents an off-by-one. Implications: Size of btrfs_ioctl_ino_lookup_args is 4096, so the new byte will be written to extra space, not some padding that could be provided by the allocator. btrfs-progs store the arguments on stack, but kernel does own copy of the ioctl buffer and the off-by-one overwrite does not affect userspace, but the ending 0 might be lost. Kernel ioctl buffer is allocated dynamically so we're overwriting somebody else's memory, and the ioctl is privileged if args.objectid is not 256. Which is in most cases, but resolving a subvolume stored in another directory will trigger that path. Before this patch the buffer was one byte larger, but then the -1 was not added. Fixes: ac8e9819d71f907 ("Btrfs: add search and inode lookup ioctls") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ added implications ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-03-03Btrfs: disable FUA if mounted with nobarrierOmar Sandoval1-4/+5
commit 1b9e619c5bc8235cfba3dc4ced2fb0e3554a05d4 upstream. I was seeing disk flushes still happening when I mounted a Btrfs filesystem with nobarrier for testing. This is because we use FUA to write out the first super block, and on devices without FUA support, the block layer translates FUA to a flush. Even on devices supporting true FUA, using FUA when we asked for no barriers is surprising. Fixes: 387125fc722a8ed ("Btrfs: fix barrier flushes") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: - I/O flag names are different, and are combined with the operation type - Use the do_barrier parameter instead of checking the NOBARRIER option again] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-03-03btrfs: fix missing error return in btrfs_drop_snapshotJeff Mahoney1-0/+1
commit e19182c0fff451e3744c1107d98f072e7ca377a0 upstream. If btrfs_del_root fails in btrfs_drop_snapshot, we'll pick up the error but then return 0 anyway due to mixing err and ret. Fixes: 79787eaab4612 ("btrfs: replace many BUG_ONs with proper error handling") Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-03-03btrfs: clear space cache inode generation alwaysJosef Bacik1-7/+7
commit 8e138e0d92c6c9d3d481674fb14e3439b495be37 upstream. We discovered a box that had double allocations, and suspected the space cache may be to blame. While auditing the write out path I noticed that if we've already setup the space cache we will just carry on. This means that any error we hit after cache_save_setup before we go to actually write the cache out we won't reset the inode generation, so whatever was already written will be considered correct, except it'll be stale. Fix this by _always_ resetting the generation on the block group inode, this way we only ever have valid or invalid cache. With this patch I was no longer able to reproduce cache corruption with dm-log-writes and my bpf error injection tool. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2018-02-13btrfs: avoid null pointer dereference on fs_info when calling btrfs_critColin Ian King1-2/+2
commit 3993b112dac968612b0b213ed59cb30f50b0015b upstream. There are checks on fs_info in __btrfs_panic to avoid dereferencing a null fs_info, however, there is a call to btrfs_crit that may also dereference a null fs_info. Fix this by adding a check to see if fs_info is null and only print the s_id if fs_info is non-null. Detected by CoverityScan CID#401973 ("Dereference after null check") Fixes: efe120a067c8 ("Btrfs: convert printk to btrfs_ and fix BTRFS prefix") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>