summaryrefslogtreecommitdiff
path: root/fs/btrfs/inode.c
AgeCommit message (Collapse)AuthorFilesLines
2016-11-20fs: Give dentry to inode_change_ok() instead of inodeJan Kara1-1/+1
commit 31051c85b5e2aaaf6315f74c72a732673632a905 upstream. inode_change_ok() will be resposible for clearing capabilities and IMA extended attributes and as such will need dentry. Give it as an argument to inode_change_ok() instead of an inode. Also rename inode_change_ok() to setattr_prepare() to better relect that it does also some modifications in addition to checks. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> [bwh: Backported to 3.2: - Drop changes to f2fs, lustre, orangefs, overlayfs - Adjust filenames, context - In nfsd, pass dentry to nfsd_sanitize_attrs() - In xfs, pass dentry to xfs_change_file_space(), xfs_set_mode(), xfs_setattr_nonsize(), and xfs_setattr_size() - Update ext3 as well - Mark pohmelfs as BROKEN; it's long dead upstream] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-08-23btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctlLuke Dashjr1-1/+1
commit 4c63c2454eff996c5e27991221106eb511f7db38 upstream. 32-bit ioctl uses these rather than the regular FS_IOC_* versions. They can be handled in btrfs using the same code. Without this, 32-bit {ch,ls}attr fail. Signed-off-by: Luke Dashjr <luke-jr+git@utopios.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2016-02-27btrfs: properly set the termination value of ctx->pos in readdirDavid Sterba1-1/+13
commit bc4ef7592f657ae81b017207a1098817126ad4cb upstream. The value of ctx->pos in the last readdir call is supposed to be set to INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a larger value, then it's LLONG_MAX. There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++" overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a 64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before the increment. We can get to that situation like that: * emit all regular readdir entries * still in the same call to readdir, bump the last pos to INT_MAX * next call to readdir will not emit any entries, but will reach the bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX Normally this is not a problem, but if we call readdir again, we'll find 'pos' set to LLONG_MAX and the unconditional increment will overflow. The report from Victor at (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging print shows that pattern: Overflow: e Overflow: 7fffffff Overflow: 7fffffffffffffff PAX: size overflow detected in function btrfs_real_readdir fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0; context: dir_context; CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1 Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015 ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48 ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78 ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8 Call Trace: [<ffffffff81742f0f>] dump_stack+0x4c/0x7f [<ffffffff811cb706>] report_size_overflow+0x36/0x40 [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0 [<ffffffff811dafc8>] iterate_dir+0xa8/0x150 [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70 [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0 Overflow: 1a [<ffffffff811db070>] ? iterate_dir+0x150/0x150 [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83 The jump from 7fffffff to 7fffffffffffffff happens when new dir entries are not yet synced and are processed from the delayed list. Then the code could go to the bump section again even though it might not emit any new dir entries from the delayed list. The fix avoids entering the "bump" section again once we've finished emitting the entries, both for synced and delayed entries. References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284 Reported-by: Victor <services@swwu.com> Signed-off-by: David Sterba <dsterba@suse.com> Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Signed-off-by: Chris Mason <clm@fb.com> [bwh: Backported to 3.2: - s/ctx->pos/filp->f_pos/ - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-11-27Btrfs: fix race leading to BUG_ON when running delalloc for nodatacowFilipe Manana1-2/+8
commit 1d512cb77bdbda80f0dd0620a3b260d697fd581d upstream. If we are using the NO_HOLES feature, we have a tiny time window when running delalloc for a nodatacow inode where we can race with a concurrent link or xattr add operation leading to a BUG_ON. This happens because at run_delalloc_nocow() we end up casting a leaf item of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a file extent item (struct btrfs_file_extent_item) and then analyse its extent type field, which won't match any of the expected extent types (values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an explicit BUG_ON(1). The following sequence diagram shows how the race happens when running a no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following neighbour leafs: Leaf X (has N items) Leaf Y [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ] slot N - 2 slot N - 1 slot 0 (Note the implicit hole for inode 257 regarding the [0, 8K[ range) CPU 1 CPU 2 run_dealloc_nocow() btrfs_lookup_file_extent() --> searches for a key with value (257 EXTENT_DATA 4096) in the fs/subvol tree --> returns us a path with path->nodes[0] == leaf X and path->slots[0] == N because path->slots[0] is >= btrfs_header_nritems(leaf X), it calls btrfs_next_leaf() btrfs_next_leaf() --> releases the path hard link added to our inode, with key (257 INODE_REF 500) added to the end of leaf X, so leaf X now has N + 1 keys --> searches for the key (257 INODE_REF 256), because it was the last key in leaf X before it released the path, with path->keep_locks set to 1 --> ends up at leaf X again and it verifies that the key (257 INODE_REF 256) is no longer the last key in the leaf, so it returns with path->nodes[0] == leaf X and path->slots[0] == N, pointing to the new item with key (257 INODE_REF 500) the loop iteration of run_dealloc_nocow() does not break out the loop and continues because the key referenced in the path at path->nodes[0] and path->slots[0] is for inode 257, its type is < BTRFS_EXTENT_DATA_KEY and its offset (500) is less then our delalloc range's end (8192) the item pointed by the path, an inode reference item, is (incorrectly) interpreted as a file extent item and we get an invalid extent type, leading to the BUG_ON(1): if (extent_type == BTRFS_FILE_EXTENT_REG || extent_type == BTRFS_FILE_EXTENT_PREALLOC) { (...) } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { (...) } else { BUG_ON(1) } The same can happen if a xattr is added concurrently and ends up having a key with an offset smaller then the delalloc's range end. So fix this by skipping keys with a type smaller than BTRFS_EXTENT_DATA_KEY. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-11-27Btrfs: fix truncation of compressed and inlined extentsFilipe Manana1-16/+60
commit 0305cd5f7fca85dae392b9ba85b116896eb7c1c7 upstream. When truncating a file to a smaller size which consists of an inline extent that is compressed, we did not discard (or made unusable) the data between the new file size and the old file size, wasting metadata space and allowing for the truncated data to be leaked and the data corruption/loss mentioned below. We were also not correctly decrementing the number of bytes used by the inode, we were setting it to zero, giving a wrong report for callers of the stat(2) syscall. The fsck tool also reported an error about a mismatch between the nbytes of the file versus the real space used by the file. Now because we weren't discarding the truncated region of the file, it was possible for a caller of the clone ioctl to actually read the data that was truncated, allowing for a security breach without requiring root access to the system, using only standard filesystem operations. The scenario is the following: 1) User A creates a file which consists of an inline and compressed extent with a size of 2000 bytes - the file is not accessible to any other users (no read, write or execution permission for anyone else); 2) The user truncates the file to a size of 1000 bytes; 3) User A makes the file world readable; 4) User B creates a file consisting of an inline extent of 2000 bytes; 5) User B issues a clone operation from user A's file into its own file (using a length argument of 0, clone the whole range); 6) User B now gets to see the 1000 bytes that user A truncated from its file before it made its file world readbale. User B also lost the bytes in the range [1000, 2000[ bytes from its own file, but that might be ok if his/her intention was reading stale data from user A that was never supposed to be public. Note that this contrasts with the case where we truncate a file from 2000 bytes to 1000 bytes and then truncate it back from 1000 to 2000 bytes. In this case reading any byte from the range [1000, 2000[ will return a value of 0x00, instead of the original data. This problem exists since the clone ioctl was added and happens both with and without my recent data loss and file corruption fixes for the clone ioctl (patch "Btrfs: fix file corruption and data loss after cloning inline extents"). So fix this by truncating the compressed inline extents as we do for the non-compressed case, which involves decompressing, if the data isn't already in the page cache, compressing the truncated version of the extent, writing the compressed content into the inline extent and then truncate it. The following test case for fstests reproduces the problem. In order for the test to pass both this fix and my previous fix for the clone ioctl that forbids cloning a smaller inline extent into a larger one, which is titled "Btrfs: fix file corruption and data loss after cloning inline extents", are needed. Without that other fix the test fails in a different way that does not leak the truncated data, instead part of destination file gets replaced with zeroes (because the destination file has a larger inline extent than the source). seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount "-o compress" # Create our test files. File foo is going to be the source of a clone operation # and consists of a single inline extent with an uncompressed size of 512 bytes, # while file bar consists of a single inline extent with an uncompressed size of # 256 bytes. For our test's purpose, it's important that file bar has an inline # extent with a size smaller than foo's inline extent. $XFS_IO_PROG -f -c "pwrite -S 0xa1 0 128" \ -c "pwrite -S 0x2a 128 384" \ $SCRATCH_MNT/foo | _filter_xfs_io $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 256" $SCRATCH_MNT/bar | _filter_xfs_io # Now durably persist all metadata and data. We do this to make sure that we get # on disk an inline extent with a size of 512 bytes for file foo. sync # Now truncate our file foo to a smaller size. Because it consists of a # compressed and inline extent, btrfs did not shrink the inline extent to the # new size (if the extent was not compressed, btrfs would shrink it to 128 # bytes), it only updates the inode's i_size to 128 bytes. $XFS_IO_PROG -c "truncate 128" $SCRATCH_MNT/foo # Now clone foo's inline extent into bar. # This clone operation should fail with errno EOPNOTSUPP because the source # file consists only of an inline extent and the file's size is smaller than # the inline extent of the destination (128 bytes < 256 bytes). However the # clone ioctl was not prepared to deal with a file that has a size smaller # than the size of its inline extent (something that happens only for compressed # inline extents), resulting in copying the full inline extent from the source # file into the destination file. # # Note that btrfs' clone operation for inline extents consists of removing the # inline extent from the destination inode and copy the inline extent from the # source inode into the destination inode, meaning that if the destination # inode's inline extent is larger (N bytes) than the source inode's inline # extent (M bytes), some bytes (N - M bytes) will be lost from the destination # file. Btrfs could copy the source inline extent's data into the destination's # inline extent so that we would not lose any data, but that's currently not # done due to the complexity that would be needed to deal with such cases # (specially when one or both extents are compressed), returning EOPNOTSUPP, as # it's normally not a very common case to clone very small files (only case # where we get inline extents) and copying inline extents does not save any # space (unlike for normal, non-inlined extents). $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar # Now because the above clone operation used to succeed, and due to foo's inline # extent not being shinked by the truncate operation, our file bar got the whole # inline extent copied from foo, making us lose the last 128 bytes from bar # which got replaced by the bytes in range [128, 256[ from foo before foo was # truncated - in other words, data loss from bar and being able to read old and # stale data from foo that should not be possible to read anymore through normal # filesystem operations. Contrast with the case where we truncate a file from a # size N to a smaller size M, truncate it back to size N and then read the range # [M, N[, we should always get the value 0x00 for all the bytes in that range. # We expected the clone operation to fail with errno EOPNOTSUPP and therefore # not modify our file's bar data/metadata. So its content should be 256 bytes # long with all bytes having the value 0xbb. # # Without the btrfs bug fix, the clone operation succeeded and resulted in # leaking truncated data from foo, the bytes that belonged to its range # [128, 256[, and losing data from bar in that same range. So reading the # file gave us the following content: # # 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 # * # 0000200 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a # * # 0000400 echo "File bar's content after the clone operation:" od -t x1 $SCRATCH_MNT/bar # Also because the foo's inline extent was not shrunk by the truncate # operation, btrfs' fsck, which is run by the fstests framework everytime a # test completes, failed reporting the following error: # # root 5 inode 257 errors 400, nbytes wrong status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> [bwh: Backported to 3.2: - Adjust parameters to btrfs_truncate_page() and btrfs_truncate_item() - Pass transaction pointer into truncate_inline_extent() - Add prototype of btrfs_truncate_page() - s/test_bit(BTRFS_ROOT_REF_COWS, &root->state)/root->ref_cows/ - Keep using BUG_ON() for other error cases, as there is no btrfs_abort_transaction() - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-11-27Btrfs: don't use ram_bytes for uncompressed inline itemsChris Mason1-4/+11
commit 514ac8ad8793a097c0c9d89202c642479d6dfa34 upstream. If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect the new size. The fixe uses the size directly from the item header when reading uncompressed inlines, and also fixes truncate to update the size as it goes. Reported-by: Jens Axboe <axboe@fb.com> Signed-off-by: Chris Mason <clm@fb.com> [bwh: Backported to 3.2: - Don't use btrfs_map_token API - There are fewer callers of btrfs_file_extent_inline_len() to change - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2015-10-13btrfs: skip waiting on ordered range for special filesJeff Mahoney1-1/+2
commit a30e577c96f59b1e1678ea5462432b09bf7d5cbc upstream. In btrfs_evict_inode, we properly truncate the page cache for evicted inodes but then we call btrfs_wait_ordered_range for every inode as well. It's the right thing to do for regular files but results in incorrect behavior for device inodes for block devices. filemap_fdatawrite_range gets called with inode->i_mapping which gets resolved to the block device inode before getting passed to wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi. What happens next depends on whether there's an open file handle associated with the inode. If there is, we write to the block device, which is unexpected behavior. If there isn't, we through normally and inode->i_data is used. We can also end up racing against open/close which can result in crashes when i_mapping points to a block device inode that has been closed. Since there can't be any page cache associated with special file inodes, it's safe to skip the btrfs_wait_ordered_range call entirely and avoid the problem. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911 Tested-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2014-04-02Btrfs: setup inode location during btrfs_init_inode_lockedChris Mason1-9/+9
commit 90d3e592e99b8e374ead2b45148abf506493a959 upstream. We have a race during inode init because the BTRFS_I(inode)->location is setup after the inode hash table lock is dropped. btrfs_find_actor uses the location field, so our search might not find an existing inode in the hash table if we race with the inode init code. This commit changes things to setup the location field sooner. Also the find actor now uses only the location objectid to match inodes. For inode hashing, we just need a unique and stable test, it doesn't have to reflect the inode numbers we show to userland. Signed-off-by: Chris Mason <clm@fb.com> [bwh: Backported to 3.2: - No hashval in btrfs_iget_locked()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-04-25Btrfs: fix race between mmap writes and compressionChris Mason1-0/+14
commit 4adaa611020fa6ac65b0ac8db78276af4ec04e63 upstream. Btrfs uses page_mkwrite to ensure stable pages during crc calculations and mmap workloads. We call clear_page_dirty_for_io before we do any crcs, and this forces any application with the file mapped to wait for the crc to finish before it is allowed to change the file. With compression on, the clear_page_dirty_for_io step is happening after we've compressed the pages. This means the applications might be changing the pages while we are compressing them, and some of those modifications might not hit the disk. This commit adds the clear_page_dirty_for_io before compression starts and makes sure to redirty the page if we have to fallback to uncompressed IO as well. Signed-off-by: Chris Mason <chris.mason@fusionio.com> Reported-by: Alexandre Oliva <oliva@gnu.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2011-12-24Merge branch 'for-linus' of ↵Linus Torvalds1-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: call d_instantiate after all ops are setup Btrfs: fix worker lock misuse in find_worker
2011-12-23Btrfs: call d_instantiate after all ops are setupAl Viro1-4/+5
This closes races where btrfs is calling d_instantiate too soon during inode creation. All of the callers of btrfs_add_nondir are updated to instantiate after the inode is fully setup in memory. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-17Merge branches 'for-linus' and 'for-linus-3.2' of ↵Linus Torvalds1-33/+147
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: unplug every once and a while Btrfs: deal with NULL srv_rsv in the delalloc inode reservation code Btrfs: only set cache_generation if we setup the block group Btrfs: don't panic if orphan item already exists Btrfs: fix leaked space in truncate Btrfs: fix how we do delalloc reservations and how we free reservations on error Btrfs: deal with enospc from dirtying inodes properly Btrfs: fix num_workers_starting bug and other bugs in async thread BTRFS: Establish i_ops before calling d_instantiate Btrfs: add a cond_resched() into the worker loop Btrfs: fix ctime update of on-disk inode btrfs: keep orphans for subvolume deletion Btrfs: fix inaccurate available space on raid0 profile Btrfs: fix wrong disk space information of the files Btrfs: fix wrong i_size when truncating a file to a larger size Btrfs: fix btrfs_end_bio to deal with write errors to a single mirror * 'for-linus-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: lower the dirty balance poll interval
2011-12-15Merge branch 'for-chris' of ↵Chris Mason1-22/+74
http://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into integration Conflicts: fs/btrfs/inode.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15Btrfs: don't panic if orphan item already existsJosef Bacik1-1/+1
I've been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in a loop. This is because we will add an orphan item, do the truncate, the truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we're left with an orphan item still in the fs. Then we come back later to do another truncate and it blows up because we already have an orphan item. This is ok so just fix the BUG_ON() to only BUG() if ret is not EEXIST. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-15Btrfs: fix leaked space in truncateJosef Bacik1-2/+3
We were occasionaly leaking space when running xfstest 269. This is because if we failed to start the transaction in the truncate loop we'd just goto out, but we need to break so that the inode is removed from the orphan list and the space is properly freed. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-15Btrfs: fix how we do delalloc reservations and how we free reservations on errorJosef Bacik1-0/+10
Running xfstests 269 with some tracing my scripts kept spitting out errors about releasing bytes that we didn't actually have reserved. This took me down a huge rabbit hole and it turns out the way we deal with reserved_extents is wrong, we need to only be setting it if the reservation succeeds, otherwise the free() method will come in and unreserve space that isn't actually reserved yet, which can lead to other warnings and such. The math was all working out right in the end, but it caused all sorts of other issues in addition to making my scripts yell and scream and generally make it impossible for me to track down the original issue I was looking for. The other problem is with our error handling in the reservation code. There are two cases that we need to deal with 1) We raced with free. In this case free won't free anything because csum_bytes is modified before we dro the lock in our reservation path, so free rightly doesn't release any space because the reservation code may be depending on that reservation. However if we fail, we need the reservation side to do the free at that point since that space is no longer in use. So as it stands the code was doing this fine and it worked out, except in case #2 2) We don't race with free. Nobody comes in and changes anything, and our reservation fails. In this case we didn't reserve anything anyway and we just need to clean up csum_bytes but not free anything. So we keep track of csum_bytes before we drop the lock and if it hasn't changed we know we can just decrement csum_bytes and carry on. Because of the case where we can race with free()'s since we have to drop our spin_lock to do the reservation, I'm going to serialize all reservations with the i_mutex. We already get this for free in the heavy use paths, truncate and file write all hold the i_mutex, just needed to add it to page_mkwrite and various ioctl/balance things. With this patch my space leak scripts no longer scream bloody murder. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-15Btrfs: deal with enospc from dirtying inodes properlyJosef Bacik1-19/+61
Now that we're properly keeping track of delayed inode space we've been getting a lot of warnings out of btrfs_dirty_inode() when running xfstest 83. This is because a bunch of people call mark_inode_dirty, which is void so we can't return ENOSPC. This needs to be fixed in a few areas 1) file_update_time - this updates the mtime and such when writing to a file, which will call mark_inode_dirty. So copy file_update_time into btrfs so we can call btrfs_dirty_inode directly and return an error if we get one appropriately. 2) fix symlinks to use btrfs_setattr for ->setattr. For some reason we weren't setting ->setattr for symlinks, even though we should have been. This catches one of the cases where we were getting errors in mark_inode_dirty. 3) Fix btrfs_setattr and btrfs_setsize to call btrfs_dirty_inode directly instead of mark_inode_dirty. This lets us return errors properly for truncate and chown/anything related to setattr. 4) Add a new btrfs_fs_dirty_inode which will just call btrfs_dirty_inode and print an error if we have one. The only remaining user we can't control for this is touch_atime(), but we don't really want to keep people from walking down the tree if we don't have space to save the atime update, so just complain but don't worry about it. With this patch xfstests 83 complains a handful of times instead of hundreds of times. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-12-15BTRFS: Establish i_ops before calling d_instantiateCasey Schaufler1-5/+26
The Smack LSM hook for security_d_instantiate checks the inode's i_op->getxattr value to determine if the containing filesystem supports extended attributes. The BTRFS filesystem sets the inode's i_op value only after it has instantiated the inode. This results in Smack incorrectly giving new BTRFS inodes attributes from the filesystem defaults on the assumption that values can't be stored on the filesystem. This patch moves the assignment of inode operation vectors ahead of the calls to d_instantiate, letting Smack know that the filesystem supports extended attributes. There should be no impact on the performance or behavior of BTRFS. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15btrfs: keep orphans for subvolume deletionArne Jansen1-0/+32
Since we have the free space caches, btrfs_orphan_cleanup also runs for the tree_root. Unfortunately this also cleans up the orphans used to mark subvol deletions in progress. Currently if a subvol deletion gets interrupted twice by umount/mount, the deletion will not be continued and the space permanently lost, though it would be possible to write a tool to recover those lost subvol deletions. This patch checks if the orphan belongs to a subvol (dead root) and skips the deletion. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15Btrfs: fix wrong disk space information of the filesMiao Xie1-1/+4
Btrfsck report errors after the 83th case of xfstests was run, The error number is 400, it means the used disk space of the file is wrong. The reason of this bug is that: The file truncation may fail when the space of the file system is not enough, and leave some file extents, whose offset are beyond the end of the files. When we want to expand those files, we will drop those file extents, and put in dummy file extents, and then we should update the i-node. But btrfs forgets to do it. This patch adds the forgotten i-node update. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15Btrfs: fix wrong i_size when truncating a file to a larger sizeMiao Xie1-6/+12
Btrfsck report error 100 after the 83th case of xfstests was run, it means the i_size of the file is wrong. The reason of this bug is that: Btrfs increased i_size of the file at the beginning, but it failed to expand the file, and failed to update the i_size to the old size because there is no enough space in the file system, so we found a wrong i_size. This patch fixes this bug by updating the i_size just when we pass the file expanding and get enough space to update i-node. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-01Merge branch 'for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix meta data raid-repair merge problem Btrfs: skip allocation attempt from empty cluster Btrfs: skip block groups without enough space for a cluster Btrfs: start search for new cluster at the beginning Btrfs: reset cluster's max_size when creating bitmap Btrfs: initialize new bitmaps' list Btrfs: fix oops when calling statfs on readonly device Btrfs: Don't error on resizing FS to same size Btrfs: fix deadlock on metadata reservation when evicting a inode Fix URL of btrfs-progs git repository in docs btrfs scrub: handle -ENOMEM from init_ipath()
2011-11-30Btrfs: fix deadlock on metadata reservation when evicting a inodeMiao Xie1-1/+1
When I ran the xfstests, I found the test tasks was blocked on meta-data reservation. By debugging, I found the reason of this bug: start transaction | v reserve meta-data space | v flush delay allocation -> iput inode -> evict inode ^ | | v wait for delay allocation flush <- reserve meta-data space And besides that, the flush on evicting inode will block the thread, which is reclaiming the memory, and make oom happen easily. Fix this bug by skipping the flush step when evicting inode. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2011-11-22Merge branch 'for-linus' of ↵Linus Torvalds1-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: remove free-space-cache.c WARN during log replay Btrfs: sectorsize align offsets in fiemap Btrfs: clear pages dirty for io and set them extent mapped Btrfs: wait on caching if we're loading the free space cache Btrfs: prefix resize related printks with btrfs: btrfs: fix stat blocks accounting Btrfs: avoid unnecessary bitmap search for cluster setup Btrfs: fix to search one more bitmap for cluster setup btrfs: mirror_num should be int, not u64 btrfs: Fix up 32/64-bit compatibility for new ioctls Btrfs: fix barrier flushes Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush
2011-11-20btrfs: fix stat blocks accountingDavid Sterba1-2/+4
Round inode bytes and delalloc bytes up to real blocksize before converting to sector size. Otherwise eg. files smaller than 512 are reported with zero blocks due to incorrect rounding. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-12Merge branch 'for-linus' of ↵Linus Torvalds1-27/+57
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: rename the option to nospace_cache Btrfs: handle bio_add_page failure gracefully in scrub Btrfs: fix deadlock caused by the race between relocation Btrfs: only map pages if we know we need them when reading the space cache Btrfs: fix orphan backref nodes Btrfs: Abstract similar code for btrfs_block_rsv_add{, _noflush} Btrfs: fix unreleased path in btrfs_orphan_cleanup() Btrfs: fix no reserved space for writing out inode cache Btrfs: fix nocow when deleting the item Btrfs: tweak the delayed inode reservations again Btrfs: rework error handling in btrfs_mount() Btrfs: close devices on all error paths in open_ctree() Btrfs: avoid null dereference and leaks when bailing from open_ctree() Btrfs: fix subvol_name leak on error in btrfs_mount() Btrfs: fix memory leak in btrfs_parse_early_options() Btrfs: fix our reservations for updating an inode when completing io Btrfs: fix oops on NULL trans handle in btrfs_truncate btrfs: fix double-free 'tree_root' in 'btrfs_mount()'
2011-11-11Btrfs: fix unreleased path in btrfs_orphan_cleanup()Miao Xie1-0/+3
When we did stress test for the space relocation, the deadlock happened. By debugging, We found it was caused by the carelessness that we forgot to unlock the read lock of the extent buffers in btrfs_orphan_cleanup() before we end the transaction handle, so the transaction commit task waited the task, which called btrfs_orphan_cleanup(), to unlock the extent buffer, but that task waited the commit task to end the transaction commit, and the deadlock happened. Fix it. Signed-ff-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-11Btrfs: tweak the delayed inode reservations againChris Mason1-20/+44
Josef sent along an incremental to the inode reservation code to make sure we try and fall back to directly updating the inode item if things go horribly wrong. This reworks that patch slightly, adding a fallback function that will always try to update the inode item directly without going through the delayed_inode code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-09Btrfs: fix our reservations for updating an inode when completing ioJosef Bacik1-0/+1
People have been reporting ENOSPC crashes in finish_ordered_io. This is because we try to steal from the delalloc block rsv to satisfy a reservation to update the inode. The problem with this is we don't explicitly save space for updating the inode when doing delalloc. This is kind of a problem and we've gotten away with this because way back when we just stole from the delalloc reserve without any questions, and this worked out fine because generally speaking the leaf had been modified either by the mtime update when we did the original write or because we just updated the leaf when we inserted the file extent item, only on rare occasions had the leaf not actually been modified, and that was still ok because we'd just use a block or two out of the over-reservation that is delalloc. Then came the delayed inode stuff. This is amazing, except it wants a full reservation for updating the inode since it may do it at some point down the road after we've written the blocks and we have to recow everything again. This worked out because the delayed inode stuff just stole from the global reserve, that is until recently when I changed that because it caused other problems. So here we are, we're doing everything right and being screwed for it. So take an extra reservation for the inode at delalloc reservation time and carry it through the life of the delalloc reservation. If we need it we can steal it in the delayed inode stuff. If we have already stolen it try and do a normal metadata reservation. If that fails try to steal from the delalloc reservation. If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to solve this and in the meantime we'll steal from the global reserve. With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see any problems. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-08Btrfs: fix oops on NULL trans handle in btrfs_truncateChris Mason1-7/+9
If we fail to reserve space in the transaction during truncate, we can error out with a NULL trans handle. The cleanup code needs an extra check to make sure we aren't trying to use the bad handle. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-07Merge branch 'for-linus' of ↵Linus Torvalds1-310/+147
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (114 commits) Btrfs: check for a null fs root when writing to the backup root log Btrfs: fix race during transaction joins Btrfs: fix a potential btrfs_bio leak on scrub fixups Btrfs: rename btrfs_bio multi -> bbio for consistency Btrfs: stop leaking btrfs_bios on readahead Btrfs: stop the readahead threads on failed mount Btrfs: fix extent_buffer leak in the metadata IO error handling Btrfs: fix the new inspection ioctls for 32 bit compat Btrfs: fix delayed insertion reservation Btrfs: ClearPageError during writepage and clean_tree_block Btrfs: be smarter about committing the transaction in reserve_metadata_bytes Btrfs: make a delayed_block_rsv for the delayed item insertion Btrfs: add a log of past tree roots btrfs: separate superblock items out of fs_info Btrfs: use the global reserve when truncating the free space cache inode Btrfs: release metadata from global reserve if we have to fallback for unlink Btrfs: make sure to flush queued bios if write_cache_pages waits Btrfs: fix extent pinning bugs in the tree log Btrfs: make sure btrfs_remove_free_space doesn't leak EAGAIN Btrfs: don't wait as long for more batches during SSD log commit ...
2011-11-06Merge git://git.jan-o-sch.net/btrfs-unstable into integrationChris Mason1-153/+4
Conflicts: fs/btrfs/Makefile fs/btrfs/extent_io.c fs/btrfs/extent_io.h fs/btrfs/scrub.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06btrfs: separate superblock items out of fs_infoDavid Sterba1-1/+1
fs_info has now ~9kb, more than fits into one page. This will cause mount failure when memory is too fragmented. Top space consumers are super block structures super_copy and super_for_commit, ~2.8kb each. Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64) Add a wrapper for freeing fs_info and all of it's dynamically allocated members. Signed-off-by: David Sterba <dsterba@suse.cz>
2011-11-06Btrfs: release metadata from global reserve if we have to fallback for unlinkJosef Bacik1-1/+4
I fixed a problem where we weren't reserving space for an orphan item when we had to fallback to using the global reserve for an unlink, but I introduced another problem. I was migrating the bytes from the transaction reserve to the global reserve and then releasing from the global reserve in btrfs_end_transaction(). The problem with this is that a migrate will jack up the size for the destination, but leave the size alone for the source, with the idea that you can do a release normally on the source and it all washes out, and then you can do a release again on the destination and it works out right. My way was skipping the release on the trans_block_rsv which still had the jacked up size from our original reservation. So instead release manually from the global reserve if this transaction was using it, and then set the trans->block_rsv back to the trans_block_rsv so that btrfs_end_transaction cleans everything up properly. With this patch xfstest 83 doesn't emit warnings about leaking space. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-02filesystems: add set_nlink()Miklos Szeredi1-2/+2
Replace remaining direct i_nlink updates with a new set_nlink() updater function. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-10-20Btrfs: fix direct-io vs nodatacowLi Zefan1-2/+1
To reproduce the bug: # mount -o nodatacow /dev/sda7 /mnt/ # dd if=/dev/zero of=/mnt/tmp bs=4K count=1 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000136115 s, 30.1 MB/s # dd if=/dev/zero of=/mnt/tmp bs=4K count=1 conv=notrunc oflag=direct dd: writing `/mnt/tmp': Input/output error 1+0 records in 0+0 records out btrfs_ordered_update_i_size() may return 1, but btrfs_endio_direct_write() mistakenly takes it as an error. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-10-20Btrfs: remove BUG_ON() in compress_file_range()Li Zefan1-1/+5
It's not a big deal if we fail to allocate the array, and instead of panic we can just give up compressing. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-10-19Btrfs: seperate out btrfs_block_rsv_check out into 2 different functionsJosef Bacik1-2/+2
Currently btrfs_block_rsv_check does 2 things, it will either refill a block reserve like in the truncate or refill case, or it will check to see if there is enough space in the global reserve and possibly refill it. However because of overcommit we could be well overcommitting ourselves just to try and refill the global reserve, when really we should just be committing the transaction. So breack this out into btrfs_block_rsv_refill and btrfs_block_rsv_check. Refill will try to reserve more metadata if it can and btrfs_block_rsv_check will not, it will only tell you if the factor of the total space is still reserved. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: reserve some space for an orphan item when unlinkingJosef Bacik1-1/+8
In __unlink_start_trans() if we don't have enough room for a reservation we will check to see if the unlink will free up space. If it does that's great, but we will still could add an orphan item, so we need to reserve enough space to add the orphan item. Do this and migrate the space the global reserve so it all works out right. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: fix the amount of space reserved for unlinkJosef Bacik1-1/+10
Our unlink reservations were a bit much, we were reserving 10 and I only count 8 possible items we're touching, so comment what we're reserving for and fix the count value. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: inline checksums into the disk free space cacheJosef Bacik1-5/+5
Yeah yeah I know this is how we used to do it and then I changed it, but damnit I'm changing it back. The fact is that writing out checksums will modify metadata, which could cause us to dirty a block group we've already written out, so we have to truncate it and all of it's checksums and re-write it which will write new checksums which could dirty a blockg roup that has already been written and you see where I'm going with this? This can cause unmount or really anything that depends on a transaction to commit to take it's sweet damned time to happen. So go back to the way it was, only this time we're specifically setting NODATACOW because we can't go through the COW pathway anyway and we're doing our own built-in cow'ing by truncating the free space cache. The other new thing is once we truncate the old cache and preallocate the new space, we don't need to do that song and dance at all for the rest of the transaction, we can just overwrite the existing space with the new cache if the block group changes for whatever reason, and the NODATACOW will let us do this fine. So keep track of which transaction we last cleared our cache in and if we cleared it in this transaction just say we're all setup and carry on. This survives xfstests and stress.sh. The inode cache will continue to use the normal csum infrastructure since it only gets written once and there will be no more modifications to the fs tree in a transaction commit. Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: break out of orphan cleanup if we can't make progressJosef Bacik1-0/+11
I noticed while running xfstests 83 that if we didn't have enough space to delete our inode the orphan cleanup would just loop. This is because it keeps finding the same orphan item and keeps trying to kill it but can't because we don't get an error back from iput for deleting the inode. So keep track of the last guy we tried to kill, if it's the same as the one we're trying to kill currently we know we are having problems and can just error out. I don't have a way to test this so look hard and make sure it's right. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: use the global reserve as a backup for deleting inodesJosef Bacik1-1/+11
Xfstests 83 really stresses our ENOSPC since it uses a 100mb fs which ends up with the mixed block group stuff. Because of this we can run into a situation where we don't have enough space to delete inodes, or even worse we can't free the inodes when we next mount the fs which causes the orphan code to lose its mind. So if we fail to make our reservation, steal from the global reserve. The global reserve will end up taking up the entire rest of the free space on the fs in this worst case so there really is no other option. With this patch test 83 doesn't freak out. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: fix orphan cleanup regressionJosef Bacik1-19/+17
In fixing how we deal with bad inodes, we had a regression in the orphan cleanup code, since it expects to get a bad inode back. So fix it to deal with getting -ESTALE back by deleting the orphan item manually and moving on. Thanks, Reported-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: use the inode's mapping mask for allocating pagesJosef Bacik1-1/+2
Johannes pointed out we were allocating only kernel pages for doing writes, which is kind of a big deal if you are on 32bit and have more than a gig of ram. So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we don't re-enter. Thanks, Reported-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: stop passing a trans handle all around the reservation codeJosef Bacik1-2/+2
The only thing that we need to have a trans handle for is in reserve_metadata_bytes and thats to know how much flushing we can do. So instead of passing it around, just check current->journal_info for a trans_handle so we know if we can commit a transaction to try and free up space or not. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: handle enospc accounting for free space inodesJosef Bacik1-1/+1
Since free space inodes now use normal checksumming we need to make sure to account for their metadata use. So reserve metadata space, and then if we fail to write out the metadata we can just release it, otherwise it will be freed up when the io completes. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: set truncate block rsv's sizeJosef Bacik1-0/+2
While debugging a different issue I noticed that we were always reserving space when we tried to use our truncate block rsv's. This is because they didn't have a ->size value, so use_block_rsv just assumes there is nothing reserved and it does a reserve_metadata_bytes. This is because btrfs_check_block_rsv() doesn't actually add to the size of the block rsv. That seems to be the right thing to do so set ->size to the minimum truncate size we need, since we will always only refill to that size anyway, and this way everything works out correctly. Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: allow callers to specify if flushing can occur for btrfs_block_rsv_checkJosef Bacik1-3/+3
If you run xfstest 224 it you will get lots of messages about not being able to delete inodes and that they will be cleaned up next mount. This is because btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to flush, so if there was not enough space, it simply failed. But in truncate and evict case we could easily flush space to try and get enough space to do our work, so make btrfs_block_rsv_check take a flush argument to pass down to reserve_metadata_bytes. Now xfstests 224 runs fine without all those complaints. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: reduce the amount of space needed for truncatesJosef Bacik1-4/+4
With btrfs_truncate_inode_items we always return if we have to go to another leaf, which makes us do our reservation again. This means we will only ever modify one leaf at a time, so we only need 1 items worth of slack space. Also, since we are deleting we will not be creating nodes as we go down, if anything we'll be free'ing them as we merge them together, so make a different calculation for truncate which will only have the worst case useage of COW'ing the entire path down to the leaf. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>