Age | Commit message (Collapse) | Author | Files | Lines |
|
commit aeafbf8486c9e2bd53f5cc3c10c0b7fd7149d69c upstream.
While running a stress test I got the following warning triggered:
[191627.672810] ------------[ cut here ]------------
[191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]()
(...)
[191627.701485] Call Trace:
[191627.702037] [<ffffffff8145f077>] dump_stack+0x4f/0x7b
[191627.702992] [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
[191627.704091] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
[191627.705380] [<ffffffffa0664499>] ? __btrfs_drop_extents+0x391/0xa50 [btrfs]
[191627.706637] [<ffffffff8104b46d>] warn_slowpath_null+0x1a/0x1c
[191627.707789] [<ffffffffa0664499>] __btrfs_drop_extents+0x391/0xa50 [btrfs]
[191627.709155] [<ffffffff8115663c>] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0
[191627.712444] [<ffffffff81155007>] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18
[191627.714162] [<ffffffffa06570c9>] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs]
[191627.715887] [<ffffffffa065422b>] ? start_transaction+0x3bb/0x610 [btrfs]
[191627.717287] [<ffffffffa065b604>] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs]
[191627.728865] [<ffffffffa065b888>] finish_ordered_fn+0x15/0x17 [btrfs]
[191627.730045] [<ffffffffa067d688>] normal_work_helper+0x14c/0x32c [btrfs]
[191627.731256] [<ffffffffa067d96a>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
[191627.732661] [<ffffffff81061119>] process_one_work+0x24c/0x4ae
[191627.733822] [<ffffffff810615b0>] worker_thread+0x206/0x2c2
[191627.734857] [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
[191627.736052] [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
[191627.737349] [<ffffffff810669a6>] kthread+0xef/0xf7
[191627.738267] [<ffffffff810f3b3a>] ? time_hardirqs_on+0x15/0x28
[191627.739330] [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
[191627.741976] [<ffffffff81465592>] ret_from_fork+0x42/0x70
[191627.743080] [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
[191627.744206] ---[ end trace bbfddacb7aaada8d ]---
$ cat -n fs/btrfs/file.c
691 int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
(...)
758 btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
759 if (key.objectid > ino ||
760 key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end)
761 break;
762
763 fi = btrfs_item_ptr(leaf, path->slots[0],
764 struct btrfs_file_extent_item);
765 extent_type = btrfs_file_extent_type(leaf, fi);
766
767 if (extent_type == BTRFS_FILE_EXTENT_REG ||
768 extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
(...)
774 } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
(...)
778 } else {
779 WARN_ON(1);
780 extent_end = search_start;
781 }
(...)
This happened because the item we were processing did not match a file
extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this
case we cast the item to a struct btrfs_file_extent_item pointer and
then find a type field value that does not match any of the expected
values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens
due to a tiny time window where a race can happen as exemplified below.
For example, consider the following scenario where we're using the
NO_HOLES feature and we have the following two neighbour leafs:
Leaf X (has N items) Leaf Y
[ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ]
slot N - 2 slot N - 1 slot 0
Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather
than explicit because NO_HOLES is enabled). Now if our inode has an
ordered extent for the range [4K, 8K[ that is finishing, the following
can happen:
CPU 1 CPU 2
btrfs_finish_ordered_io()
insert_reserved_file_extent()
__btrfs_drop_extents()
Searches for the key
(257 EXTENT_DATA 4096) through
btrfs_lookup_file_extent()
Key not found and we get a path where
path->nodes[0] == leaf X and
path->slots[0] == N
Because path->slots[0] is >=
btrfs_header_nritems(leaf X), we call
btrfs_next_leaf()
btrfs_next_leaf() releases the path
inserts key
(257 INODE_REF 4096)
at the end of leaf X,
leaf X now has N + 1 keys,
and the new key is at
slot N
btrfs_next_leaf() searches for
key (257 INODE_REF 256), with
path->keep_locks set to 1,
because it was the last key it
saw in leaf X
finds it in leaf X again and
notices it's no longer the last
key of the leaf, so it returns 0
with path->nodes[0] == leaf X and
path->slots[0] == N (which is now
< btrfs_header_nritems(leaf X)),
pointing to the new key
(257 INODE_REF 4096)
__btrfs_drop_extents() casts the
item at path->nodes[0], slot
path->slots[0], to a struct
btrfs_file_extent_item - it does
not skip keys for the target
inode with a type less than
BTRFS_EXTENT_DATA_KEY
(BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY)
sees a bogus value for the type
field triggering the WARN_ON in
the trace shown above, and sets
extent_end = search_start (4096)
does the if-then-else logic to
fixup 0 length extent items created
by a past bug from hole punching:
if (extent_end == key.offset &&
extent_end >= search_start)
goto delete_extent_item;
that evaluates to true and it ends
up deleting the key pointed to by
path->slots[0], (257 INODE_REF 4096),
from leaf X
The same could happen for example for a xattr that ends up having a key
with an offset value that matches search_start (very unlikely but not
impossible).
So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are
skipped, never casted to struct btrfs_file_extent_item and never deleted
by accident. Also protect against the unexpected case of getting a key
for a lower inode number by skipping that key and issuing a warning.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 3a8b36f378060d20062a0918e99fae39ff077bf0 upstream.
When using the fast file fsync code path we can miss the fact that new
writes happened since the last file fsync and therefore return without
waiting for the IO to finish and write the new extents to the fsync log.
Here's an example scenario where the fsync will miss the fact that new
file data exists that wasn't yet durably persisted:
1. fs_info->last_trans_committed == N - 1 and current transaction is
transaction N (fs_info->generation == N);
2. do a buffered write;
3. fsync our inode, this clears our inode's full sync flag, starts
an ordered extent and waits for it to complete - when it completes
at btrfs_finish_ordered_io(), the inode's last_trans is set to the
value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
btrfs_set_inode_last_trans);
4. transaction N is committed, so fs_info->last_trans_committed is now
set to the value N and fs_info->generation remains with the value N;
5. do another buffered write, when this happens btrfs_file_write_iter
sets our inode's last_trans to the value N + 1 (that is
fs_info->generation + 1 == N + 1);
6. transaction N + 1 is started and fs_info->generation now has the
value N + 1;
7. transaction N + 1 is committed, so fs_info->last_trans_committed
is set to the value N + 1;
8. fsync our inode - because it doesn't have the full sync flag set,
we only start the ordered extent, we don't wait for it to complete
(only in a later phase) therefore its last_trans field has the
value N + 1 set previously by btrfs_file_write_iter(), and so we
have:
inode->last_trans <= fs_info->last_trans_committed
(N + 1) (N + 1)
Which made us not log the last buffered write and exit the fsync
handler immediately, returning success (0) to user space and resulting
in data loss after a crash.
This can actually be triggered deterministically and the following excerpt
from a testcase I made for xfstests triggers the issue. It moves a dummy
file across directories and then fsyncs the old parent directory - this
is just to trigger a transaction commit, so moving files around isn't
directly related to the issue but it was chosen because running 'sync' for
example does more than just committing the current transaction, as it
flushes/waits for all file data to be persisted. The issue can also happen
at random periods, since the transaction kthread periodicaly commits the
current transaction (about every 30 seconds by default).
The body of the test is:
_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey
# Create our main test file 'foo', the one we check for data loss.
# By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
# bit from its flags (btrfs inode specific flags).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
-c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
# Now create one other file and 2 directories. We will move this second file
# from one directory to the other later because it forces btrfs to commit its
# currently open transaction if we fsync the old parent directory. This is
# necessary to trigger the data loss bug that affected btrfs.
mkdir $SCRATCH_MNT/testdir_1
touch $SCRATCH_MNT/testdir_1/bar
mkdir $SCRATCH_MNT/testdir_2
# Make sure everything is durably persisted.
sync
# Write more 8Kb of data to our file.
$XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
# Move our 'bar' file into a new directory.
mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
# Fsync our first directory. Because it had a file moved into some other
# directory, this made btrfs commit the currently open transaction. This is
# a condition necessary to trigger the data loss bug.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
# Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
# data we wrote previously to be persisted and available if a crash happens.
# This did not happen with btrfs, because of the transaction commit that
# happened when we fsynced the parent directory.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# Now check that all data we wrote before are available.
echo "File content after log replay:"
od -t x1 $SCRATCH_MNT/foo
status=0
exit
The expected golden output for the test, which is what we get with this
fix applied (or when running against ext3/4 and xfs), is:
wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 8192/8192 bytes at offset 8192
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
*
0040000
Without this fix applied, the output shows the test file does not have
the second 8Kb extent that we successfully fsynced:
wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 8192/8192 bytes at offset 8192
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000
So fix this by skipping the fsync only if we're doing a full sync and
if the inode's last_trans is <= fs_info->last_trans_committed, or if
the inode is already in the log. Also remove setting the inode's
last_trans in btrfs_file_write_iter since it's useless/unreliable.
Also because btrfs_file_write_iter no longer sets inode->last_trans to
fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
bail out if last_trans is 0, otherwise something as simple as the following
example wouldn't log the second write on the last fsync:
1. write to file
2. fsync file
3. fsync file
|--> btrfs_inode_in_log() returns true and it set last_trans to 0
4. write to file
|--> btrfs_file_write_iter() no longers sets last_trans, so it
remained with a value of 0
5. fsync
|--> inode->last_trans == 0, so it bails out without logging the
second write
A test case for xfstests will be sent soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
possible
commit 2457aec63745e235bcafb7ef312b182d8682f0fc upstream.
aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after. Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage. The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.
The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page. This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.
The primary APIs of concern in this care are the following and are used
by most filesystems.
find_get_page
find_lock_page
find_or_create_page
grab_cache_page_nowait
grab_cache_page_write_begin
All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not. Then
old API is preserved but is basically a thin wrapper around this core
function.
Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job. There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted. This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change. It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.
The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.
The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts. Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.
The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.
async dd
3.15.0-rc3 3.15.0-rc3
vanilla accessed-v2
ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)
The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.
samples percentage
ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
commit 9e8c2af96e0d2d5fe298dd796fb6bc16e888a48d upstream.
... it does that itself (via kmap_atomic())
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"These are mostly bug fixes and a two small performance fixes. The
most important of the bunch are Josef's fix for a snapshotting
regression and Mark's update to fix compile problems on arm"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
Btrfs: create the uuid tree on remount rw
btrfs: change extent-same to copy entire argument struct
Btrfs: dir_inode_operations should use btrfs_update_time also
btrfs: Add btrfs: prefix to kernel log output
btrfs: refuse to remount read-write after abort
Btrfs: btrfs_ioctl_default_subvol: Revert back to toplevel subvolume when arg is 0
Btrfs: don't leak transaction in btrfs_sync_file()
Btrfs: add the missing mutex unlock in write_all_supers()
Btrfs: iput inode on allocation failure
Btrfs: remove space_info->reservation_progress
Btrfs: kill delay_iput arg to the wait_ordered functions
Btrfs: fix worst case calculator for space usage
Revert "Btrfs: rework the overcommit logic to be based on the total size"
Btrfs: improve replacing nocow extents
Btrfs: drop dir i_size when adding new names on replay
Btrfs: replay dir_index items before other items
Btrfs: check roots last log commit when checking if an inode has been logged
Btrfs: actually log directory we are fsync()'ing
Btrfs: actually limit the size of delalloc range
Btrfs: allocate the free space by the existed max extent size when ENOSPC
...
|
|
In btrfs_sync_file(), if the call to btrfs_log_dentry_safe() returns
a negative error (for e.g. -ENOMEM via btrfs_log_inode()), we would
return without ending/freeing the transaction.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
"This is against 3.11-rc7, but was pulled and tested against your tree
as of yesterday. We do have two small incrementals queued up, but I
wanted to get this bunch out the door before I hop on an airplane.
This is a fairly large batch of fixes, performance improvements, and
cleanups from the usual Btrfs suspects.
We've included Stefan Behren's work to index subvolume UUIDs, which is
targeted at speeding up send/receive with many subvolumes or snapshots
in place. It closes a long standing performance issue that was built
in to the disk format.
Mark Fasheh's offline dedup work is also here. In this case offline
means the FS is mounted and active, but the dedup work is not done
inline during file IO. This is a building block where utilities are
able to ask the FS to dedup a series of extents. The kernel takes
care of verifying the data involved really is the same. Today this
involves reading both extents, but we'll continue to evolve the
patches"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits)
Btrfs: optimize key searches in btrfs_search_slot
Btrfs: don't use an async starter for most of our workers
Btrfs: only update disk_i_size as we remove extents
Btrfs: fix deadlock in uuid scan kthread
Btrfs: stop refusing the relocation of chunk 0
Btrfs: fix memory leak of uuid_root in free_fs_info
btrfs: reuse kbasename helper
btrfs: return btrfs error code for dev excl ops err
Btrfs: allow partial ordered extent completion
Btrfs: convert all bug_ons in free-space-cache.c
Btrfs: add support for asserts
Btrfs: adjust the fs_devices->missing count on unmount
Btrf: cleanup: don't check for root_refs == 0 twice
Btrfs: fix for patch "cleanup: don't check the same thing twice"
Btrfs: get rid of one BUG() in write_all_supers()
Btrfs: allocate prelim_ref with a slab allocater
Btrfs: pass gfp_t to __add_prelim_ref() to avoid always using GFP_ATOMIC
Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl
Btrfs: fix race between removing a dev and writing sbs
Btrfs: remove ourselves from the cluster list under lock
...
|
|
Call generic_write_sync() from the deferred I/O completion handler if
O_DSYNC is set for a write request. Also make sure various callers
don't call generic_write_sync if the direct I/O code returns
-EIOCBQUEUED.
Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from
Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
btrfs_read_fs_root_no_name() already checks if btrfs_root_refs()
is zero and returns ENOENT in this case. There is no need to do
it again in three more places.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
I noticed while looking at a deadlock that we are always starting a transaction
in cow_file_range(). This isn't really needed since we only need a transaction
if we are doing an inline extent, or if the allocator needs to allocate a chunk.
So push down all the transaction start stuff to be closer to where we actually
need a transaction in all of these cases. This will hopefully reduce our write
latency when we are committing often. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
We can end up with inodes on the auto defrag list that exist on roots that are
going to be deleted. This is extra work we don't need to do, so just bail if
our root has 0 root refs. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
I noticed while running multi-threaded fsync tests that sometimes fsck would
complain about an improper gap. This happens because we fail to add a hole
extent to the file, which was happening when we'd split a hole EM because
btrfs_drop_extent_cache was just discarding the whole em instead of splitting
it. So this patch fixes this by allowing us to split a hole em properly, which
means that added holes actually get logged properly and we no longer see this
fsck error. Thankfully we're tolerant of these sort of problems so a user would
not see any adverse effects of this bug, other than fsck complaining. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
"These are the usual mixture of bugs, cleanups and performance fixes.
Miao has some really nice tuning of our crc code as well as our
transaction commits.
Josef is peeling off more and more problems related to early enospc,
and has a number of important bug fixes in here too"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (81 commits)
Btrfs: wait ordered range before doing direct io
Btrfs: only do the tree_mod_log_free_eb if this is our last ref
Btrfs: hold the tree mod lock in __tree_mod_log_rewind
Btrfs: make backref walking code handle skinny metadata
Btrfs: fix crash regarding to ulist_add_merge
Btrfs: fix several potential problems in copy_nocow_pages_for_inode
Btrfs: cleanup the code of copy_nocow_pages_for_inode()
Btrfs: fix oops when recovering the file data by scrub function
Btrfs: make the chunk allocator completely tree lockless
Btrfs: cleanup orphaned root orphan item
Btrfs: fix wrong mirror number tuning
Btrfs: cleanup redundant code in btrfs_submit_direct()
Btrfs: remove btrfs_sector_sum structure
Btrfs: check if we can nocow if we don't have data space
Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc
Btrfs: use a percpu to keep track of possibly pinned bytes
Btrfs: check for actual acls rather than just xattrs when caching no acl
Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate
Btrfs: optimize reada_for_balance
Btrfs: optimize read_block_for_search
...
|
|
For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
matter in lseek_execute() to update the current file offset
to the desired offset if it is valid, ceph also does the
simliar things at ceph_llseek().
To reduce the duplications, this patch make lseek_execute()
public accessible so that we can call it directly from the
underlying file systems.
Thanks Dave Chinner for this suggestion.
[AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]
v2->v1:
- Add kernel-doc comments for lseek_execute()
- Call lseek_execute() in ceph->llseek()
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Ted Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
We always just try and reserve data space when we write, but if we are out of
space but have prealloc'ed extents we should still successfully write. This
patch will try and see if we can write to prealloc'ed space and if we can go
ahead and allow the write to continue. With this patch we now pass xfstests
generic/274. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
This has plagued us forever and I'm so over working around it. When we truncate
down to a non-page aligned offset we will call btrfs_truncate_page to zero out
the end of the page and write it back to disk, this will keep us from exposing
stale data if we truncate back up from that point. The problem with this is it
requires data space to do this, and people don't really expect to get ENOSPC
from truncate() for these sort of things. This also tends to bite the orphan
cleanup stuff too which keeps people from mounting. To get around this we can
just move this into btrfs_cont_expand() to make sure if we are truncating up
from a non-page size aligned i_size we will zero out the rest of this page so
that we don't expose stale data. This will give ENOSPC if you try to truncate()
up or if you try to write past the end of isize, which is much more reasonable.
This fixes xfstests generic/083 failing to mount because of the orphan cleanup
failing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
btrfs_read_fs_root_no_name() already checks if btrfs_root_refs()
is zero and returns ENOENT in this case. There is no need to do
it again in six places.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
"These are mostly fixes. The biggest exceptions are Josef's skinny
extents and Jan Schmidt's code to rebuild our quota indexes if they
get out of sync (or you enable quotas on an existing filesystem).
The skinny extents are off by default because they are a new variation
on the extent allocation tree format. btrfstune -x enables them, and
the new format makes the extent allocation tree about 30% smaller.
I rebased this a few days ago to rework Dave Sterba's crc checks on
the super block, but almost all of these go back to rc6, since I
though 3.9 was due any minute.
The biggest missing fix is the tracepoint bug that was hit late in
3.9. I ran into problems with that in overnight testing and I'm still
tracking it down. I'll definitely have that fixed for rc2."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (101 commits)
Btrfs: allow superblock mismatch from older mkfs
btrfs: enhance superblock checks
btrfs: fix misleading variable name for flags
btrfs: use unsigned long type for extent state bits
Btrfs: improve the loop of scrub_stripe
btrfs: read entire device info under lock
btrfs: remove unused gfp mask parameter from release_extent_buffer callchain
btrfs: handle errors returned from get_tree_block_key
btrfs: make static code static & remove dead code
Btrfs: deal with errors in write_dev_supers
Btrfs: remove almost all of the BUG()'s from tree-log.c
Btrfs: deal with free space cache errors while replaying log
Btrfs: automatic rescan after "quota enable" command
Btrfs: rescan for qgroups
Btrfs: split btrfs_qgroup_account_ref into four functions
Btrfs: allocate new chunks if the space is not enough for global rsv
Btrfs: separate sequence numbers for delayed ref tracking and tree mod log
btrfs: move leak debug code to functions
Btrfs: return free space in cow error path
Btrfs: set UUID in root_item for created trees
...
|
|
Faster kernel compiles by way of fewer unnecessary includes.
[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.
removed functions:
btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()
btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.
ulist.c functions are left, another patch will take care of those.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
If argument 'trans' is unnecessary in the function where
fixup_low_keys() is called, 'trans' is deleted.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
A user sent me a btrfs-image of a file system that was panicing on mount during
the log recovery. I had originally thought these problems were from a bug in
the free space cache code, but that was just a symptom of the problem. The
problem is if your application does something like this
[prealloc][prealloc][prealloc]
the internal extent maps will merge those all together into one extent map, even
though on disk they are 3 separate extents. So if you go to write into one of
these ranges the extent map will be right since we use the physical extent when
doing the write, but when we log the extents they will use the wrong sizes for
the remainder prealloc space. If this doesn't happen to trip up the free space
cache (which it won't in a lot of cases) then you will get bogus entries in your
extent tree which will screw stuff up later. The data and such will still work,
but everything else is broken. This patch fixes this by not allowing extents
that are on the modified list to be merged. This has the side effect that we
are no longer adding everything to the modified list all the time, which means
we now have to call btrfs_drop_extents every time we log an extent into the
tree. So this allows me to drop all this speciality code I was using to get
around calling btrfs_drop_extents. With this patch the testcase I've created no
longer creates a bogus file system after replaying the log. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
When logging changed extents I was logging ram_bytes as the current length,
which isn't correct, it's supposed to be the ram bytes of the original extent.
This is for compression where even if we split the extent we need to know the
ram bytes so when we uncompress the extent we know how big it will be. This was
still working out right with compression for some reason but I think we were
getting lucky. It was definitely off for prealloc which is why I noticed it,
btrfsck was complaining about it. With this patch btrfsck no longer complains
after a log replay. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).
7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"We've had a busy two weeks of bug fixing. The biggest patches in here
are some long standing early-enospc problems (Josef) and a very old
race where compression and mmap combine forces to lose writes (me).
I'm fairly sure the mmap bug goes all the way back to the introduction
of the compression code, which is proof that fsx doesn't trigger every
possible mmap corner after all.
I'm sure you'll notice one of these is from this morning, it's a small
and isolated use-after-free fix in our scrub error reporting. I
double checked it here."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: don't drop path when printing out tree errors in scrub
Btrfs: fix wrong return value of btrfs_lookup_csum()
Btrfs: fix wrong reservation of csums
Btrfs: fix double free in the btrfs_qgroup_account_ref()
Btrfs: limit the global reserve to 512mb
Btrfs: hold the ordered operations mutex when waiting on ordered extents
Btrfs: fix space accounting for unlink and rename
Btrfs: fix space leak when we fail to reserve metadata space
Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes
Btrfs: fix race between mmap writes and compression
Btrfs: fix memory leak in btrfs_create_tree()
Btrfs: fix locking on ROOT_REPLACE operations in tree mod log
Btrfs: fix missing qgroup reservation before fallocating
Btrfs: handle a bogus chunk tree nicely
Btrfs: update to use fs_state bit
|
|
Steps to reproduce:
mkfs.btrfs <disk>
mount <disk> <mnt>
btrfs quota enable <mnt>
btrfs sub create <mnt>/subv
btrfs qgroup limit 10M <mnt>/subv
fallocate --length 20M <mnt>/subv/data
For the above example, fallocating will return successfully which
is not expected, we try to fix it by doing qgroup reservation before
fallocating.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"Eric's rcu barrier patch fixes a long standing problem with our
unmount code hanging on to devices in workqueue helpers. Liu Bo
nailed down a difficult assertion for in-memory extent mappings."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix warning of free_extent_map
Btrfs: fix warning when creating snapshots
Btrfs: return as soon as possible when edquot happens
Btrfs: return EIO if we have extent tree corruption
btrfs: use rcu_barrier() to wait for bdev puts at unmount
Btrfs: remove btrfs_try_spin_lock
Btrfs: get better concurrency for snapshot-aware defrag work
|
|
Users report that an extent map's list is still linked when it's actually
going to be freed from cache.
The story is that
a) when we're going to drop an extent map and may split this large one into
smaller ems, and if this large one is flagged as EXTENT_FLAG_LOGGING which means
that it's on the list to be logged, then the smaller ems split from it will also
be flagged as EXTENT_FLAG_LOGGING, and this is _not_ expected.
b) we'll keep ems from unlinking the list and freeing when they are flagged with
EXTENT_FLAG_LOGGING, because the log code holds one reference.
The end result is the warning, but the truth is that we set the flag
EXTENT_FLAG_LOGGING only during fsync.
So clear flag EXTENT_FLAG_LOGGING for extent maps split from a large one.
Reported-by: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
Reported-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
"The biggest feature in the pull is the new (and still experimental)
raid56 code that David Woodhouse started long ago. I'm still working
on the parity logging setup that will avoid inconsistent parity after
a crash, so this is only for testing right now. But, I'd really like
to get it out to a broader audience to hammer out any performance
issues or other problems.
scrub does not yet correct errors on raid5/6 either.
Josef has another pass at fsync performance. The big change here is
to combine waiting for metadata with waiting for data, which is a big
latency win. It is also step one toward using atomics from the
hardware during a commit.
Mark Fasheh has a new way to use btrfs send/receive to send only the
metadata changes. SUSE is using this to make snapper more efficient
at finding changes between snapshosts.
Snapshot-aware defrag is also included.
Otherwise we have a large number of fixes and cleanups. Eric Sandeen
wins the award for removing the most lines, and I'm hoping we steal
this idea from XFS over and over again."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits)
btrfs: fixup/remove module.h usage as required
Btrfs: delete inline extents when we find them during logging
btrfs: try harder to allocate raid56 stripe cache
Btrfs: cleanup to make the function btrfs_delalloc_reserve_metadata more logic
Btrfs: don't call btrfs_qgroup_free if just btrfs_qgroup_reserve fails
Btrfs: remove reduplicate check about root in the function btrfs_clean_quota_tree
Btrfs: return ENOMEM rather than use BUG_ON when btrfs_alloc_path fails
Btrfs: fix missing deleted items in btrfs_clean_quota_tree
btrfs: use only inline_pages from extent buffer
Btrfs: fix wrong reserved space when deleting a snapshot/subvolume
Btrfs: fix wrong reserved space in qgroup during snap/subv creation
Btrfs: remove unnecessary dget_parent/dput when creating the pending snapshot
btrfs: remove a printk from scan_one_device
Btrfs: fix NULL pointer after aborting a transaction
Btrfs: fix memory leak of log roots
Btrfs: copy everything if we've created an inline extent
btrfs: cleanup for open-coded alignment
Btrfs: do not change inode flags in rename
Btrfs: use reserved space for creating a snapshot
clear chunk_alloc flag on retryable failure
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
|
|
Though most of the btrfs codes are using ALIGN macro for page alignment,
there are still some codes using open-coded alignment like the
following:
------
u64 mask = ((u64)root->stripesize - 1);
u64 ret = (val + mask) & ~mask;
------
Or even hidden one:
------
num_bytes = (end - start + blocksize) & ~(blocksize - 1);
------
Sometimes these open-coded alignment is not so easy to understand for
newbie like me.
This commit changes the open-coded alignment to the ALIGN macro for a
better readability.
Also there is a previous patch from David Sterba with similar changes,
but the patch is for 3.2 kernel and seems not merged.
http://www.spinics.net/lists/linux-btrfs/msg12747.html
Cc: David Sterba <dave@jikos.cz>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
If we remount the fs to close the auto defragment or make the fs R/O,
we should stop the auto defragment.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into for-linus-3.9
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Conflicts:
fs/btrfs/disk-io.c
|
|
Miao made the ordered operations stuff run async, which introduced a
deadlock where we could get somebody (sync) racing in and committing the
transaction while a commit was already happening. The new committer would
try and flush ordered operations which would hang waiting for the commit to
finish because it is done asynchronously and no longer inherits the callers
trans handle. To fix this we need to make the ordered operations list a per
transaction list. We can get new inodes added to the ordered operation list
by truncating them and then having another process writing to them, so this
makes it so that anybody trying to add an ordered operation _must_ start a
transaction in order to add itself to the list, which will keep new inodes
from getting added to the ordered operations list after we start committing.
This should fix the deadlock and also keeps us from doing a lot more work
than we need to during commit. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
There is no lock to protect fs_info->fs_state, it will introduce
some problems, such as the value may be covered by the other task
when several tasks modify it. For example:
Task0 - CPU0 Task1 - CPU1
mov %fs_state rax
or $0x1 rax
mov %fs_state rax
or $0x2 rax
mov rax %fs_state
mov rax %fs_state
The expected value is 3, but in fact, it is 2.
Though this problem doesn't happen now (because there is only one
flag currently), the code is error prone, if we add other flags,
the above problem will happen to a certainty.
Now we use bit operation for it to fix the above problem.
In this way, we can make the code more robust and be easy to
add new flags.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
The header file will then be installed under /usr/include/linux so that
userspace applications can refer to Btrfs ioctls by name and use the same
structs used internally in the kernel.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
Since we don't actually copy the extent information from the source tree in
the fast case we don't need to wait for ordered io to be completed in order
to fsync, we just need to wait for the io to be completed. So when we're
logging our file just attach all of the ordered extents to the log, and then
when the log syncs just wait for IO_DONE on the ordered extents and then
write the super. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"We've got corner cases for updating i_size that ceph was hitting,
error handling for quotas when we run out of space, a very subtle
snapshot deletion race, a crash while removing devices, and one
deadlock between subvolume creation and the sb_internal code (thanks
lockdep)."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: move d_instantiate outside the transaction during mksubvol
Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata
Btrfs: fix possible stale data exposure
Btrfs: fix missing i_size update
Btrfs: fix race between snapshot deletion and getting inode
Btrfs: fix missing release of the space/qgroup reservation in start_transaction()
Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write()
Btrfs: do not merge logged extents if we've removed them from the tree
btrfs: don't try to notify udev about missing devices
|
|
While running snapshot testscript created by Mitch and David,
the race between autodefrag and snapshot deletion can lead to
corruption of dead_root list so that we can get crash on
btrfs_clean_old_snapshots().
And besides autodefrag, scrub also does the same thing, ie. read
root first and get inode.
Here is the story(take autodefrag as an example):
(1) when we delete a snapshot or subvolume, it will set its root's
refs to zero and do a iput() on its own inode, and if this inode happens
to be the only active in-meory one in root's inode rbtree, it will add
itself to the global dead_roots list for later cleanup.
(2) after (1), the autodefrag thread may read another inode for defrag
and the inode is just in the deleted snapshot/subvolume, but all of these
are without checking if the root is still valid(refs > 0). So the end up
result is adding the deleted snapshot/subvolume's root to the global
dead_roots list AGAIN.
Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu.
So all we need to do is to take the lock to protect 'read root and get inode',
since we synchronize to wait for the rcu grace period before adding something
to the global dead_roots list.
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
If the checks at the beginning of btrfs_file_aio_write() fail, we needn't
decrease ->sync_writers, because we have not increased it. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"It turns out that we had two crc bugs when running fsx-linux in a
loop. Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it
all down. Miao also has a new OOM fix in this v2 pull as well.
Ilya fixed a regression Liu Bo found in the balance ioctls for pausing
and resuming a running balance across drives.
Josef's orphan truncate patch fixes an obscure corruption we'd see
during xfstests.
Arne's patches address problems with subvolume quotas. If the user
destroys quota groups incorrectly the FS will refuse to mount.
The rest are smaller fixes and plugs for memory leaks."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits)
Btrfs: fix repeated delalloc work allocation
Btrfs: fix wrong max device number for single profile
Btrfs: fix missed transaction->aborted check
Btrfs: Add ACCESS_ONCE() to transaction->abort accesses
Btrfs: put csums on the right ordered extent
Btrfs: use right range to find checksum for compressed extents
Btrfs: fix panic when recovering tree log
Btrfs: do not allow logged extents to be merged or removed
Btrfs: fix a regression in balance usage filter
Btrfs: prevent qgroup destroy when there are still relations
Btrfs: ignore orphan qgroup relations
Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag
Btrfs: fix unlock order in btrfs_ioctl_rm_dev
Btrfs: fix unlock order in btrfs_ioctl_resize
Btrfs: fix "mutually exclusive op is running" error code
Btrfs: bring back balance pause/resume logic
btrfs: update timestamps on truncate()
btrfs: fix btrfs_cont_expand() freeing IS_ERR em
Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents
Btrfs: fix off-by-one in lseek
...
|
|
xfstests case 285 complains.
It it because btrfs did not try to find unwritten delalloc
bytes(only dirty pages, not yet writeback) behind prealloc
extents, it ends up finding nothing while we're with SEEK_DATA.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
Lock end is inclusive.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
"A big set of fixes and features.
In terms of line count, most of the code comes from Stefan, who added
the ability to replace a single drive in place. This is different
from how btrfs normally replaces drives, and is much much much faster.
Josef is plowing through our synchronous write performance. This pull
request does not include the DIO_OWN_WAITING patch that was discussed
on the list, but it has a number of other improvements to cut down our
latencies and CPU time during fsync/O_DIRECT writes.
Miao Xie has a big series of fixes and is spreading out ordered
operations over more CPUs. This improves performance and reduces
contention.
I've put in fixes for error handling around hash collisions. These
are going back to individual stable kernels as I test against them.
Otherwise we have a lot of fixes and cleanups, thanks everyone!
raid5/6 is being rebased against the device replacement code. I'll
have it posted this Friday along with a nice series of benchmarks."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits)
Btrfs: fix a bug of per-file nocow
Btrfs: fix hash overflow handling
Btrfs: don't take inode delalloc mutex if we're a free space inode
Btrfs: fix autodefrag and umount lockup
Btrfs: fix permissions of empty files not affected by umask
Btrfs: put raid properties into global table
Btrfs: fix BUG() in scrub when first superblock reading gives EIO
Btrfs: do not call file_update_time in aio_write
Btrfs: only unlock and relock if we have to
Btrfs: use tokens where we can in the tree log
Btrfs: optimize leaf_space_used
Btrfs: don't memset new tokens
Btrfs: only clear dirty on the buffer if it is marked as dirty
Btrfs: move checks in set_page_dirty under DEBUG
Btrfs: log changed inodes based on the extent map tree
Btrfs: add path->really_keep_locks
Btrfs: do not mark ems as prealloc if we are writing to them
Btrfs: keep track of the extents original block length
Btrfs: inline csums if we're fsyncing
Btrfs: don't bother copying if we're only logging the inode
...
|
|
But the kernel decided to call it "origin" instead. Fix most of the
sites.
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This starts a transaction and dirties the inode everytime we call it, which
is super expensive if you have a write heavy workload. We will be updating
the inode when the IO completes and we reserve the space for the inode
update when we reserve space for the write, so there is no chance of loss of
information or enospc issues. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
We don't really need to copy extents from the source tree since we have all
of the information already available to us in the extent_map tree. So
instead just write the extents straight to the log tree and don't bother to
copy the extent items from the source tree.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
If we've written to a prealloc extent we need to know the original block len
for the extent. We can't figure this out currently since ->block_len is
just set to the extent length. So introduce ->orig_block_len so that we
know how many bytes were in the original extent for proper extent logging
that future patches will need. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|