summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2008-09-25Btrfs: Search data ordered extents first for checksums on readChris Mason4-20/+31
Checksum items are not inserted into the tree until all of the io from a given extent is complete. This means one dirty page from an extent may be written, freed, and then read again before the entire extent is on disk and the checksum item is inserted. The checksums themselves are stored in the ordered extent so they can be inserted in bulk when IO is complete. On read, if a checksum item isn't found, the ordered extents were being searched for a checksum record. This all worked most of the time, but the checksum insertion code tries to reduce the number of tree operations by pre-inserting checksum items based on i_size and a few other factors. This means the read code might find a checksum item that hasn't yet really been filled in. This commit changes things to check the ordered extents first and only dive into the btree if nothing was found. This removes the need for extra locking and is more reliable. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix 32 bit compiles by using an unsigned long byte count in the ↵Chris Mason1-1/+2
ordered extent The ordered extents have to fit in memory, so an unsigned long is sufficient. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Take the csum mutex while reading checksumsChris Mason5-5/+12
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: alloc_mutex latency reductionChris Mason2-20/+81
This releases the alloc_mutex in a few places that hold it for over long operations. btrfs_lookup_block_group is changed so that it doesn't need the mutex at all. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add some conditional schedules near the alloc_mutexChris Mason1-0/+2
This helps prevent stalls, especially while the snapshot cleaner is running hard Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Use mutex_lock_nested for tree lockingChris Mason2-2/+6
Lockdep has the notion of locking subclasses so that you can identify locks you expect to be taken after other locks of the same class. This changes the per-extent buffer btree locking routines to use a subclass based on the level in the tree. Unfortunately, lockdep can only handle 8 total subclasses, and the btrfs max level is also 8. So when lockdep is on, use a lower max level. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix some data=ordered related data corruptionsChris Mason10-83/+140
Stress testing was showing data checksum errors, most of which were caused by a lookup bug in the extent_map tree. The tree was caching the last pointer returned, and searches would check the last pointer first. But, search callers also expect the search to return the very first matching extent in the range, which wasn't always true with the last pointer usage. For now, the code to cache the last return value is just removed. It is easy to fix, but I think lookups are rare enough that it isn't required anymore. This commit also replaces do_sync_mapping_range with a local copy of the related functions. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Use a mutex in the extent buffer for tree block lockingChris Mason4-13/+17
This replaces the use of the page cache lock bit for locking, which wasn't suitable for block size < page size and couldn't be used recursively. The mutexes alone don't fix either problem, but they are the first step. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Index extent buffers in an rbtreeChris Mason4-220/+129
Before, extent buffers were a temporary object, meant to map a number of pages at once and collect operations on them. But, a few extra fields have crept in, and they are also the best place to store a per-tree block lock field as well. This commit puts the extent buffers into an rbtree, and ensures a single extent buffer for each tree block. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Data ordered fixesChris Mason4-10/+43
* In btrfs_delete_inode, wait for ordered extents after calling truncate_inode_pages. This is much faster, and more correct * Properly clear our the PageChecked bit everywhere we redirty the page. * Change the writepage fixup handler to lock the page range and check to see if an ordered extent had been inserted since the improperly dirtied page was discovered * Wait for ordered extents outside the transaction. This isn't required for locking rules but does improve transaction latencies * Reduce contention on the alloc_mutex by dropping it while incrementing refs on a node/leaf and while dropping refs on a leaf. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Fix btrfs_wait_ordered_extent_range to properly waitChris Mason3-25/+49
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Keep extent mappings in ram until pending ordered extents are doneChris Mason6-20/+48
It was possible for stale mappings from disk to be used instead of the new pending ordered extent. This adds a flag to the extent map struct to keep it pinned until the pending ordered extent is actually on disk. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Don't allow releasepage to succeed if EXTENT_ORDERED is setChris Mason2-6/+11
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Handle data checksumming on bios that span multiple ordered extentsChris Mason5-31/+69
Data checksumming is done right before the bio is sent down the IO stack, which means a single bio might span more than one ordered extent. In this case, the checksumming data is split between two ordered extents. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Cleanup and comment ordered-data.cChris Mason3-70/+121
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Force caching of metadata block groups on mount to avoid deadlockChris Mason1-0/+5
This is a temporary change to avoid deadlocks until the extent tree locking is fixed up. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25btrfs_next_leaf: do readahead when skip_locking is turned onChris Mason1-1/+2
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Add a per-inode lock around btrfs_drop_extentsChris Mason4-0/+21
btrfs_drop_extents is always called with a range lock held on the inode. But, it may operate on extents outside that range as it drops and splits them. This patch adds a per-inode mutex that is held while calling btrfs_drop_extents and while inserting new extents into the tree. It prevents races from two procs working against adjacent ranges in the tree. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Don't pin pages in ram until the entire ordered extent is on disk.Chris Mason4-19/+69
Checksum items are not inserted until the entire ordered extent is on disk, but individual pages might be clean and available for reclaim long before the whole extent is on disk. In order to allow those pages to be freed, we need to be able to search the list of ordered extents to find the checksum that is going to be inserted in the tree. This way if the page needs to be read back in before the checksums are in the btree, we'll be able to verify the checksum on the page. This commit adds the ability to search the pending ordered extents for a given offset in the file, and changes btrfs_releasepage to allow ordered pages to be freed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25btrfs_start_transaction: wait for commits in progress to finishChris Mason6-9/+51
btrfs_commit_transaction has to loop waiting for any writers in the transaction to finish before it can proceed. btrfs_start_transaction should be polite and not join a transaction that is in the process of being finished off. There are a few places that can't wait, basically the ones doing IO that might be needed to finish the transaction. For them, btrfs_join_transaction is added. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Update on disk i_size only after pending ordered extents are doneChris Mason5-11/+119
This changes the ordered data code to update i_size after the extent is on disk. An on disk i_size is maintained in the in-memory btrfs inode structures, and this is updated as extents finish. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Use async helpers to deal with pages that have been improperly dirtiedChris Mason6-9/+106
Higher layers sometimes call set_page_dirty without asking the filesystem to help. This causes many problems for the data=ordered and cow code. This commit detects pages that haven't been properly setup for IO and kicks off an async helper to deal with them. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: New data=ordered implementationChris Mason14-502/+910
The old data=ordered code would force commit to wait until all the data extents from the transaction were fully on disk. This introduced large latencies into the commit and stalled new writers in the transaction for a long time. The new code changes the way data allocations and extents work: * When delayed allocation is filled, data extents are reserved, and the extent bit EXTENT_ORDERED is set on the entire range of the extent. A struct btrfs_ordered_extent is allocated an inserted into a per-inode rbtree to track the pending extents. * As each page is written EXTENT_ORDERED is cleared on the bytes corresponding to that page. * When all of the bytes corresponding to a single struct btrfs_ordered_extent are written, The previously reserved extent is inserted into the FS btree and into the extent allocation trees. The checksums for the file data are also updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Drop some verbose printksChris Mason2-4/+0
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add locking around volume management (device add/remove/balance)Chris Mason6-40/+103
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix deadlock while searching for dead roots on mountChris Mason1-1/+9
btrfs_find_dead_roots called btrfs_read_fs_root_no_radix, which means we end up calling btrfs_search_slot with a path already held. The fix is to remember the key inside btrfs_find_dead_roots and drop the path. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Reduce contention on the root nodeChris Mason2-6/+21
This calls unlock_up sooner in btrfs_search_slot in order to decrease the amount of work done with the higher level tree locks held. Also, it changes btrfs_tree_lock to spin for a big against the page lock before scheduling. This makes a big difference in context switch rate under highly contended workloads. Longer term, a better locking structure is needed than the page lock. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Online btree defragmentation fixesChris Mason9-129/+190
The btree defragger wasn't making forward progress because the new key wasn't being saved by the btrfs_search_forward function. This also disables the automatic btree defrag, it wasn't scaling well to huge filesystems. The auto-defrag needs to be done differently. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add a per-inode csum mutex to avoid races creating csum itemsChris Mason6-6/+21
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Change find_extent_buffer to use TestSetPageLockedChris Mason2-3/+6
This makes it possible for callers to check for extent_buffers in cache without deadlocking against any btree locks held. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add btree locking to the tree defragmentation codeChris Mason4-201/+93
The online btree defragger is simplified and rewritten to use standard btree searches instead of a walk up / down mechanism. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Replace the transaction work queue with kthreadsChris Mason8-118/+136
This creates one kthread for commits and one kthread for deleting old snapshots. All the work queues are removed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Add btrfs_end_transaction_throttle to force writers to wait for pending commitsChris Mason7-59/+55
The existing throttle mechanism was often not sufficient to prevent new writers from coming in and making a given transaction run forever. This adds an explicit wait at the end of most operations so they will allow the current transaction to close. There is no wait inside file_write, inode updates, or cow filling, all which have different deadlock possibilities. This is a temporary measure until better asynchronous commit support is added. This code leads to stalls as it waits for data=ordered writeback, and it really needs to be fixed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix snapshot deletion to release the alloc_mutex much more often.Chris Mason3-9/+23
This lowers the impact of snapshot deletion on the rest of the FS. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add a skip_locking parameter to struct path, and make various funcs ↵Chris Mason3-14/+25
honor it Allocations may need to read in block groups from the extent allocation tree, which will require a tree search and take locks on the extent allocation tree. But, those locks might already be held in other places, leading to deadlocks. Since the alloc_mutex serializes everything right now, it is safe to skip the btree locking while caching block groups. A better fix will be to either create a recursive lock or find a way to back off existing locks while caching block groups. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Fix btrfs_next_leaf to check for new items after dropping locksChris Mason1-0/+7
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Fix btrfs_del_ordered_inode to allow forcing the drop during unlinksChris Mason5-13/+19
This allows us to delete an unlinked inode with dirty pages from the list instead of forcing commit to write these out before deleting the inode. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Drop locks in btrfs_search_slot when reading a tree block.Chris Mason4-39/+38
One lock per btree block can make for significant congestion if everyone has to wait for IO at the high levels of the btree. This drops locks held by a path when doing reads during a tree search. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Replace the big fs_mutex with a collection of other locksChris Mason12-165/+101
Extent alloctions are still protected by a large alloc_mutex. Objectid allocations are covered by a objectid mutex Other btree operations are protected by a lock on individual btree nodes Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Start btree concurrency work.Chris Mason12-214/+579
The allocation trees and the chunk trees are serialized via their own dedicated mutexes. This means allocation location is still not very fine grained. The main FS btree is protected by locks on each block in the btree. Locks are taken top / down, and as processing finishes on a given level of the tree, the lock is released after locking the lower level. The end result of a search is now a path where only the lowest level is locked. Releasing or freeing the path drops any locks held. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add a thread pool just for submit_bioChris Mason3-1/+10
If a bio submission is after a lock holder waiting for the bio on the work queue, it is possible to deadlock. Move the bios into their own pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25BTRFS_IOC_TRANS_START should be privileguedChristoph Hellwig1-0/+3
As mentioned in the comment next to it btrfs_ioctl_trans_start can do bad damage to filesystems and thus should be limited to privilegued users. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: split out ioctl.cChristoph Hellwig4-729/+796
Split the ioctl handling out of inode.c into a file of it's own. Also fix up checkpatch.pl warnings for the moved code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: kerneldoc comments for extent_map.cChristoph Hellwig1-12/+49
Add kerneldoc comments for all exported functions. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add a mount option to control worker thread pool sizeChris Mason3-16/+28
mount -o thread_pool_size changes the default, which is min(num_cpus + 2, 8). Larger thread pools would make more sense on very large disk arrays. This mount option controls the max size of each thread pool. There are multiple thread pools, so the total worker count will be larger than the mount option. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Worker thread optimizationsChris Mason2-34/+73
This changes the worker thread pool to maintain a list of idle threads, avoiding a complex search for a good thread to wake up. Threads have two states: idle - we try to reuse the last thread used in hopes of improving the batching ratios busy - each time a new work item is added to a busy task, the task is rotated to the end of the line. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add backport for the kthread work on kernels older than 2.6.20Chris Mason1-1/+8
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Fix mount -o max_inline=0Chris Mason1-2/+5
max_inline=0 used to force the max_inline size to one sector instead. Now it properly disables inline data items, while still being able to read any that happen to exist on disk. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25Btrfs: Add async worker threads for pre and post IO checksummingChris Mason8-132/+626
Btrfs has been using workqueues to spread the checksumming load across other CPUs in the system. But, workqueues only schedule work on the same CPU that queued the work, giving them a limited benefit for systems with higher CPU counts. This code adds a generic facility to schedule work with pools of kthreads, and changes the bio submission code to queue bios up. The queueing is important to make sure large numbers of procs on the system don't turn streaming workloads into random workloads by sending IO down concurrently. The end result of all of this is much higher performance (and CPU usage) when doing checksumming on large machines. Two worker pools are created, one for writes and one for endio processing. The two could deadlock if we tried to service both from a single pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25btrfs: allow scanning multiple devices during mountChristoph Hellwig1-5/+16
Allows to specify one or multiple device=/dev/foo options during mount so that ioctls on the control device can be avoided. Especially useful when trying to mount a multi-device setup as root. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>