kernel/linux.git - Linux kernel stable tree (mirror)

Age	Commit message (Collapse)	Author	Files	Lines
2008-09-25	Btrfs: Search data ordered extents first for checksums on read	Chris Mason	4	-20/+31
	Checksum items are not inserted into the tree until all of the io from a given extent is complete. This means one dirty page from an extent may be written, freed, and then read again before the entire extent is on disk and the checksum item is inserted. The checksums themselves are stored in the ordered extent so they can be inserted in bulk when IO is complete. On read, if a checksum item isn't found, the ordered extents were being searched for a checksum record. This all worked most of the time, but the checksum insertion code tries to reduce the number of tree operations by pre-inserting checksum items based on i_size and a few other factors. This means the read code might find a checksum item that hasn't yet really been filled in. This commit changes things to check the ordered extents first and only dive into the btree if nothing was found. This removes the need for extra locking and is more reliable. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Fix 32 bit compiles by using an unsigned long byte count in the ↵	Chris Mason	1	-1/+2
	ordered extent The ordered extents have to fit in memory, so an unsigned long is sufficient. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Take the csum mutex while reading checksums	Chris Mason	5	-5/+12
	Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: alloc_mutex latency reduction	Chris Mason	2	-20/+81
	This releases the alloc_mutex in a few places that hold it for over long operations. btrfs_lookup_block_group is changed so that it doesn't need the mutex at all. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Add some conditional schedules near the alloc_mutex	Chris Mason	1	-0/+2
	This helps prevent stalls, especially while the snapshot cleaner is running hard Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Use mutex_lock_nested for tree locking	Chris Mason	2	-2/+6
	Lockdep has the notion of locking subclasses so that you can identify locks you expect to be taken after other locks of the same class. This changes the per-extent buffer btree locking routines to use a subclass based on the level in the tree. Unfortunately, lockdep can only handle 8 total subclasses, and the btrfs max level is also 8. So when lockdep is on, use a lower max level. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Fix some data=ordered related data corruptions	Chris Mason	10	-83/+140
	Stress testing was showing data checksum errors, most of which were caused by a lookup bug in the extent_map tree. The tree was caching the last pointer returned, and searches would check the last pointer first. But, search callers also expect the search to return the very first matching extent in the range, which wasn't always true with the last pointer usage. For now, the code to cache the last return value is just removed. It is easy to fix, but I think lookups are rare enough that it isn't required anymore. This commit also replaces do_sync_mapping_range with a local copy of the related functions. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Use a mutex in the extent buffer for tree block locking	Chris Mason	4	-13/+17
	This replaces the use of the page cache lock bit for locking, which wasn't suitable for block size < page size and couldn't be used recursively. The mutexes alone don't fix either problem, but they are the first step. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Index extent buffers in an rbtree	Chris Mason	4	-220/+129
	Before, extent buffers were a temporary object, meant to map a number of pages at once and collect operations on them. But, a few extra fields have crept in, and they are also the best place to store a per-tree block lock field as well. This commit puts the extent buffers into an rbtree, and ensures a single extent buffer for each tree block. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Data ordered fixes	Chris Mason	4	-10/+43
	* In btrfs_delete_inode, wait for ordered extents after calling truncate_inode_pages. This is much faster, and more correct * Properly clear our the PageChecked bit everywhere we redirty the page. * Change the writepage fixup handler to lock the page range and check to see if an ordered extent had been inserted since the improperly dirtied page was discovered * Wait for ordered extents outside the transaction. This isn't required for locking rules but does improve transaction latencies * Reduce contention on the alloc_mutex by dropping it while incrementing refs on a node/leaf and while dropping refs on a leaf. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Fix btrfs_wait_ordered_extent_range to properly wait	Chris Mason	3	-25/+49
	Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Keep extent mappings in ram until pending ordered extents are done	Chris Mason	6	-20/+48
	It was possible for stale mappings from disk to be used instead of the new pending ordered extent. This adds a flag to the extent map struct to keep it pinned until the pending ordered extent is actually on disk. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Don't allow releasepage to succeed if EXTENT_ORDERED is set	Chris Mason	2	-6/+11
	Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Handle data checksumming on bios that span multiple ordered extents	Chris Mason	5	-31/+69
	Data checksumming is done right before the bio is sent down the IO stack, which means a single bio might span more than one ordered extent. In this case, the checksumming data is split between two ordered extents. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Cleanup and comment ordered-data.c	Chris Mason	3	-70/+121
	Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Force caching of metadata block groups on mount to avoid deadlock	Chris Mason	1	-0/+5
	This is a temporary change to avoid deadlocks until the extent tree locking is fixed up. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	btrfs_next_leaf: do readahead when skip_locking is turned on	Chris Mason	1	-1/+2
	Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Add a per-inode lock around btrfs_drop_extents	Chris Mason	4	-0/+21
	btrfs_drop_extents is always called with a range lock held on the inode. But, it may operate on extents outside that range as it drops and splits them. This patch adds a per-inode mutex that is held while calling btrfs_drop_extents and while inserting new extents into the tree. It prevents races from two procs working against adjacent ranges in the tree. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Don't pin pages in ram until the entire ordered extent is on disk.	Chris Mason	4	-19/+69
	Checksum items are not inserted until the entire ordered extent is on disk, but individual pages might be clean and available for reclaim long before the whole extent is on disk. In order to allow those pages to be freed, we need to be able to search the list of ordered extents to find the checksum that is going to be inserted in the tree. This way if the page needs to be read back in before the checksums are in the btree, we'll be able to verify the checksum on the page. This commit adds the ability to search the pending ordered extents for a given offset in the file, and changes btrfs_releasepage to allow ordered pages to be freed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	btrfs_start_transaction: wait for commits in progress to finish	Chris Mason	6	-9/+51
	btrfs_commit_transaction has to loop waiting for any writers in the transaction to finish before it can proceed. btrfs_start_transaction should be polite and not join a transaction that is in the process of being finished off. There are a few places that can't wait, basically the ones doing IO that might be needed to finish the transaction. For them, btrfs_join_transaction is added. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Update on disk i_size only after pending ordered extents are done	Chris Mason	5	-11/+119
	This changes the ordered data code to update i_size after the extent is on disk. An on disk i_size is maintained in the in-memory btrfs inode structures, and this is updated as extents finish. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Use async helpers to deal with pages that have been improperly dirtied	Chris Mason	6	-9/+106
	Higher layers sometimes call set_page_dirty without asking the filesystem to help. This causes many problems for the data=ordered and cow code. This commit detects pages that haven't been properly setup for IO and kicks off an async helper to deal with them. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: New data=ordered implementation	Chris Mason	14	-502/+910
	The old data=ordered code would force commit to wait until all the data extents from the transaction were fully on disk. This introduced large latencies into the commit and stalled new writers in the transaction for a long time. The new code changes the way data allocations and extents work: * When delayed allocation is filled, data extents are reserved, and the extent bit EXTENT_ORDERED is set on the entire range of the extent. A struct btrfs_ordered_extent is allocated an inserted into a per-inode rbtree to track the pending extents. * As each page is written EXTENT_ORDERED is cleared on the bytes corresponding to that page. * When all of the bytes corresponding to a single struct btrfs_ordered_extent are written, The previously reserved extent is inserted into the FS btree and into the extent allocation trees. The checksums for the file data are also updated. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Drop some verbose printks	Chris Mason	2	-4/+0
	Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Add locking around volume management (device add/remove/balance)	Chris Mason	6	-40/+103
	Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Fix deadlock while searching for dead roots on mount	Chris Mason	1	-1/+9
	btrfs_find_dead_roots called btrfs_read_fs_root_no_radix, which means we end up calling btrfs_search_slot with a path already held. The fix is to remember the key inside btrfs_find_dead_roots and drop the path. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Reduce contention on the root node	Chris Mason	2	-6/+21
	This calls unlock_up sooner in btrfs_search_slot in order to decrease the amount of work done with the higher level tree locks held. Also, it changes btrfs_tree_lock to spin for a big against the page lock before scheduling. This makes a big difference in context switch rate under highly contended workloads. Longer term, a better locking structure is needed than the page lock. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Online btree defragmentation fixes	Chris Mason	9	-129/+190
	The btree defragger wasn't making forward progress because the new key wasn't being saved by the btrfs_search_forward function. This also disables the automatic btree defrag, it wasn't scaling well to huge filesystems. The auto-defrag needs to be done differently. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Add a per-inode csum mutex to avoid races creating csum items	Chris Mason	6	-6/+21
	Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Change find_extent_buffer to use TestSetPageLocked	Chris Mason	2	-3/+6
	This makes it possible for callers to check for extent_buffers in cache without deadlocking against any btree locks held. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Add btree locking to the tree defragmentation code	Chris Mason	4	-201/+93
	The online btree defragger is simplified and rewritten to use standard btree searches instead of a walk up / down mechanism. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Replace the transaction work queue with kthreads	Chris Mason	8	-118/+136
	This creates one kthread for commits and one kthread for deleting old snapshots. All the work queues are removed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Add btrfs_end_transaction_throttle to force writers to wait for pending commits	Chris Mason	7	-59/+55
	The existing throttle mechanism was often not sufficient to prevent new writers from coming in and making a given transaction run forever. This adds an explicit wait at the end of most operations so they will allow the current transaction to close. There is no wait inside file_write, inode updates, or cow filling, all which have different deadlock possibilities. This is a temporary measure until better asynchronous commit support is added. This code leads to stalls as it waits for data=ordered writeback, and it really needs to be fixed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Fix snapshot deletion to release the alloc_mutex much more often.	Chris Mason	3	-9/+23
	This lowers the impact of snapshot deletion on the rest of the FS. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Add a skip_locking parameter to struct path, and make various funcs ↵	Chris Mason	3	-14/+25
	honor it Allocations may need to read in block groups from the extent allocation tree, which will require a tree search and take locks on the extent allocation tree. But, those locks might already be held in other places, leading to deadlocks. Since the alloc_mutex serializes everything right now, it is safe to skip the btree locking while caching block groups. A better fix will be to either create a recursive lock or find a way to back off existing locks while caching block groups. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Fix btrfs_next_leaf to check for new items after dropping locks	Chris Mason	1	-0/+7
	Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Fix btrfs_del_ordered_inode to allow forcing the drop during unlinks	Chris Mason	5	-13/+19
	This allows us to delete an unlinked inode with dirty pages from the list instead of forcing commit to write these out before deleting the inode. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Drop locks in btrfs_search_slot when reading a tree block.	Chris Mason	4	-39/+38
	One lock per btree block can make for significant congestion if everyone has to wait for IO at the high levels of the btree. This drops locks held by a path when doing reads during a tree search. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Replace the big fs_mutex with a collection of other locks	Chris Mason	12	-165/+101
	Extent alloctions are still protected by a large alloc_mutex. Objectid allocations are covered by a objectid mutex Other btree operations are protected by a lock on individual btree nodes Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Start btree concurrency work.	Chris Mason	12	-214/+579
	The allocation trees and the chunk trees are serialized via their own dedicated mutexes. This means allocation location is still not very fine grained. The main FS btree is protected by locks on each block in the btree. Locks are taken top / down, and as processing finishes on a given level of the tree, the lock is released after locking the lower level. The end result of a search is now a path where only the lowest level is locked. Releasing or freeing the path drops any locks held. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Add a thread pool just for submit_bio	Chris Mason	3	-1/+10
	If a bio submission is after a lock holder waiting for the bio on the work queue, it is possible to deadlock. Move the bios into their own pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	BTRFS_IOC_TRANS_START should be privilegued	Christoph Hellwig	1	-0/+3
	As mentioned in the comment next to it btrfs_ioctl_trans_start can do bad damage to filesystems and thus should be limited to privilegued users. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: split out ioctl.c	Christoph Hellwig	4	-729/+796
	Split the ioctl handling out of inode.c into a file of it's own. Also fix up checkpatch.pl warnings for the moved code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: kerneldoc comments for extent_map.c	Christoph Hellwig	1	-12/+49
	Add kerneldoc comments for all exported functions. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Add a mount option to control worker thread pool size	Chris Mason	3	-16/+28
	mount -o thread_pool_size changes the default, which is min(num_cpus + 2, 8). Larger thread pools would make more sense on very large disk arrays. This mount option controls the max size of each thread pool. There are multiple thread pools, so the total worker count will be larger than the mount option. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Worker thread optimizations	Chris Mason	2	-34/+73
	This changes the worker thread pool to maintain a list of idle threads, avoiding a complex search for a good thread to wake up. Threads have two states: idle - we try to reuse the last thread used in hopes of improving the batching ratios busy - each time a new work item is added to a busy task, the task is rotated to the end of the line. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Add backport for the kthread work on kernels older than 2.6.20	Chris Mason	1	-1/+8
	Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Fix mount -o max_inline=0	Chris Mason	1	-2/+5
	max_inline=0 used to force the max_inline size to one sector instead. Now it properly disables inline data items, while still being able to read any that happen to exist on disk. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	Btrfs: Add async worker threads for pre and post IO checksumming	Chris Mason	8	-132/+626
	Btrfs has been using workqueues to spread the checksumming load across other CPUs in the system. But, workqueues only schedule work on the same CPU that queued the work, giving them a limited benefit for systems with higher CPU counts. This code adds a generic facility to schedule work with pools of kthreads, and changes the bio submission code to queue bios up. The queueing is important to make sure large numbers of procs on the system don't turn streaming workloads into random workloads by sending IO down concurrently. The end result of all of this is much higher performance (and CPU usage) when doing checksumming on large machines. Two worker pools are created, one for writes and one for endio processing. The two could deadlock if we tried to service both from a single pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25	btrfs: allow scanning multiple devices during mount	Christoph Hellwig	1	-5/+16
	Allows to specify one or multiple device=/dev/foo options during mount so that ioctls on the control device can be avoided. Especially useful when trying to mount a multi-device setup as root. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>