summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2014-10-07locks: flock_make_lock should return a struct file_lock (or PTR_ERR)Jeff Layton1-8/+11
Eliminate the need for a return pointer. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07locks: set fl_owner for leases to filp instead of current->filesJeff Layton1-1/+1
Like flock locks, leases are owned by the file description. Now that the i_have_this_lease check in __break_lease is gone, we don't actually use the fl_owner for leases for anything. So, it's now safe to set this more appropriately to the same value as the fl_file. While we're at it, fix up the comments over the fl_owner_t definition since they're rather out of date. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-10-07locks: give lm_break a return valueJeff Layton2-12/+22
Christoph suggests: "Add a return value to lm_break so that the lock manager can tell the core code "you can delete this lease right now". That gets rid of the games with the timeout which require all kinds of race avoidance code in the users." Do that here and have the nfsd lease break routine use it when it detects that there was a race between setting up the lease and it being broken. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07locks: __break_lease cleanup in preparation of allowing direct removal of leasesJeff Layton1-24/+25
Eliminate an unneeded "flock" variable. We can use "fl" as a loop cursor everywhere. Add a any_leases_conflict helper function as well to consolidate a bit of code. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07locks: remove i_have_this_lease check from __break_leaseJeff Layton1-4/+2
I think that the intent of this code was to ensure that a process won't deadlock if it has one fd open with a lease on it and then breaks that lease by opening another fd. In that case it'll treat the __break_lease call as if it were non-blocking. This seems wrong -- the process could (for instance) be multithreaded and managing different fds via different threads. I also don't see any mention of this limitation in the (somewhat sketchy) documentation. Remove the check and the non-blocking behavior when i_have_this_lease is true. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-10-07locks: move freeing of leases outside of i_lockJeff Layton2-15/+25
There was only one place where we still could free a file_lock while holding the i_lock -- lease_modify. Add a new list_head argument to the lm_change operation, pass in a private list when calling it, and fix those callers to dispose of the list once the lock has been dropped. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07locks: move i_lock acquisition into generic_*_lease handlersJeff Layton1-12/+9
Now that we have a saner internal API for managing leases, we no longer need to mandate that the inode->i_lock be held over most of the lease code. Push it down into generic_add_lease and generic_delete_lease. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07locks: define a lm_setup handler for leasesJeff Layton2-48/+50
...and move the fasync setup into it for fcntl lease calls. At the same time, change the semantics of how the file_lock double-pointer is handled. Up until now, on a successful lease return you got a pointer to the lock on the list. This is bad, since that pointer can no longer be relied on as valid once the inode->i_lock has been released. Change the code to instead just zero out the pointer if the lease we passed in ended up being used. Then the callers can just check to see if it's NULL after the call and free it if it isn't. The priv argument has the same semantics. The lm_setup function can zero the pointer out to signal to the caller that it should not be freed after the function returns. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07locks: plumb a "priv" pointer into the setlease routinesJeff Layton4-18/+29
In later patches, we're going to add a new lock_manager_operation to finish setting up the lease while still holding the i_lock. To do this, we'll need to pass a little bit of info in the fcntl setlease case (primarily an fasync structure). Plumb the extra pointer into there in advance of that. We declare this pointer as a void ** to make it clear that this is private info, and that the caller isn't required to set this unless the lm_setup specifically requires it. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07nfsd: don't keep a pointer to the lease in nfs4_fileJeff Layton2-10/+4
Now that we don't need to pass in an actual lease pointer to vfs_setlease on unlock, we can stop tracking a pointer to the lease in the nfs4_file. Switch all of the places that check the fi_lease to check fi_deleg_file instead. We always set that at the same time so it will have the same semantics. Cc: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07locks: clean up vfs_setlease kerneldoc commentsJeff Layton1-24/+10
Some of the latter paragraphs seem ambiguous and just plain wrong. In particular the break_lease comment makes no sense. We call break_lease (and break_deleg) from all sorts of vfs-layer functions, so there is clearly such a method. Also get rid of some of the other comments about what's needed for a full implementation. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07locks: generic_delete_lease doesn't need a file_lock at allJeff Layton2-21/+15
Ensure that it's OK to pass in a NULL file_lock double pointer on a F_UNLCK request and convert the vfs_setlease F_UNLCK callers to do just that. Finally, turn the BUG_ON in generic_setlease into a WARN_ON_ONCE with an error return. That's a problem we can handle without crashing the box if it occurs. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07nfsd: fix potential lease memory leak in nfs4_setleaseJeff Layton1-2/+5
It's unlikely to ever occur, but if there were already a lease set on the file then we could end up getting back a different pointer on a successful setlease attempt than the one we allocated. If that happens, the one we allocated could leak. In practice, I don't think this will happen due to the fact that we only try to set up the lease once per nfs4_file, but this error handling is a bit more correct given the current lease API. Cc: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07locks: close potential race in lease_get_mtimeJeff Layton1-2/+12
lease_get_mtime is called without the i_lock held, so there's no guarantee about the stability of the list. Between the time when we assign "flock" and then dereference it to check whether it's a lease and for write, the lease could be freed. Ensure that that doesn't occur by taking the i_lock before trying to check the lease. Cc: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-09-10security: make security_file_set_fowner, f_setown and __f_setown void returnJeff Layton3-22/+9
security_file_set_fowner always returns 0, so make it f_setown and __f_setown void return functions and fix up the error handling in the callers. Cc: linux-security-module@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-09-10locks: consolidate "nolease" routinesJeff Layton5-35/+19
GFS2 and NFS have setlease routines that always just return -EINVAL. Turn that into a generic routine that can live in fs/libfs.c. Cc: <linux-nfs@vger.kernel.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: <cluster-devel@redhat.com> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-09-10locks: remove lock_may_read and lock_may_writeJeff Layton1-80/+0
There are no callers of these functions. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-10lockd: rip out deferred lock handling from testlock codepathJeff Layton1-49/+6
As Kinglong points out, the nlm_block->b_fl field is no longer used at all. Also, vfs_test_lock in the generic locking code will only return FILE_LOCK_DEFERRED if FL_SLEEP is set, and it isn't here. The only other place that returns that value is the DLM lock code, but it only does that in dlm_posix_lock, never in dlm_posix_get. Remove all of the deferred locking code from the testlock codepath since it doesn't appear to ever be used anyway. I do have a small concern that this might cause a behavior change in the case where you have a block already sitting on the list when the testlock request comes in, but that looks like it doesn't really work properly anyway. I think it's best to just pass that down to vfs_test_lock and let the filesystem report that instead of trying to infer what's going on with the lock by looking at an existing block. Cc: cluster-devel@redhat.com Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
2014-09-10NFSD: Get reference of lockowner when coping file_lockKinglong Mee1-4/+21
v5: using nfs4_get_stateowner() instead of an inline function v3: Update based on Jeff's comments v2: Fix bad using of struct file_lock_operations for handle the owner Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-10NFSD: New helper nfs4_get_stateowner() for atomic_inc sop referenceKinglong Mee1-16/+16
v5: same as the first version Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-10locks: Copy fl_lmops information for conflock in locks_copy_conflock()Kinglong Mee2-20/+17
Commit d5b9026a67 ([PATCH] knfsd: locks: flag NFSv4-owned locks) using fl_lmops field in file_lock for checking nfsd4 lockowner. But, commit 1a747ee0cc (locks: don't call ->copy_lock methods on return of conflicting locks) causes the fl_lmops of conflock always be NULL. Also, commit 0996905f93 (lockd: posix_test_lock() should not call locks_copy_lock()) caused the fl_lmops of conflock always be NULL too. Make sure copy the private information by fl_copy_lock() in struct file_lock_operations, merge __locks_copy_lock() to fl_copy_lock(). Jeff advice, "Set fl_lmops on conflocks, but don't set fl_ops. fl_ops are superfluous, since they are callbacks into the filesystem. There should be no need to bother the filesystem at all with info in a conflock. But, lock _ownership_ matters for conflocks and that's indicated by the fl_lmops. So you really do want to copy the fl_lmops for conflocks I think." v5: add missing calling of locks_release_private() in nlmsvc_testlock() v4: only copy fl_lmops for conflock, don't copy fl_ops Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-10locks: New ops in lock_manager_operations for get/put ownerKinglong Mee1-2/+10
NFSD or other lockmanager may increase the owner's reference, so adds two new options for copying and releasing owner. v5: change order from 2/6 to 3/6 v4: rename lm_copy_owner/lm_release_owner to lm_get_owner/lm_put_owner Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-10locks: Rename __locks_copy_lock() to locks_copy_conflock()Kinglong Mee1-5/+5
Jeff advice, " Right now __locks_copy_lock is only used to copy conflocks. It would be good to rename that to something more distinct (i.e.locks_copy_conflock), to make it clear that we're generating a conflock there." v5: change order from 3/6 to 2/6 v4: new patch only renaming function name Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-10locks: Remove unused conf argument from lm_grantJoe Perches2-13/+7
This argument is always NULL so don't pass it around. [jlayton: remove dependencies on previous patches in series] Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-10locks: pass correct "before" pointer to locks_unlink_lock in generic_add_leaseJeff Layton1-1/+1
The argument to locks_unlink_lock can't be just any pointer to a pointer. It must be a pointer to the fl_next field in the previous lock in the list. Cc: <stable@vger.kernel.org> # v3.15+ Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-09-03nfs: do not start the callback thread until we set rqstp->rq_taskTrond Myklebust1-1/+2
This fixes an Oopsable race when starting up the callback server. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-03lockd: Do not start the lockd thread before we've set nlmsvc_rqst->rq_taskTrond Myklebust1-1/+2
This fixes an Oopsable race when starting lockd. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-29nfsd4: remove labeled NFS warning from config helpJ. Bruce Fields1-3/+0
The working group appears committed to keeping the protocol stable, the code has gotten some use and seems to work OK. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-29NFSD: Update some as-yet unused 4.2 error codesAnna Schumaker1-1/+1
Recent NFS v4.2 drafts have removed NFS4ERR_METADATA_NOTSUPP and reassigned the error code to NFS4ERR_UNION_NOTSUPP. I also add in the NFS4ERR_OFFLOAD_NO_REQS error code. We're not using any of these yet, so there's no harm done. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-28NFSD: Remove duplicate initialization of file_lockKinglong Mee1-4/+2
locks_alloc_lock() has initialized struct file_lock, no need to re-initialize it here. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-18nfsd: allow turning off nfsv3 readdir_plusRajesh Ghanekar2-0/+9
One of our customer's application only needs file names, not file attributes. With directories having 10K+ inodes (assuming buffer cache has directory blocks cached having file names, but inode cache is limited and hence need eviction of older cached inodes), older inodes are evicted periodically. So if they keep on doing readdir(2) from NSF client on multiple directories, some directory's files are periodically removed from inode cache and hence new readdir(2) on same directory requires disk access to bring back inodes again to inode cache. As READDIRPLUS request fetches attributes also, doing getattr on each file on server, it causes unnecessary disk accesses. If READDIRPLUS on NFS client is returned with -ENOTSUPP, NFS client uses READDIR request which just gets the names of the files in a directory, not attributes, hence avoiding disk accesses on server. There's already a corresponding client-side mount option, but an export option reduces the need for configuration across multiple clients. This flag affects NFSv3 only. If it turns out it's needed for NFSv4 as well then we may have to figure out how to extend the behavior to NFSv4, but it's not currently obvious how to do that. Signed-off-by: Rajesh Ghanekar <rajesh_ghanekar@symantec.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17nfsd4: reserve adequate space for LOCK opJ. Bruce Fields1-0/+8
As of 8c7424cff6 "nfsd4: don't try to encode conflicting owner if low on space", we permit the server to process a LOCK operation even if there might not be space to return the conflicting lockowner, because we've made returning the conflicting lockowner optional. However, the rpc server still wants to know the most we might possibly return, so we need to take into account the possible conflicting lockowner in the svc_reserve_space() call here. Symptoms were log messages like "RPC request reserved 88 but used 108". Fixes: 8c7424cff6 "nfsd4: don't try to encode conflicting owner if low on space" Reported-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17nfsd4: remove obsolete commentJ. Bruce Fields1-7/+0
We do what Neil suggests now. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17nfsd3: Check write permission after checking existenceRoss Lagerwall1-5/+0
When creating a file that already exists in a read-only directory with O_EXCL, the NFSv3 server returns EACCES rather than EEXIST (which local files and the NFSv4 server return). Fix this by checking the MAY_CREATE permission only if the file does not exist. Since this already happens in do_nfsd_create, the check in nfsd3_proc_create can simply be removed. Signed-off-by: Ross Lagerwall <rosslagerwall@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17nfsd: call nfs4_put_deleg_lease outside of state_lockJeff Layton1-1/+5
Currently, we hold the state_lock when releasing the lease. That's potentially problematic in the future if we allow for setlease methods that can sleep. Move the nfs4_put_deleg_lease call out of the delegation unhashing routine (which was always a bit goofy anyway), and into the unlocked sections of the callers of unhash_delegation_locked. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17nfsd: protect lease-related nfs4_file fields with fi_lockJeff Layton1-9/+13
Currently these fields are protected with the state_lock, but that doesn't really make a lot of sense. These fields are "private" to the nfs4_file, and can be protected with the more granular fi_lock. The fi_lock is already held when setting these fields. Make the code hold the fp->fi_lock when clearing the lease-related fields in the nfs4_file, and no longer require that the state_lock be held when calling into this function. To prevent lock inversion with the i_lock, we also move the vfs_setlease and fput calls outside of the fi_lock. This also sets us up for allowing vfs_setlease calls to block in the future. Finally, remove a redundant NULL pointer check. unhash_delegation_locked locks the fp->fi_lock prior to that check, so fp in that function must never be NULL. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17nfsd: Reorder nfsd_cache_match to check more powerful discriminators firstTrond Myklebust1-7/+11
We would normally expect the xid and the checksum to be the best discriminators. Check them before looking at the procedure number, etc. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17nfsd: split DRC global spinlock into per-bucket locksTrond Myklebust1-23/+20
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17nfsd: convert num_drc_entries to an atomic_tTrond Myklebust1-16/+12
...so we can remove the spinlocking around it. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17nfsd: Remove the cache_hash listTrond Myklebust2-18/+2
Now that the lru list is per-bucket, we don't need a second list for searches. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17nfsd: convert the lru list into a per-bucket thingTrond Myklebust1-23/+50
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17nfsd: Clean up drc cache in preparation for global spinlock eliminationTrond Myklebust1-21/+24
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17nfs: Ensure that nfs_callback_start_svc sets the server rq_task...Trond Myklebust1-0/+1
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17lockd: Ensure that lockd_start_svc sets the server rq_task...Trond Myklebust1-0/+2
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-16Merge branch 'for-linus2' of ↵Linus Torvalds17-305/+541
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "These are all fixes I'd like to get out to a broader audience. The biggest of the bunch is Mark's quota fix, which is also in the SUSE kernel, and makes our subvolume quotas dramatically more accurate. I've been running xfstests with these against your current git overnight, but I'm queueing up longer tests as well" * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: disable strict file flushes for renames and truncates Btrfs: fix csum tree corruption, duplicate and outdated checksums Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch Btrfs: fix compressed write corruption on enospc btrfs: correctly handle return from ulist_add btrfs: qgroup: account shared subtrees during snapshot delete Btrfs: read lock extent buffer while walking backrefs Btrfs: __btrfs_mod_ref should always use no_quota btrfs: adjust statfs calculations according to raid profiles
2014-08-16Merge tag 'locks-v3.17-2' of git://git.samba.org/jlayton/linuxLinus Torvalds1-29/+57
Pull file locking bugfixes from Jeff Layton: "Most of these patches are to fix a long-standing regression that crept in when the BKL was removed from the file-locking code. The code was converted to use a conventional spinlock, but some fl_release_private ops can block and you can end up sleeping inside the lock. There's also a patch to make /proc/locks show delegations as 'DELEG'" * tag 'locks-v3.17-2' of git://git.samba.org/jlayton/linux: locks: update Locking documentation to clarify fl_release_private behavior locks: move locks_free_lock calls in do_fcntl_add_lease outside spinlock locks: defer freeing locks in locks_delete_lock until after i_lock has been dropped locks: don't reuse file_lock in __posix_lock_file locks: don't call locks_release_private from locks_copy_lock locks: show delegations as "DELEG" in /proc/locks
2014-08-16Merge git://git.kvack.org/~bcrl/aio-nextLinus Torvalds1-56/+30
Pull aio updates from Ben LaHaise. * git://git.kvack.org/~bcrl/aio-next: aio: use iovec array rather than the single one aio: fix some comments aio: use the macro rather than the inline magic number aio: remove the needless registration of ring file's private_data aio: remove no longer needed preempt_disable() aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx() aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()
2014-08-15btrfs: disable strict file flushes for renames and truncatesChris Mason8-267/+6
Truncates and renames are often used to replace old versions of a file with new versions. Applications often expect this to be an atomic replacement, even if they haven't done anything to make sure the new version is fully on disk. Btrfs has strict flushing in place to make sure that renaming over an old file with a new file will fully flush out the new file before allowing the transaction commit with the rename to complete. This ordering means the commit code needs to be able to lock file pages, and there are a few paths in the filesystem where we will try to end a transaction with the page lock held. It's rare, but these things can deadlock. This patch removes the ordered flushes and switches to a best effort filemap_flush like ext4 uses. It's not perfect, but it should fix the deadlocks. Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15Btrfs: fix csum tree corruption, duplicate and outdated checksumsFilipe Manana1-1/+1
Under rare circumstances we can end up leaving 2 versions of a checksum for the same file extent range. The reason for this is that after calling btrfs_next_leaf we process slot 0 of the leaf it returns, instead of processing the slot set in path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after btrfs_next_leaf() releases the path and before it searches for the next leaf, another task might cause a split of the next leaf, which migrates some of its keys to the leaf we were processing before calling btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the same leaf but with path->slots[0] having a slot number corresponding to the first new key it got, that is, a slot number that didn't exist before calling btrfs_next_leaf(), as the leaf now has more keys than it had before. So we must really process the returned leaf starting at path->slots[0] always, as it isn't always 0, and the key at slot 0 can have an offset much lower than our search offset/bytenr. For example, consider the following scenario, where we have: sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568 four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472 Leaf N: slot = 0 slot = btrfs_header_nritems() - 1 |-------------------------------------------------------------------| | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] | |-------------------------------------------------------------------| Leaf N + 1: slot = 0 slot = btrfs_header_nritems() - 1 |--------------------------------------------------------------------| | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 | |--------------------------------------------------------------------| Because we are at the last slot of leaf N, we call btrfs_next_leaf() to find the next highest key, which releases the current path and then searches for that next key. However after releasing the path and before finding that next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore btrfs_next_leaf() will returns us a path again with leaf N but with the slot pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N is then: slot = 0 slot = btrfs_header_nritems() - 2 slot = btrfs_header_nritems() - 1 |----------------------------------------------------------------------------------------------------| | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] [(CSUM CSUM 40161280), size 32] | |----------------------------------------------------------------------------------------------------| And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump into the "insert:" label, which will set tmp to: tmp = min((sums->len - total_bytes) >> blocksize_bits, (next_offset - file_key.offset) >> blocksize_bits) = min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) = min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4 and ins_size = csum_size * tmp = 4 * 4 = 16 bytes. In other words, we insert a new csum item in the tree with key (CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong, because the item with key (CSUM CSUM 40161280) (the one that was moved from leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288 bytes of our data and won't get those old checksums removed. So this leaves us 2 different checksums for 3 4kb blocks of data in the tree, and breaks the logical rule: Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover An obvious bad effect of this is that a subsequent csum tree lookup to get the checksum of any of the blocks with logical offset of 40161280, 40165376 or 40169472 (the last 3 4kb blocks of file data), will get the old checksums. Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15Btrfs: Fix memory corruption by ulist_add_merge() on 32bit archTakashi Iwai2-6/+20
We've got bug reports that btrfs crashes when quota is enabled on 32bit kernel, typically with the Oops like below: BUG: unable to handle kernel NULL pointer dereference at 00000004 IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs] *pde = 00000000 Oops: 0000 [#1] SMP CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S W 3.15.2-1.gd43d97e-default #1 Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs] task: f1478130 ti: f147c000 task.ti: f147c000 EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0 EIP is at find_parent_nodes+0x360/0x1380 [btrfs] EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000 ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690 Stack: 00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050 00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000 00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000 Call Trace: [<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs] [<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs] [<f9206148>] normal_work_helper+0xc8/0x270 [btrfs] [<c025e38b>] process_one_work+0x11b/0x390 [<c025eea1>] worker_thread+0x101/0x340 [<c026432b>] kthread+0x9b/0xb0 [<c0712a71>] ret_from_kernel_thread+0x21/0x30 [<c0264290>] kthread_create_on_node+0x110/0x110 This indicates a NULL corruption in prefs_delayed list. The further investigation and bisection pointed that the call of ulist_add_merge() results in the corruption. ulist_add_merge() takes u64 as aux and writes a 64bit value into old_aux. The callers of this function in backref.c, however, pass a pointer of a pointer to old_aux. That is, the function overwrites 64bit value on 32bit pointer. This caused a NULL in the adjacent variable, in this case, prefs_delayed. Here is a quick attempt to band-aid over this: a new function, ulist_add_merge_ptr() is introduced to pass/store properly a pointer value instead of u64. There are still ugly void ** cast remaining in the callers because void ** cannot be taken implicitly. But, it's safer than explicit cast to u64, anyway. Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046 Cc: <stable@vger.kernel.org> [v3.11+] Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Chris Mason <clm@fb.com>