summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2010-08-07block: push down BKL into .open and .releaseArnd Bergmann1-8/+2
The open and release block_device_operations are currently called with the BKL held. In order to change that, we must first make sure that all drivers that currently rely on this have no regressions. This blindly pushes the BKL into all .open and .release operations for all block drivers to prepare for the next step. The drivers can subsequently replace the BKL with their own locks or remove it completely when it can be shown that it is not needed. The functions blkdev_get and blkdev_put are the only remaining users of the big kernel lock in the block layer, besides a few uses in the ioctl code, none of which need to serialize with blkdev_{get,put}. Most of these two functions is also under the protection of bdev->bd_mutex, including the actual calls to ->open and ->release, and the common code does not access any global data structures that need the BKL. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: Add tracing to balance_dirty_pagesDave Chinner1-0/+5
Tracing high level background writeback events is good, but it doesn't give the entire picture. Add visibility into write throttling to catch IO dispatched by foreground throttling of processing dirtying lots of pages. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: Initial tracing supportDave Chinner1-8/+30
Trace queue/sched/exec parts of the writeback loop. This provides insight into when and why flusher threads are scheduled to run. e.g a sync invocation leaves traces like: sync-[...]: writeback_queue: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0 flush-8:0-[...]: writeback_exec: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0 This also lays the foundation for adding more writeback tracing to provide deeper insight into the whole writeback path. The original tracing code is from Jens Axboe, though this version is a rewrite as a result of the code being traced changing significantly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07gcc-4.6: fs: fix unused but set warningsAndi Kleen1-2/+0
No real bugs I believe, just some dead code, and some shut up code. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: merge bdi_writeback_task and bdi_start_fnChristoph Hellwig1-1/+34
Move all code for the writeback thread into fs/fs-writeback.c instead of splitting it over two functions in two files. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: remove wb_listChristoph Hellwig1-2/+2
The wb_list member of struct backing_device_info always has exactly one element. Just use the direct bdi->wb pointer instead and simplify some code. Also remove bdi_task_init which is now trivial to prepare for the next patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07block: unify flags for struct bio and struct requestChristoph Hellwig9-27/+28
Remove the current bio flags and reuse the request flags for the bio, too. This allows to more easily trace the type of I/O from the filesystem down to the block driver. There were two flags in the bio that were missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've renamed two request flags that had a superflous RW in them. Note that the flags are in bio.h despite having the REQ_ name - as blkdev.h includes bio.h that is the only way to go for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07block: BARRIER request should imply SYNCChristoph Hellwig1-1/+1
A barrier request should by defintion have priority in get_request and let the queue be unplugged immediately as it's blocking all forward progress due to the queue draining. Most filesystems already get this implicitly by the way how submit_bh treats the buffer_ordered flag, and gfs2 sets it explicitly. But btrfs and XFS are still forgetting to set the flag, as is blkdev_issue_flush and some places in DM/MD. For XFS on metadata heavy workloads this gives a consistent speedup in the 2-3% range. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-02NFS: Fix a typo in include/linux/nfs_fs.hTrond Myklebust1-5/+0
nfs_commit_inode() needs to be defined irrespectively of whether or not we are supporting NFSv3 and NFSv4. Allow the compiler to optimise away code in the NFSv2-only case by converting it into an inlined stub function. Reported-and-tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-31Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6Linus Torvalds3-11/+31
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFS: Ensure that writepage respects the nonblock flag NFS: kswapd must not block in nfs_release_page nfs: include space for the NUL in root path
2010-07-31CIFS: Remove __exit mark from cifs_exit_dns_resolver()David Howells2-2/+2
Remove the __exit mark from cifs_exit_dns_resolver() as it's called by the module init routine in case of error, and so may have been discarded during linkage. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-30NFS: Ensure that writepage respects the nonblock flagTrond Myklebust1-6/+17
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-07-30NFS: kswapd must not block in nfs_release_pageTrond Myklebust2-4/+13
See https://bugzilla.kernel.org/show_bug.cgi?id=16056 If other processes are blocked waiting for kswapd to free up some memory so that they can make progress, then we cannot allow kswapd to block on those processes. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org
2010-07-30nfs: include space for the NUL in root pathDan Carpenter1-1/+1
In root_nfs_name() it does the following: if (strlen(buf) + strlen(cp) > NFS_MAXPATHLEN) { printk(KERN_ERR "Root-NFS: Pathname for remote directory too long.\n"); return -1; } sprintf(nfs_export_path, buf, cp); In the original code if (strlen(buf) + strlen(cp) == NFS_MAXPATHLEN) then the sprintf() would lead to an overflow. Generally the rest of the code assumes that the path can have NFS_MAXPATHLEN (1024) characters and a NUL terminator so the fix is to add space to the nfs_export_path[] buffer. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-07-30CRED: Fix get_task_cred() and task_state() to not resurrect dead credentialsDavid Howells1-1/+1
It's possible for get_task_cred() as it currently stands to 'corrupt' a set of credentials by incrementing their usage count after their replacement by the task being accessed. What happens is that get_task_cred() can race with commit_creds(): TASK_1 TASK_2 RCU_CLEANER -->get_task_cred(TASK_2) rcu_read_lock() __cred = __task_cred(TASK_2) -->commit_creds() old_cred = TASK_2->real_cred TASK_2->real_cred = ... put_cred(old_cred) call_rcu(old_cred) [__cred->usage == 0] get_cred(__cred) [__cred->usage == 1] rcu_read_unlock() -->put_cred_rcu() [__cred->usage == 1] panic() However, since a tasks credentials are generally not changed very often, we can reasonably make use of a loop involving reading the creds pointer and using atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero. If successful, we can safely return the credentials in the knowledge that, even if the task we're accessing has released them, they haven't gone to the RCU cleanup code. We then change task_state() in procfs to use get_task_cred() rather than calling get_cred() on the result of __task_cred(), as that suffers from the same problem. Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be tripped when it is noticed that the usage count is not zero as it ought to be, for example: kernel BUG at kernel/cred.c:168! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/kernel/mm/ksm/run CPU 0 Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex 745 RIP: 0010:[<ffffffff81069881>] [<ffffffff81069881>] __put_cred+0xc/0x45 RSP: 0018:ffff88019e7e9eb8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0 RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0 R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001 FS: 00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0) Stack: ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45 <0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000 <0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246 Call Trace: [<ffffffff810698cd>] put_cred+0x13/0x15 [<ffffffff81069b45>] commit_creds+0x16b/0x175 [<ffffffff8106aace>] set_current_groups+0x47/0x4e [<ffffffff8106ac89>] sys_setgroups+0xf6/0x105 [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00 48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b 04 25 00 cc 00 00 48 3b b8 58 04 00 00 75 RIP [<ffffffff81069881>] __put_cred+0xc/0x45 RSP <ffff88019e7e9eb8> ---[ end trace df391256a100ebdd ]--- Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-29ecryptfs: Bugfix for error related to ecryptfs_hash_bucketsAndre Osterhues1-8/+9
The function ecryptfs_uid_hash wrongly assumes that the second parameter to hash_long() is the number of hash buckets instead of the number of hash bits. This patch fixes that and renames the variable ecryptfs_hash_buckets to ecryptfs_hash_bits to make it clearer. Fixes: CVE-2010-2492 Signed-off-by: Andre Osterhues <aosterhues@escrypt.com> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-28Merge branch 'for-linus' of ↵Linus Torvalds9-36/+50
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: use complete_all and wake_up_all ceph: Correct obvious typo of Kconfig variable "CRYPTO_AES" ceph: fix dentry lease release ceph: fix leak of dentry in ceph_init_dentry() error path ceph: fix pg_mapping leak on pg_temp updates ceph: fix d_release dop for snapdir, snapped dentries ceph: avoid dcache readdir for snapdir
2010-07-28GFS2: Use kmalloc when possible for ->readdir()Steven Whitehouse1-6/+25
If we don't need a huge amount of memory in ->readdir() then we can use kmalloc rather than vmalloc to allocate it. This should cut down on the greater overheads associated with vmalloc for smaller directories. We may be able to eliminate vmalloc entirely at some stage, but this is easy to do right away. Also using GFP_NOFS to avoid any issues wrt to deleting inodes while under a glock, and suggestion from Linus to factor out the alloc/dealloc. I've given this a test with a variety of different sized directories and it seems to work ok. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-28ceph: use complete_all and wake_up_allYehuda Sadeh6-20/+20
This fixes an issue triggered by running concurrent syncs. One of the syncs would go through while the other would just hang indefinitely. In any case, we never actually want to wake a single waiter, so the *_all functions should be used. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-279p: Pass the correct end of buffer to p9stat_readLatchesar Ionkov1-1/+1
Pass the correct end of the buffer to p9stat_read. Signed-off-by: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-07-26sysfs: allow creating symlinks from untagged to tagged directoriesEric W. Biederman1-1/+2
Supporting symlinks from untagged to tagged directories is reasonable, and needed to support CONFIG_SYSFS_DEPRECATED. So don't fail a prior allowing that case to work. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-26sysfs: sysfs_delete_link handle symlinks from untagged to tagged directories.Eric W. Biederman1-1/+1
This happens for network devices when SYSFS_DEPRECATED is enabled. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-26sysfs: Don't allow the creation of symlinks we can't removeEric W. Biederman1-5/+18
Recently my tagged sysfs support revealed a flaw in the device core that a few rare drivers are running into such that we don't always put network devices in a class subdirectory named net/. Since we are not creating the class directory the network devices wind up in a non-tagged directory, but the symlinks to the network devices from /sys/class/net are in a tagged directory. All of which works until we go to remove or rename the symlink. When we remove or rename a symlink we look in the namespace of the target of the symlink. Since the target of the symlink is in a non-tagged sysfs directory we don't have a namespace to look in, and we fail to remove the symlink. Detect this problem up front and simply don't create symlinks we won't be able to remove later. This prevents symlink leakage and fails in a much clearer and more understandable way. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Maciej W. Rozycki <macro@linux-mips.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-25ceph: Correct obvious typo of Kconfig variable "CRYPTO_AES"Robert P. J. Day1-1/+1
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-24ceph: fix dentry lease releaseSage Weil1-0/+1
When we embed a dentry lease release notification in a request, invalidate our lease so we don't think we still have it. Otherwise we can get all sorts of incorrect client behavior when multiple clients are interacting with the same part of the namespace. Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23ceph: fix leak of dentry in ceph_init_dentry() error pathSage Weil1-1/+3
If we fail to allocate a ceph_dentry_info, don't leak the dn reference. Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23ceph: fix pg_mapping leak on pg_temp updatesSage Weil1-11/+15
Free the ceph_pg_mapping structs when they are removed from the pg_temp rbtree. Also fix a leak in the __insert_pg_mapping() error path. Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23ceph: fix d_release dop for snapdir, snapped dentriesSage Weil1-3/+9
We need to set the d_release dop for snapdir and snapped dentries so that the ceph_dentry_info struct gets released. We also use the dcache to cache readdir results when possible, which only works if we know when dentries are dropped from the cache. Since we don't use the dcache for readdir in the hidden snapdir, avoid that case in ceph_dentry_release. Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23ceph: avoid dcache readdir for snapdirSage Weil1-0/+1
We should always go to the MDS for readdir on the hidden snapdir. The set of snapshots can change at any time; the client can't trust its cache for that. Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-22CIFS: Fix a malicious redirect problem in the DNS lookup codeDavid Howells3-5/+74
Fix the security problem in the CIFS filesystem DNS lookup code in which a malicious redirect could be installed by a random user by simply adding a result record into one of their keyrings with add_key() and then invoking a CIFS CFS lookup [CVE-2010-2524]. This is done by creating an internal keyring specifically for the caching of DNS lookups. To enforce the use of this keyring, the module init routine creates a set of override credentials with the keyring installed as the thread keyring and instructs request_key() to only install lookup result keys in that keyring. The override is then applied around the call to request_key(). This has some additional benefits when a kernel service uses this module to request a key: (1) The result keys are owned by root, not the user that caused the lookup. (2) The result keys don't pop up in the user's keyrings. (3) The result keys don't come out of the quota of the user that caused the lookup. The keyring can be viewed as root by doing cat /proc/keys: 2a0ca6c3 I----- 1 perm 1f030000 0 0 keyring .dns_resolver: 1/4 It can then be listed with 'keyctl list' by root. # keyctl list 0x2a0ca6c3 1 key in keyring: 726766307: --alswrv 0 0 dns_resolver: foo.bar.com Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com> Acked-by: Steve French <smfrench@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-21Fix up trivial spelling errors ('taht' -> 'that')Linus Torvalds1-1/+1
Pointed out by Lucas who found the new one in a comment in setup_percpu.c. And then I fixed the others that I grepped for. Reported-by: Lucas <canolucas@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-21Merge branch 'for-linus' of ↵Linus Torvalds5-39/+72
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: do not include cap/dentry releases in replayed messages ceph: reuse request message when replaying against recovering mds ceph: fix creation of ipv6 sockets ceph: fix parsing of ipv6 addresses ceph: fix printing of ipv6 addrs ceph: add kfree() to error path ceph: fix leak of mon authorizer ceph: fix message revocation
2010-07-20Merge branch 'shrinker' of ↵Linus Torvalds18-74/+103
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev * 'shrinker' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev: xfs: track AGs with reclaimable inodes in per-ag radix tree xfs: convert inode shrinker to per-filesystem contexts mm: add context argument to shrinker callback
2010-07-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds2-23/+126
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix checks in BTRFS_IOC_CLONE_RANGE Btrfs: fix CLONE ioctl destination file size expansion to block boundary Btrfs: fix split_leaf double split corner case
2010-07-20xfs: track AGs with reclaimable inodes in per-ag radix treeDave Chinner2-7/+67
https://bugzilla.kernel.org/show_bug.cgi?id=16348 When the filesystem grows to a large number of allocation groups, the summing of recalimable inodes gets expensive. In many cases, most AGs won't have any reclaimable inodes and so we are wasting CPU time aggregating over these AGs. This is particularly important for the inode shrinker that gets called frequently under memory pressure. To avoid the overhead, track AGs with reclaimable inodes in the per-ag radix tree so that we can find all the AGs with reclaimable inodes via a simple gang tag lookup. This involves setting the tag when the first reclaimable inode is tracked in the AG, and removing the tag when the last reclaimable inode is removed from the tree. Then the summation process becomes a loop walking the radix tree summing AGs with the reclaim tag set. This significantly reduces the overhead of scanning - a 6400 AG filesystea now only uses about 25% of a cpu in kswapd while slab reclaim progresses instead of being permanently stuck at 100% CPU and making little progress. Clean filesystems filesystems will see no overhead and the overhead only increases linearly with the number of dirty AGs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-07-20xfs: convert inode shrinker to per-filesystem contextsDave Chinner4-53/+15
Now the shrinker passes us a context, wire up a shrinker context per filesystem. This allows us to remove the global mount list and the locking problems that introduced. It also means that a shrinker call does not need to traverse clean filesystems before finding a filesystem with reclaimable inodes. This significantly reduces scanning overhead when lots of filesystems are present. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-07-20Btrfs: fix checks in BTRFS_IOC_CLONE_RANGEDan Rosenberg1-2/+2
1. The BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls should check whether the donor file is append-only before writing to it. 2. The BTRFS_IOC_CLONE_RANGE ioctl appears to have an integer overflow that allows a user to specify an out-of-bounds range to copy from the source file (if off + len wraps around). I haven't been able to successfully exploit this, but I'd imagine that a clever attacker could use this to read things he shouldn't. Even if it's not exploitable, it couldn't hurt to be safe. Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> cc: stable@kernel.org Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-07-20Btrfs: fix CLONE ioctl destination file size expansion to block boundarySage Weil1-3/+13
The CLONE and CLONE_RANGE ioctls round up the range of extents being cloned to the block size when the range to clone extends to the end of file (this is always the case with CLONE). It was then using that offset when extending the destination file's i_size. Fix this by not setting i_size beyond the originally requested ending offset. This bug was introduced by a22285a6 (2.6.35-rc1). Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-07-20Btrfs: fix split_leaf double split corner caseChris Mason1-18/+111
split_leaf was not properly balancing leaves when it was forced to split a leaf twice. This commit adds an extra push left and right before forcing the double split in hopes of getting the slot where we want to insert at either the start or end of the leaf. If the extra pushes do work, then we are able to avoid splitting twice and we keep the tree properly balanced. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-07-19[S390] dasd: use correct label location for diag fba disksPeter Oberparleiter1-2/+12
Partition boundary calculation fails for DASD FBA disks under the following conditions: - disk is formatted with CMS FORMAT with a blocksize of more than 512 bytes - all of the disk is reserved to a single CMS file using CMS RESERVE - the disk is accessed using the DIAG mode of the DASD driver Under these circumstances, the partition detection code tries to read the CMS label block containing partition-relevant information from logical block offset 1, while it is in fact located at physical block offset 1. Fix this problem by using the correct CMS label block location depending on the device type as determined by the DASD SENSE ID information. Signed-off-by: Peter Oberparleiter <peter.oberparleiter@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2010-07-19mm: add context argument to shrinker callbackDave Chinner14-16/+23
The current shrinker implementation requires the registered callback to have global state to work from. This makes it difficult to shrink caches that are not global (e.g. per-filesystem caches). Pass the shrinker structure to the callback so that users can embed the shrinker structure in the context the shrinker needs to operate on and get back to it in the callback via container_of(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-07-18Merge branch 'upstream-linus' of ↵Linus Torvalds15-219/+498
git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: ocfs2: Silence gcc warning in ocfs2_write_zero_page(). jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactions ocfs2/dlm: Remove BUG_ON from migration in the rare case of a down node ocfs2: Don't duplicate pages past i_size during CoW. ocfs2: tighten up strlen() checking ocfs2: Make xattr reflink work with new local alloc reservation. ocfs2: make xattr extension work with new local alloc reservation. ocfs2: Remove the redundant cpu_to_le64. ocfs2/dlm: don't access beyond bitmap size ocfs2: No need to zero pages past i_size. ocfs2: Zero the tail cluster when extending past i_size. ocfs2: When zero extending, do it by page. ocfs2: Limit default local alloc size within bitmap range. ocfs2: Move orphan scan work to ocfs2_wq. fs/ocfs2/dlm: Add missing spin_unlock
2010-07-17ocfs2: Silence gcc warning in ocfs2_write_zero_page().Joel Becker1-1/+1
ocfs2_write_zero_page() has a loop that won't ever be skipped, but gcc doesn't know that. Set ret=0 just to make gcc happy. Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-07-16ceph: do not include cap/dentry releases in replayed messagesSage Weil2-0/+9
Strip the cap and dentry releases from replayed messages. They can cause the shared state to get out of sync because they were generated (with the request message) earlier, and no longer reflect the current client state. Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-16ceph: reuse request message when replaying against recovering mdsSage Weil1-5/+22
Replayed rename operations (after an mds failure/recovery) were broken because the request paths were regenerated from the dentry names, which get mangled when d_move() is called. Instead, resend the previous request message when replaying completed operations. Just make sure the REPLAY flag is set and the target ino is filled in. This fixes problems with workloads doing renames when the MDS restarts, where the rename operation appears to succeed, but on mds restart then fails (leading to client confusion, app breakage, etc.). Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-16jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactionsJan Kara3-23/+25
OCFS2 uses t_commit trigger to compute and store checksum of the just committed blocks. When a buffer has b_frozen_data, checksum is computed for it instead of b_data but this can result in an old checksum being written to the filesystem in the following scenario: 1) transaction1 is opened 2) handle1 is opened 3) journal_access(handle1, bh) - This sets jh->b_transaction to transaction1 4) modify(bh) 5) journal_dirty(handle1, bh) 6) handle1 is closed 7) start committing transaction1, opening transaction2 8) handle2 is opened 9) journal_access(handle2, bh) - This copies off b_frozen_data to make it safe for transaction1 to commit. jh->b_next_transaction is set to transaction2. 10) jbd2_journal_write_metadata() checksums b_frozen_data 11) the journal correctly writes b_frozen_data to the disk journal 12) handle2 is closed - There was no dirty call for the bh on handle2, so it is never queued for any more journal operation 13) Checkpointing finally happens, and it just spools the bh via normal buffer writeback. This will write b_data, which was never triggered on and thus contains a wrong (old) checksum. This patch fixes the problem by calling the trigger at the moment data is frozen for journal commit - i.e., either when b_frozen_data is created by do_get_write_access or just before we write a buffer to the log if b_frozen_data does not exist. We also rename the trigger to t_frozen as that better describes when it is called. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-07-15ocfs2/dlm: Remove BUG_ON from migration in the rare case of a down nodeWengang Wang1-8/+14
For migration, we are waiting for DLM_LOCK_RES_MIGRATING flag to be set before sending DLM_MIG_LOCKRES_MSG message to the target. We are using dlm_migration_can_proceed() for that purpose. However, if the node is down, dlm_migration_can_proceed() will also return "go ahead". In this rare case, the DLM_LOCK_RES_MIGRATING flag might not be set yet. Remove the BUG_ON() that trips over this condition. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-07-15ocfs2: Don't duplicate pages past i_size during CoW.Tao Ma1-0/+6
During CoW, the pages after i_size don't contain valid data, so there's no need to read and duplicate them. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-07-15GFS2: rename causes kernel OopsBob Peterson1-1/+1
This patch fixes a kernel Oops in the GFS2 rename code. The problem was in the way the gfs2 directory code was trying to re-use sentinel directory entries. In the failing case, gfs2's rename function was renaming a file to another name that had the same non-trivial length. The file being renamed happened to be the first directory entry on the leaf block. First, the rename code (gfs2_rename in ops_inode.c) found the original directory entry and decided it could do its job by simply replacing the directory entry with another. Therefore it determined correctly that no block allocations were needed. Next, the rename code deleted the old directory entry prior to replacing it with the new name. Therefore, the soon-to-be replaced directory entry was temporarily made into a directory entry "sentinel" or a place holder at the start of a leaf block. Lastly, it went to re-add the replacement directory entry in that leaf block. However, when gfs2_dirent_find_space was looking for space in the leaf block, it used the wrong value for the sentinel. That threw off its calculations so later it decides it can't really re-use the sentinel and therefore must allocate a new leaf block. But because it previously decided to re-use the directory entry, it didn't waste the time to grab a new block allocation for the inode. Therefore, the inode's i_alloc pointer was still NULL and it crashes trying to reference it. In the case of sentinel directory entries, the entire dirent is reused, not just the "free space" portion of it, and therefore the function gfs2_dirent_find_space should use the value 0 rather than GFS2_DIRENT_SIZE(0) for the actual dirent size. Fixing this calculation enables the reproducer programs to work properly. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-07-15GFS2: BUG in gfs2_adjust_quotaAbhijith Das1-5/+3
HighMem pages on i686 do not get mapped to the buffer_heads and this was causing a NULL pointer dereference when we were trying to memset page buffers to zero. We now use zero_user() that kmaps the page and directly manipulates page data. This patch also fixes a boundary condition that was incorrect. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>