summaryrefslogtreecommitdiff
path: root/fs/fuse/inode.c
AgeCommit message (Collapse)AuthorFilesLines
2015-04-15VFS: normal filesystems (and lustre): d_inode() annotationsDavid Howells1-2/+2
that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-01-21fs: remove mapping->backing_dev_infoChristoph Hellwig1-1/+0
Now that we never use the backing_dev_info pointer in struct address_space we can simply remove it and save 4 to 8 bytes in every inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Reviewed-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-06fuse: add memory barrier to INITMiklos Szeredi1-1/+1
Theoretically we need to order setting of various fields in fc with fc->initialized. No known bug reports related to this yet. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-01-06fuse: fix LOOKUP vs INIT compat handlingMiklos Szeredi1-2/+1
Analysis from Marc: "Commit 7078187a795f ("fuse: introduce fuse_simple_request() helper") from the above pull request triggers some EIO errors for me in some tests that rely on fuse Looking at the code changes and a bit of debugging info I think there's a general problem here that fuse_get_req checks and possibly waits for fc->initialized, and this was always called first. But this commit changes the ordering and in many places fc->minor is now possibly used before fuse_get_req, and we can't be sure that fc has been initialized. In my case fuse_lookup_init sets req->out.args[0].size to the wrong size because fc->minor at that point is still 0, leading to the EIO error." Fix by moving the compat adjustments into fuse_simple_request() to after fuse_get_req(). This is also more readable than the original, since now compatibility is handled in a single function instead of cluttering each operation. Reported-by: Marc Dionne <marc.c.dionne@gmail.com> Tested-by: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Fixes: 7078187a795f ("fuse: introduce fuse_simple_request() helper")
2014-12-12fuse: introduce fuse_simple_request() helperMiklos Szeredi1-14/+8
The following pattern is repeated many times: req = fuse_get_req_nopages(fc); /* Initialize req->(in|out).args */ fuse_request_send(fc, req); err = req->out.h.error; fuse_put_request(req); Create a new replacement helper: /* Initialize args */ err = fuse_simple_request(fc, &args); In addition to reducing the code size, this will ease moving from the complex arg-based to a simpler page-based I/O on the fuse device. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-12fuse: flush requests on umountMiklos Szeredi1-15/+1
Use fuse_abort_conn() instead of fuse_conn_kill() in fuse_put_super(). This flushes and aborts requests still on any queues. But since we've already reset fc->connected, those requests would not be useful anyway and would be flushed when the fuse device is closed. Next patches will rely on requests being flushed before the superblock is destroyed. Use fuse_abort_conn() in cuse_process_init_reply() too, since it makes no difference there, and we can get rid of fuse_conn_kill(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-12fuse: don't wake up reserved req in fuse_conn_kill()Miklos Szeredi1-1/+0
Waking up reserved_req_waitq from fuse_conn_kill() doesn't make sense since we aren't chaging ff->reserved_req here, which is what this waitqueue signals. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-07-22fuse: add FUSE_NO_OPEN_SUPPORT flag to INITAndrew Gallagher1-1/+1
Here some additional changes to set a capability flag so that clients can detect when it's appropriate to return -ENOSYS from open. This amends the following commit introduced in 3.14: 7678ac50615d fuse: support clients that don't implement 'open' However we can only add the flag to 3.15 and later since there was no protocol version update in 3.14. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: <stable@vger.kernel.org> # v3.15+
2014-07-22fuse: s_time_gran fixMiklos Szeredi1-3/+0
Default s_time_gran is 1, don't overwrite that if userspace didn't explicitly specify one. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: <stable@vger.kernel.org> # v3.15+
2014-07-07fuse: handle large user and group IDMiklos Szeredi1-4/+16
If the number in "user_id=N" or "group_id=N" mount options was larger than INT_MAX then fuse returned EINVAL. Fix this to handle all valid uid/gid values. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org
2014-07-07fuse: inode: drop castHimangi Saraogi1-1/+1
This patch removes the cast on data of type void * as it is not needed. The following Coccinelle semantic patch was used for making the change: @r@ expression x; void* e; type T; identifier f; @@ ( *((T *)e) | ((T *)x)[...] | ((T *)x)->f | - (T *) e ) Signed-off-by: Himangi Saraogi <himangi774@gmail.com> Acked-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28fuse: clear MS_I_VERSIONMiklos Szeredi1-1/+1
Fuse doesn't support i_version (yet). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28fuse: trust kernel i_ctime onlyMaxim Patlasov1-2/+4
Let the kernel maintain i_ctime locally: update i_ctime explicitly on truncate, fallocate, open(O_TRUNC), setxattr, removexattr, link, rename, unlink. The inode flag I_DIRTY_SYNC serves as indication that local i_ctime should be flushed to the server eventually. The patch sets the flag and updates i_ctime in course of operations listed above. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28fuse: fuse: add time_gran to INIT_OUTMiklos Szeredi1-0/+5
Allow userspace fs to specify time granularity. This is needed because with writeback_cache mode the kernel is responsible for generating mtime and ctime, but if the underlying filesystem doesn't support nanosecond granularity then the cache will contain a different value from the one stored on the filesystem resulting in a change of times after a cache flush. Make the default granularity 1s. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28fuse: add .write_inodeMiklos Szeredi1-0/+1
...and flush mtime from this. This allows us to use the kernel infrastructure for writing out dirty metadata (mtime at this point, but ctime in the next patches and also maybe atime). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-28fuse: do not use uninitialized i_modeMaxim Patlasov1-1/+1
When inode is in I_NEW state, inode->i_mode is not initialized yet. Do not use it before fuse_init_inode() is called. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-05Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Major changes for 3.14 include support for the newly added ZERO_RANGE and COLLAPSE_RANGE fallocate operations, and scalability improvements in the jbd2 layer and in xattr handling when the extended attributes spill over into an external block. Other than that, the usual clean ups and minor bug fixes" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits) ext4: fix premature freeing of partial clusters split across leaf blocks ext4: remove unneeded test of ret variable ext4: fix comment typo ext4: make ext4_block_zero_page_range static ext4: atomically set inode->i_flags in ext4_set_inode_flags() ext4: optimize Hurd tests when reading/writing inodes ext4: kill i_version support for Hurd-castrated file systems ext4: each filesystem creates and uses its own mb_cache fs/mbcache.c: doucple the locking of local from global data fs/mbcache.c: change block and index hash chain to hlist_bl_node ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate ext4: refactor ext4_fallocate code ext4: Update inode i_size after the preallocation ext4: fix partial cluster handling for bigalloc file systems ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents ext4: only call sync_filesystm() when remounting read-only fs: push sync_filesystem() down to the file system's remount_fs() jbd2: improve error messages for inconsistent journal heads jbd2: minimize region locked by j_list_lock in jbd2_journal_forget() jbd2: minimize region locked by j_list_lock in journal_get_create_access() ...
2014-04-05Merge branch 'for-linus' of ↵Linus Torvalds1-6/+23
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse update from Miklos Szeredi: "This series adds cached writeback support to fuse, improving write throughput" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix "uninitialized variable" warning fuse: Turn writeback cache on fuse: Fix O_DIRECT operations vs cached writeback misorder fuse: fuse_flush() should wait on writeback fuse: Implement write_begin/write_end callbacks fuse: restructure fuse_readpage() fuse: Flush files on wb close fuse: Trust kernel i_mtime only fuse: Trust kernel i_size only fuse: Connection bit for enabling writeback fuse: Prepare to handle short reads fuse: Linking file to inode helper
2014-04-04mm + fs: store shadow entries in page cacheJohannes Weiner1-1/+1
Reclaim will be leaving shadow entries in the page cache radix tree upon evicting the real page. As those pages are found from the LRU, an iput() can lead to the inode being freed concurrently. At this point, reclaim must no longer install shadow pages because the inode freeing code needs to ensure the page tree is really empty. Add an address_space flag, AS_EXITING, that the inode freeing code sets under the tree lock before doing the final truncate. Reclaim will check for this flag before installing shadow pages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-02fuse: Turn writeback cache onPavel Emelyanov1-1/+4
Introduce a bit kernel and userspace exchange between each-other on the init stage and turn writeback on if the userspace want this and mount option 'allow_wbcache' is present (controlled by fusermount). Also add each writable file into per-inode write list and call the generic_file_aio_write to make use of the Linux page cache engine. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-02fuse: Trust kernel i_mtime onlyMaxim Patlasov1-3/+10
Let the kernel maintain i_mtime locally: - clear S_NOCMTIME - implement i_op->update_time() - flush mtime on fsync and last close - update i_mtime explicitly on truncate and fallocate Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime should be flushed to the server eventually. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-04-02fuse: Trust kernel i_size onlyPavel Emelyanov1-2/+9
Make fuse think that when writeback is on the inode's i_size is always up-to-date and not update it with the value received from the userspace. This is done because the page cache code may update i_size without letting the FS know. This assumption implies fixing the previously introduced short-read helper -- when a short read occurs the 'hole' is filled with zeroes. fuse_file_fallocate() is also fixed because now we should keep i_size up to date, so it must be updated if FUSE_FALLOCATE request succeeded. Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-03-13fs: push sync_filesystem() down to the file system's remount_fs()Theodore Ts'o1-0/+1
Previously, the no-op "mount -o mount /dev/xxx" operation when the file system is already mounted read-write causes an implied, unconditional syncfs(). This seems pretty stupid, and it's certainly documented or guaraunteed to do this, nor is it particularly useful, except in the case where the file system was mounted rw and is getting remounted read-only. However, it's possible that there might be some file systems that are actually depending on this behavior. In most file systems, it's probably fine to only call sync_filesystem() when transitioning from read-write to read-only, and there are some file systems where this is not needed at all (for example, for a pseudo-filesystem or something like romfs). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Jan Kara <jack@suse.cz> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Anders Larsen <al@alarsen.net> Cc: Phillip Lougher <phillip@squashfs.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: Petr Vandrovec <petr@vandrovec.name> Cc: xfs@oss.sgi.com Cc: linux-btrfs@vger.kernel.org Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Cc: codalist@coda.cs.cmu.edu Cc: linux-ext4@vger.kernel.org Cc: linux-f2fs-devel@lists.sourceforge.net Cc: fuse-devel@lists.sourceforge.net Cc: cluster-devel@redhat.com Cc: linux-mtd@lists.infradead.org Cc: jfs-discussion@lists.sourceforge.net Cc: linux-nfs@vger.kernel.org Cc: linux-nilfs@vger.kernel.org Cc: linux-ntfs-dev@lists.sourceforge.net Cc: ocfs2-devel@oss.oracle.com Cc: reiserfs-devel@vger.kernel.org
2013-10-25fuse: rcu-delay freeing fuse_connAl Viro1-1/+1
makes ->permission() and ->d_revalidate() safety in RCU mode independent from vfsmount_lock. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-10-25vfs: introduce d_instantiate_no_diralias()Miklos Szeredi1-2/+0
...which just returns -EBUSY if a directory alias would be created. This is to be used by fuse mkdir to make sure that a buggy or malicious userspace filesystem doesn't do anything nasty. Previously fuse used a private mutex for this purpose, which can now go away. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-09-13truncate: drop 'oldsize' truncate_pagecache() parameterKirill A. Shutemov1-1/+1
truncate_pagecache() doesn't care about old size since commit cedabed49b39 ("vfs: Fix vmtruncate() regression"). Let's drop it. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12mm/page-writeback.c: add strictlimit featureMaxim Patlasov1-1/+1
The feature prevents mistrusted filesystems (ie: FUSE mounts created by unprivileged users) to grow a large number of dirty pages before throttling. For such filesystems balance_dirty_pages always check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under "freerun", it's not allowed to skip bdi checks. The only use case for now is fuse: it sets bdi max_ratio to 1% by default and system administrators are supposed to expect that this limit won't be exceeded. The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A filesystem may set the flag when it initializes its BDI. The problematic scenario comes from the fact that nobody pays attention to the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse writeback). The implementation of fuse writeback releases original page (by calling end_page_writeback) almost immediately. A fuse request queued for real processing bears a copy of original page. Hence, if userspace fuse daemon doesn't finalize write requests in timely manner, an aggressive mmap writer can pollute virtually all memory by those temporary fuse page copies. They are carefully accounted in NR_WRITEBACK_TEMP, but nobody cares. To make further explanations shorter, let me use "NR_WRITEBACK_TEMP problem" as a shortcut for "a possibility of uncontrolled grow of amount of RAM consumed by temporary pages allocated by kernel fuse to process writeback". The problem was very easy to reproduce. There is a trivial example filesystem implementation in fuse userspace distribution: fusexmp_fh.c. I added "sleep(1);" to the write methods, then recompiled and mounted it. Then created a huge file on the mount point and run a simple program which mmap-ed the file to a memory region, then wrote a data to the region. An hour later I observed almost all RAM consumed by fuse writeback. Since then some unrelated changes in kernel fuse made it more difficult to reproduce, but it is still possible now. Putting this theoretical happens-in-the-lab thing aside, there is another thing that really hurts real world (FUSE) users. This is write-through page cache policy FUSE currently uses. I.e. handling write(2), kernel fuse populates page cache and flushes user data to the server synchronously. This is excessively suboptimal. Pavel Emelyanov's patches ("writeback cache policy") solve the problem, but they also make resolving NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise, simply copying a huge file to a fuse mount would result in memory starvation. Miklos, the maintainer of FUSE, believes strictlimit feature the way to go. And eventually putting FUSE topics aside, there is one more use-case for strictlimit feature. Using a slow USB stick (mass storage) in a machine with huge amount of RAM installed is a well-known pain. Let's make simple computations. Assuming 64GB of RAM installed, existing implementation of balance_dirty_pages will start throttling only after 9.6GB of RAM becomes dirty (freerun == 15% of total RAM). So, the command "cp 9GB_file /media/my-usb-storage/" may return in a few seconds, but subsequent "umount /media/my-usb-storage/" will take more than two hours if effective throughput of the storage is, to say, 1MB/sec. After inclusion of strictlimit feature, it will be trivial to add a knob (e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand. Manually or via udev rule. May be I'm wrong, but it seems to be quite a natural desire to limit the amount of dirty memory for some devices we are not fully trust (in the sense of sustainable throughput). [akpm@linux-foundation.org: fix warning in page-writeback.c] Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Cc: Jan Kara <jack@suse.cz> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-03fuse: hotfix truncate_pagecache() issueMaxim Patlasov1-1/+2
The way how fuse calls truncate_pagecache() from fuse_change_attributes() is completely wrong. Because, w/o i_mutex held, we never sure whether 'oldsize' and 'attr->size' are valid by the time of execution of truncate_pagecache(inode, oldsize, attr->size). In fact, as soon as we released fc->lock in the middle of fuse_change_attributes(), we completely loose control of actions which may happen with given inode until we reach truncate_pagecache. The list of potentially dangerous actions includes mmap-ed reads and writes, ftruncate(2) and write(2) extending file size. The typical outcome of doing truncate_pagecache() with outdated arguments is data corruption from user point of view. This is (in some sense) acceptable in cases when the issue is triggered by a change of the file on the server (i.e. externally wrt fuse operation), but it is absolutely intolerable in scenarios when a single fuse client modifies a file without any external intervention. A real life case I discovered by fsx-linux looked like this: 1. Shrinking ftruncate(2) comes to fuse_do_setattr(). The latter sends FUSE_SETATTR to the server synchronously, but before getting fc->lock ... 2. fuse_dentry_revalidate() is asynchronously called. It sends FUSE_LOOKUP to the server synchronously, then calls fuse_change_attributes(). The latter updates i_size, releases fc->lock, but before comparing oldsize vs attr->size.. 3. fuse_do_setattr() from the first step proceeds by acquiring fc->lock and updating attributes and i_size, but now oldsize is equal to outarg.attr.size because i_size has just been updated (step 2). Hence, fuse_do_setattr() returns w/o calling truncate_pagecache(). 4. As soon as ftruncate(2) completes, the user extends file size by write(2) making a hole in the middle of file, then reads data from the hole either by read(2) or mmap-ed read. The user expects to get zero data from the hole, but gets stale data because truncate_pagecache() is not executed yet. The scenario above illustrates one side of the problem: not truncating the page cache even though we should. Another side corresponds to truncating page cache too late, when the state of inode changed significantly. Theoretically, the following is possible: 1. As in the previous scenario fuse_dentry_revalidate() discovered that i_size changed (due to our own fuse_do_setattr()) and is going to call truncate_pagecache() for some 'new_size' it believes valid right now. But by the time that particular truncate_pagecache() is called ... 2. fuse_do_setattr() returns (either having called truncate_pagecache() or not -- it doesn't matter). 3. The file is extended either by write(2) or ftruncate(2) or fallocate(2). 4. mmap-ed write makes a page in the extended region dirty. The result will be the lost of data user wrote on the fourth step. The patch is a hotfix resolving the issue in a simplistic way: let's skip dangerous i_size update and truncate_pagecache if an operation changing file size is in progress. This simplistic approach looks correct for the cases w/o external changes. And to handle them properly, more sophisticated and intrusive techniques (e.g. NFS-like one) would be required. I'd like to postpone it until the issue is well discussed on the mailing list(s). Changed in v2: - improved patch description to cover both sides of the issue. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org
2013-07-04mm: use totalram_pages instead of num_physpages at runtimeJiang Liu1-1/+1
The global variable num_physpages is scheduled to be removed, so use totalram_pages instead of num_physpages at runtime. Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-03fuse: fix readdirplus Oops in fuse_dentry_revalidateMiklos Szeredi1-3/+4
Fix bug introduced by commit 4582a4ab2a "FUSE: Adapt readdirplus to application usage patterns". We need to check for a positive dentry; negative dentries are not added by readdirplus. Secondly we need to advise the use of readdirplus on the *parent*, otherwise the whole thing is useless. Thirdly all this is only relevant if "readdirplus_auto" mode is selected by the filesystem. We advise the use of readdirplus only if the dentry was still valid. If we had to redo the lookup then there was no use in doing the -plus version. Reported-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: Feng Shuo <steve.shuo.feng@gmail.com> CC: stable@vger.kernel.org
2013-05-01fuse: add flag to turn on async direct IOMiklos Szeredi1-1/+3
Without async DIO write requests to a single file were always serialized. With async DIO that's no longer the case. So don't turn on async DIO by default for fear of breaking backward compatibility. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-04-17fuse: skip blocking on allocations of synchronous requestsMaxim Patlasov1-2/+1
A task may have at most one synchronous request allocated. So these requests need not be otherwise limited. The patch re-works fuse_get_req() to follow this idea. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-04-17fuse: add flag fc->initializedMaxim Patlasov1-0/+3
Existing flag fc->blocked is used to suspend request allocation both in case of many background request submitted and period of time before init_reply arrives from userspace. Next patch will skip blocking allocations of synchronous request (disregarding fc->blocked). This is mostly OK, but we still need to suspend allocations if init_reply is not arrived yet. The patch introduces flag fc->initialized which will serve this purpose. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-04-17fuse: make request allocations for background processing explicitMaxim Patlasov1-0/+2
There are two types of processing requests in FUSE: synchronous (via fuse_request_send()) and asynchronous (via adding to fc->bg_queue). Fortunately, the type of processing is always known in advance, at the time of request allocation. This preparatory patch utilizes this fact making fuse_get_req() aware about the type. Next patches will use it. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-03-04fs: Limit sys_mount to only request filesystem modules.Eric W. Biederman1-0/+2
Modify the request_module to prefix the file system type with "fs-" and add aliases to all of the filesystems that can be built as modules to match. A common practice is to build all of the kernel code and leave code that is not commonly needed as modules, with the result that many users are exposed to any bug anywhere in the kernel. Looking for filesystems with a fs- prefix limits the pool of possible modules that can be loaded by mount to just filesystems trivially making things safer with no real cost. Using aliases means user space can control the policy of which filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf with blacklist and alias directives. Allowing simple, safe, well understood work-arounds to known problematic software. This also addresses a rare but unfortunate problem where the filesystem name is not the same as it's module name and module auto-loading would not work. While writing this patch I saw a handful of such cases. The most significant being autofs that lives in the module autofs4. This is relevant to user namespaces because we can reach the request module in get_fs_type() without having any special permissions, and people get uncomfortable when a user specified string (in this case the filesystem type) goes all of the way to request_module. After having looked at this issue I don't think there is any particular reason to perform any filtering or permission checks beyond making it clear in the module request that we want a filesystem module. The common pattern in the kernel is to call request_module() without regards to the users permissions. In general all a filesystem module does once loaded is call register_filesystem() and go to sleep. Which means there is not much attack surface exposed by loading a filesytem module unless the filesystem is mounted. In a user namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT, which most filesystems do not set today. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Reported-by: Kees Cook <keescook@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-02-27Merge branch 'for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle *and* ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...
2013-02-26fs: encode_fh: return FILEID_INVALID if invalid fid_typeNamjae Jeon1-1/+1
This patch is a follow up on below patch: [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63 Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Vivek Trivedi <t.vivek@samsung.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Sage Weil <sage@inktank.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-07fuse: allow control of adaptive readdirplus useEric Wong1-1/+3
For some filesystems (e.g. GlusterFS), the cost of performing a normal readdir and readdirplus are identical. Since adaptively using readdirplus has no benefit for those systems, give users/filesystems the option to control adaptive readdirplus use. v2 of this patch incorporates Miklos's suggestion to simplify the code, as well as improving consistency of macro names and documentation. Signed-off-by: Eric Wong <normalperson@yhbt.net> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-31FUSE: Adapt readdirplus to application usage patternsFeng Shuo1-0/+1
Use the same adaptive readdirplus mechanism as NFS: http://permalink.gmane.org/gmane.linux.nfs/49299 If the user space implementation wants to disable readdirplus temporarily, it could just return ENOTSUPP. Then kernel will recall it with readdir. Signed-off-by: Feng Shuo <steve.shuo.feng@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-31Do not use RCU for current process credentialsAnatol Pomozov1-1/+1
Commit c69e8d9c0 added rcu lock to fuse/dir.c It was assuming that 'task' is some other process but in fact this parameter always equals to 'current'. Inline this parameter to make it more readable and remove RCU lock as it is not needed when access current process credentials. Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24fuse: categorize fuse_get_req()Maxim Patlasov1-1/+1
The patch categorizes all fuse_get_req() invocations into two categories: - fuse_get_req_nopages(fc) - when caller doesn't care about req->pages - fuse_get_req(fc, n) - when caller need n page pointers (n > 0) Adding fuse_get_req_nopages() helps to avoid numerous fuse_get_req(fc, 0) scattered over code. Now it's clear from the first glance when a caller need fuse_req with page pointers. The patch doesn't make any logic changes. In multi-page case, it silly allocates array of FUSE_MAX_PAGES_PER_REQ page pointers. This will be amended by future patches. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24fuse: general infrastructure for pages[] of variable sizeMaxim Patlasov1-2/+2
The patch removes inline array of FUSE_MAX_PAGES_PER_REQ page pointers from fuse_req. Instead of that, req->pages may now point either to small inline array or to an array allocated dynamically. This essentially means that all callers of fuse_request_alloc[_nofs] should pass the number of pages needed explicitly. The patch doesn't make any logic changes. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24fuse: implement NFS-like readdirplus supportAnand V. Avati1-1/+4
This patch implements readdirplus support in FUSE, similar to NFS. The payload returned in the readdirplus call contains 'fuse_entry_out' structure thereby providing all the necessary inputs for 'faking' a lookup() operation on the spot. If the dentry and inode already existed (for e.g. in a re-run of ls -l) then just the inode attributes timeout and dentry timeout are refreshed. With a simple client->network->server implementation of a FUSE based filesystem, the following performance observations were made: Test: Performing a filesystem crawl over 20,000 files with sh# time ls -lR /mnt Without readdirplus: Run 1: 18.1s Run 2: 16.0s Run 3: 16.2s With readdirplus: Run 1: 4.1s Run 2: 3.8s Run 3: 3.8s The performance improvement is significant as it avoided 20,000 upcalls calls (lookup). Cache consistency is no worse than what already is. Signed-off-by: Anand V. Avati <avati@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2012-11-15userns: Support fuse interacting with multiple user namespacesEric W. Biederman1-9/+14
Use kuid_t and kgid_t in struct fuse_conn and struct fuse_mount_data. The connection between between a fuse filesystem and a fuse daemon is established when a fuse filesystem is mounted and provided with a file descriptor the fuse daemon created by opening /dev/fuse. For now restrict the communication of uids and gids between the fuse filesystem and the fuse daemon to the initial user namespace. Enforce this by verifying the file descriptor passed to the mount of fuse was opened in the initial user namespace. Ensuring the mount happens in the initial user namespace is not necessary as mounts from non-initial user namespaces are not yet allowed. In fuse_req_init_context convert the currrent fsuid and fsgid into the initial user namespace for the request that will be sent to the fuse daemon. In fuse_fill_attr convert the uid and gid passed from the fuse daemon from the initial user namespace into kuids and kgids. In iattr_to_fattr called from fuse_setattr convert kuids and kgids into the uids and gids in the initial user namespace before passing them to the fuse filesystem. In fuse_change_attributes_common called from fuse_dentry_revalidate, fuse_permission, fuse_geattr, and fuse_setattr, and fuse_iget convert the uid and gid from the fuse daemon into a kuid and a kgid to store on the fuse inode. By default fuse mounts are restricted to task whose uid, suid, and euid matches the fuse user_id and whose gid, sgid, and egid matches the fuse group id. Convert the user_id and group_id mount options into kuids and kgids at mount time, and use uid_eq and gid_eq to compare the in fuse_allow_task. Cc: Miklos Szeredi <miklos@szeredi.hu> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-10-03fs: push rcu_barrier() from deactivate_locked_super() to filesystemsKirill A. Shutemov1-0/+6
There's no reason to call rcu_barrier() on every deactivate_locked_super(). We only need to make sure that all delayed rcu free inodes are flushed before we destroy related cache. Removing rcu_barrier() from deactivate_locked_super() affects some fast paths. E.g. on my machine exit_group() of a last process in IPC namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-08-30cuse: fix fuse_conn_kill()Miklos Szeredi1-5/+7
fuse_conn_kill() removed fc->entry, called fuse_ctl_remove_conn() and fuse_bdi_destroy(). None of which is appropriate for cuse cleanup. The fuse_ctl_remove_conn() decrements the nlink on the control filesystem, which is totally bogus. The others are harmless but unnecessary. So move these out from fuse_conn_kill() to fuse_put_super() where they belong. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2012-07-18fuse: add missing INIT flagsMiklos Szeredi1-1/+2
Add missing flags that userspace derived from the protocol version number. This makes the protocol more flexible. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2012-07-18fuse: invalidate inode mapping if mtime changesBrian Foster1-3/+24
We currently invalidate the inode address space mapping if the file size changes unexpectedly. In the case of a fuse network filesystem, a portion of a file could be overwritten remotely without changing the file size. Compare the old mtime as well to detect this condition and invalidate the mapping if the file has been updated. The original logic (to ignore changes in mtime) is preserved unless the client specifies FUSE_AUTO_INVAL_DATA on init. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2012-07-18fuse: add FUSE_AUTO_INVAL_DATA init flagBrian Foster1-1/+3
FUSE_AUTO_INVAL_DATA is provided to enable updated/auto cache invalidation logic. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2012-06-05Merge branch 'for-linus' of ↵Linus Torvalds1-1/+16
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix blksize calculation fuse: fix stat call on 32 bit platforms fuse: optimize fallocate on permanent failure fuse: add FALLOCATE operation fuse: Convert to kstrtoul_from_user