summaryrefslogtreecommitdiff
path: root/fs/ceph
AgeCommit message (Collapse)AuthorFilesLines
2014-10-15Merge branch 'for-linus' of ↵Linus Torvalds12-177/+386
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "There is the long-awaited discard support for RBD (Guangliang Zhao, Josh Durgin), a pile of RBD bug fixes that didn't belong in late -rc's (Ilya Dryomov, Li RongQing), a pile of fs/ceph bug fixes and performance and debugging improvements (Yan, Zheng, John Spray), and a smattering of cleanups (Chao Yu, Fabian Frederick, Joe Perches)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits) ceph: fix divide-by-zero in __validate_layout() rbd: rbd workqueues need a resque worker libceph: ceph-msgr workqueue needs a resque worker ceph: fix bool assignments libceph: separate multiple ops with commas in debugfs output libceph: sync osd op definitions in rados.h libceph: remove redundant declaration ceph: additional debugfs output ceph: export ceph_session_state_name function ceph: include the initial ACL in create/mkdir/mknod MDS requests ceph: use pagelist to present MDS request data libceph: reference counting pagelist ceph: fix llistxattr on symlink ceph: send client metadata to MDS ceph: remove redundant code for max file size verification ceph: remove redundant io_iter_advance() ceph: move ceph_find_inode() outside the s_mutex ceph: request xattrs if xattr_version is zero rbd: set the remaining discard properties to enable support rbd: use helpers to handle discard for layered images correctly ...
2014-10-14ceph: fix divide-by-zero in __validate_layout()Yan, Zheng1-1/+1
The 'stripe_unit' field is 64 bits, casting it to 32 bits can result zero. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-10-14ceph: fix bool assignmentsFabian Frederick1-13/+13
Fix some coccinelle warnings: fs/ceph/caps.c:2400:6-10: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2401:6-15: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2402:6-17: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2403:6-22: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2404:6-22: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2405:6-19: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2440:4-20: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2469:3-16: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2490:2-18: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2519:3-7: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2549:3-12: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2575:2-6: WARNING: Assignment of bool to 0/1 fs/ceph/caps.c:2589:3-7: WARNING: Assignment of bool to 0/1 Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2014-10-14ceph: additional debugfs outputJohn Spray2-0/+47
MDS session state and client global ID is useful instrumentation when testing. Signed-off-by: John Spray <john.spray@redhat.com>
2014-10-14ceph: export ceph_session_state_name functionJohn Spray2-7/+9
...so that it can be used from the ceph debugfs code when dumping session info. Signed-off-by: John Spray <john.spray@redhat.com>
2014-10-14ceph: include the initial ACL in create/mkdir/mknod MDS requestsYan, Zheng4-47/+170
Current code set new file/directory's initial ACL in a non-atomic manner. Client first sends request to MDS to create new file/directory, then set the initial ACL after the new file/directory is successfully created. The fix is include the initial ACL in create/mkdir/mknod MDS requests. So MDS can handle creating file/directory and setting the initial ACL in one request. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-14ceph: use pagelist to present MDS request dataYan, Zheng3-39/+28
Current code uses page array to present MDS request data. Pages in the array are allocated/freed by caller of ceph_mdsc_do_request(). If request is interrupted, the pages can be freed while they are still being used by the request message. The fix is use pagelist to present MDS request data. Pagelist is reference counted. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-14libceph: reference counting pagelistYan, Zheng1-1/+0
this allow pagelist to present data that may be sent multiple times. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-14ceph: fix llistxattr on symlinkYan, Zheng1-2/+1
only regular file and directory have vxattrs. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-10-14ceph: send client metadata to MDSJohn Spray1-1/+70
Implement version 2 of CEPH_MSG_CLIENT_SESSION syntax, which includes additional client metadata to allow the MDS to report on clients by user-sensible names like hostname. Signed-off-by: John Spray <john.spray@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
2014-10-14ceph: remove redundant code for max file size verificationChao Yu2-15/+0
Both ceph_update_writeable_page and ceph_setattr will verify file size with max size ceph supported. There are two caller for ceph_update_writeable_page, ceph_write_begin and ceph_page_mkwrite. For ceph_write_begin, we have already verified the size in generic_write_checks of ceph_write_iter; for ceph_page_mkwrite, we have no chance to change file size when mmap. Likewise we have already verified the size in inode_change_ok when we call ceph_setattr. So let's remove the redundant code for max file size verification. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
2014-10-14ceph: remove redundant io_iter_advance()Yan, Zheng1-1/+0
ceph_sync_read and generic_file_read_iter() have already advanced the IO iterator. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-10-14ceph: move ceph_find_inode() outside the s_mutexYan, Zheng2-8/+10
ceph_find_inode() may wait on freeing inode, using it inside the s_mutex may cause deadlock. (the freeing inode is waiting for OSD read reply, but dispatch thread is blocked by the s_mutex) Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-14ceph: request xattrs if xattr_version is zeroYan, Zheng5-30/+20
Following sequence of events can happen. - Client releases an inode, queues cap release message. - A 'lookup' reply brings the same inode back, but the reply doesn't contain xattrs because MDS didn't receive the cap release message and thought client already has up-to-data xattrs. The fix is force sending a getattr request to MDS if xattrs_version is 0. The getattr mask is set to CEPH_STAT_CAP_XATTR, so MDS knows client does not have xattr. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-10-14ceph: make sure request isn't in any waiting list when kicking request.Yan, Zheng1-0/+1
we may corrupt waiting list if a request in the waiting list is kicked. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-14ceph: protect kick_requests() with mdsc->mutexYan, Zheng1-2/+3
Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-14ceph: trim unused inodes before reconnecting to recovering MDSYan, Zheng1-10/+13
So the recovering MDS does not need to fetch these ununsed inodes during cache rejoin. This may reduce MDS recovery time. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-10-09vfs: Remove d_drop calls from d_revalidate implementationsEric W. Biederman1-1/+0
Now that d_invalidate always succeeds it is not longer necessary or desirable to hard code d_drop calls into filesystem specific d_revalidate implementations. Remove the unnecessary d_drop calls and rely on d_invalidate to drop the dentries. Using d_invalidate ensures that paths to mount points will not be dropped. Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-14Merge branch 'for-linus' of ↵Linus Torvalds5-17/+43
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "There is a lot of refactoring and hardening of the libceph and rbd code here from Ilya that fix various smaller bugs, and a few more important fixes with clone overlap. The main fix is a critical change to the request_fn handling to not sleep that was exposed by the recent mutex changes (which will also go to the 3.16 stable series). Yan Zheng has several fixes in here for CephFS fixing ACL handling, time stamps, and request resends when the MDS restarts. Finally, there are a few cleanups from Himangi Saraogi based on Coccinelle" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (39 commits) libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly rbd: remove extra newlines from rbd_warn() messages rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC rbd: rework rbd_request_fn() ceph: fix kick_requests() ceph: fix append mode write ceph: fix sizeof(struct tYpO *) typo ceph: remove redundant memset(0) rbd: take snap_id into account when reading in parent info rbd: do not read in parent info before snap context rbd: update mapping size only on refresh rbd: harden rbd_dev_refresh() and callers a bit rbd: split rbd_dev_spec_update() into two functions rbd: remove unnecessary asserts in rbd_dev_image_probe() rbd: introduce rbd_dev_header_info() rbd: show the entire chain of parent images ceph: replace comma with a semicolon rbd: use rbd_segment_name_free() instead of kfree() ceph: check zero length in ceph_sync_read() ceph: reset r_resend_mds after receiving -ESTALE ...
2014-08-07dcache: d_obtain_alias callers don't all want DISCONNECTEDJ. Bruce Fields1-1/+1
There are a few d_obtain_alias callers that are using it to get the root of a filesystem which may already have an alias somewhere else. This is not the same as the filehandle-lookup case, and none of them actually need DCACHE_DISCONNECTED set. It isn't really a serious problem, but it would really be clearer if we reserved DCACHE_DISCONNECTED for those cases where it's actually needed. In the btrfs case this was causing a spurious printk from nfsd/nfsfh.c:fh_verify when it found an unexpected DCACHE_DISCONNECTED dentry. Josef worked around this by unsetting DCACHE_DISCONNECTED manually in 3a0dfa6a12e "Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol", and this replaces that workaround. Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07ceph: fix kick_requests()Yan, Zheng1-2/+3
__do_request() may unregister the request. So we should update iterator 'p' before calling __do_request() Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-07-28ceph: fix append mode writeYan, Zheng1-6/+5
generic_write_checks() may update 'pos', so we need to pass 'pos' to ceph_sync_write() and ceph_sync_direct_write(); Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-07-28ceph: fix sizeof(struct tYpO *) typoIlya Dryomov1-1/+1
struct ceph_xattr -> struct ceph_inode_xattr Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-28ceph: remove redundant memset(0)Ilya Dryomov1-1/+1
xattrs array of pointers is allocated with kcalloc() - no need to memset() it to 0 right after that. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-24ceph: replace comma with a semicolonHimangi Saraogi1-1/+1
Replace a comma between expression statements by a semicolon. This changes the semantics of the code, but given the current indentation appears to be what is intended. A simplified version of the Coccinelle semantic patch that performs this transformation is as follows: // <smpl> @r@ expression e1,e2; @@ e1 -, +; e2; // </smpl> Signed-off-by: Himangi Saraogi <himangi774@gmail.com> Acked-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-07-21ceph: check zero length in ceph_sync_read()Yan, Zheng1-0/+3
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-07-14ceph: reset r_resend_mds after receiving -ESTALEYan, Zheng1-0/+1
this makes __choose_mds() choose mds according caps Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-07-08ceph: properly apply umask when ACL is enabledYan, Zheng1-2/+12
when ACL is enabled, posix_acl_create() may change inode's mode Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-07-08ceph: pass proper page offset to copy_page_to_iter()Yan, Zheng1-2/+5
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-07-08ceph: include time stamp in replayed MDS requestsYan, Zheng1-2/+8
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-07-08ceph: check unsupported fallocate modeYan, Zheng1-0/+3
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-06-13Merge branch 'for-linus' of ↵Linus Torvalds8-231/+310
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "This has a mix of bug fixes and cleanups. Alex's patch fixes a rare race in RBD. Ilya's patches fix an ENOENT check when a second rbd image is mapped and a couple memory leaks. Zheng fixes several issues with fragmented directories and multiple MDSs. Josh fixes a spin/sleep issue, and Josh and Guangliang's patches fix setting and unsetting RBD images read-only. Naturally there are several other cleanups mixed in for good measure" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits) rbd: only set disk to read-only once rbd: move calls that may sleep out of spin lock range rbd: add ioctl for rbd ceph: use truncate_pagecache() instead of truncate_inode_pages() ceph: include time stamp in every MDS request rbd: fix ida/idr memory leak rbd: use reference counts for image requests rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync() rbd: make sure we have latest osdmap on 'rbd map' libceph: add ceph_monc_wait_osdmap() libceph: mon_get_version request infrastructure libceph: recognize poolop requests in debugfs ceph: refactor readpage_nounlock() to make the logic clearer mds: check cap ID when handling cap export message ceph: remember subtree root dirfrag's auth MDS ceph: introduce ceph_fill_fragtree() ceph: handle cap import atomically ceph: pre-allocate ceph_cap struct for ceph_add_cap() ceph: update inode fields according to issued caps rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO ...
2014-06-12Merge branch 'for-linus' of ↵Linus Torvalds2-115/+74
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "This the bunch that sat in -next + lock_parent() fix. This is the minimal set; there's more pending stuff. In particular, I really hope to get acct.c fixes merged this cycle - we need that to deal sanely with delayed-mntput stuff. In the next pile, hopefully - that series is fairly short and localized (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more iov_iter work. Most of prereqs for ->splice_write with sane locking order are there and Kent's dio rewrite would also fit nicely on top of this pile" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits) lock_parent: don't step on stale ->d_parent of all-but-freed one kill generic_file_splice_write() ceph: switch to iter_file_splice_write() shmem: switch to iter_file_splice_write() nfs: switch to iter_splice_write_file() fs/splice.c: remove unneeded exports ocfs2: switch to iter_file_splice_write() ->splice_write() via ->write_iter() bio_vec-backed iov_iter optimize copy_page_{to,from}_iter() bury generic_file_aio_{read,write} lustre: get rid of messing with iovecs ceph: switch to ->write_iter() ceph_sync_direct_write: stop poking into iov_iter guts ceph_sync_read: stop poking into iov_iter guts new helper: copy_page_from_iter() fuse: switch to ->write_iter() btrfs: switch to ->write_iter() ocfs2: switch to ->write_iter() xfs: switch to ->write_iter() ...
2014-06-12ceph: switch to iter_file_splice_write()Al Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-06-08ceph: use truncate_pagecache() instead of truncate_inode_pages()Yan, Zheng1-2/+2
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-06-07fs/ceph/debugfs.c: replace seq_printf by seq_putsFabian Frederick1-3/+3
Replace seq_printf where possible. Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Sage Weil <sage@inktank.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-07fs/ceph: replace pr_warning by pr_warnFabian Frederick4-6/+6
Update the last pr_warning callsites in fs branch Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Sage Weil <sage@inktank.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ceph: include time stamp in every MDS requestSage Weil2-1/+9
We recently modified the client/MDS protocol to include a timestamp in the client request. This allows ctime updates to follow the client's clock in most cases, which avoids subtle problems when clocks are out of sync and timestamps are updated sometimes by the MDS clock (for most requests) and sometimes by the client clock (for cap writeback). Signed-off-by: Sage Weil <sage@inktank.com>
2014-06-06ceph: refactor readpage_nounlock() to make the logic clearerZhang Zhen1-10/+7
If the return value of ceph_osdc_readpages() is not negative, it is certainly greater than or equal to zero. Remove the useless condition judgment and redundant braces. Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-06-06mds: check cap ID when handling cap export messageYan, Zheng1-1/+1
handle following sequence of events: - mds0 exports an inode to mds1. client receives the cap import message from mds1. caps from mds0 are removed while handling the cap import message. - mds1 exports an inode to mds0. client receives the cap export message from mds1. handle_cap_export() adds placeholder caps for mds0 - client receives the first cap export message (for exporting inode from mds0 to mds1) Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-06-06ceph: remember subtree root dirfrag's auth MDSYan, Zheng1-1/+7
remember dirfrag's auth MDS when it's different from its parent inode's auth MDS. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-06-06ceph: introduce ceph_fill_fragtree()Yan, Zheng1-45/+84
Move the code that update the i_fragtree into a separate function. Also add simple probabilistic test to decide whether the i_fragtree should be updated Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-06-06ceph: handle cap import atomicallyYan, Zheng1-45/+54
cap import messages are processed by both handle_cap_import() and handle_cap_grant(). These two functions are not executed in the same atomic context, so they can races with cap release. The fix is make handle_cap_import() not release the i_ceph_lock when it returns. Let handle_cap_grant() release the lock after it finishes its job. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-06-06ceph: pre-allocate ceph_cap struct for ceph_add_cap()Yan, Zheng3-79/+85
So that ceph_add_cap() can be used while i_ceph_lock is locked. This simplifies the code that handle cap import/export. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-06-06ceph: update inode fields according to issued capsYan, Zheng2-57/+71
Cap message and request reply from non-auth MDS may carry stale information (corresponding locks are in LOCK states) even they have the newest inode version. So client should update inode fields according to issued caps. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-06-06ceph: queue vmtruncate if necessary when handing cap grant/revokeYan, Zheng1-10/+16
cap grant/revoke message from non-auth MDS can update inode's size and truncate_seq/truncate_size. (the message arrives before auth MDS's cap trunc message) Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-06-06ceph: remove useless ACL checkZhang Zhen1-6/+0
posix_acl_xattr_set() already does the check, and it's the only way to feed in an ACL from userspace. So the check here is useless, remove it. Signed-off-by: zhang zhen <zhenzhang.zhang@huawei.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-06-06ceph: ceph_get_parent() can be staticFengguang Wu1-1/+1
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-06-02locks: ensure that fl_owner is always initialized properly in flock and ↵Jeff Layton1-8/+2
lease codepaths Currently, the fl_owner isn't set for flock locks. Some filesystems use byte-range locks to simulate flock locks and there is a common idiom in those that does: fl->fl_owner = (fl_owner_t)filp; fl->fl_start = 0; fl->fl_end = OFFSET_MAX; Since flock locks are generally "owned" by the open file description, move this into the common flock lock setup code. The fl_start and fl_end fields are already set appropriately, so remove the unneeded setting of that in flock ops in those filesystems as well. Finally, the lease code also sets the fl_owner as if they were owned by the process and not the open file description. This is incorrect as leases have the same ownership semantics as flock locks. Set them the same way. The lease code doesn't actually use the fl_owner value for anything, so this is more for consistency's sake than a bugfix. Reported-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jeff Layton <jlayton@poochiereds.net> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (Staging portion) Acked-by: J. Bruce Fields <bfields@fieldses.org>
2014-05-07ceph: switch to ->write_iter()Al Viro1-31/+26
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>