kernel/linux.git/fs/ceph/locks.c, branch v7.2-rc1

ceph: add client reset state machine and session teardown

2026-06-22T20:45:00+00:00

Add the client-side reset state machine, request gating, and manual session teardown implementation. Manual reset is an operator-triggered escape hatch for client/MDS stalemates in which caps, locks, or unsafe metadata state stop making forward progress. The reset blocks new metadata work, attempts a bounded best-effort drain of dirty client state while sessions are still alive, and finally asks the MDS to close sessions before tearing local session state down directly. The reset state machine tracks four phases: IDLE -> QUIESCING -> DRAINING -> TEARDOWN -> IDLE. QUIESCING is set synchronously by schedule_reset() before the workqueue item is dispatched, so that new metadata requests and file-lock acquisitions are gated immediately -- even before the work function begins running. All non-IDLE phases block callers on blocked_wq, preventing races with session teardown. The drain phase flushes mdlog state, dirty caps, and pending cap releases for a bounded interval. State that still cannot make progress within that interval is discarded during teardown, which is the point of the reset: break the stalemate and allow fresh sessions to rebuild clean state. The session teardown follows the established check_new_map() forced-close pattern: unregister sessions under mdsc->mutex, then clean up caps and requests under s->s_mutex. Reconnect is not attempted because the MDS only accepts reconnects during its own RECONNECT phase after restart, not from an active client. Blocked callers are released when reset completes and observe the final result via -EAGAIN (reset failed) or 0 (success). Internal work-function errors such as -ENOMEM are not propagated to unrelated callers like open() or flock(); the detailed error remains in debugfs and tracepoints. The work function checks st->shutdown before each phase transition (DRAINING, TEARDOWN) so that a concurrent ceph_mdsc_destroy() is not overwritten. If destroy already took ownership, the work function releases session references and returns without touching the state. The timeout calculation for blocked-request waiters uses max_t() to prevent jiffies underflow when the deadline has already passed. The close-grace sleep before teardown is a best-effort nudge to let queued REQUEST_CLOSE messages egress; it is not a correctness requirement since the MDS still has session_autoclose as a fallback. The destroy path marks reset as failed and wakes blocked waiters before cancel_work_sync() so unmount does not stall. Signed-off-by: Alex Markuze Reviewed-by: Viacheslav Dubeyko Signed-off-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov

ceph: convert inode flags to named bit positions and atomic bitops

2026-06-22T20:44:42+00:00

Define named bit-position constants for all CEPH_I_* inode flags and derive the bitmask values from them. This gives every flag a named _BIT constant usable with the test_bit/set_bit/clear_bit family. The intentionally unused bit position 1 is documented inline. Convert all flag modifications to use atomic bitops (set_bit, clear_bit, test_and_clear_bit). The previous code mixed lockless atomic ops on some flags (ERROR_WRITE, ODIRECT) with non-atomic read-modify-write (|= / &= ~) on other flags sharing the same unsigned long. A concurrent non-atomic RMW can clobber an adjacent lockless atomic update -- for example, a lockless clear_bit(ERROR_WRITE) could be silently resurrected by a concurrent ci->i_ceph_flags |= CEPH_I_FLUSH under the spinlock. Using atomic bitops for all modifications eliminates this class of race entirely. Flags whose only users are now the _BIT form (ERROR_WRITE, ASYNC_CHECK_CAPS) have their old mask defines removed to document that callers must use the _BIT constant with the set_bit/test_bit family. ERROR_FILELOCK and SHUTDOWN retain their mask defines because they are still used via bitmask tests in lockless readers (ceph_inode_is_shutdown, reconnect_caps_cb). The direct assignment in ceph_finish_async_create() is converted from i_ceph_flags = CEPH_I_ASYNC_CREATE to set_bit(). This inode is I_NEW at this point -- still invisible to other threads and guaranteed to have zero flags from alloc_inode -- so either form is safe, but set_bit() keeps the conversion uniform. Signed-off-by: Alex Markuze Reviewed-by: Viacheslav Dubeyko Signed-off-by: Viacheslav Dubeyko Signed-off-by: Ilya Dryomov

ceph: add checking of wait_for_completion_killable() return value

2025-10-08T21:30:46+00:00

The Coverity Scan service has detected the calling of wait_for_completion_killable() without checking the return value in ceph_lock_wait_for_completion() [1]. The CID 1636232 defect contains explanation: "If the function returns an error value, the error value may be mistaken for a normal value. In ceph_lock_wait_for_completion(): Value returned from a function is not checked for errors before being used. (CWE-252)". The patch adds the checking of wait_for_completion_killable() return value and return the error code from ceph_lock_wait_for_completion(). [1] https://scan5.scan.coverity.com/#/project-view/64304/10063?selectedIssue=1636232 Signed-off-by: Viacheslav Dubeyko Reviewed-by: Alex Markuze Signed-off-by: Ilya Dryomov

ceph: adapt to breakup of struct file_lock

2024-02-05T12:11:42+00:00

Most of the existing APIs have remained the same, but subsystems that access file_lock fields directly need to reach into struct file_lock_core now. Signed-off-by: Jeff Layton Link: https://lore.kernel.org/r/20240131-flsplit-v3-36-c6129007ee8d@kernel.org Reviewed-by: NeilBrown Signed-off-by: Christian Brauner

filelock: split common fields into struct file_lock_core

2024-02-05T12:11:38+00:00

In a future patch, we're going to split file leases into their own structure. Since a lot of the underlying machinery uses the same fields move those into a new file_lock_core, and embed that inside struct file_lock. For now, add some macros to ensure that we can continue to build while the conversion is in progress. Signed-off-by: Jeff Layton Link: https://lore.kernel.org/r/20240131-flsplit-v3-17-c6129007ee8d@kernel.org Reviewed-by: NeilBrown Signed-off-by: Christian Brauner

ceph: convert to using new filelock helpers

2024-02-05T12:11:35+00:00

Convert to using the new file locking helper functions. Signed-off-by: Jeff Layton Link: https://lore.kernel.org/r/20240131-flsplit-v3-7-c6129007ee8d@kernel.org Reviewed-by: NeilBrown Signed-off-by: Christian Brauner

ceph: print cluster fsid and client global_id in all debug logs

2023-11-03T22:28:33+00:00

Multiple CephFS mounts on a host is increasingly common so disambiguating messages like this is necessary and will make it easier to debug issues. At the same this will improve the debug logs to make them easier to troubleshooting issues, such as print the ino# instead only printing the memory addresses of the corresponding inodes and print the dentry names instead of the corresponding memory addresses for the dentry,etc. Link: https://tracker.ceph.com/issues/61590 Signed-off-by: Xiubo Li Reviewed-by: Patrick Donnelly Reviewed-by: Milind Changire Signed-off-by: Ilya Dryomov

filelock: move file locking definitions to separate header file

2023-01-11T11:52:32+00:00

The file locking definitions have lived in fs.h since the dawn of time, but they are only used by a small subset of the source files that include it. Move the file locking definitions to a new header file, and add the appropriate #include directives to the source files that need them. By doing this we trim down fs.h a bit and limit the amount of rebuilding that has to be done when we make changes to the file locking APIs. Reviewed-by: Xiubo Li Reviewed-by: Christian Brauner (Microsoft) Reviewed-by: Christoph Hellwig Reviewed-by: David Howells Reviewed-by: Russell King (Oracle) Acked-by: Chuck Lever Acked-by: Joseph Qi Acked-by: Steve French Acked-by: Al Viro Acked-by: Darrick J. Wong Signed-off-by: Jeff Layton

ceph: avoid use-after-free in ceph_fl_release_lock()

2023-01-02T11:27:25+00:00

When ceph releasing the file_lock it will try to get the inode pointer from the fl->fl_file, which the memory could already be released by another thread in filp_close(). Because in VFS layer the fl->fl_file doesn't increase the file's reference counter. Will switch to use ceph dedicate lock info to track the inode. And in ceph_fl_release_lock() we should skip all the operations if the fl->fl_u.ceph.inode is not set, which should come from the request file_lock. And we will set fl->fl_u.ceph.inode when inserting it to the inode lock list, which is when copying the lock. Link: https://tracker.ceph.com/issues/57986 Signed-off-by: Xiubo Li Reviewed-by: Jeff Layton Reviewed-by: Ilya Dryomov Signed-off-by: Ilya Dryomov

ceph: switch to vfs_inode_has_locks() to fix file lock bug

2023-01-02T11:27:25+00:00

For the POSIX locks they are using the same owner, which is the thread id. And multiple POSIX locks could be merged into single one, so when checking whether the 'file' has locks may fail. For a file where some openers use locking and others don't is a really odd usage pattern though. Locks are like stoplights -- they only work if everyone pays attention to them. Just switch ceph_get_caps() to check whether any locks are set on the inode. If there are POSIX/OFD/FLOCK locks on the file at the time, we should set CHECK_FILELOCK, regardless of what fd was used to set the lock. Fixes: ff5d913dfc71 ("ceph: return -EIO if read/write against filp that lost file locks") Signed-off-by: Xiubo Li Reviewed-by: Jeff Layton Reviewed-by: Ilya Dryomov Signed-off-by: Ilya Dryomov