| Age | Commit message (Collapse) | Author | Files | Lines |
|
commit a2a513be7139b279f1b5b2cee59c6c4950c34346 upstream.
Ignoring the explicit_open mount option on mount for devices that do not
have a limit on the number of open zones must be done after the mount
options are parsed and set in s_mount_opts. Move the check to ignore
the explicit_open option after the call to zonefs_parse_options() in
zonefs_fill_super().
Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close")
Cc: <stable@vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 10e14073107dd0b6d97d9516a02845a8e501c2c9 upstream.
Commit b35250c0816c ("writeback: Protect inode->i_io_list with
inode->i_lock") made inode->i_io_list not only protected by
wb->list_lock but also inode->i_lock, but inode_io_list_move_locked()
was missed. Add lock there and also update comment describing
things protected by inode->i_lock. This also fixes a race where
__mark_inode_dirty() could move inode under flush worker's hands
and thus sync(2) could miss writing some inodes.
Fixes: b35250c0816c ("writeback: Protect inode->i_io_list with inode->i_lock")
Link: https://lore.kernel.org/r/20220524150540.12552-1-sunjunchao2870@gmail.com
CC: stable@vger.kernel.org
Signed-off-by: Jchao Sun <sunjunchao2870@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4c14d7043fede258957d7b01da0cad2d9fe3a205 upstream.
Currently, the secondary channels of a multichannel session
also get hostname populated based on the info in primary channel.
However, this will end up with a wrong resolution of hostname to
IP address during reconnect.
This change fixes this by not populating hostname info for all
secondary channels.
Fixes: 5112d80c162f ("cifs: populate server_hostname for extra channels")
Cc: stable@vger.kernel.org
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c36ee7dab7749f7be21f7a72392744490b2a9a2b upstream.
cifs.ko defines two file system types: cifs & smb3, and
__cifs_get_super() was not including smb3 file system type when
looking up superblocks, therefore failing to reconnect tcons in
cifs_tree_connect().
Fix this by calling iterate_supers_type() on both file system types.
Link: https://lore.kernel.org/r/CAFrh3J9soC36+BVuwHB=g9z_KB5Og2+p2_W+BBoBOZveErz14w@mail.gmail.com
Cc: stable@vger.kernel.org
Tested-by: Satadru Pramanik <satadru@gmail.com>
Reported-by: Satadru Pramanik <satadru@gmail.com>
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8ea21823aa584b55ba4b861307093b78054b0c1b upstream.
During reconnects, we check the return value from
cifs_negotiate_protocol, and have handlers for both success
and failures. But if that passes, and cifs_setup_session
returns any errors other than -EACCES, we do not handle
that. This fix adds a handler for that, so that we don't
go ahead and try a tree_connect on a failed session.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 7ef93ffccd55fb0ba000ed16ef6a81cd7dee07b5 ]
We should not be including unused smb20 specific code when legacy
support is disabled (CONFIG_CIFS_ALLOW_INSECURE_LEGACY turned
off). For example smb2_operations and smb2_values aren't used
in that case. Over time we can move more and more SMB1/CIFS and SMB2.0
code into the insecure legacy ifdefs
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1b2ba3c5616e17ff951359e25c658a1c3f146f1e ]
Before waiting for a request's safe reply, we will send the mdlog flush
request to the relevant MDS. And this will also flush the mdlog for all
the other unsafe requests in the same session, so we can record the last
session and no need to flush mdlog again in the next loop. But there
still have cases that it may send the mdlog flush requst twice or more,
but that should be not often.
Rename wait_unsafe_requests() to
flush_mdlog_and_wait_mdsc_unsafe_requests() to make it more
descriptive.
[xiubli: fold in MDS request refcount leak fix from Jeff]
URL: https://tracker.ceph.com/issues/55284
URL: https://tracker.ceph.com/issues/55411
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d7a2dc523085f8b8c60548ceedc696934aefeb0e ]
`rctime' has been a pain point in cephfs due to its buggy
nature - inconsistent values reported and those sorts.
Fixing rctime is non-trivial needing an overall redesign
of the entire nested statistics infrastructure.
As a workaround, PR
http://github.com/ceph/ceph/pull/37938
allows this extended attribute to be manually set. This allows
users to "fixup" inconsistent rctime values. While this sounds
messy, its probably the wisest approach allowing users/scripts
to workaround buggy rctime values.
The above PR enables Ceph MDS to allow manually setting
rctime extended attribute with the corresponding user-land
changes. We may as well allow the same to be done via kclient
for parity.
Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5366afc4065075a4456941fbd51c33604d631ee5 ]
When there are bursty connection requests,
RDMA connection event handler is deferred and
Negotiation requests are received even if
connection status is NEW.
To handle it, set the status to CONNECTED
if Negotiation requests are received.
Reported-by: Yufan Chen <wiz.chen@gmail.com>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Tested-by: Yufan Chen <wiz.chen@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1a702dc88e150487c9c173a249b3d236498b9183 ]
Previously the protection of kernfs_pr_cont_buf was piggy backed by
rename_lock, which means that pr_cont() needs to be protected under
rename_lock. This can cause potential circular lock dependencies.
If there is an OOM, we have the following call hierarchy:
-> cpuset_print_current_mems_allowed()
-> pr_cont_cgroup_name()
-> pr_cont_kernfs_name()
pr_cont_kernfs_name() will grab rename_lock and call printk. So we have
the following lock dependencies:
kernfs_rename_lock -> console_sem
Sometimes, printk does a wakeup before releasing console_sem, which has
the dependence chain:
console_sem -> p->pi_lock -> rq->lock
Now, imagine one wants to read cgroup_name under rq->lock, for example,
printing cgroup_name in a tracepoint in the scheduler code. They will
be holding rq->lock and take rename_lock:
rq->lock -> kernfs_rename_lock
Now they will deadlock.
A prevention to this circular lock dependency is to separate the
protection of pr_cont_buf from rename_lock. In principle, rename_lock
is to protect the integrity of cgroup name when copying to buf. Once
pr_cont_buf has got its content, rename_lock can be dropped. So it's
safe to drop rename_lock after kernfs_name_locked (and
kernfs_path_from_node_locked) and rely on a dedicated pr_cont_lock
to protect pr_cont_buf.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/r/20220516190951.3144144-1-haoluo@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2d1fe8a86bf5e0663866fd0da83c2af1e1b0e362 ]
In order to garantee migrated data be persisted during checkpoint,
otherwise out-of-order persistency between data and node may cause
data corruption after SPOR.
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6949493884fe88500de4af182588e071cf1544ee ]
When doing layoutget as part of the open() compound, we have to be
careful to release the layout locks before we can call any further RPC
calls, such as setattr(). The reason is that those calls could trigger
a recall, which could deadlock.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit dc2f78e2d4cc844a1458653d57ce1b54d4a29f21 ]
Syzbot triggers two WARNs in f2fs_is_valid_blkaddr and
__is_bitmap_valid. For example, in f2fs_is_valid_blkaddr,
if type is DATA_GENERIC_ENHANCE or DATA_GENERIC_ENHANCE_READ,
it invokes WARN_ON if blkaddr is not in the right range.
The call trace is as follows:
f2fs_get_node_info+0x45f/0x1070
read_node_page+0x577/0x1190
__get_node_page.part.0+0x9e/0x10e0
__get_node_page
f2fs_get_node_page+0x109/0x180
do_read_inode
f2fs_iget+0x2a5/0x58b0
f2fs_fill_super+0x3b39/0x7ca0
Fix these two WARNs by replacing WARN_ON with dump_stack.
Reported-by: syzbot+763ae12a2ede1d99d4dc@syzkaller.appspotmail.com
Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 17eabd42560f4636648ad65ba5b20228071e2363 ]
In AFS, a directory is handled as a file that the client downloads and
parses locally for the purposes of performing lookup and getdents
operations. The in-kernel afs filesystem has a number of functions that
do this.
A directory file is arranged as a series of 2K blocks divided into
32-byte slots, where a directory entry occupies one or more slots, plus
each block starts with one or more metadata blocks.
When parsing a block, if the last slots are occupied by a dirent that
occupies more than a single slot and the file position points at a slot
that's not the initial one, the logic in afs_dir_iterate_block() that
skips over it won't advance the file pointer to the end of it. This
will cause an infinite loop in getdents() as it will keep retrying that
block and failing to advance beyond the final entry.
Fix this by advancing the file pointer if the next entry will be beyond
it when we skip a block.
This was found by the generic/676 xfstest but can also be triggered with
something like:
~/xfstests-dev/src/t_readdir_3 /xfstest.test/z 4000 1
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: http://lore.kernel.org/r/165391973497.110268.2939296942213894166.stgit@warthog.procyon.org.uk/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c14adb1cf70a984ed081c67e9d27bc3caad9537c ]
If jffs2_iget() or d_make_root() in jffs2_do_fill_super() returns
an error, we can observe the following kmemleak report:
--------------------------------------------
unreferenced object 0xffff888105a65340 (size 64):
comm "mount", pid 710, jiffies 4302851558 (age 58.239s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff859c45e5>] kmem_cache_alloc_trace+0x475/0x8a0
[<ffffffff86160146>] jffs2_sum_init+0x96/0x1a0
[<ffffffff86140e25>] jffs2_do_mount_fs+0x745/0x2120
[<ffffffff86149fec>] jffs2_do_fill_super+0x35c/0x810
[<ffffffff8614aae9>] jffs2_fill_super+0x2b9/0x3b0
[...]
unreferenced object 0xffff8881bd7f0000 (size 65536):
comm "mount", pid 710, jiffies 4302851558 (age 58.239s)
hex dump (first 32 bytes):
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
backtrace:
[<ffffffff858579ba>] kmalloc_order+0xda/0x110
[<ffffffff85857a11>] kmalloc_order_trace+0x21/0x130
[<ffffffff859c2ed1>] __kmalloc+0x711/0x8a0
[<ffffffff86160189>] jffs2_sum_init+0xd9/0x1a0
[<ffffffff86140e25>] jffs2_do_mount_fs+0x745/0x2120
[<ffffffff86149fec>] jffs2_do_fill_super+0x35c/0x810
[<ffffffff8614aae9>] jffs2_fill_super+0x2b9/0x3b0
[...]
--------------------------------------------
This is because the resources allocated in jffs2_sum_init() are not
released. Call jffs2_sum_exit() to release these resources to solve
the problem.
Fixes: e631ddba5887 ("[JFFS2] Add erase block summary support (mount time improvement)")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d21a580dafc69aa04f46e6099616146a536b0724 ]
The issue happens in a specific path in smb_check_perm_dacl(). When
"id" and "uid" have the same value, the function simply jumps out of
the loop without decrementing the reference count of the object
"posix_acls", which is increased by get_acl() earlier. This may
result in memory leaks.
Fix it by decreasing the reference count of "posix_acls" before
jumping to label "check_access_bits".
Fixes: 777cad1604d6 ("ksmbd: remove select FS_POSIX_ACL in Kconfig")
Signed-off-by: Xin Xiong <xiongx18@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit f26967b9f7a830e228bb13fb41bd516ddd9d789d upstream.
log_read_rst() returns ENOMEM error when there is not enough memory.
In this case, if info is returned without initialization,
it attempts to kfree the uninitialized info->r_page pointer. This patch
moves the memset initialization code to before log_read_rst() is called.
Reported-by: Gerald Lee <sundaywind2004@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3a761d72fa62eec8913e45d29375344f61706541 upstream.
Make the two locations where exportfs helpers check permission to lookup
a given inode idmapped mount aware by switching it to the lookup_one()
helper. This is a bugfix for the open_by_handle_at() system call which
doesn't take idmapped mounts into account currently. It's not tied to a
specific commit so we'll just Cc stable.
In addition this is required to support idmapped base layers in overlay.
The overlay filesystem uses exportfs to encode and decode file handles
for its index=on mount option and when nfs_export=on.
Cc: <stable@vger.kernel.org>
Cc: <linux-fsdevel@vger.kernel.org>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 00675017e0aeba5305665c52ded4ddce6a4c0231 upstream.
Similar to the addition of lookup_one() add a version of
lookup_one_unlocked() and lookup_one_positive_unlocked() that take
idmapped mounts into account. This is required to port overlay to
support idmapped base layers.
Cc: <linux-fsdevel@vger.kernel.org>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5f41fdaea63ddf96d921ab36b2af4a90ccdb5744 upstream.
Make the test_dummy_encryption mount option require that the encrypt
feature flag be already enabled on the filesystem, rather than
automatically enabling it. Practically, this means that "-O encrypt"
will need to be included in MKFS_OPTIONS when running xfstests with the
test_dummy_encryption mount option. (ext4/053 also needs an update.)
Moreover, as long as the preconditions for test_dummy_encryption are
being tightened anyway, take the opportunity to start rejecting it when
!CONFIG_FS_ENCRYPTION rather than ignoring it.
The motivation for requiring the encrypt feature flag is that:
- Having the filesystem auto-enable feature flags is problematic, as it
bypasses the usual sanity checks. The specific issue which came up
recently is that in kernel versions where ext4 supports casefold but
not encrypt+casefold (v5.1 through v5.10), the kernel will happily add
the encrypt flag to a filesystem that has the casefold flag, making it
unmountable -- but only for subsequent mounts, not the initial one.
This confused the casefold support detection in xfstests, causing
generic/556 to fail rather than be skipped.
- The xfstests-bld test runners (kvm-xfstests et al.) already use the
required mkfs flag, so they will not be affected by this change. Only
users of test_dummy_encryption alone will be affected. But, this
option has always been for testing only, so it should be fine to
require that the few users of this option update their test scripts.
- f2fs already requires it (for its equivalent feature flag).
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Link: https://lore.kernel.org/r/20220519204437.61645-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
smb2_compound_op
commit 0a55cf74ffb5d004b93647e4389096880ce37d6b upstream.
There is a race condition in smb2_compound_op:
after_close:
num_rqst++;
if (cfile) {
cifsFileInfo_put(cfile); // sends SMB2_CLOSE to the server
cfile = NULL;
This is triggered by smb2_query_path_info operation that happens during
revalidate_dentry. In smb2_query_path_info, get_readable_path is called to
load the cfile, increasing the reference counter. If in the meantime, this
reference becomes the very last, this call to cifsFileInfo_put(cfile) will
trigger a SMB2_CLOSE request sent to the server just before sending this compound
request – and so then the compound request fails either with EBADF/EIO depending
on the timing at the server, because the handle is already closed.
In the first scenario, the race seems to be happening between smb2_query_path_info
triggered by the rename operation, and between “cleanup” of asynchronous writes – while
fsync(fd) likely waits for the asynchronous writes to complete, releasing the writeback
structures can happen after the close(fd) call. So the EBADF/EIO errors will pop up if
the timing is such that:
1) There are still outstanding references after close(fd) in the writeback structures
2) smb2_query_path_info successfully fetches the cfile, increasing the refcounter by 1
3) All writeback structures release the same cfile, reducing refcounter to 1
4) smb2_compound_op is called with that cfile
In the second scenario, the race seems to be similar – here open triggers the
smb2_query_path_info operation, and if all other threads in the meantime decrease the
refcounter to 1 similarly to the first scenario, again SMB2_CLOSE will be sent to the
server just before issuing the compound request. This case is harder to reproduce.
See https://bugzilla.samba.org/show_bug.cgi?id=15051
Cc: stable@vger.kernel.org
Fixes: 8de9e86c67ba ("cifs: create a helper to find a writeable handle by path name")
Signed-off-by: Ondrej Hubsch <ohubsch@purestorage.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ea16567f11018e2f58e72b667b0c803ff92b8153 upstream.
The cephfs kernel client started to show the message:
ceph: mds0 session blocklisted
when mounting a filesystem. This is due to the fact that the session
messages are being incorrectly decoded: the skip needs to take into
account the 'len'.
While there, fixed some whitespaces too.
Cc: stable@vger.kernel.org
Fixes: e1c9788cb397 ("ceph: don't rely on error_string to validate blocklisted session.")
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 376b9133826865568167b4091ef92a68c4622b87 upstream.
outstanding credits must be initialized to 0,
because it means the sum of credits consumed by
in-flight requests.
And outstanding credits must be compared with
total credits in smb2_validate_credit_charge(),
because total credits are the sum of credits
granted by ksmbd.
This patch fix the following error,
while frametest with Windows clients:
Limits exceeding the maximum allowable outstanding requests,
given : 128, pending : 8065
Fixes: b589f5db6d4a ("ksmbd: limits exceeding the maximum allowable outstanding requests")
Cc: stable@vger.kernel.org
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Reported-by: Yufan Chen <wiz.chen@gmail.com>
Tested-by: Yufan Chen <wiz.chen@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 863e0d81b6683c4cbc588ad831f560c90e494bef upstream.
When user_dlm_destroy_lock failed, it didn't clean up the flags it set
before exit. For USER_LOCK_IN_TEARDOWN, if this function fails because of
lock is still in used, next time when unlink invokes this function, it
will return succeed, and then unlink will remove inode and dentry if lock
is not in used(file closed), but the dlm lock is still linked in dlm lock
resource, then when bast come in, it will trigger a panic due to
user-after-free. See the following panic call trace. To fix this,
USER_LOCK_IN_TEARDOWN should be reverted if fail. And also error should
be returned if USER_LOCK_IN_TEARDOWN is set to let user know that unlink
fail.
For the case of ocfs2_dlm_unlock failure, besides USER_LOCK_IN_TEARDOWN,
USER_LOCK_BUSY is also required to be cleared. Even though spin lock is
released in between, but USER_LOCK_IN_TEARDOWN is still set, for
USER_LOCK_BUSY, if before every place that waits on this flag,
USER_LOCK_IN_TEARDOWN is checked to bail out, that will make sure no flow
waits on the busy flag set by user_dlm_destroy_lock(), then we can
simplely revert USER_LOCK_BUSY when ocfs2_dlm_unlock fails. Fix
user_dlm_cluster_lock() which is the only function not following this.
[ 941.336392] (python,26174,16):dlmfs_unlink:562 ERROR: unlink
004fb0000060000b5a90b8c847b72e1, error -16 from destroy
[ 989.757536] ------------[ cut here ]------------
[ 989.757709] kernel BUG at fs/ocfs2/dlmfs/userdlm.c:173!
[ 989.757876] invalid opcode: 0000 [#1] SMP
[ 989.758027] Modules linked in: ksplice_2zhuk2jr_ib_ipoib_new(O)
ksplice_2zhuk2jr(O) mptctl mptbase xen_netback xen_blkback xen_gntalloc
xen_gntdev xen_evtchn cdc_ether usbnet mii ocfs2 jbd2 rpcsec_gss_krb5
auth_rpcgss nfsv4 nfsv3 nfs_acl nfs fscache lockd grace ocfs2_dlmfs
ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bnx2fc
fcoe libfcoe libfc scsi_transport_fc sunrpc ipmi_devintf bridge stp llc
rds_rdma rds bonding ib_sdp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
rdma_cm ib_cm iw_cm falcon_lsm_serviceable(PE) falcon_nf_netcontain(PE)
mlx4_vnic falcon_kal(E) falcon_lsm_pinned_13402(E) mlx4_ib ib_sa ib_mad
ib_core ib_addr xenfs xen_privcmd dm_multipath iTCO_wdt iTCO_vendor_support
pcspkr sb_edac edac_core i2c_i801 lpc_ich mfd_core ipmi_ssif i2c_core ipmi_si
ipmi_msghandler
[ 989.760686] ioatdma sg ext3 jbd mbcache sd_mod ahci libahci ixgbe dca ptp
pps_core vxlan udp_tunnel ip6_udp_tunnel megaraid_sas mlx4_core crc32c_intel
be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi ipv6 cxgb3 mdio
libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi wmi
dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
ksplice_2zhuk2jr_ib_ipoib_old]
[ 989.761987] CPU: 10 PID: 19102 Comm: dlm_thread Tainted: P OE
4.1.12-124.57.1.el6uek.x86_64 #2
[ 989.762290] Hardware name: Oracle Corporation ORACLE SERVER
X5-2/ASM,MOTHERBOARD,1U, BIOS 30350100 06/17/2021
[ 989.762599] task: ffff880178af6200 ti: ffff88017f7c8000 task.ti:
ffff88017f7c8000
[ 989.762848] RIP: e030:[<ffffffffc07d4316>] [<ffffffffc07d4316>]
__user_dlm_queue_lockres.part.4+0x76/0x80 [ocfs2_dlmfs]
[ 989.763185] RSP: e02b:ffff88017f7cbcb8 EFLAGS: 00010246
[ 989.763353] RAX: 0000000000000000 RBX: ffff880174d48008 RCX:
0000000000000003
[ 989.763565] RDX: 0000000000120012 RSI: 0000000000000003 RDI:
ffff880174d48170
[ 989.763778] RBP: ffff88017f7cbcc8 R08: ffff88021f4293b0 R09:
0000000000000000
[ 989.763991] R10: ffff880179c8c000 R11: 0000000000000003 R12:
ffff880174d48008
[ 989.764204] R13: 0000000000000003 R14: ffff880179c8c000 R15:
ffff88021db7a000
[ 989.764422] FS: 0000000000000000(0000) GS:ffff880247480000(0000)
knlGS:ffff880247480000
[ 989.764685] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 989.764865] CR2: ffff8000007f6800 CR3: 0000000001ae0000 CR4:
0000000000042660
[ 989.765081] Stack:
[ 989.765167] 0000000000000003 ffff880174d48040 ffff88017f7cbd18
ffffffffc07d455f
[ 989.765442] ffff88017f7cbd88 ffffffff816fb639 ffff88017f7cbd38
ffff8800361b5600
[ 989.765717] ffff88021db7a000 ffff88021f429380 0000000000000003
ffffffffc0453020
[ 989.765991] Call Trace:
[ 989.766093] [<ffffffffc07d455f>] user_bast+0x5f/0xf0 [ocfs2_dlmfs]
[ 989.766287] [<ffffffff816fb639>] ? schedule_timeout+0x169/0x2d0
[ 989.766475] [<ffffffffc0453020>] ? o2dlm_lock_ast_wrapper+0x20/0x20
[ocfs2_stack_o2cb]
[ 989.766738] [<ffffffffc045303a>] o2dlm_blocking_ast_wrapper+0x1a/0x20
[ocfs2_stack_o2cb]
[ 989.767010] [<ffffffffc0864ec6>] dlm_do_local_bast+0x46/0xe0 [ocfs2_dlm]
[ 989.767217] [<ffffffffc084f5cc>] ? dlm_lockres_calc_usage+0x4c/0x60
[ocfs2_dlm]
[ 989.767466] [<ffffffffc08501f1>] dlm_thread+0xa31/0x1140 [ocfs2_dlm]
[ 989.767662] [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[ 989.767834] [<ffffffff816f78ce>] ? __schedule+0x23e/0x810
[ 989.768006] [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[ 989.768178] [<ffffffff816f78ce>] ? __schedule+0x23e/0x810
[ 989.768349] [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[ 989.768521] [<ffffffff816f78ce>] ? __schedule+0x23e/0x810
[ 989.768693] [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[ 989.768893] [<ffffffff816f78ce>] ? __schedule+0x23e/0x810
[ 989.769067] [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[ 989.769241] [<ffffffff810ce4d0>] ? wait_woken+0x90/0x90
[ 989.769411] [<ffffffffc084f7c0>] ? dlm_kick_thread+0x80/0x80 [ocfs2_dlm]
[ 989.769617] [<ffffffff810a8bbb>] kthread+0xcb/0xf0
[ 989.769774] [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[ 989.769945] [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[ 989.770117] [<ffffffff810a8af0>] ? kthread_create_on_node+0x180/0x180
[ 989.770321] [<ffffffff816fdaa1>] ret_from_fork+0x61/0x90
[ 989.770492] [<ffffffff810a8af0>] ? kthread_create_on_node+0x180/0x180
[ 989.770689] Code: d0 00 00 00 f0 45 7d c0 bf 00 20 00 00 48 89 83 c0 00 00
00 48 89 83 c8 00 00 00 e8 55 c1 8c c0 83 4b 04 10 48 83 c4 08 5b 5d c3 <0f>
0b 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 55 41 54 53 48 83
[ 989.771892] RIP [<ffffffffc07d4316>]
__user_dlm_queue_lockres.part.4+0x76/0x80 [ocfs2_dlmfs]
[ 989.772174] RSP <ffff88017f7cbcb8>
[ 989.772704] ---[ end trace ebd1e38cebcc93a8 ]---
[ 989.772907] Kernel panic - not syncing: Fatal exception
[ 989.773173] Kernel Offset: disabled
Link: https://lkml.kernel.org/r/20220518235224.87100-2-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1689c169134f4b5a39156122d799b7dca76d8ddb upstream.
We always call hold_lkb(lkb) if we increment lkb->lkb_wait_count.
So, we always need to call unhold_lkb(lkb) if we decrement
lkb->lkb_wait_count. This patch will add missing unhold_lkb(lkb) if we
decrement lkb->lkb_wait_count. In case of setting lkb->lkb_wait_count to
zero we need to countdown until reaching zero and call unhold_lkb(lkb).
The waiters list unhold_lkb(lkb) can be removed because it's done for
the last lkb_wait_count decrement iteration as it's done in
_remove_from_waiters().
This issue was discovered by a dlm gfs2 test case which use excessively
dlm_unlock(LKF_CANCEL) feature. Probably the lkb->lkb_wait_count value
never reached above 1 if this feature isn't used and so it was not
discovered before.
The testcase ended in a rsb on the rsb keep data structure with a
refcount of 1 but no lkb was associated with it, which is itself
an invalid behaviour. A side effect of that was a condition in which
the dlm was sending remove messages in a looping behaviour. With this
patch that has not been reproduced.
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f6f7418357457ed58cbb020fc97e74d4e0e7b47f upstream.
This patch move the wake_up() call at the point when a remove message
completed. Before it was only when a remove message was going to be
sent. The possible waiter in wait_pending_remove() waits until a remove
is done if the resource name matches with the per ls variable
ls->ls_remove_name. If this is the case we must wait until a pending
remove is done which is indicated if DLM_WAIT_PENDING_COND() returns
false which will always be the case when ls_remove_len and
ls_remove_name are unset to indicate that a remove is not going on
anymore.
Fixes: 21d9ac1a5376 ("fs: dlm: use event based wait for pending remove")
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1f4f10845e14690b02410de50d9ea9684625a4ae upstream.
The "sock" variable is not initialized on this error path.
Cc: stable@vger.kernel.org
Fixes: 2dc6b1158c28 ("fs: dlm: introduce generic listen")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 42252d0d2aa9b94d168241710a761588b3959019 upstream.
This patch fixes an invalid read showed by KASAN. A unlock will allocate a
"struct plock_op" and a followed send_op() will append it to a global
send_list data structure. In some cases a followed dev_read() moves it
to recv_list and dev_write() will cast it to "struct plock_xop" and access
fields which are only available in those structures. At this point an
invalid read happens by accessing those fields.
To fix this issue the "callback" field is moved to "struct plock_op" to
indicate that a cast to "plock_xop" is allowed and does the additional
"plock_xop" handling if set.
Example of the KASAN output which showed the invalid read:
[ 2064.296453] ==================================================================
[ 2064.304852] BUG: KASAN: slab-out-of-bounds in dev_write+0x52b/0x5a0 [dlm]
[ 2064.306491] Read of size 8 at addr ffff88800ef227d8 by task dlm_controld/7484
[ 2064.308168]
[ 2064.308575] CPU: 0 PID: 7484 Comm: dlm_controld Kdump: loaded Not tainted 5.14.0+ #9
[ 2064.310292] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 2064.311618] Call Trace:
[ 2064.312218] dump_stack_lvl+0x56/0x7b
[ 2064.313150] print_address_description.constprop.8+0x21/0x150
[ 2064.314578] ? dev_write+0x52b/0x5a0 [dlm]
[ 2064.315610] ? dev_write+0x52b/0x5a0 [dlm]
[ 2064.316595] kasan_report.cold.14+0x7f/0x11b
[ 2064.317674] ? dev_write+0x52b/0x5a0 [dlm]
[ 2064.318687] dev_write+0x52b/0x5a0 [dlm]
[ 2064.319629] ? dev_read+0x4a0/0x4a0 [dlm]
[ 2064.320713] ? bpf_lsm_kernfs_init_security+0x10/0x10
[ 2064.321926] vfs_write+0x17e/0x930
[ 2064.322769] ? __fget_light+0x1aa/0x220
[ 2064.323753] ksys_write+0xf1/0x1c0
[ 2064.324548] ? __ia32_sys_read+0xb0/0xb0
[ 2064.325464] do_syscall_64+0x3a/0x80
[ 2064.326387] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 2064.327606] RIP: 0033:0x7f807e4ba96f
[ 2064.328470] Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 39 87 f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 7c 87 f8 ff 48
[ 2064.332902] RSP: 002b:00007ffd50cfe6e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[ 2064.334658] RAX: ffffffffffffffda RBX: 000055cc3886eb30 RCX: 00007f807e4ba96f
[ 2064.336275] RDX: 0000000000000040 RSI: 00007ffd50cfe7e0 RDI: 0000000000000010
[ 2064.337980] RBP: 00007ffd50cfe7e0 R08: 0000000000000000 R09: 0000000000000001
[ 2064.339560] R10: 000055cc3886eb30 R11: 0000000000000293 R12: 000055cc3886eb80
[ 2064.341237] R13: 000055cc3886eb00 R14: 000055cc3886f590 R15: 0000000000000001
[ 2064.342857]
[ 2064.343226] Allocated by task 12438:
[ 2064.344057] kasan_save_stack+0x1c/0x40
[ 2064.345079] __kasan_kmalloc+0x84/0xa0
[ 2064.345933] kmem_cache_alloc_trace+0x13b/0x220
[ 2064.346953] dlm_posix_unlock+0xec/0x720 [dlm]
[ 2064.348811] do_lock_file_wait.part.32+0xca/0x1d0
[ 2064.351070] fcntl_setlk+0x281/0xbc0
[ 2064.352879] do_fcntl+0x5e4/0xfe0
[ 2064.354657] __x64_sys_fcntl+0x11f/0x170
[ 2064.356550] do_syscall_64+0x3a/0x80
[ 2064.358259] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 2064.360745]
[ 2064.361511] Last potentially related work creation:
[ 2064.363957] kasan_save_stack+0x1c/0x40
[ 2064.365811] __kasan_record_aux_stack+0xaf/0xc0
[ 2064.368100] call_rcu+0x11b/0xf70
[ 2064.369785] dlm_process_incoming_buffer+0x47d/0xfd0 [dlm]
[ 2064.372404] receive_from_sock+0x290/0x770 [dlm]
[ 2064.374607] process_recv_sockets+0x32/0x40 [dlm]
[ 2064.377290] process_one_work+0x9a8/0x16e0
[ 2064.379357] worker_thread+0x87/0xbf0
[ 2064.381188] kthread+0x3ac/0x490
[ 2064.383460] ret_from_fork+0x22/0x30
[ 2064.385588]
[ 2064.386518] Second to last potentially related work creation:
[ 2064.389219] kasan_save_stack+0x1c/0x40
[ 2064.391043] __kasan_record_aux_stack+0xaf/0xc0
[ 2064.393303] call_rcu+0x11b/0xf70
[ 2064.394885] dlm_process_incoming_buffer+0x47d/0xfd0 [dlm]
[ 2064.397694] receive_from_sock+0x290/0x770 [dlm]
[ 2064.399932] process_recv_sockets+0x32/0x40 [dlm]
[ 2064.402180] process_one_work+0x9a8/0x16e0
[ 2064.404388] worker_thread+0x87/0xbf0
[ 2064.406124] kthread+0x3ac/0x490
[ 2064.408021] ret_from_fork+0x22/0x30
[ 2064.409834]
[ 2064.410599] The buggy address belongs to the object at ffff88800ef22780
[ 2064.410599] which belongs to the cache kmalloc-96 of size 96
[ 2064.416495] The buggy address is located 88 bytes inside of
[ 2064.416495] 96-byte region [ffff88800ef22780, ffff88800ef227e0)
[ 2064.422045] The buggy address belongs to the page:
[ 2064.424635] page:00000000b6bef8bc refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xef22
[ 2064.428970] flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff)
[ 2064.432515] raw: 000fffffc0000200 ffffea0000d68b80 0000001400000014 ffff888001041780
[ 2064.436110] raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
[ 2064.439813] page dumped because: kasan: bad access detected
[ 2064.442548]
[ 2064.443310] Memory state around the buggy address:
[ 2064.445988] ffff88800ef22680: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
[ 2064.449444] ffff88800ef22700: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
[ 2064.452941] >ffff88800ef22780: 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc
[ 2064.456383] ^
[ 2064.459386] ffff88800ef22800: 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc
[ 2064.462788] ffff88800ef22880: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
[ 2064.466239] ==================================================================
reproducer in python:
import argparse
import struct
import fcntl
import os
parser = argparse.ArgumentParser()
parser.add_argument('-f', '--file',
help='file to use fcntl, must be on dlm lock filesystem e.g. gfs2')
args = parser.parse_args()
f = open(args.file, 'wb+')
lockdata = struct.pack('hhllhh', fcntl.F_WRLCK,0,0,0,0,0)
fcntl.fcntl(f, fcntl.F_SETLK, lockdata)
lockdata = struct.pack('hhllhh', fcntl.F_UNLCK,0,0,0,0,0)
fcntl.fcntl(f, fcntl.F_SETLK, lockdata)
Fixes: 586759f03e2e ("gfs2: nfs lock support for gfs2")
Cc: stable@vger.kernel.org
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3ba733f879c2a88910744647e41edeefbc0d92b2 upstream.
A maliciously corrupted filesystem can contain cycles in the h-tree
stored inside a directory. That can easily lead to the kernel corrupting
tree nodes that were already verified under its hands while doing a node
split and consequently accessing unallocated memory. Fix the problem by
verifying traversed block numbers are unique.
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220518093332.13986-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 46c116b920ebec58031f0a78c5ea9599b0d2a371 upstream.
Before splitting a directory block verify its directory entries are sane
so that the splitting code does not access memory it should not.
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220518093332.13986-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d36f6ed761b53933b0b4126486c10d3da7751e7f upstream.
Hulk Robot reported a BUG_ON:
==================================================================
kernel BUG at fs/ext4/extents_status.c:199!
[...]
RIP: 0010:ext4_es_end fs/ext4/extents_status.c:199 [inline]
RIP: 0010:__es_tree_search+0x1e0/0x260 fs/ext4/extents_status.c:217
[...]
Call Trace:
ext4_es_cache_extent+0x109/0x340 fs/ext4/extents_status.c:766
ext4_cache_extents+0x239/0x2e0 fs/ext4/extents.c:561
ext4_find_extent+0x6b7/0xa20 fs/ext4/extents.c:964
ext4_ext_map_blocks+0x16b/0x4b70 fs/ext4/extents.c:4384
ext4_map_blocks+0xe26/0x19f0 fs/ext4/inode.c:567
ext4_getblk+0x320/0x4c0 fs/ext4/inode.c:980
ext4_bread+0x2d/0x170 fs/ext4/inode.c:1031
ext4_quota_read+0x248/0x320 fs/ext4/super.c:6257
v2_read_header+0x78/0x110 fs/quota/quota_v2.c:63
v2_check_quota_file+0x76/0x230 fs/quota/quota_v2.c:82
vfs_load_quota_inode+0x5d1/0x1530 fs/quota/dquot.c:2368
dquot_enable+0x28a/0x330 fs/quota/dquot.c:2490
ext4_quota_enable fs/ext4/super.c:6137 [inline]
ext4_enable_quotas+0x5d7/0x960 fs/ext4/super.c:6163
ext4_fill_super+0xa7c9/0xdc00 fs/ext4/super.c:4754
mount_bdev+0x2e9/0x3b0 fs/super.c:1158
mount_fs+0x4b/0x1e4 fs/super.c:1261
[...]
==================================================================
Above issue may happen as follows:
-------------------------------------
ext4_fill_super
ext4_enable_quotas
ext4_quota_enable
ext4_iget
__ext4_iget
ext4_ext_check_inode
ext4_ext_check
__ext4_ext_check
ext4_valid_extent_entries
Check for overlapping extents does't take effect
dquot_enable
vfs_load_quota_inode
v2_check_quota_file
v2_read_header
ext4_quota_read
ext4_bread
ext4_getblk
ext4_map_blocks
ext4_ext_map_blocks
ext4_find_extent
ext4_cache_extents
ext4_es_cache_extent
ext4_es_cache_extent
__es_tree_search
ext4_es_end
BUG_ON(es->es_lblk + es->es_len < es->es_lblk)
The error ext4 extents is as follows:
0af3 0300 0400 0000 00000000 extent_header
00000000 0100 0000 12000000 extent1
00000000 0100 0000 18000000 extent2
02000000 0400 0000 14000000 extent3
In the ext4_valid_extent_entries function,
if prev is 0, no error is returned even if lblock<=prev.
This was intended to skip the check on the first extent, but
in the error image above, prev=0+1-1=0 when checking the second extent,
so even though lblock<=prev, the function does not return an error.
As a result, bug_ON occurs in __es_tree_search and the system panics.
To solve this problem, we only need to check that:
1. The lblock of the first extent is not less than 0.
2. The lblock of the next extent is not less than
the next block of the previous extent.
The same applies to extent_idx.
Cc: stable@kernel.org
Fixes: 5946d089379a ("ext4: check for overlapping extents in ext4_valid_extent_entries()")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220518120816.1541863-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c878bea3c9d724ddfa05a813f30de3d25a0ba83f upstream.
The EXT4_FC_REPLAY bit in sbi->s_mount_state is used to indicate that
we are in the middle of replay the fast commit journal. This was
actually a mistake, since the sbi->s_mount_info is initialized from
es->s_state. Arguably s_mount_state is misleadingly named, but the
name is historical --- s_mount_state and s_state dates back to ext2.
What should have been used is the ext4_{set,clear,test}_mount_flag()
inline functions, which sets EXT4_MF_* bits in sbi->s_mount_flags.
The problem with using EXT4_FC_REPLAY is that a maliciously corrupted
superblock could result in EXT4_FC_REPLAY getting set in
s_mount_state. This bypasses some sanity checks, and this can trigger
a BUG() in ext4_es_cache_extent(). As a easy-to-backport-fix, filter
out the EXT4_FC_REPLAY bit for now. We should eventually transition
away from EXT4_FC_REPLAY to something like EXT4_MF_REPLAY.
Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20220420192312.1655305-1-phind.uet@gmail.com
Link: https://lore.kernel.org/r/20220517174028.942119-1-tytso@mit.edu
Reported-by: syzbot+c7358a3cd05ee786eb31@syzkaller.appspotmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ef09ed5d37b84d18562b30cf7253e57062d0db05 upstream.
we got issue as follows:
EXT4-fs error (device loop0): ext4_mb_generate_buddy:1141: group 0, block bitmap and bg descriptor inconsistent: 25 vs 31513 free cls
------------[ cut here ]------------
kernel BUG at fs/ext4/inode.c:2708!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 2 PID: 2147 Comm: rep Not tainted 5.18.0-rc2-next-20220413+ #155
RIP: 0010:ext4_writepages+0x1977/0x1c10
RSP: 0018:ffff88811d3e7880 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88811c098000
RDX: 0000000000000000 RSI: ffff88811c098000 RDI: 0000000000000002
RBP: ffff888128140f50 R08: ffffffffb1ff6387 R09: 0000000000000000
R10: 0000000000000007 R11: ffffed10250281ea R12: 0000000000000001
R13: 00000000000000a4 R14: ffff88811d3e7bb8 R15: ffff888128141028
FS: 00007f443aed9740(0000) GS:ffff8883aef00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020007200 CR3: 000000011c2a4000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
do_writepages+0x130/0x3a0
filemap_fdatawrite_wbc+0x83/0xa0
filemap_flush+0xab/0xe0
ext4_alloc_da_blocks+0x51/0x120
__ext4_ioctl+0x1534/0x3210
__x64_sys_ioctl+0x12c/0x170
do_syscall_64+0x3b/0x90
It may happen as follows:
1. write inline_data inode
vfs_write
new_sync_write
ext4_file_write_iter
ext4_buffered_write_iter
generic_perform_write
ext4_da_write_begin
ext4_da_write_inline_data_begin -> If inline data size too
small will allocate block to write, then mapping will has
dirty page
ext4_da_convert_inline_data_to_extent ->clear EXT4_STATE_MAY_INLINE_DATA
2. fallocate
do_vfs_ioctl
ioctl_preallocate
vfs_fallocate
ext4_fallocate
ext4_convert_inline_data
ext4_convert_inline_data_nolock
ext4_map_blocks -> fail will goto restore data
ext4_restore_inline_data
ext4_create_inline_data
ext4_write_inline_data
ext4_set_inode_state -> set inode EXT4_STATE_MAY_INLINE_DATA
3. writepages
__ext4_ioctl
ext4_alloc_da_blocks
filemap_flush
filemap_fdatawrite_wbc
do_writepages
ext4_writepages
if (ext4_has_inline_data(inode))
BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA))
The root cause of this issue is we destory inline data until call
ext4_writepages under delay allocation mode. But there maybe already
convert from inline to extent. To solve this issue, we call
filemap_flush first..
Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220516122634.1690462-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c069db76ed7b681c69159f44be96d2137e9ca989 upstream.
If processing the on-disk mount options fails after any memory was
allocated in the ext4_fs_context, e.g. s_qf_names, then this memory is
leaked. Fix this by calling ext4_fc_free() instead of kfree() directly.
Reproducer:
mkfs.ext4 -F /dev/vdc
tune2fs /dev/vdc -E mount_opts=usrjquota=file
echo clear > /sys/kernel/debug/kmemleak
mount /dev/vdc /vdc
echo scan > /sys/kernel/debug/kmemleak
sleep 5
echo scan > /sys/kernel/debug/kmemleak
cat /sys/kernel/debug/kmemleak
Fixes: 7edfd85b1ffd ("ext4: Completely separate options parsing and sb setup")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220513231605.175121-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f4534c9fc94d22383f187b9409abb3f9df2e3db3 upstream.
We got issue as follows:
EXT4-fs error (device loop0) in ext4_reserve_inode_write:5741: Out of memory
EXT4-fs error (device loop0): ext4_setattr:5462: inode #13: comm syz-executor.0: mark_inode_dirty error
EXT4-fs error (device loop0) in ext4_setattr:5519: Out of memory
EXT4-fs error (device loop0): ext4_ind_map_blocks:595: inode #13: comm syz-executor.0: Can't allocate blocks for non-extent mapped inodes with bigalloc
------------[ cut here ]------------
WARNING: CPU: 1 PID: 4361 at fs/ext4/file.c:301 ext4_file_write_iter+0x11c9/0x1220
Modules linked in:
CPU: 1 PID: 4361 Comm: syz-executor.0 Not tainted 5.10.0+ #1
RIP: 0010:ext4_file_write_iter+0x11c9/0x1220
RSP: 0018:ffff924d80b27c00 EFLAGS: 00010282
RAX: ffffffff815a3379 RBX: 0000000000000000 RCX: 000000003b000000
RDX: ffff924d81601000 RSI: 00000000000009cc RDI: 00000000000009cd
RBP: 000000000000000d R08: ffffffffbc5a2c6b R09: 0000902e0e52a96f
R10: ffff902e2b7c1b40 R11: ffff902e2b7c1b40 R12: 000000000000000a
R13: 0000000000000001 R14: ffff902e0e52aa10 R15: ffffffffffffff8b
FS: 00007f81a7f65700(0000) GS:ffff902e3bc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffff600400 CR3: 000000012db88001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
do_iter_readv_writev+0x2e5/0x360
do_iter_write+0x112/0x4c0
do_pwritev+0x1e5/0x390
__x64_sys_pwritev2+0x7e/0xa0
do_syscall_64+0x37/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Above issue may happen as follows:
Assume
inode.i_size=4096
EXT4_I(inode)->i_disksize=4096
step 1: set inode->i_isize = 8192
ext4_setattr
if (attr->ia_size != inode->i_size)
EXT4_I(inode)->i_disksize = attr->ia_size;
rc = ext4_mark_inode_dirty
ext4_reserve_inode_write
ext4_get_inode_loc
__ext4_get_inode_loc
sb_getblk --> return -ENOMEM
...
if (!error) ->will not update i_size
i_size_write(inode, attr->ia_size);
Now:
inode.i_size=4096
EXT4_I(inode)->i_disksize=8192
step 2: Direct write 4096 bytes
ext4_file_write_iter
ext4_dio_write_iter
iomap_dio_rw ->return error
if (extend)
ext4_handle_inode_extension
WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
->Then trigger warning.
To solve above issue, if mark inode dirty failed in ext4_setattr just
set 'EXT4_I(inode)->i_disksize' with old value.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20220326065351.761952-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f87c7a4b084afc13190cbb263538e444cb2b392a upstream.
Hulk Robot reported a BUG_ON:
==================================================================
EXT4-fs error (device loop3): ext4_mb_generate_buddy:805: group 0,
block bitmap and bg descriptor inconsistent: 25 vs 31513 free clusters
kernel BUG at fs/ext4/ext4_jbd2.c:53!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 PID: 25371 Comm: syz-executor.3 Not tainted 5.10.0+ #1
RIP: 0010:ext4_put_nojournal fs/ext4/ext4_jbd2.c:53 [inline]
RIP: 0010:__ext4_journal_stop+0x10e/0x110 fs/ext4/ext4_jbd2.c:116
[...]
Call Trace:
ext4_write_inline_data_end+0x59a/0x730 fs/ext4/inline.c:795
generic_perform_write+0x279/0x3c0 mm/filemap.c:3344
ext4_buffered_write_iter+0x2e3/0x3d0 fs/ext4/file.c:270
ext4_file_write_iter+0x30a/0x11c0 fs/ext4/file.c:520
do_iter_readv_writev+0x339/0x3c0 fs/read_write.c:732
do_iter_write+0x107/0x430 fs/read_write.c:861
vfs_writev fs/read_write.c:934 [inline]
do_pwritev+0x1e5/0x380 fs/read_write.c:1031
[...]
==================================================================
Above issue may happen as follows:
cpu1 cpu2
__________________________|__________________________
do_pwritev
vfs_writev
do_iter_write
ext4_file_write_iter
ext4_buffered_write_iter
generic_perform_write
ext4_da_write_begin
vfs_fallocate
ext4_fallocate
ext4_convert_inline_data
ext4_convert_inline_data_nolock
ext4_destroy_inline_data_nolock
clear EXT4_STATE_MAY_INLINE_DATA
ext4_map_blocks
ext4_ext_map_blocks
ext4_mb_new_blocks
ext4_mb_regular_allocator
ext4_mb_good_group_nolock
ext4_mb_init_group
ext4_mb_init_cache
ext4_mb_generate_buddy --> error
ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
ext4_restore_inline_data
set EXT4_STATE_MAY_INLINE_DATA
ext4_block_write_begin
ext4_da_write_end
ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
ext4_write_inline_data_end
handle=NULL
ext4_journal_stop(handle)
__ext4_journal_stop
ext4_put_nojournal(handle)
ref_cnt = (unsigned long)handle
BUG_ON(ref_cnt == 0) ---> BUG_ON
The lock held by ext4_convert_inline_data is xattr_sem, but the lock
held by generic_perform_write is i_rwsem. Therefore, the two locks can
be concurrent.
To solve above issue, we add inode_lock() for ext4_convert_inline_data().
At the same time, move ext4_convert_inline_data() in front of
ext4_punch_hole(), remove similar handling from ext4_punch_hole().
Fixes: 0c8d414f163f ("ext4: let fallocate handle inline data correctly")
Cc: stable@vger.kernel.org
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220428134031.4153381-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e4e58e5df309d695799c494958962100a4c25039 upstream.
In __ext4_super() we always overwrote the user specified journal_ioprio
value with a default value, expecting parse_apply_sb_mount_options() to
later correctly set ctx->journal_ioprio to the user specified value.
However, if parse_apply_sb_mount_options() returned early because of
empty sbi->es_s->s_mount_opts, the correct journal_ioprio value was
never set.
This patch fixes __ext4_super() to only use the default value if the
user has not specified any value for journal_ioprio.
Similarly, the remount behavior was to either use journal_ioprio
value specified during initial mount, or use the default value
irrespective of the journal_ioprio value specified during remount.
This patch modifies this to first check if a new value for ioprio
has been passed during remount and apply it. If no new value is
passed, use the value specified during initial mount.
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20220418083545.45778-1-ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0be698ecbe4471fcad80e81ec6a05001421041b3 upstream.
We got issue as follows:
EXT4-fs (loop0): mounted filesystem without journal. Opts: ,errors=continue
ext4_get_first_dir_block: bh->b_data=0xffff88810bee6000 len=34478
ext4_get_first_dir_block: *parent_de=0xffff88810beee6ae bh->b_data=0xffff88810bee6000
ext4_rename_dir_prepare: [1] parent_de=0xffff88810beee6ae
==================================================================
BUG: KASAN: use-after-free in ext4_rename_dir_prepare+0x152/0x220
Read of size 4 at addr ffff88810beee6ae by task rep/1895
CPU: 13 PID: 1895 Comm: rep Not tainted 5.10.0+ #241
Call Trace:
dump_stack+0xbe/0xf9
print_address_description.constprop.0+0x1e/0x220
kasan_report.cold+0x37/0x7f
ext4_rename_dir_prepare+0x152/0x220
ext4_rename+0xf44/0x1ad0
ext4_rename2+0x11c/0x170
vfs_rename+0xa84/0x1440
do_renameat2+0x683/0x8f0
__x64_sys_renameat+0x53/0x60
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f45a6fc41c9
RSP: 002b:00007ffc5a470218 EFLAGS: 00000246 ORIG_RAX: 0000000000000108
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f45a6fc41c9
RDX: 0000000000000005 RSI: 0000000020000180 RDI: 0000000000000005
RBP: 00007ffc5a470240 R08: 00007ffc5a470160 R09: 0000000020000080
R10: 00000000200001c0 R11: 0000000000000246 R12: 0000000000400bb0
R13: 00007ffc5a470320 R14: 0000000000000000 R15: 0000000000000000
The buggy address belongs to the page:
page:00000000440015ce refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x10beee
flags: 0x200000000000000()
raw: 0200000000000000 ffffea00043ff4c8 ffffea0004325608 0000000000000000
raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88810beee580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88810beee600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88810beee680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff88810beee700: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff88810beee780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
Disabling lock debugging due to kernel taint
ext4_rename_dir_prepare: [2] parent_de->inode=3537895424
ext4_rename_dir_prepare: [3] dir=0xffff888124170140
ext4_rename_dir_prepare: [4] ino=2
ext4_rename_dir_prepare: ent->dir->i_ino=2 parent=-757071872
Reason is first directory entry which 'rec_len' is 34478, then will get illegal
parent entry. Now, we do not check directory entry after read directory block
in 'ext4_get_first_dir_block'.
To solve this issue, check directory entry in 'ext4_get_first_dir_block'.
[ Trigger an ext4_error() instead of just warning if the directory is
missing a '.' or '..' entry. Also make sure we return an error code
if the file system is corrupted. -TYT ]
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220414025223.4113128-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d63c00ea435a5352f486c259665a4ced60399421 upstream.
Otherwise nonaligned fstrim calls will works inconveniently for iterative
scanners, for example:
// trim [0,16MB] for group-1, but mark full group as trimmed
fstrim -o $((1024*1024*128)) -l $((1024*1024*16)) ./m
// handle [16MB,16MB] for group-1, do nothing because group already has the flag.
fstrim -o $((1024*1024*144)) -l $((1024*1024*16)) ./m
[ Update function documentation for ext4_trim_all_free -- TYT ]
Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Link: https://lore.kernel.org/r/1650214995-860245-1-git-send-email-dmtrmonakhov@yandex-team.ru
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 68f4c6eba70df70a720188bce95c85570ddfcc87 upstream.
Commit 505a666ee3fc ("writeback: plug writeback in wb_writeback() and
writeback_inodes_wb()") has us holding a plug during wb_writeback, which
may cause a potential ABBA dead lock:
wb_writeback fat_file_fsync
blk_start_plug(&plug)
for (;;) {
iter i-1: some reqs have been added into plug->mq_list // LOCK A
iter i:
progress = __writeback_inodes_wb(wb, work)
. writeback_sb_inodes // fat's bdev
. __writeback_single_inode
. . generic_writepages
. . __block_write_full_page
. . . . __generic_file_fsync
. . . . sync_inode_metadata
. . . . writeback_single_inode
. . . . __writeback_single_inode
. . . . fat_write_inode
. . . . __fat_write_inode
. . . . sync_dirty_buffer // fat's bdev
. . . . lock_buffer(bh) // LOCK B
. . . . submit_bh
. . . . blk_mq_get_tag // LOCK A
. . . trylock_buffer(bh) // LOCK B
. . . redirty_page_for_writepage
. . . wbc->pages_skipped++
. . --wbc->nr_to_write
. wrote += write_chunk - wbc.nr_to_write // wrote > 0
. requeue_inode
. redirty_tail_locked
if (progress) // progress > 0
continue;
iter i+1:
queue_io
// similar process with iter i, infinite for-loop !
}
blk_finish_plug(&plug) // flush plug won't be called
Above process triggers a hungtask like:
[ 399.044861] INFO: task bb:2607 blocked for more than 30 seconds.
[ 399.046824] Not tainted 5.18.0-rc1-00005-gefae4d9eb6a2-dirty
[ 399.051539] task:bb state:D stack: 0 pid: 2607 ppid:
2426 flags:0x00004000
[ 399.051556] Call Trace:
[ 399.051570] __schedule+0x480/0x1050
[ 399.051592] schedule+0x92/0x1a0
[ 399.051602] io_schedule+0x22/0x50
[ 399.051613] blk_mq_get_tag+0x1d3/0x3c0
[ 399.051640] __blk_mq_alloc_requests+0x21d/0x3f0
[ 399.051657] blk_mq_submit_bio+0x68d/0xca0
[ 399.051674] __submit_bio+0x1b5/0x2d0
[ 399.051708] submit_bio_noacct+0x34e/0x720
[ 399.051718] submit_bio+0x3b/0x150
[ 399.051725] submit_bh_wbc+0x161/0x230
[ 399.051734] __sync_dirty_buffer+0xd1/0x420
[ 399.051744] sync_dirty_buffer+0x17/0x20
[ 399.051750] __fat_write_inode+0x289/0x310
[ 399.051766] fat_write_inode+0x2a/0xa0
[ 399.051783] __writeback_single_inode+0x53c/0x6f0
[ 399.051795] writeback_single_inode+0x145/0x200
[ 399.051803] sync_inode_metadata+0x45/0x70
[ 399.051856] __generic_file_fsync+0xa3/0x150
[ 399.051880] fat_file_fsync+0x1d/0x80
[ 399.051895] vfs_fsync_range+0x40/0xb0
[ 399.051929] __x64_sys_fsync+0x18/0x30
In my test, 'need_resched()' (which is imported by 590dca3a71 "fs-writeback:
unplug before cond_resched in writeback_sb_inodes") in function
'writeback_sb_inodes()' seldom comes true, unless cond_resched() is deleted
from write_cache_pages().
Fix it by correcting wrote number according number of skipped pages
in writeback_sb_inodes().
Goto Link to find a reproducer.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215837
Cc: stable@vger.kernel.org # v4.3
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220510133805.1988292-1-chengzhihao1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 677a82b44ebf263d4f9a0cfbd576a6ade797a07b upstream.
Yanming reported a kernel bug in Bugzilla kernel [1], which can be
reproduced. The bug message is:
The kernel message is shown below:
kernel BUG at fs/inode.c:611!
Call Trace:
evict+0x282/0x4e0
__dentry_kill+0x2b2/0x4d0
dput+0x2dd/0x720
do_renameat2+0x596/0x970
__x64_sys_rename+0x78/0x90
do_syscall_64+0x3b/0x90
[1] https://bugzilla.kernel.org/show_bug.cgi?id=215895
The bug is due to fuzzed inode has both inline_data and encrypted flags.
During f2fs_evict_inode(), as the inode was deleted by rename(), it
will cause inline data conversion due to conflicting flags. The page
cache will be polluted and the panic will be triggered in clear_inode().
Try fixing the bug by doing more sanity checks for inline data inode in
sanity_check_inode().
Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 958ed92922028ec67f504dcdc72bfdfd0f43936a upstream.
This patch tries to fix permission consistency issue as all other
mainline filesystems.
Since the initial introduction of (posix) fallocate back at the turn of
the century, it has been possible to use this syscall to change the
user-visible contents of files. This can happen by extending the file
size during a preallocation, or through any of the newer modes (punch,
zero, collapse, insert range). Because the call can be used to change
file contents, we should treat it like we do any other modification to a
file -- update the mtime, and drop set[ug]id privileges/capabilities.
The VFS function file_modified() does all this for us if pass it a
locked inode, so let's make fallocate drop permissions correctly.
Cc: stable@kernel.org
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b5639bb4313b9d455fc9fc4768d23a5e4ca8cb9d upstream.
Tryng to rename a directory that has all following properties fails with
EINVAL and triggers the 'WARN_ON_ONCE(!fscrypt_has_encryption_key(dir))'
in f2fs_match_ci_name():
- The directory is casefolded
- The directory is encrypted
- The directory's encryption key is not yet set up
- The parent directory is *not* encrypted
The problem is incorrect handling of the lookup of ".." to get the
parent reference to update. fscrypt_setup_filename() treats ".." (and
".") specially, as it's never encrypted. It's passed through as-is, and
setting up the directory's key is not attempted. As the name isn't a
no-key name, f2fs treats it as a "normal" name and attempts a casefolded
comparison. That breaks the assumption of the WARN_ON_ONCE() in
f2fs_match_ci_name() which assumes that for encrypted directories,
casefolded comparisons only happen when the directory's key is set up.
We could just remove this WARN_ON_ONCE(). However, since casefolding is
always a no-op on "." and ".." anyway, let's instead just not casefold
these names. This results in the standard bytewise comparison.
Fixes: 7ad08a58bf67 ("f2fs: Handle casefolding with Encryption")
Cc: <stable@vger.kernel.org> # v5.11+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6b8beca0edd32075a769bfe4178ca00c0dcd22a9 upstream.
As Yanming reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=215916
The kernel message is shown below:
kernel BUG at fs/f2fs/segment.c:2560!
Call Trace:
allocate_segment_by_default+0x228/0x440
f2fs_allocate_data_block+0x13d1/0x31f0
do_write_page+0x18d/0x710
f2fs_outplace_write_data+0x151/0x250
f2fs_do_write_data_page+0xef9/0x1980
move_data_page+0x6af/0xbc0
do_garbage_collect+0x312f/0x46f0
f2fs_gc+0x6b0/0x3bc0
f2fs_balance_fs+0x921/0x2260
f2fs_write_single_data_page+0x16be/0x2370
f2fs_write_cache_pages+0x428/0xd00
f2fs_write_data_pages+0x96e/0xd50
do_writepages+0x168/0x550
__writeback_single_inode+0x9f/0x870
writeback_sb_inodes+0x47d/0xb20
__writeback_inodes_wb+0xb2/0x200
wb_writeback+0x4bd/0x660
wb_workfn+0x5f3/0xab0
process_one_work+0x79f/0x13e0
worker_thread+0x89/0xf60
kthread+0x26a/0x300
ret_from_fork+0x22/0x30
RIP: 0010:new_curseg+0xe8d/0x15f0
The root cause is: ckpt.valid_block_count is inconsistent with SIT table,
stat info indicates filesystem has free blocks, but SIT table indicates
filesystem has no free segment.
So that during garbage colloection, it triggers panic when LFS allocator
fails to find free segment.
This patch tries to fix this issue by checking consistency in between
ckpt.valid_block_count and block accounted from SIT.
Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6213f5d4d23c50d393a31dc8e351e63a1fd10dbe upstream.
Let's avoid false-alarmed lockdep warning.
[ 58.914674] [T1501146] -> #2 (&sb->s_type->i_mutex_key#20){+.+.}-{3:3}:
[ 58.915975] [T1501146] system_server: down_write+0x7c/0xe0
[ 58.916738] [T1501146] system_server: f2fs_quota_sync+0x60/0x1a8
[ 58.917563] [T1501146] system_server: block_operations+0x16c/0x43c
[ 58.918410] [T1501146] system_server: f2fs_write_checkpoint+0x114/0x318
[ 58.919312] [T1501146] system_server: f2fs_issue_checkpoint+0x178/0x21c
[ 58.920214] [T1501146] system_server: f2fs_sync_fs+0x48/0x6c
[ 58.920999] [T1501146] system_server: f2fs_do_sync_file+0x334/0x738
[ 58.921862] [T1501146] system_server: f2fs_sync_file+0x30/0x48
[ 58.922667] [T1501146] system_server: __arm64_sys_fsync+0x84/0xf8
[ 58.923506] [T1501146] system_server: el0_svc_common.llvm.12821150825140585682+0xd8/0x20c
[ 58.924604] [T1501146] system_server: do_el0_svc+0x28/0xa0
[ 58.925366] [T1501146] system_server: el0_svc+0x24/0x38
[ 58.926094] [T1501146] system_server: el0_sync_handler+0x88/0xec
[ 58.926920] [T1501146] system_server: el0_sync+0x1b4/0x1c0
[ 58.927681] [T1501146] -> #1 (&sbi->cp_global_sem){+.+.}-{3:3}:
[ 58.928889] [T1501146] system_server: down_write+0x7c/0xe0
[ 58.929650] [T1501146] system_server: f2fs_write_checkpoint+0xbc/0x318
[ 58.930541] [T1501146] system_server: f2fs_issue_checkpoint+0x178/0x21c
[ 58.931443] [T1501146] system_server: f2fs_sync_fs+0x48/0x6c
[ 58.932226] [T1501146] system_server: sync_filesystem+0xac/0x130
[ 58.933053] [T1501146] system_server: generic_shutdown_super+0x38/0x150
[ 58.933958] [T1501146] system_server: kill_block_super+0x24/0x58
[ 58.934791] [T1501146] system_server: kill_f2fs_super+0xcc/0x124
[ 58.935618] [T1501146] system_server: deactivate_locked_super+0x90/0x120
[ 58.936529] [T1501146] system_server: deactivate_super+0x74/0xac
[ 58.937356] [T1501146] system_server: cleanup_mnt+0x128/0x168
[ 58.938150] [T1501146] system_server: __cleanup_mnt+0x18/0x28
[ 58.938944] [T1501146] system_server: task_work_run+0xb8/0x14c
[ 58.939749] [T1501146] system_server: do_notify_resume+0x114/0x1e8
[ 58.940595] [T1501146] system_server: work_pending+0xc/0x5f0
[ 58.941375] [T1501146] -> #0 (&sbi->gc_lock){+.+.}-{3:3}:
[ 58.942519] [T1501146] system_server: __lock_acquire+0x1270/0x2868
[ 58.943366] [T1501146] system_server: lock_acquire+0x114/0x294
[ 58.944169] [T1501146] system_server: down_write+0x7c/0xe0
[ 58.944930] [T1501146] system_server: f2fs_issue_checkpoint+0x13c/0x21c
[ 58.945831] [T1501146] system_server: f2fs_sync_fs+0x48/0x6c
[ 58.946614] [T1501146] system_server: f2fs_do_sync_file+0x334/0x738
[ 58.947472] [T1501146] system_server: f2fs_ioc_commit_atomic_write+0xc8/0x14c
[ 58.948439] [T1501146] system_server: __f2fs_ioctl+0x674/0x154c
[ 58.949253] [T1501146] system_server: f2fs_ioctl+0x54/0x88
[ 58.950018] [T1501146] system_server: __arm64_sys_ioctl+0xa8/0x110
[ 58.950865] [T1501146] system_server: el0_svc_common.llvm.12821150825140585682+0xd8/0x20c
[ 58.951965] [T1501146] system_server: do_el0_svc+0x28/0xa0
[ 58.952727] [T1501146] system_server: el0_svc+0x24/0x38
[ 58.953454] [T1501146] system_server: el0_sync_handler+0x88/0xec
[ 58.954279] [T1501146] system_server: el0_sync+0x1b4/0x1c0
Cc: stable@vger.kernel.org
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cfd66bb715fd11fde3338d0660cffa1396adc27d upstream.
As Yanming reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=215914
The root cause is: in a very small sized image, it's very easy to
exceed threshold of foreground GC, if we calculate free space and
dirty data based on section granularity, in corner case,
has_not_enough_free_secs() will always return true, result in
deadloop in f2fs_gc().
So this patch refactors has_not_enough_free_secs() as below to fix
this issue:
1. calculate needed space based on block granularity, and separate
all blocks to two parts, section part, and block part, comparing
section part to free section, and comparing block part to free space
in openned log.
2. account F2FS_DIRTY_NODES, F2FS_DIRTY_IMETA and F2FS_DIRTY_DENTS
as node block consumer;
3. account F2FS_DIRTY_DENTS as data block consumer;
Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f2db71053dc0409fae785096ad19cce4c8a95af7 upstream.
As Yanming reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=215904
The kernel message is shown below:
kernel BUG at fs/f2fs/inode.c:825!
Call Trace:
evict+0x282/0x4e0
__dentry_kill+0x2b2/0x4d0
shrink_dentry_list+0x17c/0x4f0
shrink_dcache_parent+0x143/0x1e0
do_one_tree+0x9/0x30
shrink_dcache_for_umount+0x51/0x120
generic_shutdown_super+0x5c/0x3a0
kill_block_super+0x90/0xd0
kill_f2fs_super+0x225/0x310
deactivate_locked_super+0x78/0xc0
cleanup_mnt+0x2b7/0x480
task_work_run+0xc8/0x150
exit_to_user_mode_prepare+0x14a/0x150
syscall_exit_to_user_mode+0x1d/0x40
do_syscall_64+0x48/0x90
The root cause is: inode node and dnode node share the same nid,
so during f2fs_evict_inode(), dnode node truncation will invalidate
its NAT entry, so when truncating inode node, it fails due to
invalid NAT entry, result in inode is still marked as dirty, fix
this issue by clearing dirty for inode and setting SBI_NEED_FSCK
flag in filesystem.
output from dump.f2fs:
[print_node_info: 354] Node ID [0xf:15] is inode
i_nid[0] [0x f : 15]
Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 25f8236213a91efdf708b9d77e9e51b6fc3e141c upstream.
As Yanming reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=215894
I have encountered a bug in F2FS file system in kernel v5.17.
I have uploaded the system call sequence as case.c, and a fuzzed image can
be found in google net disk
The kernel should enable CONFIG_KASAN=y and CONFIG_KASAN_INLINE=y. You can
reproduce the bug by running the following commands:
kernel BUG at fs/f2fs/segment.c:2291!
Call Trace:
f2fs_invalidate_blocks+0x193/0x2d0
f2fs_fallocate+0x2593/0x4a70
vfs_fallocate+0x2a5/0xac0
ksys_fallocate+0x35/0x70
__x64_sys_fallocate+0x8e/0xf0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
The root cause is, after image was fuzzed, block mapping info in inode
will be inconsistent with SIT table, so in f2fs_fallocate(), it will cause
panic when updating SIT with invalid blkaddr.
Let's fix the issue by adding sanity check on block address before updating
SIT table with it.
Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4d17e6fe9293d57081ffdc11e1cf313e25e8fd9e upstream.
As Yanming reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=215897
I have encountered a bug in F2FS file system in kernel v5.17.
The kernel should enable CONFIG_KASAN=y and CONFIG_KASAN_INLINE=y. You can
reproduce the bug by running the following commands:
The kernel message is shown below:
kernel BUG at fs/f2fs/f2fs.h:2511!
Call Trace:
f2fs_remove_inode_page+0x2a2/0x830
f2fs_evict_inode+0x9b7/0x1510
evict+0x282/0x4e0
do_unlinkat+0x33a/0x540
__x64_sys_unlinkat+0x8e/0xd0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
The root cause is: .total_valid_block_count or .total_valid_node_count
could fuzzed to zero, then once dec_valid_node_count() was called, it
will cause BUG_ON(), this patch fixes to print warning info and set
SBI_NEED_FSCK into CP instead of panic.
Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 118f09eda21d392e1eeb9f8a4bee044958cccf20 ]
Mark async operations such as RENAME, REMOVE, COMMIT MOVEABLE
for the nfsv4.1+ sessions.
Fixes: 85e39feead948 ("NFSv4.1 identify and mark RPC tasks that can move between transports")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|