Age | Commit message (Collapse) | Author | Files | Lines |
|
Commit 771c994ea51f ("erofs: convert all uncompressed cases to iomap")
converts to use iomap interface, it removed trace_erofs_readpage()
tracepoint in the meantime, let's add it back.
Fixes: 771c994ea51f ("erofs: convert all uncompressed cases to iomap")
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250708111942.3120926-1-chao@kernel.org
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
Commit 771c994ea51f ("erofs: convert all uncompressed cases to iomap")
converts to use iomap interface, it removed trace_erofs_readahead()
tracepoint in the meantime, let's add it back.
Fixes: 771c994ea51f ("erofs: convert all uncompressed cases to iomap")
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250707084832.2725677-1-chao@kernel.org
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
In __cachefiles_write(), if the return value of the write operation > 0, it
is set to 0. This makes it impossible to distinguish scenarios where a
partial write has occurred, and will affect the outer calling functions:
1) cachefiles_write_complete() will call "term_func" such as
netfs_write_subrequest_terminated(). When "ret" in __cachefiles_write()
is used as the "transferred_or_error" of this function, it can not
distinguish the amount of data written, makes the WARN meaningless.
2) cachefiles_ondemand_fd_write_iter() can only assume all writes were
successful by default when "ret" is 0, and unconditionally return the full
length specified by user space.
Fix it by modifying "ret" to reflect the actual number of bytes written.
Furthermore, returning a value greater than 0 from __cachefiles_write()
does not affect other call paths, such as cachefiles_issue_write() and
fscache_write().
Fixes: 047487c947e8 ("cachefiles: Implement the I/O routines")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Link: https://lore.kernel.org/20250703024418.2809353-1-wozizhi@huaweicloud.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The root inode of /proc having a fixed inode number has been part of the
core kernel ABI since its inception, and recently some userspace
programs (mainly container runtimes) have started to explicitly depend
on this behaviour.
The main reason this is useful to userspace is that by checking that a
suspect /proc handle has fstype PROC_SUPER_MAGIC and is PROCFS_ROOT_INO,
they can then use openat2(RESOLVE_{NO_{XDEV,MAGICLINK},BENEATH}) to
ensure that there isn't a bind-mount that replaces some procfs file with
a different one. This kind of attack has lead to security issues in
container runtimes in the past (such as CVE-2019-19921) and libraries
like libpathrs[1] use this feature of procfs to provide safe procfs
handling functions.
There was also some trailing whitespace in the "struct proc_dir_entry"
initialiser, so fix that up as well.
[1]: https://github.com/openSUSE/libpathrs
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/20250708-uapi-procfs-root-ino-v1-1-6ae61e97c79b@cyphar.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
evict_inodes() uses list_for_each_entry_safe() to iterate sb->s_inodes
list. However, since we use i_lru list entry for our local temporary
list of inodes to destroy, the inode is guaranteed to stay in
sb->s_inodes list while we hold sb->s_inode_list_lock. So there is no
real need for safe iteration variant and we can use
list_for_each_entry() just fine.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/20250709090635.26319-2-jack@suse.cz
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When commit d3556babd7fa ("ocfs2: fix d_splice_alias() return code
checking") was merged into v3.18-rc3, d_splice_alias() was returning one
of a valid dentry, NULL or an ERR_PTR.
When commit b5ae6b15bd73 ("merge d_materialise_unique() into
d_splice_alias()") was merged into v3.19-rc1, d_splice_alias() started
returning -ELOOP as one of ERR_PTR values.
Now, when syzkaller mounts a crafted ocfs2 filesystem image that hits
d_splice_alias() == -ELOOP case from ocfs2_lookup(), ocfs2_lookup() fails
to handle -ELOOP case and generic_shutdown_super() hits "VFS: Busy inodes
after unmount" message.
Instead of calling ocfs2_dentry_attach_lock() or ocfs2_dentry_attach_gen()
when d_splice_alias() returned an ERR_PTR value, change ocfs2_lookup() to
bail out immediately.
Also, ocfs2_lookup() needs to call dupt() when ocfs2_dentry_attach_lock()
returned an ERR_PTR value.
Link: https://lkml.kernel.org/r/da5be67d-2a0b-4b93-85d6-42f3b7440135@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+1134d3a5b062e9665a7a@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=1134d3a5b062e9665a7a
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tetsuo Handa <penguin-kernel@i-love-sakura.ne.jp>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There are two cleanups for vmcore_add_device_dump(). Return -ENOMEM
directly rather than goto the label to simplify the code and use
scoped_guard() to simplify the lock/unlock code.
Link: https://lkml.kernel.org/r/20250626105440.1053139-1-suhui@nfschina.com
Signed-off-by: Su Hui <suhui@nfschina.com>
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Suhui <suhui@nfschina.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Since lockdep_set_class() uses stringified key name via macro, calling
lockdep_set_class() with an array causes lockdep warning messages to
report variable name than actual index number.
Change ocfs2_init_locked_inode() to pass actual index number for better
readability of lockdep reports. This patch does not change behavior.
Before:
Chain exists of:
&ocfs2_sysfile_lock_key[args->fi_sysfile_type] --> jbd2_handle --> &oi->ip_xattr_sem
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&oi->ip_xattr_sem);
lock(jbd2_handle);
lock(&oi->ip_xattr_sem);
lock(&ocfs2_sysfile_lock_key[args->fi_sysfile_type]);
*** DEADLOCK ***
After:
Chain exists of:
&ocfs2_sysfile_lock_key[EXTENT_ALLOC_SYSTEM_INODE] --> jbd2_handle --> &oi->ip_xattr_sem
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&oi->ip_xattr_sem);
lock(jbd2_handle);
lock(&oi->ip_xattr_sem);
lock(&ocfs2_sysfile_lock_key[EXTENT_ALLOC_SYSTEM_INODE]);
*** DEADLOCK ***
Link: https://lkml.kernel.org/r/29348724-639c-443d-bbce-65c3a0a13a38@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
fsfuzzer may make many invalid access for FAT-fs and generate many kmsg
like "FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)".
For platforms & os that enables hardware serial device whose speed are
slow, this may cause softlockup easily.
So let's ratelimit the error log.
The log as below:
[11916.242560] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.254485] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.266388] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.278287] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.290180] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.302068] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.313962] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.325848] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.337732] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.349619] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.361505] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.373391] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.385272] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.397144] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.409025] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.420909] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.432791] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.444674] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.456558] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.468446] FAT-fs (loop2): error, invalid access to FAT (entry 0x00000ccb)
[11916.480352] watchdog: BUG: soft lockup - CPU#58 stuck for 26s! [cat:2446035]
[11916.480357] Modules linked in: ...
[11916.480503] CPU: 58 PID: 2446035 Comm: cat Kdump: loaded Tainted: ...
[11916.480508] Hardware name: vclusters VSFT5000 B/VSFT5000 B, BIOS ...
[11916.480510] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[11916.480513] pc : console_emit_next_record+0x1b4/0x288
[11916.480524] lr : console_emit_next_record+0x1ac/0x288
[11916.480525] sp : ffff80009bcdae90
[11916.480527] x29: ffff80009bcdaec0 x28: ffff800082513810 x27: 0000000000000001
[11916.480530] x26: 0000000000000001 x25: ffff800081f66000 x24: 0000000000000000
[11916.480533] x23: 0000000000000000 x22: ffff80009bcdaf8f x21: 0000000000000001
[11916.480535] x20: 0000000000000000 x19: ffff800082513810 x18: ffffffffffffffff
[11916.480538] x17: 0000000000000002 x16: 0000000000000001 x15: ffff80009bcdab30
[11916.480541] x14: 0000000000000000 x13: 205d353330363434 x12: 32545b5d36343438
[11916.480543] x11: 652820544146206f x10: 7420737365636361 x9 : ffff800080159a6c
[11916.480546] x8 : 69202c726f727265 x7 : 545b5d3634343836 x6 : 342e36313931315b
[11916.480549] x5 : ffff800082513a01 x4 : ffff80009bcdad31 x3 : 0000000000000000
[11916.480551] x2 : 00000000ffffffff x1 : 0000000001b9b000 x0 : ffff8000836cef00
[11916.480554] Call trace:
[11916.480557] console_emit_next_record+0x1b4/0x288
[11916.480560] console_flush_all+0xcc/0x190
[11916.480563] console_unlock+0x78/0x138
[11916.480565] vprintk_emit+0x1c4/0x210
[11916.480568] vprintk_default+0x40/0x58
[11916.480570] vprintk+0x84/0xc8
[11916.480572] _printk+0x68/0xa0
[11916.480578] _fat_msg+0x6c/0xa0 [fat]
[11916.480593] __fat_fs_error+0xf8/0x118 [fat]
[11916.480601] fat_ent_read+0x164/0x238 [fat]
[11916.480609] fat_get_cluster+0x180/0x2c8 [fat]
[11916.480617] fat_get_mapped_cluster+0xb8/0x170 [fat]
Link: https://lkml.kernel.org/r/20250620020231.9292-1-me@linux.beauty
Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The code checks newfe_bh for NULL after it has already been dereferenced
to access b_data. This NULL check is unnecessary for two reasons:
1. If ocfs2_inode_lock() succeeds (returns >= 0), newfe_bh is guaranteed
to be valid.
2. We've already dereferenced newfe_bh to access b_data, so it must be
non-NULL at this point.
Remove the redundant NULL check in the trace_ocfs2_rename_over_existing()
call to improve code clarity.
Link: https://lkml.kernel.org/r/20250617012534.3458669-1-leo.lilong@huawei.com
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The reproducer uses FAULT_INJECTION to make memory allocation fail, which
causes __filemap_get_folio() to fail, when initializing w_folios[i] in
ocfs2_grab_folios_for_write(), it only returns an error code and the value
of w_folios[i] is the error code, which causes
ocfs2_unlock_and_free_folios() to recycle the invalid w_folios[i] when
releasing folios.
Link: https://lkml.kernel.org/r/20250616013140.3602219-1-lizhi.xu@windriver.com
Reported-by: syzbot+c2ea94ae47cd7e3881ec@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c2ea94ae47cd7e3881ec
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Remove an access to page->mapping and a few calls to the old page-based
APIs. This doesn't support large folios, but it's still a nice
improvement.
Link: https://lkml.kernel.org/r/20250612143903.2849289-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "squashfs: Remove page->mapping references".
We're close to being able to kill page->mapping. These two patches get us
a little bit closer.
This patch (of 2):
Eliminate a reference to page->mapping by passing the inode from the
caller.
Link: https://lkml.kernel.org/r/20250612143903.2849289-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20250612143903.2849289-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
kstrtol() is better because simple_strtol() ignores overflow. And using
kstrtol() is more concise.
Link: https://lkml.kernel.org/r/20250527092333.1917391-1-suhui@nfschina.com
Signed-off-by: Su Hui <suhui@nfschina.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
All PFN_* pfn_t flags have been removed. Therefore there is no longer a
need for the pfn_t type and all uses can be replaced with normal pfns.
Link: https://lkml.kernel.org/r/bbedfa576c9822f8032494efbe43544628698b1f.1750323463.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Björn Töpel <bjorn@kernel.org>
Cc: Björn Töpel <bjorn@rivosinc.com>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: John Groves <john@groves.net>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The dcssblk driver was the last user of FS_DAX_LIMITED. That was marked
broken by 653d7825c149 ("dcssblk: mark DAX broken, remove FS_DAX_LIMITED
support") to allow removal of PFN_SPECIAL. However the FS_DAX_LIMITED
config option itself was not removed, so do that now.
Link: https://lkml.kernel.org/r/b47bf164b4a1013d736fa1a3d501bc8b8e71311f.1750323463.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Björn Töpel <bjorn@kernel.org>
Cc: Björn Töpel <bjorn@rivosinc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Groves <john@groves.net>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
DAX was the only thing that created pmd_devmap and pud_devmap entries
however it no longer does as DAX pages are now refcounted normally and
pXd_trans_huge() returns true for those. Therefore checking both
pXd_devmap and pXd_trans_huge() is redundant and the former can be removed
without changing behaviour as it will always be false.
Link: https://lkml.kernel.org/r/d58f089dc16b7feb7c6728164f37dea65d64a0d3.1750323463.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Björn Töpel <bjorn@kernel.org>
Cc: Björn Töpel <bjorn@rivosinc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Groves <john@groves.net>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
PFN_DEV was used by callers of dax_direct_access() to figure out if the
returned PFN is associated with a page using pfn_t_has_page() or not.
However all DAX PFNs now require an assoicated ZONE_DEVICE page so can
assume a page exists.
Other users of PFN_DEV were setting it before calling vmf_insert_mixed().
This is unnecessary as it is no longer checked, instead relying on
pfn_valid() to determine if there is an associated page or not.
Link: https://lkml.kernel.org/r/74b293aebc21b941090bc3e7aeafa91b57c821a5.1750323463.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Björn Töpel <bjorn@kernel.org>
Cc: Björn Töpel <bjorn@rivosinc.com>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: John Groves <john@groves.net>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: Remove pXX_devmap page table bit and pfn_t type", v3.
All users of dax now require a ZONE_DEVICE page which is properly
refcounted. This means there is no longer any need for the PFN_DEV,
PFN_MAP and PFN_SPECIAL flags. Furthermore the PFN_SG_CHAIN and
PFN_SG_LAST flags never appear to have been used. It is therefore
possible to remove the pfn_t type and replace any usage with raw pfns.
The remaining users of PFN_DEV have simply passed this to
vmf_insert_mixed() to create pte_devmap() mappings. It is unclear why
this was the case but presumably to ensure vm_normal_page() does not
return these pages. These users can be trivially converted to raw pfns
and creating a pXX_special() mapping to ensure vm_normal_page() still
doesn't return these pages.
Now that there are no users of PFN_DEV we can remove the devmap page table
bit and all associated functions and macros, freeing up a software page
table bit.
This patch (of 14):
Currently dax is the only user of pmd and pud mapped ZONE_DEVICE pages.
Therefore page walkers that want to exclude DAX pages can check pmd_devmap
or pud_devmap. However soon dax will no longer set PFN_DEV, meaning dax
pages are mapped as normal pages.
Ensure page walkers that currently use pXd_devmap to skip DAX pages
continue to do so by adding explicit checks of the VMA instead.
Remove vma_is_dax() check from mm/userfaultfd.c as validate_move_areas()
will already skip DAX VMA's on account of them not being anonymous.
The check in userfaultfd_must_wait() is also redundant as
vma_can_userfault() should have already filtered out dax vma's.
For HMM the pud_devmap check seems unnecessary as there is no reason we
shouldn't be able to handle any leaf PUD here so remove it also.
Link: https://lkml.kernel.org/r/cover.176965585864cb8d2cf41464b44dcc0471e643a0.1750323463.git-series.apopple@nvidia.com
Link: https://lkml.kernel.org/r/f0611f6f475f48fcdf34c65084a359aefef4cccc.1750323463.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Björn Töpel <bjorn@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: John Groves <john@groves.net>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Björn Töpel <bjorn@rivosinc.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/memfd: Reserve hugetlb folios before allocation", v4.
There are cases when we try to pin a folio but discover that it has not
been faulted-in. So, we try to allocate it in memfd_alloc_folio() but the
allocation request may not succeed if there are no active reservations in
the system at that instant.
Therefore, making a reservation (by calling hugetlb_reserve_pages())
associated with the allocation will ensure that our request would not fail
due to lack of reservations. This will also ensure that proper
region/subpool accounting is done with our allocation.
This patch (of 3):
Currently, hugetlb_reserve_pages() returns a bool to indicate whether the
reservation map update for the range [from, to] was successful or not.
This is not sufficient for the case where the caller needs to determine
how many entries were updated for the range.
Therefore, have hugetlb_reserve_pages() return the number of entries
updated in the reservation map associated with the range [from, to].
Also, update the callers of hugetlb_reserve_pages() to handle the new
return value.
Link: https://lkml.kernel.org/r/20250618053415.1036185-1-vivek.kasireddy@intel.com
Link: https://lkml.kernel.org/r/20250618053415.1036185-2-vivek.kasireddy@intel.com
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Cc: Steve Sistare <steven.sistare@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The core kernel code is currently very inconsistent in its use of
vm_flags_t vs. unsigned long. This prevents us from changing the type of
vm_flags_t in the future and is simply not correct, so correct this.
While this results in rather a lot of churn, it is a critical
pre-requisite for a future planned change to VMA flag type.
Additionally, update VMA userland tests to account for the changes.
To make review easier and to break things into smaller parts, driver and
architecture-specific changes is left for a subsequent commit.
The code has been adjusted to cascade the changes across all calling code
as far as is needed.
We will adjust architecture-specific and driver code in a subsequent patch.
Overall, this patch does not introduce any functional change.
Link: https://lkml.kernel.org/r/d1588e7bb96d1ea3fe7b9df2c699d5b4592d901d.1750274467.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Kees Cook <kees@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Retrieve a folio from the pagecache instead of a page and operate on it.
Removes several hidden calls to compound_head() along with calls to
deprecated functions like wait_on_page_writeback() and find_lock_page().
[dan.carpenter@linaro.org: fix NULL vs IS_ERR() bug in ceph_zero_partial_page()]
Link: https://lkml.kernel.org/r/685c1424.050a0220.baa8.d6a1@mx.google.com
Link: https://lkml.kernel.org/r/20250612143443.2848197-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Alex Markuze <amarkuze@redhat.com>
Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
memzero_page() is the new name for zero_user().
Link: https://lkml.kernel.org/r/20250612143443.2848197-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
et.al
Check pde->proc_ops->proc_lseek directly may cause UAF in rmmod scenario.
It's a gap in proc_reg_open() after commit 654b33ada4ab("proc: fix UAF in
proc_get_inode()"). Followed by AI Viro's suggestion, fix it in same
manner.
Link: https://lkml.kernel.org/r/20250607021353.1127963-1-wangzijie1@honor.com
Fixes: 3f61631d47f1 ("take care to handle NULL ->proc_lseek()")
Signed-off-by: wangzijie <wangzijie1@honor.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
UFFD_CLOEXEC, UFFD_NONBLOCK, and UFFD_FLAGS_SET have been unused since
they were added in commit 932b18e0aec6 ("userfaultfd:
linux/userfaultfd_k.h"). Remove them and the associated BUILD_BUG_ON()
checks.
Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-4-a7274d3bd5e4@columbia.edu
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
BUG_ON() is deprecated [1]. Convert all the BUG_ON()s and VM_BUG_ON()s to
use VM_WARN_ON_ONCE().
There are a few additional cases that are converted or modified:
- Convert the printk(KERN_WARNING ...) in handle_userfault() to use
pr_warn().
- Convert the WARN_ON_ONCE()s in move_pages() to use VM_WARN_ON_ONCE(),
as the relevant conditions are already checked in validate_range() in
move_pages()'s caller.
- Convert the VM_WARN_ON()'s in move_pages() to VM_WARN_ON_ONCE(). These
cases should never happen and are similar to those in mfill_atomic()
and mfill_atomic_hugetlb(), which were previously BUG_ON()s.
move_pages() was added later than those functions and makes use of
VM_WARN_ON() as a replacement for the deprecated BUG_ON(), but.
VM_WARN_ON_ONCE() is likely a better direct replacement.
- Convert the WARN_ON() for !VM_MAYWRITE in userfaultfd_unregister() and
userfaultfd_register_range() to VM_WARN_ON_ONCE(). This condition is
enforced in userfaultfd_register() so it should never happen, and can
be converted to a debug check.
[1] https://www.kernel.org/doc/html/v6.15/process/coding-style.html#use-warn-rather-than-bug
Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-3-a7274d3bd5e4@columbia.edu
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, a VMA registered with a uffd can be unregistered through a
different uffd associated with the same mm_struct.
The existing behavior is slightly broken and may incorrectly reject
unregistering some VMAs due to the following check:
if (!vma_can_userfault(cur, cur->vm_flags, wp_async))
goto out_unlock;
where wp_async is derived from ctx, not from cur. For example, a
file-backed VMA registered with wp_async enabled and UFFD_WP mode cannot
be unregistered through a uffd that does not have wp_async enabled.
Rather than fix this and maintain this odd behavior, make unregistration
stricter by requiring VMAs to be unregistered through the same uffd they
were registered with. Additionally, reorder the BUG() checks to avoid the
aforementioned wp_async issue in them. Convert the existing check to
VM_WARN_ON_ONCE() as BUG_ON() is deprecated.
This change slightly modifies the ABI. It should not be backported to
-stable. It is expected that no one depends on this behavior, and no such
cases are known.
While at it, correct the comment for the no userfaultfd case. This seems
to be a copy-paste artifact from the analogous userfaultfd_register()
check.
Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-2-a7274d3bd5e4@columbia.edu
Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This field is now only set to one in the i915 gem code that only calls
writeback_iter on it, which ignores the flag. All other checks are thuse
dead code and the field can be removed.
Link: https://lkml.kernel.org/r/20250610054959.2057526-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
On some large machines with a high number of CPUs running a 64K pagesize
kernel, we found that the 'RES' field is always 0 displayed by the top
command for some processes, which will cause a lot of confusion for users.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
The main reason is that the batch size of the percpu counter is quite
large on these machines, caching a significant percpu value, since
converting mm's rss stats into percpu_counter by commit f1a7941243c1 ("mm:
convert mm's rss stats into percpu_counter"). Intuitively, the batch
number should be optimized, but on some paths, performance may take
precedence over statistical accuracy. Therefore, introducing a new
interface to add the percpu statistical count and display it to users,
which can remove the confusion. In addition, this change is not expected
to be on a performance-critical path, so the modification should be
acceptable.
In addition, the 'mm->rss_stat' is updated by using add_mm_counter() and
dec/inc_mm_counter(), which are all wrappers around
percpu_counter_add_batch(). In percpu_counter_add_batch(), there is
percpu batch caching to avoid 'fbc->lock' contention. This patch changes
task_mem() and task_statm() to get the accurate mm counters under the
'fbc->lock', but this should not exacerbate kernel 'mm->rss_stat' lock
contention due to the percpu batch caching of the mm counters. The
following test also confirm the theoretical analysis.
I run the stress-ng that stresses anon page faults in 32 threads on my 32
cores machine, while simultaneously running a script that starts 32
threads to busy-loop pread each stress-ng thread's /proc/pid/status
interface. From the following data, I did not observe any obvious impact
of this patch on the stress-ng tests.
w/o patch:
stress-ng: info: [6848] 4,399,219,085,152 CPU Cycles 67.327 B/sec
stress-ng: info: [6848] 1,616,524,844,832 Instructions 24.740 B/sec (0.367 instr. per cycle)
stress-ng: info: [6848] 39,529,792 Page Faults Total 0.605 M/sec
stress-ng: info: [6848] 39,529,792 Page Faults Minor 0.605 M/sec
w/patch:
stress-ng: info: [2485] 4,462,440,381,856 CPU Cycles 68.382 B/sec
stress-ng: info: [2485] 1,615,101,503,296 Instructions 24.750 B/sec (0.362 instr. per cycle)
stress-ng: info: [2485] 39,439,232 Page Faults Total 0.604 M/sec
stress-ng: info: [2485] 39,439,232 Page Faults Minor 0.604 M/sec
On comparing a very simple app which just allocates & touches some
memory against v6.1 (which doesn't have f1a7941243c1) and latest Linus
tree (4c06e63b9203) I can see that on latest Linus tree the values for
VmRSS, RssAnon and RssFile from /proc/self/status are all zeroes while
they do report values on v6.1 and a Linus tree with this patch.
Link: https://lkml.kernel.org/r/f4586b17f66f97c174f7fd1f8647374fdb53de1c.1749119050.git.baolin.wang@linux.alibaba.com
Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by Donet Tom <donettom@linux.ibm.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 6cb3b1c2df87 changed how finish_xmote() clears the GLF_LOCK flag,
but it failed to adjust the equivalent code in do_xmote(). Fix that.
Fixes: 6cb3b1c2df87 ("gfs2: Fix additional unlikely request cancelation race")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
There are redundant codes in IS_CUR{SEG,SEC}() macros, let's introduce
inline is_cur{seg,sec}() functions, and use a loop in it for cleanup.
Meanwhile, it enhances expansibility, as it doesn't need to change
is_cur{seg,sec}() when we add a new log header.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
As syzbot [1] reported as below:
R10: 0000000000000100 R11: 0000000000000206 R12: 00007ffe17473450
R13: 00007f28b1c10854 R14: 000000000000dae5 R15: 00007ffe17474520
</TASK>
---[ end trace 0000000000000000 ]---
==================================================================
BUG: KASAN: use-after-free in __list_del_entry_valid+0xa6/0x130 lib/list_debug.c:62
Read of size 8 at addr ffff88812d962278 by task syz-executor/564
CPU: 1 PID: 564 Comm: syz-executor Tainted: G W 6.1.129-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
Call Trace:
<TASK>
__dump_stack+0x21/0x24 lib/dump_stack.c:88
dump_stack_lvl+0xee/0x158 lib/dump_stack.c:106
print_address_description+0x71/0x210 mm/kasan/report.c:316
print_report+0x4a/0x60 mm/kasan/report.c:427
kasan_report+0x122/0x150 mm/kasan/report.c:531
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report_generic.c:351
__list_del_entry_valid+0xa6/0x130 lib/list_debug.c:62
__list_del_entry include/linux/list.h:134 [inline]
list_del_init include/linux/list.h:206 [inline]
f2fs_inode_synced+0xf7/0x2e0 fs/f2fs/super.c:1531
f2fs_update_inode+0x74/0x1c40 fs/f2fs/inode.c:585
f2fs_update_inode_page+0x137/0x170 fs/f2fs/inode.c:703
f2fs_write_inode+0x4ec/0x770 fs/f2fs/inode.c:731
write_inode fs/fs-writeback.c:1460 [inline]
__writeback_single_inode+0x4a0/0xab0 fs/fs-writeback.c:1677
writeback_single_inode+0x221/0x8b0 fs/fs-writeback.c:1733
sync_inode_metadata+0xb6/0x110 fs/fs-writeback.c:2789
f2fs_sync_inode_meta+0x16d/0x2a0 fs/f2fs/checkpoint.c:1159
block_operations fs/f2fs/checkpoint.c:1269 [inline]
f2fs_write_checkpoint+0xca3/0x2100 fs/f2fs/checkpoint.c:1658
kill_f2fs_super+0x231/0x390 fs/f2fs/super.c:4668
deactivate_locked_super+0x98/0x100 fs/super.c:332
deactivate_super+0xaf/0xe0 fs/super.c:363
cleanup_mnt+0x45f/0x4e0 fs/namespace.c:1186
__cleanup_mnt+0x19/0x20 fs/namespace.c:1193
task_work_run+0x1c6/0x230 kernel/task_work.c:203
exit_task_work include/linux/task_work.h:39 [inline]
do_exit+0x9fb/0x2410 kernel/exit.c:871
do_group_exit+0x210/0x2d0 kernel/exit.c:1021
__do_sys_exit_group kernel/exit.c:1032 [inline]
__se_sys_exit_group kernel/exit.c:1030 [inline]
__x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1030
x64_sys_call+0x7b4/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:232
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x68/0xd2
RIP: 0033:0x7f28b1b8e169
Code: Unable to access opcode bytes at 0x7f28b1b8e13f.
RSP: 002b:00007ffe174710a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007f28b1c10879 RCX: 00007f28b1b8e169
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 0000000000000002 R08: 00007ffe1746ee47 R09: 00007ffe17472360
R10: 0000000000000009 R11: 0000000000000246 R12: 00007ffe17472360
R13: 00007f28b1c10854 R14: 000000000000dae5 R15: 00007ffe17474520
</TASK>
Allocated by task 569:
kasan_save_stack mm/kasan/common.c:45 [inline]
kasan_set_track+0x4b/0x70 mm/kasan/common.c:52
kasan_save_alloc_info+0x25/0x30 mm/kasan/generic.c:505
__kasan_slab_alloc+0x72/0x80 mm/kasan/common.c:328
kasan_slab_alloc include/linux/kasan.h:201 [inline]
slab_post_alloc_hook+0x4f/0x2c0 mm/slab.h:737
slab_alloc_node mm/slub.c:3398 [inline]
slab_alloc mm/slub.c:3406 [inline]
__kmem_cache_alloc_lru mm/slub.c:3413 [inline]
kmem_cache_alloc_lru+0x104/0x220 mm/slub.c:3429
alloc_inode_sb include/linux/fs.h:3245 [inline]
f2fs_alloc_inode+0x2d/0x340 fs/f2fs/super.c:1419
alloc_inode fs/inode.c:261 [inline]
iget_locked+0x186/0x880 fs/inode.c:1373
f2fs_iget+0x55/0x4c60 fs/f2fs/inode.c:483
f2fs_lookup+0x366/0xab0 fs/f2fs/namei.c:487
__lookup_slow+0x2a3/0x3d0 fs/namei.c:1690
lookup_slow+0x57/0x70 fs/namei.c:1707
walk_component+0x2e6/0x410 fs/namei.c:1998
lookup_last fs/namei.c:2455 [inline]
path_lookupat+0x180/0x490 fs/namei.c:2479
filename_lookup+0x1f0/0x500 fs/namei.c:2508
vfs_statx+0x10b/0x660 fs/stat.c:229
vfs_fstatat fs/stat.c:267 [inline]
vfs_lstat include/linux/fs.h:3424 [inline]
__do_sys_newlstat fs/stat.c:423 [inline]
__se_sys_newlstat+0xd5/0x350 fs/stat.c:417
__x64_sys_newlstat+0x5b/0x70 fs/stat.c:417
x64_sys_call+0x393/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:7
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x68/0xd2
Freed by task 13:
kasan_save_stack mm/kasan/common.c:45 [inline]
kasan_set_track+0x4b/0x70 mm/kasan/common.c:52
kasan_save_free_info+0x31/0x50 mm/kasan/generic.c:516
____kasan_slab_free+0x132/0x180 mm/kasan/common.c:236
__kasan_slab_free+0x11/0x20 mm/kasan/common.c:244
kasan_slab_free include/linux/kasan.h:177 [inline]
slab_free_hook mm/slub.c:1724 [inline]
slab_free_freelist_hook+0xc2/0x190 mm/slub.c:1750
slab_free mm/slub.c:3661 [inline]
kmem_cache_free+0x12d/0x2a0 mm/slub.c:3683
f2fs_free_inode+0x24/0x30 fs/f2fs/super.c:1562
i_callback+0x4c/0x70 fs/inode.c:250
rcu_do_batch+0x503/0xb80 kernel/rcu/tree.c:2297
rcu_core+0x5a2/0xe70 kernel/rcu/tree.c:2557
rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2574
handle_softirqs+0x178/0x500 kernel/softirq.c:578
run_ksoftirqd+0x28/0x30 kernel/softirq.c:945
smpboot_thread_fn+0x45a/0x8c0 kernel/smpboot.c:164
kthread+0x270/0x310 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
Last potentially related work creation:
kasan_save_stack+0x3a/0x60 mm/kasan/common.c:45
__kasan_record_aux_stack+0xb6/0xc0 mm/kasan/generic.c:486
kasan_record_aux_stack_noalloc+0xb/0x10 mm/kasan/generic.c:496
call_rcu+0xd4/0xf70 kernel/rcu/tree.c:2845
destroy_inode fs/inode.c:316 [inline]
evict+0x7da/0x870 fs/inode.c:720
iput_final fs/inode.c:1834 [inline]
iput+0x62b/0x830 fs/inode.c:1860
do_unlinkat+0x356/0x540 fs/namei.c:4397
__do_sys_unlink fs/namei.c:4438 [inline]
__se_sys_unlink fs/namei.c:4436 [inline]
__x64_sys_unlink+0x49/0x50 fs/namei.c:4436
x64_sys_call+0x958/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:88
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x68/0xd2
The buggy address belongs to the object at ffff88812d961f20
which belongs to the cache f2fs_inode_cache of size 1200
The buggy address is located 856 bytes inside of
1200-byte region [ffff88812d961f20, ffff88812d9623d0)
The buggy address belongs to the physical page:
page:ffffea0004b65800 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12d960
head:ffffea0004b65800 order:2 compound_mapcount:0 compound_pincount:0
flags: 0x4000000000010200(slab|head|zone=1)
raw: 4000000000010200 0000000000000000 dead000000000122 ffff88810a94c500
raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 2, migratetype Reclaimable, gfp_mask 0x1d2050(__GFP_IO|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_RECLAIMABLE), pid 569, tgid 568 (syz.2.16), ts 55943246141, free_ts 0
set_page_owner include/linux/page_owner.h:31 [inline]
post_alloc_hook+0x1d0/0x1f0 mm/page_alloc.c:2532
prep_new_page mm/page_alloc.c:2539 [inline]
get_page_from_freelist+0x2e63/0x2ef0 mm/page_alloc.c:4328
__alloc_pages+0x235/0x4b0 mm/page_alloc.c:5605
alloc_slab_page include/linux/gfp.h:-1 [inline]
allocate_slab mm/slub.c:1939 [inline]
new_slab+0xec/0x4b0 mm/slub.c:1992
___slab_alloc+0x6f6/0xb50 mm/slub.c:3180
__slab_alloc+0x5e/0xa0 mm/slub.c:3279
slab_alloc_node mm/slub.c:3364 [inline]
slab_alloc mm/slub.c:3406 [inline]
__kmem_cache_alloc_lru mm/slub.c:3413 [inline]
kmem_cache_alloc_lru+0x13f/0x220 mm/slub.c:3429
alloc_inode_sb include/linux/fs.h:3245 [inline]
f2fs_alloc_inode+0x2d/0x340 fs/f2fs/super.c:1419
alloc_inode fs/inode.c:261 [inline]
iget_locked+0x186/0x880 fs/inode.c:1373
f2fs_iget+0x55/0x4c60 fs/f2fs/inode.c:483
f2fs_fill_super+0x3ad7/0x6bb0 fs/f2fs/super.c:4293
mount_bdev+0x2ae/0x3e0 fs/super.c:1443
f2fs_mount+0x34/0x40 fs/f2fs/super.c:4642
legacy_get_tree+0xea/0x190 fs/fs_context.c:632
vfs_get_tree+0x89/0x260 fs/super.c:1573
do_new_mount+0x25a/0xa20 fs/namespace.c:3056
page_owner free stack trace missing
Memory state around the buggy address:
ffff88812d962100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88812d962180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88812d962200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88812d962280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88812d962300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
[1] https://syzkaller.appspot.com/x/report.txt?x=13448368580000
This bug can be reproduced w/ the reproducer [2], once we enable
CONFIG_F2FS_CHECK_FS config, the reproducer will trigger panic as below,
so the direct reason of this bug is the same as the one below patch [3]
fixed.
kernel BUG at fs/f2fs/inode.c:857!
RIP: 0010:f2fs_evict_inode+0x1204/0x1a20
Call Trace:
<TASK>
evict+0x32a/0x7a0
do_unlinkat+0x37b/0x5b0
__x64_sys_unlink+0xad/0x100
do_syscall_64+0x5a/0xb0
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0010:f2fs_evict_inode+0x1204/0x1a20
[2] https://syzkaller.appspot.com/x/repro.c?x=17495ccc580000
[3] https://lore.kernel.org/linux-f2fs-devel/20250702120321.1080759-1-chao@kernel.org
Tracepoints before panic:
f2fs_unlink_enter: dev = (7,0), dir ino = 3, i_size = 4096, i_blocks = 8, name = file1
f2fs_unlink_exit: dev = (7,0), ino = 7, ret = 0
f2fs_evict_inode: dev = (7,0), ino = 7, pino = 3, i_mode = 0x81ed, i_size = 10, i_nlink = 0, i_blocks = 0, i_advise = 0x0
f2fs_truncate_node: dev = (7,0), ino = 7, nid = 8, block_address = 0x3c05
f2fs_unlink_enter: dev = (7,0), dir ino = 3, i_size = 4096, i_blocks = 8, name = file3
f2fs_unlink_exit: dev = (7,0), ino = 8, ret = 0
f2fs_evict_inode: dev = (7,0), ino = 8, pino = 3, i_mode = 0x81ed, i_size = 9000, i_nlink = 0, i_blocks = 24, i_advise = 0x4
f2fs_truncate: dev = (7,0), ino = 8, pino = 3, i_mode = 0x81ed, i_size = 0, i_nlink = 0, i_blocks = 24, i_advise = 0x4
f2fs_truncate_blocks_enter: dev = (7,0), ino = 8, i_size = 0, i_blocks = 24, start file offset = 0
f2fs_truncate_blocks_exit: dev = (7,0), ino = 8, ret = -2
The root cause is: in the fuzzed image, dnode #8 belongs to inode #7,
after inode #7 eviction, dnode #8 was dropped.
However there is dirent that has ino #8, so, once we unlink file3, in
f2fs_evict_inode(), both f2fs_truncate() and f2fs_update_inode_page()
will fail due to we can not load node #8, result in we missed to call
f2fs_inode_synced() to clear inode dirty status.
Let's fix this by calling f2fs_inode_synced() in error path of
f2fs_evict_inode().
PS: As I verified, the reproducer [2] can trigger this bug in v6.1.129,
but it failed in v6.16-rc4, this is because the testcase will stop due to
other corruption has been detected by f2fs:
F2FS-fs (loop0): inconsistent node block, node_type:2, nid:8, node_footer[nid:8,ino:8,ofs:0,cpver:5013063228981249506,blkaddr:15366]
F2FS-fs (loop0): f2fs_lookup: inode (ino=9) has zero i_nlink
Fixes: 0f18b462b2e5 ("f2fs: flush inode metadata when checkpoint is doing")
Closes: https://syzkaller.appspot.com/x/report.txt?x=13448368580000
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
syzbot reported an UAF issue as below: [1] [2]
[1] https://syzkaller.appspot.com/text?tag=CrashReport&x=16594c60580000
==================================================================
BUG: KASAN: use-after-free in __list_del_entry_valid+0xa6/0x130 lib/list_debug.c:62
Read of size 8 at addr ffff888100567dc8 by task kworker/u4:0/8
CPU: 1 PID: 8 Comm: kworker/u4:0 Tainted: G W 6.1.129-syzkaller-00017-g642656a36791 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
Workqueue: writeback wb_workfn (flush-7:0)
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x151/0x1b7 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:316 [inline]
print_report+0x158/0x4e0 mm/kasan/report.c:427
kasan_report+0x13c/0x170 mm/kasan/report.c:531
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report_generic.c:351
__list_del_entry_valid+0xa6/0x130 lib/list_debug.c:62
__list_del_entry include/linux/list.h:134 [inline]
list_del_init include/linux/list.h:206 [inline]
f2fs_inode_synced+0x100/0x2e0 fs/f2fs/super.c:1553
f2fs_update_inode+0x72/0x1c40 fs/f2fs/inode.c:588
f2fs_update_inode_page+0x135/0x170 fs/f2fs/inode.c:706
f2fs_write_inode+0x416/0x790 fs/f2fs/inode.c:734
write_inode fs/fs-writeback.c:1460 [inline]
__writeback_single_inode+0x4cf/0xb80 fs/fs-writeback.c:1677
writeback_sb_inodes+0xb32/0x1910 fs/fs-writeback.c:1903
__writeback_inodes_wb+0x118/0x3f0 fs/fs-writeback.c:1974
wb_writeback+0x3da/0xa00 fs/fs-writeback.c:2081
wb_check_background_flush fs/fs-writeback.c:2151 [inline]
wb_do_writeback fs/fs-writeback.c:2239 [inline]
wb_workfn+0xbba/0x1030 fs/fs-writeback.c:2266
process_one_work+0x73d/0xcb0 kernel/workqueue.c:2299
worker_thread+0xa60/0x1260 kernel/workqueue.c:2446
kthread+0x26d/0x300 kernel/kthread.c:386
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
</TASK>
Allocated by task 298:
kasan_save_stack mm/kasan/common.c:45 [inline]
kasan_set_track+0x4b/0x70 mm/kasan/common.c:52
kasan_save_alloc_info+0x1f/0x30 mm/kasan/generic.c:505
__kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:333
kasan_slab_alloc include/linux/kasan.h:202 [inline]
slab_post_alloc_hook+0x53/0x2c0 mm/slab.h:768
slab_alloc_node mm/slub.c:3421 [inline]
slab_alloc mm/slub.c:3431 [inline]
__kmem_cache_alloc_lru mm/slub.c:3438 [inline]
kmem_cache_alloc_lru+0x102/0x270 mm/slub.c:3454
alloc_inode_sb include/linux/fs.h:3255 [inline]
f2fs_alloc_inode+0x2d/0x350 fs/f2fs/super.c:1437
alloc_inode fs/inode.c:261 [inline]
iget_locked+0x18c/0x7e0 fs/inode.c:1373
f2fs_iget+0x55/0x4ca0 fs/f2fs/inode.c:486
f2fs_lookup+0x3c1/0xb50 fs/f2fs/namei.c:484
__lookup_slow+0x2b9/0x3e0 fs/namei.c:1689
lookup_slow+0x5a/0x80 fs/namei.c:1706
walk_component+0x2e7/0x410 fs/namei.c:1997
lookup_last fs/namei.c:2454 [inline]
path_lookupat+0x16d/0x450 fs/namei.c:2478
filename_lookup+0x251/0x600 fs/namei.c:2507
vfs_statx+0x107/0x4b0 fs/stat.c:229
vfs_fstatat fs/stat.c:267 [inline]
vfs_lstat include/linux/fs.h:3434 [inline]
__do_sys_newlstat fs/stat.c:423 [inline]
__se_sys_newlstat+0xda/0x7c0 fs/stat.c:417
__x64_sys_newlstat+0x5b/0x70 fs/stat.c:417
x64_sys_call+0x52/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:7
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x3b/0x80 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x68/0xd2
Freed by task 0:
kasan_save_stack mm/kasan/common.c:45 [inline]
kasan_set_track+0x4b/0x70 mm/kasan/common.c:52
kasan_save_free_info+0x2b/0x40 mm/kasan/generic.c:516
____kasan_slab_free+0x131/0x180 mm/kasan/common.c:241
__kasan_slab_free+0x11/0x20 mm/kasan/common.c:249
kasan_slab_free include/linux/kasan.h:178 [inline]
slab_free_hook mm/slub.c:1745 [inline]
slab_free_freelist_hook mm/slub.c:1771 [inline]
slab_free mm/slub.c:3686 [inline]
kmem_cache_free+0x291/0x560 mm/slub.c:3711
f2fs_free_inode+0x24/0x30 fs/f2fs/super.c:1584
i_callback+0x4b/0x70 fs/inode.c:250
rcu_do_batch+0x552/0xbe0 kernel/rcu/tree.c:2297
rcu_core+0x502/0xf40 kernel/rcu/tree.c:2557
rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2574
handle_softirqs+0x1db/0x650 kernel/softirq.c:624
__do_softirq kernel/softirq.c:662 [inline]
invoke_softirq kernel/softirq.c:479 [inline]
__irq_exit_rcu+0x52/0xf0 kernel/softirq.c:711
irq_exit_rcu+0x9/0x10 kernel/softirq.c:723
instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1118 [inline]
sysvec_apic_timer_interrupt+0xa9/0xc0 arch/x86/kernel/apic/apic.c:1118
asm_sysvec_apic_timer_interrupt+0x1b/0x20 arch/x86/include/asm/idtentry.h:691
Last potentially related work creation:
kasan_save_stack+0x3b/0x60 mm/kasan/common.c:45
__kasan_record_aux_stack+0xb4/0xc0 mm/kasan/generic.c:486
kasan_record_aux_stack_noalloc+0xb/0x10 mm/kasan/generic.c:496
__call_rcu_common kernel/rcu/tree.c:2807 [inline]
call_rcu+0xdc/0x10f0 kernel/rcu/tree.c:2926
destroy_inode fs/inode.c:316 [inline]
evict+0x87d/0x930 fs/inode.c:720
iput_final fs/inode.c:1834 [inline]
iput+0x616/0x690 fs/inode.c:1860
do_unlinkat+0x4e1/0x920 fs/namei.c:4396
__do_sys_unlink fs/namei.c:4437 [inline]
__se_sys_unlink fs/namei.c:4435 [inline]
__x64_sys_unlink+0x49/0x50 fs/namei.c:4435
x64_sys_call+0x289/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:88
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x3b/0x80 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x68/0xd2
The buggy address belongs to the object at ffff888100567a10
which belongs to the cache f2fs_inode_cache of size 1360
The buggy address is located 952 bytes inside of
1360-byte region [ffff888100567a10, ffff888100567f60)
The buggy address belongs to the physical page:
page:ffffea0004015800 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x100560
head:ffffea0004015800 order:3 compound_mapcount:0 compound_pincount:0
flags: 0x4000000000010200(slab|head|zone=1)
raw: 4000000000010200 0000000000000000 dead000000000122 ffff8881002c4d80
raw: 0000000000000000 0000000080160016 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Reclaimable, gfp_mask 0xd2050(__GFP_IO|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_RECLAIMABLE), pid 298, tgid 298 (syz-executor330), ts 26489303743, free_ts 0
set_page_owner include/linux/page_owner.h:33 [inline]
post_alloc_hook+0x213/0x220 mm/page_alloc.c:2637
prep_new_page+0x1b/0x110 mm/page_alloc.c:2644
get_page_from_freelist+0x3a98/0x3b10 mm/page_alloc.c:4539
__alloc_pages+0x234/0x610 mm/page_alloc.c:5837
alloc_slab_page+0x6c/0xf0 include/linux/gfp.h:-1
allocate_slab mm/slub.c:1962 [inline]
new_slab+0x90/0x3e0 mm/slub.c:2015
___slab_alloc+0x6f9/0xb80 mm/slub.c:3203
__slab_alloc+0x5d/0xa0 mm/slub.c:3302
slab_alloc_node mm/slub.c:3387 [inline]
slab_alloc mm/slub.c:3431 [inline]
__kmem_cache_alloc_lru mm/slub.c:3438 [inline]
kmem_cache_alloc_lru+0x149/0x270 mm/slub.c:3454
alloc_inode_sb include/linux/fs.h:3255 [inline]
f2fs_alloc_inode+0x2d/0x350 fs/f2fs/super.c:1437
alloc_inode fs/inode.c:261 [inline]
iget_locked+0x18c/0x7e0 fs/inode.c:1373
f2fs_iget+0x55/0x4ca0 fs/f2fs/inode.c:486
f2fs_fill_super+0x5360/0x6dc0 fs/f2fs/super.c:4488
mount_bdev+0x282/0x3b0 fs/super.c:1445
f2fs_mount+0x34/0x40 fs/f2fs/super.c:4743
legacy_get_tree+0xf1/0x190 fs/fs_context.c:632
page_owner free stack trace missing
Memory state around the buggy address:
ffff888100567c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888100567d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888100567d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888100567e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888100567e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
[2] https://syzkaller.appspot.com/text?tag=CrashLog&x=13654c60580000
[ 24.675720][ T28] audit: type=1400 audit(1745327318.732:72): avc: denied { write } for pid=298 comm="syz-executor399" name="/" dev="loop0" ino=3 scontext=root:sysadm_r:sysadm_t tcontext=system_u:object_r:unlabeled_t tclass=dir permissive=1
[ 24.705426][ T296] ------------[ cut here ]------------
[ 24.706608][ T28] audit: type=1400 audit(1745327318.732:73): avc: denied { remove_name } for pid=298 comm="syz-executor399" name="file0" dev="loop0" ino=4 scontext=root:sysadm_r:sysadm_t tcontext=system_u:object_r:unlabeled_t tclass=dir permissive=1
[ 24.711550][ T296] WARNING: CPU: 0 PID: 296 at fs/f2fs/inode.c:847 f2fs_evict_inode+0x1262/0x1540
[ 24.734141][ T28] audit: type=1400 audit(1745327318.732:74): avc: denied { rename } for pid=298 comm="syz-executor399" name="file0" dev="loop0" ino=4 scontext=root:sysadm_r:sysadm_t tcontext=system_u:object_r:unlabeled_t tclass=dir permissive=1
[ 24.742969][ T296] Modules linked in:
[ 24.765201][ T28] audit: type=1400 audit(1745327318.732:75): avc: denied { add_name } for pid=298 comm="syz-executor399" name="bus" scontext=root:sysadm_r:sysadm_t tcontext=system_u:object_r:unlabeled_t tclass=dir permissive=1
[ 24.768847][ T296] CPU: 0 PID: 296 Comm: syz-executor399 Not tainted 6.1.129-syzkaller-00017-g642656a36791 #0
[ 24.799506][ T296] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
[ 24.809401][ T296] RIP: 0010:f2fs_evict_inode+0x1262/0x1540
[ 24.815018][ T296] Code: 34 70 4a ff eb 0d e8 2d 70 4a ff 4d 89 e5 4c 8b 64 24 18 48 8b 5c 24 28 4c 89 e7 e8 78 38 03 00 e9 84 fc ff ff e8 0e 70 4a ff <0f> 0b 4c 89 f7 be 08 00 00 00 e8 7f 21 92 ff f0 41 80 0e 04 e9 61
[ 24.834584][ T296] RSP: 0018:ffffc90000db7a40 EFLAGS: 00010293
[ 24.840465][ T296] RAX: ffffffff822aca42 RBX: 0000000000000002 RCX: ffff888110948000
[ 24.848291][ T296] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
[ 24.856064][ T296] RBP: ffffc90000db7bb0 R08: ffffffff822ac6a8 R09: ffffed10200b005d
[ 24.864073][ T296] R10: 0000000000000000 R11: dffffc0000000001 R12: ffff888100580000
[ 24.871812][ T296] R13: dffffc0000000000 R14: ffff88810fef4078 R15: 1ffff920001b6f5c
The root cause is w/ a fuzzed image, f2fs may missed to clear FI_DIRTY_INODE
flag for target inode, after f2fs_evict_inode(), the inode is still linked in
sbi->inode_list[DIRTY_META] global list, once it triggers checkpoint,
f2fs_sync_inode_meta() may access the released inode.
In f2fs_evict_inode(), let's always call f2fs_inode_synced() to clear
FI_DIRTY_INODE flag and drop inode from global dirty list to avoid this
UAF issue.
Fixes: 0f18b462b2e5 ("f2fs: flush inode metadata when checkpoint is doing")
Closes: https://syzkaller.appspot.com/bug?extid=849174b2efaf0d8be6ba
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
options in f2fs_fill_super is alloc by kstrdup:
options = kstrdup((const char *)data, GFP_KERNEL)
sit_bitmap[_mir], nat_bitmap[_mir] are alloc by kmemdup:
sit_i->sit_bitmap = kmemdup(src_bitmap, sit_bitmap_size, GFP_KERNEL);
sit_i->sit_bitmap_mir = kmemdup(src_bitmap,
sit_bitmap_size, GFP_KERNEL);
nm_i->nat_bitmap = kmemdup(version_bitmap, nm_i->bitmap_size,
GFP_KERNEL);
nm_i->nat_bitmap_mir = kmemdup(version_bitmap, nm_i->bitmap_size,
GFP_KERNEL);
write_io is alloc by f2fs_kmalloc:
sbi->write_io[i] = f2fs_kmalloc(sbi,
array_size(n, sizeof(struct f2fs_bio_info))
Use kfree is more efficient.
Signed-off-by: Jiazi Li <jqqlijiazi@gmail.com>
Signed-off-by: peixuan.qiu <peixuan.qiu@transsion.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Get rid of the GIF_ALLOC_FAILED flag; we can now be confident that the
additional consistency check isn't needed.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Use the SECTOR_SIZE and SECTOR_SHIFT constants where appropriate instead
of hardcoding their values.
Reported-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Jann Horn points out that epoll is decrementing the ep refcount and then
doing a
mutex_unlock(&ep->mtx);
afterwards. That's very wrong, because it can lead to a use-after-free.
That pattern is actually fine for the very last reference, because the
code in question will delay the actual call to "ep_free(ep)" until after
it has unlocked the mutex.
But it's wrong for the much subtler "next to last" case when somebody
*else* may also be dropping their reference and free the ep while we're
still using the mutex.
Note that this is true even if that other user is also using the same ep
mutex: mutexes, unlike spinlocks, can not be used for object ownership,
even if they guarantee mutual exclusion.
A mutex "unlock" operation is not atomic, and as one user is still
accessing the mutex as part of unlocking it, another user can come in
and get the now released mutex and free the data structure while the
first user is still cleaning up.
See our mutex documentation in Documentation/locking/mutex-design.rst,
in particular the section [1] about semantics:
"mutex_unlock() may access the mutex structure even after it has
internally released the lock already - so it's not safe for
another context to acquire the mutex and assume that the
mutex_unlock() context is not using the structure anymore"
So if we drop our ep ref before the mutex unlock, but we weren't the
last one, we may then unlock the mutex, another user comes in, drops
_their_ reference and releases the 'ep' as it now has no users - all
while the mutex_unlock() is still accessing it.
Fix this by simply moving the ep refcount dropping to outside the mutex:
the refcount itself is atomic, and doesn't need mutex protection (that's
the whole _point_ of refcounts: unlike mutexes, they are inherently
about object lifetimes).
Reported-by: Jann Horn <jannh@google.com>
Link: https://docs.kernel.org/locking/mutex-design.html#semantics [1]
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
typechecking is up to users, anyway...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20250702212616.GI3406663@ZenIV
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
All users outside of fs/debugfs/file.c are gone, in there we can just
fully split the wrappers for full and short cases and be done with that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20250702212419.GG3406663@ZenIV
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
->write() of file_operations that gets used only via debugfs_create_file()
is called only under debugfs_file_get()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20250702211650.GD3406663@ZenIV
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This started showing up more when we started logging the error being
corrected in the journal - but __bch2_fsck_err() could return
transaction restarts before that.
Setting BCH_FS_error incorrectly causes recovery passes to not be
cleared, among other issues.
Fixes: b43f72492768 ("bcachefs: Log fsck errors in the journal")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
If ksmbd_iov_pin_rsp return error, use-after-free can happen by
accessing opinfo->state and opinfo_put and ksmbd_fd_put could
called twice.
Reported-by: Ziyan Xu <research@securitygossip.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
If the call of ksmbd_vfs_lock_parent() fails, we drop the parent_path
references and return an error. We need to drop the write access we
just got on parent_path->mnt before we drop the mount reference - callers
assume that ksmbd_vfs_kern_path_locked() returns with mount write
access grabbed if and only if it has returned 0.
Fixes: 864fb5d37163 ("ksmbd: fix possible deadlock in smb2_open")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The qp is created by rdma_create_qp() as t->cm_id->qp
and t->qp is just a shortcut.
rdma_destroy_qp() also calls ib_destroy_qp(cm_id->qp) internally,
but it is protected by a mutex, clears the cm_id and also calls
trace_cm_qp_destroy().
This should make the tracing more useful as both
rdma_create_qp() and rdma_destroy_qp() are traces and it makes
the code look more sane as functions from the same layer are used
for the specific qp object.
trace-cmd stream -e rdma_cma:cm_qp_create -e rdma_cma:cm_qp_destroy
shows this now while doing a mount and unmount from a client:
<...>-80 [002] 378.514182: cm_qp_create: cm.id=1 src=172.31.9.167:5445 dst=172.31.9.166:37113 tos=0 pd.id=0 qp_type=RC send_wr=867 recv_wr=255 qp_num=1 rc=0
<...>-6283 [001] 381.686172: cm_qp_destroy: cm.id=1 src=172.31.9.167:5445 dst=172.31.9.166:37113 tos=0 qp_num=1
Before we only saw the first line.
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Steve French <stfrench@microsoft.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Tom Talpey <tom@talpey.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Since [1], it is possible for filesystems to have blocksize > PAGE_SIZE
of the system.
Remove the assumption and make the check generic for all blocksizes in
generic_check_addressable().
[1] https://lore.kernel.org/linux-xfs/20240822135018.1931258-1-kernel@pankajraghav.com/
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/20250630104018.213985-1-p.raghav@samsung.com
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
All filesystems will already check the max and min value of their block
size during their initialization. __getblk_slow() is a very low-level
function to have these checks. Remove them and only check for logical
block size alignment.
As this check with logical block size alignment might never trigger, add
WARN_ON_ONCE() to the check. As WARN_ON_ONCE() will already print the
stack, remove the call to dump_stack().
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/20250626113223.181399-1-p.raghav@samsung.com
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When sysctl_nr_open is set to a very high value (for example, 1073741816
as set by systemd), processes attempting to use file descriptors near
the limit can trigger massive memory allocation attempts that exceed
INT_MAX, resulting in a WARNING in mm/slub.c:
WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288
This happens because kvmalloc_array() and kvmalloc() check if the
requested size exceeds INT_MAX and emit a warning when the allocation is
not flagged with __GFP_NOWARN.
Specifically, when nr_open is set to 1073741816 (0x3ffffff8) and a
process calls dup2(oldfd, 1073741880), the kernel attempts to allocate:
- File descriptor array: 1073741880 * 8 bytes = 8,589,935,040 bytes
- Multiple bitmaps: ~400MB
- Total allocation size: > 8GB (exceeding INT_MAX = 2,147,483,647)
Reproducer:
1. Set /proc/sys/fs/nr_open to 1073741816:
# echo 1073741816 > /proc/sys/fs/nr_open
2. Run a program that uses a high file descriptor:
#include <unistd.h>
#include <sys/resource.h>
int main() {
struct rlimit rlim = {1073741824, 1073741824};
setrlimit(RLIMIT_NOFILE, &rlim);
dup2(2, 1073741880); // Triggers the warning
return 0;
}
3. Observe WARNING in dmesg at mm/slub.c:5027
systemd commit a8b627a introduced automatic bumping of fs.nr_open to the
maximum possible value. The rationale was that systems with memory
control groups (memcg) no longer need separate file descriptor limits
since memory is properly accounted. However, this change overlooked
that:
1. The kernel's allocation functions still enforce INT_MAX as a maximum
size regardless of memcg accounting
2. Programs and tests that legitimately test file descriptor limits can
inadvertently trigger massive allocations
3. The resulting allocations (>8GB) are impractical and will always fail
systemd's algorithm starts with INT_MAX and keeps halving the value
until the kernel accepts it. On most systems, this results in nr_open
being set to 1073741816 (0x3ffffff8), which is just under 1GB of file
descriptors.
While processes rarely use file descriptors near this limit in normal
operation, certain selftests (like
tools/testing/selftests/core/unshare_test.c) and programs that test file
descriptor limits can trigger this issue.
Fix this by adding a check in alloc_fdtable() to ensure the requested
allocation size does not exceed INT_MAX. This causes the operation to
fail with -EMFILE instead of triggering a kernel warning and avoids the
impractical >8GB memory allocation request.
Fixes: 9cfe015aa424 ("get rid of NR_OPEN and introduce a sysctl_nr_open")
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
Link: https://lore.kernel.org/20250629074021.1038845-1-sashal@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
And use bt_file for both bdev and shmem backed buftargs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
The extra bdev_ is weird, so drop it. Also improve the comment to make
it clear these are the hardware limits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
This function and the helpers used by it duplicate the same logic for AGs
and RTGs. Use the xfs_group_type enum to unify both variants.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|