Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 26343ca0df715097065b02a6cddb4a029d5b9327 ]
data_err=abort aborts the journal on I/O errors. However, this option is
meaningless if journal is disabled, so it is rejected in nojournal mode
to reduce unnecessary checks. Also, this option is ignored upon remount.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250122110533.4116662-4-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d7b0befd09320e3356a75cb96541c030515e7f5f ]
A user complained that a message such as:
EXT4-fs (nvme0n1p3): re-mounted UUID ro. Quota mode: none.
implied that the file system was previously mounted read/write and was
now remounted read-only, when it could have been some other mount
state that had changed by the "mount -o remount" operation. Fix this
by only logging "ro"or "r/w" when it has changed.
https://bugzilla.kernel.org/show_bug.cgi?id=219132
Signed-off-by: Nicolas Bretz <bretznic@gmail.com>
Link: https://patch.msgid.link/20250319171011.8372-1-bretznic@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 530fea29ef82e169cd7fe048c2b7baaeb85a0028 ]
Protect ext4_release_dquot against freezing so that we
don't try to start a transaction when FS is frozen, leading
to warnings.
Further, avoid taking the freeze protection if a transaction
is already running so that we don't need end up in a deadlock
as described in
46e294efc355 ext4: fix deadlock with fs freezing and EA inodes
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241121123855.645335-3-ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit f87d3af7419307ae26e705a2b2db36140db367a2 upstream.
This fixes an analogus bug that was fixed in xfs in commit
4b8d867ca6e2 ("xfs: don't over-report free space or inodes in
statvfs") where statfs can report misleading / incorrect information
where project quota is enabled, and the free space is less than the
remaining quota.
This commit will resolve a test failure in generic/762 which tests for
this bug.
Cc: stable@kernel.org
Fixes: 689c958cbe6b ("ext4: add project quota support")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit ce2f26e73783b4a7c46a86e3af5b5c8de0971790 ]
Presently we always BUG_ON if trying to start a transaction on a journal marked
with JBD2_UNMOUNT, since this should never happen. However, while ltp running
stress tests, it was observed that in case of some error handling paths, it is
possible for update_super_work to start a transaction after the journal is
destroyed eg:
(umount)
ext4_kill_sb
kill_block_super
generic_shutdown_super
sync_filesystem /* commits all txns */
evict_inodes
/* might start a new txn */
ext4_put_super
flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
jbd2_journal_destroy
journal_kill_thread
journal->j_flags |= JBD2_UNMOUNT;
jbd2_journal_commit_transaction
jbd2_journal_get_descriptor_buffer
jbd2_journal_bmap
ext4_journal_bmap
ext4_map_blocks
...
ext4_inode_error
ext4_handle_error
schedule_work(&sbi->s_sb_upd_work)
/* work queue kicks in */
update_super_work
jbd2_journal_start
start_this_handle
BUG_ON(journal->j_flags &
JBD2_UNMOUNT)
Hence, introduce a new mount flag to indicate journal is destroying and only do
a journaled (and deferred) update of sb if this flag is not set. Otherwise, just
fallback to an un-journaled commit.
Further, in the journal destroy path, we have the following sequence:
1. Set mount flag indicating journal is destroying
2. force a commit and wait for it
3. flush pending sb updates
This sequence is important as it ensures that, after this point, there is no sb
update that might be journaled so it is safe to update the sb outside the
journal. (To avoid race discussed in 2d01ddc86606)
Also, we don't need a similar check in ext4_grp_locked_error since it is only
called from mballoc and AFAICT it would be always valid to schedule work here.
Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/9613c465d6ff00cd315602f99283d5f24018c3f7.1742279837.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5a02a6204ca37e7c22fbb55a789c503f05e8e89a ]
Define an ext4 wrapper over jbd2_journal_destroy to make sure we
have consistent behavior during journal destruction. This will also
come useful in the next patch where we add some ext4 specific logic
in the destroy path.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/c3ba78c5c419757e6d5f2d8ebb4a8ce9d21da86a.1742279837.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Stable-dep-of: ce2f26e73783 ("ext4: avoid journaling sb update on error if journal is destroying")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6b76715d5e41fc332b0b879e66fad6ef3db07a3f ]
After commit d3476f3dad4a ("ext4: don't set SB_RDONLY after filesystem
errors") in v6.12-rc1, the 'errors=remount-ro' mode no longer sets
SB_RDONLY on errors, which results in us seeing the filesystem is still
in rw state after errors.
Therefore, after setting EXT4_FLAGS_EMERGENCY_RO, display the emergency_ro
option so that users can query whether the current file system has become
emergency read-only due to errors through commands such as 'mount' or
'cat /proc/fs/ext4/sdx/options'.
Fixes: d3476f3dad4a ("ext4: don't set SB_RDONLY after filesystem errors")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250122114130.229709-7-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8f984530c242c569bafecfa35bce969a9b8fb0dd ]
And after commit 95257987a638 ("ext4: drop EXT4_MF_FS_ABORTED flag") in
v6.6-rc1, the EXT4_FLAGS_SHUTDOWN bit is set in ext4_handle_error() under
errors=remount-ro mode. This causes the read to fail even when the error
is triggered in errors=remount-ro mode.
To correct the behavior under errors=remount-ro, EXT4_FLAGS_SHUTDOWN is
replaced by the newly introduced EXT4_FLAGS_EMERGENCY_RO. This new flag
only prevents writes, matching the previous behavior with SB_RDONLY.
Fixes: 95257987a638 ("ext4: drop EXT4_MF_FS_ABORTED flag")
Closes: https://lore.kernel.org/all/22d652f6-cb3c-43f5-b2fe-0a4bb6516a04@huawei.com/
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250122114130.229709-6-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify pre-content notification support from Jan Kara:
"This introduces a new fsnotify event (FS_PRE_ACCESS) that gets
generated before a file contents is accessed.
The event is synchronous so if there is listener for this event, the
kernel waits for reply. On success the execution continues as usual,
on failure we propagate the error to userspace. This allows userspace
to fill in file content on demand from slow storage. The context in
which the events are generated has been picked so that we don't hold
any locks and thus there's no risk of a deadlock for the userspace
handler.
The new pre-content event is available only for users with global
CAP_SYS_ADMIN capability (similarly to other parts of fanotify
functionality) and it is an administrator responsibility to make sure
the userspace event handler doesn't do stupid stuff that can DoS the
system.
Based on your feedback from the last submission, fsnotify code has
been improved and now file->f_mode encodes whether pre-content event
needs to be generated for the file so the fast path when nobody wants
pre-content event for the file just grows the additional file->f_mode
check. As a bonus this also removes the checks whether the old
FS_ACCESS event needs to be generated from the fast path. Also the
place where the event is generated during page fault has been moved so
now filemap_fault() generates the event if and only if there is no
uptodate folio in the page cache.
Also we have dropped FS_PRE_MODIFY event as current real-world users
of the pre-content functionality don't really use it so let's start
with the minimal useful feature set"
* tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits)
fanotify: Fix crash in fanotify_init(2)
fs: don't block write during exec on pre-content watched files
fs: enable pre-content events on supported file systems
ext4: add pre-content fsnotify hook for DAX faults
btrfs: disable defrag on pre-content watched files
xfs: add pre-content fsnotify hook for DAX faults
fsnotify: generate pre-content permission event on page fault
mm: don't allow huge faults for files with pre content watches
fanotify: disable readahead if we have pre-content watches
fanotify: allow to set errno in FAN_DENY permission response
fanotify: report file range info with pre-content events
fanotify: introduce FAN_PRE_ACCESS permission event
fsnotify: generate pre-content permission event on truncate
fsnotify: pass optional file access range in pre-content event
fsnotify: introduce pre-content permission events
fanotify: reserve event bit of deprecated FAN_DIR_MODIFY
fanotify: rename a misnamed constant
fanotify: don't skip extra event info if no info_mode is set
fsnotify: check if file is actually being watched for pre-content events on open
fsnotify: opt-in for permission events at file open time
...
|
|
Now that all the code has been added for pre-content events, and the
various file systems that need the page fault hooks for fsnotify have
been updated, add SB_I_ALLOW_HSM to the supported file systems.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/46960dcb2725fa0317895ed66a8409ba1c306a82.1731684329.git.josef@toxicpanda.com
|
|
Now that the crc32c() library function directly takes advantage of
architecture-specific optimizations, it is unnecessary to go through the
crypto API. Just use crc32c(). This is much simpler, and it improves
performance due to eliminating the crypto API overhead.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20241202010844.144356-17-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"A lot of miscellaneous ext4 bug fixes and cleanups this cycle, most
notably in the journaling code, bufered I/O, and compiler warning
cleanups"
* tag 'ext4_for_linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (33 commits)
jbd2: Fix comment describing journal_init_common()
ext4: prevent an infinite loop in the lazyinit thread
ext4: use struct_size() to improve ext4_htree_store_dirent()
ext4: annotate struct fname with __counted_by()
jbd2: avoid dozens of -Wflex-array-member-not-at-end warnings
ext4: use str_yes_no() helper function
ext4: prevent delalloc to nodelalloc on remount
jbd2: make b_frozen_data allocation always succeed
ext4: cleanup variable name in ext4_fc_del()
ext4: use string choices helpers
jbd2: remove the 'success' parameter from the jbd2_do_replay() function
jbd2: remove useless 'block_error' variable
jbd2: factor out jbd2_do_replay()
jbd2: refactor JBD2_COMMIT_BLOCK process in do_one_pass()
jbd2: unified release of buffer_head in do_one_pass()
jbd2: remove redundant judgments for check v1 checksum
ext4: use ERR_CAST to return an error-valued pointer
mm: zero range of eof folio exposed by inode size extension
ext4: partial zero eof block on unaligned inode size extension
ext4: disambiguate the return value of ext4_dio_write_end_io()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs untorn write support from Christian Brauner:
"An atomic write is a write issed with torn-write protection. This
means for a power failure or any hardware failure all or none of the
data from the write will be stored, never a mix of old and new data.
This work is already supported for block devices. If a block device is
opened with O_DIRECT and the block device supports atomic write, then
FMODE_CAN_ATOMIC_WRITE is added to the file of the opened block
device.
This contains the work to expand atomic write support to filesystems,
specifically ext4 and XFS. Currently, only support for writing exactly
one filesystem block atomically is added.
Since it's now possible to have filesystem block size > page size for
XFS, it's possible to write 4K+ blocks atomically on x86"
* tag 'vfs-6.13.untorn.writes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: drop an obsolete comment in iomap_dio_bio_iter
ext4: Do not fallback to buffered-io for DIO atomic write
ext4: Support setting FMODE_CAN_ATOMIC_WRITE
ext4: Check for atomic writes support in write iter
ext4: Add statx support for atomic writes
xfs: Support setting FMODE_CAN_ATOMIC_WRITE
xfs: Validate atomic writes
xfs: Support atomic write for statx
fs: iomap: Atomic write support
fs: Export generic_atomic_write_valid()
block: Add bdev atomic write limits helpers
fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid()
block/fs: Pass an iocb to generic_atomic_write_valid()
|
|
Use ktime_get_ns instead of ktime_get_real_ns when computing the lr_timeout
not to be affected by system time jumps.
Use a boolean instead of the MAX_JIFFY_OFFSET value to determine whether
the next_wakeup value has been set. Comparing elr->lr_next_sched to
MAX_JIFFY_OFFSET can cause the lazyinit thread to loop indefinitely.
Co-developed-by: Lukas Skupinski <lukas.skupinski@landisgyr.com>
Signed-off-by: Lukas Skupinski <lukas.skupinski@landisgyr.com>
Signed-off-by: Mathieu Othacehe <othacehe@gnu.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241106134741.26948-2-othacehe@gnu.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Implemented the suggested solution mentioned in the bug
https://bugzilla.kernel.org/show_bug.cgi?id=218820
Preventing the disabling of delayed allocation mode on remount.
delalloc to nodelalloc not permitted anymore
nodelalloc to delalloc permitted, not affected
Signed-off-by: Nicolas Bretz <bretznic@gmail.com>
Link: https://patch.msgid.link/20241014034143.59779-1-bretznic@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When I enabled ext4 debug for fault injection testing, I encountered the
following warning:
EXT4-fs error (device sda): ext4_read_inode_bitmap:201: comm fsstress:
Cannot read inode bitmap - block_group = 8, inode_bitmap = 1051
WARNING: CPU: 0 PID: 511 at fs/buffer.c:1181 mark_buffer_dirty+0x1b3/0x1d0
The root cause of the issue lies in the improper implementation of ext4's
buffer_head read fault injection. The actual completion of buffer_head
read and the buffer_head fault injection are not atomic, which can lead
to the uptodate flag being cleared on normally used buffer_heads in race
conditions.
[CPU0] [CPU1] [CPU2]
ext4_read_inode_bitmap
ext4_read_bh()
<bh read complete>
ext4_read_inode_bitmap
if (buffer_uptodate(bh))
return bh
jbd2_journal_commit_transaction
__jbd2_journal_refile_buffer
__jbd2_journal_unfile_buffer
__jbd2_journal_temp_unlink_buffer
ext4_simulate_fail_bh()
clear_buffer_uptodate
mark_buffer_dirty
<report warning>
WARN_ON_ONCE(!buffer_uptodate(bh))
The best approach would be to perform fault injection in the IO completion
callback function, rather than after IO completion. However, the IO
completion callback function cannot get the fault injection code in sb.
Fix it by passing the result of fault injection into the bh read function,
we simulate faults within the bh read function itself. This requires adding
an extra parameter to the bh read functions that need fault injection.
Fixes: 46f870d690fe ("ext4: simulate various I/O and checksum errors when reading metadata")
Signed-off-by: Long Li <leo.lilong@huawei.com>
Link: https://patch.msgid.link/20240906091746.510163-1-leo.lilong@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When ctx_set_flags() is unused, it prevents kernel builds
with clang, `make W=1` and CONFIG_WERROR=y:
.../ext4/super.c:2120:1: error: unused function 'ctx_set_flags' [-Werror,-Wunused-function]
2120 | EXT4_SET_CTX(flags); /* set only */
| ^~~~~~~~~~~~~~~~~~~
Fix this by marking ctx_*_flags() with __maybe_unused
(mark both for the sake of symmetry).
See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static
inline functions for W=1 build").
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20240905163229.140522-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This is the logic behavior and one that we would like to verify
using a generic fstest similar to xfs/546.
Link: https://lore.kernel.org/fstests/20240830152648.GE6216@frogsfrogsfrogs/
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240904084657.1062243-1-amir73il@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After commit 21175ca434c5 ("ext4: make prefetch_block_bitmaps default"),
we enable 'prefetch_block_bitmaps' by default, but this is not shown in
the '/proc/fs/ext4/sdx/options' procfs interface.
This makes it impossible to distinguish whether the feature is enabled by
default or not, so 'prefetch_block_bitmaps' is shown in the 'options'
procfs interface when prefetch_block_bitmaps is enabled by default.
This makes it easy to notice changes to the default mount options between
versions through the '/proc/fs/ext4/sdx/options' procfs interface.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241008120134.3758097-1-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This patch adds base support for atomic writes via statx getattr.
On bs < ps systems, we can create FS with say bs of 16k. That means
both atomic write min and max unit can be set to 16k for supporting
atomic writes.
Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
|
|
When we remount filesystem with 'abort' mount option while changing
other mount options as well (as is LTP test doing), we can return error
from the system call after commit d3476f3dad4a ("ext4: don't set
SB_RDONLY after filesystem errors") because the application of mount
option changes detects shutdown filesystem and refuses to do anything.
The behavior of application of other mount options in presence of
'abort' mount option is currently rather arbitary as some mount option
changes are handled before 'abort' and some after it.
Move aborting of the filesystem to the end of remount handling so all
requested changes are properly applied before the filesystem is shutdown
to have a reasonably consistent behavior.
Fixes: d3476f3dad4a ("ext4: don't set SB_RDONLY after filesystem errors")
Reported-by: Jan Stancek <jstancek@redhat.com>
Link: https://lore.kernel.org/all/Zvp6L+oFnfASaoHl@t14s
Signed-off-by: Jan Kara <jack@suse.cz>
Tested-by: Jan Stancek <jstancek@redhat.com>
Link: https://patch.msgid.link/20241004221556.19222-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
find_group_other() and find_group_orlov() read *_lo, *_hi with
ext4_free_inodes_count without additional locking. This can cause
data-race warning, but since the lock is held for most writes and free
inodes value is generally not a problem even if it is incorrect, it is
more appropriate to use READ_ONCE()/WRITE_ONCE() than to add locking.
==================================================================
BUG: KCSAN: data-race in ext4_free_inodes_count / ext4_free_inodes_set
write to 0xffff88810404300e of 2 bytes by task 6254 on cpu 1:
ext4_free_inodes_set+0x1f/0x80 fs/ext4/super.c:405
__ext4_new_inode+0x15ca/0x2200 fs/ext4/ialloc.c:1216
ext4_symlink+0x242/0x5a0 fs/ext4/namei.c:3391
vfs_symlink+0xca/0x1d0 fs/namei.c:4615
do_symlinkat+0xe3/0x340 fs/namei.c:4641
__do_sys_symlinkat fs/namei.c:4657 [inline]
__se_sys_symlinkat fs/namei.c:4654 [inline]
__x64_sys_symlinkat+0x5e/0x70 fs/namei.c:4654
x64_sys_call+0x1dda/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:267
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x76/0x7e
read to 0xffff88810404300e of 2 bytes by task 6257 on cpu 0:
ext4_free_inodes_count+0x1c/0x80 fs/ext4/super.c:349
find_group_other fs/ext4/ialloc.c:594 [inline]
__ext4_new_inode+0x6ec/0x2200 fs/ext4/ialloc.c:1017
ext4_symlink+0x242/0x5a0 fs/ext4/namei.c:3391
vfs_symlink+0xca/0x1d0 fs/namei.c:4615
do_symlinkat+0xe3/0x340 fs/namei.c:4641
__do_sys_symlinkat fs/namei.c:4657 [inline]
__se_sys_symlinkat fs/namei.c:4654 [inline]
__x64_sys_symlinkat+0x5e/0x70 fs/namei.c:4654
x64_sys_call+0x1dda/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:267
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Cc: stable@vger.kernel.org
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://patch.msgid.link/20241003125337.47283-1-aha310510@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Jeff Layton <jlayton@kernel.org> says:
The VFS has always used coarse-grained timestamps when updating the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around 1
per jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide when to invalidate the cache. Even with NFSv4, a lot of
exported filesystems don't properly support a change attribute and are
subject to the same problems with timestamp granularity. Other
applications have similar issues with timestamps (e.g backup
applications).
If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.
What we need is a way to only use fine-grained timestamps when they are
being actively queried. Use the (unused) top bit in inode->i_ctime_nsec
as a flag that indicates whether the current timestamps have been
queried via stat() or the like. When it's set, we allow the kernel to
use a fine-grained timestamp iff it's necessary to make the ctime show
a different value.
This solves the problem of being able to distinguish the timestamp
between updates, but introduces a new problem: it's now possible for a
file being changed to get a fine-grained timestamp. A file that is
altered just a bit later can then get a coarse-grained one that appears
older than the earlier fine-grained time. This violates timestamp
ordering guarantees.
To remedy this, keep a global monotonic atomic64_t value that acts as a
timestamp floor. When we go to stamp a file, we first get the latter of
the current floor value and the current coarse-grained time. If the
inode ctime hasn't been queried then we just attempt to stamp it with
that value.
If it has been queried, then first see whether the current coarse time
is later than the existing ctime. If it is, then we accept that value.
If it isn't, then we get a fine-grained time and try to swap that into
the global floor. Whether that succeeds or fails, we take the resulting
floor time, convert it to realtime and try to swap that into the ctime.
We take the result of the ctime swap whether it succeeds or fails, since
either is just as valid.
Filesystems can opt into this by setting the FS_MGTIME fstype flag.
Others should be unaffected (other than being subject to the same floor
value as multigrain filesystems).
* patches from https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org:
tmpfs: add support for multigrain timestamps
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
Documentation: add a new file documenting multigrain timestamps
fs: add percpu counters for significant multigrain timestamp events
fs: tracepoints around multigrain timestamp events
fs: handle delegated timestamps in setattr_copy_mgtime
fs: have setattr_copy handle multigrain timestamps appropriately
fs: add infrastructure for multigrain timestamps
Link: https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.
For ext4, we only need to enable the FS_MGTIME flag.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20241002-mgtime-v10-10-d1c4717f5284@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
We disable stripe size in __ext4_fill_super if it is not a multiple of
the cluster ratio however this check is missed when trying to remount.
This can leave us with cases where stripe < cluster_ratio after
remount:set making EXT4_B2C(sbi->s_stripe) become 0 that can cause some
unforeseen bugs like divide by 0.
Fix that by adding the check in remount path as well.
Reported-by: syzbot+1ad8bac5af24d01e2cbd@syzkaller.appspotmail.com
Tested-by: syzbot+1ad8bac5af24d01e2cbd@syzkaller.appspotmail.com
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Fixes: c3defd99d58c ("ext4: treat stripe in block unit")
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/3a493bb503c3598e25dcfbed2936bb2dff3fece7.1725002410.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Commit 985b67cd8639 ("ext4: filesystems without casefold feature cannot
be mounted with siphash") properly rejects volumes where
s_def_hash_version is set to DX_HASH_SIPHASH, but the check and the
error message should not look into casefold setup - a filesystem should
never have DX_HASH_SIPHASH as the default hash. Fix it and, since we
are there, move the check to ext4_hash_info_init.
Fixes:985b67cd8639 ("ext4: filesystems without casefold feature cannot
be mounted with siphash")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://patch.msgid.link/87jzg1en6j.fsf_-_@mailhost.krisman.be
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When the filesystem is mounted with errors=remount-ro, we were setting
SB_RDONLY flag to stop all filesystem modifications. We knew this misses
proper locking (sb->s_umount) and does not go through proper filesystem
remount procedure but it has been the way this worked since early ext2
days and it was good enough for catastrophic situation damage
mitigation. Recently, syzbot has found a way (see link) to trigger
warnings in filesystem freezing because the code got confused by
SB_RDONLY changing under its hands. Since these days we set
EXT4_FLAGS_SHUTDOWN on the superblock which is enough to stop all
filesystem modifications, modifying SB_RDONLY shouldn't be needed. So
stop doing that.
Link: https://lore.kernel.org/all/000000000000b90a8e061e21d12f@google.com
Reported-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://patch.msgid.link/20240805201241.27286-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The following kernel trace can be triggered with fstest generic/629 when
executed against a filesystem with fast-commit feature enabled:
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 0 PID: 866 Comm: mount Not tainted 6.10.0+ #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x66/0x90
register_lock_class+0x759/0x7d0
__lock_acquire+0x85/0x2630
? __find_get_block+0xb4/0x380
lock_acquire+0xd1/0x2d0
? __ext4_journal_get_write_access+0xd5/0x160
_raw_spin_lock+0x33/0x40
? __ext4_journal_get_write_access+0xd5/0x160
__ext4_journal_get_write_access+0xd5/0x160
ext4_reserve_inode_write+0x61/0xb0
__ext4_mark_inode_dirty+0x79/0x270
? ext4_ext_replay_set_iblocks+0x2f8/0x450
ext4_ext_replay_set_iblocks+0x330/0x450
ext4_fc_replay+0x14c8/0x1540
? jread+0x88/0x2e0
? rcu_is_watching+0x11/0x40
do_one_pass+0x447/0xd00
jbd2_journal_recover+0x139/0x1b0
jbd2_journal_load+0x96/0x390
ext4_load_and_init_journal+0x253/0xd40
ext4_fill_super+0x2cc6/0x3180
...
In the replay path there's an attempt to lock sbi->s_bdev_wb_lock in
function ext4_check_bdev_write_error(). Unfortunately, at this point this
spinlock has not been initialized yet. Moving it's initialization to an
earlier point in __ext4_fill_super() fixes this splat.
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Link: https://patch.msgid.link/20240718094356.7863-1-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
Syzbot has found an ODEBUG bug in ext4_fill_super
The del_timer_sync function cancels the s_err_report timer,
which reminds about filesystem errors daily. We should
guarantee the timer is no longer active before kfree(sbi).
When filesystem mounting fails, the flow goes to failed_mount3,
where an error occurs when ext4_stop_mmpd is called, causing
a read I/O failure. This triggers the ext4_handle_error function
that ultimately re-arms the timer,
leaving the s_err_report timer active before kfree(sbi) is called.
Fix the issue by canceling the s_err_report timer after calling ext4_stop_mmpd.
Signed-off-by: Xiaxi Shen <shenxiaxi26@gmail.com>
Reported-and-tested-by: syzbot+59e0101c430934bc9a36@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=59e0101c430934bc9a36
Link: https://patch.msgid.link/20240715043336.98097-1-shenxiaxi26@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
Single characters (line breaks) should be put into a sequence.
Thus use the corresponding function “seq_putc”.
This issue was transformed by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Link: https://patch.msgid.link/076974ab-4da3-4176-89dc-0514e020c276@web.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When mounting the ext4 filesystem, if the default hash version is set to
DX_HASH_SIPHASH but the casefold feature is not set, exit the mounting.
Reported-by: syzbot+340581ba9dceb7e06fb3@syzkaller.appspotmail.com
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Link: https://patch.msgid.link/20240605012335.44086-1-lizhi.xu@windriver.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Many cleanups and bug fixes in ext4, especially for the fast commit
feature.
Also some performance improvements; in particular, improving IOPS and
throughput on fast devices running Async Direct I/O by up to 20% by
optimizing jbd2_transaction_committed()"
* tag 'ext4_for_linus-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
ext4: make sure the first directory block is not a hole
ext4: check dot and dotdot of dx_root before making dir indexed
ext4: sanity check for NULL pointer after ext4_force_shutdown
jbd2: increase maximum transaction size
jbd2: drop pointless shrinker batch initialization
jbd2: avoid infinite transaction commit loop
jbd2: precompute number of transaction descriptor blocks
jbd2: make jbd2_journal_get_max_txn_bufs() internal
jbd2: avoid mount failed when commit block is partial submitted
ext4: avoid writing unitialized memory to disk in EA inodes
ext4: don't track ranges in fast_commit if inode has inlined data
ext4: fix possible tid_t sequence overflows
ext4: use ext4_update_inode_fsync_trans() helper in inode creation
ext4: add missing MODULE_DESCRIPTION()
jbd2: add missing MODULE_DESCRIPTION()
ext4: use memtostr_pad() for s_volume_name
jbd2: speed up jbd2_transaction_committed()
ext4: make ext4_da_map_blocks() buffer_head unaware
ext4: make ext4_insert_delayed_block() insert multi-blocks
ext4: factor out a helper to check the cluster allocation state
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount API updates from Christian Brauner:
- Add a generic helper to parse uid and gid mount options.
Currently we open-code the same logic in various filesystems which is
error prone, especially since the verification of uid and gid mount
options is a sensitive operation in the face of idmappings.
Add a generic helper and convert all filesystems over to it. Make
sure that filesystems that are mountable in unprivileged containers
verify that the specified uid and gid can be represented in the
owning namespace of the filesystem.
- Convert hostfs to the new mount api.
* tag 'vfs-6.11.mount.api' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fuse: Convert to new uid/gid option parsing helpers
fuse: verify {g,u}id mount options correctly
fat: Convert to new uid/gid option parsing helpers
fat: Convert to new mount api
fat: move debug into fat_mount_options
vboxsf: Convert to new uid/gid option parsing helpers
tracefs: Convert to new uid/gid option parsing helpers
smb: client: Convert to new uid/gid option parsing helpers
tmpfs: Convert to new uid/gid option parsing helpers
ntfs3: Convert to new uid/gid option parsing helpers
isofs: Convert to new uid/gid option parsing helpers
hugetlbfs: Convert to new uid/gid option parsing helpers
ext4: Convert to new uid/gid option parsing helpers
exfat: Convert to new uid/gid option parsing helpers
efivarfs: Convert to new uid/gid option parsing helpers
debugfs: Convert to new uid/gid option parsing helpers
autofs: Convert to new uid/gid option parsing helpers
fs_parse: add uid & gid option option parsing helpers
hostfs: Add const qualifier to host_root in hostfs_fill_super()
hostfs: convert hostfs to use the new mount API
|
|
Convert to new uid/gid option parsing helpers
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Link: https://lore.kernel.org/r/a84be40d-5110-4dac-83b1-0ea8e043f0fd@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The per-inode i_reserved_data_blocks count the reserved delalloc blocks
in a regular file, it should be zero when destroying the file. The
per-fs s_dirtyclusters_counter count all reserved delalloc blocks in a
filesystem, it also should be zero when umounting the filesystem. Now we
have only an error message if the i_reserved_data_blocks is not zero,
which is unable to be simply captured, so add WARN_ON_ONCE to make it
more visable.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240517124005.347221-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Instead of a bunch of ifdefs, make the unicode built checks part of the
code flow where possible, as requested by Torvalds.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
[eugen.hristev@collabora.com: port to 6.10-rc1]
Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com>
Link: https://lore.kernel.org/r/20240606073353.47130-7-eugen.hristev@collabora.com
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull bdev bd_inode updates from Al Viro:
"Replacement of bdev->bd_inode with sane(r) set of primitives by me and
Yu Kuai"
* tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RIP ->bd_inode
dasd_format(): killing the last remaining user of ->bd_inode
nilfs_attach_log_writer(): use ->bd_mapping->host instead of ->bd_inode
block/bdev.c: use the knowledge of inode/bdev coallocation
gfs2: more obvious initializations of mapping->host
fs/buffer.c: massage the remaining users of ->bd_inode to ->bd_mapping
blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here...
grow_dev_folio(): we only want ->bd_inode->i_mapping there
use ->bd_mapping instead of ->bd_inode->i_mapping
block_device: add a pointer to struct address_space (page cache of bdev)
missing helpers: bdev_unhash(), bdev_drop()
block: move two helpers into bdev.c
block2mtd: prevent direct access of bd_inode
dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode)
blkdev_write_iter(): saner way to get inode and bdev
bcachefs: remove dead function bdev_sectors()
ext4: remove block_device_ejected()
erofs_buf: store address_space instead of inode
erofs: switch erofs_bread() to passing offset instead of block number
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs blocksize updates from Al Viro:
"This gets rid of bogus set_blocksize() uses, switches it over
to be based on a 'struct file *' and verifies that the caller
has the device opened exclusively"
* tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
make set_blocksize() fail unless block device is opened exclusive
set_blocksize(): switch to passing struct file *
btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens
swsusp: don't bother with setting block size
zram: don't bother with reopening - just use O_EXCL for open
swapon(2): open swap with O_EXCL
swapon(2)/swapoff(2): don't bother with block size
pktcdvd: sort set_blocksize() calls out
bcache_register(): don't bother with set_blocksize()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
- more folio conversion patches
- add support for FS_IOC_GETFSSYSFSPATH
- mballoc cleaups and add more kunit tests
- sysfs cleanups and bug fixes
- miscellaneous bug fixes and cleanups
* tag 'ext4_for_linus-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
ext4: fix error pointer dereference in ext4_mb_load_buddy_gfp()
jbd2: add prefix 'jbd2' for 'shrink_type'
jbd2: use shrink_type type instead of bool type for __jbd2_journal_clean_checkpoint_list()
ext4: fix uninitialized ratelimit_state->lock access in __ext4_fill_super()
ext4: remove calls to to set/clear the folio error flag
ext4: propagate errors from ext4_sb_bread() in ext4_xattr_block_cache_find()
ext4: fix mb_cache_entry's e_refcnt leak in ext4_xattr_block_cache_find()
jbd2: remove redundant assignement to variable err
ext4: remove the redundant folio_wait_stable()
ext4: fix potential unnitialized variable
ext4: convert ac_buddy_page to ac_buddy_folio
ext4: convert ac_bitmap_page to ac_bitmap_folio
ext4: convert ext4_mb_init_cache() to take a folio
ext4: convert bd_buddy_page to bd_buddy_folio
ext4: convert bd_bitmap_page to bd_bitmap_folio
ext4: open coding repeated check in next_linear_group
ext4: use correct criteria name instead stale integer number in comment
ext4: call ext4_mb_mark_free_simple to free continuous bits in found chunk
ext4: add test_mb_mark_used_cost to estimate cost of mb_mark_used
ext4: keep "prefetch_grp" and "nr" consistent
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That
means we now have six free FMODE_* bits in total (but bit #6
already got used for FMODE_WRITE_RESTRICTED)
- Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup)
- Add fd_raw cleanup class so we can make use of automatic cleanup
provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well
- Optimize seq_puts()
- Simplify __seq_puts()
- Add new anon_inode_getfile_fmode() api to allow specifying f_mode
instead of open-coding it in multiple places
- Annotate struct file_handle with __counted_by() and use
struct_size()
- Warn in get_file() whether f_count resurrection from zero is
attempted (epoll/drm discussion)
- Folio-sophize aio
- Export the subvolume id in statx() for both btrfs and bcachefs
- Relax linkat(AT_EMPTY_PATH) requirements
- Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors
for dup*() equality replacing kcmp()
Cleanups:
- Compile out swapfile inode checks when swap isn't enabled
- Use (1 << n) notation for FMODE_* bitshifts for clarity
- Remove redundant variable assignment in fs/direct-io
- Cleanup uses of strncpy in orangefs
- Speed up and cleanup writeback
- Move fsparam_string_empty() helper into header since it's currently
open-coded in multiple places
- Add kernel-doc comments to proc_create_net_data_write()
- Don't needlessly read dentry->d_flags twice
Fixes:
- Fix out-of-range warning in nilfs2
- Fix ecryptfs overflow due to wrong encryption packet size
calculation
- Fix overly long line in xfs file_operations (follow-up to FMODE_*
cleanup)
- Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up
to FMODE_* cleanup)
- Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_*
cleanup)
- Fix stable offset api to prevent endless loops
- Fix afs file server rotations
- Prevent xattr node from overflowing the eraseblock in jffs2
- Move fdinfo PTRACE_MODE_READ procfs check into the .permission()
operation instead of .open() operation since this caused userspace
regressions"
* tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
afs: Fix fileserver rotation getting stuck
selftests: add F_DUPDFD_QUERY selftests
fcntl: add F_DUPFD_QUERY fcntl()
file: add fd_raw cleanup class
fs: WARN when f_count resurrection is attempted
seq_file: Simplify __seq_puts()
seq_file: Optimize seq_puts()
proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation
fs: Create anon_inode_getfile_fmode()
xfs: don't call xfs_file_open from xfs_dir_open
xfs: drop fop_flags for directories
xfs: fix overly long line in the file_operations
shmem: Fix shmem_rename2()
libfs: Add simple_offset_rename() API
libfs: Fix simple_offset_rename_exchange()
jffs2: prevent xattr node from overflowing the eraseblock
vfs, swap: compile out IS_SWAPFILE() on swapless configs
vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements
fs/direct-io: remove redundant assignment to variable retval
fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading
...
|
|
In the following concurrency we will access the uninitialized rs->lock:
ext4_fill_super
ext4_register_sysfs
// sysfs registered msg_ratelimit_interval_ms
// Other processes modify rs->interval to
// non-zero via msg_ratelimit_interval_ms
ext4_orphan_cleanup
ext4_msg(sb, KERN_INFO, "Errors on filesystem, "
__ext4_msg
___ratelimit(&(EXT4_SB(sb)->s_msg_ratelimit_state)
if (!rs->interval) // do nothing if interval is 0
return 1;
raw_spin_trylock_irqsave(&rs->lock, flags)
raw_spin_trylock(lock)
_raw_spin_trylock
__raw_spin_trylock
spin_acquire(&lock->dep_map, 0, 1, _RET_IP_)
lock_acquire
__lock_acquire
register_lock_class
assign_lock_key
dump_stack();
ratelimit_state_init(&sbi->s_msg_ratelimit_state, 5 * HZ, 10);
raw_spin_lock_init(&rs->lock);
// init rs->lock here
and get the following dump_stack:
=========================================================
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 12 PID: 753 Comm: mount Tainted: G E 6.7.0-rc6-next-20231222 #504
[...]
Call Trace:
dump_stack_lvl+0xc5/0x170
dump_stack+0x18/0x30
register_lock_class+0x740/0x7c0
__lock_acquire+0x69/0x13a0
lock_acquire+0x120/0x450
_raw_spin_trylock+0x98/0xd0
___ratelimit+0xf6/0x220
__ext4_msg+0x7f/0x160 [ext4]
ext4_orphan_cleanup+0x665/0x740 [ext4]
__ext4_fill_super+0x21ea/0x2b10 [ext4]
ext4_fill_super+0x14d/0x360 [ext4]
[...]
=========================================================
Normally interval is 0 until s_msg_ratelimit_state is initialized, so
___ratelimit() does nothing. But registering sysfs precedes initializing
rs->lock, so it is possible to change rs->interval to a non-zero value
via the msg_ratelimit_interval_ms interface of sysfs while rs->lock is
uninitialized, and then a call to ext4_msg triggers the problem by
accessing an uninitialized rs->lock. Therefore register sysfs after all
initializations are complete to avoid such problems.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240102133730.1098120-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Just the low-hanging fruit...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20240411145346.2516848-2-viro@zeniv.linux.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
block_device_ejected() is added by commit bdfe0cbd746a ("Revert
"ext4: remove block_device_ejected"") in 2015. At that time 'bdi->wb'
is destroyed synchronized from del_gendisk(), hence if ext4 is still
mounted, and then mark_buffer_dirty() will reference destroyed 'wb'.
However, such problem doesn't exist anymore:
- commit d03f6cdc1fc4 ("block: Dynamically allocate and refcount
backing_dev_info") switch bdi to use refcounting;
- commit 13eec2363ef0 ("fs: Get proper reference for s_bdi"), will grab
additional reference of bdi while mounting, so that 'bdi->wb' will not
be destroyed until generic_shutdown_super().
Hence remove this dead function block_device_ejected().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20240411145346.2516848-7-viro@zeniv.linux.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.
in file.c:
s_last_mounted is marked as __nonstring meaning it does not need to be
NUL-terminated. Let's instead use strtomem_pad() to copy bytes from the
string source to the byte array destination -- while also ensuring to
pad with zeroes.
in ioctl.c:
We can drop the memset and size argument in favor of using the new
2-argument version of strscpy_pad() -- which was introduced with Commit
e6584c3964f2f ("string: Allow 2-argument strscpy()"). This guarantees
NUL-termination and NUL-padding on the destination buffer -- which seems
to be a requirement judging from this comment:
| static int ext4_ioctl_getlabel(struct ext4_sb_info *sbi, char __user *user_label)
| {
| char label[EXT4_LABEL_MAX + 1];
|
| /*
| * EXT4_LABEL_MAX must always be smaller than FSLABEL_MAX because
| * FSLABEL_MAX must include terminating null byte, while s_volume_name
| * does not have to.
| */
in super.c:
s_first_error_func is marked as __nonstring meaning we can take the same
approach as in file.c; just use strtomem_pad()
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20240321-strncpy-fs-ext4-file-c-v1-1-36a6a09fef0c@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
The new sysfs path ioctl lets us get the /sys/fs/ path for a given
filesystem in a fs agnostic way, potentially nudging us towards
standarizing some of our reporting.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Link: https://lore.kernel.org/r/20240315035308.3563511-4-kent.overstreet@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
kfree already checks if its argument is NULL. This fixes two
Coccinelle/coccicheck warnings reported by ifnullfree.cocci.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20240317153638.2136-2-thorsten.blum@toblux.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Currently a device is only really released once the umount returns to
userspace due to how file closing works. That ultimately could cause
an old umount assumption to be violated that concurrent umount and mount
don't fail. So an exclusively held device with a temporary holder should
be yielded before the filesystem is gone. Add a helper that allows
callers to do that. This also allows us to remove the two holder ops
that Linus wasn't excited about.
Link: https://lore.kernel.org/r/20240326-vfs-bdev-end_holder-v1-1-20af85202918@kernel.org
Fixes: f3a608827d1f ("bdev: open block device as files") # mainline only
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Since both ext4 and overlayfs define the same macro to specify string
parameters that may allow empty values, define it in an header file so
that this helper can be shared.
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Link: https://lore.kernel.org/r/20240312104757.27333-1-luis.henriques@linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Ext4 bug fixes and cleanups, plus some additional kunit tests"
* tag 'ext4_for_linus-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
ext4: initialize sbi->s_freeclusters_counter and sbi->s_dirtyclusters_counter before use in kunit test
ext4: hold group lock in ext4 kunit test
ext4: alloc test super block from sget
ext4: kunit: use dynamic inode allocation
ext4: enable meta_bg only when new desc blocks are needed
ext4: remove unused parameter biop in ext4_issue_discard()
ext4: remove SLAB_MEM_SPREAD flag usage
ext4: verify s_clusters_per_group even without bigalloc
ext4: fix corruption during on-line resize
ext4: don't report EOPNOTSUPP errors from discard
ext4: drop duplicate ea_inode handling in ext4_xattr_block_set()
ext4: fold quota accounting into ext4_xattr_inode_lookup_create()
ext4: correct best extent lstart adjustment logic
ext4: forbid commit inconsistent quota data when errors=remount-ro
ext4: add a hint for block bitmap corrupt state in mb_groups
ext4: fix the comment of ext4_map_blocks()/ext4_ext_map_blocks()
ext4: improve error msg for ext4_mb_seq_groups_show
ext4: remove unused buddy_loaded in ext4_mb_seq_groups_show
ext4: Add unit test for ext4_mb_mark_diskspace_used
ext4: Add unit test for mb_free_blocks
...
|