Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"The most important one reverts an improper fix which can cause an
unexpected warning more often on specific images, and another one
fixes LZMA decompression on 32-bit platforms. The others are minor
fixes and cleanups.
- Fix LZMA decompression failure on HIGHMEM platforms
- Revert an inproper fix since it is actually an implementation issue
of vmalloc()
- Avoid a wrong DBG_BUGON since it could be triggered with -EINTR
- Minor cleanups"
* tag 'erofs-for-6.3-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: use wrapper i_blocksize() in erofs_file_read_iter()
erofs: get rid of a useless DBG_BUGON
erofs: Revert "erofs: fix kvcalloc() misuse with __GFP_NOFAIL"
erofs: fix wrong kunmap when using LZMA on HIGHMEM platforms
erofs: mark z_erofs_lzma_init/erofs_pcpubuf_init w/ __init
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Protect NFSD writes against filesystem freezing
- Fix a potential memory leak during server shutdown
* tag 'nfsd-6.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
SUNRPC: Fix a server shutdown leak
NFSD: Protect against filesystem freezing
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"First batch of fixes. Among them there are two updates to sysfs and
ioctl which are not strictly fixes but are used for testing so there's
no reason to delay them.
- fix block group item corruption after inserting new block group
- fix extent map logging bit not cleared for split maps after
dropping range
- fix calculation of unusable block group space reporting bogus
values due to 32/64b division
- fix unnecessary increment of read error stat on write error
- improve error handling in inode update
- export per-device fsid in DEV_INFO ioctl to distinguish seeding
devices, needed for testing
- allocator size classes:
- fix potential dead lock in size class loading logic
- print sysfs stats for the allocation classes"
* tag 'for-6.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix block group item corruption after inserting new block group
btrfs: fix extent map logging bit not cleared for split maps after dropping range
btrfs: fix percent calculation for bg reclaim message
btrfs: fix unnecessary increment of read error stat on write error
btrfs: handle btrfs_del_item errors in __btrfs_update_delayed_inode
btrfs: ioctl: return device fsid from DEV_INFO ioctl
btrfs: fix potential dead lock in size class loading logic
btrfs: sysfs: add size class stats
|
|
linux/fs.h has a wrapper for this operation.
Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230306075527.1338-1-zbestahu@gmail.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
`err` could be -EINTR and it should not be the case. Actually such
DBG_BUGON is useless.
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230309053148.9223-2-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
Let's revert commit 12724ba38992 ("erofs: fix kvcalloc() misuse with
__GFP_NOFAIL") since kvmalloc() already supports __GFP_NOFAIL in commit
a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc"). So
the original fix was wrong.
Actually there was some issue as [1] discussed, so before that mm fix
is landed, the warn could still happen but applying this commit first
will cause less.
[1] https://lore.kernel.org/r/20230305053035.1911-1-hsiangkao@linux.alibaba.com
Fixes: 12724ba38992 ("erofs: fix kvcalloc() misuse with __GFP_NOFAIL")
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230309053148.9223-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
As the call trace shown, the root cause is kunmap incorrect pages:
BUG: kernel NULL pointer dereference, address: 00000000
CPU: 1 PID: 40 Comm: kworker/u5:0 Not tainted 6.2.0-rc5 #4
Workqueue: erofs_worker z_erofs_decompressqueue_work
EIP: z_erofs_lzma_decompress+0x34b/0x8ac
z_erofs_decompress+0x12/0x14
z_erofs_decompress_queue+0x7e7/0xb1c
z_erofs_decompressqueue_work+0x32/0x60
process_one_work+0x24b/0x4d8
? process_one_work+0x1a4/0x4d8
worker_thread+0x14c/0x3fc
kthread+0xe6/0x10c
? rescuer_thread+0x358/0x358
? kthread_complete_and_exit+0x18/0x18
ret_from_fork+0x1c/0x28
---[ end trace 0000000000000000 ]---
The bug is trivial and should be fixed now. It has no impact on
!HIGHMEM platforms.
Fixes: 622ceaddb764 ("erofs: lzma compression support")
Cc: <stable@vger.kernel.org> # 5.16+
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230305134455.88236-1-hsiangkao@linux.alibaba.com
|
|
They are used during the erofs module init phase. Let's mark it as
__init like any other function.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230303063731.66760-1-frank.li@vivo.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
We can often end up inserting a block group item, for a new block group,
with a wrong value for the used bytes field.
This happens if for the new allocated block group, in the same transaction
that created the block group, we have tasks allocating extents from it as
well as tasks removing extents from it.
For example:
1) Task A creates a metadata block group X;
2) Two extents are allocated from block group X, so its "used" field is
updated to 32K, and its "commit_used" field remains as 0;
3) Transaction commit starts, by some task B, and it enters
btrfs_start_dirty_block_groups(). There it tries to update the block
group item for block group X, which currently has its "used" field with
a value of 32K. But that fails since the block group item was not yet
inserted, and so on failure update_block_group_item() sets the
"commit_used" field of the block group back to 0;
4) The block group item is inserted by task A, when for example
btrfs_create_pending_block_groups() is called when releasing its
transaction handle. This results in insert_block_group_item() inserting
the block group item in the extent tree (or block group tree), with a
"used" field having a value of 32K, but without updating the
"commit_used" field in the block group, which remains with value of 0;
5) The two extents are freed from block X, so its "used" field changes
from 32K to 0;
6) The transaction commit by task B continues, it enters
btrfs_write_dirty_block_groups() which calls update_block_group_item()
for block group X, and there it decides to skip the block group item
update, because "used" has a value of 0 and "commit_used" has a value
of 0 too.
As a result, we end up with a block item having a 32K "used" field but
no extents allocated from it.
When this issue happens, a btrfs check reports an error like this:
[1/7] checking root items
[2/7] checking extents
block group [1104150528 1073741824] used 39796736 but extent items used 0
ERROR: errors found in extent allocation tree or chunk allocation
(...)
Fix this by making insert_block_group_item() update the block group's
"commit_used" field.
Fixes: 7248e0cebbef ("btrfs: skip update of block group item if used bytes are the same")
CC: stable@vger.kernel.org # 6.2+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Flole observes this WARNING on occasion:
[1210423.486503] WARNING: CPU: 8 PID: 1524732 at fs/ext4/ext4_jbd2.c:75 ext4_journal_check_start+0x68/0xb0
Reported-by: <flole@flole.de>
Suggested-by: Jan Kara <jack@suse.cz>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217123
Fixes: 73da852e3831 ("nfsd: use vfs_iter_read/write")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
range
At btrfs_drop_extent_map_range() we are clearing the EXTENT_FLAG_LOGGING
bit on a 'flags' variable that was not initialized. This makes static
checkers complain about it, so initialize the 'flags' variable before
clearing the bit.
In practice this has no consequences, because EXTENT_FLAG_LOGGING should
not be set when btrfs_drop_extent_map_range() is called, as an fsync locks
the inode in exclusive mode, locks the inode's mmap semaphore in exclusive
mode too and it always flushes all delalloc.
Also add a comment about why we clear EXTENT_FLAG_LOGGING on a copy of the
flags of the split extent map.
Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/Y%2FyipSVozUDEZKow@kili/
Fixes: db21370bffbc ("btrfs: drop extent map range more efficiently")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have a report, that the info message for block-group reclaim is
crossing the 100% used mark.
This is happening as we were truncating the divisor for the division
(the block_group->length) to a 32bit value.
Fix this by using div64_u64() to not truncate the divisor.
In the worst case, it can lead to a div by zero error and should be
possible to trigger on 4 disks RAID0, and each device is large enough:
$ mkfs.btrfs -f /dev/test/scratch[1234] -m raid1 -d raid0
btrfs-progs v6.1
[...]
Filesystem size: 40.00GiB
Block group profiles:
Data: RAID0 4.00GiB <<<
Metadata: RAID1 256.00MiB
System: RAID1 8.00MiB
Reported-by: Forza <forza@tnonline.net>
Link: https://lore.kernel.org/linux-btrfs/e99483.c11a58d.1863591ca52@tnonline.net/
Fixes: 5f93e776c673 ("btrfs: zoned: print unusable percentage when reclaiming block groups")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add Qu's note ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Current btrfs_log_dev_io_error() increases the read error count even if the
erroneous IO is a WRITE request. This is because it forget to use "else
if", and all the error WRITE requests counts as READ error as there is (of
course) no REQ_RAHEAD bit set.
Fixes: c3a62baf21ad ("btrfs: use chained bios when cloning")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Even if the slot is already read out, we may still need to re-balance
the tree, thus it can cause error in that btrfs_del_item() call and we
need to handle it properly.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: void0red <void0red@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently user space utilizes dev info ioctl to grab the info of a
certain devid, this includes its device uuid. But the returned info is
not enough to determine if a device is a seed.
Commit a26d60dedf9a ("btrfs: sysfs: add devinfo/fsid to retrieve actual
fsid from the device") exports the same value in sysfs so this is for
parity with ioctl. Add a new member, fsid, into
btrfs_ioctl_dev_info_args, and populate the member with fsid value.
This should not cause any compatibility problem, following the
combinations:
- Old user space, old kernel
- Old user space, new kernel
User space tool won't even check the new member.
- New user space, old kernel
The kernel won't touch the new member, and user space tool should
zero out its argument, thus the new member is all zero.
User space tool can then know the kernel doesn't support this fsid
reporting, and falls back to whatever they can.
- New user space, new kernel
Go as planned.
Would find the fsid member is no longer zero, and trust its value.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
As reported by Filipe, there's a potential deadlock caused by
using btrfs_search_forward on commit_root. The locking there is
unconditional, even if ->skip_locking and ->search_commit_root is set.
It's not meant to be used for commit roots, so it always needs to do
locking.
So if another task is COWing a child node of the same root node and
then needs to wait for block group caching to complete when trying to
allocate a metadata extent, it deadlocks.
For example:
[539604.239315] sysrq: Show Blocked State
[539604.240133] task:kworker/u16:6 state:D stack:0 pid:2119594 ppid:2 flags:0x00004000
[539604.241613] Workqueue: btrfs-cache btrfs_work_helper [btrfs]
[539604.242673] Call Trace:
[539604.243129] <TASK>
[539604.243925] __schedule+0x41d/0xee0
[539604.244797] ? rcu_read_lock_sched_held+0x12/0x70
[539604.245399] ? rwsem_down_read_slowpath+0x185/0x490
[539604.246111] schedule+0x5d/0xf0
[539604.246593] rwsem_down_read_slowpath+0x2da/0x490
[539604.247290] ? rcu_barrier_tasks_trace+0x10/0x20
[539604.248090] __down_read_common+0x3d/0x150
[539604.248702] down_read_nested+0xc3/0x140
[539604.249280] __btrfs_tree_read_lock+0x24/0x100 [btrfs]
[539604.250097] btrfs_read_lock_root_node+0x48/0x60 [btrfs]
[539604.250915] btrfs_search_forward+0x59/0x460 [btrfs]
[539604.251781] ? btrfs_global_root+0x50/0x70 [btrfs]
[539604.252476] caching_thread+0x1be/0x920 [btrfs]
[539604.253167] btrfs_work_helper+0xf6/0x400 [btrfs]
[539604.253848] process_one_work+0x24f/0x5a0
[539604.254476] worker_thread+0x52/0x3b0
[539604.255166] ? __pfx_worker_thread+0x10/0x10
[539604.256047] kthread+0xf0/0x120
[539604.256591] ? __pfx_kthread+0x10/0x10
[539604.257212] ret_from_fork+0x29/0x50
[539604.257822] </TASK>
[539604.258233] task:btrfs-transacti state:D stack:0 pid:2236474 ppid:2 flags:0x00004000
[539604.259802] Call Trace:
[539604.260243] <TASK>
[539604.260615] __schedule+0x41d/0xee0
[539604.261205] ? rcu_read_lock_sched_held+0x12/0x70
[539604.262000] ? rwsem_down_read_slowpath+0x185/0x490
[539604.262822] schedule+0x5d/0xf0
[539604.263374] rwsem_down_read_slowpath+0x2da/0x490
[539604.266228] ? lock_acquire+0x160/0x310
[539604.266917] ? rcu_read_lock_sched_held+0x12/0x70
[539604.267996] ? lock_contended+0x19e/0x500
[539604.268720] __down_read_common+0x3d/0x150
[539604.269400] down_read_nested+0xc3/0x140
[539604.270057] __btrfs_tree_read_lock+0x24/0x100 [btrfs]
[539604.271129] btrfs_read_lock_root_node+0x48/0x60 [btrfs]
[539604.272372] btrfs_search_slot+0x143/0xf70 [btrfs]
[539604.273295] update_block_group_item+0x9e/0x190 [btrfs]
[539604.274282] btrfs_start_dirty_block_groups+0x1c4/0x4f0 [btrfs]
[539604.275381] ? __mutex_unlock_slowpath+0x45/0x280
[539604.276390] btrfs_commit_transaction+0xee/0xed0 [btrfs]
[539604.277391] ? lock_acquire+0x1a4/0x310
[539604.278080] ? start_transaction+0xcb/0x6c0 [btrfs]
[539604.279099] transaction_kthread+0x142/0x1c0 [btrfs]
[539604.279996] ? __pfx_transaction_kthread+0x10/0x10 [btrfs]
[539604.280673] kthread+0xf0/0x120
[539604.281050] ? __pfx_kthread+0x10/0x10
[539604.281496] ret_from_fork+0x29/0x50
[539604.281966] </TASK>
[539604.282255] task:fsstress state:D stack:0 pid:2236483 ppid:1 flags:0x00004006
[539604.283897] Call Trace:
[539604.284700] <TASK>
[539604.285088] __schedule+0x41d/0xee0
[539604.285660] schedule+0x5d/0xf0
[539604.286175] btrfs_wait_block_group_cache_progress+0xf2/0x170 [btrfs]
[539604.287342] ? __pfx_autoremove_wake_function+0x10/0x10
[539604.288450] find_free_extent+0xd93/0x1750 [btrfs]
[539604.289256] ? _raw_spin_unlock+0x29/0x50
[539604.289911] ? btrfs_get_alloc_profile+0x127/0x2a0 [btrfs]
[539604.290843] btrfs_reserve_extent+0x147/0x290 [btrfs]
[539604.291943] btrfs_alloc_tree_block+0xcb/0x3e0 [btrfs]
[539604.292903] __btrfs_cow_block+0x138/0x580 [btrfs]
[539604.293773] btrfs_cow_block+0x10e/0x240 [btrfs]
[539604.294595] btrfs_search_slot+0x7f3/0xf70 [btrfs]
[539604.295585] btrfs_update_device+0x71/0x1b0 [btrfs]
[539604.296459] btrfs_chunk_alloc_add_chunk_item+0xe0/0x340 [btrfs]
[539604.297489] btrfs_chunk_alloc+0x1bf/0x490 [btrfs]
[539604.298335] find_free_extent+0x6fa/0x1750 [btrfs]
[539604.299174] ? _raw_spin_unlock+0x29/0x50
[539604.299950] ? btrfs_get_alloc_profile+0x127/0x2a0 [btrfs]
[539604.300918] btrfs_reserve_extent+0x147/0x290 [btrfs]
[539604.301797] btrfs_alloc_tree_block+0xcb/0x3e0 [btrfs]
[539604.303017] ? lock_release+0x224/0x4a0
[539604.303855] __btrfs_cow_block+0x138/0x580 [btrfs]
[539604.304789] btrfs_cow_block+0x10e/0x240 [btrfs]
[539604.305611] btrfs_search_slot+0x7f3/0xf70 [btrfs]
[539604.306682] ? btrfs_global_root+0x50/0x70 [btrfs]
[539604.308198] lookup_inline_extent_backref+0x17b/0x7a0 [btrfs]
[539604.309254] lookup_extent_backref+0x43/0xd0 [btrfs]
[539604.310122] __btrfs_free_extent+0xf8/0x810 [btrfs]
[539604.310874] ? lock_release+0x224/0x4a0
[539604.311724] ? btrfs_merge_delayed_refs+0x17b/0x1d0 [btrfs]
[539604.313023] __btrfs_run_delayed_refs+0x2ba/0x1260 [btrfs]
[539604.314271] btrfs_run_delayed_refs+0x8f/0x1c0 [btrfs]
[539604.315445] ? rcu_read_lock_sched_held+0x12/0x70
[539604.316706] btrfs_commit_transaction+0xa2/0xed0 [btrfs]
[539604.317855] ? do_raw_spin_unlock+0x4b/0xa0
[539604.318544] ? _raw_spin_unlock+0x29/0x50
[539604.319240] create_subvol+0x53d/0x6e0 [btrfs]
[539604.320283] btrfs_mksubvol+0x4f5/0x590 [btrfs]
[539604.321220] __btrfs_ioctl_snap_create+0x11b/0x180 [btrfs]
[539604.322307] btrfs_ioctl_snap_create_v2+0xc6/0x150 [btrfs]
[539604.323295] btrfs_ioctl+0x9f7/0x33e0 [btrfs]
[539604.324331] ? rcu_read_lock_sched_held+0x12/0x70
[539604.325137] ? lock_release+0x224/0x4a0
[539604.325808] ? __x64_sys_ioctl+0x87/0xc0
[539604.326467] __x64_sys_ioctl+0x87/0xc0
[539604.327109] do_syscall_64+0x38/0x90
[539604.327875] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[539604.328792] RIP: 0033:0x7f05a7babaeb
This needs to use regular btrfs_search_slot() with some skip and stop
logic.
Since we only consider five samples (five search slots), don't bother
with the complexity of looking for commit_root_sem contention. If
necessary, it can be added to the load function in between samples.
Reported-by: Filipe Manana <fdmanana@kernel.org>
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H7eKMD44Z1+=Kb-1RFMMeZpAm2fwyO59yeBwCcSOU80Pg@mail.gmail.com/
Fixes: c7eec3d9aa95 ("btrfs: load block group size class when caching")
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that address space operations are merge dfor in-ICB and normal
files, it is more likely some code mistakenly tries to map blocks for
in-ICB files. WARN and return error instead of silently returning
garbage.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
After merging address space operations of normal and in-ICB files,
readahead could get called for in-ICB files which resulted in
udf_get_block() being called for these files. udf_get_block() is not
prepared to be called for in-ICB files and ends up returning garbage
results as it interprets file data as extent list. Fix the problem by
skipping readahead for in-ICB files.
Fixes: 37a8a39f7ad3 ("udf: Switch to single address_space_operations")
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
The patch converting udf_adinicb_writepage() to avoid manually kmapping
the page used memcpy_to_page() however that copies in the wrong
direction (effectively overwriting file data with the old contents).
What we should be using is memcpy_from_page() to copy data from the page
into the inode and then mark inode dirty to store the data.
Fixes: 5cfc45321a6d ("udf: Convert udf_adinicb_writepage() to memcpy_to_page()")
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"17 hotfixes.
Eight are for MM and seven are for other parts of the kernel. Seven
are cc:stable and eight address post-6.3 issues or were judged
unsuitable for -stable backporting"
* tag 'mm-hotfixes-stable-2023-03-04-13-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mailmap: map Dikshita Agarwal's old address to his current one
mailmap: map Vikash Garodia's old address to his current one
fs/cramfs/inode.c: initialize file_ra_state
fs: hfsplus: fix UAF issue in hfsplus_put_super
panic: fix the panic_print NMI backtrace setting
lib: parser: update documentation for match_NUMBER functions
kasan, x86: don't rename memintrinsics in uninstrumented files
kasan: test: fix test for new meminstrinsic instrumentation
kasan: treat meminstrinsic as builtins in uninstrumented files
kasan: emit different calls for instrumentable memintrinsics
ocfs2: fix non-auto defrag path not working issue
ocfs2: fix defrag path triggering jbd2 ASSERT
mailmap: map Georgi Djakov's old Linaro address to his current one
mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON
lib/zlib: DFLTCC deflate does not write all available bits for Z_NO_FLUSH
mm/damon/paddr: fix missing folio_put()
mm/mremap: fix dup_anon_vma() in vma_merge() case 4
|
|
git://git.samba.org/sfrench/cifs-2.6
Pull more cifs updates from Steve French:
- xfstest generic/208 fix (memory leak)
- minor netfs fix (to address smatch warning)
- a DFS fix for stable
- a reconnect race fix
- two multichannel fixes
- RDMA (smbdirect) fix
- two additional writeback fixes from David
* tag '6.3-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix memory leak in direct I/O
cifs: prevent data race in cifs_reconnect_tcon()
cifs: improve checking of DFS links over STATUS_OBJECT_NAME_INVALID
iov: Fix netfs_extract_user_to_sg()
cifs: Fix cifs_write_back_from_locked_folio()
cifs: reuse cifs_match_ipaddr for comparison of dstaddr too
cifs: match even the scope id for ipv6 addresses
cifs: Fix an uninitialised variable
cifs: Add some missing xas_retry() calls
|
|
file_ra_state_init() assumes that the file_ra_state has been zeroed out.
Fixes a KMSAN used-unintialized issue (at least).
Fixes: cf948cbc35e80 ("cramfs: read_mapping_page() is synchronous")
Reported-by: syzbot <syzbot+8ce7f8308d91e6b8bbe2@syzkaller.appspotmail.com>
Link: https://lkml.kernel.org/r/0000000000008f74e905f56df987@google.com
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Nicolas Pitre <nico@fluxnic.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The current hfsplus_put_super first calls hfs_btree_close on
sbi->ext_tree, then invokes iput on sbi->hidden_dir, resulting in an
use-after-free issue in hfsplus_release_folio.
As shown in hfsplus_fill_super, the error handling code also calls iput
before hfs_btree_close.
To fix this error, we move all iput calls before hfsplus_btree_close.
Note that this patch is tested on Syzbot.
Link: https://lkml.kernel.org/r/20230226124948.3175736-1-mudongliangabcd@gmail.com
Reported-by: syzbot+57e3e98f7e3b80f64d56@syzkaller.appspotmail.com
Tested-by: Dongliang Mu <mudongliangabcd@gmail.com>
Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Pull ceph fixes from Ilya Dryomov:
"Two small fixes from Xiubo and myself, marked for stable"
* tag 'ceph-for-6.3-rc1' of https://github.com/ceph/ceph-client:
rbd: avoid use-after-free in do_rbd_add() when rbd_dev_create() fails
ceph: update the time stamps and try to drop the suid/sgid
|
|
When __cifs_readv() and __cifs_writev() extract pages from a user-backed
iterator into a BVEC-type iterator, they set ->bv_need_unpin to note
whether they need to unpin the pages later. However, in both cases they
examine the BVEC-type iterator and not the source iterator - and so
bv_need_unpin doesn't get set and the pages are leaked.
I think this may be responsible for the generic/208 xfstest failing
occasionally with:
WARNING: CPU: 0 PID: 3064 at mm/gup.c:218 try_grab_page+0x65/0x100
RIP: 0010:try_grab_page+0x65/0x100
follow_page_pte+0x1a7/0x570
__get_user_pages+0x1a2/0x650
__gup_longterm_locked+0xdc/0xb50
internal_get_user_pages_fast+0x17f/0x310
pin_user_pages_fast+0x46/0x60
iov_iter_extract_pages+0xc9/0x510
? __kmalloc_large_node+0xb1/0x120
? __kmalloc_node+0xbe/0x130
netfs_extract_user_iter+0xbf/0x200 [netfs]
__cifs_writev+0x150/0x330 [cifs]
vfs_write+0x2a8/0x3c0
ksys_pwrite64+0x65/0xa0
with the page refcount going negative. This is less unlikely than it seems
because the page is being pinned, not simply got, and so the refcount
increased by 1024 each time, and so only needs to be called around ~2097152
for the refcount to go negative.
Further, the test program (aio-dio-invalidate-failure) uses a 32MiB static
buffer and all the PTEs covering it refer to the same page because it's
never written to.
The warning in try_grab_page():
if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
return -ENOMEM;
then trips and prevents us ever using the page again for DIO at least.
Fixes: d08089f649a0 ("cifs: Change the I/O paths to use an iterator rather than a page list")
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Link: https://lore.kernel.org/r/CAH2r5mvaTsJ---n=265a4zqRA7pP+o4MJ36WCQUS6oPrOij8cw@mail.gmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Make sure to get an up-to-date TCP_Server_Info::nr_targets value prior
to waiting the server to be reconnected in cifs_reconnect_tcon(). It
is set in cifs_tcp_ses_needs_reconnect() and protected by
TCP_Server_Info::srv_lock.
Create a new cifs_wait_for_server_reconnect() helper that can be used
by both SMB2+ and CIFS reconnect code.
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Do not map STATUS_OBJECT_NAME_INVALID to -EREMOTE under non-DFS
shares, or 'nodfs' mounts or CONFIG_CIFS_DFS_UPCALL=n builds.
Otherwise, in the slow path, get a referral to figure out whether it
is an actual DFS link.
This could be simply reproduced under a non-DFS share by running the
following
$ mount.cifs //srv/share /mnt -o ...
$ cat /mnt/$(printf '\U110000')
cat: '/mnt/'$'\364\220\200\200': Object is remote
Fixes: c877ce47e137 ("cifs: reduce roundtrips on create/qinfo requests")
CC: stable@vger.kernel.org # 6.2
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Fix the loop check in netfs_extract_user_to_sg() for extraction from
user-backed iterators to do the body if npages > 0, not if npages < 0
(which it can never be).
This isn't currently used by cifs, which only ever extracts data from BVEC,
KVEC and XARRAY iterators at this level, user-backed iterators having being
decanted into BVEC iterators at a higher level to accommodate the work
being done in a kernel thread.
Found by smatch:
fs/netfs/iterator.c:139 netfs_extract_user_to_sg() warn: unsigned 'npages' is never less than zero.
Fixes: 018584697533 ("netfs: Add a function to extract an iterator into a scatterlist")
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202302261115.P3TQi1ZO-lkp@intel.com/
Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://lore.kernel.org/r/Y/yYnAhoAYDBKixX@kili
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-cachefs@redhat.com
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
cifs_write_back_from_locked_folio() should return the number of bytes read,
but returns the result of ->async_writev(), which will be 0 on success. As
it happens, this doesn't prevent cifs_writepages_region() from working as
it will then examine and ignore the pages that are no longer dirty rather
than just skipping over them.
Fixes: d08089f649a0 ("cifs: Change the I/O paths to use an iterator rather than a page list")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Tom Talpey <tom@talpey.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
We have two pieces of code that does pretty much the same
comparison. This change reuses cifs_match_ipaddr within
match_address.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
match_address function matches the scope id for ipv6 addresses,
but cifs_match_ipaddr (which is another function used for comparison)
does not use scope id. Doing so with this change.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Fix an uninitialised variable introduced in cifs.
Fixes: 3d78fe73fa12 ("cifs: Build the RDMA SGE list directly from an iterator")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Tom Talpey <tom@talpey.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-rdma@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The xas_for_each loops added into fs/cifs/file.c need to go round again if
indicated by xas_retry().
Fixes: b8713c4dbfa3 ("cifs: Add some helper functions")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Tom Talpey <tom@talpey.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Make it possible to see the distribution of size classes for block
groups. Helpful for testing and debugging the allocator w.r.t. to size
classes.
The new stats can be found at the path:
/sys/fs/btrfs/<FSID>/allocation/<bg-type>/size_class
but they will only be non-zero for bg-type = data.
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Back in 2008 we extended the capability bits from 32 to 64, and we did
it by extending the single 32-bit capability word from one word to an
array of two words. It was then obfuscated by hiding the "2" behind two
macro expansions, with the reasoning being that maybe it gets extended
further some day.
That reasoning may have been valid at the time, but the last thing we
want to do is to extend the capability set any more. And the array of
values not only causes source code oddities (with loops to deal with
it), but also results in worse code generation. It's a lose-lose
situation.
So just change the 'u32[2]' into a 'u64' and be done with it.
We still have to deal with the fact that the user space interface is
designed around an array of these 32-bit values, but that was the case
before too, since the array layouts were different (ie user space
doesn't use an array of 32-bit values for individual capability masks,
but an array of 32-bit slices of multiple masks).
So that marshalling of data is actually simplified too, even if it does
remain somewhat obscure and odd.
This was all triggered by my reaction to the new "cap_isidentical()"
introduced recently. By just using a saner data structure, it went from
unsigned __capi;
CAP_FOR_EACH_U32(__capi) {
if (a.cap[__capi] != b.cap[__capi])
return false;
}
return true;
to just being
return a.val == b.val;
instead. Which is rather more obvious both to humans and to compilers.
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Paul Moore <paul@paul-moore.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux
Pull UML updates from Richard Weinberger:
- Add support for rust (yay!)
- Add support for LTO
- Add platform bus support to virtio-pci
- Various virtio fixes
- Coding style, spelling cleanups
* tag 'uml-for-linus-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: (27 commits)
Documentation: rust: Fix arch support table
uml: vector: Remove unused definitions VECTOR_{WRITE,HEADERS}
um: virt-pci: properly remove PCI device from bus
um: virtio_uml: move device breaking into workqueue
um: virtio_uml: mark device as unregistered when breaking it
um: virtio_uml: free command if adding to virtqueue failed
UML: define RUNTIME_DISCARD_EXIT
virt-pci: add platform bus support
um-virt-pci: Make max delay configurable
um: virt-pci: implement pcibios_get_phb_of_node()
um: Support LTO
um: put power options in a menu
um: Use CFLAGS_vmlinux
um: Prevent building modules incompatible with MODVERSIONS
um: Avoid pcap multiple definition errors
um: Make the definition of cpu_data more compatible
x86: um: vdso: Add '%rcx' and '%r11' to the syscall clobber list
rust: arch/um: Add support for CONFIG_RUST under x86_64 UML
rust: arch/um: Disable FP/SIMD instruction to match x86
rust: arch/um: Use 'pie' relocation mode under UML
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull jffs2, ubi and ubifs updates from Richard Weinberger:
"JFFS2:
- Fix memory corruption in error path
- Spelling and coding style fixes
UBI:
- Switch to BLK_MQ_F_BLOCKING in ubiblock
- Wire up partent device (for sysfs)
- Multiple UAF bugfixes
- Fix for an infinite loop in WL error path
UBIFS:
- Fix for multiple memory leaks in error paths
- Fixes for wrong space accounting
- Minor cleanups
- Spelling and coding style fixes"
* tag 'ubifs-for-linus-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: (36 commits)
ubi: block: Fix a possible use-after-free bug in ubiblock_create()
ubifs: make kobj_type structures constant
mtd: ubi: block: wire-up device parent
mtd: ubi: wire-up parent MTD device
ubi: use correct names in function kernel-doc comments
ubi: block: set BLK_MQ_F_BLOCKING
jffs2: Fix list_del corruption if compressors initialized failed
jffs2: Use function instead of macro when initialize compressors
jffs2: fix spelling mistake "neccecary"->"necessary"
ubifs: Fix kernel-doc
ubifs: Fix some kernel-doc comments
UBI: Fastmap: Fix kernel-doc
ubi: ubi_wl_put_peb: Fix infinite loop when wear-leveling work failed
ubi: Fix UAF wear-leveling entry in eraseblk_count_seq_show()
ubi: fastmap: Fix missed fm_anchor PEB in wear-leveling after disabling fastmap
ubifs: ubifs_releasepage: Remove ubifs_assert(0) to valid this process
ubifs: ubifs_writepage: Mark page dirty after writing inode failed
ubifs: dirty_cow_znode: Fix memleak in error handling path
ubifs: Re-statistic cleaned znode count if commit failed
ubi: Fix permission display of the debugfs files
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
Pull 9p updates from Eric Van Hensbergen:
- some fixes and cleanup setting up for a larger set of performance
patches I've been working on
- a contributed fixes relating to 9p/rdma
- some contributed fixes relating to 9p/xen
* tag '9p-6.3-for-linus-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
fs/9p: fix error reporting in v9fs_dir_release
net/9p: fix bug in client create for .L
9p/rdma: unmap receive dma buffer in rdma_request()/post_recv()
9p/xen: fix connection sequence
9p/xen: fix version parsing
fs/9p: Expand setup of writeback cache to all levels
net/9p: Adjust maximum MSIZE to account for p9 header
|
|
Pull jfs update from Dave Kleikamp:
"Just one simple sanity check"
* tag 'jfs-6.3' of https://github.com/kleikamp/linux-shaggy:
fs/jfs: fix shift exponent db_agl2size negative
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat updates from Namjae Jeon:
- Handle vendor extension and allocation entries as unrecognized benign
secondary entries
- Fix wrong ->i_blocks on devices with non-512 byte sector
- Add the check to avoid returning -EIO from exfat_readdir() at current
position exceeding the directory size
- Fix a bug that reach the end of the directory stream at a position
not aligned with the dentry size
- Redefine DIR_DELETED as 0xFFFFFFF7, the bad cluster number
- Two cleanup fixes and fix cluster leakage in error handling
* tag 'exfat-for-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: fix the newly allocated clusters are not freed in error handling
exfat: don't print error log in normal case
exfat: remove unneeded code from exfat_alloc_cluster()
exfat: handle unreconized benign secondary entries
exfat: fix inode->i_blocks for non-512 byte sector size device
exfat: redefine DIR_DELETED as the bad cluster number
exfat: fix reporting fs error when reading dir beyond EOF
exfat: fix unexpected EOF while reading dir
|
|
Pull moar xfs updates from Darrick Wong:
"This contains a fix for a deadlock in the allocator. It continues the
slow march towards being able to offline AGs, and it refactors the
interface to the xfs allocator to be less indirection happy.
Summary:
- Fix a deadlock in the free space allocator due to the AG-walking
algorithm forgetting to follow AG-order locking rules
- Make the inode allocator prefer existing free inodes instead of
failing to allocate new inode chunks when free space is low
- Set minleft correctly when setting allocator parameters for bmap
changes
- Fix uninitialized variable access in the getfsmap code
- Make a distinction between active and passive per-AG structure
references. For now, active references are taken to perform some
work in an AG on behalf of a high level operation; passive
references are used by lower level code to finish operations
started by other threads. Eventually this will become part of
online shrink
- Split out all the different allocator strategies into separate
functions to move us away from design antipattern of filling out a
huge structure for various differentish things and issuing a single
function multiplexing call
- Various cleanups in the filestreams allocator code, which we might
very well want to deprecate instead of continuing
- Fix a bug with the agi rotor code that was introduced earlier in
this series"
* tag 'xfs-6.3-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (44 commits)
xfs: restore old agirotor behavior
xfs: fix uninitialized variable access
xfs: refactor the filestreams allocator pick functions
xfs: return a referenced perag from filestreams allocator
xfs: pass perag to filestreams tracing
xfs: use for_each_perag_wrap in xfs_filestream_pick_ag
xfs: track an active perag reference in filestreams
xfs: factor out MRU hit case in xfs_filestream_select_ag
xfs: remove xfs_filestream_select_ag() longest extent check
xfs: merge new filestream AG selection into xfs_filestream_select_ag()
xfs: merge filestream AG lookup into xfs_filestream_select_ag()
xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.c
xfs: use xfs_bmap_longest_free_extent() in filestreams
xfs: get rid of notinit from xfs_bmap_longest_free_extent
xfs: factor out filestreams from xfs_bmap_btalloc_nullfb
xfs: convert trim to use for_each_perag_range
xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walker
xfs: move the minimum agno checks into xfs_alloc_vextent_check_args
xfs: fold xfs_alloc_ag_vextent() into callers
xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Improve performance for ext4 by allowing multiple process to perform
direct I/O writes to preallocated blocks by using a shared inode lock
instead of taking an exclusive lock.
In addition, multiple bug fixes and cleanups"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix incorrect options show of original mount_opt and extend mount_opt2
ext4: Fix possible corruption when moving a directory
ext4: init error handle resource before init group descriptors
ext4: fix task hung in ext4_xattr_delete_inode
jbd2: fix data missing when reusing bh which is ready to be checkpointed
ext4: update s_journal_inum if it changes after journal replay
ext4: fail ext4_iget if special inode unallocated
ext4: fix function prototype mismatch for ext4_feat_ktype
ext4: remove unnecessary variable initialization
ext4: fix inode tree inconsistency caused by ENOMEM
ext4: refuse to create ea block when umounted
ext4: optimize ea_inode block expansion
ext4: remove dead code in updating backup sb
ext4: dio take shared inode lock when overwriting preallocated blocks
ext4: don't show commit interval if it is zero
ext4: use ext4_fc_tl_mem in fast-commit replay path
ext4: improve xattr consistency checking and error reporting
|
|
In error handling 'free_cluster', before num_alloc clusters allocated,
p_chain->size will not updated and always 0, thus the newly allocated
clusters are not freed.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
When allocating a new cluster, exFAT first allocates from the
next cluster of the last cluster of the file. If the last cluster
of the file is the last cluster of the volume, allocate from the
first cluster. This is a normal case, but the following error log
will be printed. It makes users confused, so this commit removes
the error log.
[1960905.181545] exFAT-fs (sdb1): hint_cluster is invalid (262130)
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
In the removed code, num_clusters is 0, nothing is done in
exfat_chain_cont_cluster(), so it is unneeded, remove it.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
This fixes three issues on move extents ioctl without auto defrag:
a) In ocfs2_find_victim_alloc_group(), we have to convert bits to block
first in case of global bitmap.
b) In ocfs2_probe_alloc_group(), when finding enough bits in block
group bitmap, we have to back off move_len to start pos as well,
otherwise it may corrupt filesystem.
c) In ocfs2_ioctl_move_extents(), set me_threshold both for non-auto
and auto defrag paths. Otherwise it will set move_max_hop to 0 and
finally cause unexpectedly ENOSPC error.
Currently there are no tools triggering the above issues since
defragfs.ocfs2 enables auto defrag by default. Tested with manually
changing defragfs.ocfs2 to run non auto defrag path.
Link: https://lkml.kernel.org/r/20230220050526.22020-1-heming.zhao@suse.com
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
code path:
ocfs2_ioctl_move_extents
ocfs2_move_extents
ocfs2_defrag_extent
__ocfs2_move_extent
+ ocfs2_journal_access_di
+ ocfs2_split_extent //sub-paths call jbd2_journal_restart
+ ocfs2_journal_dirty //crash by jbs2 ASSERT
crash stacks:
PID: 11297 TASK: ffff974a676dcd00 CPU: 67 COMMAND: "defragfs.ocfs2"
#0 [ffffb25d8dad3900] machine_kexec at ffffffff8386fe01
#1 [ffffb25d8dad3958] __crash_kexec at ffffffff8395959d
#2 [ffffb25d8dad3a20] crash_kexec at ffffffff8395a45d
#3 [ffffb25d8dad3a38] oops_end at ffffffff83836d3f
#4 [ffffb25d8dad3a58] do_trap at ffffffff83833205
#5 [ffffb25d8dad3aa0] do_invalid_op at ffffffff83833aa6
#6 [ffffb25d8dad3ac0] invalid_op at ffffffff84200d18
[exception RIP: jbd2_journal_dirty_metadata+0x2ba]
RIP: ffffffffc09ca54a RSP: ffffb25d8dad3b70 RFLAGS: 00010207
RAX: 0000000000000000 RBX: ffff9706eedc5248 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff97337029ea28 RDI: ffff9706eedc5250
RBP: ffff9703c3520200 R8: 000000000f46b0b2 R9: 0000000000000000
R10: 0000000000000001 R11: 00000001000000fe R12: ffff97337029ea28
R13: 0000000000000000 R14: ffff9703de59bf60 R15: ffff9706eedc5250
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffffb25d8dad3ba8] ocfs2_journal_dirty at ffffffffc137fb95 [ocfs2]
#8 [ffffb25d8dad3be8] __ocfs2_move_extent at ffffffffc139a950 [ocfs2]
#9 [ffffb25d8dad3c80] ocfs2_defrag_extent at ffffffffc139b2d2 [ocfs2]
Analysis
This bug has the same root cause of 'commit 7f27ec978b0e ("ocfs2: call
ocfs2_journal_access_di() before ocfs2_journal_dirty() in
ocfs2_write_end_nolock()")'. For this bug, jbd2_journal_restart() is
called by ocfs2_split_extent() during defragmenting.
How to fix
For ocfs2_split_extent() can handle journal operations totally by itself.
Caller doesn't need to call journal access/dirty pair, and caller only
needs to call journal start/stop pair. The fix method is to remove
journal access/dirty from __ocfs2_move_extent().
The discussion for this patch:
https://oss.oracle.com/pipermail/ocfs2-devel/2023-February/000647.html
Link: https://lkml.kernel.org/r/20230217003717.32469-1-heming.zhao@suse.com
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
access(2) remains commonly used, for example on exec:
access("/etc/ld.so.preload", R_OK)
or when running gcc: strace -c gcc empty.c
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
0.00 0.000000 0 42 26 access
It falls down to do_faccessat without the AT_EACCESS flag, which in turn
results in allocation of new creds in order to modify fsuid/fsgid and
caps. This is a very expensive process single-threaded and most notably
multi-threaded, with numerous structures getting refed and unrefed on
imminent new cred destruction.
Turns out for typical consumers the resulting creds would be identical
and this can be checked upfront, avoiding the hard work.
An access benchmark plugged into will-it-scale running on Cascade Lake
shows:
test proc before after
access1 1 1310582 2908735 (+121%) # distinct files
access1 24 4716491 63822173 (+1353%) # distinct files
access2 24 2378041 5370335 (+125%) # same file
The above benchmarks are not integrated into will-it-scale, but can be
found in a pull request:
https://github.com/antonblanchard/will-it-scale/pull/36/files
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've got a huge number of patches that improve code
readability along with minor bug fixes, while we've mainly fixed some
critical issues in recently-added per-block age-based extent_cache,
atomic write support, and some folio cases.
Enhancements:
- add sysfs nodes to set last_age_weight and manage
discard_io_aware_gran
- show ipu policy in debugfs
- reduce stack memory cost by using bitfield in struct f2fs_io_info
- introduce trace_f2fs_replace_atomic_write_block
- enhance iostat support and adds flush commands
Bug fixes:
- revert "f2fs: truncate blocks in batch in __complete_revoke_list()"
- fix kernel crash on the atomic write abort flow
- call clear_page_private_reference in .{release,invalid}_folio
- support .migrate_folio for compressed inode
- fix cgroup writeback accounting with fs-layer encryption
- retry to update the inode page given data corruption
- fix kernel crash due to NULL io->bio
- fix some bugs in per-block age-based extent_cache:
- wrong calculation of block age
- update age extent in f2fs_do_zero_range()
- update age extent correctly during truncation"
* tag 'f2fs-for-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (81 commits)
f2fs: drop unnecessary arg for f2fs_ioc_*()
f2fs: Revert "f2fs: truncate blocks in batch in __complete_revoke_list()"
f2fs: synchronize atomic write aborts
f2fs: fix wrong segment count
f2fs: replace si->sbi w/ sbi in stat_show()
f2fs: export ipu policy in debugfs
f2fs: make kobj_type structures constant
f2fs: fix to do sanity check on extent cache correctly
f2fs: add missing description for ipu_policy node
f2fs: fix to set ipu policy
f2fs: fix typos in comments
f2fs: fix kernel crash due to null io->bio
f2fs: use iostat_lat_type directly as a parameter in the iostat_update_and_unbind_ctx()
f2fs: add sysfs nodes to set last_age_weight
f2fs: fix f2fs_show_options to show nogc_merge mount option
f2fs: fix cgroup writeback accounting with fs-layer encryption
f2fs: fix wrong calculation of block age
f2fs: fix to update age extent in f2fs_do_zero_range()
f2fs: fix to update age extent correctly during truncation
f2fs: fix to avoid potential memory corruption in __update_iostat_latency()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC driver updates from Arnd Bergmann:
"As usual, there are lots of minor driver changes across SoC platforms
from NXP, Amlogic, AMD Zynq, Mediatek, Qualcomm, Apple and Samsung.
These usually add support for additional chip variations in existing
drivers, but also add features or bugfixes.
The SCMI firmware subsystem gains a unified raw userspace interface
through debugfs, which can be used for validation purposes.
Newly added drivers include:
- New power management drivers for StarFive JH7110, Allwinner D1 and
Renesas RZ/V2M
- A driver for Qualcomm battery and power supply status
- A SoC device driver for identifying Nuvoton WPCM450 chips
- A regulator coupler driver for Mediatek MT81xxv"
* tag 'soc-drivers-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (165 commits)
power: supply: Introduce Qualcomm PMIC GLINK power supply
soc: apple: rtkit: Do not copy the reg state structure to the stack
soc: sunxi: SUN20I_PPU should depend on PM
memory: renesas-rpc-if: Remove redundant division of dummy
soc: qcom: socinfo: Add IDs for IPQ5332 and its variant
dt-bindings: arm: qcom,ids: Add IDs for IPQ5332 and its variant
dt-bindings: power: qcom,rpmpd: add RPMH_REGULATOR_LEVEL_LOW_SVS_L1
firmware: qcom_scm: Move qcom_scm.h to include/linux/firmware/qcom/
MAINTAINERS: Update qcom CPR maintainer entry
dt-bindings: firmware: document Qualcomm SM8550 SCM
dt-bindings: firmware: qcom,scm: add qcom,scm-sa8775p compatible
soc: qcom: socinfo: Add Soc IDs for IPQ8064 and variants
dt-bindings: arm: qcom,ids: Add Soc IDs for IPQ8064 and variants
soc: qcom: socinfo: Add support for new field in revision 17
soc: qcom: smd-rpm: Add IPQ9574 compatible
soc: qcom: pmic_glink: remove redundant calculation of svid
soc: qcom: stats: Populate all subsystem debugfs files
dt-bindings: soc: qcom,rpmh-rsc: Update to allow for generic nodes
soc: qcom: pmic_glink: add CONFIG_NET/CONFIG_OF dependencies
soc: qcom: pmic_glink: Introduce altmode support
...
|