Age | Commit message (Collapse) | Author | Files | Lines |
|
smb3_decrypt_req() validate if pdu_length is smaller than
smb2_transform_hdr size.
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-21589
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Since commit 74d7970febf7 ("ksmbd: fix racy issue from using ->d_parent and
->d_name"), ksmbd can not lookup cross mount points. If last component is
a cross mount point during path lookup, check if it is crossed to follow it
down. And allow path lookup to cross a mount point when a crossmnt
parameter is set to 'yes' in smb.conf.
Cc: stable@vger.kernel.org
Fixes: 74d7970febf7 ("ksmbd: fix racy issue from using ->d_parent and ->d_name")
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
During allocations, while looking for preallocations(PA) in the per
inode rbtree, we can't do a direct traversal of the tree because
ext4_mb_discard_group_preallocation() can paralelly mark the pa deleted
and that can cause direct traversal to skip some entries. This was
leading to a BUG_ON() being hit [1] when we missed a PA that could satisfy
our request and ultimately tried to create a new PA that would overlap
with the missed one.
To makes sure we handle that case while still keeping the performance of
the rbtree, we make use of the fact that the only pa that could possibly
overlap the original goal start is the one that satisfies the below
conditions:
1. It must have it's logical start immediately to the left of
(ie less than) original logical start.
2. It must not be deleted
To find this pa we use the following traversal method:
1. Descend into the rbtree normally to find the immediate neighboring
PA. Here we keep descending irrespective of if the PA is deleted or if
it overlaps with our request etc. The goal is to find an immediately
adjacent PA.
2. If the found PA is on right of original goal, use rb_prev() to find
the left adjacent PA.
3. Check if this PA is deleted and keep moving left with rb_prev() until
a non deleted PA is found.
4. This is the PA we are looking for. Now we can check if it can satisfy
the original request and proceed accordingly.
This approach also takes care of having deleted PAs in the tree.
(While we are at it, also fix a possible overflow bug in calculating the
end of a PA)
[1] https://lore.kernel.org/linux-ext4/CA+G9fYv2FRpLqBZf34ZinR8bU2_ZRAUOjKAD3+tKRFaEQHtt8Q@mail.gmail.com/
Cc: stable@kernel.org # 6.4
Fixes: 3872778664e3 ("ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list")
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reviewed-by: Ritesh Harjani (IBM) ritesh.list@gmail.com
Tested-by: Ritesh Harjani (IBM) ritesh.list@gmail.com
Link: https://lore.kernel.org/r/edd2efda6a83e6343c5ace9deea44813e71dbe20.1690045963.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In ext4_mb_choose_next_group_best_avail(), we want the start order to be
1 less than goal length and the min_order to be, at max, 1 more than the
original length. This commit fixes an off by one issue that arose due to
the fact that 1 << fls(n) > (n).
After all the processing:
order = 1 order below goal len
min_order = maximum of the three:-
- order - trim_order
- 1 order below B2C(s_stripe)
- 1 order above original len
Cc: stable@kernel.org
Fixes: 33122aa930 ("ext4: Add allocation criteria 1.5 (CR1_5)")
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230609103403.112807-1-ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When run on a file system where the inline_data feature has been
enabled, xfstests generic/269, generic/270, and generic/476 cause ext4
to emit error messages indicating that inline directory entries are
corrupted. This occurs because the inline offset used to locate
inline directory entries in the inode body is not updated when an
xattr in that shared region is deleted and the region is shifted in
memory to recover the space it occupied. If the deleted xattr precedes
the system.data attribute, which points to the inline directory entries,
that attribute will be moved further up in the region. The inline
offset continues to point to whatever is located in system.data's former
location, with unfortunate effects when used to access directory entries
or (presumably) inline data in the inode body.
Cc: stable@kernel.org
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20230522181520.1570360-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
From 2.43 to 2.44
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Dumping the enc/dec keys is a session wide operation.
And it should not matter if the ioctl was run on
a regular file or a directory.
Currently, we obtain the tcon pointer from the
cifs file handle. But since there's no dir open call
in cifs, this is not populated for dirs.
This change allows dumping of session keys using ioctl
even for directories. To do this, we'll now get the
tcon pointer from the superblock, and not from the file
handle.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
commit bd238fb431f3 ("9p: Reorganization of 9p file system code")
left behind this.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
|
|
The 9p code for some reason used to initialize variables outside of the
declaration, e.g. instead of just initializing the variable like this:
int retval = 0
We would be doing this:
int retval;
retval = 0;
This is perfectly fine and the compiler will just optimize dead stores
anyway, but scan-build seems to think this is a problem and there are
many of these warnings making the output of scan-build full of such
warnings:
fs/9p/vfs_inode.c:916:2: warning: Value stored to 'retval' is never read [deadcode.DeadStores]
retval = 0;
^ ~
I have no strong opinion here, but if we want to regularly run
scan-build we should fix these just to silence the messages.
I've confirmed these all are indeed ok to remove.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
|
|
retval from filemap_fdatawrite was immediately overwritten by the
following p9_fid_put: preserve any error in fdatawrite if there
was any first.
This fixes the following scan-build warning:
fs/9p/vfs_dir.c:220:4: warning: Value stored to 'retval' is never read [deadcode.DeadStores]
retval = filemap_fdatawrite(inode->i_mapping);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fixes: 89c58cb395ec ("fs/9p: fix error reporting in v9fs_dir_release")
Cc: stable@vger.kernel.org
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
|
|
When using the block group tree feature, this tree is a critical tree just
like the extent, csum and free space trees, and just like them it uses the
delayed refs block reserve.
So take into account the block group tree, and its current size, when
calculating the size for the global reserve.
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The zoned mode need to reset a zone before using it. We rely on btrfs's
original discard functionality (discarding unused block group range) to do
the resetting.
While the commit 63a7cb130718 ("btrfs: auto enable discard=async when
possible") made the discard done in an async manner, a zoned reset do not
need to be async, as it is fast enough.
Even worth, delaying zone rests prevents using those zones again. So, let's
disable async discard on the zoned mode.
Fixes: 63a7cb130718 ("btrfs: auto enable discard=async when possible")
CC: stable@vger.kernel.org # 6.3+
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update message text ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Pull iomap fix from Darrick Wong:
"Fix partial write regression.
It turns out that fstests doesn't have any test coverage for short
writes, but LTP does. Fortunately, this was caught right after -rc1
was tagged.
Summary:
- Fix a bug wherein a failed write could clobber short write status"
* tag 'iomap-6.5-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: micro optimize the ki_pos assignment in iomap_file_buffered_write
iomap: fix a regression for partial write errors
|
|
Pull xfs fixes from Darrick Wong:
"Flexarray declaration conversions.
This probably should've been done with the merge window open, but I
was not aware that the UBSAN knob would be getting turned up for 6.5,
and the fstests failures due to the kernel warnings are getting in the
way of testing.
Summary:
- Convert all the array[1] declarations into the accepted flex
array[] declarations so that UBSAN and friends will not get
confused"
* tag 'xfs-6.5-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: convert flex-array declarations in xfs attr shortform objects
xfs: convert flex-array declarations in xfs attr leaf blocks
xfs: convert flex-array declarations in struct xfs_attrlist*
|
|
There was an invalidate_inode_pages2 added to readonly mmap path
that is unnecessary since that path is only entered when writeback
cache is disabled on mount.
Cc: stable@vger.kernel.org
Fixes: 1543b4c5071c ("fs/9p: remove writeback fid and fix per-file modes")
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
|
|
There were two flags (s_flags and s_cache) which had incorrect signed
type in the parameters of the file cache mode helper function.
Cc: stable@vger.kernel.org
Fixes: 1543b4c5071c ("fs/9p: remove writeback fid and fix per-file modes")
Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
|
|
There appears to be a typo in the comparison statement for the logic
which sets a file's cache mode based on mount flags.
Cc: stable@vger.kernel.org
Fixes: 1543b4c5071c ("fs/9p: remove writeback fid and fix per-file modes")
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
|
|
This eliminates a check for shared that was overrestrictive and
prevented read-only mmaps when writeback caches weren't enabled.
Cc: stable@vger.kernel.org
Fixes: 1543b4c5071c ("fs/9p: remove writeback fid and fix per-file modes")
Reported-by: Robert Schwebel <r.schwebel@pengutronix.de>
Closes: https://lore.kernel.org/v9fs/ZK25XZ%2BGpR3KHIB%2F@pengutronix.de
Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Stable fixes:
- fix race between balance and cancel/pause
- various iput() fixes
- fix use-after-free of new block group that became unused
- fix warning when putting transaction with qgroups enabled after
abort
- fix crash in subpage mode when page could be released between map
and map read
- when scrubbing raid56 verify the P/Q stripes unconditionally
- fix minor memory leak in zoned mode when a block group with an
unexpected superblock is found
Regression fixes:
- fix ordered extent split error handling when submitting direct IO
- user irq-safe locking when adding delayed iputs"
* tag 'for-6.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix warning when putting transaction with qgroups enabled after abort
btrfs: fix ordered extent split error handling in btrfs_dio_submit_io
btrfs: set_page_extent_mapped after read_folio in btrfs_cont_expand
btrfs: raid56: always verify the P/Q contents for scrub
btrfs: use irq safe locking when running and adding delayed iputs
btrfs: fix iput() on error pointer after error during orphan cleanup
btrfs: fix double iput() on inode after an error during orphan cleanup
btrfs: zoned: fix memory leak after finding block group with super blocks
btrfs: fix use-after-free of new block group that became unused
btrfs: be a bit more careful when setting mirror_num_ret in btrfs_map_block
btrfs: fix race between balance and cancel/pause
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"Small but important fixes and a trivial cleanup"
* tag 'fuse-update-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: ioctl: translate ENOSYS in outarg
fuse: revalidate: don't invalidate if interrupted
fuse: Apply flags2 only when userspace set the FUSE_INIT_EXT
fuse: remove duplicate check for nodeid
fuse: add feature flag for expire-only
|
|
If the client is calling TEST_STATEID, then it is because some event
occurred that requires it to check all the stateids for validity and
call FREE_STATEID on the ones that have been revoked. In this case,
either the stateid exists in the list of stateids associated with that
nfs4_client, in which case it should be tested, or it does not. There
are no additional conditions to be considered.
Reported-by: "Frank Ch. Eigler" <fche@redhat.com>
Fixes: 7df302f75ee2 ("NFSD: TEST_STATEID should not return NFS4ERR_STALE_STATEID")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
If we have a transaction abort with qgroups enabled we get a warning
triggered when doing the final put on the transaction, like this:
[552.6789] ------------[ cut here ]------------
[552.6815] WARNING: CPU: 4 PID: 81745 at fs/btrfs/transaction.c:144 btrfs_put_transaction+0x123/0x130 [btrfs]
[552.6817] Modules linked in: btrfs blake2b_generic xor (...)
[552.6819] CPU: 4 PID: 81745 Comm: btrfs-transacti Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1
[552.6819] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[552.6819] RIP: 0010:btrfs_put_transaction+0x123/0x130 [btrfs]
[552.6821] Code: bd a0 01 00 (...)
[552.6821] RSP: 0018:ffffa168c0527e28 EFLAGS: 00010286
[552.6821] RAX: ffff936042caed00 RBX: ffff93604a3eb448 RCX: 0000000000000000
[552.6821] RDX: ffff93606421b028 RSI: ffffffff92ff0878 RDI: ffff93606421b010
[552.6821] RBP: ffff93606421b000 R08: 0000000000000000 R09: ffffa168c0d07c20
[552.6821] R10: 0000000000000000 R11: ffff93608dc52950 R12: ffffa168c0527e70
[552.6821] R13: ffff93606421b000 R14: ffff93604a3eb420 R15: ffff93606421b028
[552.6821] FS: 0000000000000000(0000) GS:ffff93675fb00000(0000) knlGS:0000000000000000
[552.6821] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[552.6821] CR2: 0000558ad262b000 CR3: 000000014feda005 CR4: 0000000000370ee0
[552.6822] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[552.6822] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[552.6822] Call Trace:
[552.6822] <TASK>
[552.6822] ? __warn+0x80/0x130
[552.6822] ? btrfs_put_transaction+0x123/0x130 [btrfs]
[552.6824] ? report_bug+0x1f4/0x200
[552.6824] ? handle_bug+0x42/0x70
[552.6824] ? exc_invalid_op+0x14/0x70
[552.6824] ? asm_exc_invalid_op+0x16/0x20
[552.6824] ? btrfs_put_transaction+0x123/0x130 [btrfs]
[552.6826] btrfs_cleanup_transaction+0xe7/0x5e0 [btrfs]
[552.6828] ? _raw_spin_unlock_irqrestore+0x23/0x40
[552.6828] ? try_to_wake_up+0x94/0x5e0
[552.6828] ? __pfx_process_timeout+0x10/0x10
[552.6828] transaction_kthread+0x103/0x1d0 [btrfs]
[552.6830] ? __pfx_transaction_kthread+0x10/0x10 [btrfs]
[552.6832] kthread+0xee/0x120
[552.6832] ? __pfx_kthread+0x10/0x10
[552.6832] ret_from_fork+0x29/0x50
[552.6832] </TASK>
[552.6832] ---[ end trace 0000000000000000 ]---
This corresponds to this line of code:
void btrfs_put_transaction(struct btrfs_transaction *transaction)
{
(...)
WARN_ON(!RB_EMPTY_ROOT(
&transaction->delayed_refs.dirty_extent_root));
(...)
}
The warning happens because btrfs_qgroup_destroy_extent_records(), called
in the transaction abort path, we free all entries from the rbtree
"dirty_extent_root" with rbtree_postorder_for_each_entry_safe(), but we
don't actually empty the rbtree - it's still pointing to nodes that were
freed.
So set the rbtree's root node to NULL to avoid this warning (assign
RB_ROOT).
Fixes: 81f7eb00ff5b ("btrfs: destroy qgroup extent records on transaction abort")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When the call to btrfs_extract_ordered_extent in btrfs_dio_submit_io
fails to allocate memory for a new ordered_extent, it calls into the
btrfs_dio_end_io for error handling. btrfs_dio_end_io then assumes that
bbio->ordered is set because it is supposed to be at this point, except
for this error handling corner case. Try to not overload the
btrfs_dio_end_io with error handling of a bio in a non-canonical state,
and instead call btrfs_finish_ordered_extent and iomap_dio_bio_end_io
directly for this error case.
Reported-by: syzbot <syzbot+5b82f0e951f8c2bcdb8f@syzkaller.appspotmail.com>
Fixes: b41b6f6937dc ("btrfs: use btrfs_finish_ordered_extent to complete direct writes")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: syzbot <syzbot+5b82f0e951f8c2bcdb8f@syzkaller.appspotmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
While trying to get the subpage blocksize tests running, I hit the
following panic on generic/476
assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:229
kernel BUG at fs/btrfs/subpage.c:229!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 1 PID: 1453 Comm: fsstress Not tainted 6.4.0-rc7+ #12
Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20230301gitf80f052277c8-26.fc38 03/01/2023
pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
pc : btrfs_subpage_assert+0xbc/0xf0
lr : btrfs_subpage_assert+0xbc/0xf0
Call trace:
btrfs_subpage_assert+0xbc/0xf0
btrfs_subpage_clear_checked+0x38/0xc0
btrfs_page_clear_checked+0x48/0x98
btrfs_truncate_block+0x5d0/0x6a8
btrfs_cont_expand+0x5c/0x528
btrfs_write_check.isra.0+0xf8/0x150
btrfs_buffered_write+0xb4/0x760
btrfs_do_write_iter+0x2f8/0x4b0
btrfs_file_write_iter+0x1c/0x30
do_iter_readv_writev+0xc8/0x158
do_iter_write+0x9c/0x210
vfs_iter_write+0x24/0x40
iter_file_splice_write+0x224/0x390
direct_splice_actor+0x38/0x68
splice_direct_to_actor+0x12c/0x260
do_splice_direct+0x90/0xe8
generic_copy_file_range+0x50/0x90
vfs_copy_file_range+0x29c/0x470
__arm64_sys_copy_file_range+0xcc/0x498
invoke_syscall.constprop.0+0x80/0xd8
do_el0_svc+0x6c/0x168
el0_svc+0x50/0x1b0
el0t_64_sync_handler+0x114/0x120
el0t_64_sync+0x194/0x198
This happens because during btrfs_cont_expand we'll get a page, set it
as mapped, and if it's not Uptodate we'll read it. However between the
read and re-locking the page we could have called release_folio() on the
page, but left the page in the file mapping. release_folio() can clear
the page private, and thus further down we blow up when we go to modify
the subpage bits.
Fix this by putting the set_page_extent_mapped() after the read. This
is safe because read_folio() will call set_page_extent_mapped() before
it does the read, and then if we clear page private but leave it on the
mapping we're completely safe re-setting set_page_extent_mapped(). With
this patch I can now run generic/476 without panicing.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[REGRESSION]
Commit 75b470332965 ("btrfs: raid56: migrate recovery and scrub recovery
path to use error_bitmap") changed the behavior of scrub_rbio().
Initially if we have no error reading the raid bio, we will assign
@need_check to true, then finish_parity_scrub() would later verify the
content of P/Q stripes before writeback.
But after that commit we never verify the content of P/Q stripes and
just writeback them.
This can lead to unrepaired P/Q stripes during scrub, or already
corrupted P/Q copied to the dev-replace target.
[FIX]
The situation is more complex than the regression, in fact the initial
behavior is not 100% correct either.
If we have the following rare case, it can still lead to the same
problem using the old behavior:
0 16K 32K 48K 64K
Data 1: |IIIIIII| |
Data 2: | |
Parity: | |CCCCCCC| |
Where "I" means IO error, "C" means corruption.
In the above case, we're scrubbing the parity stripe, then read out all
the contents of Data 1, Data 2, Parity stripes.
But found IO error in Data 1, which leads to rebuild using Data 2 and
Parity and got the correct data.
In that case, we would not verify if the Parity is correct for range
[16K, 32K).
So here we have to always verify the content of Parity no matter if we
did recovery or not.
This patch would remove the @need_check parameter of
finish_parity_scrub() completely, and would always do the P/Q
verification before writeback.
Fixes: 75b470332965 ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap")
CC: stable@vger.kernel.org # 6.2+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Running delayed iputs, which never happens in an irq context, needs to
lock the spinlock fs_info->delayed_iput_lock. When finishing bios for
data writes (irq context, bio.c) we call btrfs_put_ordered_extent() which
needs to add a delayed iput and for that it needs to acquire the spinlock
fs_info->delayed_iput_lock. Without disabling irqs when running delayed
iputs we can therefore deadlock on that spinlock. The same deadlock can
also happen when adding an inode to the delayed iputs list, since this
can be done outside an irq context as well.
Syzbot recently reported this, which results in the following trace:
================================
WARNING: inconsistent lock state
6.4.0-syzkaller-09904-ga507db1d8fdc #0 Not tainted
--------------------------------
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
btrfs-cleaner/16079 [HC0[0]:SC0[0]:HE1:SE1] takes:
ffff888107804d20 (&fs_info->delayed_iput_lock){+.?.}-{2:2}, at: spin_lock include/linux/spinlock.h:350 [inline]
ffff888107804d20 (&fs_info->delayed_iput_lock){+.?.}-{2:2}, at: btrfs_run_delayed_iputs+0x28/0xe0 fs/btrfs/inode.c:3523
{IN-SOFTIRQ-W} state was registered at:
lock_acquire kernel/locking/lockdep.c:5761 [inline]
lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5726
__raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
_raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
spin_lock include/linux/spinlock.h:350 [inline]
btrfs_add_delayed_iput+0x128/0x390 fs/btrfs/inode.c:3490
btrfs_put_ordered_extent fs/btrfs/ordered-data.c:559 [inline]
btrfs_put_ordered_extent+0x2f6/0x610 fs/btrfs/ordered-data.c:547
__btrfs_bio_end_io fs/btrfs/bio.c:118 [inline]
__btrfs_bio_end_io+0x136/0x180 fs/btrfs/bio.c:112
btrfs_orig_bbio_end_io+0x86/0x2b0 fs/btrfs/bio.c:163
btrfs_simple_end_io+0x105/0x380 fs/btrfs/bio.c:378
bio_endio+0x589/0x690 block/bio.c:1617
req_bio_endio block/blk-mq.c:766 [inline]
blk_update_request+0x5c5/0x1620 block/blk-mq.c:911
blk_mq_end_request+0x59/0x680 block/blk-mq.c:1032
lo_complete_rq+0x1c6/0x280 drivers/block/loop.c:370
blk_complete_reqs+0xb3/0xf0 block/blk-mq.c:1110
__do_softirq+0x1d4/0x905 kernel/softirq.c:553
run_ksoftirqd kernel/softirq.c:921 [inline]
run_ksoftirqd+0x31/0x60 kernel/softirq.c:913
smpboot_thread_fn+0x659/0x9e0 kernel/smpboot.c:164
kthread+0x344/0x440 kernel/kthread.c:389
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
irq event stamp: 39
hardirqs last enabled at (39): [<ffffffff81d5ebc4>] __do_kmem_cache_free mm/slab.c:3558 [inline]
hardirqs last enabled at (39): [<ffffffff81d5ebc4>] kmem_cache_free mm/slab.c:3582 [inline]
hardirqs last enabled at (39): [<ffffffff81d5ebc4>] kmem_cache_free+0x244/0x370 mm/slab.c:3575
hardirqs last disabled at (38): [<ffffffff81d5eb5e>] __do_kmem_cache_free mm/slab.c:3553 [inline]
hardirqs last disabled at (38): [<ffffffff81d5eb5e>] kmem_cache_free mm/slab.c:3582 [inline]
hardirqs last disabled at (38): [<ffffffff81d5eb5e>] kmem_cache_free+0x1de/0x370 mm/slab.c:3575
softirqs last enabled at (0): [<ffffffff814ac99f>] copy_process+0x227f/0x75c0 kernel/fork.c:2448
softirqs last disabled at (0): [<0000000000000000>] 0x0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&fs_info->delayed_iput_lock);
<Interrupt>
lock(&fs_info->delayed_iput_lock);
*** DEADLOCK ***
1 lock held by btrfs-cleaner/16079:
#0: ffff888107804860 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x103/0x4b0 fs/btrfs/disk-io.c:1463
stack backtrace:
CPU: 3 PID: 16079 Comm: btrfs-cleaner Not tainted 6.4.0-syzkaller-09904-ga507db1d8fdc #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
print_usage_bug kernel/locking/lockdep.c:3978 [inline]
valid_state kernel/locking/lockdep.c:4020 [inline]
mark_lock_irq kernel/locking/lockdep.c:4223 [inline]
mark_lock.part.0+0x1102/0x1960 kernel/locking/lockdep.c:4685
mark_lock kernel/locking/lockdep.c:4649 [inline]
mark_usage kernel/locking/lockdep.c:4598 [inline]
__lock_acquire+0x8e4/0x5e20 kernel/locking/lockdep.c:5098
lock_acquire kernel/locking/lockdep.c:5761 [inline]
lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5726
__raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
_raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
spin_lock include/linux/spinlock.h:350 [inline]
btrfs_run_delayed_iputs+0x28/0xe0 fs/btrfs/inode.c:3523
cleaner_kthread+0x2e5/0x4b0 fs/btrfs/disk-io.c:1478
kthread+0x344/0x440 kernel/kthread.c:389
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
</TASK>
So fix this by using spin_lock_irq() and spin_unlock_irq() when running
delayed iputs, and using spin_lock_irqsave() and spin_unlock_irqrestore()
when adding a delayed iput().
Reported-by: syzbot+da501a04be5ff533b102@syzkaller.appspotmail.com
Fixes: ec63b84d4611 ("btrfs: add an ordered_extent pointer to struct btrfs_bio")
Link: https://lore.kernel.org/linux-btrfs/000000000000d5c89a05ffbd39dd@google.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At btrfs_orphan_cleanup(), if we can't find an inode (btrfs_iget() returns
an -ENOENT error pointer), we proceed with 'ret' set to -ENOENT and the
inode pointer set to ERR_PTR(-ENOENT). Later when we proceed to the body
of the following if statement:
if (ret == -ENOENT || inode->i_nlink) {
(...)
trans = btrfs_start_transaction(root, 1);
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
iput(inode);
goto out;
}
(...)
ret = btrfs_del_orphan_item(trans, root,
found_key.objectid);
btrfs_end_transaction(trans);
if (ret) {
iput(inode);
goto out;
}
continue;
}
If we get an error from btrfs_start_transaction() or from the call to
btrfs_del_orphan_item() we end calling iput() against an inode pointer
that has a value of ERR_PTR(-ENOENT), resulting in a crash with the
following trace:
[876.667] BUG: kernel NULL pointer dereference, address: 0000000000000096
[876.667] #PF: supervisor read access in kernel mode
[876.667] #PF: error_code(0x0000) - not-present page
[876.667] PGD 0 P4D 0
[876.668] Oops: 0000 [#1] PREEMPT SMP PTI
[876.668] CPU: 0 PID: 2356187 Comm: mount Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1
[876.668] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[876.668] RIP: 0010:iput+0xa/0x20
[876.668] Code: ff ff ff 66 (...)
[876.669] RSP: 0018:ffffafa9c0c9f9d0 EFLAGS: 00010282
[876.669] RAX: ffffffffffffffe4 RBX: 000000000009453b RCX: 0000000000000000
[876.669] RDX: 0000000000000001 RSI: ffffafa9c0c9f930 RDI: fffffffffffffffe
[876.669] RBP: ffff95c612f3b800 R08: 0000000000000001 R09: ffffffffffffffe4
[876.670] R10: 00018f2a71010000 R11: 000000000ead96e3 R12: ffff95cb7d6909a0
[876.670] R13: fffffffffffffffe R14: ffff95c60f477000 R15: 00000000ffffffe4
[876.670] FS: 00007f5fbe30a840(0000) GS:ffff95ccdfa00000(0000) knlGS:0000000000000000
[876.670] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[876.671] CR2: 0000000000000096 CR3: 000000055e9f6004 CR4: 0000000000370ef0
[876.671] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[876.671] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[876.672] Call Trace:
[876.744] <TASK>
[876.744] ? __die_body+0x1b/0x60
[876.744] ? page_fault_oops+0x15d/0x450
[876.745] ? __kmem_cache_alloc_node+0x47/0x410
[876.745] ? do_user_addr_fault+0x65/0x8a0
[876.745] ? exc_page_fault+0x74/0x170
[876.746] ? asm_exc_page_fault+0x22/0x30
[876.746] ? iput+0xa/0x20
[876.746] btrfs_orphan_cleanup+0x221/0x330 [btrfs]
[876.746] btrfs_lookup_dentry+0x58f/0x5f0 [btrfs]
[876.747] btrfs_lookup+0xe/0x30 [btrfs]
[876.747] __lookup_slow+0x82/0x130
[876.785] walk_component+0xe5/0x160
[876.786] path_lookupat.isra.0+0x6e/0x150
[876.786] filename_lookup+0xcf/0x1a0
[876.786] ? mod_objcg_state+0xd2/0x360
[876.786] ? obj_cgroup_charge+0xf5/0x110
[876.787] ? should_failslab+0xa/0x20
[876.787] ? kmem_cache_alloc+0x47/0x450
[876.787] vfs_path_lookup+0x51/0x90
[876.788] mount_subtree+0x8d/0x130
[876.788] btrfs_mount+0x149/0x410 [btrfs]
[876.788] ? __kmem_cache_alloc_node+0x47/0x410
[876.788] ? vfs_parse_fs_param+0xc0/0x110
[876.789] legacy_get_tree+0x24/0x50
[876.834] vfs_get_tree+0x22/0xd0
[876.852] path_mount+0x2d8/0x9c0
[876.852] do_mount+0x79/0x90
[876.852] __x64_sys_mount+0x8e/0xd0
[876.853] do_syscall_64+0x38/0x90
[876.899] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[876.958] RIP: 0033:0x7f5fbe50b76a
[876.959] Code: 48 8b 0d a9 (...)
[876.959] RSP: 002b:00007fff01925798 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[876.959] RAX: ffffffffffffffda RBX: 00007f5fbe694264 RCX: 00007f5fbe50b76a
[876.960] RDX: 0000561bde6c8720 RSI: 0000561bde6bdec0 RDI: 0000561bde6c31a0
[876.960] RBP: 0000561bde6bdc70 R08: 0000000000000000 R09: 0000000000000001
[876.960] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[876.960] R13: 0000561bde6c31a0 R14: 0000561bde6c8720 R15: 0000561bde6bdc70
[876.960] </TASK>
So fix this by setting 'inode' to NULL whenever we get an error from
btrfs_iget(), and to make the code simpler, stop testing for 'ret' being
-ENOENT to check if we have an inode - instead test for 'inode' being NULL
or not. Having a NULL 'inode' prevents any iput() call from crashing, as
iput() ignores NULL inode pointers. Also, stop testing for a NULL return
value from btrfs_iget() with PTR_ERR_OR_ZERO(), because btrfs_iget() never
returns NULL - in case an inode is not found, it returns ERR_PTR(-ENOENT),
and in case of memory allocation failure, it returns ERR_PTR(-ENOMEM).
We also don't need the extra iput() calls on the error branches for the
btrfs_start_transaction() and btrfs_del_orphan_item() calls, as we have
already called iput() before, so remove them.
Fixes: a13bb2c03848 ("btrfs: add missing iputs on orphan cleanup failure")
CC: stable@vger.kernel.org # 6.4
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At btrfs_orphan_cleanup(), if we were able to find the inode, we do an
iput() on the inode, then if btrfs_drop_verity_items() succeeds and then
either btrfs_start_transaction() or btrfs_del_orphan_item() fail, we do
another iput() in the respective error paths, resulting in an extra iput()
on the inode.
Fix this by setting inode to NULL after the first iput(), as iput()
ignores a NULL inode pointer argument.
Fixes: a13bb2c03848 ("btrfs: add missing iputs on orphan cleanup failure")
CC: stable@vger.kernel.org # 6.4
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At exclude_super_stripes(), if we happen to find a block group that has
super blocks mapped to it and we are on a zoned filesystem, we error out
as this is not supposed to happen, indicating either a bug or maybe some
memory corruption for example. However we are exiting the function without
freeing the memory allocated for the logical address of the super blocks.
Fix this by freeing the logical address.
Fixes: 12659251ca5d ("btrfs: implement log-structured superblock for ZONED mode")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Pstore supports compression using a variety of algorithms exposed by the
crypto API. This uses the deprecated comp (as opposed to scomp/acomp)
API, and so we should stop using that, and either move to the new API,
or switch to a different approach entirely.
Given that we only compress ASCII text in pstore, and considering that
this happens when the system is likely to be in an unstable state, the
flexibility that the complex crypto API provides does not outweigh its
impact on the risk that we might encounter additional problems when
trying to commit the kernel log contents to the pstore backend.
So let's switch [back] to the zlib deflate library API, and remove all
the complexity that really has no place in a low-level diagnostic
facility. Note that, while more modern compression algorithms have been
added to the kernel in recent years, the code size of zlib deflate is
substantially smaller than, e.g., zstd, while its performance in terms
of compression ratio is comparable for ASCII text, and speed is deemed
irrelevant in this context.
Note that this means that compressed pstore records may no longer be
accessible after a kernel upgrade, but this has never been part of the
contract. (The choice of compression algorithm is not stored in the
pstore records either)
Tested-by: "Guilherme G. Piccoli" <gpiccoli@igalia.com> # Steam Deck
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230712162332.2670437-3-ardb@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
The worst case compression size used by pstore gives an upper bound for
how much the data might inadvertently *grow* due to encapsulation
overhead if the input is not compressible at all.
Given that pstore records have individual 'compressed' flags, we can
simply store the uncompressed data if compressing it would end up using
more space, making the worst case identical to the uncompressed case.
This means we can just drop all the elaborate logic that reasons about
upper bounds for each individual compression algorithm, and just store
the uncompressed data directly if compression fails for any reason.
Co-developed-by: Kees Cook <keescook@chromium.org>
Tested-by: "Guilherme G. Piccoli" <gpiccoli@igalia.com> # Steam Deck
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230712162332.2670437-2-ardb@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
Jan Kara suggested that when one thread is in the middle of freezing a
filesystem, another thread trying to freeze the same fs but with a
different freeze_holder should wait until the freezer reaches either end
state (UNFROZEN or COMPLETE) instead of returning EBUSY immediately.
Neither caller can do anything sensible with this race other than retry
but they cannot really distinguish EBUSY as in "some other holder of the
same type has the sb already frozen" from "freezing raced with holder of
a different type".
Plumb in the extra code needed to wait for the fs freezer to reach an
end state and try the freeze again.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
|
|
Userspace can freeze a filesystem using the FIFREEZE ioctl or by
suspending the block device; this state persists until userspace thaws
the filesystem with the FITHAW ioctl or resuming the block device.
Since commit 18e9e5104fcd ("Introduce freeze_super and thaw_super for
the fsfreeze ioctl") we only allow the first freeze command to succeed.
The kernel may decide that it is necessary to freeze a filesystem for
its own internal purposes, such as suspends in progress, filesystem fsck
activities, or quiescing a device prior to removal. Userspace thaw
commands must never break a kernel freeze, and kernel thaw commands
shouldn't undo userspace's freeze command.
Introduce a couple of freeze holder flags and wire it into the
sb_writers state. One kernel and one userspace freeze are allowed to
coexist at the same time; the filesystem will not thaw until both are
lifted.
I wonder if the f2fs/gfs2 code should be using a kernel freeze here, but
for now we'll use FREEZE_HOLDER_USERSPACE to preserve existing
behaviors.
Cc: mcgrof@kernel.org
Cc: jack@suse.cz
Cc: hch@infradead.org
Cc: ruansy.fnst@fujitsu.com
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
|
|
We have the new value for ki_pos right at hand in iter.pos, so assign
that instead of recalculating it from ret.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.harjani@gmail.com>
|
|
When write* wrote some data it should return the amount of written data
and not the error code that caused it to stop. Fix a recent regression
in iomap_file_buffered_write that caused it to return the errno instead.
Fixes: 219580eea1ee ("iomap: update ki_pos in iomap_file_buffered_write")
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.harjani@gmail.com>
|
|
As of 6.5-rc1, UBSAN trips over the ondisk extended attribute shortform
definitions using an array length of 1 to pretend to be a flex array.
Kernel compilers have to support unbounded array declarations, so let's
correct this.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
|
|
As of 6.5-rc1, UBSAN trips over the ondisk extended attribute leaf block
definitions using an array length of 1 to pretend to be a flex array.
Kernel compilers have to support unbounded array declarations, so let's
correct this.
================================================================================
UBSAN: array-index-out-of-bounds in fs/xfs/libxfs/xfs_attr_leaf.c:2535:24
index 2 is out of range for type '__u8 [1]'
Call Trace:
<TASK>
dump_stack_lvl+0x33/0x50
__ubsan_handle_out_of_bounds+0x9c/0xd0
xfs_attr3_leaf_getvalue+0x2ce/0x2e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_attr_leaf_get+0x148/0x1c0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_attr_get_ilocked+0xae/0x110 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_attr_get+0xee/0x150 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_xattr_get+0x7d/0xc0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
__vfs_getxattr+0xa3/0x100
vfs_getxattr+0x87/0x1d0
do_getxattr+0x17a/0x220
getxattr+0x89/0xf0
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
|
|
As of 6.5-rc1, UBSAN trips over the attrlist ioctl definitions using an
array length of 1 to pretend to be a flex array. Kernel compilers have
to support unbounded array declarations, so let's correct this. This
may cause friction with userspace header declarations, but suck is life.
================================================================================
UBSAN: array-index-out-of-bounds in fs/xfs/xfs_ioctl.c:345:18
index 1 is out of range for type '__s32 [1]'
Call Trace:
<TASK>
dump_stack_lvl+0x33/0x50
__ubsan_handle_out_of_bounds+0x9c/0xd0
xfs_ioc_attr_put_listent+0x413/0x420 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_attr_list_ilocked+0x170/0x850 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_attr_list+0xb7/0x120 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_ioc_attr_list+0x13b/0x2e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_attrlist_by_handle+0xab/0x120 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
xfs_file_ioctl+0x1ff/0x15e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09]
vfs_ioctl+0x1f/0x60
The kernel and xfsprogs code that uses these structures will not have
problems, but the long tail of external user programs might.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
|
|
Pull smb client fixes from Steve French:
- Two reconnect fixes: important fix to address inFlight count to leak
(which can leak credits), and fix for better handling a deleted share
- DFS fix
- SMB1 cleanup fix
- deferred close fix
* tag '6.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix mid leak during reconnection after timeout threshold
cifs: is_network_name_deleted should return a bool
smb: client: fix missed ses refcounting
smb: client: Fix -Wstringop-overflow issues
cifs: if deferred close is disabled then close files immediately
|
|
There is a potential deadlock reported by syzbot as below:
======================================================
WARNING: possible circular locking dependency detected
6.4.0-next-20230707-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor330/5073 is trying to acquire lock:
ffff8880218527a0 (&mm->mmap_lock){++++}-{3:3}, at: mmap_read_lock_killable include/linux/mmap_lock.h:151 [inline]
ffff8880218527a0 (&mm->mmap_lock){++++}-{3:3}, at: get_mmap_lock_carefully mm/memory.c:5293 [inline]
ffff8880218527a0 (&mm->mmap_lock){++++}-{3:3}, at: lock_mm_and_find_vma+0x369/0x510 mm/memory.c:5344
but task is already holding lock:
ffff888019f760e0 (&sbi->s_lock){+.+.}-{3:3}, at: exfat_iterate+0x117/0xb50 fs/exfat/dir.c:232
which lock already depends on the new lock.
Chain exists of:
&mm->mmap_lock --> mapping.invalidate_lock#3 --> &sbi->s_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&sbi->s_lock);
lock(mapping.invalidate_lock#3);
lock(&sbi->s_lock);
rlock(&mm->mmap_lock);
Let's try to avoid above potential deadlock condition by moving dir_emit*()
out of sbi->s_lock coverage.
Fixes: ca06197382bd ("exfat: add directory operations")
Cc: stable@vger.kernel.org #v5.7+
Reported-by: syzbot+1741a5d9b79989c10bdc@syzkaller.appspotmail.com
Link: https://lore.kernel.org/lkml/00000000000078ee7e060066270b@google.com/T/#u
Tested-by: syzbot+1741a5d9b79989c10bdc@syzkaller.appspotmail.com
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
When the number of responses with status of STATUS_IO_TIMEOUT
exceeds a specified threshold (NUM_STATUS_IO_TIMEOUT), we reconnect
the connection. But we do not return the mid, or the credits
returned for the mid, or reduce the number of in-flight requests.
This bug could result in the server->in_flight count to go bad,
and also cause a leak in the mids.
This change moves the check to a few lines below where the
response is decrypted, even of the response is read from the
transform header. This way, the code for returning the mids
can be reused.
Also, the cifs_reconnect was reconnecting just the transport
connection before. In case of multi-channel, this may not be
what we want to do after several timeouts. Changed that to
reconnect the session and the tree too.
Also renamed NUM_STATUS_IO_TIMEOUT to a more appropriate name
MAX_STATUS_IO_TIMEOUT.
Fixes: 8e670f77c4a5 ("Handle STATUS_IO_TIMEOUT gracefully")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Currently, is_network_name_deleted and it's implementations
do not return anything if the network name did get deleted.
So the function doesn't fully achieve what it advertizes.
Changed the function to return a bool instead. It will now
return true if the error returned is STATUS_NETWORK_NAME_DELETED
and the share (tree id) was found to be connected. It returns
false otherwise.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The d_hash_and_lookup() function returns error pointers or NULL.
Most incorrect error checks were fixed, but the one in int path_pts()
was forgotten.
Fixes: eedf265aa003 ("devpts: Make each mount of devpts an independent filesystem.")
Signed-off-by: Wang Ming <machel@vivo.com>
Message-Id: <20230713120555.7025-1-machel@vivo.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Changing the mode of symlinks is meaningless as the vfs doesn't take the
mode of a symlink into account during path lookup permission checking.
However, the vfs doesn't block mode changes on symlinks. This however,
has lead to an untenable mess roughly classifiable into the following
two categories:
(1) Filesystems that don't implement a i_op->setattr() for symlinks.
Such filesystems may or may not know that without i_op->setattr()
defined, notify_change() falls back to simple_setattr() causing the
inode's mode in the inode cache to be changed.
That's a generic issue as this will affect all non-size changing
inode attributes including ownership changes.
Example: afs
(2) Filesystems that fail with EOPNOTSUPP but change the mode of the
symlink nonetheless.
Some filesystems will happily update the mode of a symlink but still
return EOPNOTSUPP. This is the biggest source of confusion for
userspace.
The EOPNOTSUPP in this case comes from POSIX ACLs. Specifically it
comes from filesystems that call posix_acl_chmod(), e.g., btrfs via
if (!err && attr->ia_valid & ATTR_MODE)
err = posix_acl_chmod(idmap, dentry, inode->i_mode);
Filesystems including btrfs don't implement i_op->set_acl() so
posix_acl_chmod() will report EOPNOTSUPP.
When posix_acl_chmod() is called, most filesystems will have
finished updating the inode.
Perversely, this has the consequences that this behavior may depend
on two kconfig options and mount options:
* CONFIG_POSIX_ACL={y,n}
* CONFIG_${FSTYPE}_POSIX_ACL={y,n}
* Opt_acl, Opt_noacl
Example: btrfs, ext4, xfs
The only way to change the mode on a symlink currently involves abusing
an O_PATH file descriptor in the following manner:
fd = openat(-1, "/path/to/link", O_CLOEXEC | O_PATH | O_NOFOLLOW);
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/self/fd/%d", fd);
chmod(path, 0000);
But for most major filesystems with POSIX ACL support such as btrfs,
ext4, ceph, tmpfs, xfs and others this will fail with EOPNOTSUPP with
the mode still updated due to the aforementioned posix_acl_chmod()
nonsense.
So, given that for all major filesystems this would fail with EOPNOTSUPP
and that both glibc (cf. [1]) and musl (cf. [2]) outright block mode
changes on symlinks we should just try and block mode changes on
symlinks directly in the vfs and have a clean break with this nonsense.
If this causes any regressions, we do the next best thing and fix up all
filesystems that do return EOPNOTSUPP with the mode updated to not call
posix_acl_chmod() on symlinks.
But as usual, let's try the clean cut solution first. It's a simple
patch that can be easily reverted. Not marking this for backport as I'll
do that manually if we're reasonably sure that this works and there are
no strong objections.
We could block this in chmod_common() but it's more appropriate to do it
notify_change() as it will also mean that we catch filesystems that
change symlink permissions explicitly or accidently.
Similar proposals were floated in the past as in [3] and [4] and again
recently in [5]. There's also a couple of bugs about this inconsistency
as in [6] and [7].
Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=99527a3727e44cb8661ee1f743068f108ec93979;hb=HEAD [1]
Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c [2]
Link: https://lore.kernel.org/all/20200911065733.GA31579@infradead.org [3]
Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00518.html [4]
Link: https://lore.kernel.org/lkml/87lefmbppo.fsf@oldenburg.str.redhat.com [5]
Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00467.html [6]
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=14578#c17 [7]
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # please backport to all LTSes but not before v6.6-rc2 is tagged
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Message-Id: <20230712-vfs-chmod-symlinks-v2-1-08cfb92b61dd@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Due to an oversight in commit 1b3044e39a89 ("procfs: fix pthread
cross-thread naming if !PR_DUMPABLE") in switching from REG to NOD,
chmod operations on /proc/thread-self/comm were no longer blocked as
they are on almost all other procfs files.
A very similar situation with /proc/self/environ was used to as a root
exploit a long time ago, but procfs has SB_I_NOEXEC so this is simply a
correctness issue.
Ref: https://lwn.net/Articles/191954/
Ref: 6d76fa58b050 ("Don't allow chmod() on the /proc/<pid>/ files")
Fixes: 1b3044e39a89 ("procfs: fix pthread cross-thread naming if !PR_DUMPABLE")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Message-Id: <20230713141001.27046-1-cyphar@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
exfat_extract_uni_name copies characters from a given file name entry into
the 'uniname' variable. This variable is actually defined on the stack of
the exfat_readdir() function. According to the definition of
the 'exfat_uni_name' type, the file name should be limited 255 characters
(+ null teminator space), but the exfat_get_uniname_from_ext_entry()
function can write more characters because there is no check if filename
entries exceeds max filename length. This patch add the check not to copy
filename characters when exceeding max filename length.
Cc: stable@vger.kernel.org
Cc: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reported-by: Maxim Suhanov <dfirblog@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
All other files in /proc/$PID/ use proc_setattr().
Not using it allows the usage of chmod() on /proc/$PID/net, even on
other processes owned by the same user.
The same would probably also be true for other attributes to be changed.
As this technically represents an ABI change it is not marked for
stable so any unlikely regressions are caught during a full release cycle.
Fixes: e9720acd728a ("[NET]: Make /proc/net a symlink on /proc/self/net (v3)")
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/lkml/d0d111ef-edae-4760-83fb-36db84278da1@t-8ch.de/
Fixes: b4844fa0bdb4 ("selftests/nolibc: implement a few tests for various syscalls")
Tested-by: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Message-Id: <20230624-proc-net-setattr-v1-2-73176812adee@weissschuh.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-39-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-38-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.
Acked-by: Gao Xiang <xiang@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230705190309.579783-37-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|