summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2023-07-23ksmbd: fix out of bounds in smb3_decrypt_req()Namjae Jeon1-1/+2
smb3_decrypt_req() validate if pdu_length is smaller than smb2_transform_hdr size. Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-21589 Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-07-23ksmbd: check if a mount point is crossed during path lookupNamjae Jeon4-39/+53
Since commit 74d7970febf7 ("ksmbd: fix racy issue from using ->d_parent and ->d_name"), ksmbd can not lookup cross mount points. If last component is a cross mount point during path lookup, check if it is crossed to follow it down. And allow path lookup to cross a mount point when a crossmnt parameter is set to 'yes' in smb.conf. Cc: stable@vger.kernel.org Fixes: 74d7970febf7 ("ksmbd: fix racy issue from using ->d_parent and ->d_name") Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-07-23ext4: fix rbtree traversal bug in ext4_mb_use_preallocatedOjaswin Mujoo1-27/+131
During allocations, while looking for preallocations(PA) in the per inode rbtree, we can't do a direct traversal of the tree because ext4_mb_discard_group_preallocation() can paralelly mark the pa deleted and that can cause direct traversal to skip some entries. This was leading to a BUG_ON() being hit [1] when we missed a PA that could satisfy our request and ultimately tried to create a new PA that would overlap with the missed one. To makes sure we handle that case while still keeping the performance of the rbtree, we make use of the fact that the only pa that could possibly overlap the original goal start is the one that satisfies the below conditions: 1. It must have it's logical start immediately to the left of (ie less than) original logical start. 2. It must not be deleted To find this pa we use the following traversal method: 1. Descend into the rbtree normally to find the immediate neighboring PA. Here we keep descending irrespective of if the PA is deleted or if it overlaps with our request etc. The goal is to find an immediately adjacent PA. 2. If the found PA is on right of original goal, use rb_prev() to find the left adjacent PA. 3. Check if this PA is deleted and keep moving left with rb_prev() until a non deleted PA is found. 4. This is the PA we are looking for. Now we can check if it can satisfy the original request and proceed accordingly. This approach also takes care of having deleted PAs in the tree. (While we are at it, also fix a possible overflow bug in calculating the end of a PA) [1] https://lore.kernel.org/linux-ext4/CA+G9fYv2FRpLqBZf34ZinR8bU2_ZRAUOjKAD3+tKRFaEQHtt8Q@mail.gmail.com/ Cc: stable@kernel.org # 6.4 Fixes: 3872778664e3 ("ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reviewed-by: Ritesh Harjani (IBM) ritesh.list@gmail.com Tested-by: Ritesh Harjani (IBM) ritesh.list@gmail.com Link: https://lore.kernel.org/r/edd2efda6a83e6343c5ace9deea44813e71dbe20.1690045963.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-23ext4: fix off by one issue in ext4_mb_choose_next_group_best_avail()Ojaswin Mujoo1-5/+9
In ext4_mb_choose_next_group_best_avail(), we want the start order to be 1 less than goal length and the min_order to be, at max, 1 more than the original length. This commit fixes an off by one issue that arose due to the fact that 1 << fls(n) > (n). After all the processing: order = 1 order below goal len min_order = maximum of the three:- - order - trim_order - 1 order below B2C(s_stripe) - 1 order above original len Cc: stable@kernel.org Fixes: 33122aa930 ("ext4: Add allocation criteria 1.5 (CR1_5)") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230609103403.112807-1-ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-23ext4: correct inline offset when handling xattrs in inode bodyEric Whitney1-0/+14
When run on a file system where the inline_data feature has been enabled, xfstests generic/269, generic/270, and generic/476 cause ext4 to emit error messages indicating that inline directory entries are corrupted. This occurs because the inline offset used to locate inline directory entries in the inode body is not updated when an xattr in that shared region is deleted and the region is shifted in memory to recover the space it occupied. If the deleted xattr precedes the system.data attribute, which points to the inline directory entries, that attribute will be moved further up in the region. The inline offset continues to point to whatever is located in system.data's former location, with unfortunate effects when used to access directory entries or (presumably) inline data in the inode body. Cc: stable@kernel.org Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20230522181520.1570360-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-07-22cifs: update internal module version number for cifs.koSteve French1-2/+2
From 2.43 to 2.44 Signed-off-by: Steve French <stfrench@microsoft.com>
2023-07-22cifs: allow dumping keys for directories tooShyam Prasad N1-4/+13
Dumping the enc/dec keys is a session wide operation. And it should not matter if the ioctl was run on a regular file or a directory. Currently, we obtain the tcon pointer from the cifs file handle. But since there's no dir open call in cifs, this is not populated for dirs. This change allows dumping of session keys using ioctl even for directories. To do this, we'll now get the tcon pointer from the superblock, and not from the file handle. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-07-20fs/9p: Remove unused extern declarationYueHaibing1-2/+0
commit bd238fb431f3 ("9p: Reorganization of 9p file system code") left behind this. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com> Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2023-07-209p: remove dead stores (variable set again without being read)Dominique Martinet2-7/+0
The 9p code for some reason used to initialize variables outside of the declaration, e.g. instead of just initializing the variable like this: int retval = 0 We would be doing this: int retval; retval = 0; This is perfectly fine and the compiler will just optimize dead stores anyway, but scan-build seems to think this is a problem and there are many of these warnings making the output of scan-build full of such warnings: fs/9p/vfs_inode.c:916:2: warning: Value stored to 'retval' is never read [deadcode.DeadStores] retval = 0; ^ ~ I have no strong opinion here, but if we want to regularly run scan-build we should fix these just to silence the messages. I've confirmed these all are indeed ok to remove. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2023-07-209p: fix ignored return value in v9fs_dir_releaseDominique Martinet1-2/+3
retval from filemap_fdatawrite was immediately overwritten by the following p9_fid_put: preserve any error in fdatawrite if there was any first. This fixes the following scan-build warning: fs/9p/vfs_dir.c:220:4: warning: Value stored to 'retval' is never read [deadcode.DeadStores] retval = filemap_fdatawrite(inode->i_mapping); ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fixes: 89c58cb395ec ("fs/9p: fix error reporting in v9fs_dir_release") Cc: stable@vger.kernel.org Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2023-07-20btrfs: account block group tree when calculating global reserve sizeFilipe Manana1-0/+5
When using the block group tree feature, this tree is a critical tree just like the extent, csum and free space trees, and just like them it uses the delayed refs block reserve. So take into account the block group tree, and its current size, when calculating the size for the global reserve. CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-20btrfs: zoned: do not enable async discardNaohiro Aota2-1/+9
The zoned mode need to reset a zone before using it. We rely on btrfs's original discard functionality (discarding unused block group range) to do the resetting. While the commit 63a7cb130718 ("btrfs: auto enable discard=async when possible") made the discard done in an async manner, a zoned reset do not need to be async, as it is fast enough. Even worth, delaying zone rests prevents using those zones again. So, let's disable async discard on the zoned mode. Fixes: 63a7cb130718 ("btrfs: auto enable discard=async when possible") CC: stable@vger.kernel.org # 6.3+ Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update message text ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-20Merge tag 'iomap-6.5-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-2/+2
Pull iomap fix from Darrick Wong: "Fix partial write regression. It turns out that fstests doesn't have any test coverage for short writes, but LTP does. Fortunately, this was caught right after -rc1 was tagged. Summary: - Fix a bug wherein a failed write could clobber short write status" * tag 'iomap-6.5-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: micro optimize the ki_pos assignment in iomap_file_buffered_write iomap: fix a regression for partial write errors
2023-07-20Merge tag 'xfs-6.5-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds3-13/+71
Pull xfs fixes from Darrick Wong: "Flexarray declaration conversions. This probably should've been done with the merge window open, but I was not aware that the UBSAN knob would be getting turned up for 6.5, and the fstests failures due to the kernel warnings are getting in the way of testing. Summary: - Convert all the array[1] declarations into the accepted flex array[] declarations so that UBSAN and friends will not get confused" * tag 'xfs-6.5-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: convert flex-array declarations in xfs attr shortform objects xfs: convert flex-array declarations in xfs attr leaf blocks xfs: convert flex-array declarations in struct xfs_attrlist*
2023-07-20fs/9p: remove unnecessary invalidate_inode_pages2Eric Van Hensbergen1-1/+0
There was an invalidate_inode_pages2 added to readonly mmap path that is unnecessary since that path is only entered when writeback cache is disabled on mount. Cc: stable@vger.kernel.org Fixes: 1543b4c5071c ("fs/9p: remove writeback fid and fix per-file modes") Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com> Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2023-07-20fs/9p: fix type mismatch in file cache mode helperEric Van Hensbergen1-2/+2
There were two flags (s_flags and s_cache) which had incorrect signed type in the parameters of the file cache mode helper function. Cc: stable@vger.kernel.org Fixes: 1543b4c5071c ("fs/9p: remove writeback fid and fix per-file modes") Reviewed-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2023-07-20fs/9p: fix typo in comparison logic for cache modeEric Van Hensbergen1-1/+1
There appears to be a typo in the comparison statement for the logic which sets a file's cache mode based on mount flags. Cc: stable@vger.kernel.org Fixes: 1543b4c5071c ("fs/9p: remove writeback fid and fix per-file modes") Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com> Reviewed-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2023-07-20fs/9p: remove unnecessary and overrestrictive checkEric Van Hensbergen1-3/+1
This eliminates a check for shared that was overrestrictive and prevented read-only mmaps when writeback caches weren't enabled. Cc: stable@vger.kernel.org Fixes: 1543b4c5071c ("fs/9p: remove writeback fid and fix per-file modes") Reported-by: Robert Schwebel <r.schwebel@pengutronix.de> Closes: https://lore.kernel.org/v9fs/ZK25XZ%2BGpR3KHIB%2F@pengutronix.de Reviewed-by: Dominique Martinet <asmadeus@codewreck.org> Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com> Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2023-07-20Merge tag 'for-6.5-rc2-tag' of ↵Linus Torvalds6-46/+79
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Stable fixes: - fix race between balance and cancel/pause - various iput() fixes - fix use-after-free of new block group that became unused - fix warning when putting transaction with qgroups enabled after abort - fix crash in subpage mode when page could be released between map and map read - when scrubbing raid56 verify the P/Q stripes unconditionally - fix minor memory leak in zoned mode when a block group with an unexpected superblock is found Regression fixes: - fix ordered extent split error handling when submitting direct IO - user irq-safe locking when adding delayed iputs" * tag 'for-6.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix warning when putting transaction with qgroups enabled after abort btrfs: fix ordered extent split error handling in btrfs_dio_submit_io btrfs: set_page_extent_mapped after read_folio in btrfs_cont_expand btrfs: raid56: always verify the P/Q contents for scrub btrfs: use irq safe locking when running and adding delayed iputs btrfs: fix iput() on error pointer after error during orphan cleanup btrfs: fix double iput() on inode after an error during orphan cleanup btrfs: zoned: fix memory leak after finding block group with super blocks btrfs: fix use-after-free of new block group that became unused btrfs: be a bit more careful when setting mirror_num_ret in btrfs_map_block btrfs: fix race between balance and cancel/pause
2023-07-19Merge tag 'fuse-update-6.5' of ↵Linus Torvalds3-13/+20
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "Small but important fixes and a trivial cleanup" * tag 'fuse-update-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: ioctl: translate ENOSYS in outarg fuse: revalidate: don't invalidate if interrupted fuse: Apply flags2 only when userspace set the FUSE_INIT_EXT fuse: remove duplicate check for nodeid fuse: add feature flag for expire-only
2023-07-18nfsd: Remove incorrect check in nfsd4_validate_stateidTrond Myklebust1-2/+0
If the client is calling TEST_STATEID, then it is because some event occurred that requires it to check all the stateids for validity and call FREE_STATEID on the ones that have been revoked. In this case, either the stateid exists in the list of stateids associated with that nfs4_client, in which case it should be tested, or it does not. There are no additional conditions to be considered. Reported-by: "Frank Ch. Eigler" <fche@redhat.com> Fixes: 7df302f75ee2 ("NFSD: TEST_STATEID should not return NFS4ERR_STALE_STATEID") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-07-18btrfs: fix warning when putting transaction with qgroups enabled after abortFilipe Manana1-0/+1
If we have a transaction abort with qgroups enabled we get a warning triggered when doing the final put on the transaction, like this: [552.6789] ------------[ cut here ]------------ [552.6815] WARNING: CPU: 4 PID: 81745 at fs/btrfs/transaction.c:144 btrfs_put_transaction+0x123/0x130 [btrfs] [552.6817] Modules linked in: btrfs blake2b_generic xor (...) [552.6819] CPU: 4 PID: 81745 Comm: btrfs-transacti Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1 [552.6819] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [552.6819] RIP: 0010:btrfs_put_transaction+0x123/0x130 [btrfs] [552.6821] Code: bd a0 01 00 (...) [552.6821] RSP: 0018:ffffa168c0527e28 EFLAGS: 00010286 [552.6821] RAX: ffff936042caed00 RBX: ffff93604a3eb448 RCX: 0000000000000000 [552.6821] RDX: ffff93606421b028 RSI: ffffffff92ff0878 RDI: ffff93606421b010 [552.6821] RBP: ffff93606421b000 R08: 0000000000000000 R09: ffffa168c0d07c20 [552.6821] R10: 0000000000000000 R11: ffff93608dc52950 R12: ffffa168c0527e70 [552.6821] R13: ffff93606421b000 R14: ffff93604a3eb420 R15: ffff93606421b028 [552.6821] FS: 0000000000000000(0000) GS:ffff93675fb00000(0000) knlGS:0000000000000000 [552.6821] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [552.6821] CR2: 0000558ad262b000 CR3: 000000014feda005 CR4: 0000000000370ee0 [552.6822] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [552.6822] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [552.6822] Call Trace: [552.6822] <TASK> [552.6822] ? __warn+0x80/0x130 [552.6822] ? btrfs_put_transaction+0x123/0x130 [btrfs] [552.6824] ? report_bug+0x1f4/0x200 [552.6824] ? handle_bug+0x42/0x70 [552.6824] ? exc_invalid_op+0x14/0x70 [552.6824] ? asm_exc_invalid_op+0x16/0x20 [552.6824] ? btrfs_put_transaction+0x123/0x130 [btrfs] [552.6826] btrfs_cleanup_transaction+0xe7/0x5e0 [btrfs] [552.6828] ? _raw_spin_unlock_irqrestore+0x23/0x40 [552.6828] ? try_to_wake_up+0x94/0x5e0 [552.6828] ? __pfx_process_timeout+0x10/0x10 [552.6828] transaction_kthread+0x103/0x1d0 [btrfs] [552.6830] ? __pfx_transaction_kthread+0x10/0x10 [btrfs] [552.6832] kthread+0xee/0x120 [552.6832] ? __pfx_kthread+0x10/0x10 [552.6832] ret_from_fork+0x29/0x50 [552.6832] </TASK> [552.6832] ---[ end trace 0000000000000000 ]--- This corresponds to this line of code: void btrfs_put_transaction(struct btrfs_transaction *transaction) { (...) WARN_ON(!RB_EMPTY_ROOT( &transaction->delayed_refs.dirty_extent_root)); (...) } The warning happens because btrfs_qgroup_destroy_extent_records(), called in the transaction abort path, we free all entries from the rbtree "dirty_extent_root" with rbtree_postorder_for_each_entry_safe(), but we don't actually empty the rbtree - it's still pointing to nodes that were freed. So set the rbtree's root node to NULL to avoid this warning (assign RB_ROOT). Fixes: 81f7eb00ff5b ("btrfs: destroy qgroup extent records on transaction abort") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18btrfs: fix ordered extent split error handling in btrfs_dio_submit_ioChristoph Hellwig1-2/+5
When the call to btrfs_extract_ordered_extent in btrfs_dio_submit_io fails to allocate memory for a new ordered_extent, it calls into the btrfs_dio_end_io for error handling. btrfs_dio_end_io then assumes that bbio->ordered is set because it is supposed to be at this point, except for this error handling corner case. Try to not overload the btrfs_dio_end_io with error handling of a bio in a non-canonical state, and instead call btrfs_finish_ordered_extent and iomap_dio_bio_end_io directly for this error case. Reported-by: syzbot <syzbot+5b82f0e951f8c2bcdb8f@syzkaller.appspotmail.com> Fixes: b41b6f6937dc ("btrfs: use btrfs_finish_ordered_extent to complete direct writes") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: syzbot <syzbot+5b82f0e951f8c2bcdb8f@syzkaller.appspotmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18btrfs: set_page_extent_mapped after read_folio in btrfs_cont_expandJosef Bacik1-3/+11
While trying to get the subpage blocksize tests running, I hit the following panic on generic/476 assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:229 kernel BUG at fs/btrfs/subpage.c:229! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 1 PID: 1453 Comm: fsstress Not tainted 6.4.0-rc7+ #12 Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20230301gitf80f052277c8-26.fc38 03/01/2023 pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) pc : btrfs_subpage_assert+0xbc/0xf0 lr : btrfs_subpage_assert+0xbc/0xf0 Call trace: btrfs_subpage_assert+0xbc/0xf0 btrfs_subpage_clear_checked+0x38/0xc0 btrfs_page_clear_checked+0x48/0x98 btrfs_truncate_block+0x5d0/0x6a8 btrfs_cont_expand+0x5c/0x528 btrfs_write_check.isra.0+0xf8/0x150 btrfs_buffered_write+0xb4/0x760 btrfs_do_write_iter+0x2f8/0x4b0 btrfs_file_write_iter+0x1c/0x30 do_iter_readv_writev+0xc8/0x158 do_iter_write+0x9c/0x210 vfs_iter_write+0x24/0x40 iter_file_splice_write+0x224/0x390 direct_splice_actor+0x38/0x68 splice_direct_to_actor+0x12c/0x260 do_splice_direct+0x90/0xe8 generic_copy_file_range+0x50/0x90 vfs_copy_file_range+0x29c/0x470 __arm64_sys_copy_file_range+0xcc/0x498 invoke_syscall.constprop.0+0x80/0xd8 do_el0_svc+0x6c/0x168 el0_svc+0x50/0x1b0 el0t_64_sync_handler+0x114/0x120 el0t_64_sync+0x194/0x198 This happens because during btrfs_cont_expand we'll get a page, set it as mapped, and if it's not Uptodate we'll read it. However between the read and re-locking the page we could have called release_folio() on the page, but left the page in the file mapping. release_folio() can clear the page private, and thus further down we blow up when we go to modify the subpage bits. Fix this by putting the set_page_extent_mapped() after the read. This is safe because read_folio() will call set_page_extent_mapped() before it does the read, and then if we clear page private but leave it on the mapping we're completely safe re-setting set_page_extent_mapped(). With this patch I can now run generic/476 without panicing. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18btrfs: raid56: always verify the P/Q contents for scrubQu Wenruo1-8/+3
[REGRESSION] Commit 75b470332965 ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap") changed the behavior of scrub_rbio(). Initially if we have no error reading the raid bio, we will assign @need_check to true, then finish_parity_scrub() would later verify the content of P/Q stripes before writeback. But after that commit we never verify the content of P/Q stripes and just writeback them. This can lead to unrepaired P/Q stripes during scrub, or already corrupted P/Q copied to the dev-replace target. [FIX] The situation is more complex than the regression, in fact the initial behavior is not 100% correct either. If we have the following rare case, it can still lead to the same problem using the old behavior: 0 16K 32K 48K 64K Data 1: |IIIIIII| | Data 2: | | Parity: | |CCCCCCC| | Where "I" means IO error, "C" means corruption. In the above case, we're scrubbing the parity stripe, then read out all the contents of Data 1, Data 2, Parity stripes. But found IO error in Data 1, which leads to rebuild using Data 2 and Parity and got the correct data. In that case, we would not verify if the Parity is correct for range [16K, 32K). So here we have to always verify the content of Parity no matter if we did recovery or not. This patch would remove the @need_check parameter of finish_parity_scrub() completely, and would always do the P/Q verification before writeback. Fixes: 75b470332965 ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap") CC: stable@vger.kernel.org # 6.2+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18btrfs: use irq safe locking when running and adding delayed iputsFilipe Manana1-10/+25
Running delayed iputs, which never happens in an irq context, needs to lock the spinlock fs_info->delayed_iput_lock. When finishing bios for data writes (irq context, bio.c) we call btrfs_put_ordered_extent() which needs to add a delayed iput and for that it needs to acquire the spinlock fs_info->delayed_iput_lock. Without disabling irqs when running delayed iputs we can therefore deadlock on that spinlock. The same deadlock can also happen when adding an inode to the delayed iputs list, since this can be done outside an irq context as well. Syzbot recently reported this, which results in the following trace: ================================ WARNING: inconsistent lock state 6.4.0-syzkaller-09904-ga507db1d8fdc #0 Not tainted -------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. btrfs-cleaner/16079 [HC0[0]:SC0[0]:HE1:SE1] takes: ffff888107804d20 (&fs_info->delayed_iput_lock){+.?.}-{2:2}, at: spin_lock include/linux/spinlock.h:350 [inline] ffff888107804d20 (&fs_info->delayed_iput_lock){+.?.}-{2:2}, at: btrfs_run_delayed_iputs+0x28/0xe0 fs/btrfs/inode.c:3523 {IN-SOFTIRQ-W} state was registered at: lock_acquire kernel/locking/lockdep.c:5761 [inline] lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5726 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:350 [inline] btrfs_add_delayed_iput+0x128/0x390 fs/btrfs/inode.c:3490 btrfs_put_ordered_extent fs/btrfs/ordered-data.c:559 [inline] btrfs_put_ordered_extent+0x2f6/0x610 fs/btrfs/ordered-data.c:547 __btrfs_bio_end_io fs/btrfs/bio.c:118 [inline] __btrfs_bio_end_io+0x136/0x180 fs/btrfs/bio.c:112 btrfs_orig_bbio_end_io+0x86/0x2b0 fs/btrfs/bio.c:163 btrfs_simple_end_io+0x105/0x380 fs/btrfs/bio.c:378 bio_endio+0x589/0x690 block/bio.c:1617 req_bio_endio block/blk-mq.c:766 [inline] blk_update_request+0x5c5/0x1620 block/blk-mq.c:911 blk_mq_end_request+0x59/0x680 block/blk-mq.c:1032 lo_complete_rq+0x1c6/0x280 drivers/block/loop.c:370 blk_complete_reqs+0xb3/0xf0 block/blk-mq.c:1110 __do_softirq+0x1d4/0x905 kernel/softirq.c:553 run_ksoftirqd kernel/softirq.c:921 [inline] run_ksoftirqd+0x31/0x60 kernel/softirq.c:913 smpboot_thread_fn+0x659/0x9e0 kernel/smpboot.c:164 kthread+0x344/0x440 kernel/kthread.c:389 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 irq event stamp: 39 hardirqs last enabled at (39): [<ffffffff81d5ebc4>] __do_kmem_cache_free mm/slab.c:3558 [inline] hardirqs last enabled at (39): [<ffffffff81d5ebc4>] kmem_cache_free mm/slab.c:3582 [inline] hardirqs last enabled at (39): [<ffffffff81d5ebc4>] kmem_cache_free+0x244/0x370 mm/slab.c:3575 hardirqs last disabled at (38): [<ffffffff81d5eb5e>] __do_kmem_cache_free mm/slab.c:3553 [inline] hardirqs last disabled at (38): [<ffffffff81d5eb5e>] kmem_cache_free mm/slab.c:3582 [inline] hardirqs last disabled at (38): [<ffffffff81d5eb5e>] kmem_cache_free+0x1de/0x370 mm/slab.c:3575 softirqs last enabled at (0): [<ffffffff814ac99f>] copy_process+0x227f/0x75c0 kernel/fork.c:2448 softirqs last disabled at (0): [<0000000000000000>] 0x0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&fs_info->delayed_iput_lock); <Interrupt> lock(&fs_info->delayed_iput_lock); *** DEADLOCK *** 1 lock held by btrfs-cleaner/16079: #0: ffff888107804860 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x103/0x4b0 fs/btrfs/disk-io.c:1463 stack backtrace: CPU: 3 PID: 16079 Comm: btrfs-cleaner Not tainted 6.4.0-syzkaller-09904-ga507db1d8fdc #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 print_usage_bug kernel/locking/lockdep.c:3978 [inline] valid_state kernel/locking/lockdep.c:4020 [inline] mark_lock_irq kernel/locking/lockdep.c:4223 [inline] mark_lock.part.0+0x1102/0x1960 kernel/locking/lockdep.c:4685 mark_lock kernel/locking/lockdep.c:4649 [inline] mark_usage kernel/locking/lockdep.c:4598 [inline] __lock_acquire+0x8e4/0x5e20 kernel/locking/lockdep.c:5098 lock_acquire kernel/locking/lockdep.c:5761 [inline] lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5726 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:350 [inline] btrfs_run_delayed_iputs+0x28/0xe0 fs/btrfs/inode.c:3523 cleaner_kthread+0x2e5/0x4b0 fs/btrfs/disk-io.c:1478 kthread+0x344/0x440 kernel/kthread.c:389 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 </TASK> So fix this by using spin_lock_irq() and spin_unlock_irq() when running delayed iputs, and using spin_lock_irqsave() and spin_unlock_irqrestore() when adding a delayed iput(). Reported-by: syzbot+da501a04be5ff533b102@syzkaller.appspotmail.com Fixes: ec63b84d4611 ("btrfs: add an ordered_extent pointer to struct btrfs_bio") Link: https://lore.kernel.org/linux-btrfs/000000000000d5c89a05ffbd39dd@google.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18btrfs: fix iput() on error pointer after error during orphan cleanupFilipe Manana1-10/+10
At btrfs_orphan_cleanup(), if we can't find an inode (btrfs_iget() returns an -ENOENT error pointer), we proceed with 'ret' set to -ENOENT and the inode pointer set to ERR_PTR(-ENOENT). Later when we proceed to the body of the following if statement: if (ret == -ENOENT || inode->i_nlink) { (...) trans = btrfs_start_transaction(root, 1); if (IS_ERR(trans)) { ret = PTR_ERR(trans); iput(inode); goto out; } (...) ret = btrfs_del_orphan_item(trans, root, found_key.objectid); btrfs_end_transaction(trans); if (ret) { iput(inode); goto out; } continue; } If we get an error from btrfs_start_transaction() or from the call to btrfs_del_orphan_item() we end calling iput() against an inode pointer that has a value of ERR_PTR(-ENOENT), resulting in a crash with the following trace: [876.667] BUG: kernel NULL pointer dereference, address: 0000000000000096 [876.667] #PF: supervisor read access in kernel mode [876.667] #PF: error_code(0x0000) - not-present page [876.667] PGD 0 P4D 0 [876.668] Oops: 0000 [#1] PREEMPT SMP PTI [876.668] CPU: 0 PID: 2356187 Comm: mount Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1 [876.668] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [876.668] RIP: 0010:iput+0xa/0x20 [876.668] Code: ff ff ff 66 (...) [876.669] RSP: 0018:ffffafa9c0c9f9d0 EFLAGS: 00010282 [876.669] RAX: ffffffffffffffe4 RBX: 000000000009453b RCX: 0000000000000000 [876.669] RDX: 0000000000000001 RSI: ffffafa9c0c9f930 RDI: fffffffffffffffe [876.669] RBP: ffff95c612f3b800 R08: 0000000000000001 R09: ffffffffffffffe4 [876.670] R10: 00018f2a71010000 R11: 000000000ead96e3 R12: ffff95cb7d6909a0 [876.670] R13: fffffffffffffffe R14: ffff95c60f477000 R15: 00000000ffffffe4 [876.670] FS: 00007f5fbe30a840(0000) GS:ffff95ccdfa00000(0000) knlGS:0000000000000000 [876.670] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [876.671] CR2: 0000000000000096 CR3: 000000055e9f6004 CR4: 0000000000370ef0 [876.671] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [876.671] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [876.672] Call Trace: [876.744] <TASK> [876.744] ? __die_body+0x1b/0x60 [876.744] ? page_fault_oops+0x15d/0x450 [876.745] ? __kmem_cache_alloc_node+0x47/0x410 [876.745] ? do_user_addr_fault+0x65/0x8a0 [876.745] ? exc_page_fault+0x74/0x170 [876.746] ? asm_exc_page_fault+0x22/0x30 [876.746] ? iput+0xa/0x20 [876.746] btrfs_orphan_cleanup+0x221/0x330 [btrfs] [876.746] btrfs_lookup_dentry+0x58f/0x5f0 [btrfs] [876.747] btrfs_lookup+0xe/0x30 [btrfs] [876.747] __lookup_slow+0x82/0x130 [876.785] walk_component+0xe5/0x160 [876.786] path_lookupat.isra.0+0x6e/0x150 [876.786] filename_lookup+0xcf/0x1a0 [876.786] ? mod_objcg_state+0xd2/0x360 [876.786] ? obj_cgroup_charge+0xf5/0x110 [876.787] ? should_failslab+0xa/0x20 [876.787] ? kmem_cache_alloc+0x47/0x450 [876.787] vfs_path_lookup+0x51/0x90 [876.788] mount_subtree+0x8d/0x130 [876.788] btrfs_mount+0x149/0x410 [btrfs] [876.788] ? __kmem_cache_alloc_node+0x47/0x410 [876.788] ? vfs_parse_fs_param+0xc0/0x110 [876.789] legacy_get_tree+0x24/0x50 [876.834] vfs_get_tree+0x22/0xd0 [876.852] path_mount+0x2d8/0x9c0 [876.852] do_mount+0x79/0x90 [876.852] __x64_sys_mount+0x8e/0xd0 [876.853] do_syscall_64+0x38/0x90 [876.899] entry_SYSCALL_64_after_hwframe+0x72/0xdc [876.958] RIP: 0033:0x7f5fbe50b76a [876.959] Code: 48 8b 0d a9 (...) [876.959] RSP: 002b:00007fff01925798 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [876.959] RAX: ffffffffffffffda RBX: 00007f5fbe694264 RCX: 00007f5fbe50b76a [876.960] RDX: 0000561bde6c8720 RSI: 0000561bde6bdec0 RDI: 0000561bde6c31a0 [876.960] RBP: 0000561bde6bdc70 R08: 0000000000000000 R09: 0000000000000001 [876.960] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [876.960] R13: 0000561bde6c31a0 R14: 0000561bde6c8720 R15: 0000561bde6bdc70 [876.960] </TASK> So fix this by setting 'inode' to NULL whenever we get an error from btrfs_iget(), and to make the code simpler, stop testing for 'ret' being -ENOENT to check if we have an inode - instead test for 'inode' being NULL or not. Having a NULL 'inode' prevents any iput() call from crashing, as iput() ignores NULL inode pointers. Also, stop testing for a NULL return value from btrfs_iget() with PTR_ERR_OR_ZERO(), because btrfs_iget() never returns NULL - in case an inode is not found, it returns ERR_PTR(-ENOENT), and in case of memory allocation failure, it returns ERR_PTR(-ENOMEM). We also don't need the extra iput() calls on the error branches for the btrfs_start_transaction() and btrfs_del_orphan_item() calls, as we have already called iput() before, so remove them. Fixes: a13bb2c03848 ("btrfs: add missing iputs on orphan cleanup failure") CC: stable@vger.kernel.org # 6.4 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18btrfs: fix double iput() on inode after an error during orphan cleanupFilipe Manana1-0/+1
At btrfs_orphan_cleanup(), if we were able to find the inode, we do an iput() on the inode, then if btrfs_drop_verity_items() succeeds and then either btrfs_start_transaction() or btrfs_del_orphan_item() fail, we do another iput() in the respective error paths, resulting in an extra iput() on the inode. Fix this by setting inode to NULL after the first iput(), as iput() ignores a NULL inode pointer argument. Fixes: a13bb2c03848 ("btrfs: add missing iputs on orphan cleanup failure") CC: stable@vger.kernel.org # 6.4 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18btrfs: zoned: fix memory leak after finding block group with super blocksFilipe Manana1-0/+1
At exclude_super_stripes(), if we happen to find a block group that has super blocks mapped to it and we are on a zoned filesystem, we error out as this is not supposed to happen, indicating either a bug or maybe some memory corruption for example. However we are exiting the function without freeing the memory allocated for the logical address of the super blocks. Fix this by freeing the logical address. Fixes: 12659251ca5d ("btrfs: implement log-structured superblock for ZONED mode") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18pstore: Replace crypto API compression with zlib_deflate library callsArd Biesheuvel2-139/+97
Pstore supports compression using a variety of algorithms exposed by the crypto API. This uses the deprecated comp (as opposed to scomp/acomp) API, and so we should stop using that, and either move to the new API, or switch to a different approach entirely. Given that we only compress ASCII text in pstore, and considering that this happens when the system is likely to be in an unstable state, the flexibility that the complex crypto API provides does not outweigh its impact on the risk that we might encounter additional problems when trying to commit the kernel log contents to the pstore backend. So let's switch [back] to the zlib deflate library API, and remove all the complexity that really has no place in a low-level diagnostic facility. Note that, while more modern compression algorithms have been added to the kernel in recent years, the code size of zlib deflate is substantially smaller than, e.g., zstd, while its performance in terms of compression ratio is comparable for ASCII text, and speed is deemed irrelevant in this context. Note that this means that compressed pstore records may no longer be accessible after a kernel upgrade, but this has never been part of the contract. (The choice of compression algorithm is not stored in the pstore records either) Tested-by: "Guilherme G. Piccoli" <gpiccoli@igalia.com> # Steam Deck Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230712162332.2670437-3-ardb@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-18pstore: Remove worst-case compression size logicArd Biesheuvel1-187/+19
The worst case compression size used by pstore gives an upper bound for how much the data might inadvertently *grow* due to encapsulation overhead if the input is not compressible at all. Given that pstore records have individual 'compressed' flags, we can simply store the uncompressed data if compressing it would end up using more space, making the worst case identical to the uncompressed case. This means we can just drop all the elaborate logic that reasons about upper bounds for each individual compression algorithm, and just store the uncompressed data directly if compression fails for any reason. Co-developed-by: Kees Cook <keescook@chromium.org> Tested-by: "Guilherme G. Piccoli" <gpiccoli@igalia.com> # Steam Deck Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230712162332.2670437-2-ardb@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-17fs: wait for partially frozen filesystemsDarrick J. Wong1-2/+32
Jan Kara suggested that when one thread is in the middle of freezing a filesystem, another thread trying to freeze the same fs but with a different freeze_holder should wait until the freezer reaches either end state (UNFROZEN or COMPLETE) instead of returning EBUSY immediately. Neither caller can do anything sensible with this race other than retry but they cannot really distinguish EBUSY as in "some other holder of the same type has the sb already frozen" from "freezing raced with holder of a different type". Plumb in the extra code needed to wait for the fs freezer to reach an end state and try the freeze again. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz>
2023-07-17fs: distinguish between user initiated freeze and kernel initiated freezeDarrick J. Wong5-23/+88
Userspace can freeze a filesystem using the FIFREEZE ioctl or by suspending the block device; this state persists until userspace thaws the filesystem with the FITHAW ioctl or resuming the block device. Since commit 18e9e5104fcd ("Introduce freeze_super and thaw_super for the fsfreeze ioctl") we only allow the first freeze command to succeed. The kernel may decide that it is necessary to freeze a filesystem for its own internal purposes, such as suspends in progress, filesystem fsck activities, or quiescing a device prior to removal. Userspace thaw commands must never break a kernel freeze, and kernel thaw commands shouldn't undo userspace's freeze command. Introduce a couple of freeze holder flags and wire it into the sb_writers state. One kernel and one userspace freeze are allowed to coexist at the same time; the filesystem will not thaw until both are lifted. I wonder if the f2fs/gfs2 code should be using a kernel freeze here, but for now we'll use FREEZE_HOLDER_USERSPACE to preserve existing behaviors. Cc: mcgrof@kernel.org Cc: jack@suse.cz Cc: hch@infradead.org Cc: ruansy.fnst@fujitsu.com Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
2023-07-17iomap: micro optimize the ki_pos assignment in iomap_file_buffered_writeChristoph Hellwig1-1/+1
We have the new value for ki_pos right at hand in iter.pos, so assign that instead of recalculating it from ret. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.harjani@gmail.com>
2023-07-17iomap: fix a regression for partial write errorsChristoph Hellwig1-1/+1
When write* wrote some data it should return the amount of written data and not the error code that caused it to stop. Fix a recent regression in iomap_file_buffered_write that caused it to return the errno instead. Fixes: 219580eea1ee ("iomap: update ki_pos in iomap_file_buffered_write") Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.harjani@gmail.com>
2023-07-17xfs: convert flex-array declarations in xfs attr shortform objectsDarrick J. Wong2-1/+2
As of 6.5-rc1, UBSAN trips over the ondisk extended attribute shortform definitions using an array length of 1 to pretend to be a flex array. Kernel compilers have to support unbounded array declarations, so let's correct this. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org>
2023-07-17xfs: convert flex-array declarations in xfs attr leaf blocksDarrick J. Wong2-10/+67
As of 6.5-rc1, UBSAN trips over the ondisk extended attribute leaf block definitions using an array length of 1 to pretend to be a flex array. Kernel compilers have to support unbounded array declarations, so let's correct this. ================================================================================ UBSAN: array-index-out-of-bounds in fs/xfs/libxfs/xfs_attr_leaf.c:2535:24 index 2 is out of range for type '__u8 [1]' Call Trace: <TASK> dump_stack_lvl+0x33/0x50 __ubsan_handle_out_of_bounds+0x9c/0xd0 xfs_attr3_leaf_getvalue+0x2ce/0x2e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_leaf_get+0x148/0x1c0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_get_ilocked+0xae/0x110 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_get+0xee/0x150 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_xattr_get+0x7d/0xc0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] __vfs_getxattr+0xa3/0x100 vfs_getxattr+0x87/0x1d0 do_getxattr+0x17a/0x220 getxattr+0x89/0xf0 Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org>
2023-07-17xfs: convert flex-array declarations in struct xfs_attrlist*Darrick J. Wong1-2/+2
As of 6.5-rc1, UBSAN trips over the attrlist ioctl definitions using an array length of 1 to pretend to be a flex array. Kernel compilers have to support unbounded array declarations, so let's correct this. This may cause friction with userspace header declarations, but suck is life. ================================================================================ UBSAN: array-index-out-of-bounds in fs/xfs/xfs_ioctl.c:345:18 index 1 is out of range for type '__s32 [1]' Call Trace: <TASK> dump_stack_lvl+0x33/0x50 __ubsan_handle_out_of_bounds+0x9c/0xd0 xfs_ioc_attr_put_listent+0x413/0x420 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_list_ilocked+0x170/0x850 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_list+0xb7/0x120 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_ioc_attr_list+0x13b/0x2e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attrlist_by_handle+0xab/0x120 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_file_ioctl+0x1ff/0x15e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] vfs_ioctl+0x1f/0x60 The kernel and xfsprogs code that uses these structures will not have problems, but the long tail of external user programs might. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org>
2023-07-16Merge tag '6.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds7-31/+43
Pull smb client fixes from Steve French: - Two reconnect fixes: important fix to address inFlight count to leak (which can leak credits), and fix for better handling a deleted share - DFS fix - SMB1 cleanup fix - deferred close fix * tag '6.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix mid leak during reconnection after timeout threshold cifs: is_network_name_deleted should return a bool smb: client: fix missed ses refcounting smb: client: Fix -Wstringop-overflow issues cifs: if deferred close is disabled then close files immediately
2023-07-15exfat: release s_lock before calling dir_emit()Sungjong Seo1-15/+12
There is a potential deadlock reported by syzbot as below: ====================================================== WARNING: possible circular locking dependency detected 6.4.0-next-20230707-syzkaller #0 Not tainted ------------------------------------------------------ syz-executor330/5073 is trying to acquire lock: ffff8880218527a0 (&mm->mmap_lock){++++}-{3:3}, at: mmap_read_lock_killable include/linux/mmap_lock.h:151 [inline] ffff8880218527a0 (&mm->mmap_lock){++++}-{3:3}, at: get_mmap_lock_carefully mm/memory.c:5293 [inline] ffff8880218527a0 (&mm->mmap_lock){++++}-{3:3}, at: lock_mm_and_find_vma+0x369/0x510 mm/memory.c:5344 but task is already holding lock: ffff888019f760e0 (&sbi->s_lock){+.+.}-{3:3}, at: exfat_iterate+0x117/0xb50 fs/exfat/dir.c:232 which lock already depends on the new lock. Chain exists of: &mm->mmap_lock --> mapping.invalidate_lock#3 --> &sbi->s_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&sbi->s_lock); lock(mapping.invalidate_lock#3); lock(&sbi->s_lock); rlock(&mm->mmap_lock); Let's try to avoid above potential deadlock condition by moving dir_emit*() out of sbi->s_lock coverage. Fixes: ca06197382bd ("exfat: add directory operations") Cc: stable@vger.kernel.org #v5.7+ Reported-by: syzbot+1741a5d9b79989c10bdc@syzkaller.appspotmail.com Link: https://lore.kernel.org/lkml/00000000000078ee7e060066270b@google.com/T/#u Tested-by: syzbot+1741a5d9b79989c10bdc@syzkaller.appspotmail.com Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2023-07-14cifs: fix mid leak during reconnection after timeout thresholdShyam Prasad N1-4/+15
When the number of responses with status of STATUS_IO_TIMEOUT exceeds a specified threshold (NUM_STATUS_IO_TIMEOUT), we reconnect the connection. But we do not return the mid, or the credits returned for the mid, or reduce the number of in-flight requests. This bug could result in the server->in_flight count to go bad, and also cause a leak in the mids. This change moves the check to a few lines below where the response is decrypted, even of the response is read from the transform header. This way, the code for returning the mids can be reused. Also, the cifs_reconnect was reconnecting just the transport connection before. In case of multi-channel, this may not be what we want to do after several timeouts. Changed that to reconnect the session and the tree too. Also renamed NUM_STATUS_IO_TIMEOUT to a more appropriate name MAX_STATUS_IO_TIMEOUT. Fixes: 8e670f77c4a5 ("Handle STATUS_IO_TIMEOUT gracefully") Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-07-14cifs: is_network_name_deleted should return a boolShyam Prasad N3-7/+14
Currently, is_network_name_deleted and it's implementations do not return anything if the network name did get deleted. So the function doesn't fully achieve what it advertizes. Changed the function to return a bool instead. It will now return true if the error returned is STATUS_NETWORK_NAME_DELETED and the share (tree id) was found to be connected. It returns false otherwise. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-07-14fs: Fix error checking for d_hash_and_lookup()Wang Ming1-1/+1
The d_hash_and_lookup() function returns error pointers or NULL. Most incorrect error checks were fixed, but the one in int path_pts() was forgotten. Fixes: eedf265aa003 ("devpts: Make each mount of devpts an independent filesystem.") Signed-off-by: Wang Ming <machel@vivo.com> Message-Id: <20230713120555.7025-1-machel@vivo.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-14attr: block mode changes of symlinksChristian Brauner1-2/+18
Changing the mode of symlinks is meaningless as the vfs doesn't take the mode of a symlink into account during path lookup permission checking. However, the vfs doesn't block mode changes on symlinks. This however, has lead to an untenable mess roughly classifiable into the following two categories: (1) Filesystems that don't implement a i_op->setattr() for symlinks. Such filesystems may or may not know that without i_op->setattr() defined, notify_change() falls back to simple_setattr() causing the inode's mode in the inode cache to be changed. That's a generic issue as this will affect all non-size changing inode attributes including ownership changes. Example: afs (2) Filesystems that fail with EOPNOTSUPP but change the mode of the symlink nonetheless. Some filesystems will happily update the mode of a symlink but still return EOPNOTSUPP. This is the biggest source of confusion for userspace. The EOPNOTSUPP in this case comes from POSIX ACLs. Specifically it comes from filesystems that call posix_acl_chmod(), e.g., btrfs via if (!err && attr->ia_valid & ATTR_MODE) err = posix_acl_chmod(idmap, dentry, inode->i_mode); Filesystems including btrfs don't implement i_op->set_acl() so posix_acl_chmod() will report EOPNOTSUPP. When posix_acl_chmod() is called, most filesystems will have finished updating the inode. Perversely, this has the consequences that this behavior may depend on two kconfig options and mount options: * CONFIG_POSIX_ACL={y,n} * CONFIG_${FSTYPE}_POSIX_ACL={y,n} * Opt_acl, Opt_noacl Example: btrfs, ext4, xfs The only way to change the mode on a symlink currently involves abusing an O_PATH file descriptor in the following manner: fd = openat(-1, "/path/to/link", O_CLOEXEC | O_PATH | O_NOFOLLOW); char path[PATH_MAX]; snprintf(path, sizeof(path), "/proc/self/fd/%d", fd); chmod(path, 0000); But for most major filesystems with POSIX ACL support such as btrfs, ext4, ceph, tmpfs, xfs and others this will fail with EOPNOTSUPP with the mode still updated due to the aforementioned posix_acl_chmod() nonsense. So, given that for all major filesystems this would fail with EOPNOTSUPP and that both glibc (cf. [1]) and musl (cf. [2]) outright block mode changes on symlinks we should just try and block mode changes on symlinks directly in the vfs and have a clean break with this nonsense. If this causes any regressions, we do the next best thing and fix up all filesystems that do return EOPNOTSUPP with the mode updated to not call posix_acl_chmod() on symlinks. But as usual, let's try the clean cut solution first. It's a simple patch that can be easily reverted. Not marking this for backport as I'll do that manually if we're reasonably sure that this works and there are no strong objections. We could block this in chmod_common() but it's more appropriate to do it notify_change() as it will also mean that we catch filesystems that change symlink permissions explicitly or accidently. Similar proposals were floated in the past as in [3] and [4] and again recently in [5]. There's also a couple of bugs about this inconsistency as in [6] and [7]. Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=99527a3727e44cb8661ee1f743068f108ec93979;hb=HEAD [1] Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c [2] Link: https://lore.kernel.org/all/20200911065733.GA31579@infradead.org [3] Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00518.html [4] Link: https://lore.kernel.org/lkml/87lefmbppo.fsf@oldenburg.str.redhat.com [5] Link: https://sourceware.org/legacy-ml/libc-alpha/2020-02/msg00467.html [6] Link: https://sourceware.org/bugzilla/show_bug.cgi?id=14578#c17 [7] Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org # please backport to all LTSes but not before v6.6-rc2 is tagged Suggested-by: Christoph Hellwig <hch@lst.de> Suggested-by: Florian Weimer <fweimer@redhat.com> Message-Id: <20230712-vfs-chmod-symlinks-v2-1-08cfb92b61dd@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-13procfs: block chmod on /proc/thread-self/commAleksa Sarai1-1/+2
Due to an oversight in commit 1b3044e39a89 ("procfs: fix pthread cross-thread naming if !PR_DUMPABLE") in switching from REG to NOD, chmod operations on /proc/thread-self/comm were no longer blocked as they are on almost all other procfs files. A very similar situation with /proc/self/environ was used to as a root exploit a long time ago, but procfs has SB_I_NOEXEC so this is simply a correctness issue. Ref: https://lwn.net/Articles/191954/ Ref: 6d76fa58b050 ("Don't allow chmod() on the /proc/<pid>/ files") Fixes: 1b3044e39a89 ("procfs: fix pthread cross-thread naming if !PR_DUMPABLE") Cc: stable@vger.kernel.org # v4.7+ Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Message-Id: <20230713141001.27046-1-cyphar@cyphar.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-13exfat: check if filename entries exceeds max filename lengthNamjae Jeon1-2/+7
exfat_extract_uni_name copies characters from a given file name entry into the 'uniname' variable. This variable is actually defined on the stack of the exfat_readdir() function. According to the definition of the 'exfat_uni_name' type, the file name should be limited 255 characters (+ null teminator space), but the exfat_get_uniname_from_ext_entry() function can write more characters because there is no check if filename entries exceeds max filename length. This patch add the check not to copy filename characters when exceeding max filename length. Cc: stable@vger.kernel.org Cc: Yuezhang Mo <Yuezhang.Mo@sony.com> Reported-by: Maxim Suhanov <dfirblog@gmail.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2023-07-13proc: use generic setattr() for /proc/$PID/netThomas Weißschuh1-0/+1
All other files in /proc/$PID/ use proc_setattr(). Not using it allows the usage of chmod() on /proc/$PID/net, even on other processes owned by the same user. The same would probably also be true for other attributes to be changed. As this technically represents an ABI change it is not marked for stable so any unlikely regressions are caught during a full release cycle. Fixes: e9720acd728a ("[NET]: Make /proc/net a symlink on /proc/self/net (v3)") Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/lkml/d0d111ef-edae-4760-83fb-36db84278da1@t-8ch.de/ Fixes: b4844fa0bdb4 ("selftests/nolibc: implement a few tests for various syscalls") Tested-by: Zhangjin Wu <falcon@tinylab.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Message-Id: <20230624-proc-net-setattr-v1-2-73176812adee@weissschuh.net> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-13ext2: convert to ctime accessor functionsJeff Layton8-18/+18
In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230705190309.579783-39-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-13exfat: convert to ctime accessor functionsJeff Layton4-19/+15
In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230705190309.579783-38-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-13erofs: convert to ctime accessor functionsJeff Layton1-8/+4
In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Acked-by: Gao Xiang <xiang@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Message-Id: <20230705190309.579783-37-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>