summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-10-24Merge tag 'for-6.12-rc4-tag' of ↵Linus Torvalds9-34/+45
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - mount option fixes: - fix handling of compression mount options on remount - reject rw remount in case there are options that don't work in read-write mode (like rescue options) - fix zone accounting of unusable space - fix in-memory corruption when merging extent maps - fix delalloc range locking for sector < page - use more convenient default value of drop subtree threshold, clean more subvolumes without the fallback to marking quotas inconsistent - fix smatch warning about incorrect value passed to ERR_PTR * tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item() btrfs: reject ro->rw reconfiguration if there are hard ro requirements btrfs: fix read corruption due to race with extent map merging btrfs: fix the delalloc range locking if sector size < page size btrfs: qgroup: set a more sane default value for subtree drop threshold btrfs: clear force-compress on remount when compress mount option is given btrfs: zoned: fix zone unusable accounting for freed reserved extent
2024-10-24Merge tag 'jfs-6.12-rc5' of github.com:kleikamp/linux-shaggyLinus Torvalds1-1/+1
Pull jfs fix from David Kleikamp: "Fix a regression introduced in 6.12-rc1" * tag 'jfs-6.12-rc5' of github.com:kleikamp/linux-shaggy: jfs: Fix sanity check in dbMount
2024-10-24Merge tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefsLinus Torvalds41-205/+475
Pull bcachefs fixes from Kent Overstreet: "Lots of hotfixes: - transaction restart injection has been shaking out a few things - fix a data corruption in the buffered write path on -ENOSPC, found by xfstests generic/299 - Some small show_options fixes - Repair mismatches in inode hash type, seed: different snapshot versions of an inode must have the same hash/type seed, used for directory entries and xattrs. We were checking the hash seed, but not the type, and a user contributed a filesystem where the hash type on one inode had somehow been flipped; these fixes allow his filesystem to repair. Additionally, the hash type flip made some directory entries invisible, which were then recreated by userspace; so the hash check code now checks for duplicate non dangling dirents, and renames one of them if necessary. - Don't use wait_event_interruptible() in recovery: this fixes some filesystems failing to mount with -ERESTARTSYS - Workaround for kvmalloc not supporting > INT_MAX allocations, causing an -ENOMEM when allocating the sorted array of journal keys: this allows a 75 TB filesystem to mount - Make sure bch_inode_unpacked.bi_snapshot is set in the old inode compat path: this alllows Marcin's filesystem (in use since before 6.7) to repair and mount" * tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs: (26 commits) bcachefs: Set bch_inode_unpacked.bi_snapshot in old inode path bcachefs: Mark more errors as AUTOFIX bcachefs: Workaround for kvmalloc() not supporting > INT_MAX allocations bcachefs: Don't use wait_event_interruptible() in recovery bcachefs: Fix __bch2_fsck_err() warning bcachefs: fsck: Improve hash_check_key() bcachefs: bch2_hash_set_or_get_in_snapshot() bcachefs: Repair mismatches in inode hash seed, type bcachefs: Add hash seed, type to inode_to_text() bcachefs: INODE_STR_HASH() for bch_inode_unpacked bcachefs: Run in-kernel offline fsck without ratelimit errors bcachefs: skip mount option handle for empty string. bcachefs: fix incorrect show_options results bcachefs: Fix data corruption on -ENOSPC in buffered write path bcachefs: bch2_folio_reservation_get_partial() is now better behaved bcachefs: fix disk reservation accounting in bch2_folio_reservation_get() bcachefS: ec: fix data type on stripe deletion bcachefs: Don't use commit_do() unnecessarily bcachefs: handle restarts in bch2_bucket_io_time_reset() bcachefs: fix restart handling in __bch2_resume_logged_op_finsert() ...
2024-10-24Revert "9p: Enable multipage folios"Dominique Martinet1-1/+0
This reverts commit 1325e4a91a405f88f1b18626904d37860a4f9069. using multipage folios apparently break some madvise operations like MADV_PAGEOUT which do not reliably unload the specified page anymore, Revert the patch until that is figured out. Reported-by: Andrii Nakryiko <andrii@kernel.org> Fixes: 1325e4a91a40 ("9p: Enable multipage folios") Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-22jfs: Fix sanity check in dbMountDave Kleikamp1-1/+1
MAXAG is a legitimate value for bmp->db_numag Fixes: e63866a47556 ("jfs: fix out-of-bounds in dbNextAG() and diAlloc()") Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2024-10-22btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item()Yue Haibing2-7/+4
The ret may be zero in btrfs_search_dir_index_item() and should not passed to ERR_PTR(). Now btrfs_unlink_subvol() is the only caller to this, reconstructed it to check ERR_PTR(-ENOENT) while ret >= 0. This fixes smatch warnings: fs/btrfs/dir-item.c:353 btrfs_search_dir_index_item() warn: passing zero to 'ERR_PTR' Fixes: 9dcbe16fccbb ("btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22btrfs: reject ro->rw reconfiguration if there are hard ro requirementsQu Wenruo1-2/+1
[BUG] Syzbot reports the following crash: BTRFS info (device loop0 state MCS): disabling free space tree BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1) BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2) Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:backup_super_roots fs/btrfs/disk-io.c:1691 [inline] RIP: 0010:write_all_supers+0x97a/0x40f0 fs/btrfs/disk-io.c:4041 Call Trace: <TASK> btrfs_commit_transaction+0x1eae/0x3740 fs/btrfs/transaction.c:2530 btrfs_delete_free_space_tree+0x383/0x730 fs/btrfs/free-space-tree.c:1312 btrfs_start_pre_rw_mount+0xf28/0x1300 fs/btrfs/disk-io.c:3012 btrfs_remount_rw fs/btrfs/super.c:1309 [inline] btrfs_reconfigure+0xae6/0x2d40 fs/btrfs/super.c:1534 btrfs_reconfigure_for_mount fs/btrfs/super.c:2020 [inline] btrfs_get_tree_subvol fs/btrfs/super.c:2079 [inline] btrfs_get_tree+0x918/0x1920 fs/btrfs/super.c:2115 vfs_get_tree+0x90/0x2b0 fs/super.c:1800 do_new_mount+0x2be/0xb40 fs/namespace.c:3472 do_mount fs/namespace.c:3812 [inline] __do_sys_mount fs/namespace.c:4020 [inline] __se_sys_mount+0x2d6/0x3c0 fs/namespace.c:3997 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f [CAUSE] To support mounting different subvolume with different RO/RW flags for the new mount APIs, btrfs introduced two workaround to support this feature: - Skip mount option/feature checks if we are mounting a different subvolume - Reconfigure the fs to RW if the initial mount is RO Combining these two, we can have the following sequence: - Mount the fs ro,rescue=all,clear_cache,space_cache=v1 rescue=all will mark the fs as hard read-only, so no v2 cache clearing will happen. - Mount a subvolume rw of the same fs. We go into btrfs_get_tree_subvol(), but fc_mount() returns EBUSY because our new fc is RW, different from the original fs. Now we enter btrfs_reconfigure_for_mount(), which switches the RO flag first so that we can grab the existing fs_info. Then we reconfigure the fs to RW. - During reconfiguration, option/features check is skipped This means we will restart the v2 cache clearing, and convert back to v1 cache. This will trigger fs writes, and since the original fs has "rescue=all" option, it skips the csum tree read. And eventually causing NULL pointer dereference in super block writeback. [FIX] For reconfiguration caused by different subvolume RO/RW flags, ensure we always run btrfs_check_options() to ensure we have proper hard RO requirements met. In fact the function btrfs_check_options() doesn't really do many complex checks, but hard RO requirement and some feature dependency checks, thus there is no special reason not to do the check for mount reconfiguration. Reported-by: syzbot+56360f93efa90ff15870@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/0000000000008c5d090621cb2770@google.com/ Fixes: f044b318675f ("btrfs: handle the ro->rw transition for mounting different subvolumes") CC: stable@vger.kernel.org # 6.8+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22btrfs: fix read corruption due to race with extent map mergingBoris Burkov1-15/+16
In debugging some corrupt squashfs files, we observed symptoms of corrupt page cache pages but correct on-disk contents. Further investigation revealed that the exact symptom was a correct page followed by an incorrect, duplicate, page. This got us thinking about extent maps. commit ac05ca913e9f ("Btrfs: fix race between using extent maps and merging them") enforces a reference count on the primary `em` extent_map being merged, as that one gets modified. However, since, commit 3d2ac9922465 ("btrfs: introduce new members for extent_map") both 'em' and 'merge' get modified, which started modifying 'merge' and thus introduced the same race. We were able to reproduce this by looping the affected squashfs workload in parallel on a bunch of separate btrfs-es while also dropping caches. We are still working on a simple enough reproducer to make into an fstest. The simplest fix is to stop modifying 'merge', which is not essential, as it is dropped immediately after the merge. This behavior is simply a consequence of the order of the two extent maps being important in computing the new values. Modify merge_ondisk_extents to take prev and next by const* and also take a third merged parameter that it puts the results in. Note that this introduces the rather odd behavior of passing 'em' to merge_ondisk_extents as a const * and as a regular ptr. Fixes: 3d2ac9922465 ("btrfs: introduce new members for extent_map") CC: stable@vger.kernel.org # 6.11+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22btrfs: fix the delalloc range locking if sector size < page sizeQu Wenruo1-8/+9
Inside lock_delalloc_folios(), there are several problems related to sector size < page size handling: - Set the writer locks without checking if the folio is still valid We call btrfs_folio_start_writer_lock() just like it's folio_lock(). But since the folio may not even be the folio of the current mapping, we can easily screw up the folio->private. - The range is not clamped inside the page This means we can over write other bitmaps if the start/len is not properly handled, and trigger the btrfs_subpage_assert(). - @processed_end is always rounded up to page end If the delalloc range is not page aligned, and we need to retry (returning -EAGAIN), then we will unlock to the page end. Thankfully this is not a huge problem, as now btrfs_folio_end_writer_lock() can handle range larger than the locked range, and only unlock what is already locked. Fix all these problems by: - Lock and check the folio first, then call btrfs_folio_set_writer_lock() So that if we got a folio not belonging to the inode, we won't touch folio->private. - Properly truncate the range inside the page - Update @processed_end to the locked range end Fixes: 1e1de38792e0 ("btrfs: make process_one_page() to handle subpage locking") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22btrfs: qgroup: set a more sane default value for subtree drop thresholdQu Wenruo3-2/+4
Since commit 011b46c30476 ("btrfs: skip subtree scan if it's too high to avoid low stall in btrfs_commit_transaction()"), btrfs qgroup can automatically skip large subtree scan at the cost of marking qgroup inconsistent. It's designed to address the final performance problem of snapshot drop with qgroup enabled, but to be safe the default value is BTRFS_MAX_LEVEL, requiring a user space daemon to set a different value to make it work. I'd say it's not a good idea to rely on user space tool to set this default value, especially when some operations (snapshot dropping) can be triggered immediately after mount, leaving a very small window to that that sysfs interface. So instead of disabling this new feature by default, enable it with a low threshold (3), so that large subvolume tree drop at mount time won't cause huge qgroup workload. CC: stable@vger.kernel.org # 6.1 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22btrfs: clear force-compress on remount when compress mount option is givenFilipe Manana1-0/+9
After the migration to use fs context for processing mount options we had a slight change in the semantics for remounting a filesystem that was mounted with compress-force. Before we could clear compress-force by passing only "-o compress[=algo]" during a remount, but after that change that does not work anymore, force-compress is still present and one needs to pass "-o compress-force=no,compress[=algo]" to the mount command. Example, when running on a kernel 6.8+: $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi $ mount | grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/) $ mount -o remount,compress=zlib:5 /mnt/sdi $ mount | grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/) On a 6.7 kernel (or older): $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi $ mount | grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/) $ mount -o remount,compress=zlib:5 /mnt/sdi $ mount | grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/) So update btrfs_parse_param() to clear "compress-force" when "compress" is given, providing the same semantics as kernel 6.7 and older. Reported-by: Roman Mamedov <rm@romanrm.net> Link: https://lore.kernel.org/linux-btrfs/20241014182416.13d0f8b0@nvm/ CC: stable@vger.kernel.org # 6.8+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22KVM: selftests: Fix build on on non-x86 architecturesMark Brown1-1/+3
Commit 9a400068a158 ("KVM: selftests: x86: Avoid using SSE/AVX instructions") unconditionally added -march=x86-64-v2 to the CFLAGS used to build the KVM selftests which does not work on non-x86 architectures: cc1: error: unknown value ‘x86-64-v2’ for ‘-march’ Fix this by making the addition of this x86 specific command line flag conditional on building for x86. Fixes: 9a400068a158 ("KVM: selftests: x86: Avoid using SSE/AVX instructions") Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-229p: fix slab cache name creation for realLinus Torvalds1-1/+3
This was attempted by using the dev_name in the slab cache name, but as Omar Sandoval pointed out, that can be an arbitrary string, eg something like "/dev/root". Which in turn trips verify_dirent_name(), which fails if a filename contains a slash. So just make it use a sequence counter, and make it an atomic_t to avoid any possible races or locking issues. Reported-and-tested-by: Omar Sandoval <osandov@fb.com> Link: https://lore.kernel.org/all/ZxafcO8KWMlXaeWE@telecaster.dhcp.thefacebook.com/ Fixes: 79efebae4afc ("9p: Avoid creating multiple slab caches with the same name") Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Thorsten Leemhuis <regressions@leemhuis.info> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-21Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds25-103/+277
Pull kvm fixes from Paolo Bonzini: "ARM64: - Fix the guest view of the ID registers, making the relevant fields writable from userspace (affecting ID_AA64DFR0_EL1 and ID_AA64PFR1_EL1) - Correcly expose S1PIE to guests, fixing a regression introduced in 6.12-rc1 with the S1POE support - Fix the recycling of stage-2 shadow MMUs by tracking the context (are we allowed to block or not) as well as the recycling state - Address a couple of issues with the vgic when userspace misconfigures the emulation, resulting in various splats. Headaches courtesy of our Syzkaller friends - Stop wasting space in the HYP idmap, as we are dangerously close to the 4kB limit, and this has already exploded in -next - Fix another race in vgic_init() - Fix a UBSAN error when faking the cache topology with MTE enabled RISCV: - RISCV: KVM: use raw_spinlock for critical section in imsic x86: - A bandaid for lack of XCR0 setup in selftests, which causes trouble if the compiler is configured to have x86-64-v3 (with AVX) as the default ISA. Proper XCR0 setup will come in the next merge window. - Fix an issue where KVM would not ignore low bits of the nested CR3 and potentially leak up to 31 bytes out of the guest memory's bounds - Fix case in which an out-of-date cached value for the segments could by returned by KVM_GET_SREGS. - More cleanups for KVM_X86_QUIRK_SLOT_ZAP_ALL - Override MTRR state for KVM confidential guests, making it WB by default as is already the case for Hyper-V guests. Generic: - Remove a couple of unused functions" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (27 commits) RISCV: KVM: use raw_spinlock for critical section in imsic KVM: selftests: Fix out-of-bounds reads in CPUID test's array lookups KVM: selftests: x86: Avoid using SSE/AVX instructions KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory KVM: VMX: reset the segment cache after segment init in vmx_vcpu_reset() KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range() KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot x86/kvm: Override default caching mode for SEV-SNP and TDX KVM: Remove unused kvm_vcpu_gfn_to_pfn_atomic KVM: Remove unused kvm_vcpu_gfn_to_pfn KVM: arm64: Ensure vgic_ready() is ordered against MMIO registration KVM: arm64: vgic: Don't check for vgic_ready() when setting NR_IRQS KVM: arm64: Fix shift-out-of-bounds bug KVM: arm64: Shave a few bytes from the EL2 idmap code KVM: arm64: Don't eagerly teardown the vgic on init error KVM: arm64: Expose S1PIE to guests KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule KVM: arm64: nv: Punt stage-2 recycling to a vCPU request KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed ...
2024-10-21Merge tag 'probes-fixes-v6.12-rc4' of ↵Linus Torvalds1-3/+6
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull uprobe fix from Masami Hiramatsu: - uprobe: avoid out-of-bounds memory access of fetching args Uprobe trace events can cause out-of-bounds memory access when fetching user-space data which is bigger than one page, because it does not check the local CPU buffer size when reading the data. This checks the read data size and cut it down to the local CPU buffer size. * tag 'probes-fixes-v6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: uprobe: avoid out-of-bounds memory access of fetching args
2024-10-21Merge tag 'vfs-6.12-rc5.fixes' of ↵Linus Torvalds12-67/+95
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "afs: - Fix a lock recursion in afs_wake_up_async_call() on ->notify_lock netfs: - Drop the references to a folio immediately after the folio has been extracted to prevent races with future I/O collection - Fix a documenation build error - Downgrade the i_rwsem for buffered writes to fix a cifs reported performance regression when switching to netfslib vfs: - Explicitly return -E2BIG from openat2() if the specified size is unexpectedly large. This aligns openat2() with other extensible struct based system calls - When copying a mount namespace ensure that we only try to remove the new copy from the mount namespace rbtree if it has already been added to it nilfs: - Clear the buffer delay flag when clearing the buffer state clags when a buffer head is discarded to prevent a kernel OOPs ocfs2: - Fix an unitialized value warning in ocfs2_setattr() proc: - Fix a kernel doc warning" * tag 'vfs-6.12-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: proc: Fix W=1 build kernel-doc warning afs: Fix lock recursion fs: Fix uninitialized value issue in from_kuid and from_kgid fs: don't try and remove empty rbtree node netfs: Downgrade i_rwsem for a buffered write nilfs2: fix kernel bug due to missing clearing of buffer delay flag openat2: explicitly return -E2BIG for (usize > PAGE_SIZE) netfs: fix documentation build error netfs: In readahead, put the folio refs as soon extracted
2024-10-21Merge tag 'v6.12-p4' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fix from Herbert Xu: "Fix a regression in mpi that broke RSA" * tag 'v6.12-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: lib/mpi - Fix an "Uninitialized scalar variable" issue
2024-10-21uprobe: avoid out-of-bounds memory access of fetching argsQiao Ma1-3/+6
Uprobe needs to fetch args into a percpu buffer, and then copy to ring buffer to avoid non-atomic context problem. Sometimes user-space strings, arrays can be very large, but the size of percpu buffer is only page size. And store_trace_args() won't check whether these data exceeds a single page or not, caused out-of-bounds memory access. It could be reproduced by following steps: 1. build kernel with CONFIG_KASAN enabled 2. save follow program as test.c ``` \#include <stdio.h> \#include <stdlib.h> \#include <string.h> // If string length large than MAX_STRING_SIZE, the fetch_store_strlen() // will return 0, cause __get_data_size() return shorter size, and // store_trace_args() will not trigger out-of-bounds access. // So make string length less than 4096. \#define STRLEN 4093 void generate_string(char *str, int n) { int i; for (i = 0; i < n; ++i) { char c = i % 26 + 'a'; str[i] = c; } str[n-1] = '\0'; } void print_string(char *str) { printf("%s\n", str); } int main() { char tmp[STRLEN]; generate_string(tmp, STRLEN); print_string(tmp); return 0; } ``` 3. compile program `gcc -o test test.c` 4. get the offset of `print_string()` ``` objdump -t test | grep -w print_string 0000000000401199 g F .text 000000000000001b print_string ``` 5. configure uprobe with offset 0x1199 ``` off=0x1199 cd /sys/kernel/debug/tracing/ echo "p /root/test:${off} arg1=+0(%di):ustring arg2=\$comm arg3=+0(%di):ustring" > uprobe_events echo 1 > events/uprobes/enable echo 1 > tracing_on ``` 6. run `test`, and kasan will report error. ================================================================== BUG: KASAN: use-after-free in strncpy_from_user+0x1d6/0x1f0 Write of size 8 at addr ffff88812311c004 by task test/499CPU: 0 UID: 0 PID: 499 Comm: test Not tainted 6.12.0-rc3+ #18 Hardware name: Red Hat KVM, BIOS 1.16.0-4.al8 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x55/0x70 print_address_description.constprop.0+0x27/0x310 kasan_report+0x10f/0x120 ? strncpy_from_user+0x1d6/0x1f0 strncpy_from_user+0x1d6/0x1f0 ? rmqueue.constprop.0+0x70d/0x2ad0 process_fetch_insn+0xb26/0x1470 ? __pfx_process_fetch_insn+0x10/0x10 ? _raw_spin_lock+0x85/0xe0 ? __pfx__raw_spin_lock+0x10/0x10 ? __pte_offset_map+0x1f/0x2d0 ? unwind_next_frame+0xc5f/0x1f80 ? arch_stack_walk+0x68/0xf0 ? is_bpf_text_address+0x23/0x30 ? kernel_text_address.part.0+0xbb/0xd0 ? __kernel_text_address+0x66/0xb0 ? unwind_get_return_address+0x5e/0xa0 ? __pfx_stack_trace_consume_entry+0x10/0x10 ? arch_stack_walk+0xa2/0xf0 ? _raw_spin_lock_irqsave+0x8b/0xf0 ? __pfx__raw_spin_lock_irqsave+0x10/0x10 ? depot_alloc_stack+0x4c/0x1f0 ? _raw_spin_unlock_irqrestore+0xe/0x30 ? stack_depot_save_flags+0x35d/0x4f0 ? kasan_save_stack+0x34/0x50 ? kasan_save_stack+0x24/0x50 ? mutex_lock+0x91/0xe0 ? __pfx_mutex_lock+0x10/0x10 prepare_uprobe_buffer.part.0+0x2cd/0x500 uprobe_dispatcher+0x2c3/0x6a0 ? __pfx_uprobe_dispatcher+0x10/0x10 ? __kasan_slab_alloc+0x4d/0x90 handler_chain+0xdd/0x3e0 handle_swbp+0x26e/0x3d0 ? __pfx_handle_swbp+0x10/0x10 ? uprobe_pre_sstep_notifier+0x151/0x1b0 irqentry_exit_to_user_mode+0xe2/0x1b0 asm_exc_int3+0x39/0x40 RIP: 0033:0x401199 Code: 01 c2 0f b6 45 fb 88 02 83 45 fc 01 8b 45 fc 3b 45 e4 7c b7 8b 45 e4 48 98 48 8d 50 ff 48 8b 45 e8 48 01 d0 ce RSP: 002b:00007ffdf00576a8 EFLAGS: 00000206 RAX: 00007ffdf00576b0 RBX: 0000000000000000 RCX: 0000000000000ff2 RDX: 0000000000000ffc RSI: 0000000000000ffd RDI: 00007ffdf00576b0 RBP: 00007ffdf00586b0 R08: 00007feb2f9c0d20 R09: 00007feb2f9c0d20 R10: 0000000000000001 R11: 0000000000000202 R12: 0000000000401040 R13: 00007ffdf0058780 R14: 0000000000000000 R15: 0000000000000000 </TASK> This commit enforces the buffer's maxlen less than a page-size to avoid store_trace_args() out-of-memory access. Link: https://lore.kernel.org/all/20241015060148.1108331-1-mqaio@linux.alibaba.com/ Fixes: dcad1a204f72 ("tracing/uprobes: Fetch args before reserving a ring buffer") Signed-off-by: Qiao Ma <mqaio@linux.alibaba.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-10-21Linux 6.12-rc4v6.12-rc4Linus Torvalds1-1/+1
2024-10-21bcachefs: Set bch_inode_unpacked.bi_snapshot in old inode pathKent Overstreet1-0/+2
This fixes a fsck bug on a very old filesystem (pre mainline merge). Fixes: 72350ee0ea22 ("bcachefs: Kill snapshot arg to fsck_write_inode()") Reported-by: Marcin Mirosław <marcin@mejor.pl> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-21bcachefs: Mark more errors as AUTOFIXKent Overstreet1-2/+2
Reported-by: Marcin Mirosław <marcin@mejor.pl> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-21Merge tag 'for-net-2024-10-16' of ↵Linus Torvalds4-25/+14
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Pull bluetooth fixes from Luiz Augusto Von Dentz: - ISO: Fix multiple init when debugfs is disabled - Call iso_exit() on module unload - Remove debugfs directory on module init failure - btusb: Fix not being able to reconnect after suspend - btusb: Fix regression with fake CSR controllers 0a12:0001 - bnep: fix wild-memory-access in proto_unregister Note: normally the bluetooth fixes go through the networking tree, but this missed the weekly merge, and two of the commits fix regressions that have caused a fair amount of noise and have now hit stable too: https://lore.kernel.org/all/4e1977ca-6166-4891-965e-34a6f319035f@leemhuis.info/ So I'm pulling it directly just to expedite things and not miss yet another -rc release. This is not meant to become a new pattern. * tag 'for-net-2024-10-16' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: btusb: Fix regression with fake CSR controllers 0a12:0001 Bluetooth: bnep: fix wild-memory-access in proto_unregister Bluetooth: btusb: Fix not being able to reconnect after suspend Bluetooth: Remove debugfs directory on module init failure Bluetooth: Call iso_exit() on module unload Bluetooth: ISO: Fix multiple init when debugfs is disabled
2024-10-20Merge tag 'pinctrl-v6.12-2' of ↵Linus Torvalds8-13/+23
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pin control fixes from Linus Walleij: "Mostly error path fixes, but one pretty serious interrupt problem in the Ocelot driver as well: - Fix two error paths and a missing semicolon in the Intel driver - Add a missing ACPI ID for the Intel Panther Lake - Check return value of devm_kasprintf() in the Apple and STM32 drivers - Add a missing mutex_destroy() in the aw9523 driver - Fix a double free in cv1800_pctrl_dt_node_to_map() in the Sophgo driver - Fix a double free in ma35_pinctrl_dt_node_to_map_func() in the Nuvoton driver - Fix a bug in the Ocelot interrupt handler making the system hang" * tag 'pinctrl-v6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: pinctrl: ocelot: fix system hang on level based interrupts pinctrl: nuvoton: fix a double free in ma35_pinctrl_dt_node_to_map_func() pinctrl: sophgo: fix double free in cv1800_pctrl_dt_node_to_map() pinctrl: intel: platform: Add Panther Lake to the list of supported pinctrl: aw9523: add missing mutex_destroy pinctrl: stm32: check devm_kasprintf() returned value pinctrl: apple: check devm_kasprintf() returned value pinctrl: intel: platform: use semicolon instead of comma in ncommunities assignment pinctrl: intel: platform: fix error path in device_for_each_child_node()
2024-10-20bcachefs: Workaround for kvmalloc() not supporting > INT_MAX allocationsKent Overstreet1-1/+14
kvmalloc() doesn't support allocations > INT_MAX, but vmalloc() does - the limit should be lifted, but we can work around this for now. A user with a 75 TB filesystem reported the following journal replay error: https://github.com/koverstreet/bcachefs/issues/769 In journal replay we have to sort and dedup all the keys from the journal, which means we need a large contiguous allocation. Given that the user has 128GB of ram, the 2GB limit on allocation size has become far too small. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-20bcachefs: Don't use wait_event_interruptible() in recoveryKent Overstreet3-6/+8
Fix a bug where mount was failing with -ERESTARTSYS: https://github.com/koverstreet/bcachefs/issues/741 We only want the interruptible wait when called from fsync. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-20bcachefs: Fix __bch2_fsck_err() warningKent Overstreet1-1/+4
We only warn about having a btree_trans that wasn't passed in if we'll be prompting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-20Merge tag 'char-misc-6.12-rc4' of ↵Linus Torvalds23-267/+117
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull misc driver fixes from Greg KH: "Here are a number of small char/misc/iio driver fixes for 6.12-rc4: - loads of small iio driver fixes for reported problems - parport driver out-of-bounds fix - Kconfig description and MAINTAINERS file updates All of these, except for the Kconfig and MAINTAINERS file updates have been in linux-next all week. Those other two are just documentation changes and will have no runtime issues and were merged on Friday" * tag 'char-misc-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (39 commits) misc: rtsx: list supported models in Kconfig help MAINTAINERS: Remove some entries due to various compliance requirements. misc: microchip: pci1xxxx: add support for NVMEM_DEVID_AUTO for OTP device misc: microchip: pci1xxxx: add support for NVMEM_DEVID_AUTO for EEPROM device parport: Proper fix for array out-of-bounds access iio: frequency: admv4420: fix missing select REMAP_SPI in Kconfig iio: frequency: {admv4420,adrf6780}: format Kconfig entries iio: adc: ad4695: Add missing Kconfig select iio: adc: ti-ads8688: add missing select IIO_(TRIGGERED_)BUFFER in Kconfig iio: hid-sensors: Fix an error handling path in _hid_sensor_set_report_latency() iioc: dac: ltc2664: Fix span variable usage in ltc2664_channel_config() iio: dac: stm32-dac-core: add missing select REGMAP_MMIO in Kconfig iio: dac: ltc1660: add missing select REGMAP_SPI in Kconfig iio: dac: ad5770r: add missing select REGMAP_SPI in Kconfig iio: amplifiers: ada4250: add missing select REGMAP_SPI in Kconfig iio: frequency: adf4377: add missing select REMAP_SPI in Kconfig iio: resolver: ad2s1210: add missing select (TRIGGERED_)BUFFER in Kconfig iio: resolver: ad2s1210 add missing select REGMAP in Kconfig iio: proximity: mb1232: add missing select IIO_(TRIGGERED_)BUFFER in Kconfig iio: pressure: bm1390: add missing select IIO_(TRIGGERED_)BUFFER in Kconfig ...
2024-10-20Merge tag 'tty-6.12-rc4' of ↵Linus Torvalds5-57/+67
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial driver fixes from Greg KH: "Here are some small tty and serial driver fixes for 6.12-rc4: - qcom-geni serial driver fixes, wow what a mess of a UART chip that thing is... - vt infoleak fix for odd font sizes - imx serial driver bugfix - yet-another n_gsm ldisc bugfix, slowly chipping down the issues in that piece of code All of these have been in linux-next for over a week with no reported issues" * tag 'tty-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: serial: qcom-geni: rename suspend functions serial: qcom-geni: drop unused receive parameter serial: qcom-geni: drop flip buffer WARN() serial: qcom-geni: fix rx cancel dma status bit serial: qcom-geni: fix receiver enable serial: qcom-geni: fix dma rx cancellation serial: qcom-geni: fix shutdown race serial: qcom-geni: revert broken hibernation support serial: qcom-geni: fix polled console initialisation serial: imx: Update mctrl old_status on RTSD interrupt tty: n_gsm: Fix use-after-free in gsm_cleanup_mux vt: prevent kernel-infoleak in con_font_get()
2024-10-20Merge tag 'usb-6.12-rc4' of ↵Linus Torvalds14-58/+151
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB driver fixes from Greg KH: "Here are some small USB driver fixes and new device ids for 6.12-rc4: - xhci driver fixes for a number of reported issues - new usb-serial driver ids - dwc3 driver fixes for reported problems. - usb gadget driver fixes for reported problems - typec driver fixes - MAINTAINER file updates All of these have been in linux-next this week with no reported issues" * tag 'usb-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: USB: serial: option: add Telit FN920C04 MBIM compositions USB: serial: option: add support for Quectel EG916Q-GL xhci: dbc: honor usb transfer size boundaries. usb: xhci: Fix handling errors mid TD followed by other errors xhci: Mitigate failed set dequeue pointer commands xhci: Fix incorrect stream context type macro USB: gadget: dummy-hcd: Fix "task hung" problem usb: gadget: f_uac2: fix return value for UAC2_ATTRIBUTE_STRING store usb: dwc3: core: Fix system suspend on TI AM62 platforms xhci: tegra: fix checked USB2 port number usb: dwc3: Wait for EndXfer completion before restoring GUSB2PHYCFG usb: typec: qcom-pmic-typec: fix sink status being overwritten with RP_DEF usb: typec: altmode should keep reference to parent MAINTAINERS: usb: raw-gadget: add bug tracker link MAINTAINERS: Add an entry for the LJCA drivers
2024-10-20Merge tag 'x86_urgent_for_v6.12_rc4' of ↵Linus Torvalds7-16/+47
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Explicitly disable the TSC deadline timer when going idle to address some CPU errata in that area - Do not apply the Zenbleed fix on anything else except AMD Zen2 on the late microcode loading path - Clear CPU buffers later in the NMI exit path on 32-bit to avoid register clearing while they still contain sensitive data, for the RDFS mitigation - Do not clobber EFLAGS.ZF with VERW on the opportunistic SYSRET exit path on 32-bit - Fix parsing issues of memory bandwidth specification in sysfs for resctrl's memory bandwidth allocation feature - Other small cleanups and improvements * tag 'x86_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/apic: Always explicitly disarm TSC-deadline timer x86/CPU/AMD: Only apply Zenbleed fix for Zen2 during late microcode load x86/bugs: Use code segment selector for VERW operand x86/entry_32: Clear CPU buffers after register restore in NMI return x86/entry_32: Do not clobber user EFLAGS.ZF x86/resctrl: Annotate get_mem_config() functions as __init x86/resctrl: Avoid overflow in MB settings in bw_validate() x86/amd_nb: Add new PCI ID for AMD family 1Ah model 20h
2024-10-20Merge tag 'irq_urgent_for_v6.12_rc4' of ↵Linus Torvalds8-32/+73
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Borislav Petkov: - Fix a case for sifive-plic where an interrupt gets disabled *and* masked and remains masked when it gets reenabled later - Plug a small race in GIC-v4 where userspace can force an affinity change of a virtual CPU (vPE) in its unmapping path - Do not mix the two sets of ocelot irqchip's registers in the mask calculation of the main interrupt sticky register - Other smaller fixlets and cleanups * tag 'irq_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/renesas-rzg2l: Fix missing put_device irqchip/riscv-intc: Fix SMP=n boot with ACPI irqchip/sifive-plic: Unmask interrupt in plic_irq_enable() irqchip/gic-v4: Don't allow a VMOVP on a dying VPE irqchip/sifive-plic: Return error code on failure irqchip/riscv-imsic: Fix output text of base address irqchip/ocelot: Comment sticky register clearing code irqchip/ocelot: Fix trigger register address irqchip: Remove obsolete config ARM_GIC_V3_ITS_PCI
2024-10-20Merge tag 'sched_urgent_for_v6.12_rc4' of ↵Linus Torvalds17-72/+146
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduling fixes from Borislav Petkov: - Add PREEMPT_RT maintainers - Fix another aspect of delayed dequeued tasks wrt determining their state, i.e., whether they're runnable or blocked - Handle delayed dequeued tasks and their migration wrt PSI properly - Fix the situation where a delayed dequeue task gets enqueued into a new class, which should not happen - Fix a case where memory allocation would happen while the runqueue lock is held, which is a no-no - Do not over-schedule when tasks with shorter slices preempt the currently running task - Make sure delayed to deque entities are properly handled before unthrottling - Other smaller cleanups and improvements * tag 'sched_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: MAINTAINERS: Add an entry for PREEMPT_RT. sched/fair: Fix external p->on_rq users sched/psi: Fix mistaken CPU pressure indication after corrupted task state bug sched/core: Dequeue PSI signals for blocked tasks that are delayed sched: Fix delayed_dequeue vs switched_from_fair() sched/core: Disable page allocation in task_tick_mm_cid() sched/deadline: Use hrtick_enabled_dl() before start_hrtick_dl() sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running sched: Fix sched_delayed vs cfs_bandwidth
2024-10-20Merge tag 'for-linus-6.12a-rc4-tag' of ↵Linus Torvalds5-12/+44
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fix from Juergen Gross: "A single fix for a build failure introduced this merge window" * tag 'for-linus-6.12a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: Remove dependency between pciback and privcmd
2024-10-20Merge tag 'dma-mapping-6.12-2024-10-20' of ↵Linus Torvalds1-8/+8
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fix from Christoph Hellwig: "Just another small tracing fix from Sean" * tag 'dma-mapping-6.12-2024-10-20' of git://git.infradead.org/users/hch/dma-mapping: dma-mapping: fix tracing dma_alloc/free with vmalloc'd memory
2024-10-20Merge tag 'kvmarm-fixes-6.12-3' of ↵Paolo Bonzini6-27/+49
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.12, take #3 - Stop wasting space in the HYP idmap, as we are dangerously close to the 4kB limit, and this has already exploded in -next - Fix another race in vgic_init() - Fix a UBSAN error when faking the cache topology with MTE enabled
2024-10-20Merge tag 'kvmarm-fixes-6.12-2' of ↵Paolo Bonzini10-35/+183
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.12, take #2 - Fix the guest view of the ID registers, making the relevant fields writable from userspace (affecting ID_AA64DFR0_EL1 and ID_AA64PFR1_EL1) - Correcly expose S1PIE to guests, fixing a regression introduced in 6.12-rc1 with the S1POE support - Fix the recycling of stage-2 shadow MMUs by tracking the context (are we allowed to block or not) as well as the recycling state - Address a couple of issues with the vgic when userspace misconfigures the emulation, resulting in various splats. Headaches courtesy of our Syzkaller friends
2024-10-20RISCV: KVM: use raw_spinlock for critical section in imsicCyan Yang1-4/+4
For the external interrupt updating procedure in imsic, there was a spinlock to protect it already. But since it should not be preempted in any cases, we should turn to use raw_spinlock to prevent any preemption in case PREEMPT_RT was enabled. Signed-off-by: Cyan Yang <cyan.yang@sifive.com> Reviewed-by: Yong-Xuan Wang <yongxuan.wang@sifive.com> Reviewed-by: Anup Patel <anup@brainfault.org> Message-ID: <20240919160126.44487-1-cyan.yang@sifive.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: selftests: Fix out-of-bounds reads in CPUID test's array lookupsSean Christopherson1-1/+1
When looking for a "mangled", i.e. dynamic, CPUID entry, terminate the walk based on the number of array _entries_, not the size in bytes of the array. Iterating based on the total size of the array can result in false passes, e.g. if the random data beyond the array happens to match a CPUID entry's function and index. Fixes: fb18d053b7f8 ("selftest: kvm: x86: test KVM_GET_CPUID2 and guest visible CPUIDs against KVM_GET_SUPPORTED_CPUID") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-ID: <20241003234337.273364-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: selftests: x86: Avoid using SSE/AVX instructionsVitaly Kuznetsov1-0/+1
Some distros switched gcc to '-march=x86-64-v3' by default and while it's hard to find a CPU which doesn't support it today, many KVM selftests fail with ==== Test Assertion Failure ==== lib/x86_64/processor.c:570: Unhandled exception in guest pid=72747 tid=72747 errno=4 - Interrupted system call Unhandled exception '0x6' at guest RIP '0x4104f7' The failure is easy to reproduce elsewhere with $ make clean && CFLAGS='-march=x86-64-v3' make -j && ./x86_64/kvm_pv_test The root cause of the problem seems to be that with '-march=x86-64-v3' GCC uses AVX* instructions (VMOVQ in the example above) and without prior XSETBV() in the guest this results in #UD. It is certainly possible to add it there, e.g. the following saves the day as well: Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-ID: <20240920154422.2890096-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memorySean Christopherson1-1/+5
Ignore nCR3[4:0] when loading PDPTEs from memory for nested SVM, as bits 4:0 of CR3 are ignored when PAE paging is used, and thus VMRUN doesn't enforce 32-byte alignment of nCR3. In the absolute worst case scenario, failure to ignore bits 4:0 can result in an out-of-bounds read, e.g. if the target page is at the end of a memslot, and the VMM isn't using guard pages. Per the APM: The CR3 register points to the base address of the page-directory-pointer table. The page-directory-pointer table is aligned on a 32-byte boundary, with the low 5 address bits 4:0 assumed to be 0. And the SDM's much more explicit: 4:0 Ignored Note, KVM gets this right when loading PDPTRs, it's only the nSVM flow that is broken. Fixes: e4e517b4be01 ("KVM: MMU: Do not unconditionally read PDPTE from guest memory") Reported-by: Kirk Swidowski <swidowski@google.com> Cc: Andy Nguyen <theflow@google.com> Cc: 3pvd <3pvd@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009140838.1036226-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: VMX: reset the segment cache after segment init in vmx_vcpu_reset()Maxim Levitsky1-3/+3
Reset the segment cache after segment initialization in vmx_vcpu_reset() to harden KVM against caching stale/uninitialized data. Without the recent fix to bypass the cache in kvm_arch_vcpu_put(), the following scenario is possible: - vCPU is just created, and the vCPU thread is preempted before SS.AR_BYTES is written in vmx_vcpu_reset(). - When scheduling out the vCPU task, kvm_arch_vcpu_in_kernel() => vmx_get_cpl() reads and caches '0' for SS.AR_BYTES. - vmx_vcpu_reset() => seg_setup() configures SS.AR_BYTES, but doesn't invoke vmx_segment_cache_clear() to invalidate the cache. As a result, KVM retains a stale value in the cache, which can be read, e.g. via KVM_GET_SREGS. Usually this is not a problem because the VMX segment cache is reset on each VM-Exit, but if the userspace VMM (e.g KVM selftests) reads and writes system registers just after the vCPU was created, _without_ modifying SS.AR_BYTES, userspace will write back the stale '0' value and ultimately will trigger a VM-Entry failure due to incorrect SS segment type. Invalidating the cache after writing the VMCS doesn't address the general issue of cache accesses from IRQ context being unsafe, but it does prevent KVM from clobbering the VMCS, i.e. mitigates the harm done _if_ KVM has a bug that results in an unsafe cache access. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Fixes: 2fb92db1ec08 ("KVM: VMX: Cache vmcs segment fields") [sean: rework changelog to account for previous patch] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009175002.1118178-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALLSean Christopherson1-7/+9
Massage the documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL to call out that it applies to moved memslots as well as deleted memslots, to avoid KVM's "fast zap" terminology (which has no meaning for userspace), and to reword the documented targeted zap behavior to specifically say that KVM _may_ zap a subset of all SPTEs. As evidenced by the fix to zap non-leafs SPTEs with gPTEs, formally documenting KVM's exact internal behavior is risky and unnecessary. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009192345.1148353-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range()Sean Christopherson1-0/+11
Add a lockdep assertion in kvm_unmap_gfn_range() to ensure that either mmu_invalidate_in_progress is elevated, or that the range is being zapped due to memslot removal (loosely detected by slots_lock being held). Zapping SPTEs without mmu_invalidate_{in_progress,seq} protection is unsafe as KVM's page fault path snapshots state before acquiring mmu_lock, and thus can create SPTEs with stale information if vCPUs aren't forced to retry faults (due to seeing an in-progress or past MMU invalidation). Memslot removal is a special case, as the memslot is retrieved outside of mmu_invalidate_seq, i.e. doesn't use the "standard" protections, and instead relies on SRCU synchronization to ensure any in-flight page faults are fully resolved before zapping SPTEs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009192345.1148353-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslotSean Christopherson1-10/+6
When performing a targeted zap on memslot removal, zap only MMU pages that shadow guest PTEs, as zapping all SPs that "match" the gfn is inexact and unnecessary. Furthermore, for_each_gfn_valid_sp() arguably shouldn't exist, because it doesn't do what most people would it expect it to do. The "round gfn for level" adjustment that is done for direct SPs (no gPTE) means that the exact gfn comparison will not get a match, even when a SP does "cover" a gfn, or was even created specifically for a gfn. For memslot deletion specifically, KVM's behavior will vary significantly based on the size and alignment of a memslot, and in weird ways. E.g. for a 4KiB memslot, KVM will zap more SPs if the slot is 1GiB aligned than if it's only 4KiB aligned. And as described below, zapping SPs in the aligned case overzaps for direct MMUs, as odds are good the upper-level SPs are serving other memslots. To iterate over all potentially-relevant gfns, KVM would need to make a pass over the hash table for each level, with the gfn used for lookup rounded for said level. And then check that the SP is of the correct level, too, e.g. to avoid over-zapping. But even then, KVM would massively overzap, as processing every level is all but guaranteed to zap SPs that serve other memslots, especially if the memslot being removed is relatively small. KVM could mitigate that issue by processing only levels that can be possible guest huge pages, i.e. are less likely to be re-used for other memslot, but while somewhat logical, that's quite arbitrary and would be a bit of a mess to implement. So, zap only SPs with gPTEs, as the resulting behavior is easy to describe, is predictable, and is explicitly minimal, i.e. KVM only zaps SPs that absolutely must be zapped. Cc: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241009192345.1148353-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20x86/kvm: Override default caching mode for SEV-SNP and TDXKirill A. Shutemov1-0/+4
AMD SEV-SNP and Intel TDX have limited access to MTRR: either it is not advertised in CPUID or it cannot be programmed (on TDX, due to #VE on CR0.CD clear). This results in guests using uncached mappings where it shouldn't and pmd/pud_set_huge() failures due to non-uniform memory type reported by mtrr_type_lookup(). Override MTRR state, making it WB by default as the kernel does for Hyper-V guests. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Suggested-by: Binbin Wu <binbin.wu@intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20241015095818.357915-1-kirill.shutemov@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: Remove unused kvm_vcpu_gfn_to_pfn_atomicDr. David Alan Gilbert3-8/+1
The last use of kvm_vcpu_gfn_to_pfn_atomic was removed by commit 1bbc60d0c7e5 ("KVM: x86/mmu: Remove MMU auditing") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Message-ID: <20241001141354.18009-3-linux@treblig.org> [Adjust Documentation/virt/kvm/locking.rst. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: Remove unused kvm_vcpu_gfn_to_pfnDr. David Alan Gilbert2-7/+0
The last use of kvm_vcpu_gfn_to_pfn was removed by commit b1624f99aa8f ("KVM: Remove kvm_vcpu_gfn_to_page() and kvm_vcpu_gpa_to_page()") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Message-ID: <20241001141354.18009-2-linux@treblig.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20Merge tag 'io_uring-6.12-20241019' of git://git.kernel.dk/linuxLinus Torvalds1-1/+1
Pull one more io_uring fix from Jens Axboe: "Fix for a regression introduced in 6.12-rc2, where a condition check was negated and hence -EAGAIN would bubble back up up to userspace rather than trigger a retry condition" * tag 'io_uring-6.12-20241019' of git://git.kernel.dk/linux: io_uring/rw: fix wrong NOWAIT check in io_rw_init_file()
2024-10-19Merge tag 'scsi-fixes' of ↵Linus Torvalds5-42/+45
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Fixes all in drivers. The largest is the mpi3mr which corrects a phy count limit that should only apply to the controller but was being incorrectly applied to expander phys" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: target: core: Fix null-ptr-deref in target_alloc_device() scsi: mpi3mr: Validate SAS port assignments scsi: ufs: core: Set SDEV_OFFLINE when UFS is shut down scsi: ufs: core: Requeue aborted request scsi: ufs: core: Fix the issue of ICU failure
2024-10-19Merge tag 'ftrace-v6.12-rc3' of ↵Linus Torvalds1-8/+23
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace fixes from Steven Rostedt: "A couple of fixes to function graph infrastructure: - Fix allocation of idle shadow stack allocation during hotplug If function graph tracing is started when a CPU is offline, if it were come online during the trace then the idle task that represents the CPU will not get a shadow stack allocated for it. This means all function graph hooks that happen while that idle task is running (including in interrupt mode) will have all its events dropped. Switch over to the CPU hotplug mechanism that will have any newly brought on line CPU get a callback that can allocate the shadow stack for its idle task. - Fix allocation size of the ret_stack_list array When function graph tracing converted over to allowing more than one user at a time, it had to convert its shadow stack from an array of ret_stack structures to an array of unsigned longs. The shadow stacks are allocated in batches of 32 at a time and assigned to every running task. The batch is held by the ret_stack_list array. But when the conversion happened, instead of allocating an array of 32 pointers, it was allocated as a ret_stack itself (PAGE_SIZE). This ret_stack_list gets passed to a function that iterates over what it believes is its size defined by the FTRACE_RETSTACK_ALLOC_SIZE macro (which is 32). Luckily (PAGE_SIZE) is greater than 32 * sizeof(long), otherwise this would have been an array overflow. This still should be fixed and the ret_stack_list should be allocated to the size it is expected to be as someday it may end up being bigger than SHADOW_STACK_SIZE" * tag 'ftrace-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: fgraph: Allocate ret_stack_list with proper size fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks