summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-07-27hostfs: fix the host directory parse when mounting.Hongbo Li1-10/+55
hostfs not keep the host directory when mounting. When the host directory is none (default), fc->source is used as the host root directory, and this is wrong. Here we use `parse_monolithic` to handle the old mount path for parsing the root directory. For new mount path, The `parse_param` is used for the host directory parse. Reported-and-tested-by: Maciej Żenczykowski <maze@google.com> Fixes: cd140ce9f611 ("hostfs: convert hostfs to use the new mount API") Link: https://lore.kernel.org/all/CANP3RGceNzwdb7w=vPf5=7BCid5HVQDmz1K5kC9JG42+HVAh_g@mail.gmail.com/ Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Link: https://lore.kernel.org/r/20240725065130.1821964-1-lihongbo22@huawei.com [brauner: minor fixes] Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-27fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNTSeth Forshee (DigitalOcean)1-0/+11
Christian noticed that it is possible for a privileged user to mount most filesystems with a non-initial user namespace in sb->s_user_ns. When fsopen() is called in a non-init namespace the caller's namespace is recorded in fs_context->user_ns. If the returned file descriptor is then passed to a process priviliged in init_user_ns, that process can call fsconfig(fd_fs, FSCONFIG_CMD_CREATE), creating a new superblock with sb->s_user_ns set to the namespace of the process which called fsopen(). This is problematic. We cannot assume that any filesystem which does not set FS_USERNS_MOUNT has been written with a non-initial s_user_ns in mind, increasing the risk for bugs and security issues. Prevent this by returning EPERM from sget_fc() when FS_USERNS_MOUNT is not set for the filesystem and a non-initial user namespace will be used. sget() does not need to be updated as it always uses the user namespace of the current context, or the initial user namespace if SB_SUBMOUNT is set. Fixes: cb50b348c71f ("convenience helpers: vfs_get_super() and sget_fc()") Reported-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Link: https://lore.kernel.org/r/20240724-s_user_ns-fix-v1-1-895d07c94701@kernel.org Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-27ALSA: firewire-lib: fix wrong value as length of header for CIP_NO_HEADER caseTakashi Sakamoto1-2/+1
In a commit 1d717123bb1a ("ALSA: firewire-lib: Avoid -Wflex-array-member-not-at-end warning"), DEFINE_FLEX() macro was used to handle variable length of array for header field in struct fw_iso_packet structure. The usage of macro has a side effect that the designated initializer assigns the count of array to the given field. Therefore CIP_HEADER_QUADLETS (=2) is assigned to struct fw_iso_packet.header, while the original designated initializer assigns zero to all fields. With CIP_NO_HEADER flag, the change causes invalid length of header in isochronous packet for 1394 OHCI IT context. This bug affects all of devices supported by ALSA fireface driver; RME Fireface 400, 800, UCX, UFX, and 802. This commit fixes the bug by replacing it with the alternative version of macro which corresponds no initializer. Cc: stable@vger.kernel.org Fixes: 1d717123bb1a ("ALSA: firewire-lib: Avoid -Wflex-array-member-not-at-end warning") Reported-by: Edmund Raile <edmund.raile@proton.me> Closes: https://lore.kernel.org/r/rrufondjeynlkx2lniot26ablsltnynfaq2gnqvbiso7ds32il@qk4r6xps7jh2/ Reviewed-by: Takashi Iwai <tiwai@suse.de> Link: https://lore.kernel.org/r/20240725155640.128442-1-o-takashi@sakamocchi.jp Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
2024-07-27Revert "firewire: Annotate struct fw_iso_packet with __counted_by()"Takashi Sakamoto1-3/+2
This reverts commit d3155742db89df3b3c96da383c400e6ff4d23c25. The header_length field is byte unit, thus it can not express the number of elements in header field. It seems that the argument for counted_by attribute can have no arithmetic expression, therefore this commit just reverts the issued commit. Suggested-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20240725161648.130404-1-o-takashi@sakamocchi.jp Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>
2024-07-27minmax: avoid overly complicated constant expressions in VM codeLinus Torvalds2-2/+9
The minmax infrastructure is overkill for simple constants, and can cause huge expansions because those simple constants are then used by other things. For example, 'pageblock_order' is a core VM constant, but because it was implemented using 'min_t()' and all the type-checking that involves, it actually expanded to something like 2.5kB of preprocessor noise. And when that simple constant was then used inside other expansions: #define pageblock_nr_pages (1UL << pageblock_order) #define pageblock_start_pfn(pfn) ALIGN_DOWN((pfn), pageblock_nr_pages) and we then use that inside a 'max()' macro: case ISOLATE_SUCCESS: update_cached = false; last_migrated_pfn = max(cc->zone->zone_start_pfn, pageblock_start_pfn(cc->migrate_pfn - 1)); the end result was that one statement expanding to 253kB in size. There are probably other cases of this, but this one case certainly stood out. I've added 'MIN_T()' and 'MAX_T()' macros for this kind of "core simple constant with specific type" use. These macros skip the type checking, and as such need to be very sparingly used only for obvious cases that have active issues like this. Reported-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/all/36aa2cad-1db1-4abf-8dd2-fb20484aabc3@lucifer.local/ Cc: David Laight <David.Laight@aculab.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-27minmax: avoid overly complex min()/max() macro arguments in xenLinus Torvalds1-2/+3
We have some very fancy min/max macros that have tons of sanity checking to warn about mixed signedness etc. This is all things that a sane compiler should warn about, but there are no sane compiler interfaces for this, and '-Wsign-compare' is broken [1] and not useful. So then we compensate (some would say over-compensate) by doing the checks manually with some truly horrid macro games. And no, we can't just use __builtin_types_compatible_p(), because the whole question of "does it make sense to compare these two values" is a lot more complicated than that. For example, it makes a ton of sense to compare unsigned values with simple constants like "5", even if that is indeed a signed type. So we have these very strange macros to try to make sensible type checking decisions on the arguments to 'min()' and 'max()'. But that can cause enormous code expansion if the min()/max() macros are used with complicated expressions, and particularly if you nest these things so that you get the first big expansion then expanded again. The xen setup.c file ended up ballooning to over 50MB of preprocessed noise that takes 15s to compile (obviously depending on the build host), largely due to one single line. So let's split that one single line to just be simpler. I think it ends up being more legible to humans too at the same time. Now that single file compiles in under a second. Reported-and-reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/all/c83c17bb-be75-4c67-979d-54eee38774c6@lucifer.local/ Link: https://staticthinking.wordpress.com/2023/07/25/wsign-compare-is-garbage/ [1] Cc: David Laight <David.Laight@aculab.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-27nilfs2: handle inconsistent state in nilfs_btnode_create_block()Ryusuke Konishi2-7/+22
Syzbot reported that a buffer state inconsistency was detected in nilfs_btnode_create_block(), triggering a kernel bug. It is not appropriate to treat this inconsistency as a bug; it can occur if the argument block address (the buffer index of the newly created block) is a virtual block number and has been reallocated due to corruption of the bitmap used to manage its allocation state. So, modify nilfs_btnode_create_block() and its callers to treat it as a possible filesystem error, rather than triggering a kernel bug. Link: https://lkml.kernel.org/r/20240725052007.4562-1-konishi.ryusuke@gmail.com Fixes: a60be987d45d ("nilfs2: B-tree node cache") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+89cc4f2324ed37988b60@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=89cc4f2324ed37988b60 Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-27selftests/mm: skip test for non-LPA2 and non-LVA systemsDev Jain1-1/+15
Post my improvement of the test in e4a4ba415419 ("selftests/mm: va_high_addr_switch: dynamically initialize testcases to enable LPA2 testing"): The test begins to fail on 4k and 16k pages, on non-LPA2 systems. To reduce noise in the CI systems, let us skip the test when higher address space is not implemented. Link: https://lkml.kernel.org/r/20240718052504.356517-1-dev.jain@arm.com Fixes: e4a4ba415419 ("selftests/mm: va_high_addr_switch: dynamically initialize testcases to enable LPA2 testing") Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-27mm/page_alloc: fix pcp->count race between drain_pages_zone() vs ↵Li Zhijian1-7/+11
__rmqueue_pcplist() It's expected that no page should be left in pcp_list after calling zone_pcp_disable() in offline_pages(). Previously, it's observed that offline_pages() gets stuck [1] due to some pages remaining in pcp_list. Cause: There is a race condition between drain_pages_zone() and __rmqueue_pcplist() involving the pcp->count variable. See below scenario: CPU0 CPU1 ---------------- --------------- spin_lock(&pcp->lock); __rmqueue_pcplist() { zone_pcp_disable() { /* list is empty */ if (list_empty(list)) { /* add pages to pcp_list */ alloced = rmqueue_bulk() mutex_lock(&pcp_batch_high_lock) ... __drain_all_pages() { drain_pages_zone() { /* read pcp->count, it's 0 here */ count = READ_ONCE(pcp->count) /* 0 means nothing to drain */ /* update pcp->count */ pcp->count += alloced << order; ... ... spin_unlock(&pcp->lock); In this case, after calling zone_pcp_disable() though, there are still some pages in pcp_list. And these pages in pcp_list are neither movable nor isolated, offline_pages() gets stuck as a result. Solution: Expand the scope of the pcp->lock to also protect pcp->count in drain_pages_zone(), to ensure no pages are left in the pcp list after zone_pcp_disable() [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/ Link: https://lkml.kernel.org/r/20240723064428.1179519-1-lizhijian@fujitsu.com Fixes: 4b23a68f9536 ("mm/page_alloc: protect PCP lists with a spinlock") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Reported-by: Yao Xingtao <yaoxt.fnst@fujitsu.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-27mm: memcg: add cacheline padding after lruvec in mem_cgroup_per_nodeRoman Gushchin1-0/+1
Oliver Sand reported a performance regression caused by commit 98c9daf5ae6b ("mm: memcg: guard memcg1-specific members of struct mem_cgroup_per_node"), which puts some fields of the mem_cgroup_per_node structure under the CONFIG_MEMCG_V1 config option. Apparently it causes a false cache sharing between lruvec and lru_zone_size members of the structure. Fix it by adding an explicit padding after the lruvec member. Even though the padding is not required with CONFIG_MEMCG_V1 set, it seems like the introduced memory overhead is not significant enough to warrant another divergence in the mem_cgroup_per_node layout, so the padding is added unconditionally. Link: https://lkml.kernel.org/r/20240723171244.747521-1-roman.gushchin@linux.dev Fixes: 98c9daf5ae6b ("mm: memcg: guard memcg1-specific members of struct mem_cgroup_per_node") Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202407121335.31a10cb6-oliver.sang@intel.com Tested-by: Oliver Sang <oliver.sang@intel.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-27alloc_tag: outline and export free_reserved_page()Suren Baghdasaryan2-15/+18
Outline and export free_reserved_page() because modules use it and it in turn uses page_ext_{get|put} which should not be exported. The same result could be obtained by outlining {get|put}_page_tag_ref() but that would have higher performance impact as these functions are used in more performance critical paths. Link: https://lkml.kernel.org/r/20240717212844.2749975-1-surenb@google.com Fixes: dcfe378c81f7 ("lib: introduce support for page allocation tagging") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407080044.DWMC9N9I-lkp@intel.com/ Suggested-by: Christoph Hellwig <hch@infradead.org> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Sourav Panda <souravpanda@google.com> Cc: <stable@vger.kernel.org> [6.10] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-27decompress_bunzip2: fix rare decompression failureRoss Lagerwall1-1/+2
The decompression code parses a huffman tree and counts the number of symbols for a given bit length. In rare cases, there may be >= 256 symbols with a given bit length, causing the unsigned char to overflow. This causes a decompression failure later when the code tries and fails to find the bit length for a given symbol. Since the maximum number of symbols is 258, use unsigned short instead. Link: https://lkml.kernel.org/r/20240717162016.1514077-1-ross.lagerwall@citrix.com Fixes: bc22c17e12c1 ("bzip2/lzma: library support for gzip, bzip2 and lzma decompression") Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Cc: Alain Knaff <alain@knaff.lu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-27mm/huge_memory: avoid PMD-size page cache if neededGavin Shan2-5/+19
xarray can't support arbitrary page cache size. the largest and supported page cache size is defined as MAX_PAGECACHE_ORDER by commit 099d90642a71 ("mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray"). However, it's possible to have 512MB page cache in the huge memory's collapsing path on ARM64 system whose base page size is 64KB. 512MB page cache is breaking the limitation and a warning is raised when the xarray entry is split as shown in the following example. [root@dhcp-10-26-1-207 ~]# cat /proc/1/smaps | grep KernelPageSize KernelPageSize: 64 kB [root@dhcp-10-26-1-207 ~]# cat /tmp/test.c : int main(int argc, char **argv) { const char *filename = TEST_XFS_FILENAME; int fd = 0; void *buf = (void *)-1, *p; int pgsize = getpagesize(); int ret = 0; if (pgsize != 0x10000) { fprintf(stdout, "System with 64KB base page size is required!\n"); return -EPERM; } system("echo 0 > /sys/devices/virtual/bdi/253:0/read_ahead_kb"); system("echo 1 > /proc/sys/vm/drop_caches"); /* Open the xfs file */ fd = open(filename, O_RDONLY); assert(fd > 0); /* Create VMA */ buf = mmap(NULL, TEST_MEM_SIZE, PROT_READ, MAP_SHARED, fd, 0); assert(buf != (void *)-1); fprintf(stdout, "mapped buffer at 0x%p\n", buf); /* Populate VMA */ ret = madvise(buf, TEST_MEM_SIZE, MADV_NOHUGEPAGE); assert(ret == 0); ret = madvise(buf, TEST_MEM_SIZE, MADV_POPULATE_READ); assert(ret == 0); /* Collapse VMA */ ret = madvise(buf, TEST_MEM_SIZE, MADV_HUGEPAGE); assert(ret == 0); ret = madvise(buf, TEST_MEM_SIZE, MADV_COLLAPSE); if (ret) { fprintf(stdout, "Error %d to madvise(MADV_COLLAPSE)\n", errno); goto out; } /* Split xarray entry. Write permission is needed */ munmap(buf, TEST_MEM_SIZE); buf = (void *)-1; close(fd); fd = open(filename, O_RDWR); assert(fd > 0); fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, TEST_MEM_SIZE - pgsize, pgsize); out: if (buf != (void *)-1) munmap(buf, TEST_MEM_SIZE); if (fd > 0) close(fd); return ret; } [root@dhcp-10-26-1-207 ~]# gcc /tmp/test.c -o /tmp/test [root@dhcp-10-26-1-207 ~]# /tmp/test ------------[ cut here ]------------ WARNING: CPU: 25 PID: 7560 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128 Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib \ nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct \ nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 \ ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm fuse \ xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 virtio_net \ sha1_ce net_failover virtio_blk virtio_console failover dimlib virtio_mmio CPU: 25 PID: 7560 Comm: test Kdump: loaded Not tainted 6.10.0-rc7-gavin+ #9 Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024 pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : xas_split_alloc+0xf8/0x128 lr : split_huge_page_to_list_to_order+0x1c4/0x780 sp : ffff8000ac32f660 x29: ffff8000ac32f660 x28: ffff0000e0969eb0 x27: ffff8000ac32f6c0 x26: 0000000000000c40 x25: ffff0000e0969eb0 x24: 000000000000000d x23: ffff8000ac32f6c0 x22: ffffffdfc0700000 x21: 0000000000000000 x20: 0000000000000000 x19: ffffffdfc0700000 x18: 0000000000000000 x17: 0000000000000000 x16: ffffd5f3708ffc70 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: ffffffffffffffc0 x10: 0000000000000040 x9 : ffffd5f3708e692c x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff0000e0969eb8 x5 : ffffd5f37289e378 x4 : 0000000000000000 x3 : 0000000000000c40 x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000 Call trace: xas_split_alloc+0xf8/0x128 split_huge_page_to_list_to_order+0x1c4/0x780 truncate_inode_partial_folio+0xdc/0x160 truncate_inode_pages_range+0x1b4/0x4a8 truncate_pagecache_range+0x84/0xa0 xfs_flush_unmap_range+0x70/0x90 [xfs] xfs_file_fallocate+0xfc/0x4d8 [xfs] vfs_fallocate+0x124/0x2f0 ksys_fallocate+0x4c/0xa0 __arm64_sys_fallocate+0x24/0x38 invoke_syscall.constprop.0+0x7c/0xd8 do_el0_svc+0xb4/0xd0 el0_svc+0x44/0x1d8 el0t_64_sync_handler+0x134/0x150 el0t_64_sync+0x17c/0x180 Fix it by correcting the supported page cache orders, different sets for DAX and other files. With it corrected, 512MB page cache becomes disallowed on all non-DAX files on ARM64 system where the base page size is 64KB. After this patch is applied, the test program fails with error -EINVAL returned from __thp_vma_allowable_orders() and the madvise() system call to collapse the page caches. Link: https://lkml.kernel.org/r/20240715000423.316491-1-gshan@redhat.com Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache") Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Don Dutile <ddutile@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: <stable@vger.kernel.org> [5.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-27mm: huge_memory: use !CONFIG_64BIT to relax huge page alignment on 32 bit ↵Yang Shi1-1/+1
machines Yves-Alexis Perez reported commit 4ef9ad19e176 ("mm: huge_memory: don't force huge page alignment on 32 bit") didn't work for x86_32 [1]. It is because x86_32 uses CONFIG_X86_32 instead of CONFIG_32BIT. !CONFIG_64BIT should cover all 32 bit machines. [1] https://lore.kernel.org/linux-mm/CAHbLzkr1LwH3pcTgM+aGQ31ip2bKqiqEQ8=FQB+t2c3dhNKNHA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20240712155855.1130330-1-yang@os.amperecomputing.com Fixes: 4ef9ad19e176 ("mm: huge_memory: don't force huge page alignment on 32 bit") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Reported-by: Yves-Alexis Perez <corsac@debian.org> Tested-by: Yves-Alexis Perez <corsac@debian.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Salvatore Bonaccorso <carnil@debian.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> [6.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-27mm: fix old/young bit handling in the faulting pathRam Tummala1-1/+1
Commit 3bd786f76de2 ("mm: convert do_set_pte() to set_pte_range()") replaced do_set_pte() with set_pte_range() and that introduced a regression in the following faulting path of non-anonymous vmas which caused the PTE for the faulting address to be marked as old instead of young. handle_pte_fault() do_pte_missing() do_fault() do_read_fault() || do_cow_fault() || do_shared_fault() finish_fault() set_pte_range() The polarity of prefault calculation is incorrect. This leads to prefault being incorrectly set for the faulting address. The following check will incorrectly mark the PTE old rather than young. On some architectures this will cause a double fault to mark it young when the access is retried. if (prefault && arch_wants_old_prefaulted_pte()) entry = pte_mkold(entry); On a subsequent fault on the same address, the faulting path will see a non NULL vmf->pte and instead of reaching the do_pte_missing() path, PTE will then be correctly marked young in handle_pte_fault() itself. Due to this bug, performance degradation in the fault handling path will be observed due to unnecessary double faulting. Link: https://lkml.kernel.org/r/20240710014539.746200-1-rtummala@nvidia.com Fixes: 3bd786f76de2 ("mm: convert do_set_pte() to set_pte_range()") Signed-off-by: Ram Tummala <rtummala@nvidia.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-27dt-bindings: arm: update James Clark's email addressJames Clark2-2/+2
My new address is james.clark@linaro.org Link: https://lkml.kernel.org/r/20240709102512.31212-3-james.clark@linaro.org Signed-off-by: James Clark <james.clark@linaro.org> Cc: Bjorn Andersson <quic_bjorande@quicinc.com> Cc: Conor Dooley <conor+dt@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Geliang Tang <geliang@kernel.org> Cc: Hao Zhang <quic_hazha@quicinc.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Krzysztof Kozlowski <krzk+dt@kernel.org> Cc: Mao Jinlong <quic_jinlmao@quicinc.com> Cc: Matthieu Baerts <matttbe@kernel.org> Cc: Matt Ranostay <matt@ranostay.sg> Cc: Mike Leach <mike.leach@linaro.org> Cc: Oleksij Rempel <o.rempel@pengutronix.de> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-27MAINTAINERS: mailmap: update James Clark's email addressJames Clark2-2/+3
My new address is james.clark@linaro.org Link: https://lkml.kernel.org/r/20240709102512.31212-2-james.clark@linaro.org Signed-off-by: James Clark <james.clark@linaro.org> Cc: Bjorn Andersson <quic_bjorande@quicinc.com> Cc: Conor Dooley <conor+dt@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Geliang Tang <geliang@kernel.org> Cc: Hao Zhang <quic_hazha@quicinc.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Krzysztof Kozlowski <krzk+dt@kernel.org> Cc: Mao Jinlong <quic_jinlmao@quicinc.com> Cc: Matthieu Baerts <matttbe@kernel.org> Cc: Matt Ranostay <matt@ranostay.sg> Cc: Mike Leach <mike.leach@linaro.org> Cc: Oleksij Rempel <o.rempel@pengutronix.de> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-26dt-bindings: iio: adc: ad7192: Fix 'single-channel' constraintsRob Herring (Arm)1-3/+2
The 'single-channel' property is an uint32, not an array, so 'items' is an incorrect constraint. This didn't matter until dtschema recently changed how properties are decoded. This results in this warning: Documentation/devicetree/bindings/iio/adc/adi,ad7192.example.dtb: adc@0: \ channel@1:single-channel: 1 is not of type 'array' Fixes: caf7b7632b8d ("dt-bindings: iio: adc: ad7192: Add AD7194 support") Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20240723230904.1299744-1-robh@kernel.org Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2024-07-26tools/power turbostat: version 2024.07.26Len Brown1-53/+52
Release 2024.07.26: Enable turbostat extensions to add both perf and PMT (Intel Platform Monitoring Technology) counters from the cmdline. Demonstrate PMT access with built-in support for Meteor Lake's Die%c6 counter. This commit: Clean up white-space nits introduced since version 2024.05.10 Signed-off-by: Len Brown <len.brown@intel.com>
2024-07-26tools/power turbostat: Include umask=%x in perf counter's configPatryk Wlazlyn1-10/+50
Some counters, like cpu/cache-misses/, expose and require umask=%x parameter alongside event=%x in the sysfs perf counter's event file. This change make sure we parse and use it when opening user added counters. Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2024-07-26tools/power turbostat: Document PMT in turbostat.8Patryk Wlazlyn1-0/+65
Add a general description of the user interface for adding PMT counters with the new --add pmt,... option. Provide a complete example for requesting two counters. Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2024-07-26tools/power turbostat: Add MTL's PMT DC6 builtin counterPatryk Wlazlyn1-1/+69
Provide a definition for metadata that allows reading DC6 residency counter via PMT and exposes it as a builtin counter. Note that this residency counter is updated and read via entirely different mechanisms vs the MSR-based residency counters. On MTL processors, there are times when Die%c6 will report above 100%. This is still useful, but don't expect 3 digits of precision... Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2024-07-26tools/power turbostat: Add early support for PMT countersPatryk Wlazlyn1-2/+766
Allows users to read Intel PMT (Platform Monitoring Technology) counters, providing interface similar to one used to add MSR and perf counters. Because PMT is exposed as a raw MMIO range, without metadata, user has to supply the necessary information to find and correctly display the requested counter. Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2024-07-26Merge tag 'auxdisplay-for-v6.11-tag1' of ↵Linus Torvalds7-4/+17
git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k Pull auxdisplay updates from Geert Uytterhoeven: - add support for configuring the boot message on line displays - miscellaneous fixes and improvements * tag 'auxdisplay-for-v6.11-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k: auxdisplay: ht16k33: Drop reference after LED registration auxdisplay: Use sizeof(*pointer) instead of sizeof(type) auxdisplay: hd44780: add missing MODULE_DESCRIPTION() macro auxdisplay: linedisp: add missing MODULE_DESCRIPTION() macro auxdisplay: linedisp: Support configuring the boot message auxdisplay: charlcd: Provide a forward declaration
2024-07-26Merge tag 'sound-fix-6.11-rc1' of ↵Linus Torvalds17-46/+434
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "A collection of fixes gathered since the previous pull. We see a bit large LOCs at a HD-audio quirk, but that's only bulk COEF data, hence it's safe to take. In addition to that, there were two minor fixes for MIDI 2.0 handling for ALSA core, and the rest are all rather random small and device-specific fixes" * tag 'sound-fix-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ASoC: fsl-asoc-card: Dynamically allocate memory for snd_soc_dai_link_components ASoC: amd: yc: Support mic on Lenovo Thinkpad E16 Gen 2 ALSA: hda/realtek: Implement sound init sequence for Samsung Galaxy Book3 Pro 360 ALSA: hda/realtek: cs35l41: Fixup remaining asus strix models ASoC: SOF: ipc4-topology: Preserve the DMA Link ID for ChainDMA on unprepare ASoC: SOF: ipc4-topology: Only handle dai_config with HW_PARAMS for ChainDMA ALSA: ump: Force 1 Group for MIDI1 FBs ALSA: ump: Don't update FB name for static blocks ALSA: usb-audio: Add a quirk for Sonix HD USB Camera ASoC: TAS2781: Fix tasdev_load_calibrated_data() ASoC: tegra: select CONFIG_SND_SIMPLE_CARD_UTILS ASoC: Intel: use soc_intel_is_byt_cr() only when IOSF_MBI is reachable ALSA: usb-audio: Move HD Webcam quirk to the right place ALSA: hda: tas2781: mark const variables as __maybe_unused ALSA: usb-audio: Fix microphone sound on HD webcam. ASoC: sof: amd: fix for firmware reload failure in Vangogh platform ASoC: Intel: Fix RT5650 SSP lookup ASOC: SOF: Intel: hda-loader: only wait for HDaudio IOC for IPC4 devices ASoC: SOF: imx8m: Fix DSP control regmap retrieval
2024-07-26Merge tag 'drm-next-2024-07-26' of https://gitlab.freedesktop.org/drm/kernelLinus Torvalds56-197/+690
Pull drm fixes from Dave Airlie: "Fixes for rc1, mostly amdgpu, i915 and xe, with some other misc ones, doesn't seem to be anything too serious. amdgpu: - Bump driver version for GFX12 DCC - DC documention warning fixes - VCN unified queue power fix - SMU fix - RAS fix - Display corruption fix - SDMA 5.2 workaround - GFX12 fixes - Uninitialized variable fix - VCN/JPEG 4.0.3 fixes - Misc display fixes - RAS fixes - VCN4/5 harvest fix - GPU reset fix i915: - Reset intel_dp->link_trained before retraining the link - Don't switch the LTTPR mode on an active link - Do not consider preemption during execlists_dequeue for gen8 - Allow NULL memory region xe: - xe_exec ioctl minor fix on sync entry cleanup upon error - SRIOV: limit VF LMEM provisioning - Wedge mode fixes v3d: - fix indirect dispatch on newer v3d revs panel: - fix panel backlight bindings" * tag 'drm-next-2024-07-26' of https://gitlab.freedesktop.org/drm/kernel: (39 commits) drm/amdgpu: reset vm state machine after gpu reset(vram lost) drm/amdgpu: add missed harvest check for VCN IP v4/v5 drm/amdgpu: Fix eeprom max record count drm/amdgpu: fix ras UE error injection failure issue drm/amd/display: Remove ASSERT if significance is zero in math_ceil2 drm/amd/display: Check for NULL pointer drm/amdgpu/vcn: Use offsets local to VCN/JPEG in VF drm/amdgpu: Add empty HDP flush function to VCN v4.0.3 drm/amdgpu: Add empty HDP flush function to JPEG v4.0.3 drm/amd/amdgpu: Fix uninitialized variable warnings drm/amdgpu: Fix atomics on GFX12 drm/amdgpu/sdma5.2: Update wptr registers as well as doorbell drm/i915: Allow NULL memory region drm/i915/gt: Do not consider preemption during execlists_dequeue for gen8 dt-bindings: display: panel: samsung,atna33xc20: Document ATNA45AF01 drm/xe: Don't suspend device upon wedge drm/xe: Wedge the entire device drm/xe/pf: Limit fair VF LMEM provisioning drm/xe/exec: Fix minor bug related to xe_sync_entry_cleanup drm/amd/display: fix corruption with high refresh rates on DCN 3.0 ...
2024-07-26tools/power turbostat: Add selftests for added perf countersPatryk Wlazlyn1-0/+178
Test adds several perf counters from msr, cstate_core and cstate_pkg groups and checks if the columns for those counters show up. The test skips the counters that are not present. It is not an error, but the test may not be as exhaustive. Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2024-07-26tools/power turbostat: Add selftests for SMI, APERF and MPERF countersPatryk Wlazlyn1-0/+157
The test requests BICs that are dependent on SMI, APERF and MPERF counters and checks if the columns show up in the output and the turbostat doesn't crash. Read the counters in both --no-msr and --no-perf mode. The test skips counters that are not present or user does not have permissions to read. It is not an error, but the test may not be as exhaustive. Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2024-07-26Merge tag 's390-6.11-2' of ↵Linus Torvalds52-525/+746
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull more s390 updates from Vasily Gorbik: - Fix KMSAN build breakage caused by the conflict between s390 and mm-stable trees - Add KMSAN page markers for ptdump - Add runtime constant support - Fix __pa/__va for modules under non-GPL licenses by exporting necessary vm_layout struct with EXPORT_SYMBOL to prevent linkage problems - Fix an endless loop in the CF_DIAG event stop in the CPU Measurement Counter Facility code when the counter set size is zero - Remove the PROTECTED_VIRTUALIZATION_GUEST config option and enable its functionality by default - Support allocation of multiple MSI interrupts per device and improve logging of architecture-specific limitations - Add support for lowcore relocation as a debugging feature to catch all null ptr dereferences in the kernel address space, improving detection beyond the current implementation's limited write access protection - Clean up and rework CPU alternatives to allow for callbacks and early patching for the lowcore relocation * tag 's390-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (39 commits) s390: Remove protvirt and kvm config guards for uv code s390/boot: Add cmdline option to relocate lowcore s390/kdump: Make kdump ready for lowcore relocation s390/entry: Make system_call() ready for lowcore relocation s390/entry: Make ret_from_fork() ready for lowcore relocation s390/entry: Make __switch_to() ready for lowcore relocation s390/entry: Make restart_int_handler() ready for lowcore relocation s390/entry: Make mchk_int_handler() ready for lowcore relocation s390/entry: Make int handlers ready for lowcore relocation s390/entry: Make pgm_check_handler() ready for lowcore relocation s390/entry: Add base register to CHECK_VMAP_STACK/CHECK_STACK macro s390/entry: Add base register to SIEEXIT macro s390/entry: Add base register to MBEAR macro s390/entry: Make __sie64a() ready for lowcore relocation s390/head64: Make startup code ready for lowcore relocation s390: Add infrastructure to patch lowcore accesses s390/atomic_ops: Disable flag outputs constraint for GCC versions below 14.2.0 s390/entry: Move SIE indicator flag to thread info s390/nmi: Simplify ptregs setup s390/alternatives: Remove alternative facility list ...
2024-07-26tools/power turbostat: Move verbose counter messages to level 2Patryk Wlazlyn1-13/+11
Printing information about the source and value during initialization and reading of the counter for each cpu, while useful when debugging, results in too verbose output. Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2024-07-26tools/power turbostat: Move debug prints from stdout to stderrPatryk Wlazlyn1-8/+9
This leaves the stdout cleaner, having only counter data. It makes it easier for programs to parse the output of turbostat, for example selftests. Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2024-07-26Merge tag 'arm64-fixes' of ↵Linus Torvalds6-9/+31
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "The usual summary below, but the main fix is for the fast GUP lockless page-table walk when we have a combination of compile-time and run-time folding of the p4d and the pud respectively. - Remove some redundant Kconfig conditionals - Fix string output in ptrace selftest - Fix fast GUP crashes in some page-table configurations - Remove obsolete linker option when building the vDSO - Fix some sysreg field definitions for the GIC" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: mm: Fix lockless walks with static and dynamic page-table folding arm64/sysreg: Correct the values for GICv4.1 arm64/vdso: Remove --hash-style=sysv kselftest: missing arg in ptrace.c arm64/Kconfig: Remove redundant 'if HAVE_FUNCTION_GRAPH_TRACER' arm64: remove redundant 'if HAVE_ARCH_KASAN' in Kconfig
2024-07-26smb3: add dynamic trace point for session setup key expired failuresSteve French2-1/+47
There are cases where services need to remount (or change their credentials files) when keys have expired, but it can be helpful to have a dynamic trace point to make it easier to notify the service to refresh the storage account key. Here is sample output, one from mount with bad password, one from a reconnect where the password has been changed or expired and reconnect fails (requiring remount with new storage account key) TASK-PID CPU# ||||| TIMESTAMP FUNCTION | | | ||||| | | mount.cifs-11362 [000] ..... 6000.241620: smb3_key_expired: rc=-13 user=testpassu conn_id=0x2 server=localhost addr=127.0.0.1:445 kworker/4:0-8458 [004] ..... 6044.892283: smb3_key_expired: rc=-13 user=testpassu conn_id=0x3 server=localhost addr=127.0.0.1:445 Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-07-26Merge tag 'ceph-for-6.11-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds6-22/+34
Pull ceph updates from Ilya Dryomov: "A small patchset to address bogus I/O errors and ultimately an assertion failure in the face of watch errors with -o exclusive mappings in RBD marked for stable and some assorted CephFS fixes" * tag 'ceph-for-6.11-rc1' of https://github.com/ceph/ceph-client: rbd: don't assume rbd_is_lock_owner() for exclusive mappings rbd: don't assume RBD_LOCK_STATE_LOCKED for exclusive mappings rbd: rename RBD_LOCK_STATE_RELEASING and releasing_wait ceph: fix incorrect kmalloc size of pagevec mempool ceph: periodically flush the cap releases ceph: convert comma to semicolon in __ceph_dentry_dir_lease_touch() ceph: use cap_wait_list only if debugfs is enabled
2024-07-26smb3: add four dynamic tracepoints for copy_file_range and reflinkSteve French2-1/+67
Add more dynamic tracepoints to help debug copy_file_range (copychunk) and clone_range ("duplicate extents"). These are tracepoints for entering the function and completing without error. For example: "trace-cmd record -e smb3_copychunk_enter -e smb3_copychunk_done" or "trace-cmd record -e smb3_clone_enter -e smb3_clone_done" Here is sample output: TASK-PID CPU# ||||| TIMESTAMP FUNCTION | | | ||||| | | cp-5964 [005] ..... 2176.168977: smb3_clone_enter: xid=17 sid=0xeb275be4 tid=0x7ffa7cdb source fid=0x1ed02e15 source offset=0x0 target fid=0x1ed02e15 target offset=0x0 len=0xa0000 cp-5964 [005] ..... 2176.170668: smb3_clone_done: xid=17 sid=0xeb275be4 tid=0x7ffa7cdb source fid=0x1ed02e15 source offset=0x0 target fid=0x1ed02e15 target offset=0x0 len=0xa0000 Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-07-26smb3: add dynamic tracepoint for reflink errorsSteve French2-0/+62
There are cases where debugging clone_range ("smb2_duplicate_extents" function) and in the future copy_range ("smb2_copychunk_range") can be helpful. Add dynamic trace points for any errors in clone, and a followon patch will add them for copychunk. "trace-cmd record -e smb3_clone_err" Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-07-26Merge tag 'erofs-for-6.11-rc1-2' of ↵Linus Torvalds5-21/+49
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull more erofs updates from Gao Xiang: - Support STATX_DIOALIGN and FS_IOC_GETFSSYSFSPATH - Fix a race of LZ4 decompression due to recent refactoring - Another multi-page folio adaption in erofs_bread() * tag 'erofs-for-6.11-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: convert comma to semicolon erofs: support multi-page folios for erofs_bread() erofs: add support for FS_IOC_GETFSSYSFSPATH erofs: fix race in z_erofs_get_gbuf() erofs: support STATX_DIOALIGN
2024-07-26Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2-2/+6
Pull struct file leak fixes from Al Viro: "a couple of leaks on failure exits missing fdput()" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: lirc: rc_dev_get_from_fd(): fix file leak powerpc: fix a file leak in kvm_vcpu_ioctl_enable_cap()
2024-07-26arm64: allow installing compressed image by defaultLinus Torvalds2-2/+19
On arm64 we build compressed images, but "make install" by default will install the old non-compressed one. To actually get the compressed image install, you need to use "make zinstall", which is not the usual way to install a kernel. Which may not sound like much of an issue, but when you deal with multiple architectures (and years of your fingers knowing the regular "make install" incantation), this inconsistency is pretty annoying. But as Will Deacon says: "Sadly, bootloaders being as top quality as you might expect, I don't think we're in a position to rely on decompressor support across the board. Our Image.gz is literally just that -- we don't have a built-in decompressor (nor do I think we want to rush into that again after the fun we had on arm32) and the recent EFI zboot support solves that problem for platforms using EFI. Changing the default 'install' target terrifies me. There are bound to be folks with embedded boards who've scripted this and we could really ruin their day if we quietly give them a compressed kernel that their bootloader doesn't know how to handle :/" So make this conditional on a new "COMPRESSED_INSTALL" option. Cc: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-26Merge tag 'bitmap-6.11-rc1' of https://github.com:/norov/linuxLinus Torvalds10-70/+54
Pull bitmap updates from Yury Norov: "Random fixes" * tag 'bitmap-6.11-rc1' of https://github.com:/norov/linux: riscv: Remove unnecessary int cast in variable_fls() radix tree test suite: put definition of bitmap_clear() into lib/bitmap.c bitops: Add a comment explaining the double underscore macros lib: bitmap: add missing MODULE_DESCRIPTION() macros cpumask: introduce assign_cpu() macro
2024-07-26io_uring/napi: pass ktime to io_napi_adjust_timeoutPavel Begunkov3-17/+11
Pass the waiting time for __io_napi_adjust_timeout as ktime and get rid of all timespec64 conversions. It's especially simpler since the caller already have a ktime. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4f5b8e8eed4f53a1879e031a6712b25381adc23d.1722003776.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-26io_uring/napi: use ktime in busy pollingPavel Begunkov4-24/+30
It's more natural to use ktime/ns instead of keeping around usec, especially since we're comparing it against user provided timers, so convert napi busy poll internal handling to ktime. It's also nicer since the type (ktime_t vs unsigned long) now tells the unit of measure. Keep everything as ktime, which we convert to/from micro seconds for IORING_[UN]REGISTER_NAPI. The net/ busy polling works seems to work with usec, however it's not real usec as shift by 10 is used to get it from nsecs, see busy_loop_current_time(), so it's easy to get truncated nsec back and we get back better precision. Note, we can further improve it later by removing the truncation and maybe convincing net/ to use ktime/ns instead. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/95e7ec8d095069a3ed5d40a4bc6f8b586698bc7e.1722003776.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-26regulator: Further restrict RZG2L USB VBCTRL regulator dependenciesMark Brown1-1/+1
Since the regulator can't be used without the USB controller also tighten the dependency to match, as well as the default. Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com> Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://patch.msgid.link/20240726-regulator-restrict-rzg2l-v1-1-640e508896e2@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
2024-07-26Merge tag 'nvme-6.11-2024-07-26' of git://git.infradead.org/nvme into block-6.11Jens Axboe5-8/+30
Pull NVMe fixes from Keith: "nvme fixes for Linux 6.11 - Fix request without payloads cleanup (Leon) - Use new protection information format (Francis) - Improved debug message for lost pci link (Bart) - Another apst quirk (Wang) - Use appropriate sysfs api for printing chars (Markus)" * tag 'nvme-6.11-2024-07-26' of git://git.infradead.org/nvme: nvme-pci: add missing condition check for existence of mapped data nvme-core: choose PIF from QPIF if QPIFS supports and PIF is QTYPE nvme-pci: Fix the instructions for disabling power management nvme: remove redundant bdev local variable nvme-fabrics: Use seq_putc() in __nvmf_concat_opt_tokens() nvme/pci: Add APST quirk for Lenovo N60z laptop
2024-07-26RISC-V: Provide the frequency of time CSR via hwprobePalmer Dabbelt4-1/+9
The RISC-V architecture makes a real time counter CSR (via RDTIME instruction) available for applications in U-mode but there is no architected mechanism for an application to discover the frequency the counter is running at. Some applications (e.g., DPDK) use the time counter for basic performance analysis as well as fine grained time-keeping. Add support to the hwprobe system call to export the time CSR frequency to code running in U-mode. Signed-off-by: Yunhui Cui <cuiyunhui@bytedance.com> Reviewed-by: Evan Green <evan@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Acked-by: Punit Agrawal <punit.agrawal@bytedance.com> Link: https://lore.kernel.org/r/20240702033731.71955-2-cuiyunhui@bytedance.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-26riscv: Extend sv39 linear mapping max size to 128GStuart Menefy2-6/+7
This harmonizes all virtual addressing modes which can now all map (PGDIR_SIZE * PTRS_PER_PGD) / 4 of physical memory. The RISCV implementation of KASAN requires that the boundary between shallow mappings are aligned on an 8G boundary. In this case we need VMALLOC_START to be 8G aligned. So although we only need to move the start of the linear mapping down by 4GiB to allow 128GiB to be mapped, we actually move it down by 8GiB (creating a 4GiB hole between the linear mapping and KASAN shadow space) to maintain the alignment requirement. Signed-off-by: Stuart Menefy <stuart.menefy@codasip.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240630110550.1731929-1-stuart.menefy@codasip.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-26Merge patch series "RISC-V: Select ACPI PPTT drivers"Palmer Dabbelt2-7/+29
This series adds support for ACPI PPTT via cacheinfo. * b4-shazam-merge: RISC-V: Select ACPI PPTT drivers riscv: cacheinfo: initialize cacheinfo's level and type from ACPI PPTT riscv: cacheinfo: remove the useless input parameter (node) of ci_leaf_init() Link: https://lore.kernel.org/r/20240617131425.7526-1-cuiyunhui@bytedance.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-26Merge patch "Enable SPCR table for console output on RISC-V"Palmer Dabbelt2-1/+12
Sia Jee Heng <jeeheng.sia@starfivetech.com> says: The ACPI SPCR code has been used to enable console output for ARM64 and X86. The same code can be reused for RISC-V. Furthermore, SPCR table is mandated for headless system as outlined in the RISC-V BRS Specification, chapter 6. * b4-shazam-merge: RISC-V: ACPI: Enable SPCR table for console output on RISC-V Link: https://lore.kernel.org/r/20240502073751.102093-1-jeeheng.sia@starfivetech.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-26riscv: enable HAVE_ARCH_STACKLEAKJisheng Zhang4-1/+8
Add support for the stackleak feature. Whenever the kernel returns to user space the kernel stack is filled with a poison value. At the same time, disables the plugin in EFI stub code because EFI stub is out of scope for the protection. Tested on qemu and milkv duo: / # echo STACKLEAK_ERASING > /sys/kernel/debug/provoke-crash/DIRECT [ 38.675575] lkdtm: Performing direct entry STACKLEAK_ERASING [ 38.678448] lkdtm: stackleak stack usage: [ 38.678448] high offset: 288 bytes [ 38.678448] current: 496 bytes [ 38.678448] lowest: 1328 bytes [ 38.678448] tracked: 1328 bytes [ 38.678448] untracked: 448 bytes [ 38.678448] poisoned: 14312 bytes [ 38.678448] low offset: 8 bytes [ 38.689887] lkdtm: OK: the rest of the thread stack is properly erased Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Link: https://lore.kernel.org/r/20240623235316.2010-1-jszhang@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-26riscv: signal: Remove unlikely() from WARN_ON() conditionZhongqiu Han1-1/+1
"WARN_ON(unlikely(x))" is excessive. WARN_ON() already uses unlikely() internally. Signed-off-by: Zhongqiu Han <quic_zhonhan@quicinc.com> Reviewed-by: Bjorn Andersson <quic_bjorande@quicinc.com> Reviewed-by: Andy Chiu <andy.chiu@sifive.com> Link: https://lore.kernel.org/r/20240620033434.3778156-1-quic_zhonhan@quicinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>