summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-04-17MAINTAINERS: net: Add Oleksij to pse-pd maintainersKory Maincent (Dent Project)1-0/+7
Oleksij was the first to add support for pse-pd net subsystem. Add himself to the maintainers seems logical. Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20240414-feature_poe-v8-1-e4bf1e860da5@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-17selftests: drv-net: add config for netdevsimJakub Kicinski1-0/+2
Real driver testing will obviously require enabling more options, but will require more manual setup in the first place. For CIs running purely software tests we need to enable netdevsim. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240416004556.1618804-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-17selftests: drv-net: add stdout to the command failed exceptionJakub Kicinski1-1/+2
ping prints all the info to stdout. To make debug easier capture stdout in the Exception raised when command unexpectedly fails. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240416004556.1618804-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-17ip6_vti: fix memleak on netns dismantleFlorian Westphal1-0/+1
kmemleak reports net_device resources are no longer released, restore needs_free_netdev toggle. Sample backtrace: unreferenced object 0xffff88810874f000 (size 4096): [..] [<00000000a2b8af8b>] __kmalloc_node+0x209/0x290 [<0000000040b0a1a9>] alloc_netdev_mqs+0x58/0x470 [<00000000b4be1e78>] vti6_init_net+0x94/0x230 [<000000008830c1ea>] ops_init+0x32/0xc0 [<000000006a26fa8f>] setup_net+0x134/0x2e0 [..] Cc: Breno Leitao <leitao@debian.org> Fixes: a9b2d55a8f1e ("ip6_vti: Do not use custom stat allocator") Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240415122346.26503-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-17dt-bindings: net: nxp,dwmac-imx: allow nvmem cells propertyPeng Fan1-0/+4
Allow nvmem-cells and nvmem-cell-names to get mac_address from onchip fuse. Signed-off-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20240415103621.1644735-1-peng.fan@oss.nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-17net: ipa: Remove unnecessary print function dev_err()Jiapeng Chong1-5/+1
The print function dev_err() is redundant because platform_get_irq_byname() already prints an error. ./drivers/net/ipa/ipa_interrupt.c:300:2-9: line 300 is redundant because platform_get_irq() already prints an error. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8756 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240415031456.10805-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-17net/handshake: remove redundant assignment to variable retColin Ian King1-1/+0
The variable is being assigned an value and then is being re-assigned a new value in the next statement. The assignment is redundant and can be removed. Cleans up clang scan build warning: net/handshake/tlshd.c:216:2: warning: Value stored to 'ret' is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Acked-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20240415100713.483399-1-colin.i.king@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-17tun: limit printing rate when illegal packet received by tun devLei Chen1-8/+10
vhost_worker will call tun call backs to receive packets. If too many illegal packets arrives, tun_do_read will keep dumping packet contents. When console is enabled, it will costs much more cpu time to dump packet and soft lockup will be detected. net_ratelimit mechanism can be used to limit the dumping rate. PID: 33036 TASK: ffff949da6f20000 CPU: 23 COMMAND: "vhost-32980" #0 [fffffe00003fce50] crash_nmi_callback at ffffffff89249253 #1 [fffffe00003fce58] nmi_handle at ffffffff89225fa3 #2 [fffffe00003fceb0] default_do_nmi at ffffffff8922642e #3 [fffffe00003fced0] do_nmi at ffffffff8922660d #4 [fffffe00003fcef0] end_repeat_nmi at ffffffff89c01663 [exception RIP: io_serial_in+20] RIP: ffffffff89792594 RSP: ffffa655314979e8 RFLAGS: 00000002 RAX: ffffffff89792500 RBX: ffffffff8af428a0 RCX: 0000000000000000 RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff8af428a0 RBP: 0000000000002710 R8: 0000000000000004 R9: 000000000000000f R10: 0000000000000000 R11: ffffffff8acbf64f R12: 0000000000000020 R13: ffffffff8acbf698 R14: 0000000000000058 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffffa655314979e8] io_serial_in at ffffffff89792594 #6 [ffffa655314979e8] wait_for_xmitr at ffffffff89793470 #7 [ffffa65531497a08] serial8250_console_putchar at ffffffff897934f6 #8 [ffffa65531497a20] uart_console_write at ffffffff8978b605 #9 [ffffa65531497a48] serial8250_console_write at ffffffff89796558 #10 [ffffa65531497ac8] console_unlock at ffffffff89316124 #11 [ffffa65531497b10] vprintk_emit at ffffffff89317c07 #12 [ffffa65531497b68] printk at ffffffff89318306 #13 [ffffa65531497bc8] print_hex_dump at ffffffff89650765 #14 [ffffa65531497ca8] tun_do_read at ffffffffc0b06c27 [tun] #15 [ffffa65531497d38] tun_recvmsg at ffffffffc0b06e34 [tun] #16 [ffffa65531497d68] handle_rx at ffffffffc0c5d682 [vhost_net] #17 [ffffa65531497ed0] vhost_worker at ffffffffc0c644dc [vhost] #18 [ffffa65531497f10] kthread at ffffffff892d2e72 #19 [ffffa65531497f50] ret_from_fork at ffffffff89c0022f Fixes: ef3db4a59542 ("tun: avoid BUG, dump packet on GSO errors") Signed-off-by: Lei Chen <lei.chen@smartx.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20240415020247.2207781-1-lei.chen@smartx.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-17net: dsa: microchip: drop unneeded MODULE_ALIASKrzysztof Kozlowski1-7/+0
The ID table already has respective entry and MODULE_DEVICE_TABLE and creates proper alias for SPI driver. Having another MODULE_ALIAS causes the alias to be duplicated. Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Link: https://lore.kernel.org/r/20240414154929.127045-1-krzk@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-17bcachefs: make sure to release last journal pin in replayKent Overstreet1-1/+4
This fixes a deadlock when journal replay has many keys to insert that were from fsck, not the journal. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-17bcachefs: node scan: ignore multiple nodes with same seq if interiorKent Overstreet1-0/+2
Interior nodes are not really needed, when we have to scan - but if this pops up for leaf nodes we'll need a real heuristic. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-17bcachefs: Fix format specifier in validate_bset_keys()Nathan Chancellor1-1/+1
When building for 32-bit platforms, for which size_t is 'unsigned int', there is a warning from a format string in validate_bset_keys(): fs/bcachefs/btree_io.c: In function 'validate_bset_keys': fs/bcachefs/btree_io.c:891:34: error: format '%lu' expects argument of type 'long unsigned int', but argument 12 has type 'unsigned int' [-Werror=format=] 891 | "bad k->u64s %u (min %u max %lu)", k->u64s, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/bcachefs/btree_io.c:603:32: note: in definition of macro 'btree_err' 603 | msg, ##__VA_ARGS__); \ | ^~~ fs/bcachefs/btree_io.c:887:21: note: in expansion of macro 'btree_err_on' 887 | if (btree_err_on(!bkeyp_u64s_valid(&b->format, k), | ^~~~~~~~~~~~ fs/bcachefs/btree_io.c:891:64: note: format string is defined here 891 | "bad k->u64s %u (min %u max %lu)", k->u64s, | ~~^ | | | long unsigned int | %u cc1: all warnings being treated as errors BKEY_U64s is size_t so the entire expression is promoted to size_t. Use the '%zu' specifier so that there is no warning regardless of the width of size_t. Fixes: 031ad9e7dbd1 ("bcachefs: Check for packed bkeys that are too big") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202404130747.wH6Dd23p-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202404131536.HdAMBOVc-lkp@intel.com/ Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-17bcachefs: Fix null ptr deref in twf from BCH_IOCTL_FSCK_OFFLINEKent Overstreet3-3/+19
We need to initialize the stdio redirects before they're used. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-17nilfs2: fix OOB in nilfs_set_de_typeJeongjun Park1-1/+1
The size of the nilfs_type_by_mode array in the fs/nilfs2/dir.c file is defined as "S_IFMT >> S_SHIFT", but the nilfs_set_de_type() function, which uses this array, specifies the index to read from the array in the same way as "(mode & S_IFMT) >> S_SHIFT". static void nilfs_set_de_type(struct nilfs_dir_entry *de, struct inode *inode) { umode_t mode = inode->i_mode; de->file_type = nilfs_type_by_mode[(mode & S_IFMT)>>S_SHIFT]; // oob } However, when the index is determined this way, an out-of-bounds (OOB) error occurs by referring to an index that is 1 larger than the array size when the condition "mode & S_IFMT == S_IFMT" is satisfied. Therefore, a patch to resize the nilfs_type_by_mode array should be applied to prevent OOB errors. Link: https://lkml.kernel.org/r/20240415182048.7144-1-konishi.ryusuke@gmail.com Reported-by: syzbot+2e22057de05b9f3b30d8@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2e22057de05b9f3b30d8 Fixes: 2ba466d74ed7 ("nilfs2: directory entry operations") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-17MAINTAINERS: update Naoya Horiguchi's email addressNaoya Horiguchi2-2/+3
My old NEC address has been removed, so update MAINTAINERS and .mailmap to map it to my gmail address. Link: https://lkml.kernel.org/r/20240412181720.18452-1-nao.horiguchi@gmail.com Signed-off-by: Naoya Horiguchi <nao.horiguchi@gmail.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-17fork: defer linking file vma until vma is fully initializedMiaohe Lin1-16/+17
Thorvald reported a WARNING [1]. And the root cause is below race: CPU 1 CPU 2 fork hugetlbfs_fallocate dup_mmap hugetlbfs_punch_hole i_mmap_lock_write(mapping); vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree. i_mmap_unlock_write(mapping); hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem! i_mmap_lock_write(mapping); hugetlb_vmdelete_list vma_interval_tree_foreach hugetlb_vma_trylock_write -- Vma_lock is cleared. tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem! hugetlb_vma_unlock_write -- Vma_lock is assigned!!! i_mmap_unlock_write(mapping); hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside i_mmap_rwsem lock while vma lock can be used in the same time. Fix this by deferring linking file vma until vma is fully initialized. Those vmas should be initialized first before they can be used. Link: https://lkml.kernel.org/r/20240410091441.3539905-1-linmiaohe@huawei.com Fixes: 8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reported-by: Thorvald Natvig <thorvald@google.com> Closes: https://lore.kernel.org/linux-mm/20240129161735.6gmjsswx62o4pbja@revolver/T/ [1] Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Tycho Andersen <tandersen@netflix.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-17mm/shmem: inline shmem_is_huge() for disabled transparent hugepagesSumanth Korikkar2-6/+9
In order to minimize code size (CONFIG_CC_OPTIMIZE_FOR_SIZE=y), compiler might choose to make a regular function call (out-of-line) for shmem_is_huge() instead of inlining it. When transparent hugepages are disabled (CONFIG_TRANSPARENT_HUGEPAGE=n), it can cause compilation error. mm/shmem.c: In function `shmem_getattr': ./include/linux/huge_mm.h:383:27: note: in expansion of macro `BUILD_BUG' 383 | #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; }) | ^~~~~~~~~ mm/shmem.c:1148:33: note: in expansion of macro `HPAGE_PMD_SIZE' 1148 | stat->blksize = HPAGE_PMD_SIZE; To prevent the possible error, always inline shmem_is_huge() when transparent hugepages are disabled. Link: https://lkml.kernel.org/r/20240409155407.2322714-1-sumanthk@linux.ibm.com Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-17mm,page_owner: defer enablement of static branchOscar Salvador1-1/+1
Kefeng Wang reported that he was seeing some memory leaks with kmemleak with page_owner enabled. The reason is that we enable the page_owner_inited static branch and then proceed with the linking of stack_list struct to dummy_stack, which means that exists a race window between these two steps where we can have pages already being allocated calling add_stack_record_to_list(), allocating objects and linking them to stack_list, but then we set stack_list pointing to dummy_stack in init_page_owner. Which means that the objects that have been allocated during that time window are unreferenced and lost. Fix this by deferring the enablement of the branch until we have properly set up the list. Link: https://lkml.kernel.org/r/20240409131715.13632-1-osalvador@suse.de Fixes: 4bedfb314bdd ("mm,page_owner: maintain own list of stack_records structs") Signed-off-by: Oscar Salvador <osalvador@suse.de> Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com> Closes: https://lore.kernel.org/linux-mm/74b147b0-718d-4d50-be75-d6afc801cd24@huawei.com/ Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-17Squashfs: check the inode number is not the invalid value of zeroPhillip Lougher1-1/+4
Syskiller has produced an out of bounds access in fill_meta_index(). That out of bounds access is ultimately caused because the inode has an inode number with the invalid value of zero, which was not checked. The reason this causes the out of bounds access is due to following sequence of events: 1. Fill_meta_index() is called to allocate (via empty_meta_index()) and fill a metadata index. It however suffers a data read error and aborts, invalidating the newly returned empty metadata index. It does this by setting the inode number of the index to zero, which means unused (zero is not a valid inode number). 2. When fill_meta_index() is subsequently called again on another read operation, locate_meta_index() returns the previous index because it matches the inode number of 0. Because this index has been returned it is expected to have been filled, and because it hasn't been, an out of bounds access is performed. This patch adds a sanity check which checks that the inode number is not zero when the inode is created and returns -EINVAL if it is. [phillip@squashfs.org.uk: whitespace fix] Link: https://lkml.kernel.org/r/20240409204723.446925-1-phillip@squashfs.org.uk Link: https://lkml.kernel.org/r/20240408220206.435788-1-phillip@squashfs.org.uk Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Reported-by: "Ubisectech Sirius" <bugreport@ubisectech.com> Closes: https://lore.kernel.org/lkml/87f5c007-b8a5-41ae-8b57-431e924c5915.bugreport@ubisectech.com/ Cc: Christian Brauner <brauner@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-17mm,swapops: update check in is_pfn_swap_entry for hwpoison entriesOscar Salvador1-32/+33
Tony reported that the Machine check recovery was broken in v6.9-rc1, as he was hitting a VM_BUG_ON when injecting uncorrectable memory errors to DRAM. After some more digging and debugging on his side, he realized that this went back to v6.1, with the introduction of 'commit 0d206b5d2e0d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")'. That commit, among other things, introduced swp_offset_pfn(), replacing hwpoison_entry_to_pfn() in its favour. The patch also introduced a VM_BUG_ON() check for is_pfn_swap_entry(), but is_pfn_swap_entry() never got updated to cover hwpoison entries, which means that we would hit the VM_BUG_ON whenever we would call swp_offset_pfn() for such entries on environments with CONFIG_DEBUG_VM set. Fix this by updating the check to cover hwpoison entries as well, and update the comment while we are it. Link: https://lkml.kernel.org/r/20240407130537.16977-1-osalvador@suse.de Fixes: 0d206b5d2e0d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry") Signed-off-by: Oscar Salvador <osalvador@suse.de> Reported-by: Tony Luck <tony.luck@intel.com> Closes: https://lore.kernel.org/all/Zg8kLSl2yAlA3o5D@agluck-desk3/ Tested-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: <stable@vger.kernel.org> [6.1.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-17mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabledMiaohe Lin1-3/+15
When I did hard offline test with hugetlb pages, below deadlock occurs: ====================================================== WARNING: possible circular locking dependency detected 6.8.0-11409-gf6cef5f8c37f #1 Not tainted ------------------------------------------------------ bash/46904 is trying to acquire lock: ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60 but task is already holding lock: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (pcp_batch_high_lock){+.+.}-{3:3}: __mutex_lock+0x6c/0x770 page_alloc_cpu_online+0x3c/0x70 cpuhp_invoke_callback+0x397/0x5f0 __cpuhp_invoke_callback_range+0x71/0xe0 _cpu_up+0xeb/0x210 cpu_up+0x91/0xe0 cpuhp_bringup_mask+0x49/0xb0 bringup_nonboot_cpus+0xb7/0xe0 smp_init+0x25/0xa0 kernel_init_freeable+0x15f/0x3e0 kernel_init+0x15/0x1b0 ret_from_fork+0x2f/0x50 ret_from_fork_asm+0x1a/0x30 -> #0 (cpu_hotplug_lock){++++}-{0:0}: __lock_acquire+0x1298/0x1cd0 lock_acquire+0xc0/0x2b0 cpus_read_lock+0x2a/0xc0 static_key_slow_dec+0x16/0x60 __hugetlb_vmemmap_restore_folio+0x1b9/0x200 dissolve_free_huge_page+0x211/0x260 __page_handle_poison+0x45/0xc0 memory_failure+0x65e/0xc70 hard_offline_page_store+0x55/0xa0 kernfs_fop_write_iter+0x12c/0x1d0 vfs_write+0x387/0x550 ksys_write+0x64/0xe0 do_syscall_64+0xca/0x1e0 entry_SYSCALL_64_after_hwframe+0x6d/0x75 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(pcp_batch_high_lock); lock(cpu_hotplug_lock); lock(pcp_batch_high_lock); rlock(cpu_hotplug_lock); *** DEADLOCK *** 5 locks held by bash/46904: #0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0 #1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0 #2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0 #3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70 #4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40 stack backtrace: CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8c37f #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x68/0xa0 check_noncircular+0x129/0x140 __lock_acquire+0x1298/0x1cd0 lock_acquire+0xc0/0x2b0 cpus_read_lock+0x2a/0xc0 static_key_slow_dec+0x16/0x60 __hugetlb_vmemmap_restore_folio+0x1b9/0x200 dissolve_free_huge_page+0x211/0x260 __page_handle_poison+0x45/0xc0 memory_failure+0x65e/0xc70 hard_offline_page_store+0x55/0xa0 kernfs_fop_write_iter+0x12c/0x1d0 vfs_write+0x387/0x550 ksys_write+0x64/0xe0 do_syscall_64+0xca/0x1e0 entry_SYSCALL_64_after_hwframe+0x6d/0x75 RIP: 0033:0x7fc862314887 Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887 RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001 RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00 In short, below scene breaks the lock dependency chain: memory_failure __page_handle_poison zone_pcp_disable -- lock(pcp_batch_high_lock) dissolve_free_huge_page __hugetlb_vmemmap_restore_folio static_key_slow_dec cpus_read_lock -- rlock(cpu_hotplug_lock) Fix this by calling drain_all_pages() instead. This issue won't occur until commit a6b40850c442 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key"). As it introduced rlock(cpu_hotplug_lock) in dissolve_free_huge_page() code path while lock(pcp_batch_high_lock) is already in the __page_handle_poison(). [linmiaohe@huawei.com: extend comment per Oscar] [akpm@linux-foundation.org: reflow block comment] Link: https://lkml.kernel.org/r/20240407085456.2798193-1-linmiaohe@huawei.com Fixes: a6b40850c442 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-17mm/userfaultfd: allow hugetlb change protection upon poison entryPeter Xu1-3/+7
After UFFDIO_POISON, there can be two kinds of hugetlb pte markers, either the POISON one or UFFD_WP one. Allow change protection to run on a poisoned marker just like !hugetlb cases, ignoring the marker irrelevant of the permission. Here the two bits are mutual exclusive. For example, when install a poisoned entry it must not be UFFD_WP already (by checking pte_none() before such install). And it also means if UFFD_WP is set there must have no POISON bit set. It makes sense because UFFD_WP is a bit to reflect permission, and permissions do not apply if the pte is poisoned and destined to sigbus. So here we simply check uffd_wp bit set first, do nothing otherwise. Attach the Fixes to UFFDIO_POISON work, as before that it should not be possible to have poison entry for hugetlb (e.g., hugetlb doesn't do swap, so no chance of swapin errors). Link: https://lkml.kernel.org/r/20240405231920.1772199-1-peterx@redhat.com Link: https://lore.kernel.org/r/000000000000920d5e0615602dd1@google.com Fixes: fc71884a5f59 ("mm: userfaultfd: add new UFFDIO_POISON ioctl") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: syzbot+b07c8ac8eee3d4d8440f@syzkaller.appspotmail.com Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: <stable@vger.kernel.org> [6.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-17mm,page_owner: fix printing of stack recordsOscar Salvador1-3/+1
When seq_* code sees that its buffer overflowed, it re-allocates a bigger onecand calls seq_operations->start() callback again. stack_start() naively though that if it got called again, it meant that the old record got already printed so it returned the next object, but that is not true. The consequence of that is that every time stack_stop() -> stack_start() get called because we needed a bigger buffer, stack_start() will skip entries, and those will not be printed. Fix it by not advancing to the next object in stack_start(). Link: https://lkml.kernel.org/r/20240404070702.2744-5-osalvador@suse.de Fixes: 765973a09803 ("mm,page_owner: display all stacks and their count") Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Potapenko <glider@google.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-17mm,page_owner: fix accounting of pages when migratingOscar Salvador1-0/+15
Upon migration, new allocated pages are being given the handle of the old pages. This is problematic because it means that for the stack which allocated the old page, we will be substracting the old page + the new one when that page is freed, creating an accounting imbalance. There is an interest in keeping it that way, as otherwise the output will biased towards migration stacks should those operations occur often, but that is not really helpful. The link from the new page to the old stack is being performed by calling __update_page_owner_handle() in __folio_copy_owner(). The only thing that is left is to link the migrate stack to the old page, so the old page will be subtracted from the migrate stack, avoiding by doing so any possible imbalance. Link: https://lkml.kernel.org/r/20240404070702.2744-4-osalvador@suse.de Fixes: 217b2119b9e2 ("mm,page_owner: implement the tracking of the stacks count") Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Potapenko <glider@google.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-17mm,page_owner: fix refcount imbalanceOscar Salvador2-49/+58
Current code does not contemplate scenarios were an allocation and free operation on the same pages do not handle it in the same amount at once. To give an example, page_alloc_exact(), where we will allocate a page of enough order to stafisfy the size request, but we will free the remainings right away. In the above example, we will increment the stack_record refcount only once, but we will decrease it the same number of times as number of unused pages we have to free. This will lead to a warning because of refcount imbalance. Fix this by recording the number of base pages in the refcount field. Link: https://lkml.kernel.org/r/20240404070702.2744-3-osalvador@suse.de Reported-by: syzbot+41bbfdb8d41003d12c0f@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-mm/00000000000090e8ff0613eda0e5@google.com Fixes: 217b2119b9e2 ("mm,page_owner: implement the tracking of the stacks count") Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-17mm,page_owner: update metadata for tail pagesOscar Salvador1-63/+74
Patch series "page_owner: Fix refcount imbalance and print fixup", v4. This series consists of a refactoring/correctness of updating the metadata of tail pages, a couple of fixups for the refcounting part and a fixup for the stack_start() function. From this series on, instead of counting the stacks, we count the outstanding nr_base_pages each stack has, which gives us a much better memory overview. The other fixup is for the migration part. A more detailed explanation can be found in the changelog of the respective patches. This patch (of 4): __set_page_owner_handle() and __reset_page_owner() update the metadata of all pages when the page is of a higher-order, but we miss to do the same when the pages are migrated. __folio_copy_owner() only updates the metadata of the head page, meaning that the information stored in the first page and the tail pages will not match. Strictly speaking that is not a big problem because 1) we do not print tail pages and 2) upon splitting all tail pages will inherit the metadata of the head page, but it is better to have all metadata in check should there be any problem, so it can ease debugging. For that purpose, a couple of helpers are created __update_page_owner_handle() which updates the metadata on allocation, and __update_page_owner_free_handle() which does the same when the page is freed. __folio_copy_owner() will make use of both as it needs to entirely replace the page_owner metadata for the new page. Link: https://lkml.kernel.org/r/20240404070702.2744-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20240404070702.2744-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-17userfaultfd: change src_folio after ensuring it's unpinned in UFFDIO_MOVELokesh Gidra1-3/+3
Commit d7a08838ab74 ("mm: userfaultfd: fix unexpected change to src_folio when UFFDIO_MOVE fails") moved the src_folio->{mapping, index} changing to after clearing the page-table and ensuring that it's not pinned. This avoids failure of swapout+migration and possibly memory corruption. However, the commit missed fixing it in the huge-page case. Link: https://lkml.kernel.org/r/20240404171726.2302435-1-lokeshgidra@google.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Lokesh Gidra <lokeshgidra@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Nicolas Geoffray <ngeoffray@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-17mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properlyDavid Hildenbrand3-41/+40
Darrick reports that in some cases where pread() would fail with -EIO and mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ / MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT. While the madvise() call can be interrupted by a signal, this is not the desired behavior. MADV_POPULATE_READ / MADV_POPULATE_WRITE should behave like page faults in that case: fail and not retry forever. A reproducer can be found at [1]. The reason is that __get_user_pages(), as called by faultin_vma_page_range(), will not handle VM_FAULT_RETRY in a proper way: it will simply return 0 when VM_FAULT_RETRY happened, making madvise_populate()->faultin_vma_page_range() retry again and again, never setting FOLL_TRIED->FAULT_FLAG_TRIED for __get_user_pages(). __get_user_pages_locked() does what we want, but duplicating that logic in faultin_vma_page_range() feels wrong. So let's use __get_user_pages_locked() instead, that will detect VM_FAULT_RETRY and set FOLL_TRIED when retrying, making the fault handler return VM_FAULT_SIGBUS (VM_FAULT_ERROR) at some point, propagating -EFAULT from faultin_page() to __get_user_pages(), all the way to madvise_populate(). But, there is an issue: __get_user_pages_locked() will end up re-taking the MM lock and then __get_user_pages() will do another VMA lookup. In the meantime, the VMA layout could have changed and we'd fail with different error codes than we'd want to. As __get_user_pages() will currently do a new VMA lookup either way, let it do the VMA handling in a different way, controlled by a new FOLL_MADV_POPULATE flag, effectively moving these checks from madvise_populate() + faultin_page_range() in there. With this change, Darricks reproducer properly fails with -EFAULT, as documented for MADV_POPULATE_READ / MADV_POPULATE_WRITE. [1] https://lore.kernel.org/all/20240313171936.GN1927156@frogsfrogsfrogs/ Link: https://lkml.kernel.org/r/20240314161300.382526-1-david@redhat.com Link: https://lkml.kernel.org/r/20240314161300.382526-2-david@redhat.com Fixes: 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Darrick J. Wong <djwong@kernel.org> Closes: https://lore.kernel.org/all/20240311223815.GW1927156@frogsfrogsfrogs/ Cc: Darrick J. Wong <djwong@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-16drm/panel: novatek-nt36682e: don't unregister DSI deviceDmitry Baryshkov1-2/+0
The DSI device for the panel was registered by the DSI host, so it is an error to unregister it from the panel driver. Drop the call to mipi_dsi_device_unregister(). Fixes: ea4f9975625a ("drm/panel: Add support for Novatek NT36672E panel driver") Reviewed-by: Jessica Zhang <quic_jesszhan@quicinc.com> Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240404-drop-panel-unregister-v1-2-9f56953c5fb9@linaro.org
2024-04-16drm/panel: visionox-rm69299: don't unregister DSI deviceDmitry Baryshkov1-2/+0
The DSI device for the panel was registered by the DSI host, so it is an error to unregister it from the panel driver. Drop the call to mipi_dsi_device_unregister(). Fixes: c7f66d32dd43 ("drm/panel: add support for rm69299 visionox panel") Reviewed-by: Jessica Zhang <quic_jesszhan@quicinc.com> Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240404-drop-panel-unregister-v1-1-9f56953c5fb9@linaro.org
2024-04-16ice: Fix checking for unsupported keys on non-tunnel deviceMarcin Szycik1-1/+4
Add missing FLOW_DISSECTOR_KEY_ENC_* checks to TC flower filter parsing. Without these checks, it would be possible to add filters with tunnel options on non-tunnel devices. enc_* options are only valid for tunnel devices. Example: devlink dev eswitch set $PF1_PCI mode switchdev echo 1 > /sys/class/net/$PF1/device/sriov_numvfs tc qdisc add dev $VF1_PR ingress ethtool -K $PF1 hw-tc-offload on tc filter add dev $VF1_PR ingress flower enc_ttl 12 skip_sw action drop Fixes: 9e300987d4a8 ("ice: VXLAN and Geneve TC support") Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-16ice: tc: allow zero flags in parsing tc flowerMichal Swiatkowski1-1/+1
The check for flags is done to not pass empty lookups to adding switch rule functions. Since metadata is always added to lookups there is no need to check against the flag. It is also fixing the problem with such rule: $ tc filter add dev gtp_dev ingress protocol ip prio 0 flower \ enc_dst_port 2123 action drop Switch block in case of GTP can't parse the destination port, because it should always be set to GTP specific value. The same with ethertype. The result is that there is no other matching criteria than GTP tunnel. In this case flags is 0, rule can't be added only because of defensive check against flags. Fixes: 9a225f81f540 ("ice: Support GTP-U and GTP-C offload in switchdev") Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-16ice: tc: check src_vsi in case of traffic from VFMichal Swiatkowski1-0/+8
In case of traffic going from the VF (so ingress for port representor) source VSI should be consider during packet classification. It is needed for hardware to not match packets from different ports with filters added on other port. It is only for "from VF" traffic, because other traffic direction doesn't have source VSI. Set correct ::src_vsi in rule_info to pass it to the hardware filter. For example this rule should drop only ipv4 packets from eth10, not from the others VF PRs. It is needed to check source VSI in this case. $tc filter add dev eth10 ingress protocol ip flower skip_sw action drop Fixes: 0d08a441fb1a ("ice: ndo_setup_tc implementation for PF") Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-16Merge tag 'kvm-x86-fixes-6.9-rcN' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini15-89/+194
- Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM would allow userspace to refresh the cache with a bogus GPA. The bug has existed for quite some time, but was exposed by a new sanity check added in 6.9 (to ensure a cache is either GPA-based or HVA-based). - Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left behind during a 6.9 cleanup. - Disable support for virtualizing adaptive PEBS, as KVM's implementation is architecturally broken and can leak host LBRs to the guest. - Fix a bug where KVM neglects to set the enable bits for general purpose counters in PERF_GLOBAL_CTRL when initializing the virtual PMU. Both Intel and AMD architectures require the bits to be set at RESET in order for v2 PMUs to be backwards compatible with software that was written for v1 PMUs, i.e. for software that will never manually set the global enables. - Disable LBR virtualization on CPUs that don't support LBR callstacks, as KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the virtual LBR perf event, i.e. KVM will always fail to create LBR events on such CPUs. - Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that results in an array overflow (detected by KASAN). - Fix a flaw in the max_guest_memory selftest that results in it exhausting the supply of ucall structures when run with more than 256 vCPUs. - Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test. - Fix a bug where KVM incorrectly thinks a TDP MMU root is an indirect shadow root due KVM unnecessarily clobbering root_role.direct when userspace sets guest CPUID. - Fix a dirty logging bug in the where KVM fails to write-protect TDP MMU SPTEs used for L2 if Page-Modification Logging is enabled for L1 and the L1 hypervisor is NOT using EPT (if nEPT is enabled, KVM doesn't use the TDP MMU to run L2). For simplicity, KVM always disables PML when running L2, but the TDP MMU wasn't accounting for root-specific conditions that force write- protect based dirty logging.
2024-04-16bpftool: Address minor issues in bash completionQuentin Monnet1-36/+25
This commit contains a series of clean-ups and fixes for bpftool's bash completion file: - Make sure all local variables are declared as such. - Make sure variables are initialised before being read. - Update ELF section ("maps" -> ".maps") for looking up map names in object files. - Fix call to _init_completion. - Move definition for MAP_TYPE and PROG_TYPE higher up in the scope to avoid defining them multiple times, reuse MAP_TYPE where relevant. - Simplify completion for "duration" keyword in "bpftool prog profile". - Fix completion for "bpftool struct_ops register" and "bpftool link (pin|detach)" where we would repeatedly suggest file names instead of suggesting just one name. - Fix completion for "bpftool iter pin ... map MAP" to account for the "map" keyword. - Add missing "detach" suggestion for "bpftool link". Signed-off-by: Quentin Monnet <qmo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240413011427.14402-3-qmo@kernel.org
2024-04-16bpftool: Update documentation where progs/maps can be passed by nameQuentin Monnet4-4/+4
When using references to BPF programs, bpftool supports passing programs by name on the command line. The manual pages for "bpftool prog" and "bpftool map" (for prog_array updates) mention it, but we have a few additional subcommands that support referencing programs by name but do not mention it in their documentation. Let's update the pages for subcommands "btf", "cgroup", and "net". Similarly, we can reference maps by name when passing them to "bpftool prog load", so we update the page for "bpftool prog" as well. Signed-off-by: Quentin Monnet <qmo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240413011427.14402-2-qmo@kernel.org
2024-04-16bpf: Harden and/or/xor value tracking in verifierHarishankar Vishwanathan1-54/+40
This patch addresses a latent unsoundness issue in the scalar(32)_min_max_and/or/xor functions. While it is not a bugfix, it ensures that the functions produce sound outputs for all inputs. The issue occurs in these functions when setting signed bounds. The following example illustrates the issue for scalar_min_max_and(), but it applies to the other functions. In scalar_min_max_and() the following clause is executed when ANDing positive numbers: /* ANDing two positives gives a positive, so safe to * cast result into s64. */ dst_reg->smin_value = dst_reg->umin_value; dst_reg->smax_value = dst_reg->umax_value; However, if umin_value and umax_value of dst_reg cross the sign boundary (i.e., if (s64)dst_reg->umin_value > (s64)dst_reg->umax_value), then we will end up with smin_value > smax_value, which is unsound. Previous works [1, 2] have discovered and reported this issue. Our tool Agni [2, 3] consideres it a false positive. This is because, during the verification of the abstract operator scalar_min_max_and(), Agni restricts its inputs to those passing through reg_bounds_sync(). This mimics real-world verifier behavior, as reg_bounds_sync() is invariably executed at the tail of every abstract operator. Therefore, such behavior is unlikely in an actual verifier execution. However, it is still unsound for an abstract operator to set signed bounds such that smin_value > smax_value. This patch fixes it, making the abstract operator sound for all (well-formed) inputs. It is worth noting that while the previous code updated the signed bounds (using the output unsigned bounds) only when the *input signed* bounds were positive, the new code updates them whenever the *output unsigned* bounds do not cross the sign boundary. An alternative approach to fix this latent unsoundness would be to unconditionally set the signed bounds to unbounded [S64_MIN, S64_MAX], and let reg_bounds_sync() refine the signed bounds using the unsigned bounds and the tnum. We found that our approach produces more precise (tighter) bounds. For example, consider these inputs to BPF_AND: /* dst_reg */ var_off.value: 8608032320201083347 var_off.mask: 615339716653692460 smin_value: 8070450532247928832 smax_value: 8070450532247928832 umin_value: 13206380674380886586 umax_value: 13206380674380886586 s32_min_value: -2110561598 s32_max_value: -133438816 u32_min_value: 4135055354 u32_max_value: 4135055354 /* src_reg */ var_off.value: 8584102546103074815 var_off.mask: 9862641527606476800 smin_value: 2920655011908158522 smax_value: 7495731535348625717 umin_value: 7001104867969363969 umax_value: 8584102543730304042 s32_min_value: -2097116671 s32_max_value: 71704632 u32_min_value: 1047457619 u32_max_value: 4268683090 After going through tnum_and() -> scalar32_min_max_and() -> scalar_min_max_and() -> reg_bounds_sync(), our patch produces the following bounds for s32: s32_min_value: -1263875629 s32_max_value: -159911942 Whereas, setting the signed bounds to unbounded in scalar_min_max_and() produces: s32_min_value: -1263875629 s32_max_value: -1 As observed, our patch produces a tighter s32 bound. We also confirmed using Agni and SMT verification that our patch always produces signed bounds that are equal to or more precise than setting the signed bounds to unbounded in scalar_min_max_and(). [1] https://sanjit-bhat.github.io/assets/pdf/ebpf-verifier-range-analysis22.pdf [2] https://link.springer.com/chapter/10.1007/978-3-031-37709-9_12 [3] https://github.com/bpfverif/agni Co-developed-by: Matan Shachnai <m.shachnai@rutgers.edu> Signed-off-by: Matan Shachnai <m.shachnai@rutgers.edu> Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240402212039.51815-1-harishankar.vishwanathan@gmail.com Link: https://lore.kernel.org/bpf/20240416115303.331688-1-harishankar.vishwanathan@gmail.com
2024-04-16wifi: wilc1000: convert list management to RCUAlexis Lothoré5-64/+45
wilc1000 driver currently uses SRCU API to manage vif list. There is no real reason to use SRCU API here, and it makes things slightly more complicated (we need to handle a SRCU index as well as a srcu_struct) than classical RCU. Switch SRCU APIs to RCU APIs. Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com> Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://msgid.link/20240410-wilc_srcu_to_rcu-v1-2-a9ff5b10feaa@bootlin.com
2024-04-16wifi: wilc1000: set atomic flag on kmemdup in srcu critical sectionAlexis Lothoré1-1/+1
In order to prepare wilc1000 driver switch from SRCU to RCU, make sure that allocators are provided with the GFP_ATOMIC flag when called in a critical read section. Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com> Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://msgid.link/20240410-wilc_srcu_to_rcu-v1-1-a9ff5b10feaa@bootlin.com
2024-04-16wifi: ath11k: use RCU when accessing struct inet6_dev::ac_listKalle Valo1-1/+3
Commit c3718936ec47 ("ipv6: anycast: complete RCU handling of struct ifacaddr6") converted struct inet6_dev::ac_list to use RCU but missed that ath11k also accesses this list. Now sparse warns: drivers/net/wireless/ath/ath11k/mac.c:9145:21: warning: incorrect type in assignment (different address spaces) drivers/net/wireless/ath/ath11k/mac.c:9145:21: expected struct ifacaddr6 *ifaca6 drivers/net/wireless/ath/ath11k/mac.c:9145:21: got struct ifacaddr6 [noderef] __rcu *ac_list drivers/net/wireless/ath/ath11k/mac.c:9145:53: warning: incorrect type in assignment (different address spaces) drivers/net/wireless/ath/ath11k/mac.c:9145:53: expected struct ifacaddr6 *ifaca6 drivers/net/wireless/ath/ath11k/mac.c:9145:53: got struct ifacaddr6 [noderef] __rcu *aca_next Fix it by using rtnl_dereference(). Also add a note that read_lock_bh() calls rcu_read_lock() which I was not aware of. Tested-on: WCN6855 hw2.0 PCI WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.37 Fixes: c3718936ec47 ("ipv6: anycast: complete RCU handling of struct ifacaddr6") Signed-off-by: Kalle Valo <quic_kvalo@quicinc.com> Acked-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://msgid.link/20240411165516.4070649-2-kvalo@kernel.org
2024-04-16bpf, tests: Fix typos in commentsChen Pei1-1/+1
Currently, there are two comments with same name "64-bit ATOMIC magnitudes", the second one should be "32-bit ATOMIC magnitudes" based on the context. Signed-off-by: Chen Pei <cp0613@linux.alibaba.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20240415081928.17440-1-cp0613@linux.alibaba.com
2024-04-16btf: Avoid weak external referencesArd Biesheuvel2-5/+8
If the BTF code is enabled in the build configuration, the start/stop BTF markers are guaranteed to exist. Only when CONFIG_DEBUG_INFO_BTF=n, the references in btf_parse_vmlinux() will remain unsatisfied, relying on the weak linkage of the external references to avoid breaking the build. Avoid GOT based relocations to these markers in the final executable by dropping the weak attribute and instead, make btf_parse_vmlinux() return ERR_PTR(-ENOENT) directly if CONFIG_DEBUG_INFO_BTF is not enabled to begin with. The compiler will drop any subsequent references to __start_BTF and __stop_BTF in that case, allowing the link to succeed. Note that Clang will notice that taking the address of __start_BTF can no longer yield NULL, so testing for that condition becomes unnecessary. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240415162041.2491523-8-ardb+git@google.com
2024-04-16Merge branch 'net-stmmac-fix-mac-capabilities-procedure'Paolo Abeni7-29/+32
Serge Semin says: ==================== net: stmmac: Fix MAC-capabilities procedure The series got born as a result of the discussions around the recent Yanteng' series adding the Loongson LS7A1000, LS2K1000, LS7A2000, LS2K2000 MACs support: Link: https://lore.kernel.org/netdev/fu3f6uoakylnb6eijllakeu5i4okcyqq7sfafhp5efaocbsrwe@w74xe7gb6x7p In particular the Yanteng' patchset needed to implement the Loongson MAC-specific constraints applied to the link speed and link duplex mode. As a result of the discussion with Russel the next preliminary patch was born: Link: https://lore.kernel.org/netdev/df31e8bcf74b3b4ddb7ddf5a1c371390f16a2ad5.1712917541.git.siyanteng@loongson.cn The patch above was a temporal solution utilized by Yanteng for further developments and to move on with the on-going review. This patchset is a refactored version of that single patch with formatting required for the fixes patches. In particular the series starts with fixing the half-duplex-less constraint currently applied for all IP-cores. In fact it's specific for the DW QoS Eth only (DW GMAC v4.x/v5.x). The next patch fixes the MAC-capabilities setting up during the active Tx/Rx queues re-initialization procedure. Particularly the procedure missed the max-speed limit thus possibly activating speeds prohibited on the respective platforms. Third patch fixes the incorrect MAC-capabilities initialization for DW MAC100, DW XGMAC and DW XLGMAC devices by moving the correct initialization to the IP-core specific setup() methods. That's it for now. Thanks for review and testing in advance. Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com> Cc: Simon Horman <horms@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Chen-Yu Tsai <wens@csie.org> Cc: Jernej Skrabec <jernej.skrabec@gmail.com> Cc: Samuel Holland <samuel@sholland.org> Cc: netdev@vger.kernel.org Cc: linux-stm32@st-md-mailman.stormreply.com Cc: linux-arm-kernel@lists.infradead.org Cc: linux-sunxi@lists.linux.dev Cc: linux-kernel@vger.kernel.org ==================== Link: https://lore.kernel.org/r/20240412180340.7965-1-fancer.lancer@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-16net: stmmac: Fix IP-cores specific MAC capabilitiesSerge Semin7-19/+23
Here is the list of the MAC capabilities specific to the particular DW MAC IP-cores currently supported by the driver: DW MAC100: MAC_ASYM_PAUSE | MAC_SYM_PAUSE | MAC_10 | MAC_100 DW GMAC: MAC_ASYM_PAUSE | MAC_SYM_PAUSE | MAC_10 | MAC_100 | MAC_1000 Allwinner sun8i MAC: MAC_ASYM_PAUSE | MAC_SYM_PAUSE | MAC_10 | MAC_100 | MAC_1000 DW QoS Eth: MAC_ASYM_PAUSE | MAC_SYM_PAUSE | MAC_10 | MAC_100 | MAC_1000 | MAC_2500FD if there is more than 1 active Tx/Rx queues: MAC_ASYM_PAUSE | MAC_SYM_PAUSE | MAC_10FD | MAC_100FD | MAC_1000FD | MAC_2500FD DW XGMAC: MAC_ASYM_PAUSE | MAC_SYM_PAUSE | MAC_1000FD | MAC_2500FD | MAC_5000FD | MAC_10000FD DW XLGMAC: MAC_ASYM_PAUSE | MAC_SYM_PAUSE | MAC_1000FD | MAC_2500FD | MAC_5000FD | MAC_10000FD | MAC_25000FD | MAC_40000FD | MAC_50000FD | MAC_100000FD As you can see there are only two common capabilities: MAC_ASYM_PAUSE | MAC_SYM_PAUSE. Meanwhile what is currently implemented defines 10/100/1000 link speeds for all IP-cores, which is definitely incorrect for DW MAC100, DW XGMAC and DW XLGMAC devices. Seeing the flow-control is implemented as a callback for each MAC IP-core (see dwmac100_flow_ctrl(), dwmac1000_flow_ctrl(), sun8i_dwmac_flow_ctrl(), etc) and since the MAC-specific setup() method is supposed to be called for each available DW MAC-based device, the capabilities initialization can be freely moved to these setup() functions, thus correctly setting up the MAC-capabilities for each IP-core (including the Allwinner Sun8i). A new stmmac_link::caps field was specifically introduced for that so to have all link-specific info preserved in a single structure. Note the suggested change fixes three earlier commits at a time. The commit 5b0d7d7da64b ("net: stmmac: Add the missing speeds that XGMAC supports") permitted the 10-100 link speeds and 1G half-duplex mode for DW XGMAC IP-core even though it doesn't support them. The commit df7699c70c1b ("net: stmmac: Do not cut down 1G modes") incorrectly added the MAC1000 capability to the DW MAC100 IP-core. Similarly to the DW XGMAC the commit 8a880936e902 ("net: stmmac: Add XLGMII support") incorrectly permitted the 10-100 link speeds and 1G half-duplex mode for DW XLGMAC IP-core. Fixes: 5b0d7d7da64b ("net: stmmac: Add the missing speeds that XGMAC supports") Fixes: df7699c70c1b ("net: stmmac: Do not cut down 1G modes") Fixes: 8a880936e902 ("net: stmmac: Add XLGMII support") Suggested-by: Russell King (Oracle) <linux@armlinux.org.uk> Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Reviewed-by: Romain Gantois <romain.gantois@bootlin.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-16net: stmmac: Fix max-speed being ignored on queue re-initSerge Semin1-0/+5
It's possible to have the maximum link speed being artificially limited on the platform-specific basis. It's done either by setting up the plat_stmmacenet_data::max_speed field or by specifying the "max-speed" DT-property. In such cases it's required that any specific MAC-capabilities re-initializations would take the limit into account. In particular the link speed capabilities may change during the number of active Tx/Rx queues re-initialization. But the currently implemented procedure doesn't take the speed limit into account. Fix that by calling phylink_limit_mac_speed() in the stmmac_reinit_queues() method if the speed limitation was required in the same way as it's done in the stmmac_phy_setup() function. Fixes: 95201f36f395 ("net: stmmac: update MAC capabilities when tx queues are updated") Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Reviewed-by: Romain Gantois <romain.gantois@bootlin.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-16net: stmmac: Apply half-duplex-less constraint for DW QoS Eth onlySerge Semin2-16/+10
There are three DW MAC IP-cores which can have the multiple Tx/Rx queues enabled: DW GMAC v3.7+ with AV feature, DW QoS Eth v4.x/v5.x, DW XGMAC/XLGMAC Based on the respective HW databooks, only the DW QoS Eth IP-core doesn't support the half-duplex link mode in case if more than one queues enabled: "In multiple queue/channel configurations, for half-duplex operation, enable only the Q0/CH0 on Tx and Rx. For single queue/channel in full-duplex operation, any queue/channel can be enabled." The rest of the IP-cores don't have such constraint. Thus in order to have the constraint applied for the DW QoS Eth MACs only, let's move the it' implementation to the respective MAC-capabilities getter and make sure the getter is called in the queues re-init procedure. Fixes: b6cfffa7ad92 ("stmmac: fix DMA channel hang in half-duplex mode") Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Reviewed-by: Romain Gantois <romain.gantois@bootlin.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-16platform/x86/intel-uncore-freq: Increase minor number supportSrinivas Pandruvada1-2/+2
No new changes will be added for minor version 2. Change the minor version number to 2 and stop displaying log message for unsupported minor version 2. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Link: https://lore.kernel.org/r/20240415220625.2828339-1-srinivas.pandruvada@linux.intel.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2024-04-16platform/x86: ISST: Add Granite Rapids-D to HPM CPU listSrinivas Pandruvada1-0/+1
Add Granite Rapids-D to hpm_cpu_ids, so that MSR 0x54 can be used. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Link: https://lore.kernel.org/r/20240415212853.2820470-1-srinivas.pandruvada@linux.intel.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2024-04-16platform/x86/amd: pmf: Add quirk for ROG Zephyrus G14Mario Limonciello1-0/+8
ROG Zephyrus G14 advertises support for SPS notifications to the BIOS but doesn't actually use them. Instead the asus-nb-wmi driver utilizes such events. Add a quirk to prevent the system from registering for ACPI platform profile when this system is found to avoid conflicts. Reported-by: al0uette@outlook.com Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218685 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20240410140956.385-3-mario.limonciello@amd.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2024-04-16platform/x86/amd: pmf: Add infrastructure for quirking supported funcsMario Limonciello5-2/+52
In the event of a BIOS bug add infrastructure that will be utilized to override the return value for supported_funcs to avoid enabling features. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20240410140956.385-2-mario.limonciello@amd.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>