summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2025-01-16mailmap: update entry for Ethan Carter EdwardsEthan Carter Edwards1-0/+1
Map old gmail + name to my current full name and email. Link: https://lkml.kernel.org/r/xbfkmvmp4wyxrvlan57bjnul5icrwfyt67vnhhw2cyr5rzbnee@mfvihhd6s7l5 Signed-off-by: Ethan Carter Edwards <ethan@ethancedwards.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-16mm: zswap: move allocations during CPU init outside the lockYosry Ahmed1-18/+24
In zswap_cpu_comp_prepare(), allocations are made and assigned to various members of acomp_ctx under acomp_ctx->mutex. However, allocations may recurse into zswap through reclaim, trying to acquire the same mutex and deadlocking. Move the allocations before the mutex critical section. Only the initialization of acomp_ctx needs to be done with the mutex held. Link: https://lkml.kernel.org/r/20250113214458.2123410-1-yosryahmed@google.com Fixes: 12dcb0ef5406 ("mm: zswap: properly synchronize freeing resources during CPU hotunplug") Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-16mm: khugepaged: fix call hpage_collapse_scan_file() for anonymous vmaLiu Shixin1-2/+2
syzkaller reported such a BUG_ON(): ------------[ cut here ]------------ kernel BUG at mm/khugepaged.c:1835! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP ... CPU: 6 UID: 0 PID: 8009 Comm: syz.15.106 Kdump: loaded Tainted: G W 6.13.0-rc6 #22 Tainted: [W]=WARN Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : collapse_file+0xa44/0x1400 lr : collapse_file+0x88/0x1400 sp : ffff80008afe3a60 ... Call trace: collapse_file+0xa44/0x1400 (P) hpage_collapse_scan_file+0x278/0x400 madvise_collapse+0x1bc/0x678 madvise_vma_behavior+0x32c/0x448 madvise_walk_vmas.constprop.0+0xbc/0x140 do_madvise.part.0+0xdc/0x2c8 __arm64_sys_madvise+0x68/0x88 invoke_syscall+0x50/0x120 el0_svc_common.constprop.0+0xc8/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x34/0x128 el0t_64_sync_handler+0xc8/0xd0 el0t_64_sync+0x190/0x198 This indicates that the pgoff is unaligned. After analysis, I confirm the vma is mapped to /dev/zero. Such a vma certainly has vm_file, but it is set to anonymous by mmap_zero(). So even if it's mmapped by 2m-unaligned, it can pass the check in thp_vma_allowable_order() as it is an anonymous-mmap, but then be collapsed as a file-mmap. It seems the problem has existed for a long time, but actually, since we have khugepaged_max_ptes_none check before, we will skip collapse it as it is /dev/zero and so has no present page. But commit d8ea7cc8547c limit the check for only khugepaged, so the BUG_ON() can be triggered by madvise_collapse(). Add vma_is_anonymous() check to make such vma be processed by hpage_collapse_scan_pmd(). Link: https://lkml.kernel.org/r/20250111034511.2223353-1-liushixin2@huawei.com Fixes: d8ea7cc8547c ("mm/khugepaged: add flag to predicate khugepaged-only behavior") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reviewed-by: Yang Shi <yang@os.amperecomputing.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mattew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-16mm: shmem: use signed int for version handling in casefold optionKaran Sanghavi1-1/+1
Fixes an issue where the use of an unsigned data type in `shmem_parse_opt_casefold()` caused incorrect evaluation of negative conditions. Link: https://lkml.kernel.org/r/20250111-unsignedcompare1601569-v3-1-c861b4221831@gmail.com Fixes: 58e55efd6c72 ("tmpfs: Add casefold lookup support") Reviewed-by: André Almeida <andrealmeid@igalia.com> Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Signed-off-by: Karan Sanghavi <karansanghvi98@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Hugh Dickens <hughd@google.com> Cc: Shuah khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-16alloc_tag: skip pgalloc_tag_swap if profiling is disabledSuren Baghdasaryan1-0/+3
When memory allocation profiling is disabled, there is no need to swap allocation tags during migration. Skip it to avoid unnecessary overhead. Once I added these checks, the overhead of the mode when memory profiling is enabled but turned off went down by about 50%. Link: https://lkml.kernel.org/r/20241226211639.1357704-2-surenb@google.com Fixes: e0a955bf7f61 ("mm/codetag: add pgalloc_tag_copy()") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: David Wang <00107082@163.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Yu Zhao <yuzhao@google.com> Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-16mm: page_alloc: fix missed updates of lowmem_reserve in ↵zihan zhou1-0/+3
adjust_managed_page_count In the kernel, the zone's lowmem_reserve and _watermark, and the global variable 'totalreserve_pages' depend on the value of managed_pages, but after running adjust_managed_page_count, these values aren't updated, which causes some problems. For example, in a system with six 1GB large pages, we found that the value of protection in zoneinfo (zone->lowmem_reserve), is not right. Its value seems to be calculated from the initial managed_pages, but after the managed_pages changed, was not updated. Only after reading the file /proc/sys/vm/lowmem_reserve_ratio, updates happen. read file /proc/sys/vm/lowmem_reserve_ratio: lowmem_reserve_ratio_sysctl_handler ----setup_per_zone_lowmem_reserve --------calculate_totalreserve_pages protection changed after reading file: [root@test ~]# cat /proc/zoneinfo | grep protection protection: (0, 2719, 57360, 0) protection: (0, 0, 54640, 0) protection: (0, 0, 0, 0) protection: (0, 0, 0, 0) [root@test ~]# cat /proc/sys/vm/lowmem_reserve_ratio 256 256 32 0 [root@test ~]# cat /proc/zoneinfo | grep protection protection: (0, 2735, 63524, 0) protection: (0, 0, 60788, 0) protection: (0, 0, 0, 0) protection: (0, 0, 0, 0) lowmem_reserve increased also makes the totalreserve_pages increased, which causes a decrease in available memory. The one above is just a test machine, and the increase is not significant. On our online machine, the reserved memory will increase by several GB due to reading this file. It is clearly unreasonable to cause a sharp drop in available memory just by reading a file. In this patch, we update reserve memory when update managed_pages, The size of reserved memory becomes stable. But it seems that the _watermark should also be updated along with the managed_pages. We have not done it because we are unsure if it is reasonable to set the watermark through the initial managed_pages. If it is not reasonable, we will propose new patch. Link: https://lkml.kernel.org/r/20241225021034.45693-1-15645113830zzh@gmail.com Signed-off-by: zihan zhou <15645113830zzh@gmail.com> Signed-off-by: yaowenchao <yaowenchao@jd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13fs/proc: fix softlockup in __read_vmcore (part 2)Rik van Riel1-0/+2
Since commit 5cbcb62dddf5 ("fs/proc: fix softlockup in __read_vmcore") the number of softlockups in __read_vmcore at kdump time have gone down, but they still happen sometimes. In a memory constrained environment like the kdump image, a softlockup is not just a harmless message, but it can interfere with things like RCU freeing memory, causing the crashdump to get stuck. The second loop in __read_vmcore has a lot more opportunities for natural sleep points, like scheduling out while waiting for a data write to happen, but apparently that is not always enough. Add a cond_resched() to the second loop in __read_vmcore to (hopefully) get rid of the softlockups. Link: https://lkml.kernel.org/r/20250110102821.2a37581b@fangorn Fixes: 5cbcb62dddf5 ("fs/proc: fix softlockup in __read_vmcore") Signed-off-by: Rik van Riel <riel@surriel.com> Reported-by: Breno Leitao <leitao@debian.org> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm: fix assertion in folio_end_read()Matthew Wilcox (Oracle)1-1/+1
We only need to assert that the uptodate flag is clear if we're going to set it. This hasn't been a problem before now because we have only used folio_end_read() when completing with an error, but it's convenient to use it in squashfs if we discover the folio is already uptodate. Link: https://lkml.kernel.org/r/20250110163300.3346321-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm: vmscan : pgdemote vmstat is not getting updated when MGLRU is enabled.Donet Tom1-0/+3
When MGLRU is enabled, the pgdemote_kswapd, pgdemote_direct, and pgdemote_khugepaged stats in vmstat are not being updated. Commit f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") moved the pgdemote vmstat update from demote_folio_list() to shrink_inactive_list(), which is in the normal LRU path. As a result, the pgdemote stats are updated correctly for the normal LRU but not for MGLRU. To address this, we have added the pgdemote stat update in the evict_folios() function, which is in the MGLRU path. With this patch, the pgdemote stats will now be updated correctly when MGLRU is enabled. Without this patch vmstat output when MGLRU is enabled ====================================================== pgdemote_kswapd 0 pgdemote_direct 0 pgdemote_khugepaged 0 With this patch vmstat output when MGLRU is enabled =================================================== pgdemote_kswapd 43234 pgdemote_direct 4691 pgdemote_khugepaged 0 Link: https://lkml.kernel.org/r/20250109060540.451261-1-donettom@linux.ibm.com Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Acked-by: Yu Zhao <yuzhao@google.com> Tested-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13vmstat: disable vmstat_work on vmstat_cpu_down_prep()Koichiro Den1-2/+13
The upstream commit adcfb264c3ed ("vmstat: disable vmstat_work on vmstat_cpu_down_prep()") introduced another warning during the boot phase so was soon reverted on upstream by commit cd6313beaeae ("Revert "vmstat: disable vmstat_work on vmstat_cpu_down_prep()""). This commit resolves it and reattempts the original fix. Even after mm/vmstat:online teardown, shepherd may still queue work for the dying cpu until the cpu is removed from online mask. While it's quite rare, this means that after unbind_workers() unbinds a per-cpu kworker, it potentially runs vmstat_update for the dying CPU on an irrelevant cpu before entering atomic AP states. When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the backtrace. BUG: using smp_processor_id() in preemptible [00000000] code: \ kworker/7:3/1702 caller is refresh_cpu_vm_stats+0x235/0x5f0 CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G Tainted: [N]=TEST Workqueue: mm_percpu_wq vmstat_update Call Trace: <TASK> dump_stack_lvl+0x8d/0xb0 check_preemption_disabled+0xce/0xe0 refresh_cpu_vm_stats+0x235/0x5f0 vmstat_update+0x17/0xa0 process_one_work+0x869/0x1aa0 worker_thread+0x5e5/0x1100 kthread+0x29e/0x380 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x1a/0x30 </TASK> So, for mm/vmstat:online, disable vmstat_work reliably on teardown and symmetrically enable it on startup. For secondary CPUs during CPU hotplug scenarios, ensure the delayed work is disabled immediately after the initialization. These CPUs are not yet online when start_shepherd_timer() runs on boot CPU. vmstat_cpu_online() will enable the work for them. Link: https://lkml.kernel.org/r/20250108042807.3429745-1-koichiro.den@canonical.com Signed-off-by: Huacai Chen <chenhuacai@kernel.org> Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Suggested-by: Huacai Chen <chenhuacai@kernel.org> Tested-by: Charalampos Mitrodimas <charmitro@posteo.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13zram: fix potential UAF of zram tableKairui Song1-0/+1
If zram_meta_alloc failed early, it frees allocated zram->table without setting it NULL. Which will potentially cause zram_meta_free to access the table if user reset an failed and uninitialized device. Link: https://lkml.kernel.org/r/20250107065446.86928-1-ryncsn@gmail.com Fixes: 74363ec674cb ("zram: fix uninitialized ZRAM not releasing backing device") Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13selftests/mm: set allocated memory to non-zero content in cow testRyan Roberts1-4/+4
After commit b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp"), cow test cases involving swapping out THPs via madvise(MADV_PAGEOUT) started to be skipped due to the subsequent check via pagemap determining that the memory was not actually swapped out. Logs similar to this were emitted: ... # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP (16 kB) ok 2 # SKIP MADV_PAGEOUT did not work, is swap enabled? # [RUN] Basic COW after fork() ... with single PTE of swapped-out THP (16 kB) ok 3 # SKIP MADV_PAGEOUT did not work, is swap enabled? # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP (32 kB) ok 4 # SKIP MADV_PAGEOUT did not work, is swap enabled? ... The commit in question introduces the behaviour of scanning THPs and if their content is predominantly zero, it splits them and replaces the pages which are wholly zero with the zero page. These cow test cases were getting caught up in this. So let's avoid that by filling the contents of all allocated memory with a non-zero value. With this in place, the tests are passing again. Link: https://lkml.kernel.org/r/20250107142555.1870101-1-ryan.roberts@arm.com Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm: clear uffd-wp PTE/PMD state on mremap()Ryan Roberts4-2/+68
When mremap()ing a memory region previously registered with userfaultfd as write-protected but without UFFD_FEATURE_EVENT_REMAP, an inconsistency in flag clearing leads to a mismatch between the vma flags (which have uffd-wp cleared) and the pte/pmd flags (which do not have uffd-wp cleared). This mismatch causes a subsequent mprotect(PROT_WRITE) to trigger a warning in page_table_check_pte_flags() due to setting the pte to writable while uffd-wp is still set. Fix this by always explicitly clearing the uffd-wp pte/pmd flags on any such mremap() so that the values are consistent with the existing clearing of VM_UFFD_WP. Be careful to clear the logical flag regardless of its physical form; a PTE bit, a swap PTE bit, or a PTE marker. Cover PTE, huge PMD and hugetlb paths. Link: https://lkml.kernel.org/r/20250107144755.1871363-2-ryan.roberts@arm.com Co-developed-by: Mikołaj Lenczewski <miko.lenczewski@arm.com> Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Closes: https://lore.kernel.org/linux-mm/810b44a8-d2ae-4107-b665-5a42eae2d948@arm.com/ Fixes: 63b2d4174c4a ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl") Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13module: fix writing of livepatch relocations in ROX textPetr Pavlu1-1/+2
A livepatch module can contain a special relocation section .klp.rela.<objname>.<secname> to apply its relocations at the appropriate time and to additionally access local and unexported symbols. When <objname> points to another module, such relocations are processed separately from the regular module relocation process. For instance, only when the target <objname> actually becomes loaded. With CONFIG_STRICT_MODULE_RWX, when the livepatch core decides to apply these relocations, their processing results in the following bug: [ 25.827238] BUG: unable to handle page fault for address: 00000000000012ba [ 25.827819] #PF: supervisor read access in kernel mode [ 25.828153] #PF: error_code(0x0000) - not-present page [ 25.828588] PGD 0 P4D 0 [ 25.829063] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI [ 25.829742] CPU: 2 UID: 0 PID: 452 Comm: insmod Tainted: G O K 6.13.0-rc4-00078-g059dd502b263 #7820 [ 25.830417] Tainted: [O]=OOT_MODULE, [K]=LIVEPATCH [ 25.830768] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014 [ 25.831651] RIP: 0010:memcmp+0x24/0x60 [ 25.832190] Code: [...] [ 25.833378] RSP: 0018:ffffa40b403a3ae8 EFLAGS: 00000246 [ 25.833637] RAX: 0000000000000000 RBX: ffff93bc81d8e700 RCX: ffffffffc0202000 [ 25.834072] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 00000000000012ba [ 25.834548] RBP: ffffa40b403a3b68 R08: ffffa40b403a3b30 R09: 0000004a00000002 [ 25.835088] R10: ffffffffffffd222 R11: f000000000000000 R12: 0000000000000000 [ 25.835666] R13: ffffffffc02032ba R14: ffffffffc007d1e0 R15: 0000000000000004 [ 25.836139] FS: 00007fecef8c3080(0000) GS:ffff93bc8f900000(0000) knlGS:0000000000000000 [ 25.836519] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 25.836977] CR2: 00000000000012ba CR3: 0000000002f24000 CR4: 00000000000006f0 [ 25.837442] Call Trace: [ 25.838297] <TASK> [ 25.841083] __write_relocate_add.constprop.0+0xc7/0x2b0 [ 25.841701] apply_relocate_add+0x75/0xa0 [ 25.841973] klp_write_section_relocs+0x10e/0x140 [ 25.842304] klp_write_object_relocs+0x70/0xa0 [ 25.842682] klp_init_object_loaded+0x21/0xf0 [ 25.842972] klp_enable_patch+0x43d/0x900 [ 25.843572] do_one_initcall+0x4c/0x220 [ 25.844186] do_init_module+0x6a/0x260 [ 25.844423] init_module_from_file+0x9c/0xe0 [ 25.844702] idempotent_init_module+0x172/0x270 [ 25.845008] __x64_sys_finit_module+0x69/0xc0 [ 25.845253] do_syscall_64+0x9e/0x1a0 [ 25.845498] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 25.846056] RIP: 0033:0x7fecef9eb25d [ 25.846444] Code: [...] [ 25.847563] RSP: 002b:00007ffd0c5d6de8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 25.848082] RAX: ffffffffffffffda RBX: 000055b03f05e470 RCX: 00007fecef9eb25d [ 25.848456] RDX: 0000000000000000 RSI: 000055b001e74e52 RDI: 0000000000000003 [ 25.848969] RBP: 00007ffd0c5d6ea0 R08: 0000000000000040 R09: 0000000000004100 [ 25.849411] R10: 00007fecefac7b20 R11: 0000000000000246 R12: 000055b001e74e52 [ 25.849905] R13: 0000000000000000 R14: 000055b03f05e440 R15: 0000000000000000 [ 25.850336] </TASK> [ 25.850553] Modules linked in: deku(OK+) uinput [ 25.851408] CR2: 00000000000012ba [ 25.852085] ---[ end trace 0000000000000000 ]--- The problem is that the .klp.rela.<objname>.<secname> relocations are processed after the module was already formed and mod->rw_copy was reset. However, the code in __write_relocate_add() calls module_writable_address() which translates the target address 'loc' still to 'loc + (mem->rw_copy - mem->base)', with mem->rw_copy now being 0. Fix the problem by returning directly 'loc' in module_writable_address() when the module is already formed. Function __write_relocate_add() knows to use text_poke() in such a case. Link: https://lkml.kernel.org/r/20250107153507.14733-1-petr.pavlu@suse.com Fixes: 0c133b1e78cd ("module: prepare to handle ROX allocations for text") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reported-by: Marek Maslanka <mmaslanka@google.com> Closes: https://lore.kernel.org/linux-modules/CAGcaFA2hdThQV6mjD_1_U+GNHThv84+MQvMWLgEuX+LVbAyDxg@mail.gmail.com/ Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm: zswap: properly synchronize freeing resources during CPU hotunplugYosry Ahmed1-14/+44
In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the current CPU at the beginning of the operation is retrieved and used throughout. However, since neither preemption nor migration are disabled, it is possible that the operation continues on a different CPU. If the original CPU is hotunplugged while the acomp_ctx is still in use, we run into a UAF bug as some of the resources attached to the acomp_ctx are freed during hotunplug in zswap_cpu_comp_dead() (i.e. acomp_ctx.buffer, acomp_ctx.req, or acomp_ctx.acomp). The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") when the switch to the crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was retrieved using get_cpu_ptr() which disables preemption and makes sure the CPU cannot go away from under us. Preemption cannot be disabled with the crypto_acomp API as a sleepable context is needed. Use the acomp_ctx.mutex to synchronize CPU hotplug callbacks allocating and freeing resources with compression/decompression paths. Make sure that acomp_ctx.req is NULL when the resources are freed. In the compression/decompression paths, check if acomp_ctx.req is NULL after acquiring the mutex (meaning the CPU was offlined) and retry on the new CPU. The initialization of acomp_ctx.mutex is moved from the CPU hotplug callback to the pool initialization where it belongs (where the mutex is allocated). In addition to adding clarity, this makes sure that CPU hotplug cannot reinitialize a mutex that is already locked by compression/decompression. Previously a fix was attempted by holding cpus_read_lock() [1]. This would have caused a potential deadlock as it is possible for code already holding the lock to fall into reclaim and enter zswap (causing a deadlock). A fix was also attempted using SRCU for synchronization, but Johannes pointed out that synchronize_srcu() cannot be used in CPU hotplug notifiers [2]. Alternative fixes that were considered/attempted and could have worked: - Refcounting the per-CPU acomp_ctx. This involves complexity in handling the race between the refcount dropping to zero in zswap_[de]compress() and the refcount being re-initialized when the CPU is onlined. - Disabling migration before getting the per-CPU acomp_ctx [3], but that's discouraged and is a much bigger hammer than needed, and could result in subtle performance issues. [1]https://lkml.kernel.org/20241219212437.2714151-1-yosryahmed@google.com/ [2]https://lkml.kernel.org/20250107074724.1756696-2-yosryahmed@google.com/ [3]https://lkml.kernel.org/20250107222236.2715883-2-yosryahmed@google.com/ [yosryahmed@google.com: remove comment] Link: https://lkml.kernel.org/r/CAJD7tkaxS1wjn+swugt8QCvQ-rVF5RZnjxwPGX17k8x9zSManA@mail.gmail.com Link: https://lkml.kernel.org/r/20250108222441.3622031-1-yosryahmed@google.com Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Closes: https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/ Reported-by: Sam Sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4OcuruL4tPg6OaQ@mail.gmail.com/ Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13Revert "mm: zswap: fix race between [de]compression and CPU hotunplug"Yosry Ahmed1-16/+3
This reverts commit eaebeb93922ca6ab0dd92027b73d0112701706ef. Commit eaebeb93922c ("mm: zswap: fix race between [de]compression and CPU hotunplug") used the CPU hotplug lock in zswap compress/decompress operations to protect against a race with CPU hotunplug making some per-CPU resources go away. However, zswap compress/decompress can be reached through reclaim while the lock is held, resulting in a potential deadlock as reported by syzbot: ====================================================== WARNING: possible circular locking dependency detected 6.13.0-rc6-syzkaller-00006-g5428dc1906dd #0 Not tainted ------------------------------------------------------ kswapd0/89 is trying to acquire lock: ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: acomp_ctx_get_cpu mm/zswap.c:886 [inline] ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_compress mm/zswap.c:908 [inline] ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_store_page mm/zswap.c:1439 [inline] ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_store+0xa74/0x1ba0 mm/zswap.c:1546 but task is already holding lock: ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6871 [inline] ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xb58/0x2f30 mm/vmscan.c:7253 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}-{0:0}: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 __fs_reclaim_acquire mm/page_alloc.c:3853 [inline] fs_reclaim_acquire+0x88/0x130 mm/page_alloc.c:3867 might_alloc include/linux/sched/mm.h:318 [inline] slab_pre_alloc_hook mm/slub.c:4070 [inline] slab_alloc_node mm/slub.c:4148 [inline] __kmalloc_cache_node_noprof+0x40/0x3a0 mm/slub.c:4337 kmalloc_node_noprof include/linux/slab.h:924 [inline] alloc_worker kernel/workqueue.c:2638 [inline] create_worker+0x11b/0x720 kernel/workqueue.c:2781 workqueue_prepare_cpu+0xe3/0x170 kernel/workqueue.c:6628 cpuhp_invoke_callback+0x48d/0x830 kernel/cpu.c:194 __cpuhp_invoke_callback_range kernel/cpu.c:965 [inline] cpuhp_invoke_callback_range kernel/cpu.c:989 [inline] cpuhp_up_callbacks kernel/cpu.c:1020 [inline] _cpu_up+0x2b3/0x580 kernel/cpu.c:1690 cpu_up+0x184/0x230 kernel/cpu.c:1722 cpuhp_bringup_mask+0xdf/0x260 kernel/cpu.c:1788 cpuhp_bringup_cpus_parallel+0xf9/0x160 kernel/cpu.c:1878 bringup_nonboot_cpus+0x2b/0x50 kernel/cpu.c:1892 smp_init+0x34/0x150 kernel/smp.c:1009 kernel_init_freeable+0x417/0x5d0 init/main.c:1569 kernel_init+0x1d/0x2b0 init/main.c:1466 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 -> #0 (cpu_hotplug_lock){++++}-{0:0}: check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] cpus_read_lock+0x42/0x150 kernel/cpu.c:490 acomp_ctx_get_cpu mm/zswap.c:886 [inline] zswap_compress mm/zswap.c:908 [inline] zswap_store_page mm/zswap.c:1439 [inline] zswap_store+0xa74/0x1ba0 mm/zswap.c:1546 swap_writepage+0x647/0xce0 mm/page_io.c:279 shmem_writepage+0x1248/0x1610 mm/shmem.c:1579 pageout mm/vmscan.c:696 [inline] shrink_folio_list+0x35ee/0x57e0 mm/vmscan.c:1374 shrink_inactive_list mm/vmscan.c:1967 [inline] shrink_list mm/vmscan.c:2205 [inline] shrink_lruvec+0x16db/0x2f30 mm/vmscan.c:5734 mem_cgroup_shrink_node+0x385/0x8e0 mm/vmscan.c:6575 mem_cgroup_soft_reclaim mm/memcontrol-v1.c:312 [inline] memcg1_soft_limit_reclaim+0x346/0x810 mm/memcontrol-v1.c:362 balance_pgdat mm/vmscan.c:6975 [inline] kswapd+0x17b3/0x2f30 mm/vmscan.c:7253 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(cpu_hotplug_lock); lock(fs_reclaim); rlock(cpu_hotplug_lock); *** DEADLOCK *** 1 lock held by kswapd0/89: #0: ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6871 [inline] #0: ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xb58/0x2f30 mm/vmscan.c:7253 stack backtrace: CPU: 0 UID: 0 PID: 89 Comm: kswapd0 Not tainted 6.13.0-rc6-syzkaller-00006-g5428dc1906dd #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074 check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206 check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] cpus_read_lock+0x42/0x150 kernel/cpu.c:490 acomp_ctx_get_cpu mm/zswap.c:886 [inline] zswap_compress mm/zswap.c:908 [inline] zswap_store_page mm/zswap.c:1439 [inline] zswap_store+0xa74/0x1ba0 mm/zswap.c:1546 swap_writepage+0x647/0xce0 mm/page_io.c:279 shmem_writepage+0x1248/0x1610 mm/shmem.c:1579 pageout mm/vmscan.c:696 [inline] shrink_folio_list+0x35ee/0x57e0 mm/vmscan.c:1374 shrink_inactive_list mm/vmscan.c:1967 [inline] shrink_list mm/vmscan.c:2205 [inline] shrink_lruvec+0x16db/0x2f30 mm/vmscan.c:5734 mem_cgroup_shrink_node+0x385/0x8e0 mm/vmscan.c:6575 mem_cgroup_soft_reclaim mm/memcontrol-v1.c:312 [inline] memcg1_soft_limit_reclaim+0x346/0x810 mm/memcontrol-v1.c:362 balance_pgdat mm/vmscan.c:6975 [inline] kswapd+0x17b3/0x2f30 mm/vmscan.c:7253 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Revert the change. A different fix for the race with CPU hotunplug will follow. Link: https://lkml.kernel.org/r/20250107222236.2715883-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sam Sun <samsun1006219@gmail.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13hugetlb: fix NULL pointer dereference in trace_hugetlbfs_alloc_inodeMuchun Song1-1/+1
hugetlb_file_setup() will pass a NULL @dir to hugetlbfs_get_inode(), so we will access a NULL pointer for @dir. Fix it and set __entry->dr to 0 if @dir is NULL. Because ->i_ino cannot be 0 (see get_next_ino()), there is no confusing if user sees a 0 inode number. Link: https://lkml.kernel.org/r/20250106033118.4640-1-songmuchun@bytedance.com Fixes: 318580ad7f28 ("hugetlbfs: support tracepoint") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reported-by: Cheung Wall <zzqq0103.hey@gmail.com> Closes: https://lore.kernel.org/linux-mm/02858D60-43C1-4863-A84F-3C76A8AF1F15@linux.dev/T/# Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Cc: cheung wall <zzqq0103.hey@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm: fix div by zero in bdi_ratio_from_pagesStefan Roesch1-2/+8
During testing it has been detected, that it is possible to get div by zero error in bdi_set_min_bytes. The error is caused by the function bdi_ratio_from_pages(). bdi_ratio_from_pages() calls global_dirty_limits. If the dirty threshold is 0, the div by zero is raised. This can happen if the root user is setting: echo 0 > /proc/sys/vm/dirty_ratio The following is a test case: echo 0 > /proc/sys/vm/dirty_ratio cd /sys/class/bdi/<device> echo 1 > strict_limit echo 8192 > min_bytes ==> error is raised. The problem is addressed by returning -EINVAL if dirty_ratio or dirty_bytes is set to 0. [shr@devkernel.io: check for -EINVAL in bdi_set_min_bytes() and bdi_set_max_bytes()] Link: https://lkml.kernel.org/r/20250108014723.166637-1-shr@devkernel.io [shr@devkernel.io: v3] Link: https://lkml.kernel.org/r/20250109063411.6591-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20250104012037.159386-1-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Reported-by: cheung wall <zzqq0103.hey@gmail.com> Closes: https://lore.kernel.org/linux-mm/87pll35yd0.fsf@devkernel.io/T/#t Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Qiang Zhang <zzqq0103.hey@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13x86/execmem: fix ROX cache usage in Xen PV guestsJuergen Gross1-1/+2
The recently introduced ROX cache for modules is assuming large page support in 64-bit mode without testing the related feature bit. This results in breakage when running as a Xen PV guest, as in this mode large pages are not supported. Fix that by testing the X86_FEATURE_PSE capability when deciding whether to enable the ROX cache. Link: https://lkml.kernel.org/r/20250103065631.26459-1-jgross@suse.com Fixes: 2e45474ab14f ("execmem: add support for cache of large ROX pages") Signed-off-by: Juergen Gross <jgross@suse.com> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13filemap: avoid truncating 64-bit offset to 32 bitsMarco Nelissen1-1/+1
On 32-bit kernels, folio_seek_hole_data() was inadvertently truncating a 64-bit value to 32 bits, leading to a possible infinite loop when writing to an xfs filesystem. Link: https://lkml.kernel.org/r/20250102190540.1356838-1-marco.nelissen@gmail.com Fixes: 54fa39ac2e00 ("iomap: use mapping_seek_hole_data") Signed-off-by: Marco Nelissen <marco.nelissen@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13tools: fix atomic_set() definition to set the value correctlySuren Baghdasaryan2-2/+2
Currently vma test is failing because of the new vma_assert_attached() assertion. The check is failing because previous refcount_set() inside vma_mark_attached() is a NoOp. Fix the definition of atomic_set() to correctly set the value of the atomic. Link: https://lkml.kernel.org/r/20241227222220.1726384-1-surenb@google.com Fixes: 9325b8b5a1cb ("tools: add skeleton code for userland testing of VMA logic") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/mempolicy: count MPOL_WEIGHTED_INTERLEAVE to "interleave_hit"Honggyu Kim1-1/+2
Commit fa3bea4e1f82 introduced MPOL_WEIGHTED_INTERLEAVE but it missed adding its counter to "interleave_hit" of numastat, which is located at /sys/devices/system/node/nodeN/ directory. It'd be better to add weighted interleving counter info to the existing "interleave_hit" instead of introducing a new counter "weighted_interleave_hit". Link: https://lkml.kernel.org/r/20241227095737.645-1-honggyu.kim@sk.com Fixes: fa3bea4e1f82 ("mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving") Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Tested-by: Yunjeong Mun <yunjeong.mun@sk.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13scripts/decode_stacktrace.sh: fix decoding of lines with an additional infoLuca Ceresoli1-2/+14
Since commit bdf8eafbf7f5 ("arm64: stacktrace: report source of unwind data") a stack trace line can contain an additional info field that was not present before, in the form of one or more letters in parentheses. E.g.: [ 504.517915] led_sysfs_enable+0x54/0x80 (P) ^^^ When this is present, decode_stacktrace decodes the line incorrectly: [ 504.517915] led_sysfs_enable+0x54/0x80 P Extend parsing to decode it correctly: [ 504.517915] led_sysfs_enable (drivers/leds/led-core.c:455 (discriminator 7)) (P) The regex to match such lines assumes the info can be extended in the future to other uppercase characters, and will need to be extended in case other characters will be used. Using a much more generic regex might incur in false positives, so this looked like a good tradeoff. Link: https://lkml.kernel.org/r/20241230-decode_stacktrace-fix-info-v1-1-984910659173@bootlin.com Fixes: bdf8eafbf7f5 ("arm64: stacktrace: report source of unwind data") Signed-off-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Puranjay Mohan <puranjay@kernel.org> Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/kmemleak: fix percpu memory leak detection failureGuo Weikang1-1/+1
kmemleak_alloc_percpu gives an incorrect min_count parameter, causing percpu memory to be considered a gray object. Link: https://lkml.kernel.org/r/20241227092311.3572500-1-guoweikang.kernel@gmail.com Fixes: 8c8685928910 ("mm/kmemleak: use IS_ERR_PCPU() for pointer in the percpu address space") Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com> Acked-by: Uros Bizjak <ubizjak@gmail.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-06Revert "vmstat: disable vmstat_work on vmstat_cpu_down_prep()"Linus Torvalds1-2/+1
This reverts commit adcfb264c3ed51fbbf5068ddf10d309a63683868. It turns out this just causes a different warning splat instead that seems to be much easier to trigger, so let's revert ASAP. Reported-and-bisected-by: Borislav Petkov <bp@alien8.de> Tested-by: Breno Leitao <leitao@debian.org> Reported-by: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/all/20250106131817.GAZ3vYGVr3-hWFFPLj@fat_crate.local/ Cc: Koichiro Den <koichiro.den@canonical.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-01-06Linux 6.13-rc6Linus Torvalds1-1/+1
2025-01-05Merge tag 'kbuild-fixes-v6.13-3' of ↵Linus Torvalds5-34/+46
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Fix escaping of '$' in scripts/mksysmap - Fix a modpost crash observed with the latest binutils - Fix 'provides' in the linux-api-headers pacman package * tag 'kbuild-fixes-v6.13-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kbuild: pacman-pkg: provide versioned linux-api-headers package modpost: work around unaligned data access error modpost: refactor do_vmbus_entry() modpost: fix the missed iteration for the max bit in do_input() scripts/mksysmap: Fix escape chars '$'
2025-01-05Merge tag 'mm-hotfixes-stable-2025-01-04-18-02' of ↵Linus Torvalds29-69/+212
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "25 hotfixes. 16 are cc:stable. 18 are MM and 7 are non-MM. The usual bunch of singletons and two doubletons - please see the relevant changelogs for details" * tag 'mm-hotfixes-stable-2025-01-04-18-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (25 commits) MAINTAINERS: change Arınç _NAL's name and email address scripts/sorttable: fix orc_sort_cmp() to maintain symmetry and transitivity mm/util: make memdup_user_nul() similar to memdup_user() mm, madvise: fix potential workingset node list_lru leaks mm/damon/core: fix ignored quota goals and filters of newly committed schemes mm/damon/core: fix new damon_target objects leaks on damon_commit_targets() mm/list_lru: fix false warning of negative counter vmstat: disable vmstat_work on vmstat_cpu_down_prep() mm: shmem: fix the update of 'shmem_falloc->nr_unswapped' mm: shmem: fix incorrect index alignment for within_size policy percpu: remove intermediate variable in PERCPU_PTR() mm: zswap: fix race between [de]compression and CPU hotunplug ocfs2: fix slab-use-after-free due to dangling pointer dqi_priv fs/proc/task_mmu: fix pagemap flags with PMD THP entries on 32bit kcov: mark in_softirq_really() as __always_inline docs: mm: fix the incorrect 'FileHugeMapped' field mailmap: modify the entry for Mathieu Othacehe mm/kmemleak: fix sleeping function called from invalid context at print message mm: hugetlb: independent PMD page table shared count maple_tree: reload mas before the second call for mas_empty_area ...
2025-01-05Merge tag 'clk-fixes-for-linus' of ↵Linus Torvalds2-2/+14
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk fixes from Stephen Boyd: "A randconfig build fix and a performance fix: - Fix the CONFIG_RESET_CONTROLLER=n path signature of clk_imx8mp_audiomix_reset_controller_register() to appease randconfig - Speed up the sdhci clk on TH1520 by a factor of 4 by adding a fixed factor clk" * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: clk: clk-imx8mp-audiomix: fix function signature clk: thead: Fix TH1520 emmc and shdci clock rate
2025-01-05kbuild: pacman-pkg: provide versioned linux-api-headers packageThomas Weißschuh1-1/+1
The Arch Linux glibc package contains a versioned dependency on "linux-api-headers". If the linux-api-headers package provided by pacman-pkg does not specify an explicit version this dependency is not satisfied. Fix the dependency by providing an explicit version. Fixes: c8578539deba ("kbuild: add script and target to generate pacman package") Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2025-01-04Merge tag 'linux-watchdog-6.13-rc6' of ↵Linus Torvalds1-1/+1
git://www.linux-watchdog.org/linux-watchdog Pull watchdog fix from Wim Van Sebroeck: - fix error message during stm32 driver probe * tag 'linux-watchdog-6.13-rc6' of git://www.linux-watchdog.org/linux-watchdog: watchdog: stm32_iwdg: fix error message during driver probe
2025-01-04Merge tag 'sched_ext-for-6.13-rc5-fixes' of ↵Linus Torvalds17-29/+37
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix a bug where bpf_iter_scx_dsq_new() was not initializing the iterator's flags and could inadvertently enable e.g. reverse iteration - Fix a bug where scx_ops_bypass() could call irq_restore twice - Add Andrea and Changwoo as maintainers for better review coverage - selftests and tools/sched_ext build and other fixes * tag 'sched_ext-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix dsq_local_on selftest sched_ext: initialize kit->cursor.flags sched_ext: Fix invalid irq restore in scx_ops_bypass() MAINTAINERS: add me as reviewer for sched_ext MAINTAINERS: add self as reviewer for sched_ext scx: Fix maximal BPF selftest prog sched_ext: fix application of sizeof to pointer selftests/sched_ext: fix build after renames in sched_ext API sched_ext: Add __weak to fix the build errors
2025-01-04Merge tag 'wq-for-6.13-rc5-fixes' of ↵Linus Torvalds2-11/+30
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: - Suppress a corner case spurious flush dependency warning - Two trivial changes * tag 'wq-for-6.13-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: add printf attribute to __alloc_workqueue() workqueue: Do not warn when cancelling WQ_MEM_RECLAIM work from !WQ_MEM_RECLAIM worker rust: add safety comment in workqueue traits
2025-01-04Merge tag 'block-6.13-20250103' of git://git.kernel.dk/linuxLinus Torvalds10-82/+109
Pull block fixes from Jens Axboe: "Collection of fixes for block. Particularly the target name overflow has been a bit annoying, as it results in overwriting random memory and hence shows up as triggering various other bugs. - NVMe pull request via Keith: - Fix device specific quirk for PRP list alignment (Robert) - Fix target name overflow (Leo) - Fix target write granularity (Luis) - Fix target sleeping in atomic context (Nilay) - Remove unnecessary tcp queue teardown (Chunguang) - Simple cdrom typo fix" * tag 'block-6.13-20250103' of git://git.kernel.dk/linux: cdrom: Fix typo, 'devicen' to 'device' nvme-tcp: remove nvme_tcp_destroy_io_queues() nvmet-loop: avoid using mutex in IO hotpath nvmet: propagate npwg topology nvmet: Don't overflow subsysnqn nvme-pci: 512 byte aligned dma pool segment quirk
2025-01-04Merge tag 'io_uring-6.13-20250103' of git://git.kernel.dk/linuxLinus Torvalds4-15/+37
Pull io_uring fixes from Jens Axboe: - Fix an issue with the read multishot support and posting of CQEs from io-wq context - Fix a regression introduced in this cycle, where making the timeout lock a raw one uncovered another locking dependency. As a result, move the timeout flushing outside of the timeout lock, punting them to a local list first - Fix use of an uninitialized variable in io_async_msghdr. Doesn't really matter functionally, but silences a valid KMSAN complaint that it's not always initialized - Fix use of incrementally provided buffers for read on non-pollable files, where the buffer always gets committed upfront. Unfortunately the buffer address isn't resolved first, so the read ends up using the updated rather than the current value * tag 'io_uring-6.13-20250103' of git://git.kernel.dk/linux: io_uring/kbuf: use pre-committed buffer address for non-pollable file io_uring/net: always initialize kmsg->msg.msg_inq upfront io_uring/timeout: flush timeouts outside of the timeout lock io_uring/rw: fix downgraded mshot read
2025-01-04Merge tag 'net-6.13-rc6' of ↵Linus Torvalds59-381/+849
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from wireles and netfilter. Nothing major here. Over the last two weeks we gathered only around two-thirds of our normal weekly fix count, but delaying sending these until -rc7 seemed like a really bad idea. AFAIK we have no bugs under investigation. One or two reverts for stuff for which we haven't gotten a proper fix will likely come in the next PR. Current release - fix to a fix: - netfilter: nft_set_hash: unaligned atomic read on struct nft_set_ext - eth: gve: trigger RX NAPI instead of TX NAPI in gve_xsk_wakeup Previous releases - regressions: - net: reenable NETIF_F_IPV6_CSUM offload for BIG TCP packets - mptcp: - fix sleeping rcvmsg sleeping forever after bad recvbuffer adjust - fix TCP options overflow - prevent excessive coalescing on receive, fix throughput - net: fix memory leak in tcp_conn_request() if map insertion fails - wifi: cw1200: fix potential NULL dereference after conversion to GPIO descriptors - phy: micrel: dynamically control external clock of KSZ PHY, fix suspend behavior Previous releases - always broken: - af_packet: fix VLAN handling with MSG_PEEK - net: restrict SO_REUSEPORT to inet sockets - netdev-genl: avoid empty messages in NAPI get - dsa: microchip: fix set_ageing_time function on KSZ9477 and LAN937X - eth: - gve: XDP fixes around transmit, queue wakeup etc. - ti: icssg-prueth: fix firmware load sequence to prevent time jump which breaks timesync related operations Misc: - netlink: specs: mptcp: add missing attr and improve documentation" * tag 'net-6.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits) net: ti: icssg-prueth: Fix clearing of IEP_CMP_CFG registers during iep_init net: ti: icssg-prueth: Fix firmware load sequence. mptcp: prevent excessive coalescing on receive mptcp: don't always assume copied data in mptcp_cleanup_rbuf() mptcp: fix recvbuffer adjust on sleeping rcvmsg ila: serialize calls to nf_register_net_hooks() af_packet: fix vlan_get_protocol_dgram() vs MSG_PEEK af_packet: fix vlan_get_tci() vs MSG_PEEK net: wwan: iosm: Properly check for valid exec stage in ipc_mmio_init() net: restrict SO_REUSEPORT to inet sockets net: reenable NETIF_F_IPV6_CSUM offload for BIG TCP packets net: sfc: Correct key_len for efx_tc_ct_zone_ht_params net: wwan: t7xx: Fix FSM command timeout issue sky2: Add device ID 11ab:4373 for Marvell 88E8075 mptcp: fix TCP options overflow. net: mv643xx_eth: fix an OF node reference leak gve: trigger RX NAPI instead of TX NAPI in gve_xsk_wakeup eth: bcmsysport: fix call balance of priv->clk handling routines net: llc: reset skb->transport_header netlink: specs: mptcp: fix missing doc ...
2025-01-04Merge tag 'nios2_update_for_v6.14' of ↵Linus Torvalds1-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux Pull nios2 fixlet from Dinh Nguyen: - Use str_yes_no() helper function * tag 'nios2_update_for_v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux: nios2: Use str_yes_no() helper in show_cpuinfo()
2025-01-03Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds29-159/+321
Pull rdma fixes from Jason Gunthorpe: "A lot of fixes accumulated over the holiday break: - Static tool fixes, value is already proven to be NULL, possible integer overflow - Many bnxt_re fixes: - Crashes due to a mismatch in the maximum SGE list size - Don't waste memory for user QPs by creating kernel-only structures - Fix compatability issues with older HW in some of the new HW features recently introduced: RTS->RTS feature, work around 9096 - Do not allow destroy_qp to fail - Validate QP MTU against device limits - Add missing validation on madatory QP attributes for RTR->RTS - Report port_num in query_qp as required by the spec - Fix creation of QPs of the maximum queue size, and in the variable mode - Allow all QPs to be used on newer HW by limiting a work around only to HW it affects - Use the correct MSN table size for variable mode QPs - Add missing locking in create_qp() accessing the qp_tbl - Form WQE buffers correctly when some of the buffers are 0 hop - Don't crash on QP destroy if the userspace doesn't setup the dip_ctx - Add the missing QP flush handler call on the DWQE path to avoid hanging on error recovery - Consistently use ENXIO for return codes if the devices is fatally errored - Try again to fix VLAN support on iwarp, previous fix was reverted due to breaking other cards - Correct error path return code for rdma netlink events - Remove the seperate net_device pointer in siw and rxe which syzkaller found a way to UAF - Fix a UAF of a stack ib_sge in rtrs - Fix a regression where old mlx5 devices and FW were wrongly activing new device features and failing" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (28 commits) RDMA/mlx5: Enable multiplane mode only when it is supported RDMA/bnxt_re: Fix error recovery sequence RDMA/rtrs: Ensure 'ib_sge list' is accessible RDMA/rxe: Remove the direct link to net_device RDMA/hns: Fix missing flush CQE for DWQE RDMA/hns: Fix warning storm caused by invalid input in IO path RDMA/hns: Fix accessing invalid dip_ctx during destroying QP RDMA/hns: Fix mapping error of zero-hop WQE buffer RDMA/bnxt_re: Fix the locking while accessing the QP table RDMA/bnxt_re: Fix MSN table size for variable wqe mode RDMA/bnxt_re: Add send queue size check for variable wqe RDMA/bnxt_re: Disable use of reserved wqes RDMA/bnxt_re: Fix max_qp_wrs reported RDMA/siw: Remove direct link to net_device RDMA/nldev: Set error code in rdma_nl_notify_event RDMA/bnxt_re: Fix reporting hw_ver in query_device RDMA/bnxt_re: Fix to export port num to ib_query_qp RDMA/bnxt_re: Fix setting mandatory attributes for modify_qp RDMA/bnxt_re: Add check for path mtu in modify_qp RDMA/bnxt_re: Fix the check for 9060 condition ...
2025-01-03Merge tag 'pinctrl-v6.13-2' of ↵Linus Torvalds2-0/+7
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pin control fixes from Linus Walleij: - A small Kconfig fixup for the i.MX. In principle this could come in from the SoC tree but the bug was introduced from the pin control tree so let's fix it from here. - Fix a sleep in atomic context in the MCP23xxx GPIO expander by disabling the regmap locking and using explicit mutex locks. * tag 'pinctrl-v6.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: pinctrl: mcp23s08: Fix sleeping in atomic context due to regmap locking ARM: imx: Re-introduce the PINCTRL selection
2025-01-03Merge tag 'sound-6.13-rc6' of ↵Linus Torvalds7-12/+25
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "The first new year pull request: no surprises, all small fixes, including: - Follow-up fixes for the new compress-offload API extension - A couple of fixes for MIDI 2.0 UMP handling - A trivial race fix for OSS sequencer emulation ioctls - USB-audio and HD-audio fixes / quirks" * tag 'sound-6.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: seq: Check UMP support for midi_version change ALSA hda/realtek: Add quirk for Framework F111:000C Revert "ALSA: ump: Don't enumeration invalid groups for legacy rawmidi" ALSA: seq: oss: Fix races at processing SysEx messages ALSA: compress_offload: fix remaining descriptor races in sound/core/compress_offload.c ALSA: compress_offload: Drop unneeded no_free_ptr() ALSA: hda/tas2781: Ignore SUBSYS_ID not found for tas2563 projects ALSA: usb-audio: US16x08: Initialize array before use
2025-01-03Merge tag 'drm-fixes-2025-01-03' of https://gitlab.freedesktop.org/drm/kernelLinus Torvalds13-111/+112
Pull drm fixes from Dave Airlie: "Happy New Year. It was fairly quiet for holidays period, certainly nothing that worth getting off the couch before I needed to, this is for the past two weeks, i915, xe and some adv7511, I expect we will see some amdgpu etc happening next week, but otherwise all quiet. i915: - Fix C10 pll programming sequence [cx0_phy] - Fix power gate sequence. [dg1] xe: - uapi: Revert some devcoredump file format changes breaking a mesa debug tool - Fixes around waits when moving to system - Fix a typo when checking for LMEM provisioning - Fix a fault on fd close after unbind - A couple of OA fixes squashed for stable backporting adv7511: - fix UAF - drop single lane support - audio infoframe fix" * tag 'drm-fixes-2025-01-03' of https://gitlab.freedesktop.org/drm/kernel: xe/oa: Fix query mode of operation for OAR/OAC drm/i915/dg1: Fix power gate sequence. drm/i915/cx0_phy: Fix C10 pll programming sequence drm/xe: Fix fault on fd close after unbind drm/xe/pf: Use correct function to check LMEM provisioning drm/xe: Wait for migration job before unmapping pages drm/xe: Use non-interruptible wait when moving BO to system drm/xe: Revert some changes that break a mesa debug tool drm: adv7511: Drop dsi single lane support dt-bindings: display: adi,adv7533: Drop single lane support drm: adv7511: Fix use-after-free in adv7533_attach_dsi() drm/bridge: adv7511_audio: Update Audio InfoFrame properly
2025-01-03Merge tag 'ftrace-v6.13-rc5-2' of ↵Linus Torvalds2-7/+3
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace fixes from Steven Rostedt: - Add needed READ_ONCE() around access to the fgraph array element The updates to the fgraph array can happen when callbacks are registered and unregistered. The __ftrace_return_to_handler() can handle reading either the old value or the new value. But once it reads that value it must stay consistent otherwise the check that looks to see if the value is a stub may show false, but if the compiler decides to re-read after that check, it can be true which can cause the code to crash later on. - Make function profiler use the top level ops for filtering again When function graph became available for instances, its filter ops became independent from the top level set_ftrace_filter. In the process the function profiler received its own filter ops as well. But the function profiler uses the top level set_ftrace_filter file and does not have one of its own. In giving it its own filter ops, it lost any user interface it once had. Make it use the top level set_ftrace_filter file again. This fixes a regression. * tag 'ftrace-v6.13-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Fix function profiler's filtering functionality fgraph: Add READ_ONCE() when accessing fgraph_array[]
2025-01-03io_uring/kbuf: use pre-committed buffer address for non-pollable fileJens Axboe1-1/+3
For non-pollable files, buffer ring consumption will commit upfront. This is fine, but io_ring_buffer_select() will return the address of the buffer after having committed it. For incrementally consumed buffers, this is incorrect as it will modify the buffer address. Store the pre-committed value and return that. If that isn't done, then the initial part of the buffer is not used and the application will correctly assume the content arrived at the start of the userspace buffer, but the kernel will have put it later in the buffer. Or it can cause a spurious -EFAULT returned in the CQE, depending on the buffer size. As bounds are suitably checked for doing the actual IO, no adverse side effects are possible - it's just a data misplacement within the existing buffer. Reported-by: Gwendal Fernet <gwendalfernet@gmail.com> Cc: stable@vger.kernel.org Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-03RDMA/mlx5: Enable multiplane mode only when it is supportedMark Zhang2-2/+4
Driver queries vport_cxt.num_plane and enables multiplane when it is greater then 0, but some old FWs (versions from x.40.1000 till x.42.1000), report vport_cxt.num_plane = 1 unexpectedly. Fix it by querying num_plane only when HCA_CAP2.multiplane bit is set. Fixes: 2a5db20fa532 ("RDMA/mlx5: Add support to multi-plane device and port") Link: https://patch.msgid.link/r/1ef901acdf564716fcf550453cf5e94f343777ec.1734610916.git.leon@kernel.org Cc: stable@vger.kernel.org Reported-by: Francesco Poli <invernomuto@paranoici.org> Closes: https://lore.kernel.org/all/nvs4i2v7o6vn6zhmtq4sgazy2hu5kiulukxcntdelggmznnl7h@so3oul6uwgbl/ Signed-off-by: Mark Zhang <markzhang@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-01-03Merge branch 'net-iep-clock-module-fixes'David S. Miller7-121/+244
Meghana Malladi says: ==================== IEP clock module bug fixes This series has some bug fixes for IEP module needed by PPS and timesync operations. Patch 1/2 fixes firmware load sequence to run all the firmwares when either of the ethernet interfaces is up. Move all the code common for firmware bringup under common functions. Patch 2/2 fixes distorted PPS signal when the ethernet interfaces are brough down and up. This patch also fixes enabling PPS signal after bringing the interface up, without disabling PPS. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-03net: ti: icssg-prueth: Fix clearing of IEP_CMP_CFG registers during iep_initMeghana Malladi1-0/+8
When ICSSG interfaces are brought down and brought up again, the pru cores are shut down and booted again, flushing out all the memories and start again in a clean state. Hence it is expected that the IEP_CMP_CFG register needs to be flushed during iep_init() to ensure that the existing residual configuration doesn't cause any unusual behavior. If the register is not cleared, existing IEP_CMP_CFG set for CMP1 will result in SYNC0_OUT signal based on the SYNC_OUT register values. After bringing the interface up, calling PPS enable doesn't work as the driver believes PPS is already enabled, (iep->pps_enabled is not cleared during interface bring down) and driver will just return true even though there is no signal. Fix this by disabling pps and perout. Fixes: c1e0230eeaab ("net: ti: icss-iep: Add IEP driver") Signed-off-by: Meghana Malladi <m-malladi@ti.com> Reviewed-by: Roger Quadros <rogerq@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-03net: ti: icssg-prueth: Fix firmware load sequence.MD Danish Anwar6-121/+236
Timesync related operations are ran in PRU0 cores for both ICSSG SLICE0 and SLICE1. Currently whenever any ICSSG interface comes up we load the respective firmwares to PRU cores and whenever interface goes down, we stop the resective cores. Due to this, when SLICE0 goes down while SLICE1 is still active, PRU0 firmwares are unloaded and PRU0 core is stopped. This results in clock jump for SLICE1 interface as the timesync related operations are no longer running. As there are interdependencies between SLICE0 and SLICE1 firmwares, fix this by running both PRU0 and PRU1 firmwares as long as at least 1 ICSSG interface is up. Add new flag in prueth struct to check if all firmwares are running and remove the old flag (fw_running). Use emacs_initialized as reference count to load the firmwares for the first and last interface up/down. Moving init_emac_mode and fw_offload_mode API outside of icssg_config to icssg_common_start API as they need to be called only once per firmware boot. Change prueth_emac_restart() to return error code and add error prints inside the caller of this functions in case of any failures. Move prueth_emac_stop() from common to sr1 driver. sr1 and sr2 drivers have different logic handling for stopping the firmwares. While sr1 driver is dependent on emac structure to stop the corresponding pru cores for that slice, for sr2 all the pru cores of both the slices are stopped and is not dependent on emac. So the prueth_emac_stop() function is no longer common and can be moved to sr1 driver. Fixes: c1e0230eeaab ("net: ti: icss-iep: Add IEP driver") Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Signed-off-by: Meghana Malladi <m-malladi@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-03Merge branch 'mptcp-rx-path-fixes'Jakub Kicinski1-11/+12
Matthieu Baerts says: ==================== mptcp: rx path fixes Here are 3 different fixes, all related to the MPTCP receive buffer: - Patch 1: fix receive buffer space when recvmsg() blocks after receiving some data. For a fix introduced in v6.12, backported to v6.1. - Patch 2: mptcp_cleanup_rbuf() can be called when no data has been copied. For 5.11. - Patch 3: prevent excessive coalescing on receive, which can affect the throughput badly. It looks better to wait a bit before backporting this one to stable versions, to get more results. For 5.10. ==================== Link: https://patch.msgid.link/20241230-net-mptcp-rbuf-fixes-v1-0-8608af434ceb@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-03mptcp: prevent excessive coalescing on receivePaolo Abeni1-0/+1
Currently the skb size after coalescing is only limited by the skb layout (the skb must not carry frag_list). A single coalesced skb covering several MSS can potentially fill completely the receive buffer. In such a case, the snd win will zero until the receive buffer will be empty again, affecting tput badly. Fixes: 8268ed4c9d19 ("mptcp: introduce and use mptcp_try_coalesce()") Cc: stable@vger.kernel.org # please delay 2 weeks after 6.13-final release Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241230-net-mptcp-rbuf-fixes-v1-3-8608af434ceb@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-03mptcp: don't always assume copied data in mptcp_cleanup_rbuf()Paolo Abeni1-9/+9
Under some corner cases the MPTCP protocol can end-up invoking mptcp_cleanup_rbuf() when no data has been copied, but such helper assumes the opposite condition. Explicitly drop such assumption and performs the costly call only when strictly needed - before releasing the msk socket lock. Fixes: fd8976790a6c ("mptcp: be careful on MPTCP-level ack.") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241230-net-mptcp-rbuf-fixes-v1-2-8608af434ceb@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>