summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
40 hoursmm/page_alloc/vmstat: simplify refresh_cpu_vm_stats change detectionJoshua Hahn1-1/+1
commit 0acc67c4030c39f39ac90413cc5d0abddd3a9527 upstream. Patch series "mm/page_alloc: Batch callers of free_pcppages_bulk", v5. Motivation & Approach ===================== While testing workloads with high sustained memory pressure on large machines in the Meta fleet (1Tb memory, 316 CPUs), we saw an unexpectedly high number of softlockups. Further investigation showed that the zone lock in free_pcppages_bulk was being held for a long time, and was called to free 2k+ pages over 100 times just during boot. This causes starvation in other processes for the zone lock, which can lead to the system stalling as multiple threads cannot make progress without the locks. We can see these issues manifesting as warnings: [ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU [ 4512.604370] rcu: 20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426 [ 4512.626401] rcu: hardirqs softirqs csw/system [ 4512.638793] rcu: number: 0 145 0 [ 4512.651177] rcu: cputime: 30 10410 174 ==> 10558(ms) [ 4512.666657] rcu: (t=21077 jiffies g=783665 q=1242213 ncpus=316) While these warnings don't indicate a crash or a kernel panic, they do point to the underlying issue of lock contention. To prevent starvation in both locks, batch the freeing of pages using pcp->batch. Because free_pcppages_bulk is called with the pcp lock and acquires the zone lock, relinquishing and reacquiring the locks are only effective when both of them are broken together (unless the system was built with queued spinlocks). Thus, instead of modifying free_pcppages_bulk to break both locks, batch the freeing from its callers instead. A similar fix has been implemented in the Meta fleet, and we have seen significantly less softlockups. Testing ======= The following are a few synthetic benchmarks, made on three machines. The first is a large machine with 754GiB memory and 316 processors. The second is a relatively smaller machine with 251GiB memory and 176 processors. The third and final is the smallest of the three, which has 62GiB memory and 36 processors. On all machines, I kick off a kernel build with -j$(nproc). Negative delta is better (faster compilation). Large machine (754GiB memory, 316 processors) make -j$(nproc) +------------+---------------+-----------+ | Metric (s) | Variation (%) | Delta(%) | +------------+---------------+-----------+ | real | 0.8070 | - 1.4865 | | user | 0.2823 | + 0.4081 | | sys | 5.0267 | -11.8737 | +------------+---------------+-----------+ Medium machine (251GiB memory, 176 processors) make -j$(nproc) +------------+---------------+----------+ | Metric (s) | Variation (%) | Delta(%) | +------------+---------------+----------+ | real | 0.2806 | +0.0351 | | user | 0.0994 | +0.3170 | | sys | 0.6229 | -0.6277 | +------------+---------------+----------+ Small machine (62GiB memory, 36 processors) make -j$(nproc) +------------+---------------+----------+ | Metric (s) | Variation (%) | Delta(%) | +------------+---------------+----------+ | real | 0.1503 | -2.6585 | | user | 0.0431 | -2.2984 | | sys | 0.1870 | -3.2013 | +------------+---------------+----------+ Here, variation is the coefficient of variation, i.e. standard deviation / mean. Based on these results, it seems like there are varying degrees to how much lock contention this reduces. For the largest and smallest machines that I ran the tests on, it seems like there is quite some significant reduction. There is also some performance increases visible from userspace. Interestingly, the performance gains don't scale with the size of the machine, but rather there seems to be a dip in the gain there is for the medium-sized machine. One possible theory is that because the high watermark depends on both memory and the number of local CPUs, what impacts zone contention the most is not these individual values, but rather the ratio of mem:processors. This patch (of 5): Currently, refresh_cpu_vm_stats returns an int, indicating how many changes were made during its updates. Using this information, callers like vmstat_update can heuristically determine if more work will be done in the future. However, all of refresh_cpu_vm_stats's callers either (a) ignore the result, only caring about performing the updates, or (b) only care about whether changes were made, but not *how many* changes were made. Simplify the code by returning a bool instead to indicate if updates were made. In addition, simplify fold_diff and decay_pcp_high to return a bool for the same reason. Link: https://lkml.kernel.org/r/20251014145011.3427205-1-joshua.hahnjy@gmail.com Link: https://lkml.kernel.org/r/20251014145011.3427205-2-joshua.hahnjy@gmail.com Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Chris Mason <clm@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 038a102535eb ("mm/page_alloc: prevent pcp corruption with SMP=n") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
40 hoursHID: intel-ish-hid: Use dedicated unbound workqueues to prevent resume blockingZhang Lixu1-0/+2
commit 0d30dae38fe01cd1de358c6039a0b1184689fe51 upstream. During suspend/resume tests with S2IDLE, some ISH functional failures were observed because of delay in executing ISH resume handler. Here schedule_work() is used from resume handler to do actual work. schedule_work() uses system_wq, which is a per CPU work queue. Although the queuing is not bound to a CPU, but it prefers local CPU of the caller, unless prohibited. Users of this work queue are not supposed to queue long running work. But in practice, there are scenarios where long running work items are queued on other unbound workqueues, occupying the CPU. As a result, the ISH resume handler may not get a chance to execute in a timely manner. In one scenario, one of the ish_resume_handler() executions was delayed nearly 1 second because another work item on an unbound workqueue occupied the same CPU. This delay causes ISH functionality failures. A similar issue was previously observed where the ISH HID driver timed out while getting the HID descriptor during S4 resume in the recovery kernel, likely caused by the same workqueue contention problem. Create dedicated unbound workqueues for all ISH operations to allow work items to execute on any available CPU, eliminating CPU-specific bottlenecks and improving resume reliability under varying system loads. Also ISH has three different components, a bus driver which implements ISH protocols, a PCI interface layer and HID interface. Use one dedicated work queue for all of them. Signed-off-by: Zhang Lixu <lixu.zhang@intel.com> Signed-off-by: Jiri Kosina <jkosina@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
40 hoursiommu/sva: invalidate stale IOTLB entries for kernel address spaceLu Baolu1-0/+4
commit e37d5a2d60a338c5917c45296bac65da1382eda5 upstream. Introduce a new IOMMU interface to flush IOTLB paging cache entries for the CPU kernel address space. This interface is invoked from the x86 architecture code that manages combined user and kernel page tables, specifically before any kernel page table page is freed and reused. This addresses the main issue with vfree() which is a common occurrence and can be triggered by unprivileged users. While this resolves the primary problem, it doesn't address some extremely rare case related to memory unplug of memory that was present as reserved memory at boot, which cannot be triggered by unprivileged users. The discussion can be found at the link below. Enable SVA on x86 architecture since the IOMMU can now receive notification to flush the paging cache before freeing the CPU kernel page table pages. Link: https://lkml.kernel.org/r/20251022082635.2462433-9-baolu.lu@linux.intel.com Link: https://lore.kernel.org/linux-iommu/04983c62-3b1d-40d4-93ae-34ca04b827e5@intel.com/ Co-developed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Suggested-by: Jann Horn <jannh@google.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robin Murohy <robin.murphy@arm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yi Lai <yi1.lai@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
40 hoursmm: introduce deferred freeing for kernel page tablesDave Hansen1-3/+13
commit 5ba2f0a1556479638ac11a3c201421f5515e89f5 upstream. This introduces a conditional asynchronous mechanism, enabled by CONFIG_ASYNC_KERNEL_PGTABLE_FREE. When enabled, this mechanism defers the freeing of pages that are used as page tables for kernel address mappings. These pages are now queued to a work struct instead of being freed immediately. This deferred freeing allows for batch-freeing of page tables, providing a safe context for performing a single expensive operation (TLB flush) for a batch of kernel page tables instead of performing that expensive operation for each page table. Link: https://lkml.kernel.org/r/20251022082635.2462433-8-baolu.lu@linux.intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robin Murohy <robin.murphy@arm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vasant Hegde <vasant.hegde@amd.com> Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yi Lai <yi1.lai@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
40 hoursmm: introduce pure page table freeing functionDave Hansen1-3/+8
commit 01894295672335ff304beed4359f30d14d5765f2 upstream. The pages used for ptdescs are currently freed back to the allocator in a single location. They will shortly be freed from a second location. Create a simple helper that just frees them back to the allocator. Link: https://lkml.kernel.org/r/20251022082635.2462433-6-baolu.lu@linux.intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robin Murohy <robin.murphy@arm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vasant Hegde <vasant.hegde@amd.com> Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yi Lai <yi1.lai@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
40 hoursmm: actually mark kernel page table pagesDave Hansen2-0/+21
commit 977870522af34359b461060597ee3a86f27450d6 upstream. Now that the API is in place, mark kernel page table pages just after they are allocated. Unmark them just before they are freed. Note: Unconditionally clearing the 'kernel' marking (via ptdesc_clear_kernel()) would be functionally identical to what is here. But having the if() makes it logically clear that this function can be used for kernel and non-kernel page tables. Link: https://lkml.kernel.org/r/20251022082635.2462433-4-baolu.lu@linux.intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robin Murohy <robin.murphy@arm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vasant Hegde <vasant.hegde@amd.com> Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yi Lai <yi1.lai@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
40 hoursmm: add a ptdesc flag to mark kernel page tablesDave Hansen1-0/+41
commit 27bfafac65d87c58639f5d7af1353ec1e7886963 upstream. The page tables used to map the kernel and userspace often have very different handling rules. There are frequently *_kernel() variants of functions just for kernel page tables. That's not great and has lead to code duplication. Instead of having completely separate call paths, allow a 'ptdesc' to be marked as being for kernel mappings. Introduce helpers to set and clear this status. Note: this uses the PG_referenced bit. Page flags are a great fit for this since it is truly a single bit of information. Use PG_referenced itself because it's a fairly benign flag (as opposed to things like PG_lock). It's also (according to Willy) unlikely to go away any time soon. PG_referenced is not in PAGE_FLAGS_CHECK_AT_FREE. It does not need to be cleared before freeing the page, and pages coming out of the allocator should have it cleared. Regardless, introduce an API to clear it anyway. Having symmetry in the API makes it easier to change the underlying implementation later, like if there was a need to move to a PAGE_FLAGS_CHECK_AT_FREE bit. Link: https://lkml.kernel.org/r/20251022082635.2462433-3-baolu.lu@linux.intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robin Murohy <robin.murphy@arm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vasant Hegde <vasant.hegde@amd.com> Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yi Lai <yi1.lai@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
40 hoursext4: fix ext4_tune_sb_params paddingArnd Bergmann1-1/+1
commit cd16edba1c6a24af138e1a5ded2711231fffa99f upstream. The padding at the end of struct ext4_tune_sb_params is architecture specific and in particular is different between x86-32 and x86-64, since the __u64 member only enforces struct alignment on the latter. This shows up as a new warning when test-building the headers with -Wpadded: include/linux/ext4.h:144:1: error: padding struct size to alignment boundary with 4 bytes [-Werror=padded] All members inside the structure are naturally aligned, so the only difference here is the amount of padding at the end. Make the padding explicit, to have a consistent sizeof(struct ext4_tune_sb_params) of 232 on all architectures and avoid adding compat ioctl handling for EXT4_IOC_GET_TUNE_SB_PARAM/EXT4_IOC_SET_TUNE_SB_PARAM. This is an ABI break on x86-32 but hopefully this can go into 6.18.y early enough as a fixup so no actual users will be affected. Alternatively, the kernel could handle the ioctl commands for both sizes (232 and 228 bytes) on all architectures. Fixes: 04a91570ac67 ("ext4: implemet new ioctls to set and get superblock parameters") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20251204101914.1037148-1-arnd@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
40 hoursusb: core: add USB_QUIRK_NO_BOS for devices that hang on BOS descriptorJohannes Brüderl1-0/+3
commit 2740ac33c87b3d0dfa022efd6ba04c6261b1abbd upstream. Add USB_QUIRK_NO_BOS quirk flag to skip requesting the BOS descriptor for devices that cannot handle it. Add Elgato 4K X (0fd9:009b) to the quirk table. This device hangs when the BOS descriptor is requested at SuperSpeed Plus (10Gbps). Link: https://bugzilla.kernel.org/show_bug.cgi?id=220027 Cc: stable <stable@kernel.org> Signed-off-by: Johannes Brüderl <johannes.bruederl@gmail.com> Link: https://patch.msgid.link/20251207090220.14807-1-johannes.bruederl@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
40 hoursALSA: pcm: Improve the fix for race of buffer access at PCM OSS layerJaroslav Kysela1-1/+1
commit 47c27c9c9c720bc93fdc69605d0ecd9382e99047 upstream. Handle the error code from snd_pcm_buffer_access_lock() in snd_pcm_runtime_buffer_set_silence() function. Found by Alexandros Panagiotou <apanagio@redhat.com> Fixes: 93a81ca06577 ("ALSA: pcm: Fix race of buffer access at PCM OSS layer") Cc: stable@vger.kernel.org # 6.15 Signed-off-by: Jaroslav Kysela <perex@perex.cz> Link: https://patch.msgid.link/20260107213642.332954-1-perex@perex.cz Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
40 hoursscsi: core: Fix error handler encryption supportBrian Kao1-0/+6
commit 9a49157deeb23581fc5c8189b486340d7343264a upstream. Some low-level drivers (LLD) access block layer crypto fields, such as rq->crypt_keyslot and rq->crypt_ctx within `struct request`, to configure hardware for inline encryption. However, SCSI Error Handling (EH) commands (e.g., TEST UNIT READY, START STOP UNIT) should not involve any encryption setup. To prevent drivers from erroneously applying crypto settings during EH, this patch saves the original values of rq->crypt_keyslot and rq->crypt_ctx before an EH command is prepared via scsi_eh_prep_cmnd(). These fields in the 'struct request' are then set to NULL. The original values are restored in scsi_eh_restore_cmnd() after the EH command completes. This ensures that the block layer crypto context does not leak into EH command execution. Signed-off-by: Brian Kao <powenkao@google.com> Link: https://patch.msgid.link/20251218031726.2642834-1-powenkao@google.com Cc: stable@vger.kernel.org Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
40 hoursmm, kfence: describe @slab parameter in __kfence_obj_info()Bagas Sanjaya1-0/+1
[ Upstream commit 6cfab50e1440fde19af7c614aacd85e11aa4dcea ] Sphinx reports kernel-doc warning: WARNING: ./include/linux/kfence.h:220 function parameter 'slab' not described in '__kfence_obj_info' Fix it by describing @slab parameter. Link: https://lkml.kernel.org/r/20251219014006.16328-6-bagasdotme@gmail.com Fixes: 2dfe63e61cc3 ("mm, kfence: support kmem_dump_obj() for KFENCE objects") Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Acked-by: Marco Elver <elver@google.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hourstextsearch: describe @list member in ts_ops searchBagas Sanjaya1-0/+1
[ Upstream commit f26528478bb102c28e7ac0cbfc8ec8185afdafc7 ] Sphinx reports kernel-doc warning: WARNING: ./include/linux/textsearch.h:49 struct member 'list' not described in 'ts_ops' Describe @list member to fix it. Link: https://lkml.kernel.org/r/20251219014006.16328-4-bagasdotme@gmail.com Fixes: 2de4ff7bd658 ("[LIB]: Textsearch infrastructure.") Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Thomas Graf <tgraf@suug.ch> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursmm: describe @flags parameter in memalloc_flags_save()Bagas Sanjaya1-0/+1
[ Upstream commit e2fb7836b01747815f8bb94981c35f2688afb120 ] Patch series "mm kernel-doc fixes". Here are kernel-doc fixes for mm subsystem. I'm also including textsearch fix since there's currently no maintainer for include/linux/textsearch.h (get_maintainer.pl only shows LKML). This patch (of 4): Sphinx reports kernel-doc warning: WARNING: ./include/linux/sched/mm.h:332 function parameter 'flags' not described in 'memalloc_flags_save' Describe @flags to fix it. Link: https://lkml.kernel.org/r/20251219014006.16328-2-bagasdotme@gmail.com Link: https://lkml.kernel.org/r/20251219014006.16328-3-bagasdotme@gmail.com Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Fixes: 3f6d5e6a468d ("mm: introduce memalloc_flags_{save,restore}") Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnet: airoha: Fix typo in airoha_ppe_setup_tc_block_cb definitionLorenzo Bianconi1-2/+2
[ Upstream commit dfdf774656205515b2d6ad94fce63c7ccbe92d91 ] Fix Typo in airoha_ppe_dev_setup_tc_block_cb routine definition when CONFIG_NET_AIROHA is not enabled. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202601090517.Fj6v501r-lkp@intel.com/ Fixes: f45fc18b6de04 ("net: airoha: Add airoha_ppe_dev struct definition") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260109-airoha_ppe_dev_setup_tc_block_cb-typo-v1-1-282e8834a9f9@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursipv4: ip_tunnel: spread netdev_lockdep_set_classes()Eric Dumazet1-1/+12
[ Upstream commit 872ac785e7680dac9ec7f8c5ccd4f667f49d6997 ] Inspired by yet another syzbot report. IPv6 tunnels call netdev_lockdep_set_classes() for each tunnel type, while IPv4 currently centralizes netdev_lockdep_set_classes() call from ip_tunnel_init(). Make ip_tunnel_init() a macro, so that we have different lockdep classes per tunnel type. Fixes: 0bef512012b1 ("net: add netdev_lockdep_set_classes() to virtual drivers") Reported-by: syzbot+1240b33467289f5ab50b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/695d439f.050a0220.1c677c.0347.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260106172426.1760721-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursPM: EM: Fix incorrect description of the cost field in struct em_perf_stateYaxiong Tian1-1/+1
[ Upstream commit 54b603f2db6b95495bc33a8f2bde80f044baff9a ] Due to commit 1b600da51073 ("PM: EM: Optimize em_cpu_energy() and remove division"), the logic for energy consumption calculation has been modified. The actual calculation of cost is 10 * power * max_frequency / frequency instead of power * max_frequency / frequency. Therefore, the comment for cost has been updated to reflect the correct content. Fixes: 1b600da51073 ("PM: EM: Optimize em_cpu_energy() and remove division") Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> [ rjw: Added Fixes: tag ] Link: https://patch.msgid.link/20251230061534.816894-1-tianyaxiong@kylinos.cn Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursdrm/bridge: dw-hdmi-qp: Fix spurious IRQ on resumeSebastian Reichel1-0/+1
[ Upstream commit 14adddc65340f2034751c95616861c0e888e2bb1 ] After resume from suspend to RAM, the following splash is generated if the HDMI driver is probed (independent of a connected cable): [ 1194.484052] irq 80: nobody cared (try booting with the "irqpoll" option) [ 1194.484074] CPU: 0 UID: 0 PID: 627 Comm: rtcwake Not tainted 6.17.0-rc7-g96f1a11414b3 #1 PREEMPT [ 1194.484082] Hardware name: Rockchip RK3576 EVB V10 Board (DT) [ 1194.484085] Call trace: [ 1194.484087] ... (stripped) [ 1194.484283] handlers: [ 1194.484285] [<00000000bc363dcb>] dw_hdmi_qp_main_hardirq [dw_hdmi_qp] [ 1194.484302] Disabling IRQ #80 Apparently the HDMI IP is losing part of its state while the system is suspended and generates spurious interrupts during resume. The bug has not yet been noticed, as system suspend does not yet work properly on upstream kernel with either the Rockchip RK3588 or RK3576 platform. Fixes: 128a9bf8ace2 ("drm/rockchip: Add basic RK3588 HDMI output support") Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com> Signed-off-by: Heiko Stuebner <heiko@sntech.de> Link: https://patch.msgid.link/20251014-rockchip-hdmi-suspend-fix-v1-1-983fcbf44839@collabora.com Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursNFS: Fix a deadlock involving nfs_release_folio()Trond Myklebust1-0/+1
[ Upstream commit cce0be6eb4971456b703aaeafd571650d314bcca ] Wang Zhaolong reports a deadlock involving NFSv4.1 state recovery waiting on kthreadd, which is attempting to reclaim memory by calling nfs_release_folio(). The latter cannot make progress due to state recovery being needed. It seems that the only safe thing to do here is to kick off a writeback of the folio, without waiting for completion, or else kicking off an asynchronous commit. Reported-by: Wang Zhaolong <wangzhaolong@huaweicloud.com> Fixes: 96780ca55e3c ("NFS: fix up nfs_release_folio() to try to release the page") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
7 daysnetfilter: nf_tables: avoid chain re-validation if possibleFlorian Westphal1-8/+26
[ Upstream commit 8e1a1bc4f5a42747c08130b8242ebebd1210b32f ] Hamza Mahfooz reports cpu soft lock-ups in nft_chain_validate(): watchdog: BUG: soft lockup - CPU#1 stuck for 27s! [iptables-nft-re:37547] [..] RIP: 0010:nft_chain_validate+0xcb/0x110 [nf_tables] [..] nft_immediate_validate+0x36/0x50 [nf_tables] nft_chain_validate+0xc9/0x110 [nf_tables] nft_immediate_validate+0x36/0x50 [nf_tables] nft_chain_validate+0xc9/0x110 [nf_tables] nft_immediate_validate+0x36/0x50 [nf_tables] nft_chain_validate+0xc9/0x110 [nf_tables] nft_immediate_validate+0x36/0x50 [nf_tables] nft_chain_validate+0xc9/0x110 [nf_tables] nft_immediate_validate+0x36/0x50 [nf_tables] nft_chain_validate+0xc9/0x110 [nf_tables] nft_immediate_validate+0x36/0x50 [nf_tables] nft_chain_validate+0xc9/0x110 [nf_tables] nft_table_validate+0x6b/0xb0 [nf_tables] nf_tables_validate+0x8b/0xa0 [nf_tables] nf_tables_commit+0x1df/0x1eb0 [nf_tables] [..] Currently nf_tables will traverse the entire table (chain graph), starting from the entry points (base chains), exploring all possible paths (chain jumps). But there are cases where we could avoid revalidation. Consider: 1 input -> j2 -> j3 2 input -> j2 -> j3 3 input -> j1 -> j2 -> j3 Then the second rule does not need to revalidate j2, and, by extension j3, because this was already checked during validation of the first rule. We need to validate it only for rule 3. This is needed because chain loop detection also ensures we do not exceed the jump stack: Just because we know that j2 is cycle free, its last jump might now exceed the allowed stack size. We also need to update all reachable chains with the new largest observed call depth. Care has to be taken to revalidate even if the chain depth won't be an issue: chain validation also ensures that expressions are not called from invalid base chains. For example, the masquerade expression can only be called from NAT postrouting base chains. Therefore we also need to keep record of the base chain context (type, hooknum) and revalidate if the chain becomes reachable from a different hook location. Reported-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Closes: https://lore.kernel.org/netfilter-devel/20251118221735.GA5477@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/ Tested-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
7 daysnet: airoha: Fix npu rx DMA definitionsLorenzo Bianconi1-4/+4
[ Upstream commit a7fc8c641cab855824c45e5e8877e40fd528b5df ] Fix typos in npu rx DMA descriptor definitions. Fixes: b3ef7bdec66fb ("net: airoha: Add airoha_offload.h header") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260102-airoha-npu-dma-rx-def-fixes-v1-1-205fc6bf7d94@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
7 daysnetdev: preserve NETIF_F_ALL_FOR_ALL across TSO updatesDi Zhu1-1/+2
[ Upstream commit 02d1e1a3f9239cdb3ecf2c6d365fb959d1bf39df ] Directly increment the TSO features incurs a side effect: it will also directly clear the flags in NETIF_F_ALL_FOR_ALL on the master device, which can cause issues such as the inability to enable the nocache copy feature on the bonding driver. The fix is to include NETIF_F_ALL_FOR_ALL in the update mask, thereby preventing it from being cleared. Fixes: b0ce3508b25e ("bonding: allow TSO being set on bonding master") Signed-off-by: Di Zhu <zhud@hygon.cn> Link: https://patch.msgid.link/20251224012224.56185-1-zhud@hygon.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
7 daysbtrfs: fix NULL dereference on root when tracing inode evictionMiquel Sabaté Solà1-1/+2
[ Upstream commit f157dd661339fc6f5f2b574fe2429c43bd309534 ] When evicting an inode the first thing we do is to setup tracing for it, which implies fetching the root's id. But in btrfs_evict_inode() the root might be NULL, as implied in the next check that we do in btrfs_evict_inode(). Hence, we either should set the ->root_objectid to 0 in case the root is NULL, or we move tracing setup after checking that the root is not NULL. Setting the rootid to 0 at least gives us the possibility to trace this call even in the case when the root is NULL, so that's the solution taken here. Fixes: 1abe9b8a138c ("Btrfs: add initial tracepoint support for btrfs") Reported-by: syzbot+d991fea1b4b23b1f6bf8@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d991fea1b4b23b1f6bf8 Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
7 daysdrm/atomic-helper: Export and namespace some functionsLinus Walleij1-0/+22
commit d1c7dc57ff2400b141e6582a8d2dc5170108cf81 upstream. Export and namespace those not prefixed with drm_* so it becomes possible to write custom commit tail functions in individual drivers using the helper infrastructure. Tested-by: Marek Vasut <marek.vasut+renesas@mailbox.org> Reviewed-by: Maxime Ripard <mripard@kernel.org> Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> Cc: stable@vger.kernel.org # v6.17+ Fixes: c9b1150a68d9 ("drm/atomic-helper: Re-order bridge chain pre-enable and post-disable") Reviewed-by: Aradhya Bhatia <aradhya.bhatia@linux.dev> Reviewed-by: Linus Walleij <linusw@kernel.org> Tested-by: Linus Walleij <linusw@kernel.org> Signed-off-by: Linus Walleij <linusw@kernel.org> Link: https://patch.msgid.link/20251205-drm-seq-fix-v1-3-fda68fa1b3de@ideasonboard.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 daysRevert "drm/atomic-helper: Re-order bridge chain pre-enable and post-disable"Tomi Valkeinen1-183/+66
commit c1ef9a6cabb34dbc09e31417b0c0a672fe0de13a upstream. This reverts commit c9b1150a68d9362a0827609fc0dc1664c0d8bfe1. Changing the enable/disable sequence has caused regressions on multiple platforms: R-Car, MCDE, Rockchip. A series (see link below) was sent to fix these, but it was decided that it's better to revert the original patch and change the enable/disable sequence only in the tidss driver. Reverting this commit breaks tidss's DSI and OLDI outputs, which will be fixed in the following commits. Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> Link: https://lore.kernel.org/all/20251202-mcde-drm-regression-thirdfix-v6-0-f1bffd4ec0fa%40kernel.org/ Fixes: c9b1150a68d9 ("drm/atomic-helper: Re-order bridge chain pre-enable and post-disable") Cc: stable@vger.kernel.org # v6.17+ Reviewed-by: Aradhya Bhatia <aradhya.bhatia@linux.dev> Reviewed-by: Maxime Ripard <mripard@kernel.org> Reviewed-by: Linus Walleij <linusw@kernel.org> Tested-by: Linus Walleij <linusw@kernel.org> Signed-off-by: Linus Walleij <linusw@kernel.org> Link: https://patch.msgid.link/20251205-drm-seq-fix-v1-1-fda68fa1b3de@ideasonboard.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 daystracing: Add recursion protection in kernel stack trace recordingSteven Rostedt1-0/+9
commit 5f1ef0dfcb5b7f4a91a9b0e0ba533efd9f7e2cdb upstream. A bug was reported about an infinite recursion caused by tracing the rcu events with the kernel stack trace trigger enabled. The stack trace code called back into RCU which then called the stack trace again. Expand the ftrace recursion protection to add a set of bits to protect events from recursion. Each bit represents the context that the event is in (normal, softirq, interrupt and NMI). Have the stack trace code use the interrupt context to protect against recursion. Note, the bug showed an issue in both the RCU code as well as the tracing stacktrace code. This only handles the tracing stack trace side of the bug. The RCU fix will be handled separately. Link: https://lore.kernel.org/all/20260102122807.7025fc87@gandalf.local.home/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Link: https://patch.msgid.link/20260105203141.515cd49f@gandalf.local.home Reported-by: Yao Kai <yaokai34@huawei.com> Tested-by: Yao Kai <yaokai34@huawei.com> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 daysNFSD: Remove NFSERR_EAGAINChuck Lever2-3/+0
commit c6c209ceb87f64a6ceebe61761951dcbbf4a0baa upstream. I haven't found an NFSERR_EAGAIN in RFCs 1094, 1813, 7530, or 8881. None of these RFCs have an NFS status code that match the numeric value "11". Based on the meaning of the EAGAIN errno, I presume the use of this status in NFSD means NFS4ERR_DELAY. So replace the one usage of nfserr_eagain, and remove it from NFSD's NFS status conversion tables. As far as I can tell, NFSERR_EAGAIN has existed since the pre-git era, but was not actually used by any code until commit f4e44b393389 ("NFSD: delay unmount source's export after inter-server copy completed."), at which time it become possible for NFSD to return a status code of 11 (which is not valid NFS protocol). Fixes: f4e44b393389 ("NFSD: delay unmount source's export after inter-server copy completed.") Cc: stable@vger.kernel.org Reviewed-by: NeilBrown <neil@brown.name> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
13 dayssched/fair: Proportional newidle balancePeter Zijlstra1-0/+3
commit 33cf66d88306663d16e4759e9d24766b0aaa2e17 upstream. Add a randomized algorithm that runs newidle balancing proportional to its success rate. This improves schbench significantly: 6.18-rc4: 2.22 Mrps/s 6.18-rc4+revert: 2.04 Mrps/s 6.18-rc4+revert+random: 2.18 Mrps/S Conversely, per Adam Li this affects SpecJBB slightly, reducing it by 1%: 6.17: -6% 6.17+revert: 0% 6.17+revert+random: -1% Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://lkml.kernel.org/r/6825c50d-7fa7-45d8-9b81-c6e7e25738e2@meta.com Link: https://patch.msgid.link/20251107161739.770122091@infradead.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08vfio/pci: Disable qword access to the PCI ROM barKevin Tian1-1/+9
[ Upstream commit dc85a46928c41423ad89869baf05a589e2975575 ] Commit 2b938e3db335 ("vfio/pci: Enable iowrite64 and ioread64 for vfio pci") enables qword access to the PCI bar resources. However certain devices (e.g. Intel X710) are observed with problem upon qword accesses to the rom bar, e.g. triggering PCI aer errors. This is triggered by Qemu which caches the rom content by simply does a pread() of the remaining size until it gets the full contents. The other bars would only perform operations at the same access width as their guest drivers. Instead of trying to identify all broken devices, universally disable qword access to the rom bar i.e. going back to the old way which worked reliably for years. Reported-by: Farrah Chen <farrah.chen@intel.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220740 Fixes: 2b938e3db335 ("vfio/pci: Enable iowrite64 and ioread64 for vfio pci") Cc: stable@vger.kernel.org Signed-off-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Link: https://lore.kernel.org/r/20251218081650.555015-2-kevin.tian@intel.com Signed-off-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08drm/pagemap, drm/xe: Ensure that the devmem allocation is idle before useThomas Hellström1-3/+14
commit 754c23238438600e9236719f7e67aff2c4d02093 upstream. In situations where no system memory is migrated to devmem, and in upcoming patches where another GPU is performing the migration to the newly allocated devmem buffer, there is nothing to ensure any ongoing clear to the devmem allocation or async eviction from the devmem allocation is complete. Address that by passing a struct dma_fence down to the copy functions, and ensure it is waited for before migration is marked complete. v3: - New patch. v4: - Update the logic used for determining when to wait for the pre_migrate_fence. - Update the logic used for determining when to warn for the pre_migrate_fence since the scheduler fences apparently can signal out-of-order. v5: - Fix a UAF (CI) - Remove references to source P2P migration (Himal) - Put the pre_migrate_fence after migration. v6: - Pipeline the pre_migrate_fence dependency (Matt Brost) Fixes: c5b3eb5a906c ("drm/xe: Add GPUSVM device memory copy vfunc functions") Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.15+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> # For merging through drm-xe. Link: https://patch.msgid.link/20251219113320.183860-4-thomas.hellstrom@linux.intel.com (cherry picked from commit 16b5ad31952476fb925c401897fc171cd37f536b) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08drm/buddy: Separate clear and dirty free block treesArunpravin Paneer Selvam1-1/+1
commit d4cd665c98c144dd6ad5d66d30396e13d23118c9 upstream. Maintain two separate RB trees per order - one for clear (zeroed) blocks and another for dirty (uncleared) blocks. This separation improves code clarity and makes it more obvious which tree is being searched during allocation. It also improves scalability and efficiency when searching for a specific type of block, avoiding unnecessary checks and making the allocator more predictable under fragmentation. The changes have been validated using the existing drm_buddy_test KUnit test cases, along with selected graphics workloads, to ensure correctness and avoid regressions. v2: Missed adding the suggested-by tag. Added it in v2. v3(Matthew): - Remove the double underscores from the internal functions. - Rename the internal functions to have less generic names. - Fix the error handling code. - Pass tree argument for the tree macro. - Use the existing dirty/free bit instead of new tree field. - Make free_trees[] instead of clear_tree and dirty_tree for more cleaner approach. v4: - A bug was reported by Intel CI and it is fixed by Matthew Auld. - Replace the get_root function with &mm->free_trees[tree][order] (Matthew) - Remove the unnecessary rbtree_is_empty() check (Matthew) - Remove the unnecessary get_tree_for_flags() function. - Rename get_tree_for_block() name with get_block_tree() for more clarity. v5(Jani Nikula): - Don't use static inline in .c files. - enum free_tree and enumerator names are quite generic for a header and usage and the whole enum should be an implementation detail. v6: - Rewrite the __force_merge() function using the rb_last() and rb_prev(). v7(Matthew): - Replace the open-coded tree iteration for loops with the for_each_free_tree() macro throughout the code. - Fixed out_free_roots to prevent double decrement of i, addressing potential crash. - Replaced enum drm_buddy_free_tree with unsigned int in for_each_free_tree loops. Cc: stable@vger.kernel.org Fixes: a68c7eaa7a8f ("drm/amdgpu: Enable clear page functionality") Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Suggested-by: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4260 Link: https://lore.kernel.org/r/20251006095124.1663-2-Arunpravin.PaneerSelvam@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08drm/buddy: Optimize free block management with RB treeArunpravin Paneer Selvam1-3/+8
commit c178e534fff1d5a74da80ea03b20e2b948a00113 upstream. Replace the freelist (O(n)) used for free block management with a red-black tree, providing more efficient O(log n) search, insert, and delete operations. This improves scalability and performance when managing large numbers of free blocks per order (e.g., hundreds or thousands). In the VK-CTS memory stress subtest, the buddy manager merges fragmented memory and inserts freed blocks into the freelist. Since freelist insertion is O(n), this becomes a bottleneck as fragmentation increases. Benchmarking shows list_insert_sorted() consumes ~52.69% CPU with the freelist, compared to just 0.03% with the RB tree (rbtree_insert.isra.0), despite performing the same sorted insert. This also improves performance in heavily fragmented workloads, such as games or graphics tests that stress memory. As the buddy allocator evolves with new features such as clear-page tracking, the resulting fragmentation and complexity have grown. These RB-tree based design changes are introduced to address that growth and ensure the allocator continues to perform efficiently under fragmented conditions. The RB tree implementation with separate clear/dirty trees provides: - O(n log n) aggregate complexity for all operations instead of O(n^2) - Elimination of soft lockups and system instability - Improved code maintainability and clarity - Better scalability for large memory systems - Predictable performance under fragmentation v3(Matthew): - Remove RB_EMPTY_NODE check in force_merge function. - Rename rb for loop macros to have less generic names and move to .c file. - Make the rb node rb and link field as union. v4(Jani Nikula): - The kernel-doc comment should be "/**" - Move all the rbtree macros to rbtree.h and add parens to ensure correct precedence. v5: - Remove the inline in a .c file (Jani Nikula). v6(Peter Zijlstra): - Add rb_add() function replacing the existing rbtree_insert() code. v7: - A full walk iteration in rbtree is slower than the list (Peter Zijlstra). - The existing rbtree_postorder_for_each_entry_safe macro should be used in scenarios where traversal order is not a critical factor (Christian). v8(Matthew): - Remove the rbtree_is_empty() check in this patch as well. Cc: stable@vger.kernel.org Fixes: a68c7eaa7a8f ("drm/amdgpu: Enable clear page functionality") Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20251006095124.1663-1-Arunpravin.PaneerSelvam@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08mm: consider non-anon swap cache folios in folio_expected_ref_count()Bijan Tabatabai1-4/+4
commit f183663901f21fe0fba8bd31ae894bc529709ee0 upstream. Currently, folio_expected_ref_count() only adds references for the swap cache if the folio is anonymous. However, according to the comment above the definition of PG_swapcache in enum pageflags, shmem folios can also have PG_swapcache set. This patch makes sure references for the swap cache are added if folio_test_swapcache(folio) is true. This issue was found when trying to hot-unplug memory in a QEMU/KVM virtual machine. When initiating hot-unplug when most of the guest memory is allocated, hot-unplug hangs partway through removal due to migration failures. The following message would be printed several times, and would be printed again about every five seconds: [ 49.641309] migrating pfn b12f25 failed ret:7 [ 49.641310] page: refcount:2 mapcount:0 mapping:0000000033bd8fe2 index:0x7f404d925 pfn:0xb12f25 [ 49.641311] aops:swap_aops [ 49.641313] flags: 0x300000000030508(uptodate|active|owner_priv_1|reclaim|swapbacked|node=0|zone=3) [ 49.641314] raw: 0300000000030508 ffffed312c4bc908 ffffed312c4bc9c8 0000000000000000 [ 49.641315] raw: 00000007f404d925 00000000000c823b 00000002ffffffff 0000000000000000 [ 49.641315] page dumped because: migration failure When debugging this, I found that these migration failures were due to __migrate_folio() returning -EAGAIN for a small set of folios because the expected reference count it calculates via folio_expected_ref_count() is one less than the actual reference count of the folios. Furthermore, all of the affected folios were not anonymous, but had the PG_swapcache flag set, inspiring this patch. After applying this patch, the memory hot-unplug behaves as expected. I tested this on a machine running Ubuntu 24.04 with kernel version 6.8.0-90-generic and 64GB of memory. The guest VM is managed by libvirt and runs Ubuntu 24.04 with kernel version 6.18 (though the head of the mm-unstable branch as a Dec 16, 2025 was also tested and behaves the same) and 48GB of memory. The libvirt XML definition for the VM can be found at [1]. CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE is set in the guest kernel so the hot-pluggable memory is automatically onlined. Below are the steps to reproduce this behavior: 1) Define and start and virtual machine host$ virsh -c qemu:///system define ./test_vm.xml # test_vm.xml from [1] host$ virsh -c qemu:///system start test_vm 2) Setup swap in the guest guest$ sudo fallocate -l 32G /swapfile guest$ sudo chmod 0600 /swapfile guest$ sudo mkswap /swapfile guest$ sudo swapon /swapfile 3) Use alloc_data [2] to allocate most of the remaining guest memory guest$ ./alloc_data 45 4) In a separate guest terminal, monitor the amount of used memory guest$ watch -n1 free -h 5) When alloc_data has finished allocating, initiate the memory hot-unplug using the provided xml file [3] host$ virsh -c qemu:///system detach-device test_vm ./remove.xml --live After initiating the memory hot-unplug, you should see the amount of available memory in the guest decrease, and the amount of used swap data increase. If everything works as expected, when all of the memory is unplugged, there should be around 8.5-9GB of data in swap. If the unplugging is unsuccessful, the amount of used swap data will settle below that. If that happens, you should be able to see log messages in dmesg similar to the one posted above. Link: https://lkml.kernel.org/r/20251216200727.2360228-1-bijan311@gmail.com Link: https://github.com/BijanT/linux_patch_files/blob/main/test_vm.xml [1] Link: https://github.com/BijanT/linux_patch_files/blob/main/alloc_data.c [2] Link: https://github.com/BijanT/linux_patch_files/blob/main/remove.xml [3] Fixes: 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count calculation") Signed-off-by: Bijan Tabatabai <bijan311@gmail.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shivank Garg <shivankg@amd.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Kairui Song <ryncsn@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08kernel/kexec: change the prototype of kimage_map_segment()Pingfan Liu1-2/+2
commit fe55ea85939efcbf0e6baa234f0d70acb79e7b58 upstream. The kexec segment index will be required to extract the corresponding information for that segment in kimage_map_segment(). Additionally, kexec_segment already holds the kexec relocation destination address and size. Therefore, the prototype of kimage_map_segment() can be changed. Link: https://lkml.kernel.org/r/20251216014852.8737-1-piliu@redhat.com Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") Signed-off-by: Pingfan Liu <piliu@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: Roberto Sassu <roberto.sassu@huawei.com> Cc: Alexander Graf <graf@amazon.com> Cc: Steven Chen <chenste@linux.microsoft.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08kasan: refactor pcpu kasan vmalloc unpoisonMaciej Wieczor-Retman1-0/+15
commit 6f13db031e27e88213381039032a9cc061578ea6 upstream. A KASAN tag mismatch, possibly causing a kernel panic, can be observed on systems with a tag-based KASAN enabled and with multiple NUMA nodes. It was reported on arm64 and reproduced on x86. It can be explained in the following points: 1. There can be more than one virtual memory chunk. 2. Chunk's base address has a tag. 3. The base address points at the first chunk and thus inherits the tag of the first chunk. 4. The subsequent chunks will be accessed with the tag from the first chunk. 5. Thus, the subsequent chunks need to have their tag set to match that of the first chunk. Refactor code by reusing __kasan_unpoison_vmalloc in a new helper in preparation for the actual fix. Link: https://lkml.kernel.org/r/eb61d93b907e262eefcaa130261a08bcb6c5ce51.1764874575.git.m.wieczorretman@pm.me Fixes: 1d96320f8d53 ("kasan, vmalloc: add vmalloc tagging for SW_TAGS") Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Jiayuan Chen <jiayuan.chen@linux.dev> Cc: Kees Cook <kees@kernel.org> Cc: Marco Elver <elver@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> [6.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08mm/kasan: fix incorrect unpoisoning in vrealloc for KASANJiayuan Chen1-0/+1
commit 007f5da43b3d0ecff972e2616062b8da1f862f5e upstream. Patch series "kasan: vmalloc: Fixes for the percpu allocator and vrealloc", v3. Patches fix two issues related to KASAN and vmalloc. The first one, a KASAN tag mismatch, possibly resulting in a kernel panic, can be observed on systems with a tag-based KASAN enabled and with multiple NUMA nodes. Initially it was only noticed on x86 [1] but later a similar issue was also reported on arm64 [2]. Specifically the problem is related to how vm_structs interact with pcpu_chunks - both when they are allocated, assigned and when pcpu_chunk addresses are derived. When vm_structs are allocated they are unpoisoned, each with a different random tag, if vmalloc support is enabled along the KASAN mode. Later when first pcpu chunk is allocated it gets its 'base_addr' field set to the first allocated vm_struct. With that it inherits that vm_struct's tag. When pcpu_chunk addresses are later derived (by pcpu_chunk_addr(), for example in pcpu_alloc_noprof()) the base_addr field is used and offsets are added to it. If the initial conditions are satisfied then some of the offsets will point into memory allocated with a different vm_struct. So while the lower bits will get accurately derived the tag bits in the top of the pointer won't match the shadow memory contents. The solution (proposed at v2 of the x86 KASAN series [3]) is to unpoison the vm_structs with the same tag when allocating them for the per cpu allocator (in pcpu_get_vm_areas()). The second one reported by syzkaller [4] is related to vrealloc and happens because of random tag generation when unpoisoning memory without allocating new pages. This breaks shadow memory tracking and needs to reuse the existing tag instead of generating a new one. At the same time an inconsistency in used flags is corrected. This patch (of 3): Syzkaller reported a memory out-of-bounds bug [4]. This patch fixes two issues: 1. In vrealloc the KASAN_VMALLOC_VM_ALLOC flag is missing when unpoisoning the extended region. This flag is required to correctly associate the allocation with KASAN's vmalloc tracking. Note: In contrast, vzalloc (via __vmalloc_node_range_noprof) explicitly sets KASAN_VMALLOC_VM_ALLOC and calls kasan_unpoison_vmalloc() with it. vrealloc must behave consistently -- especially when reusing existing vmalloc regions -- to ensure KASAN can track allocations correctly. 2. When vrealloc reuses an existing vmalloc region (without allocating new pages) KASAN generates a new tag, which breaks tag-based memory access tracking. Introduce KASAN_VMALLOC_KEEP_TAG, a new KASAN flag that allows reusing the tag already attached to the pointer, ensuring consistent tag behavior during reallocation. Pass KASAN_VMALLOC_KEEP_TAG and KASAN_VMALLOC_VM_ALLOC to the kasan_unpoison_vmalloc inside vrealloc_node_align_noprof(). Link: https://lkml.kernel.org/r/cover.1765978969.git.m.wieczorretman@pm.me Link: https://lkml.kernel.org/r/38dece0a4074c43e48150d1e242f8242c73bf1a5.1764874575.git.m.wieczorretman@pm.me Link: https://lore.kernel.org/all/e7e04692866d02e6d3b32bb43b998e5d17092ba4.1738686764.git.maciej.wieczor-retman@intel.com/ [1] Link: https://lore.kernel.org/all/aMUrW1Znp1GEj7St@MiWiFi-R3L-srv/ [2] Link: https://lore.kernel.org/all/CAPAsAGxDRv_uFeMYu9TwhBVWHCCtkSxoWY4xmFB_vowMbi8raw@mail.gmail.com/ [3] Link: https://syzkaller.appspot.com/bug?extid=997752115a851cb0cf36 [4] Fixes: a0309faf1cb0 ("mm: vmalloc: support more granular vrealloc() sizing") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Co-developed-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> Reported-by: syzbot+997752115a851cb0cf36@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68e243a2.050a0220.1696c6.007d.GAE@google.com/T/ Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Marco Elver <elver@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08compiler_types.h: add "auto" as a macro for "__auto_type"H. Peter Anvin1-0/+13
commit 2fb6915fa22dc5524d704afba58a13305dd9f533 upstream. "auto" was defined as a keyword back in the K&R days, but as a storage type specifier. No one ever used it, since it was and is the default storage type for local variables. C++11 recycled the keyword to allow a type to be declared based on the type of an initializer. This was finally adopted into standard C in C23. gcc and clang provide the "__auto_type" alias keyword as an extension for pre-C23, however, there is no reason to pollute the bulk of the source base with this temporary keyword; instead define "auto" as a macro unless the compiler is running in C23+ mode. This macro is added in <linux/compiler_types.h> because that header is included in some of the tools headers, wheres <linux/compiler.h> is not as it has a bunch of very kernel-specific things in it. [ Cc: stable to reduce potential backporting burden. ] Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Acked-by: Miguel Ojeda <ojeda@kernel.org> Cc: <stable@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08kunit: Enforce task execution in {soft,hard}irq contextsDavid Gow1-20/+33
[ Upstream commit c31f4aa8fed048fa70e742c4bb49bb48dc489ab3 ] The kunit_run_irq_test() helper allows a function to be run in hardirq and softirq contexts (in addition to the task context). It does this by running the user-provided function concurrently in the three contexts, until either a timeout has expired or a number of iterations have completed in the normal task context. However, on setups where the initialisation of the hardirq and softirq contexts (or, indeed, the scheduling of those tasks) is significantly slower than the function execution, it's possible for that number of iterations to be exceeded before any runs in irq contexts actually occur. This occurs with the polyval.test_polyval_preparekey_in_irqs test, which runs 20000 iterations of the relatively fast preparekey function, and therefore fails often under many UML, 32-bit arm, m68k and other environments. Instead, ensure that the max_iterations limit counts executions in all three contexts, and requires at least one of each. This will cause the test to continue iterating until at least the irq contexts have been tested, or the 1s wall-clock limit has been exceeded. This causes the test to pass in all of my environments. In so doing, we also update the task counters to atomic ints, to better match both the 'int' max_iterations input, and to ensure they are correctly updated across contexts. Finally, we also fix a few potential assertion messages to be less-specific to the original crypto usecases. Fixes: 950a81224e8b ("lib/crypto: tests: Add hash-test-template.h and gen-hash-testvecs.py") Signed-off-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/r/20251219085259.1163048-1-davidgow@google.com Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08RDMA/irdma: Fix irdma_alloc_ucontext_resp paddingArnd Bergmann1-1/+1
[ Upstream commit d95e99a74eaf35c070f5939295331e5d7857c723 ] A recent commit modified struct irdma_alloc_ucontext_resp by adding a member with implicit padding in front of it, though this does not change the offset of the data members other than m68k. Reported by scripts/check-uapi.sh: ==== ABI differences detected in include/rdma/irdma-abi.h from 1dd7bde2e91c -> HEAD ==== [C] 'struct irdma_alloc_ucontext_resp' changed: type size changed from 704 to 640 (in bits) 1 data member deletion: '__u8 rsvd3[2]', at offset 640 (in bits) at irdma-abi.h:61:1 1 data member insertion: '__u8 revd3[2]', at offset 592 (in bits) at irdma-abi.h:60:1 Change the size back to the previous version, and remove the implicit padding by making it explicit and matching what x86-64 would do by placing max_hw_srq_quanta member into a naturally aligned location. Fixes: 563e1feb5f6e ("RDMA/irdma: Add SRQ support") Link: https://patch.msgid.link/r/20251208133849.315451-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Jacob Moroni <jmoroni@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08RDMA/ucma: Fix rdma_ucm_query_ib_service_resp struct paddingArnd Bergmann1-1/+3
[ Upstream commit 2dc675f614850b80deab7cf6d12902636ed8a7f4 ] On a few 32-bit architectures, the newly added ib_user_service_rec structure is not 64-bit aligned the way it is on most regular ones. Add explicit padding into the rdma_ucm_query_ib_service_resp and rdma_ucm_resolve_ib_service structures that embed it, so that the layout is compatible across all of them. This is an ABI change on i386, aligning it with x86_64 and the other 64-bit architectures to avoid having to use a compat ioctl handler. Fixes: 810f874eda8e ("RDMA/ucma: Support query resolved service records") Link: https://patch.msgid.link/r/20251208133311.313977-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08genalloc.h: fix htmldocs warningAndrew Morton1-0/+1
[ Upstream commit 5393802c94e0ab1295c04c94c57bcb00222d4674 ] WARNING: include/linux/genalloc.h:52 function parameter 'start_addr' not described in 'genpool_algo_t' Fixes: 52fbf1134d47 ("lib/genalloc.c: fix allocation of aligned buffer from non-aligned chunk") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lkml.kernel.org/r/20251127130624.563597e3@canb.auug.org.au Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Alexey Skidanov <alexey.skidanov@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08net: dsa: properly keep track of conduit referenceVladimir Oltean1-0/+1
[ Upstream commit 06e219f6a706c367c93051f408ac61417643d2f9 ] Problem description ------------------- DSA has a mumbo-jumbo of reference handling of the conduit net device and its kobject which, sadly, is just wrong and doesn't make sense. There are two distinct problems. 1. The OF path, which uses of_find_net_device_by_node(), never releases the elevated refcount on the conduit's kobject. Nominally, the OF and non-OF paths should result in objects having identical reference counts taken, and it is already suspicious that dsa_dev_to_net_device() has a put_device() call which is missing in dsa_port_parse_of(), but we can actually even verify that an issue exists. With CONFIG_DEBUG_KOBJECT_RELEASE=y, if we run this command "before" and "after" applying this patch: (unbind the conduit driver for net device eno2) echo 0000:00:00.2 > /sys/bus/pci/drivers/fsl_enetc/unbind we see these lines in the output diff which appear only with the patch applied: kobject: 'eno2' (ffff002009a3a6b8): kobject_release, parent 0000000000000000 (delayed 1000) kobject: '109' (ffff0020099d59a0): kobject_release, parent 0000000000000000 (delayed 1000) 2. After we find the conduit interface one way (OF) or another (non-OF), it can get unregistered at any time, and DSA remains with a long-lived, but in this case stale, cpu_dp->conduit pointer. Holding the net device's underlying kobject isn't actually of much help, it just prevents it from being freed (but we never need that kobject directly). What helps us to prevent the net device from being unregistered is the parallel netdev reference mechanism (dev_hold() and dev_put()). Actually we actually use that netdev tracker mechanism implicitly on user ports since commit 2f1e8ea726e9 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings"), via netdev_upper_dev_link(). But time still passes at DSA switch probe time between the initial of_find_net_device_by_node() code and the user port creation time, time during which the conduit could unregister itself and DSA wouldn't know about it. So we have to run of_find_net_device_by_node() under rtnl_lock() to prevent that from happening, and release the lock only with the netdev tracker having acquired the reference. Do we need to keep the reference until dsa_unregister_switch() / dsa_switch_shutdown()? 1: Maybe yes. A switch device will still be registered even if all user ports failed to probe, see commit 86f8b1c01a0a ("net: dsa: Do not make user port errors fatal"), and the cpu_dp->conduit pointers remain valid. I haven't audited all call paths to see whether they will actually use the conduit in lack of any user port, but if they do, it seems safer to not rely on user ports for that reference. 2. Definitely yes. We support changing the conduit which a user port is associated to, and we can get into a situation where we've moved all user ports away from a conduit, thus no longer hold any reference to it via the net device tracker. But we shouldn't let it go nonetheless - see the next change in relation to dsa_tree_find_first_conduit() and LAG conduits which disappear. We have to be prepared to return to the physical conduit, so the CPU port must explicitly keep another reference to it. This is also to say: the user ports and their CPU ports may not always keep a reference to the same conduit net device, and both are needed. As for the conduit's kobject for the /sys/class/net/ entry, we don't care about it, we can release it as soon as we hold the net device object itself. History and blame attribution ----------------------------- The code has been refactored so many times, it is very difficult to follow and properly attribute a blame, but I'll try to make a short history which I hope to be correct. We have two distinct probing paths: - one for OF, introduced in 2016 in commit 83c0afaec7b7 ("net: dsa: Add new binding implementation") - one for non-OF, introduced in 2017 in commit 71e0bbde0d88 ("net: dsa: Add support for platform data") These are both complete rewrites of the original probing paths (which used struct dsa_switch_driver and other weird stuff, instead of regular devices on their respective buses for register access, like MDIO, SPI, I2C etc): - one for OF, introduced in 2013 in commit 5e95329b701c ("dsa: add device tree bindings to register DSA switches") - one for non-OF, introduced in 2008 in commit 91da11f870f0 ("net: Distributed Switch Architecture protocol support") except for tiny bits and pieces like dsa_dev_to_net_device() which were seemingly carried over since the original commit, and used to this day. The point is that the original probing paths received a fix in 2015 in the form of commit 679fb46c5785 ("net: dsa: Add missing master netdev dev_put() calls"), but the fix never made it into the "new" (dsa2) probing paths that can still be traced to today, and the fixed probing path was later deleted in 2019 in commit 93e86b3bc842 ("net: dsa: Remove legacy probing support"). That is to say, the new probing paths were never quite correct in this area. The existence of the legacy probing support which was deleted in 2019 explains why dsa_dev_to_net_device() returns a conduit with elevated refcount (because it was supposed to be released during dsa_remove_dst()). After the removal of the legacy code, the only user of dsa_dev_to_net_device() calls dev_put(conduit) immediately after this function returns. This pattern makes no sense today, and can only be interpreted historically to understand why dev_hold() was there in the first place. Change details -------------- Today we have a better netdev tracking infrastructure which we should use. Logically netdev_hold() belongs in common code (dsa_port_parse_cpu(), where dp->conduit is assigned), but there is a tradeoff to be made with the rtnl_lock() section which would become a bit too long if we did that - dsa_port_parse_cpu() also calls request_module(). So we duplicate a bit of logic in order for the callers of dsa_port_parse_cpu() to be the ones responsible of holding the conduit reference and releasing it on error. This shortens the rtnl_lock() section significantly. In the dsa_switch_probe() error path, dsa_switch_release_ports() will be called in a number of situations, one being where dsa_port_parse_cpu() maybe didn't get the chance to run at all (a different port failed earlier, etc). So we have to test for the conduit being NULL prior to calling netdev_put(). There have still been so many transformations to the code since the blamed commits (rename master -> conduit, commit 0650bf52b31f ("net: dsa: be compatible with masters which unregister on shutdown")), that it only makes sense to fix the code using the best methods available today and see how it can be backported to stable later. I suspect the fix cannot even be backported to kernels which lack dsa_switch_shutdown(), and I suspect this is also maybe why the long-lived conduit reference didn't make it into the new DSA probing paths at the time (problems during shutdown). Because dsa_dev_to_net_device() has a single call site and has to be changed anyway, the logic was just absorbed into the non-OF dsa_port_parse(). Tested on the ocelot/felix switch and on dsa_loop, both on the NXP LS1028A with CONFIG_DEBUG_KOBJECT_RELEASE=y. Reported-by: Ma Ke <make24@iscas.ac.cn> Closes: https://lore.kernel.org/netdev/20251214131204.4684-1-make24@iscas.ac.cn/ Fixes: 83c0afaec7b7 ("net: dsa: Add new binding implementation") Fixes: 71e0bbde0d88 ("net: dsa: Add support for platform data") Reviewed-by: Jonas Gorski <jonas.gorski@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20251215150236.3931670-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-01-08drm/edid: add DRM_EDID_IDENT_INIT() to initialize struct drm_edid_identJani Nikula1-0/+6
[ Upstream commit 8b61583f993589a64c061aa91b44f5bd350d90a5 ] Add a convenience helper for initializing struct drm_edid_ident. Cc: Tiago Martins Araújo <tiago.martins.araujo@gmail.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Tested-by: Tiago Martins Araújo <tiago.martins.araujo@gmail.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/710b2ac6a211606ec1f90afa57b79e8c7375a27e.1761681968.git.jani.nikula@intel.com Signed-off-by: Jani Nikula <jani.nikula@intel.com> Stable-dep-of: 83cbb4d33dc2 ("drm/displayid: add quirk to ignore DisplayID checksum errors") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-08mm/huge_memory: merge uniform_split_supported() and ↵Wei Yang1-5/+3
non_uniform_split_supported() [ Upstream commit 8a0e4bdddd1c998b894d879a1d22f1e745606215 ] uniform_split_supported() and non_uniform_split_supported() share significantly similar logic. The only functional difference is that uniform_split_supported() includes an additional check on the requested @new_order. The reason for this check comes from the following two aspects: * some file system or swap cache just supports order-0 folio * the behavioral difference between uniform/non-uniform split The behavioral difference between uniform split and non-uniform: * uniform split splits folio directly to @new_order * non-uniform split creates after-split folios with orders from folio_order(folio) - 1 to new_order. This means for non-uniform split or !new_order split we should check the file system and swap cache respectively. This commit unifies the logic and merge the two functions into a single combined helper, removing redundant code and simplifying the split support checking mechanism. Link: https://lkml.kernel.org/r/20251106034155.21398-3-richard.weiyang@gmail.com Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages") Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ split_type => uniform_split and replaced SPLIT_TYPE_NON_UNIFORM checks ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02mptcp: pm: ignore unknown endpoint flagsMatthieu Baerts (NGI0)1-0/+1
commit 0ace3297a7301911e52d8195cb1006414897c859 upstream. Before this patch, the kernel was saving any flags set by the userspace, even unknown ones. This doesn't cause critical issues because the kernel is only looking at specific ones. But on the other hand, endpoints dumps could tell the userspace some recent flags seem to be supported on older kernel versions. Instead, ignore all unknown flags when parsing them. By doing that, the userspace can continue to set unsupported flags, but it has a way to verify what is supported by the kernel. Note that it sounds better to continue accepting unsupported flags not to change the behaviour, but also that eases things on the userspace side by adding "optional" endpoint types only supported by newer kernel versions without having to deal with the different kernel versions. A note for the backports: there will be conflicts in mptcp.h on older versions not having the mentioned flags, the new line should still be added last, and the '5' needs to be adapted to have the same value as the last entry. Fixes: 01cacb00b35c ("mptcp: add netlink-based PM") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251205-net-mptcp-misc-fixes-6-19-rc1-v1-1-9e4781a6c1b8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destructionHarry Yoo1-0/+7
commit 0f35040de59371ad542b915d7b91176c9910dadc upstream. Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab caches when a cache is destroyed. This is unnecessary; only the RCU sheaves belonging to the cache being destroyed need to be flushed. As suggested by Vlastimil Babka, introduce a weaker form of kvfree_rcu_barrier() that operates on a specific slab cache. Factor out flush_rcu_sheaves_on_cache() from flush_all_rcu_sheaves() and call it from flush_all_rcu_sheaves() and kvfree_rcu_barrier_on_cache(). Call kvfree_rcu_barrier_on_cache() instead of kvfree_rcu_barrier() on cache destruction. The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen 5900X machine (1 socket), by loading slub_kunit module. Before: Total calls: 19 Average latency (us): 18127 Total time (us): 344414 After: Total calls: 19 Average latency (us): 10066 Total time (us): 191264 Two performance regression have been reported: - stress module loader test's runtime increases by 50-60% (Daniel) - internal graphics test's runtime on Tegra234 increases by 35% (Jon) They are fixed by this change. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Fixes: ec66e0d59952 ("slab: add sheaf support for batching kfree_rcu() operations") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz Reported-and-tested-by: Daniel Gomez <da.gomez@samsung.com> Closes: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org Reported-and-tested-by: Jon Hunter <jonathanh@nvidia.com> Closes: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20251207154148.117723-1-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02dma-mapping: Fix DMA_BIT_MASK() macro being brokenHans de Goede1-1/+1
commit 31b931bebd11a0f00967114f62c8c38952f483e5 upstream. After commit a50f7456f853 ("dma-mapping: Allow use of DMA_BIT_MASK(64) in global scope"), the DMA_BIT_MASK() macro is broken when passed non trivial statements for the value of 'n'. This is caused by the new version missing parenthesis around 'n' when evaluating 'n'. One example of this breakage is the IPU6 driver now crashing due to it getting DMA-addresses with address bit 32 set even though it has tried to set a 32 bit DMA mask. The IPU6 CSI2 engine has a DMA mask of either 31 or 32 bits depending on if it is in secure mode or not and it sets this masks like this: mmu_info->aperture_end = (dma_addr_t)DMA_BIT_MASK(isp->secure_mode ? IPU6_MMU_ADDR_BITS : IPU6_MMU_ADDR_BITS_NON_SECURE); So the 'n' argument here is "isp->secure_mode ? IPU6_MMU_ADDR_BITS : IPU6_MMU_ADDR_BITS_NON_SECURE" which gets expanded into: isp->secure_mode ? IPU6_MMU_ADDR_BITS : IPU6_MMU_ADDR_BITS_NON_SECURE - 1 With the -1 only being applied in the non secure case, causing the secure mode mask to be one 1 bit too large. Fixes: a50f7456f853 ("dma-mapping: Allow use of DMA_BIT_MASK(64) in global scope") Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: James Clark <james.clark@linaro.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Hans de Goede <johannes.goede@oss.qualcomm.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20251207184756.97904-1-johannes.goede@oss.qualcomm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02crash: let architecture decide crash memory export to iomem_resourceSourabh Jain1-0/+6
commit adc15829fb73e402903b7030729263b6ee4a7232 upstream. With the generic crashkernel reservation, the kernel emits the following warning on powerpc: WARNING: CPU: 0 PID: 1 at arch/powerpc/mm/mem.c:341 add_system_ram_resources+0xfc/0x180 Modules linked in: CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.17.0-auto-12607-g5472d60c129f #1 VOLUNTARY Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.01 (NH1110_069) hv:phyp pSeries NIP: c00000000201de3c LR: c00000000201de34 CTR: 0000000000000000 REGS: c000000127cef8a0 TRAP: 0700 Not tainted (6.17.0-auto-12607-g5472d60c129f) MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 84000840 XER: 20040010 CFAR: c00000000017eed0 IRQMASK: 0 GPR00: c00000000201de34 c000000127cefb40 c0000000016a8100 0000000000000001 GPR04: c00000012005aa00 0000000020000000 c000000002b705c8 0000000000000000 GPR08: 000000007fffffff fffffffffffffff0 c000000002db8100 000000011fffffff GPR12: c00000000201dd40 c000000002ff0000 c0000000000112bc 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 c0000000015a3808 GPR24: c00000000200468c c000000001699888 0000000000000106 c0000000020d1950 GPR28: c0000000014683f8 0000000081000200 c0000000015c1868 c000000002b9f710 NIP [c00000000201de3c] add_system_ram_resources+0xfc/0x180 LR [c00000000201de34] add_system_ram_resources+0xf4/0x180 Call Trace: add_system_ram_resources+0xf4/0x180 (unreliable) do_one_initcall+0x60/0x36c do_initcalls+0x120/0x220 kernel_init_freeable+0x23c/0x390 kernel_init+0x34/0x26c ret_from_kernel_user_thread+0x14/0x1c This warning occurs due to a conflict between crashkernel and System RAM iomem resources. The generic crashkernel reservation adds the crashkernel memory range to /proc/iomem during early initialization. Later, all memblock ranges are added to /proc/iomem as System RAM. If the crashkernel region overlaps with any memblock range, it causes a conflict while adding those memblock regions as iomem resources, triggering the above warning. The conflicting memblock regions are then omitted from /proc/iomem. For example, if the following crashkernel region is added to /proc/iomem: 20000000-11fffffff : Crash kernel then the following memblock regions System RAM regions fail to be inserted: 00000000-7fffffff : System RAM 80000000-257fffffff : System RAM Fix this by not adding the crashkernel memory to /proc/iomem on powerpc. Introduce an architecture hook to let each architecture decide whether to export the crashkernel region to /proc/iomem. For more info checkout commit c40dd2f766440 ("powerpc: Add System RAM to /proc/iomem") and commit bce074bdbc36 ("powerpc: insert System RAM resource to prevent crashkernel conflict") Note: Before switching to the generic crashkernel reservation, powerpc never exported the crashkernel region to /proc/iomem. Link: https://lkml.kernel.org/r/20251016142831.144515-1-sourabhjain@linux.ibm.com Fixes: e3185ee438c2 ("powerpc/crash: use generic crashkernel reservation"). Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Closes: https://lore.kernel.org/all/90937fe0-2e76-4c82-b27e-7b8a7fe3ac69@linux.ibm.com/ Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Baoquan he <bhe@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02tpm2-sessions: Fix out of range indexing in name_sizeJarkko Sakkinen1-5/+8
commit 6e9722e9a7bfe1bbad649937c811076acf86e1fd upstream. 'name_size' does not have any range checks, and it just directly indexes with TPM_ALG_ID, which could lead into memory corruption at worst. Address the issue by only processing known values and returning -EINVAL for unrecognized values. Make also 'tpm_buf_append_name' and 'tpm_buf_fill_hmac_session' fallible so that errors are detected before causing any spurious TPM traffic. End also the authorization session on failure in both of the functions, as the session state would be then by definition corrupted. Cc: stable@vger.kernel.org # v6.10+ Fixes: 1085b8276bb4 ("tpm: Add the rest of the session HMAC API") Reviewed-by: Jonathan McDowell <noodles@meta.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-02media: v4l2-mem2mem: Fix outdated documentationLaurent Pinchart1-2/+1
commit 082b86919b7a94de01d849021b4da820a6cb89dc upstream. Commit cbd9463da1b1 ("media: v4l2-mem2mem: Avoid calling .device_run in v4l2_m2m_job_finish") deferred calls to .device_run() to a work queue to avoid recursive calls when a job is finished right away from .device_run(). It failed to update the v4l2_m2m_job_finish() documentation that still states the function must not be called from .device_run(). Fix it. Fixes: cbd9463da1b1 ("media: v4l2-mem2mem: Avoid calling .device_run in v4l2_m2m_job_finish") Cc: stable@vger.kernel.org Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>