summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2025-11-24block: make REQ_OP_ZONE_OPEN a write operationDamien Le Moal1-5/+5
commit 19de03b312d69a7e9bacb51c806c6e3f4207376c upstream. A REQ_OP_OPEN_ZONE request changes the condition of a sequential zone of a zoned block device to the explicitly open condition (BLK_ZONE_COND_EXP_OPEN). As such, it should be considered a write operation. Change this operation code to be an odd number to reflect this. The following operation numbers are changed to keep the numbering compact. No problems were reported without this change as this operation has no data. However, this unifies the zone operation to reflect that they modify the device state and also allows strengthening checks in the block layer, e.g. checking if this operation is not issued against a read-only device. Fixes: 6c1b1da58f8c ("block: add zone open, close and finish operations") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24block: fix op_is_zone_mgmt() to handle REQ_OP_ZONE_RESET_ALLDamien Le Moal1-0/+1
commit 12a1c9353c47c0fb3464eba2d78cdf649dee1cf7 upstream. REQ_OP_ZONE_RESET_ALL is a zone management request. Fix op_is_zone_mgmt() to return true for that operation, like it already does for REQ_OP_ZONE_RESET. While no problems were reported without this fix, this change allows strengthening checks in various block device drivers (scsi sd, virtioblk, DM) where op_is_zone_mgmt() is used to verify that a zone management command is not being issued to a regular block device. Fixes: 6c1b1da58f8c ("block: add zone open, close and finish operations") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24fbcon: Set fb_display[i]->mode to NULL when the mode is releasedQuanmin Yan1-0/+2
commit a1f3058930745d2b938b6b4f5bd9630dc74b26b7 upstream. Recently, we discovered the following issue through syzkaller: BUG: KASAN: slab-use-after-free in fb_mode_is_equal+0x285/0x2f0 Read of size 4 at addr ff11000001b3c69c by task syz.xxx ... Call Trace: <TASK> dump_stack_lvl+0xab/0xe0 print_address_description.constprop.0+0x2c/0x390 print_report+0xb9/0x280 kasan_report+0xb8/0xf0 fb_mode_is_equal+0x285/0x2f0 fbcon_mode_deleted+0x129/0x180 fb_set_var+0xe7f/0x11d0 do_fb_ioctl+0x6a0/0x750 fb_ioctl+0xe0/0x140 __x64_sys_ioctl+0x193/0x210 do_syscall_64+0x5f/0x9c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e Based on experimentation and analysis, during framebuffer unregistration, only the memory of fb_info->modelist is freed, without setting the corresponding fb_display[i]->mode to NULL for the freed modes. This leads to UAF issues during subsequent accesses. Here's an example of reproduction steps: 1. With /dev/fb0 already registered in the system, load a kernel module to register a new device /dev/fb1; 2. Set fb1's mode to the global fb_display[] array (via FBIOPUT_CON2FBMAP); 3. Switch console from fb to VGA (to allow normal rmmod of the ko); 4. Unload the kernel module, at this point fb1's modelist is freed, leaving a wild pointer in fb_display[]; 5. Trigger the bug via system calls through fb0 attempting to delete a mode from fb0. Add a check in do_unregister_framebuffer(): if the mode to be freed exists in fb_display[], set the corresponding mode pointer to NULL. Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Helge Deller <deller@gmx.de> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-02gpio: regmap: add the .fixed_direction_output configuration parameterIoana Ciornei1-0/+5
[ Upstream commit 00aaae60faf554c27c95e93d47f200a93ff266ef ] There are GPIO controllers such as the one present in the LX2160ARDB QIXIS FPGA which have fixed-direction input and output GPIO lines mixed together in a single register. This cannot be modeled using the gpio-regmap as-is since there is no way to present the true direction of a GPIO line. In order to make this use case possible, add a new configuration parameter - fixed_direction_output - into the gpio_regmap_config structure. This will enable user drivers to provide a bitmap that represents the fixed direction of the GPIO lines. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Acked-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Reviewed-by: Michael Walle <mwalle@kernel.org> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Stable-dep-of: 2ba5772e530f ("gpio: idio-16: Define fixed direction of the GPIO lines") Signed-off-by: William Breathitt Gray <wbg@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-02gpio: regmap: Allow to allocate regmap-irq deviceMathieu Dubois-Briand1-0/+11
[ Upstream commit 553b75d4bfe9264f631d459fe9996744e0672b0e ] GPIO controller often have support for IRQ: allow to easily allocate both gpio-regmap and regmap-irq in one operation. Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Signed-off-by: Mathieu Dubois-Briand <mathieu.dubois-briand@bootlin.com> Link: https://lore.kernel.org/r/20250824-mdb-max7360-support-v14-5-435cfda2b1ea@bootlin.com Signed-off-by: Lee Jones <lee@kernel.org> Stable-dep-of: 2ba5772e530f ("gpio: idio-16: Define fixed direction of the GPIO lines") Signed-off-by: William Breathitt Gray <wbg@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-02bits: introduce fixed-type GENMASK_U*()Vincent Mailhol2-1/+30
[ Upstream commit 19408200c094858d952a90bf4977733dc89a4df5 ] Add GENMASK_TYPE() which generalizes __GENMASK() to support different types, and implement fixed-types versions of GENMASK() based on it. The fixed-type version allows more strict checks to the min/max values accepted, which is useful for defining registers like implemented by i915 and xe drivers with their REG_GENMASK*() macros. The strict checks rely on shift-count-overflow compiler check to fail the build if a number outside of the range allowed is passed. Example: #define FOO_MASK GENMASK_U32(33, 4) will generate a warning like: include/linux/bits.h:51:27: error: right shift count >= width of type [-Werror=shift-count-overflow] 51 | type_max(t) >> (BITS_PER_TYPE(t) - 1 - (h))))) | ^~ The result is casted to the corresponding fixed width type. For example, GENMASK_U8() returns an u8. Note that because of the C promotion rules, GENMASK_U8() and GENMASK_U16() will immediately be promoted to int if used in an expression. Regardless, the main goal is not to get the correct type, but rather to enforce more checks at compile time. While GENMASK_TYPE() is crafted to cover all variants, including the already existing GENMASK(), GENMASK_ULL() and GENMASK_U128(), for the moment, only use it for the newly introduced GENMASK_U*(). The consolidation will be done in a separate change. Co-developed-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Acked-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Yury Norov <yury.norov@gmail.com> Stable-dep-of: 2ba5772e530f ("gpio: idio-16: Define fixed direction of the GPIO lines") Signed-off-by: William Breathitt Gray <wbg@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-02bits: add comments and newlines to #if, #else and #endif directivesVincent Mailhol1-2/+6
[ Upstream commit 31299a5e0211241171b2222c5633aad4763bf700 ] This is a preparation for the upcoming GENMASK_U*() and BIT_U*() changes. After introducing those new macros, there will be a lot of scrolling between the #if, #else and #endif. Add a comment to the #else and #endif preprocessor macros to help keep track of which context we are in. Also, add new lines to better visually separate the non-asm and asm sections. Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Yury Norov <yury.norov@gmail.com> Stable-dep-of: 2ba5772e530f ("gpio: idio-16: Define fixed direction of the GPIO lines") Signed-off-by: William Breathitt Gray <wbg@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-02audit: record fanotify event regardless of presence of rulesRichard Guy Briggs1-1/+1
[ Upstream commit ce8370e2e62a903e18be7dd0e0be2eee079501e1 ] When no audit rules are in place, fanotify event results are unconditionally dropped due to an explicit check for the existence of any audit rules. Given this is a report from another security sub-system, allow it to be recorded regardless of the existence of any audit rules. To test, install and run the fapolicyd daemon with default config. Then as an unprivileged user, create and run a very simple binary that should be denied. Then check for an event with ausearch -m FANOTIFY -ts recent Link: https://issues.redhat.com/browse/RHEL-9065 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-23mm/ksm: fix flag-dropping behavior in ksm_madviseJakub Acs1-1/+1
commit f04aad36a07cc17b7a5d5b9a2d386ce6fae63e93 upstream. syzkaller discovered the following crash: (kernel BUG) [ 44.607039] ------------[ cut here ]------------ [ 44.607422] kernel BUG at mm/userfaultfd.c:2067! [ 44.608148] Oops: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI [ 44.608814] CPU: 1 UID: 0 PID: 2475 Comm: reproducer Not tainted 6.16.0-rc6 #1 PREEMPT(none) [ 44.609635] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 44.610695] RIP: 0010:userfaultfd_release_all+0x3a8/0x460 <snip other registers, drop unreliable trace> [ 44.617726] Call Trace: [ 44.617926] <TASK> [ 44.619284] userfaultfd_release+0xef/0x1b0 [ 44.620976] __fput+0x3f9/0xb60 [ 44.621240] fput_close_sync+0x110/0x210 [ 44.622222] __x64_sys_close+0x8f/0x120 [ 44.622530] do_syscall_64+0x5b/0x2f0 [ 44.622840] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 44.623244] RIP: 0033:0x7f365bb3f227 Kernel panics because it detects UFFD inconsistency during userfaultfd_release_all(). Specifically, a VMA which has a valid pointer to vma->vm_userfaultfd_ctx, but no UFFD flags in vma->vm_flags. The inconsistency is caused in ksm_madvise(): when user calls madvise() with MADV_UNMEARGEABLE on a VMA that is registered for UFFD in MINOR mode, it accidentally clears all flags stored in the upper 32 bits of vma->vm_flags. Assuming x86_64 kernel build, unsigned long is 64-bit and unsigned int and int are 32-bit wide. This setup causes the following mishap during the &= ~VM_MERGEABLE assignment. VM_MERGEABLE is a 32-bit constant of type unsigned int, 0x8000'0000. After ~ is applied, it becomes 0x7fff'ffff unsigned int, which is then promoted to unsigned long before the & operation. This promotion fills upper 32 bits with leading 0s, as we're doing unsigned conversion (and even for a signed conversion, this wouldn't help as the leading bit is 0). & operation thus ends up AND-ing vm_flags with 0x0000'0000'7fff'ffff instead of intended 0xffff'ffff'7fff'ffff and hence accidentally clears the upper 32-bits of its value. Fix it by changing `VM_MERGEABLE` constant to unsigned long, using the BIT() macro. Note: other VM_* flags are not affected: This only happens to the VM_MERGEABLE flag, as the other VM_* flags are all constants of type int and after ~ operation, they end up with leading 1 and are thus converted to unsigned long with leading 1s. Note 2: After commit 31defc3b01d9 ("userfaultfd: remove (VM_)BUG_ON()s"), this is no longer a kernel BUG, but a WARNING at the same place: [ 45.595973] WARNING: CPU: 1 PID: 2474 at mm/userfaultfd.c:2067 but the root-cause (flag-drop) remains the same. [akpm@linux-foundation.org: rust bindgen wasn't able to handle BIT(), from Miguel] Link: https://lore.kernel.org/oe-kbuild-all/202510030449.VfSaAjvd-lkp@intel.com/ Link: https://lkml.kernel.org/r/20251001090353.57523-2-acsjakub@amazon.de Fixes: 7677f7fd8be7 ("userfaultfd: add minor fault registration mode") Signed-off-by: Jakub Acs <acsjakub@amazon.de> Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: SeongJae Park <sj@kernel.org> Tested-by: Alice Ryhl <aliceryhl@google.com> Tested-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Cc: Xu Xin <xu.xin16@zte.com.cn> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [acsjakub@amazon.de: adapt rust bindgen const to older versions] Signed-off-by: Jakub Acs <acsjakub@amazon.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23PCI: Add PCI_VDEVICE_SUB helper macroPiotr Kwapulinski1-0/+14
[ Upstream commit 208fff3f567e2a3c3e7e4788845e90245c3891b4 ] PCI_VDEVICE_SUB generates the pci_device_id struct layout for the specific PCI device/subdevice. Private data may follow the output. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Stable-dep-of: a7075f501bd3 ("ixgbevf: fix mailbox API compatibility by negotiating supported features") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23quota: remove unneeded return value of register_quota_formatKemeng Shi1-1/+1
[ Upstream commit a838e5dca63d1dc701e63b2b1176943c57485c45 ] The register_quota_format always returns 0, simply remove unneeded return value. Link: https://patch.msgid.link/20240715130534.2112678-3-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Jan Kara <jack@suse.cz> Stable-dep-of: 72b7ceca857f ("fs: quota: create dedicated workqueue for quota_release_work") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23PM: runtime: Add new devm functionsBence Csókás1-0/+4
[ Upstream commit 73db799bf5efc5a04654bb3ff6c9bf63a0dfa473 ] Add `devm_pm_runtime_set_active_enabled()` and `devm_pm_runtime_get_noresume()` for simplifying common cases in drivers. Signed-off-by: Bence Csókás <csokas.bence@prolan.hu> Link: https://patch.msgid.link/20250327195928.680771-3-csokas.bence@prolan.hu Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Stable-dep-of: 0792c1984a45 ("iio: imu: inv_icm42600: Simplify pm_runtime setup") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23usb: gadget: Introduce free_usb_request helperKuen-Han Tsai1-0/+23
[ Upstream commit 201c53c687f2b55a7cc6d9f4000af4797860174b ] Introduce the free_usb_request() function that frees both the request's buffer and the request itself. This function serves as the cleanup callback for DEFINE_FREE() to enable automatic, scope-based cleanup for usb_request pointers. Signed-off-by: Kuen-Han Tsai <khtsai@google.com> Link: https://lore.kernel.org/r/20250916-ready-v1-2-4997bf277548@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20250916-ready-v1-2-4997bf277548@google.com Stable-dep-of: 75a5b8d4ddd4 ("usb: gadget: f_ncm: Refactor bind path to use __free()") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23usb: gadget: Store endpoint pointer in usb_requestKuen-Han Tsai1-0/+2
[ Upstream commit bfb1d99d969fe3b892db30848aeebfa19d21f57f ] Gadget function drivers often have goto-based error handling in their bind paths, which can be bug-prone. Refactoring these paths to use __free() scope-based cleanup is desirable, but currently blocked. The blocker is that usb_ep_free_request(ep, req) requires two parameters, while the __free() mechanism can only pass a pointer to the request itself. Store an endpoint pointer in the struct usb_request. The pointer is populated centrally in usb_ep_alloc_request() on every successful allocation, making the request object self-contained. Signed-off-by: Kuen-Han Tsai <khtsai@google.com> Link: https://lore.kernel.org/r/20250916-ready-v1-1-4997bf277548@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20250916-ready-v1-1-4997bf277548@google.com Stable-dep-of: 75a5b8d4ddd4 ("usb: gadget: f_ncm: Refactor bind path to use __free()") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23cpufreq: CPPC: Avoid using CPUFREQ_ETERNAL as transition delayRafael J. Wysocki1-0/+3
[ Upstream commit f965d111e68f4a993cc44d487d416e3d954eea11 ] If cppc_get_transition_latency() returns CPUFREQ_ETERNAL to indicate a failure to retrieve the transition latency value from the platform firmware, the CPPC cpufreq driver will use that value (converted to microseconds) as the policy transition delay, but it is way too large for any practical use. Address this by making the driver use the cpufreq's default transition latency value (in microseconds) as the transition delay if CPUFREQ_ETERNAL is returned by cppc_get_transition_latency(). Fixes: d4f3388afd48 ("cpufreq / CPPC: Set platform specific transition_delay_us") Cc: 5.19+ <stable@vger.kernel.org> # 5.19 Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Reviewed-by: Jie Zhan <zhanjie9@hisilicon.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> [ added CPUFREQ_DEFAULT_TRANSITION_LATENCY_NS definition to include/linux/cpufreq.h ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19rseq: Protect event mask against membarrier IPIThomas Gleixner1-3/+8
[ Upstream commit 6eb350a2233100a283f882c023e5ad426d0ed63b ] rseq_need_restart() reads and clears task::rseq_event_mask with preemption disabled to guard against the scheduler. But membarrier() uses an IPI and sets the PREEMPT bit in the event mask from the IPI, which leaves that RMW operation unprotected. Use guard(irq) if CONFIG_MEMBARRIER is enabled to fix that. Fixes: 2a36ab717e8f ("rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: stable@vger.kernel.org [ Applied changes to include/linux/sched.h instead of include/linux/rseq.h ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19mm/ksm: fix incorrect KSM counter handling in mm_struct during forkDonet Tom1-0/+6
[ Upstream commit 4d6fc29f36341d7795db1d1819b4c15fe9be7b23 ] Patch series "mm/ksm: Fix incorrect accounting of KSM counters during fork", v3. The first patch in this series fixes the incorrect accounting of KSM counters such as ksm_merging_pages, ksm_rmap_items, and the global ksm_zero_pages during fork. The following patch add a selftest to verify the ksm_merging_pages counter was updated correctly during fork. Test Results ============ Without the first patch ----------------------- # [RUN] test_fork_ksm_merging_page_count not ok 10 ksm_merging_page in child: 32 With the first patch -------------------- # [RUN] test_fork_ksm_merging_page_count ok 10 ksm_merging_pages is not inherited after fork This patch (of 2): Currently, the KSM-related counters in `mm_struct`, such as `ksm_merging_pages`, `ksm_rmap_items`, and `ksm_zero_pages`, are inherited by the child process during fork. This results in inconsistent accounting. When a process uses KSM, identical pages are merged and an rmap item is created for each merged page. The `ksm_merging_pages` and `ksm_rmap_items` counters are updated accordingly. However, after a fork, these counters are copied to the child while the corresponding rmap items are not. As a result, when the child later triggers an unmerge, there are no rmap items present in the child, so the counters remain stale, leading to incorrect accounting. A similar issue exists with `ksm_zero_pages`, which maintains both a global counter and a per-process counter. During fork, the per-process counter is inherited by the child, but the global counter is not incremented. Since the child also references zero pages, the global counter should be updated as well. Otherwise, during zero-page unmerge, both the global and per-process counters are decremented, causing the global counter to become inconsistent. To fix this, ksm_merging_pages and ksm_rmap_items are reset to 0 during fork, and the global ksm_zero_pages counter is updated with the per-process ksm_zero_pages value inherited by the child. This ensures that KSM statistics remain accurate and reflect the activity of each process correctly. Link: https://lkml.kernel.org/r/cover.1758648700.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/7b9870eb67ccc0d79593940d9dbd4a0b39b5d396.1758648700.git.donettom@linux.ibm.com Fixes: 7609385337a4 ("ksm: count ksm merging pages for each process") Fixes: cb4df4cae4f2 ("ksm: count allocated ksm rmap_items for each process") Fixes: e2942062e01d ("ksm: count all zero pages placed by KSM") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Aboorva Devarajan <aboorvad@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: <stable@vger.kernel.org> [6.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ changed mm_flags_test() to test_bit() ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19iio: frequency: adf4350: Fix ADF4350_REG3_12BIT_CLKDIV_MODEMichael Hennerich1-1/+1
commit 1d8fdabe19267338f29b58f968499e5b55e6a3b6 upstream. The clk div bits (2 bits wide) do not start in bit 16 but in bit 15. Fix it accordingly. Fixes: e31166f0fd48 ("iio: frequency: New driver for Analog Devices ADF4350/ADF4351 Wideband Synthesizers") Signed-off-by: Michael Hennerich <michael.hennerich@analog.com> Signed-off-by: Nuno Sá <nuno.sa@analog.com> Link: https://patch.msgid.link/20250829-adf4350-fix-v2-2-0bf543ba797d@analog.com Cc: <Stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-15bpf: Enforce expected_attach_type for tailcall compatibilityDaniel Borkmann1-0/+1
[ Upstream commit 4540aed51b12bc13364149bf95f6ecef013197c0 ] Yinhao et al. recently reported: Our fuzzer tool discovered an uninitialized pointer issue in the bpf_prog_test_run_xdp() function within the Linux kernel's BPF subsystem. This leads to a NULL pointer dereference when a BPF program attempts to deference the txq member of struct xdp_buff object. The test initializes two programs of BPF_PROG_TYPE_XDP: progA acts as the entry point for bpf_prog_test_run_xdp() and its expected_attach_type can neither be of be BPF_XDP_DEVMAP nor BPF_XDP_CPUMAP. progA calls into a slot of a tailcall map it owns. progB's expected_attach_type must be BPF_XDP_DEVMAP to pass xdp_is_valid_access() validation. The program returns struct xdp_md's egress_ifindex, and the latter is only allowed to be accessed under mentioned expected_attach_type. progB is then inserted into the tailcall which progA calls. The underlying issue goes beyond XDP though. Another example are programs of type BPF_PROG_TYPE_CGROUP_SOCK_ADDR. sock_addr_is_valid_access() as well as sock_addr_func_proto() have different logic depending on the programs' expected_attach_type. Similarly, a program attached to BPF_CGROUP_INET4_GETPEERNAME should not be allowed doing a tailcall into a program which calls bpf_bind() out of BPF which is only enabled for BPF_CGROUP_INET4_CONNECT. In short, specifying expected_attach_type allows to open up additional functionality or restrictions beyond what the basic bpf_prog_type enables. The use of tailcalls must not violate these constraints. Fix it by enforcing expected_attach_type in __bpf_prog_map_compatible(). Note that we only enforce this for tailcall maps, but not for BPF devmaps or cpumaps: There, the programs are invoked through dev_map_bpf_prog_run*() and cpu_map_bpf_prog_run*() which set up a new environment / context and therefore these situations are not prone to this issue. Fixes: 5e43f899b03a ("bpf: Check attach type at prog load time") Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20250926171201.188490-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15once: fix race by moving DO_ONCE to separate sectionQi Xi1-2/+2
[ Upstream commit edcc8a38b5ac1a3dbd05e113a38a25b937ebefe5 ] The commit c2c60ea37e5b ("once: use __section(".data.once")") moved DO_ONCE's ___done variable to .data.once section, which conflicts with DO_ONCE_LITE() that also uses the same section. This creates a race condition when clear_warn_once is used: Thread 1 (DO_ONCE) Thread 2 (DO_ONCE) __do_once_start read ___done (false) acquire once_lock execute func __do_once_done write ___done (true) __do_once_start release once_lock // Thread 3 clear_warn_once reset ___done read ___done (false) acquire once_lock execute func schedule once_work __do_once_done once_deferred: OK write ___done (true) static_branch_disable release once_lock schedule once_work once_deferred: BUG_ON(!static_key_enabled) DO_ONCE_LITE() in once_lite.h is used by WARN_ON_ONCE() and other warning macros. Keep its ___done flag in the .data..once section and allow resetting by clear_warn_once, as originally intended. In contrast, DO_ONCE() is used for functions like get_random_once() and relies on its ___done flag for internal synchronization. We should not reset DO_ONCE() by clear_warn_once. Fix it by isolating DO_ONCE's ___done into a separate .data..do_once section, shielding it from clear_warn_once. Fixes: c2c60ea37e5b ("once: use __section(".data.once")") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Qi Xi <xiqi2@huawei.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-12driver core/PM: Set power.no_callbacks along with power.no_pmRafael J. Wysocki1-0/+3
commit c2ce2453413d429e302659abc5ace634e873f6f5 upstream. Devices with power.no_pm set are not expected to need any power management at all, so modify device_set_pm_not_required() to set power.no_callbacks for them too in case runtime PM will be enabled for any of them (which in principle may be done for convenience if such a device participates in a dependency chain). Since device_set_pm_not_required() must be called before device_add() or it would not have any effect, it can update power.no_callbacks without locking, unlike pm_runtime_no_callbacks() that can be called after registering the target device. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: stable <stable@kernel.org> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/1950054.tdWV9SEqCh@rafael.j.wysocki Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02minmax.h: remove some #defines that are only expanded onceDavid Laight1-8/+6
[ Upstream commit 2b97aaf74ed534fb838d09867d09a3ca5d795208 ] The bodies of __signed_type_use() and __unsigned_type_use() are much the same size as their names - so put the bodies in the only line that expands them. Similarly __signed_type() is defined separately for 64bit and then used exactly once just below. Change the test for __signed_type from CONFIG_64BIT to one based on gcc defined macros so that the code is valid if it gets used outside of a kernel build. Link: https://lkml.kernel.org/r/9386d1ebb8974fbabbed2635160c3975@AcuMS.aculab.com Signed-off-by: David Laight <david.laight@aculab.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02minmax.h: simplify the variants of clamp()David Laight1-12/+12
[ Upstream commit 495bba17cdf95e9703af1b8ef773c55ef0dfe703 ] Always pass a 'type' through to __clamp_once(), pass '__auto_type' from clamp() itself. The expansion of __types_ok3() is reasonable so it isn't worth the added complexity of avoiding it when a fixed type is used for all three values. Link: https://lkml.kernel.org/r/8f69f4deac014f558bab186444bac2e8@AcuMS.aculab.com Signed-off-by: David Laight <david.laight@aculab.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02minmax.h: move all the clamp() definitions after the min/max() onesDavid Laight1-58/+51
[ Upstream commit c3939872ee4a6b8bdcd0e813c66823b31e6e26f7 ] At some point the definitions for clamp() got added in the middle of the ones for min() and max(). Re-order the definitions so they are more sensibly grouped. Link: https://lkml.kernel.org/r/8bb285818e4846469121c8abc3dfb6e2@AcuMS.aculab.com Signed-off-by: David Laight <david.laight@aculab.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp()David Laight1-2/+1
[ Upstream commit a5743f32baec4728711bbc01d6ac2b33d4c67040 ] Use BUILD_BUG_ON_MSG(statically_true(ulo > uhi), ...) for the sanity check of the bounds in clamp(). Gives better error coverage and one less expansion of the arguments. Link: https://lkml.kernel.org/r/34d53778977747f19cce2abb287bb3e6@AcuMS.aculab.com Signed-off-by: David Laight <david.laight@aculab.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02minmax.h: reduce the #define expansion of min(), max() and clamp()David Laight1-12/+12
[ Upstream commit b280bb27a9f7c91ddab730e1ad91a9c18a051f41 ] Since the test for signed values being non-negative only relies on __builtion_constant_p() (not is_constexpr()) it can use the 'ux' variable instead of the caller supplied expression. This means that the #define parameters are only expanded twice. Once in the code and once quoted in the error message. Link: https://lkml.kernel.org/r/051afc171806425da991908ed8688a98@AcuMS.aculab.com Signed-off-by: David Laight <david.laight@aculab.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02minmax.h: update some commentsDavid Laight1-29/+24
[ Upstream commit 10666e99204818ef45c702469488353b5bb09ec7 ] - Change three to several. - Remove the comment about retaining constant expressions, no longer true. - Realign to nearer 80 columns and break on major punctiation. - Add a leading comment to the block before __signed_type() and __is_nonneg() Otherwise the block explaining the cast is a bit 'floating'. Reword the rest of that comment to improve readability. Link: https://lkml.kernel.org/r/85b050c81c1d4076aeb91a6cded45fee@AcuMS.aculab.com Signed-off-by: David Laight <david.laight@aculab.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02minmax.h: add whitespace around operators and after commasDavid Laight1-17/+17
[ Upstream commit 71ee9b16251ea4bf7c1fe222517c82bdb3220acc ] Patch series "minmax.h: Cleanups and minor optimisations". Some tidyups and minor changes to minmax.h. This patch (of 7): Link: https://lkml.kernel.org/r/c50365d214e04f9ba256d417c8bebbc0@AcuMS.aculab.com Link: https://lkml.kernel.org/r/f04b2e1310244f62826267346fde0553@AcuMS.aculab.com Signed-off-by: David Laight <david.laight@aculab.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02minmax: fix up min3() and max3() tooLinus Torvalds1-2/+10
[ Upstream commit 21b136cc63d2a9ddd60d4699552b69c214b32964 ] David Laight pointed out that we should deal with the min3() and max3() mess too, which still does excessive expansion. And our current macros are actually rather broken. In particular, the macros did this: #define min3(x, y, z) min((typeof(x))min(x, y), z) #define max3(x, y, z) max((typeof(x))max(x, y), z) and that not only is a nested expansion of possibly very complex arguments with all that involves, the typing with that "typeof()" cast is completely wrong. For example, imagine what happens in max3() if 'x' happens to be a 'unsigned char', but 'y' and 'z' are 'unsigned long'. The types are compatible, and there's no warning - but the result is just random garbage. No, I don't think we've ever hit that issue in practice, but since we now have sane infrastructure for doing this right, let's just use it. It fixes any excessive expansion, and also avoids these kinds of broken type issues. Requested-by: David Laight <David.Laight@aculab.com> Acked-by: Arnd Bergmann <arnd@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02minmax: improve macro expansion and type checkingLinus Torvalds2-15/+68
[ Upstream commit 22f5468731491e53356ba7c028f0fdea20b18e2c ] This clarifies the rules for min()/max()/clamp() type checking and makes them a much more efficient macro expansion. In particular, we now look at the type and range of the inputs to see whether they work together, generating a mask of acceptable comparisons, and then just verifying that the inputs have a shared case: - an expression with a signed type can be used for (1) signed comparisons (2) unsigned comparisons if it is statically known to have a non-negative value - an expression with an unsigned type can be used for (3) unsigned comparison (4) signed comparisons if the type is smaller than 'int' and thus the C integer promotion rules will make it signed anyway Here rule (1) and (3) are obvious, and rule (2) is important in order to allow obvious trivial constants to be used together with unsigned values. Rule (4) is not necessarily a good idea, but matches what we used to do, and we have extant cases of this situation in the kernel. Notably with bcachefs having an expression like min(bch2_bucket_sectors_dirty(a), ca->mi.bucket_size) where bch2_bucket_sectors_dirty() returns an 's64', and 'ca->mi.bucket_size' is of type 'u16'. Technically that bcachefs comparison is clearly sensible on a C type level, because the 'u16' will go through the normal C integer promotion, and become 'int', and then we're comparing two signed values and everything looks sane. However, it's not entirely clear that a 'min(s64,u16)' operation makes a lot of conceptual sense, and it's possible that we will remove rule (4). After all, the _reason_ we have these complicated type checks is exactly that the C type promotion rules are not very intuitive. But at least for now the rule is in place for backwards compatibility. Also note that rule (2) existed before, but is hugely relaxed by this commit. It used to be true only for the simplest compile-time non-negative integer constants. The new macro model will allow cases where the compiler can trivially see that an expression is non-negative even if it isn't necessarily a constant. For example, the amdgpu driver does min_t(size_t, sizeof(fru_info->serial), pia[addr] & 0x3F)); because our old 'min()' macro would see that 'pia[addr] & 0x3F' is of type 'int' and clearly not a C constant expression, so doing a 'min()' with a 'size_t' is a signedness violation. Our new 'min()' macro still sees that 'pia[addr] & 0x3F' is of type 'int', but is smart enough to also see that it is clearly non-negative, and thus would allow that case without any complaints. Cc: Arnd Bergmann <arnd@kernel.org> Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02minmax: simplify min()/max()/clamp() implementationLinus Torvalds1-23/+20
[ Upstream commit dc1c8034e31b14a2e5e212104ec508aec44ce1b9 ] Now that we no longer have any C constant expression contexts (ie array size declarations or static initializers) that use min() or max(), we can simpify the implementation by not having to worry about the result staying as a C constant expression. So now we can unconditionally just use temporary variables of the right type, and get rid of the excessive expansion that used to come from the use of __builtin_choose_expr(__is_constexpr(...), .. to pick the specialized code for constant expressions. Another expansion simplification is to pass the temporary variables (in addition to the original expression) to our __types_ok() macro. That may superficially look like it complicates the macro, but when we only want the type of the expression, expanding the temporary variable names is much simpler and smaller than expanding the potentially complicated original expression. As a result, on my machine, doing a $ time make drivers/staging/media/atomisp/pci/isp/kernels/ynr/ynr_1.0/ia_css_ynr.host.i goes from real 0m16.621s user 0m15.360s sys 0m1.221s to real 0m2.532s user 0m2.091s sys 0m0.452s because the token expansion goes down dramatically. In particular, the longest line expansion (which was line 71 of that 'ia_css_ynr.host.c' file) shrinks from 23,338kB (yes, 23MB for one single line) to "just" 1,444kB (now "only" 1.4MB). And yes, that line is still the line from hell, because it's doing multiple levels of "min()/max()" expansion thanks to some of them being hidden inside the uDIGIT_FITTING() macro. Lorenzo has a nice cleanup patch that makes that driver use inline functions instead of macros for sDIGIT_FITTING() and uDIGIT_FITTING(), which will fix that line once and for all, but the 16-fold reduction in this case does show why we need to simplify these helpers. Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02minmax: make generic MIN() and MAX() macros available everywhereLinus Torvalds1-0/+2
[ Upstream commit 1a251f52cfdc417c84411a056bc142cbd77baef4 ] This just standardizes the use of MIN() and MAX() macros, with the very traditional semantics. The goal is to use these for C constant expressions and for top-level / static initializers, and so be able to simplify the min()/max() macros. These macro names were used by various kernel code - they are very traditional, after all - and all such users have been fixed up, with a few different approaches: - trivial duplicated macro definitions have been removed Note that 'trivial' here means that it's obviously kernel code that already included all the major kernel headers, and thus gets the new generic MIN/MAX macros automatically. - non-trivial duplicated macro definitions are guarded with #ifndef This is the "yes, they define their own versions, but no, the include situation is not entirely obvious, and maybe they don't get the generic version automatically" case. - strange use case #1 A couple of drivers decided that the way they want to describe their versioning is with #define MAJ 1 #define MIN 2 #define DRV_VERSION __stringify(MAJ) "." __stringify(MIN) which adds zero value and I just did my Alexander the Great impersonation, and rewrote that pointless Gordian knot as #define DRV_VERSION "1.2" instead. - strange use case #2 A couple of drivers thought that it's a good idea to have a random 'MIN' or 'MAX' define for a value or index into a table, rather than the traditional macro that takes arguments. These values were re-written as C enum's instead. The new function-line macros only expand when followed by an open parenthesis, and thus don't clash with enum use. Happily, there weren't really all that many of these cases, and a lot of users already had the pattern of using '#ifndef' guarding (or in one case just using '#undef MIN') before defining their own private version that does the same thing. I left such cases alone. Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02mm: folio_may_be_lru_cached() unless folio_test_large()Hugh Dickins1-0/+10
[ Upstream commit 2da6de30e60dd9bb14600eff1cc99df2fa2ddae3 ] mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as a large folio is added: so collect_longterm_unpinnable_folios() just wastes effort when calling lru_add_drain[_all]() on a large folio. But although there is good reason not to batch up PMD-sized folios, we might well benefit from batching a small number of low-order mTHPs (though unclear how that "small number" limitation will be implemented). So ask if folio_may_be_lru_cached() rather than !folio_test_large(), to insulate those particular checks from future change. Name preferred to "folio_is_batchable" because large folios can well be put on a batch: it's just the per-CPU LRU caches, drained much later, which need care. Marked for stable, to counter the increase in lru_add_drain_all()s from "mm/gup: check ref_count instead of lru before migration". Link: https://lkml.kernel.org/r/57d2eaf8-3607-f318-e0c5-be02dce61ad0@google.com Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region") Signed-off-by: Hugh Dickins <hughd@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Keir Fraser <keirf@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: yangge <yangge1116@126.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Resolved conflicts in mm/swap.c ] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-02mm: add folio_expected_ref_count() for reference count calculationShivank Garg1-0/+55
[ Upstream commit 86ebd50224c0734d965843260d0dc057a9431c61 ] Patch series " JFS: Implement migrate_folio for jfs_metapage_aops" v5. This patchset addresses a warning that occurs during memory compaction due to JFS's missing migrate_folio operation. The warning was introduced by commit 7ee3647243e5 ("migrate: Remove call to ->writepage") which added explicit warnings when filesystem don't implement migrate_folio. The syzbot reported following [1]: jfs_metapage_aops does not implement migrate_folio WARNING: CPU: 1 PID: 5861 at mm/migrate.c:955 fallback_migrate_folio mm/migrate.c:953 [inline] WARNING: CPU: 1 PID: 5861 at mm/migrate.c:955 move_to_new_folio+0x70e/0x840 mm/migrate.c:1007 Modules linked in: CPU: 1 UID: 0 PID: 5861 Comm: syz-executor280 Not tainted 6.15.0-rc1-next-20250411-syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025 RIP: 0010:fallback_migrate_folio mm/migrate.c:953 [inline] RIP: 0010:move_to_new_folio+0x70e/0x840 mm/migrate.c:1007 To fix this issue, this series implement metapage_migrate_folio() for JFS which handles both single and multiple metapages per page configurations. While most filesystems leverage existing migration implementations like filemap_migrate_folio(), buffer_migrate_folio_norefs() or buffer_migrate_folio() (which internally used folio_expected_refs()), JFS's metapage architecture requires special handling of its private data during migration. To support this, this series introduce the folio_expected_ref_count(), which calculates external references to a folio from page/swap cache, private data, and page table mappings. This standardized implementation replaces the previous ad-hoc folio_expected_refs() function and enables JFS to accurately determine whether a folio has unexpected references before attempting migration. Implement folio_expected_ref_count() to calculate expected folio reference counts from: - Page/swap cache (1 per page) - Private data (1) - Page table mappings (1 per map) While originally needed for page migration operations, this improved implementation standardizes reference counting by consolidating all refcount contributors into a single, reusable function that can benefit any subsystem needing to detect unexpected references to folios. The folio_expected_ref_count() returns the sum of these external references without including any reference the caller itself might hold. Callers comparing against the actual folio_ref_count() must account for their own references separately. Link: https://syzkaller.appspot.com/bug?extid=8bb6fd945af4e0ad9299 [1] Link: https://lkml.kernel.org/r/20250430100150.279751-1-shivankg@amd.com Link: https://lkml.kernel.org/r/20250430100150.279751-2-shivankg@amd.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Shivank Garg <shivankg@amd.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Co-developed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 98c6d259319e ("mm/gup: check ref_count instead of lru before migration") [ Take the new function in mm.h, removing "const" from its parameter to stop build warnings; but avoid all the conflicts of using it in mm/migrate.c. ] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-25minmax: simplify and clarify min_t()/max_t() implementationLinus Torvalds1-8/+11
[ Upstream commit 017fa3e89187848fd056af757769c9e66ac3e93d ] This simplifies the min_t() and max_t() macros by no longer making them work in the context of a C constant expression. That means that you can no longer use them for static initializers or for array sizes in type definitions, but there were only a couple of such uses, and all of them were converted (famous last words) to use MIN_T/MAX_T instead. Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-25minmax: avoid overly complicated constant expressions in VM codeLinus Torvalds2-1/+8
[ Upstream commit 3a7e02c040b130b5545e4b115aada7bacd80a2b6 ] The minmax infrastructure is overkill for simple constants, and can cause huge expansions because those simple constants are then used by other things. For example, 'pageblock_order' is a core VM constant, but because it was implemented using 'min_t()' and all the type-checking that involves, it actually expanded to something like 2.5kB of preprocessor noise. And when that simple constant was then used inside other expansions: #define pageblock_nr_pages (1UL << pageblock_order) #define pageblock_start_pfn(pfn) ALIGN_DOWN((pfn), pageblock_nr_pages) and we then use that inside a 'max()' macro: case ISOLATE_SUCCESS: update_cached = false; last_migrated_pfn = max(cc->zone->zone_start_pfn, pageblock_start_pfn(cc->migrate_pfn - 1)); the end result was that one statement expanding to 253kB in size. There are probably other cases of this, but this one case certainly stood out. I've added 'MIN_T()' and 'MAX_T()' macros for this kind of "core simple constant with specific type" use. These macros skip the type checking, and as such need to be very sparingly used only for obvious cases that have active issues like this. Reported-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/all/36aa2cad-1db1-4abf-8dd2-fb20484aabc3@lucifer.local/ Cc: David Laight <David.Laight@aculab.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-25net/mlx5e: Harden uplink netdev access against device unbindJianbo Liu1-0/+1
[ Upstream commit 6b4be64fd9fec16418f365c2d8e47a7566e9eba5 ] The function mlx5_uplink_netdev_get() gets the uplink netdevice pointer from mdev->mlx5e_res.uplink_netdev. However, the netdevice can be removed and its pointer cleared when unbound from the mlx5_core.eth driver. This results in a NULL pointer, causing a kernel panic. BUG: unable to handle page fault for address: 0000000000001300 at RIP: 0010:mlx5e_vport_rep_load+0x22a/0x270 [mlx5_core] Call Trace: <TASK> mlx5_esw_offloads_rep_load+0x68/0xe0 [mlx5_core] esw_offloads_enable+0x593/0x910 [mlx5_core] mlx5_eswitch_enable_locked+0x341/0x420 [mlx5_core] mlx5_devlink_eswitch_mode_set+0x17e/0x3a0 [mlx5_core] devlink_nl_eswitch_set_doit+0x60/0xd0 genl_family_rcv_msg_doit+0xe0/0x130 genl_rcv_msg+0x183/0x290 netlink_rcv_skb+0x4b/0xf0 genl_rcv+0x24/0x40 netlink_unicast+0x255/0x380 netlink_sendmsg+0x1f3/0x420 __sock_sendmsg+0x38/0x60 __sys_sendto+0x119/0x180 do_syscall_64+0x53/0x1d0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Ensure the pointer is valid before use by checking it for NULL. If it is valid, immediately call netdev_hold() to take a reference, and preventing the netdevice from being freed while it is in use. Fixes: 7a9fb35e8c3a ("net/mlx5e: Do not reload ethernet ports when changing eswitch mode") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1757939074-617281-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-19compiler-clang.h: define __SANITIZE_*__ macros only when undefinedNathan Chancellor1-5/+24
commit 3fac212fe489aa0dbe8d80a42a7809840ca7b0f9 upstream. Clang 22 recently added support for defining __SANITIZE__ macros similar to GCC [1], which causes warnings (or errors with CONFIG_WERROR=y or W=e) with the existing defines that the kernel creates to emulate this behavior with existing clang versions. In file included from <built-in>:3: In file included from include/linux/compiler_types.h:171: include/linux/compiler-clang.h:37:9: error: '__SANITIZE_THREAD__' macro redefined [-Werror,-Wmacro-redefined] 37 | #define __SANITIZE_THREAD__ | ^ <built-in>:352:9: note: previous definition is here 352 | #define __SANITIZE_THREAD__ 1 | ^ Refactor compiler-clang.h to only define the sanitizer macros when they are undefined and adjust the rest of the code to use these macros for checking if the sanitizers are enabled, clearing up the warnings and allowing the kernel to easily drop these defines when the minimum supported version of LLVM for building the kernel becomes 22.0.0 or newer. Link: https://lkml.kernel.org/r/20250902-clang-update-sanitize-defines-v1-1-cf3702ca3d92@kernel.org Link: https://github.com/llvm/llvm-project/commit/568c23bbd3303518c5056d7f03444dae4fdc8a9c [1] Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Justin Stitt <justinstitt@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Bill Wendling <morbo@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-19mm: introduce and use {pgd,p4d}_populate_kernel()Harry Yoo2-6/+36
commit f2d2f9598ebb0158a3fe17cda0106d7752e654a2 upstream. Introduce and use {pgd,p4d}_populate_kernel() in core MM code when populating PGD and P4D entries for the kernel address space. These helpers ensure proper synchronization of page tables when updating the kernel portion of top-level page tables. Until now, the kernel has relied on each architecture to handle synchronization of top-level page tables in an ad-hoc manner. For example, see commit 9b861528a801 ("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping changes"). However, this approach has proven fragile for following reasons: 1) It is easy to forget to perform the necessary page table synchronization when introducing new changes. For instance, commit 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps") overlooked the need to synchronize page tables for the vmemmap area. 2) It is also easy to overlook that the vmemmap and direct mapping areas must not be accessed before explicit page table synchronization. For example, commit 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges")) caused crashes by accessing the vmemmap area before calling sync_global_pgds(). To address this, as suggested by Dave Hansen, introduce _kernel() variants of the page table population helpers, which invoke architecture-specific hooks to properly synchronize page tables. These are introduced in a new header file, include/linux/pgalloc.h, so they can be called from common code. They reuse existing infrastructure for vmalloc and ioremap. Synchronization requirements are determined by ARCH_PAGE_TABLE_SYNC_MASK, and the actual synchronization is performed by arch_sync_kernel_mappings(). This change currently targets only x86_64, so only PGD and P4D level helpers are introduced. Currently, these helpers are no-ops since no architecture sets PGTBL_{PGD,P4D}_MODIFIED in ARCH_PAGE_TABLE_SYNC_MASK. In theory, PUD and PMD level helpers can be added later if needed by other architectures. For now, 32-bit architectures (x86-32 and arm) only handle PGTBL_PMD_MODIFIED, so p*d_populate_kernel() will never affect them unless we introduce a PMD level helper. [harry.yoo@oracle.com: fix KASAN build error due to p*d_populate_kernel()] Link: https://lkml.kernel.org/r/20250822020727.202749-1-harry.yoo@oracle.com Link: https://lkml.kernel.org/r/20250818020206.4517-3-harry.yoo@oracle.com Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: bibo mao <maobibo@loongson.cn> Cc: Borislav Betkov <bp@alien8.de> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Adjust context ] Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-11x86/vmscape: Enable the mitigationPawan Gupta1-0/+1
Commit 556c1ad666ad90c50ec8fccb930dd5046cfbecfb upstream. Enable the previously added mitigation for VMscape. Add the cmdline vmscape={off|ibpb|force} and sysfs reporting. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-09PCI/MSI: Add an option to write MSIX ENTRY_DATA before any readsJonathan Currier1-0/+2
[ Upstream commit cf761e3dacc6ad5f65a4886d00da1f9681e6805a ] Commit 7d5ec3d36123 ("PCI/MSI: Mask all unused MSI-X entries") introduced a readl() from ENTRY_VECTOR_CTRL before the writel() to ENTRY_DATA. This is correct, however some hardware, like the Sun Neptune chips, the NIU module, will cause an error and/or fatal trap if any MSIX table entry is read before the corresponding ENTRY_DATA field is written to. Add an optional early writel() in msix_prepare_msi_desc(). Fixes: 7d5ec3d36123 ("PCI/MSI: Mask all unused MSI-X entries") Signed-off-by: Jonathan Currier <dullfire@yahoo.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20241117234843.19236-2-dullfire@yahoo.com [ Adjust context ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-09mm: move page table sync declarations to linux/pgtable.hHarry Yoo2-16/+16
commit 7cc183f2e67d19b03ee5c13a6664b8c6cc37ff9d upstream. During our internal testing, we started observing intermittent boot failures when the machine uses 4-level paging and has a large amount of persistent memory: BUG: unable to handle page fault for address: ffffe70000000034 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] SMP NOPTI RIP: 0010:__init_single_page+0x9/0x6d Call Trace: <TASK> __init_zone_device_page+0x17/0x5d memmap_init_zone_device+0x154/0x1bb pagemap_range+0x2e0/0x40f memremap_pages+0x10b/0x2f0 devm_memremap_pages+0x1e/0x60 dev_dax_probe+0xce/0x2ec [device_dax] dax_bus_probe+0x6d/0xc9 [... snip ...] </TASK> It turns out that the kernel panics while initializing vmemmap (struct page array) when the vmemmap region spans two PGD entries, because the new PGD entry is only installed in init_mm.pgd, but not in the page tables of other tasks. And looking at __populate_section_memmap(): if (vmemmap_can_optimize(altmap, pgmap)) // does not sync top level page tables r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap); else // sync top level page tables in x86 r = vmemmap_populate(start, end, nid, altmap); In the normal path, vmemmap_populate() in arch/x86/mm/init_64.c synchronizes the top level page table (See commit 9b861528a801 ("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping changes")) so that all tasks in the system can see the new vmemmap area. However, when vmemmap_can_optimize() returns true, the optimized path skips synchronization of top-level page tables. This is because vmemmap_populate_compound_pages() is implemented in core MM code, which does not handle synchronization of the top-level page tables. Instead, the core MM has historically relied on each architecture to perform this synchronization manually. We're not the first party to encounter a crash caused by not-sync'd top level page tables: earlier this year, Gwan-gyeong Mun attempted to address the issue [1] [2] after hitting a kernel panic when x86 code accessed the vmemmap area before the corresponding top-level entries were synced. At that time, the issue was believed to be triggered only when struct page was enlarged for debugging purposes, and the patch did not get further updates. It turns out that current approach of relying on each arch to handle the page table sync manually is fragile because 1) it's easy to forget to sync the top level page table, and 2) it's also easy to overlook that the kernel should not access the vmemmap and direct mapping areas before the sync. # The solution: Make page table sync more code robust and harder to miss To address this, Dave Hansen suggested [3] [4] introducing {pgd,p4d}_populate_kernel() for updating kernel portion of the page tables and allow each architecture to explicitly perform synchronization when installing top-level entries. With this approach, we no longer need to worry about missing the sync step, reducing the risk of future regressions. The new interface reuses existing ARCH_PAGE_TABLE_SYNC_MASK, PGTBL_P*D_MODIFIED and arch_sync_kernel_mappings() facility used by vmalloc and ioremap to synchronize page tables. pgd_populate_kernel() looks like this: static inline void pgd_populate_kernel(unsigned long addr, pgd_t *pgd, p4d_t *p4d) { pgd_populate(&init_mm, pgd, p4d); if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_PGD_MODIFIED) arch_sync_kernel_mappings(addr, addr); } It is worth noting that vmalloc() and apply_to_range() carefully synchronizes page tables by calling p*d_alloc_track() and arch_sync_kernel_mappings(), and thus they are not affected by this patch series. This series was hugely inspired by Dave Hansen's suggestion and hence added Suggested-by: Dave Hansen. Cc stable because lack of this series opens the door to intermittent boot failures. This patch (of 3): Move ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() to linux/pgtable.h so that they can be used outside of vmalloc and ioremap. Link: https://lkml.kernel.org/r/20250818020206.4517-1-harry.yoo@oracle.com Link: https://lkml.kernel.org/r/20250818020206.4517-2-harry.yoo@oracle.com Link: https://lore.kernel.org/linux-mm/20250220064105.808339-1-gwan-gyeong.mun@intel.com [1] Link: https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@intel.com [2] Link: https://lore.kernel.org/linux-mm/d1da214c-53d3-45ac-a8b6-51821c5416e4@intel.com [3] Link: https://lore.kernel.org/linux-mm/4d800744-7b88-41aa-9979-b245e8bf794b@intel.com [4] Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: bibo mao <maobibo@loongson.cn> Cc: Borislav Betkov <bp@alien8.de> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-09bpf: Fix oob access in cgroup local storageDaniel Borkmann1-0/+1
[ Upstream commit abad3d0bad72a52137e0c350c59542d75ae4f513 ] Lonial reported that an out-of-bounds access in cgroup local storage can be crafted via tail calls. Given two programs each utilizing a cgroup local storage with a different value size, and one program doing a tail call into the other. The verifier will validate each of the indivial programs just fine. However, in the runtime context the bpf_cg_run_ctx holds an bpf_prog_array_item which contains the BPF program as well as any cgroup local storage flavor the program uses. Helpers such as bpf_get_local_storage() pick this up from the runtime context: ctx = container_of(current->bpf_ctx, struct bpf_cg_run_ctx, run_ctx); storage = ctx->prog_item->cgroup_storage[stype]; if (stype == BPF_CGROUP_STORAGE_SHARED) ptr = &READ_ONCE(storage->buf)->data[0]; else ptr = this_cpu_ptr(storage->percpu_buf); For the second program which was called from the originally attached one, this means bpf_get_local_storage() will pick up the former program's map, not its own. With mismatching sizes, this can result in an unintended out-of-bounds access. To fix this issue, we need to extend bpf_map_owner with an array of storage_cookie[] to match on i) the exact maps from the original program if the second program was using bpf_get_local_storage(), or ii) allow the tail call combination if the second program was not using any of the cgroup local storage maps. Fixes: 7d9c3427894f ("bpf: Make cgroup storages shared between programs on the same cgroup") Reported-by: Lonial Con <kongln9170@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20250730234733.530041-4-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-09bpf: Move bpf map owner out of common structDaniel Borkmann1-12/+24
[ Upstream commit fd1c98f0ef5cbcec842209776505d9e70d8fcd53 ] Given this is only relevant for BPF tail call maps, it is adding up space and penalizing other map types. We also need to extend this with further objects to track / compare to. Therefore, lets move this out into a separate structure and dynamically allocate it only for BPF tail call maps. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20250730234733.530041-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-09bpf: Move cgroup iterator helpers to bpf.hDaniel Borkmann2-13/+14
[ Upstream commit 9621e60f59eae87eb9ffe88d90f24f391a1ef0f0 ] Move them into bpf.h given we also need them in core code. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20250730234733.530041-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: abad3d0bad72 ("bpf: Fix oob access in cgroup local storage") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-09bpf: Add cookie object to bpf mapsDaniel Borkmann1-0/+1
[ Upstream commit 12df58ad294253ac1d8df0c9bb9cf726397a671d ] Add a cookie to BPF maps to uniquely identify BPF maps for the timespan when the node is up. This is different to comparing a pointer or BPF map id which could get rolled over and reused. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20250730234733.530041-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-04net/mlx5: Add device cap for supporting hot reset in sync reset flowMoshe Shemesh1-2/+9
[ Upstream commit 9947204cdad97d22d171039019a4aad4d6899cdd ] New devices with new FW can support sync reset for firmware activate using hot reset. Add capability for supporting it and add MFRL field to query from FW which type of PCI reset method to use while handling sync reset events. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20240911201757.1505453-10-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 902a8bc23a24 ("net/mlx5: Fix lockdep assertion on sync reset unload event") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-04atm: atmtcp: Prevent arbitrary write in atmtcp_recv_control().Kuniyuki Iwashima1-0/+1
[ Upstream commit ec79003c5f9d2c7f9576fc69b8dbda80305cbe3a ] syzbot reported the splat below. [0] When atmtcp_v_open() or atmtcp_v_close() is called via connect() or close(), atmtcp_send_control() is called to send an in-kernel special message. The message has ATMTCP_HDR_MAGIC in atmtcp_control.hdr.length. Also, a pointer of struct atm_vcc is set to atmtcp_control.vcc. The notable thing is struct atmtcp_control is uAPI but has a space for an in-kernel pointer. struct atmtcp_control { struct atmtcp_hdr hdr; /* must be first */ ... atm_kptr_t vcc; /* both directions */ ... } __ATM_API_ALIGN; typedef struct { unsigned char _[8]; } __ATM_API_ALIGN atm_kptr_t; The special message is processed in atmtcp_recv_control() called from atmtcp_c_send(). atmtcp_c_send() is vcc->dev->ops->send() and called from 2 paths: 1. .ndo_start_xmit() (vcc->send() == atm_send_aal0()) 2. vcc_sendmsg() The problem is sendmsg() does not validate the message length and userspace can abuse atmtcp_recv_control() to overwrite any kptr by atmtcp_control. Let's add a new ->pre_send() hook to validate messages from sendmsg(). [0]: Oops: general protection fault, probably for non-canonical address 0xdffffc00200000ab: 0000 [#1] SMP KASAN PTI KASAN: probably user-memory-access in range [0x0000000100000558-0x000000010000055f] CPU: 0 UID: 0 PID: 5865 Comm: syz-executor331 Not tainted 6.17.0-rc1-syzkaller-00215-gbab3ce404553 #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2025 RIP: 0010:atmtcp_recv_control drivers/atm/atmtcp.c:93 [inline] RIP: 0010:atmtcp_c_send+0x1da/0x950 drivers/atm/atmtcp.c:297 Code: 4d 8d 75 1a 4c 89 f0 48 c1 e8 03 42 0f b6 04 20 84 c0 0f 85 15 06 00 00 41 0f b7 1e 4d 8d b7 60 05 00 00 4c 89 f0 48 c1 e8 03 <42> 0f b6 04 20 84 c0 0f 85 13 06 00 00 66 41 89 1e 4d 8d 75 1c 4c RSP: 0018:ffffc90003f5f810 EFLAGS: 00010203 RAX: 00000000200000ab RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff88802a510000 RSI: 00000000ffffffff RDI: ffff888030a6068c RBP: ffff88802699fb40 R08: ffff888030a606eb R09: 1ffff1100614c0dd R10: dffffc0000000000 R11: ffffffff8718fc40 R12: dffffc0000000000 R13: ffff888030a60680 R14: 000000010000055f R15: 00000000ffffffff FS: 00007f8d7e9236c0(0000) GS:ffff888125c1c000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000045ad50 CR3: 0000000075bde000 CR4: 00000000003526f0 Call Trace: <TASK> vcc_sendmsg+0xa10/0xc60 net/atm/common.c:645 sock_sendmsg_nosec net/socket.c:714 [inline] __sock_sendmsg+0x219/0x270 net/socket.c:729 ____sys_sendmsg+0x505/0x830 net/socket.c:2614 ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2668 __sys_sendmsg net/socket.c:2700 [inline] __do_sys_sendmsg net/socket.c:2705 [inline] __se_sys_sendmsg net/socket.c:2703 [inline] __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2703 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f8d7e96a4a9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f8d7e923198 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f8d7e9f4308 RCX: 00007f8d7e96a4a9 RDX: 0000000000000000 RSI: 0000200000000240 RDI: 0000000000000005 RBP: 00007f8d7e9f4300 R08: 65732f636f72702f R09: 65732f636f72702f R10: 65732f636f72702f R11: 0000000000000246 R12: 00007f8d7e9c10ac R13: 00007f8d7e9231a0 R14: 0000200000000200 R15: 0000200000000250 </TASK> Modules linked in: Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+1741b56d54536f4ec349@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68a6767c.050a0220.3d78fd.0011.GAE@google.com/ Tested-by: syzbot+1741b56d54536f4ec349@syzkaller.appspotmail.com Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250821021901.2814721-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-04NFS: Fix a race when updating an existing writeTrond Myklebust1-0/+1
commit 76d2e3890fb169168c73f2e4f8375c7cc24a765e upstream. After nfs_lock_and_join_requests() tests for whether the request is still attached to the mapping, nothing prevents a call to nfs_inode_remove_request() from succeeding until we actually lock the page group. The reason is that whoever called nfs_inode_remove_request() doesn't necessarily have a lock on the page group head. So in order to avoid races, let's take the page group lock earlier in nfs_lock_and_join_requests(), and hold it across the removal of the request in nfs_inode_remove_request(). Reported-by: Jeff Layton <jlayton@kernel.org> Tested-by: Joe Quanaim <jdq@meta.com> Tested-by: Andrew Steffen <aksteffen@meta.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Fixes: bd37d6fce184 ("NFSv4: Convert nfs_lock_and_join_requests() to use nfs_page_find_head_request()") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-04nfs: fold nfs_page_group_lock_subrequests into nfs_lock_and_join_requestsChristoph Hellwig1-1/+0
commit 25edbcac6e32eab345e470d56ca9974a577b878b upstream. Fold nfs_page_group_lock_subrequests into nfs_lock_and_join_requests to prepare for future changes to this code, and move the helpers to write.c as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>