summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2025-05-28f2fs: add a method for calculating the remaining blocks in the current ↵yohan.joung1-4/+19
segment in LFS mode. In LFS mode, the previous segment cannot use invalid blocks, so the remaining blocks from the next_blkoff of the current segment to the end of the section are calculated. Signed-off-by: yohan.joung <yohan.joung@sk.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-28KVM: s390: Simplify and move pv codeClaudio Imbrenda11-182/+133
All functions in kvm/gmap.c fit better in kvm/pv.c instead. Move and rename them appropriately, then delete the now empty kvm/gmap.c and kvm/gmap.h. Reviewed-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com> Reviewed-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Acked-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20250528095502.226213-5-imbrenda@linux.ibm.com Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-ID: <20250528095502.226213-5-imbrenda@linux.ibm.com>
2025-05-28KVM: s390: Refactor and split some gmap helpersClaudio Imbrenda8-187/+274
Refactor some gmap functions; move the implementation into a separate file with only helper functions. The new helper functions work on vm addresses, leaving all gmap logic in the gmap functions, which mostly become just wrappers. The whole gmap handling is going to be moved inside KVM soon, but the helper functions need to touch core mm functions, and thus need to stay in the core of kernel. Reviewed-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Acked-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20250528095502.226213-4-imbrenda@linux.ibm.com Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-ID: <20250528095502.226213-4-imbrenda@linux.ibm.com>
2025-05-28KVM: s390: Remove unneeded srcu lockClaudio Imbrenda1-4/+2
All paths leading to handle_essa() already hold the kvm->srcu. Remove unneeded srcu locking from handle_essa(). Add lockdep assertion to make sure we will always be holding kvm->srcu when entering handle_essa(). Reviewed-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Reviewed-by: Steffen Eiden <seiden@linux.ibm.com> Link: https://lore.kernel.org/r/20250528095502.226213-3-imbrenda@linux.ibm.com Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-ID: <20250528095502.226213-3-imbrenda@linux.ibm.com>
2025-05-28s390: Remove unneeded includesClaudio Imbrenda8-7/+2
Many files don't need to include asm/tlb.h or asm/gmap.h. On the other hand, asm/tlb.h does need to include asm/gmap.h. Remove all unneeded includes so that asm/tlb.h is not directly used by s390 arch code anymore. Remove asm/gmap.h from a few other files as well, so that now only KVM code, mm/gmap.c, and asm/tlb.h include it. Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Reviewed-by: Steffen Eiden <seiden@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Link: https://lore.kernel.org/r/20250528095502.226213-2-imbrenda@linux.ibm.com Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-ID: <20250528095502.226213-2-imbrenda@linux.ibm.com>
2025-05-28s390/uv: Improve splitting of large folios that cannot be split while dirtyDavid Hildenbrand1-6/+60
Currently, starting a PV VM on an iomap-based filesystem with large folio support, such as XFS, will not work. We'll be stuck in unpack_one()->gmap_make_secure(), because we can't seem to make progress splitting the large folio. The problem is that we require a writable PTE but a writable PTE under such filesystems will imply a dirty folio. So whenever we have a writable PTE, we'll have a dirty folio, and dirty iomap folios cannot currently get split, because split_folio()->split_huge_page_to_list_to_order()->filemap_release_folio() will fail in iomap_release_folio(). So we will not make any progress splitting such large folios. Until dirty folios can be split more reliably, let's manually trigger writeback of the problematic folio using filemap_write_and_wait_range(), and retry the split immediately afterwards exactly once, before looking up the folio again. Should this logic be part of split_folio()? Likely not; most split users don't have to split so eagerly to make any progress. For now, this seems to affect xfs, zonefs and erofs, and this patch makes it work again (tested on xfs only). While this could be considered a fix for commit 6795801366da ("xfs: Support large folios"), commit df2f9708ff1f ("zonefs: enable support for large folios") and commit ce529cc25b18 ("erofs: enable large folios for iomap mode"), before commit eef88fe45ac9 ("s390/uv: Split large folios in gmap_make_secure()"), we did not try splitting large folios at all. So it's all rather part of making SE compatible with file systems that support large folios. But to have some "Fixes:" tag, let's just use eef88fe45ac9. Not CCing stable, because there are a lot of dependencies, and it simply not working is not critical in stable kernels. Reported-by: Sebastian Mitterle <smitterl@redhat.com> Closes: https://issues.redhat.com/browse/RHEL-58218 Fixes: eef88fe45ac9 ("s390/uv: Split large folios in gmap_make_secure()") Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20250516123946.1648026-4-david@redhat.com Message-ID: <20250516123946.1648026-4-david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2025-05-28Merge tag 'audit-pr-20250527' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: - Always record AUDIT_ANOM events when auditing is enabled. Prior to this patch we only recorded AUDIT_ANOM events if auditing was enabled and the admin/distro had explicitly configured audit beyond the defaults. Considering that AUDIT_ANOM events are anomolous events considered to be "security relevant", it seems wise to record these events as long as auditing is enabled, even if the system is running with a default audit configuration. - Mark the audit_log_vformat() function with the __printf() attribute to quiet GCC. * tag 'audit-pr-20250527' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: record AUDIT_ANOM_* events regardless of presence of rules audit: mark audit_log_vformat() with __printf() attribute
2025-05-28Merge tag 'selinux-pr-20250527' of ↵Linus Torvalds11-85/+232
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull selinux updates from Paul Moore: - Reduce the SELinux impact on path walks. Add a small directory access cache to the per-task SELinux state. This cache allows SELinux to cache the most recently used directory access decisions in order to avoid repeatedly querying the AVC on path walks where the majority of the directories have similar security contexts/labels. My performance measurements are crude, but prior to this patch the time spent in SELinux code on a 'make allmodconfig' run was 103% that of __d_lookup_rcu(), and with this patch the time spent in SELinux code dropped to 63% of __d_lookup_rcu(), a ~40% improvement. Additional improvments can be expected in the future, but those will require additional SELinux policy/toolchain support. - Add support for wildcards in genfscon policy statements. This patch allows for wildcards in the genfscon patch matching logic as opposed to the prefix matching that was used prior to this change. Adding wilcard support allows for more expressive and efficient path matching in the policy which is especially helpful for sysfs, and has resulted in a ~15% boot time reduction in Android. SELinux policies can opt into wilcard matching by using the "genfs_seclabel_wildcard" policy capability. - Unify the error/OOM handling of the SELinux network caches. A failure to allocate memory for the SELinux network caches isn't fatal as the object label can still be safely returned to the caller, it simply means that we cannot add the new data to the cache, at least temporarily. This patch corrects this behavior for the InfiniBand cache and does some minor cleanup. - Minor improvements around constification, 'likely' annotations, and removal of bogus comments. * tag 'selinux-pr-20250527' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: selinux: fix the kdoc header for task_avdcache_update selinux: remove a duplicated include selinux: reduce path walk overhead selinux: support wildcard match in genfscon selinux: drop copy-paste comment selinux: unify OOM handling in network hashtables selinux: add likely hints for fast paths selinux: contify network namespace pointer selinux: constify network address pointer
2025-05-28Merge tag 'lsm-pr-20250527' of ↵Linus Torvalds2-24/+24
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull lsm update from Paul Moore: "One minor LSM framework patch to move the selinux_netlink_send() hook under the CONFIG_SECURITY_NETWORK Kconfig knob" * tag 'lsm-pr-20250527' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: lsm: Move security_netlink_send to under CONFIG_SECURITY_NETWORK
2025-05-28Merge tag 'integrity-v6.16' of ↵Linus Torvalds8-34/+283
git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity Pull integrity updates from Mimi Zohar: "Carrying the IMA measurement list across kexec is not a new feature, but is updated to address a couple of issues: - Carrying the IMA measurement list across kexec required knowing apriori all the file measurements between the "kexec load" and "kexec execute" in order to measure them before the "kexec load". Any delay between the "kexec load" and "kexec exec" exacerbated the problem. - Any file measurements post "kexec load" were not carried across kexec, resulting in the measurement list being out of sync with the TPM PCR. With these changes, the buffer for the IMA measurement list is still allocated at "kexec load", but copying the IMA measurement list is deferred to after quiescing the TPM. Two new kexec critical data records are defined" * tag 'integrity-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity: ima: do not copy measurement list to kdump kernel ima: measure kexec load and exec events as critical data ima: make the kexec extra memory configurable ima: verify if the segment size has changed ima: kexec: move IMA log copy from kexec load to execute ima: kexec: define functions to copy IMA log at soft boot ima: kexec: skip IMA segment validation after kexec soft reboot kexec: define functions to map and unmap segments ima: define and call ima_alloc_kexec_file_buf() ima: rename variable the seq_file "file" to "ima_kexec_file"
2025-05-28Merge tag 'Smack-for-6.16' of https://github.com/cschaufler/smack-nextLinus Torvalds1-7/+5
Pull smack update from Casey Schaufler: "One trivial kernel doc fix" * tag 'Smack-for-6.16' of https://github.com/cschaufler/smack-next: security/smack/smackfs: small kernel-doc fixes
2025-05-28Merge tag 'hardening-v6.16-rc1' of ↵Linus Torvalds30-169/+459
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: - Update overflow helpers to ease refactoring of on-stack flex array instances (Gustavo A. R. Silva, Kees Cook) - lkdtm: Use SLAB_NO_MERGE instead of constructors (Harry Yoo) - Simplify CONFIG_CC_HAS_COUNTED_BY (Jan Hendrik Farr) - Disable u64 usercopy KUnit test on 32-bit SPARC (Thomas Weißschuh) - Add missed designated initializers now exposed by fixed randstruct (Nathan Chancellor, Kees Cook) - Document compilers versions for __builtin_dynamic_object_size - Remove ARM_SSP_PER_TASK GCC plugin - Fix GCC plugin randstruct, add selftests, and restore COMPILE_TEST builds - Kbuild: induce full rebuilds when dependencies change with GCC plugins, the Clang sanitizer .scl file, or the randstruct seed. - Kbuild: Switch from -Wvla to -Wvla-larger-than=1 - Correct several __nonstring uses for -Wunterminated-string-initialization * tag 'hardening-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (23 commits) Revert "hardening: Disable GCC randstruct for COMPILE_TEST" lib/tests: randstruct: Add deep function pointer layout test lib/tests: Add randstruct KUnit test randstruct: gcc-plugin: Remove bogus void member net: qede: Initialize qede_ll_ops with designated initializer scsi: qedf: Use designated initializer for struct qed_fcoe_cb_ops md/bcache: Mark __nonstring look-up table integer-wrap: Force full rebuild when .scl file changes randstruct: Force full rebuild when seed changes gcc-plugins: Force full rebuild when plugins change kbuild: Switch from -Wvla to -Wvla-larger-than=1 hardening: simplify CONFIG_CC_HAS_COUNTED_BY overflow: Fix direct struct member initialization in _DEFINE_FLEX() kunit/overflow: Add tests for STACK_FLEX_ARRAY_SIZE() helper overflow: Add STACK_FLEX_ARRAY_SIZE() helper input/joystick: magellan: Mark __nonstring look-up table const watchdog: exar: Shorten identity name to fit correctly mod_devicetable: Enlarge the maximum platform_device_id name length overflow: Clarify expectations for getting DEFINE_FLEX variable sizes compiler_types: Identify compiler versions for __builtin_dynamic_object_size ...
2025-05-28Merge tag 'seccomp-v6.16-rc1' of ↵Linus Torvalds2-9/+6
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp updates from Kees Cook: - selftest fixes for arm32 (Neill Kapron, Terry Tritton) - documentation typo fix (Sumanth Gavini) * tag 'seccomp-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: selftests: seccomp: Fix "performace" to "performance" selftests/seccomp: fix negative_ENOSYS tracer tests on arm32 selftests/seccomp: fix syscall_restart test for arm compat
2025-05-28dt-bindings: timer: Add fsl,vf610-pit.yamlFrank Li1-0/+54
Add binding doc fsl,vf610-pit.yaml to fix below CHECK_DTB warnings: arch/arm/boot/dts/nxp/vf/vf610m4-colibri.dtb: /soc/bus@40000000/pit@40037000: failed to match any schema with compatible: ['fsl,vf610-pit'] Signed-off-by: Frank Li <Frank.Li@nxp.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20250522205710.502779-1-Frank.Li@nxp.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2025-05-28dt-bindings: gpu: mali-bifrost: Add compatible for RZ/G3E SoCTommaso Merciai1-0/+2
Add a compatible string for the Renesas RZ/G3E SoC variants that include a Mali-G52 GPU. These variants share the same restrictions on interrupts, clocks, and power domains as the RZ/G2L SoC, so extend the existing schema validation accordingly. Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Tommaso Merciai <tommaso.merciai.xr@bp.renesas.com> Link: https://lore.kernel.org/r/20250528073040.904033-1-tommaso.merciai.xr@bp.renesas.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2025-05-28ASoC: dt-bindings: qcom,sm8250: Add Fairphone 5 sound cardLuca Weiss1-0/+1
Document the bindings for the sound card on Fairphone 5 which uses the older non-audioreach audio architecture. Acked-by: Rob Herring (Arm) <robh@kernel.org> Signed-off-by: Luca Weiss <luca.weiss@fairphone.com> Link: https://lore.kernel.org/r/20250507-fp5-dp-sound-v4-1-4098e918a29e@fairphone.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2025-05-28s390/uv: Always return 0 from s390_wiggle_split_folio() if successfulDavid Hildenbrand1-10/+12
Let's consistently return 0 if the operation was successful, and just detect ourselves whether splitting is required -- folio_test_large() is a cheap operation. Update the documentation. Should we simply always return -EAGAIN instead of 0, so we don't have to handle it in the caller? Not sure, staring at the documentation, this way looks a bit cleaner. Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20250516123946.1648026-3-david@redhat.com Message-ID: <20250516123946.1648026-3-david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2025-05-28s390/uv: Don't return 0 from make_hva_secure() if the operation was not ↵David Hildenbrand1-1/+4
successful If s390_wiggle_split_folio() returns 0 because splitting a large folio succeeded, we will return 0 from make_hva_secure() even though a retry is required. Return -EAGAIN in that case. Otherwise, we'll return 0 from gmap_make_secure(), and consequently from unpack_one(). In kvm_s390_pv_unpack(), we assume that unpacking succeeded and skip unpacking this page. Later on, we run into issues and fail booting the VM. So far, this issue was only observed with follow-up patches where we split large pagecache XFS folios. Maybe it can also be triggered with shmem? We'll cleanup s390_wiggle_split_folio() a bit next, to also return 0 if no split was required. Fixes: d8dfda5af0be ("KVM: s390: pv: fix race when making a page secure") Cc: stable@vger.kernel.org Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20250516123946.1648026-2-david@redhat.com Message-ID: <20250516123946.1648026-2-david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2025-05-28Merge branch 'kvm-lockdep-common' into HEADPaolo Bonzini13-170/+131
Introduce new mutex locking functions mutex_trylock_nest_lock() and mutex_lock_killable_nest_lock() and use them to clean up locking of all vCPUs for a VM. For x86, this removes some complex code that was used instead of lockdep's "nest_lock" feature. For ARM and RISC-V, this removes a lockdep warning when the VM is configured to have more than MAX_LOCK_DEPTH vCPUs, and removes a fair amount of duplicate code by sharing the logic across all architectures. Signed-off-by: Paolo BOnzini <pbonzini@redhat.com>
2025-05-28rust: add helper for mutex_trylockPaolo Bonzini1-0/+5
After commit c5b6ababd21a ("locking/mutex: implement mutex_trylock_nested", currently in the KVM tree) mutex_trylock() will be a macro when lockdep is enabled. Rust therefore needs the corresponding helper. Just add it and the rust/bindings/bindings_helpers_generated.rs Makefile rules will do their thing. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250528083431.1875345-1-pbonzini@redhat.com> Acked-by: Miguel Ojeda <ojeda@kernel.org> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni24-61/+210
Merge in late fixes to prepare for the 6.16 net-next PR. No conflicts nor adjacent changes. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28selftests/bpf: Fix bpf selftest build warningSaket Kumar Bhaskar1-3/+3
On linux-next, build for bpf selftest displays a warning: Warning: Kernel ABI header at 'tools/include/uapi/linux/if_xdp.h' differs from latest version at 'include/uapi/linux/if_xdp.h'. Commit 8066e388be48 ("net: add UAPI to the header guard in various network headers") changed the header guard from _LINUX_IF_XDP_H to _UAPI_LINUX_IF_XDP_H in include/uapi/linux/if_xdp.h. To resolve the warning, update tools/include/uapi/linux/if_xdp.h to align with the changes in include/uapi/linux/if_xdp.h Fixes: 8066e388be48 ("net: add UAPI to the header guard in various network headers") Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Closes: https://lore.kernel.org/all/c2bc466d-dff2-4d0d-a797-9af7f676c065@linux.ibm.com/ Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://patch.msgid.link/20250527054138.1086006-1-skb99@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28selftests: netfilter: Fix skip of wildcard interface testPhil Sutter1-2/+5
The script is supposed to skip wildcard interface testing if unsupported by the host's nft tool. The failing check caused script abort due to 'set -e' though. Fix this by running the potentially failing nft command inside the if-conditional pipe. Fixes: 73db1b5dab6f ("selftests: netfilter: Torture nftables netdev hooks") Signed-off-by: Phil Sutter <phil@nwl.cc> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Link: https://patch.msgid.link/20250527094117.18589-1-phil@nwl.cc Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28iomap: don't lose folio dropbehind state for overwritesJens Axboe3-3/+26
DONTCACHE I/O must have the completion punted to a workqueue, just like what is done for unwritten extents, as the completion needs task context to perform the invalidation of the folio(s). However, if writeback is started off filemap_fdatawrite_range() off generic_sync() and it's an overwrite, then the DONTCACHE marking gets lost as iomap_add_to_ioend() don't look at the folio being added and no further state is passed down to help it know that this is a dropbehind/DONTCACHE write. Check if the folio being added is marked as dropbehind, and set IOMAP_IOEND_DONTCACHE if that is the case. Then XFS can factor this into the decision making of completion context in xfs_submit_ioend(). Additionally include this ioend flag in the NOMERGE flags, to avoid mixing it with unrelated IO. Since this is the 3rd flag that will cause XFS to punt the completion to a workqueue, add a helper so that each one of them can get appropriately commented. This fixes extra page cache being instantiated when the write performed is an overwrite, rather than newly instantiated blocks. Fixes: b2cd5ae693a3 ("iomap: make buffered writes work with RWF_DONTCACHE") Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/5153f6e8-274d-4546-bf55-30a5018e0d03@kernel.dk Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-28virtio: reject shm region if length is zeroSami Uddin1-0/+2
Prevent usage of shared memory regions where the length is zero, as such configurations are not valid and may lead to unexpected behavior. Signed-off-by: Sami Uddin <sami.md.ko@gmail.com> Message-Id: <20250511222153.2332-1-sami.md.ko@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-05-28net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 framesHoratiu Vultur1-1/+3
We have noticed that when PHY timestamping is enabled, L2 frames seems to be modified by changing two 2 bytes with a value of 0. The place were these 2 bytes seems to be random(or I couldn't find a pattern). In most of the cases the userspace can ignore these frames but if for example those 2 bytes are in the correction field there is nothing to do. This seems to happen when configuring the HW for IPv4 even that the flow is not enabled. These 2 bytes correspond to the UDPv4 checksum and once we don't enable clearing the checksum when using L2 frames then the frame doesn't seem to be changed anymore. Fixes: 7d272e63e0979d ("net: phy: mscc: timestamping and PHC support") Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Link: https://patch.msgid.link/20250523082716.2935895-1-horatiu.vultur@microchip.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28net: openvswitch: Fix the dead loop of MPLS parseFaicker Mo1-1/+1
The unexpected MPLS packet may not end with the bottom label stack. When there are many stacks, The label count value has wrapped around. A dead loop occurs, soft lockup/CPU stuck finally. stack backtrace: UBSAN: array-index-out-of-bounds in /build/linux-0Pa0xK/linux-5.15.0/net/openvswitch/flow.c:662:26 index -1 is out of range for type '__be32 [3]' CPU: 34 PID: 0 Comm: swapper/34 Kdump: loaded Tainted: G OE 5.15.0-121-generic #131-Ubuntu Hardware name: Dell Inc. PowerEdge C6420/0JP9TF, BIOS 2.12.2 07/14/2021 Call Trace: <IRQ> show_stack+0x52/0x5c dump_stack_lvl+0x4a/0x63 dump_stack+0x10/0x16 ubsan_epilogue+0x9/0x36 __ubsan_handle_out_of_bounds.cold+0x44/0x49 key_extract_l3l4+0x82a/0x840 [openvswitch] ? kfree_skbmem+0x52/0xa0 key_extract+0x9c/0x2b0 [openvswitch] ovs_flow_key_extract+0x124/0x350 [openvswitch] ovs_vport_receive+0x61/0xd0 [openvswitch] ? kernel_init_free_pages.part.0+0x4a/0x70 ? get_page_from_freelist+0x353/0x540 netdev_port_receive+0xc4/0x180 [openvswitch] ? netdev_port_receive+0x180/0x180 [openvswitch] netdev_frame_hook+0x1f/0x40 [openvswitch] __netif_receive_skb_core.constprop.0+0x23a/0xf00 __netif_receive_skb_list_core+0xfa/0x240 netif_receive_skb_list_internal+0x18e/0x2a0 napi_complete_done+0x7a/0x1c0 bnxt_poll+0x155/0x1c0 [bnxt_en] __napi_poll+0x30/0x180 net_rx_action+0x126/0x280 ? bnxt_msix+0x67/0x80 [bnxt_en] handle_softirqs+0xda/0x2d0 irq_exit_rcu+0x96/0xc0 common_interrupt+0x8e/0xa0 </IRQ> Fixes: fbdcdd78da7c ("Change in Openvswitch to support MPLS label depth of 3 in ingress direction") Signed-off-by: Faicker Mo <faicker.mo@zenlayer.com> Acked-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/259D3404-575D-4A6D-B263-1DF59A67CF89@zenlayer.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28calipso: Don't call calipso functions for AF_INET sk.Kuniyuki Iwashima1-0/+3
syzkaller reported a null-ptr-deref in txopt_get(). [0] The offset 0x70 was of struct ipv6_txoptions in struct ipv6_pinfo, so struct ipv6_pinfo was NULL there. However, this never happens for IPv6 sockets as inet_sk(sk)->pinet6 is always set in inet6_create(), meaning the socket was not IPv6 one. The root cause is missing validation in netlbl_conn_setattr(). netlbl_conn_setattr() switches branches based on struct sockaddr.sa_family, which is passed from userspace. However, netlbl_conn_setattr() does not check if the address family matches the socket. The syzkaller must have called connect() for an IPv6 address on an IPv4 socket. We have a proper validation in tcp_v[46]_connect(), but security_socket_connect() is called in the earlier stage. Let's copy the validation to netlbl_conn_setattr(). [0]: Oops: general protection fault, probably for non-canonical address 0xdffffc000000000e: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077] CPU: 2 UID: 0 PID: 12928 Comm: syz.9.1677 Not tainted 6.12.0 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:txopt_get include/net/ipv6.h:390 [inline] RIP: 0010: Code: 02 00 00 49 8b ac 24 f8 02 00 00 e8 84 69 2a fd e8 ff 00 16 fd 48 8d 7d 70 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 53 02 00 00 48 8b 6d 70 48 85 ed 0f 84 ab 01 00 RSP: 0018:ffff88811b8afc48 EFLAGS: 00010212 RAX: dffffc0000000000 RBX: 1ffff11023715f8a RCX: ffffffff841ab00c RDX: 000000000000000e RSI: ffffc90007d9e000 RDI: 0000000000000070 RBP: 0000000000000000 R08: ffffed1023715f9d R09: ffffed1023715f9e R10: ffffed1023715f9d R11: 0000000000000003 R12: ffff888123075f00 R13: ffff88810245bd80 R14: ffff888113646780 R15: ffff888100578a80 FS: 00007f9019bd7640(0000) GS:ffff8882d2d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f901b927bac CR3: 0000000104788003 CR4: 0000000000770ef0 PKRU: 80000000 Call Trace: <TASK> calipso_sock_setattr+0x56/0x80 net/netlabel/netlabel_calipso.c:557 netlbl_conn_setattr+0x10c/0x280 net/netlabel/netlabel_kapi.c:1177 selinux_netlbl_socket_connect_helper+0xd3/0x1b0 security/selinux/netlabel.c:569 selinux_netlbl_socket_connect_locked security/selinux/netlabel.c:597 [inline] selinux_netlbl_socket_connect+0xb6/0x100 security/selinux/netlabel.c:615 selinux_socket_connect+0x5f/0x80 security/selinux/hooks.c:4931 security_socket_connect+0x50/0xa0 security/security.c:4598 __sys_connect_file+0xa4/0x190 net/socket.c:2067 __sys_connect+0x12c/0x170 net/socket.c:2088 __do_sys_connect net/socket.c:2098 [inline] __se_sys_connect net/socket.c:2095 [inline] __x64_sys_connect+0x73/0xb0 net/socket.c:2095 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xaa/0x1b0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f901b61a12d Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f9019bd6fa8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 00007f901b925fa0 RCX: 00007f901b61a12d RDX: 000000000000001c RSI: 0000200000000140 RDI: 0000000000000003 RBP: 00007f901b701505 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00007f901b5b62a0 R15: 00007f9019bb7000 </TASK> Modules linked in: Fixes: ceba1832b1b2 ("calipso: Set the calipso socket label to match the secattr.") Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: John Cheung <john.cs.hey@gmail.com> Closes: https://lore.kernel.org/netdev/CAP=Rh=M1LzunrcQB1fSGauMrJrhL6GGps5cPAKzHJXj6GQV+-g@mail.gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://patch.msgid.link/20250522221858.91240-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28Merge branch ↵Paolo Abeni2-1/+43
'net_sched-hfsc-address-reentrant-enqueue-adding-class-to-eltree-twice' Pedro Tammela says: ==================== net_sched: hfsc: Address reentrant enqueue adding class to eltree twice Savino says: "We are writing to report that this recent patch (141d34391abbb315d68556b7c67ad97885407547) can be bypassed, and a UAF can still occur when HFSC is utilized with NETEM. The patch only checks the cl->cl_nactive field to determine whether it is the first insertion or not, but this field is only incremented by init_vf. By using HFSC_RSC (which uses init_ed), it is possible to bypass the check and insert the class twice in the eltree. Under normal conditions, this would lead to an infinite loop in hfsc_dequeue for the reasons we already explained in this report. However, if TBF is added as root qdisc and it is configured with a very low rate, it can be utilized to prevent packets from being dequeued. This behavior can be exploited to perform subsequent insertions in the HFSC eltree and cause a UAF." To fix both the UAF and the infinite loop, with netem as an hfsc child, check explicitly in hfsc_enqueue whether the class is already in the eltree whenever the HFSC_RSC flag is set. Also add a TDC test to reproduce the UAF scenario. ==================== Link: https://patch.msgid.link/20250522181448.1439717-1-pctammela@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28selftests/tc-testing: Add a test for HFSC eltree double add with reentrant ↵Pedro Tammela1-0/+35
enqueue behaviour on netem Reproduce the UAF scenario where netem is a child of HFSC and HFSC is configured to use the eltree. In such case, this TDC test would cause the HFSC class to be added to the eltree twice resulting in a UAF. Reviewed-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Link: https://patch.msgid.link/20250522181448.1439717-3-pctammela@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28net_sched: hfsc: Address reentrant enqueue adding class to eltree twicePedro Tammela1-1/+8
Savino says: "We are writing to report that this recent patch (141d34391abbb315d68556b7c67ad97885407547) [1] can be bypassed, and a UAF can still occur when HFSC is utilized with NETEM. The patch only checks the cl->cl_nactive field to determine whether it is the first insertion or not [2], but this field is only incremented by init_vf [3]. By using HFSC_RSC (which uses init_ed) [4], it is possible to bypass the check and insert the class twice in the eltree. Under normal conditions, this would lead to an infinite loop in hfsc_dequeue for the reasons we already explained in this report [5]. However, if TBF is added as root qdisc and it is configured with a very low rate, it can be utilized to prevent packets from being dequeued. This behavior can be exploited to perform subsequent insertions in the HFSC eltree and cause a UAF." To fix both the UAF and the infinite loop, with netem as an hfsc child, check explicitly in hfsc_enqueue whether the class is already in the eltree whenever the HFSC_RSC flag is set. [1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=141d34391abbb315d68556b7c67ad97885407547 [2] https://elixir.bootlin.com/linux/v6.15-rc5/source/net/sched/sch_hfsc.c#L1572 [3] https://elixir.bootlin.com/linux/v6.15-rc5/source/net/sched/sch_hfsc.c#L677 [4] https://elixir.bootlin.com/linux/v6.15-rc5/source/net/sched/sch_hfsc.c#L1574 [5] https://lore.kernel.org/netdev/8DuRWwfqjoRDLDmBMlIfbrsZg9Gx50DHJc1ilxsEBNe2D6NMoigR_eIRIG0LOjMc3r10nUUZtArXx4oZBIdUfZQrwjcQhdinnMis_0G7VEk=@willsroot.io/T/#u Fixes: 37d9cf1a3ce3 ("sched: Fix detection of empty queues in child qdiscs") Reported-by: Savino Dicanosa <savy@syst3mfailure.io> Reported-by: William Liu <will@willsroot.io> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Link: https://patch.msgid.link/20250522181448.1439717-2-pctammela@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callbackHariprasad Kelam1-3/+1
This patch addresses below issues, 1. Active traffic on the leaf node must be stopped before its send queue is reassigned to the parent. This patch resolves the issue by marking the node as 'Inner'. 2. During a system reboot, the interface receives TC_HTB_LEAF_DEL and TC_HTB_LEAF_DEL_LAST callbacks to delete its HTB queues. In the case of TC_HTB_LEAF_DEL_LAST, although the same send queue is reassigned to the parent, the current logic still attempts to update the real number of queues, leadning to below warnings New queues can't be registered after device unregistration. WARNING: CPU: 0 PID: 6475 at net/core/net-sysfs.c:1714 netdev_queue_update_kobjects+0x1e4/0x200 Fixes: 5e6808b4c68d ("octeontx2-pf: Add support for HTB offload") Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250522115842.1499666-1-hkelam@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28octeontx2-pf: QOS: Perform cache sync on send queue teardownHariprasad Kelam1-0/+22
QOS is designed to create a new send queue whenever a class is created, ensuring proper shaping and scheduling. However, when multiple send queues are created and deleted in a loop, SMMU errors are observed. This patch addresses the issue by performing an data cache sync during the teardown of QOS send queues. Fixes: ab6dddd2a669 ("octeontx2-pf: qos send queues management") Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250522094742.1498295-1-hkelam@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28net: mana: Add support for Multi Vports on Bare metalHaiyang Zhang2-9/+19
To support Multi Vports on Bare metal, increase the device config response version. And, skip the register HW vport, and register filter steps, when the Bare metal hostmode is set. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/1747671636-5810-1-git-send-email-haiyangz@microsoft.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28Merge tag 'sched_ext-for-6.16' of ↵Linus Torvalds17-995/+1668
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - More in-kernel idle CPU selection improvements. Expand topology awareness coverage add scx_bpf_select_cpu_and() to allow more flexibility. The idle CPU selection kfuncs can now be called from unlocked contexts too. - A bunch of reorganization changes to lay the foundation for multiple hierarchical scheduler support. This isn't ready yet and the included changes don't make meaningful behavior differences. One notable change is replacing some static_key tests with dynamic tests as the test results may differ depending on the scheduler instance. This isn't expected to cause meaningful performance difference. - Other minor and doc updates. - There were multiple patches in for-6.15-fixes which conflicted with changes in for-6.16. for-6.15-fixes were pulled three times into for-6.16 to resolve the conflicts. * tag 'sched_ext-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (49 commits) sched_ext: Call ops.update_idle() after updating builtin idle bits sched_ext, docs: convert mentions of "CFS" to "fair-class scheduler" selftests/sched_ext: Update test enq_select_cpu_fails sched_ext: idle: Consolidate default idle CPU selection kfuncs selftests/sched_ext: Add test for scx_bpf_select_cpu_and() via test_run sched_ext: idle: Allow scx_bpf_select_cpu_and() from unlocked context sched_ext: idle: Validate locking correctness in scx_bpf_select_cpu_and() sched_ext: Make scx_kf_allowed_if_unlocked() available outside ext.c sched_ext, docs: add label sched_ext: Explain the temporary situation around scx_root dereferences sched_ext: Add @sch to SCX_CALL_OP*() sched_ext: Cleanup [__]scx_exit/error*() sched_ext: Add @sch to SCX_CALL_OP*() sched_ext: Clean up scx_root usages Documentation: scheduler: Changed lowercase acronyms to uppercase sched_ext: Avoid NULL scx_root deref in __scx_exit() sched_ext: Add RCU protection to scx_root in DSQ iterator sched_ext: Clean up SCX_EXIT_NONE handling in scx_disable_workfn() sched_ext: Move disable machinery into scx_sched sched_ext: Move event_stats_cpu into scx_sched ...
2025-05-28Merge tag 'cgroup-for-6.16' of ↵Linus Torvalds14-342/+665
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - cgroup rstat shared the tracking tree across all controllers with the rationale being that a cgroup which is using one resource is likely to be using other resources at the same time (ie. if something is allocating memory, it's probably consuming CPU cycles). However, this turned out to not scale very well especially with memcg using rstat for internal operations which made memcg stat read and flush patterns substantially different from other controllers. JP Kobryn split the rstat tree per controller. - cgroup BPF support was hooking into cgroup init/exit paths directly. Convert them to use a notifier chain instead so that other usages can be added easily. The two of the patches which implement this are mislabeled as belonging to sched_ext instead of cgroup. Sorry. - Relatively minor cpuset updates - Documentation updates * tag 'cgroup-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (23 commits) sched_ext: Convert cgroup BPF support to use cgroup_lifetime_notifier sched_ext: Introduce cgroup_lifetime_notifier cgroup: Minor reorganization of cgroup_create() cgroup, docs: cpu controller's interaction with various scheduling policies cgroup, docs: convert space indentation to tab indentation cgroup: avoid per-cpu allocation of size zero rstat cpu locks cgroup, docs: be specific about bandwidth control of rt processes cgroup: document the rstat per-cpu initialization cgroup: helper for checking rstat participation of css cgroup: use subsystem-specific rstat locks to avoid contention cgroup: use separate rstat trees for each subsystem cgroup: compare css to cgroup::self in helper for distingushing css cgroup: warn on rstat usage by early init subsystems cgroup/cpuset: drop useless cpumask_empty() in compute_effective_exclusive_cpumask() cgroup/rstat: Improve cgroup_rstat_push_children() documentation cgroup: fix goto ordering in cgroup_init() cgroup: fix pointer check in css_rstat_init() cgroup/cpuset: Add warnings to catch inconsistency in exclusive CPUs cgroup/cpuset: Fix obsolete comment in cpuset_css_offline() cgroup/cpuset: Always use cpu_active_mask ...
2025-05-28Merge tag 'wq-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2-2/+15
Pull workqueue updates from Tejun Heo: "Fix statistic update race condition and a couple documentation updates" * tag 'wq-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix typo in comment workqueue: Fix race condition in wq->stats incrementation workqueue: Better document teardown for delayed_work
2025-05-28Merge tag 'sysctl-6.16-rc1' of ↵Linus Torvalds11-208/+265
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl updates from Joel Granados: - Move kern_table members out of kernel/sysctl.c Moved a subset (tracing, panic, signal, stack_tracer and sparc) out of the kern_table array. The goal is for kern_table to only have sysctl elements. All this increases modularity by placing the ctl_tables closer to where they are used while reducing the chances of merge conflicts in kernel/sysctl.c. - Fixed sysctl unit test panic by relocating it to selftests * tag 'sysctl-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: Close test ctl_headers with a for loop sysctl: call sysctl tests with a for loop sysctl: Add 0012 to test the u8 range check sysctl: move u8 register test to lib/test_sysctl.c sparc: mv sparc sysctls into their own file under arch/sparc/kernel stack_tracer: move sysctl registration to kernel/trace/trace_stack.c tracing: Move trace sysctls into trace.c signal: Move signal ctl tables into signal.c panic: Move panic ctl tables into panic.c
2025-05-28Merge tag 'm68k-for-v6.16-tag1' of ↵Linus Torvalds14-50/+2
git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k Pull m68k updates from Geert Uytterhoeven: - One more strscpy() conversion - Fix detection of real Mac II - defconfig updates * tag 'm68k-for-v6.16-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k: m68k: defconfig: Update defconfigs for v6.15-rc1 m68k: mac: Fix macintosh_config for Mac II m68k: Replace strcpy() with strscpy() in hardware_proc_show()
2025-05-28Merge tag 'for-linus-6.16-rc1-tag' of ↵Linus Torvalds4-6/+32
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: - A fix for running as a Xen dom0 on the iMX8QXP Arm platform - An update of the xen.config adding XEN_UNPOPULATED_ALLOC for better support of PVH dom0 - A fix of the Xen balloon driver when running without CONFIG_XEN_UNPOPULATED_ALLOC - A fix of the dm_op Xen hypercall on Arm needed to pass user space buffers to the hypervisor in certain configurations * tag 'for-linus-6.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/arm: call uaccess_ttbr0_enable for dm_op hypercall xen/x86: fix initial memory balloon target xen: enable XEN_UNPOPULATED_ALLOC as part of xen.config xen: swiotlb: Wire up map_resource callback
2025-05-28bpf, arm64: Remove unused-but-set function and variable.Alexei Starovoitov1-19/+2
Remove unused-but-set function and variable to fix the build warning: arch/arm64/net/bpf_jit_comp.c: In function 'arch_bpf_trampoline_size': 2547 | int nregs, ret; | ^~~~~ Fixes: 9014cf56f13d ("bpf, arm64: Support up to 12 function arguments") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/bpf/20250528002704.21197-1-alexei.starovoitov@gmail.com Closes: https://lore.kernel.org/oe-kbuild-all/202505280643.h0qYcSCM-lkp@intel.com/
2025-05-28Merge tag 'dma-mapping-6.16-2025-05-26' of ↵Linus Torvalds10-201/+764
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping updates from Marek Szyprowski: "New two step DMA mapping API, which is is a first step to a long path to provide alternatives to scatterlist and to remove hacks, abuses and design mistakes related to scatterlists. This new approach optimizes some calls to DMA-IOMMU layer and cache maintenance by batching them, reduces memory usage as it is no need to store mapped DMA addresses to unmap them, and reduces some function call overhead. It is a combination effort of many people, lead and developed by Christoph Hellwig and Leon Romanovsky" * tag 'dma-mapping-6.16-2025-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: docs: core-api: document the IOVA-based API dma-mapping: add a dma_need_unmap helper dma-mapping: Implement link/unlink ranges API iommu/dma: Factor out a iommu_dma_map_swiotlb helper dma-mapping: Provide an interface to allow allocate IOVA iommu: add kernel-doc for iommu_unmap_fast iommu: generalize the batched sync after map interface dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h PCI/P2PDMA: Refactor the p2pdma mapping helpers
2025-05-28llist: make llist_add_batch() a static inlineJens Axboe2-25/+20
The function is small enough that it should be, and it's a (very) hot path for io_uring. Doing this actually reduces my vmlinux text size for my standard build/test box. Before: axboe@r7625 ~/g/linux (test)> size vmlinux text data bss dec hex filename 19892174 5938310 2470432 28300916 1afd674 vmlinux After: axboe@r7625 ~/g/linux (test)> size vmlinux text data bss dec hex filename 19891878 5938310 2470436 28300624 1afd550 vmlinux Link: https://lkml.kernel.org/r/f1d104c6-7ac8-457a-a53d-6bb741421b2f@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-28delayacct: remove redundant code and adjust indentationWang Yaxin1-35/+16
Remove redundant code and adjust indentation of xxx_delay_max/min. Link: https://lkml.kernel.org/r/20250521093157668iQrhhcMjA-th5LQf4-A3c@zte.com.cn Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-28squashfs: add optional full compressed block cachingChanho Min2-0/+49
The commit 93e72b3c612adcaca1 ("squashfs: migrate from ll_rw_block usage to BIO") removed caching of compressed blocks in SquashFS, causing fio performance regression in workloads with repeated file reads. Without caching, every read triggers disk I/O, severely impacting performance in tools like fio. This patch introduces a new CONFIG_SQUASHFS_COMP_CACHE_FULL Kconfig option to enable caching of all compressed blocks, restoring performance to pre-BIO migration levels. When enabled, all pages in a BIO are cached in the page cache, reducing disk I/O for repeated reads. The fio test results with this patch confirm the performance restoration: For example, fio tests (iodepth=1, numjobs=1, ioengine=psync) show a notable performance restoration: Disable CONFIG_SQUASHFS_COMP_CACHE_FULL: IOPS=815, BW=102MiB/s (107MB/s)(6113MiB/60001msec) Enable CONFIG_SQUASHFS_COMP_CACHE_FULL: IOPS=2223, BW=278MiB/s (291MB/s)(16.3GiB/59999msec) The tradeoff is increased memory usage due to caching all compressed blocks. The CONFIG_SQUASHFS_COMP_CACHE_FULL option allows users to enable this feature selectively, balancing performance and memory usage for workloads with frequent repeated reads. Link: https://lkml.kernel.org/r/20250521072559.2389-1-chanho.min@lge.com Signed-off-by: Chanho Min <chanho.min@lge.com> Reviewed-by Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-28crash_dump, nvme: select CONFIGFS_FS as built-inArnd Bergmann3-3/+8
Configfs can be configured as a loadable module, which causes a link-time failure for dm-crypt crash dump support: crash_dump_dm_crypt.c:(.text+0x3a4): undefined reference to `config_item_init_type_name' aarch64-linux-ld: kernel/crash_dump_dm_crypt.o: in function `configfs_dmcrypt_keys_init': crash_dump_dm_crypt.c:(.init.text+0x90): undefined reference to `config_group_init' aarch64-linux-ld: crash_dump_dm_crypt.c:(.init.text+0xb4): undefined reference to `configfs_register_subsystem' aarch64-linux-ld: crash_dump_dm_crypt.c:(.init.text+0xd8): undefined reference to `configfs_unregister_subsystem' This could be avoided with a dependency on CONFIGFS_FS=y, but the dependency has an additional problem of causing Kconfig dependency loops since most other uses select the symbol. Using a simple 'select CONFIGFS_FS' here in turn fails with CONFIG_DM_CRYPT=m, because that still only causes configfs to be a loadable module. The only version I found that fixes this reliably uses an additional Kconfig symbol to ensure the 'select' actually turns on configfs as builtin, with two additional changes to avoid dependency loops with nvme and sysfs. There is no compile-time dependency between configfs and sysfs, so selecting configfs from a driver with sysfs disabled does not cause link failures, only the default /sys/kernel/config mount point will not be created. Link: https://lkml.kernel.org/r/20250521160359.2132363-1-arnd@kernel.org Fixes: 6b23858fd63b ("crash_dump: make dm crypt keys persist for the kdump kernel") Fixes: 1fb470408497 ("nvme-loop: add configfs dependency") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Andreas Hindborg <a.hindborg@kernel.org> Cc: Breno Leitao <leitao@debian.org> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Coiby Xu <coxu@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-28mm: pcp: increase pcp->free_count threshold to trigger free_highNikhil Dhama1-1/+1
In old pcp design, pcp->free_factor gets incremented in nr_pcp_free() which is invoked by free_pcppages_bulk(). So, it used to increase free_factor by 1 only when we try to reduce the size of pcp list and free_high used to trigger only for order > 0 and order < costly_order and pcp->free_factor > 0. For iperf3 I noticed that with older design in kernel v6.6, pcp list was drained mostly when pcp->count > high (more often when count goes above 530). and most of the time pcp->free_factor was 0, triggering very few high order flushes. But this is changed in the current design, introduced in commit 6ccdcb6d3a74 ("mm, pcp: reduce detecting time of consecutive high order page freeing"), where pcp->free_factor is changed to pcp->free_count to keep track of the number of pages freed contiguously. In this design, pcp->free_count is incremented on every deallocation, irrespective of whether pcp list was reduced or not. And logic to trigger free_high is if pcp->free_count goes above batch (which is 63) and there are two contiguous page free without any allocation. With this design, for iperf3, pcp list is getting flushed more frequently because free_high heuristics is triggered more often now. I observed that high order pcp list is drained as soon as both count and free_count goes above 63. Due to this more aggressive high order flushing, applications doing contiguous high order allocation will require to go to global list more frequently. On a 2-node AMD machine with 384 vCPUs on each node, connected via Mellonox connectX-7, I am seeing a ~30% performance reduction if we scale number of iperf3 client/server pairs from 32 to 64. Though this new design reduced the time to detect high order flushes, but for application which are allocating high order pages more frequently it may be flushing the high order list pre-maturely. This motivates towards tuning on how late or early we should flush high order lists. So, in this patch, we increased the pcp->free_count threshold to trigger free_high from "batch" to "batch + pcp->high_min / 2" as suggested by Ying [1], In the original pcp->free_factor solution, free_high is triggered for contiguous freeing with size ranging from "batch" to "pcp->high + batch". So, the average value is "batch + pcp->high / 2". While in the pcp->free_count solution, free_high will be triggered for contiguous freeing with size "batch". So, to restore the original behavior, we can use the threshold "batch + pcp->high_min / 2" This new threshold keeps high order pages in pcp list for a longer duration which can help the application doing high order allocations frequently. With this patch performace to Iperf3 is restored and score for other benchmarks on the same machine are as follows: iperf3 lmbench3 netperf kbuild (AF_UNIX) (SCTP_STREAM_MANY) ------- --------- ----------------- ------ v6.6 vanilla (base) 100 100 100 100 v6.12 vanilla 69 113 98.5 98.8 v6.12 + this patch 100 110.3 100.2 99.3 netperf-tcp: 6.12 6.12 vanilla this_patch Hmean 64 732.14 ( 0.00%) 730.45 ( -0.23%) Hmean 128 1417.46 ( 0.00%) 1419.44 ( 0.14%) Hmean 256 2679.67 ( 0.00%) 2676.45 ( -0.12%) Hmean 1024 8328.52 ( 0.00%) 8339.34 ( 0.13%) Hmean 2048 12716.98 ( 0.00%) 12743.68 ( 0.21%) Hmean 3312 15787.79 ( 0.00%) 15887.25 ( 0.63%) Hmean 4096 17311.91 ( 0.00%) 17332.68 ( 0.12%) Hmean 8192 20310.73 ( 0.00%) 20465.09 ( 0.76%) Link: https://lore.kernel.org/all/875xjmuiup.fsf@DESKTOP-5N7EMDA/ [1] Link: https://lkml.kernel.org/r/20250407105219.55351-1-nikhil.dhama@amd.com Fixes: 6ccdcb6d3a74 ("mm, pcp: reduce detecting time of consecutive high order page freeing") Signed-off-by: Nikhil Dhama <nikhil.dhama@amd.com> Suggested-by: Huang Ying <ying.huang@linux.alibaba.com> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Bharata B Rao <bharata@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-28mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()Fan Ni1-11/+13
In __unmap_hugepage_range(), the "page" pointer always points to the first page of a huge page, which guarantees there is a folio associating with it. Convert the "page" pointer to use folio. Link: https://lkml.kernel.org/r/20250505182345.506888-6-nifan.cxl@gmail.com Signed-off-by: Fan Ni <fan.ni@samsung.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-28mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of pageFan Ni2-11/+11
The function __unmap_hugepage_range() has two kinds of users: 1) unmap_hugepage_range(), which passes in the head page of a folio. Since unmap_hugepage_range() already takes folio and there are no other uses of the folio struct in the function, it is natural for __unmap_hugepage_range() to take folio also. 2) All other uses, which pass in NULL pointer. In both cases, we can pass in folio. Refactor __unmap_hugepage_range() to take folio. Link: https://lkml.kernel.org/r/20250505182345.506888-5-nifan.cxl@gmail.com Signed-off-by: Fan Ni <fan.ni@samsung.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-28mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of pageFan Ni2-5/+6
The function unmap_hugepage_range() has two kinds of users: 1) unmap_ref_private(), which passes in the head page of a folio. Since unmap_ref_private() already takes folio and there are no other uses of the folio struct in the function, it is natural for unmap_hugepage_range() to take folio also. 2) All other uses, which pass in NULL pointer. In both cases, we can pass in folio. Refactor unmap_hugepage_range() to take folio. Link: https://lkml.kernel.org/r/20250505182345.506888-4-nifan.cxl@gmail.com Signed-off-by: Fan Ni <fan.ni@samsung.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>