summaryrefslogtreecommitdiff
path: root/include/uapi/linux
AgeCommit message (Collapse)AuthorFilesLines
2025-06-23io_uring/netcmd: add tx timestamping cmd supportPavel Begunkov1-0/+16
Add a new socket command which returns tx time stamps to the user. It provide an alternative to the existing error queue recvmsg interface. The command works in a polled multishot mode, which means io_uring will poll the socket and keep posting timestamps until the request is cancelled or fails in any other way (e.g. with no space in the CQ). It reuses the net infra and grabs timestamps from the socket's error queue. The command requires IORING_SETUP_CQE32. All non-final CQEs (marked with IORING_CQE_F_MORE) have cqe->res set to the tskey, and the upper 16 bits of cqe->flags keep tstype (i.e. offset by IORING_CQE_BUFFER_SHIFT). The timevalue is store in the upper part of the extended CQE. The final completion won't have IORING_CQE_F_MORE and will have cqe->res storing 0/error. Suggested-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/92ee66e6b33b8de062a977843d825f58f21ecd37.1750065793.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23io_uring/nop: add IORING_NOP_TW completion flagJens Axboe1-0/+1
To test and profile the overhead of io_uring task_work and the various types of it, add IORING_NOP_TW which tells nop to signal completions through task_work rather than complete them inline. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-23fs: introduce FALLOC_FL_WRITE_ZEROES to fallocateZhang Yi1-0/+17
With the development of flash-based storage devices, we can quickly write zeros to SSDs using the WRITE_ZERO command if the devices do not actually write physical zeroes to the media. Therefore, we can use this command to quickly preallocate a real all-zero file with written extents. This approach should be beneficial for subsequent pure overwriting within this file, as it can save on block allocation and, consequently, significant metadata changes, which should greatly improve overwrite performance on certain filesystems. Therefore, introduce a new operation FALLOC_FL_WRITE_ZEROES to fallocate. This flag is used to convert a specified range of a file to zeros by issuing a zeroing operation. Blocks should be allocated for the regions that span holes in the file, and the entire range is converted to written extents. If the underlying device supports the actual offload write zeroes command, the process of zeroing out operation can be accelerated. If it does not, we currently don't prevent the file system from writing actual zeros to the device. This provides users with a new method to quickly generate a zeroed file, users no longer need to write zero data to create a file with written extents. Users can determine whether a disk supports the unmap write zeroes feature through querying this sysfs interface: /sys/block/<disk>/queue/write_zeroes_unmap_max_hw_bytes Users can also enable or disable the unmap write zeroes operation through this sysfs interface: /sys/block/<disk>/queue/write_zeroes_unmap_max_bytes Finally, this flag cannot be specified in conjunction with the FALLOC_FL_KEEP_SIZE since allocating written extents beyond file EOF is not permitted. In addition, filesystems that always require out-of-place writes should not support this flag since they still need to allocated new blocks during subsequent overwrites. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-7-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-22Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-0/+22
Pull kvm fixes from Paolo Bonzini: "ARM: - Fix another set of FP/SIMD/SVE bugs affecting NV, and plugging some missing synchronisation - A small fix for the irqbypass hook fixes, tightening the check and ensuring that we only deal with MSI for both the old and the new route entry - Rework the way the shadow LRs are addressed in a nesting configuration, plugging an embarrassing bug as well as simplifying the whole process - Add yet another fix for the dreaded arch_timer_edge_cases selftest RISC-V: - Fix the size parameter check in SBI SFENCE calls - Don't treat SBI HFENCE calls as NOPs x86 TDX: - Complete API for handling complex TDVMCALLs in userspace. This was delayed because the spec lacked a way for userspace to deny supporting these calls; the new exit code is now approved" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: TDX: Exit to userspace for GetTdVmCallInfo KVM: TDX: Handle TDG.VP.VMCALL<GetQuote> KVM: TDX: Add new TDVMCALL status code for unsupported subfuncs KVM: arm64: VHE: Centralize ISBs when returning to host KVM: arm64: Remove cpacr_clear_set() KVM: arm64: Remove ad-hoc CPTR manipulation from kvm_hyp_handle_fpsimd() KVM: arm64: Remove ad-hoc CPTR manipulation from fpsimd_sve_sync() KVM: arm64: Reorganise CPTR trap manipulation KVM: arm64: VHE: Synchronize CPTR trap deactivation KVM: arm64: VHE: Synchronize restore of host debug registers KVM: arm64: selftests: Close the GIC FD in arch_timer_edge_cases KVM: arm64: Explicitly treat routing entry type changes as changes KVM: arm64: nv: Fix tracking of shadow list registers RISC-V: KVM: Don't treat SBI HFENCE calls as NOPs RISC-V: KVM: Fix the size parameter check in SBI SFENCE calls
2025-06-21Merge tag 'perf-tools-fixes-for-v6.16-1-2025-06-20' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools Pull perf tools fixes from Arnaldo Carvalho de Melo: - Fix some file descriptor leaks that stand out with recent changes to 'perf list' - Fix prctl include to fix building 'perf bench futex' hash with musl libc - Restrict 'perf test' uniquifying entry to machines with 'uncore_imc' PMUs - Document new output fields (op, cache, mem, dtlb, snoop) used with 'perf mem' - Synchronize kernel header copies * tag 'perf-tools-fixes-for-v6.16-1-2025-06-20' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: tools headers x86 cpufeatures: Sync with the kernel sources perf bench futex: Fix prctl include in musl libc perf test: Directory file descriptor leak perf evsel: Missed close() when probing hybrid core PMUs tools headers: Synchronize linux/bits.h with the kernel sources tools arch amd ibs: Sync ibs.h with the kernel sources tools arch x86: Sync the msr-index.h copy with the kernel sources tools headers: Syncronize linux/build_bug.h with the kernel sources tools headers: Update the copy of x86's mem{cpy,set}_64.S used in 'perf bench' tools headers UAPI: Sync linux/kvm.h with the kernel sources tools headers UAPI: Sync the drm/drm.h with the kernel sources perf beauty: Update copy of linux/socket.h with the kernel sources tools headers UAPI: Sync kvm header with the kernel sources tools headers x86 svm: Sync svm headers with the kernel sources tools headers UAPI: Sync KVM's vmx.h header with the kernel sources tools kvm headers arm64: Update KVM header from the kernel sources tools headers UAPI: Sync linux/prctl.h with the kernel sources to pick FUTEX knob perf mem: Document new output fields (op, cache, mem, dtlb, snoop) tools headers: Update the fs headers with the kernel sources perf test: Restrict uniquifying test to machines with 'uncore_imc'
2025-06-20KVM: TDX: Exit to userspace for SetupEventNotifyInterruptPaolo Bonzini1-0/+4
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20KVM: TDX: Exit to userspace for GetTdVmCallInfoBinbin Wu1-0/+5
Exit to userspace for TDG.VP.VMCALL<GetTdVmCallInfo> via KVM_EXIT_TDX, to allow userspace to provide information about the support of TDVMCALLs when r12 is 1 for the TDVMCALLs beyond the GHCI base API. GHCI spec defines the GHCI base TDVMCALLs: <GetTdVmCallInfo>, <MapGPA>, <ReportFatalError>, <Instruction.CPUID>, <#VE.RequestMMIO>, <Instruction.HLT>, <Instruction.IO>, <Instruction.RDMSR> and <Instruction.WRMSR>. They must be supported by VMM to support TDX guests. For GetTdVmCallInfo - When leaf (r12) to enumerate TDVMCALL functionality is set to 0, successful execution indicates all GHCI base TDVMCALLs listed above are supported. Update the KVM TDX document with the set of the GHCI base APIs. - When leaf (r12) to enumerate TDVMCALL functionality is set to 1, it indicates the TDX guest is querying the supported TDVMCALLs beyond the GHCI base TDVMCALLs. Exit to userspace to let userspace set the TDVMCALL sub-function bit(s) accordingly to the leaf outputs. KVM could set the TDVMCALL bit(s) supported by itself when the TDVMCALLs don't need support from userspace after returning from userspace and before entering guest. Currently, no such TDVMCALLs implemented, KVM just sets the values returned from userspace. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> [Adjust userspace API. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>Binbin Wu1-0/+17
Handle TDVMCALL for GetQuote to generate a TD-Quote. GetQuote is a doorbell-like interface used by TDX guests to request VMM to generate a TD-Quote signed by a service hosting TD-Quoting Enclave operating on the host. A TDX guest passes a TD Report (TDREPORT_STRUCT) in a shared-memory area as parameter. Host VMM can access it and queue the operation for a service hosting TD-Quoting enclave. When completed, the Quote is returned via the same shared-memory area. KVM only checks the GPA from the TDX guest has the shared-bit set and drops the shared-bit before exiting to userspace to avoid bleeding the shared-bit into KVM's exit ABI. KVM forwards the request to userspace VMM (e.g. QEMU) and userspace VMM queues the operation asynchronously. KVM sets the return code according to the 'ret' field set by userspace to notify the TDX guest whether the request has been queued successfully or not. When the request has been queued successfully, the TDX guest can poll the status field in the shared-memory area to check whether the Quote generation is completed or not. When completed, the generated Quote is returned via the same buffer. Add KVM_EXIT_TDX as a new exit reason to userspace. Userspace is required to handle the KVM exit reason as the initial support for TDX, by reentering KVM to ensure that the TDVMCALL is complete. While at it, add a note that KVM_EXIT_HYPERCALL also requires reentry with KVM_RUN. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> [Adjust userspace API. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20HID: core: Add bus define for SoundWire busCharles Keepax1-0/+1
SDCA (SoundWire Device Class for Audio) uses HID to convey input events from peripheral devices. Add a bus define for the SoundWire bus to prepare support for this. Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Signed-off-by: Shuming Fan <shumingf@realtek.com> Acked-by: Jiri Kosina <jkosina@suse.com> Link: https://patch.msgid.link/20250616114907.855452-1-shumingf@realtek.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-06-20wifi: cfg80211: Add support for link reconfiguration negotiation offload to ↵Kavita Kavita1-1/+5
driver In the case of SME-in-driver, the driver can internally choose to update the links based on the AP MLD recommendation and do link reconfiguration negotiation with AP MLD. (e.g., After the driver processing the BSS Transition Management request frame received from the AP MLD with Neighbor Report containing Multi-Link element with recommended links information chooses to do link reconfiguration negotiation with AP MLD). To support this, extend cfg80211_mlo_reconf_add_done() and NL80211_CMD_ASSOC_MLO_RECONF to indicate added links information for driver-initiated link reconfiguration requests. For removed links, the driver indicates links information using the NL80211_CMD_LINKS_REMOVED event for driver-initiated cases, the same as supplicant initiated cases. For the driver-initiated case, cfg80211 will receive link reconfiguration result asynchronously from driver so holding BSSes of the accepted add links is needed in the event path. Also, no need of unhold call for the rejected add link BSSes since there was no hold call happened previously. Once the supplicant receives the NL80211_CMD_ASSOC_MLO_RECONF event, it needs to process the information about newly added links and install per-link group keys (e.g., GTK/IGTK/BIGTK etc.). In case of the SME-in-driver, using a vendor interface etc. to notify the supplicant to initiate a link reconfiguration request and then supplicant sending command to the cfg80211 can lead to race conditions. The correct design to avoid this is that the driver indicates the cfg80211 directly with the results of the link reconfiguration negotiation. Signed-off-by: Kavita Kavita <quic_kkavita@quicinc.com> Link: https://patch.msgid.link/20250604105757.2542-3-quic_kkavita@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-20wifi: cfg80211: Improve the documentation for NL80211_CMD_ASSOC_MLO_RECONFKavita Kavita1-1/+5
The existing documentation for the NL80211_CMD_ASSOC_MLO_RECONF does not clearly explain handling of link reconfiguration request results from the driver. Add documentation to explain that the command is used as an event to notify userspace about added links information, and that the existing NL80211_CMD_LINKS_REMOVED command is used to notify userspace about removed links information. Signed-off-by: Kavita Kavita <quic_kkavita@quicinc.com> Link: https://patch.msgid.link/20250604105757.2542-2-quic_kkavita@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-8/+4
Cross-merge networking fixes after downstream PR (net-6.16-rc3). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19Merge tag 'net-6.16-rc3' of ↵Linus Torvalds2-6/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from wireless. The ath12k fix to avoid FW crashes requires adding support for a number of new FW commands so it's quite large in terms of LoC. The rest is relatively small. Current release - fix to a fix: - ptp: fix breakage after ptp_vclock_in_use() rework Current release - regressions: - openvswitch: allocate struct ovs_pcpu_storage dynamically, static allocation may exhaust module loader limit on smaller systems Previous releases - regressions: - tcp: fix tcp_packet_delayed() for peers with no selective ACK support Previous releases - always broken: - wifi: ath12k: don't activate more links than firmware supports - tcp: make sure sockets open via passive TFO have valid NAPI ID - eth: bnxt_en: update MRU and RSS table of RSS contexts on queue reset, prevent Rx queues from silently hanging after queue reset - NFC: uart: set tty->disc_data only in success path" * tag 'net-6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (59 commits) net: airoha: Differentiate hwfd buffer size for QDMA0 and QDMA1 net: airoha: Compute number of descriptors according to reserved memory size tools: ynl: fix mixing ops and notifications on one socket net: atm: fix /proc/net/atm/lec handling net: atm: add lec_mutex mlxbf_gige: return EPROBE_DEFER if PHY IRQ is not available net: airoha: Always check return value from airoha_ppe_foe_get_entry() NFC: nci: uart: Set tty->disc_data only in success path calipso: Fix null-ptr-deref in calipso_req_{set,del}attr(). MAINTAINERS: Remove Shannon Nelson from MAINTAINERS file net: lan743x: fix potential out-of-bounds write in lan743x_ptp_io_event_clock_get() eth: fbnic: avoid double free when failing to DMA-map FW msg tcp: fix passive TFO socket having invalid NAPI ID selftests: net: add test for passive TFO socket NAPI ID selftests: net: add passive TFO test binary selftests: netdevsim: improve lib.sh include in peer.sh tipc: fix null-ptr-deref when acquiring remote ip of ethernet bearer Octeontx2-pf: Fix Backpresure configuration net: ftgmac100: select FIXED_PHY net: ethtool: remove duplicate defines for family info ...
2025-06-19time: Introduce auxiliary POSIX clocksAnna-Maria Behnsen1-0/+11
To support auxiliary timekeeping and the related user space interfaces, it's required to define a clock ID range for them. Reserve 8 auxiliary clock IDs after the regular timekeeping clock ID space. This is the maximum number of auxiliary clocks the kernel can support. The actual number of supported clocks depends obviously on the presence of related devices and might be constraint by the available VDSO space. Add the corresponding timekeeper IDs as well. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083025.905800695@linutronix.de
2025-06-19net: ethtool: Add PSE port priority support featureKory Maincent (Dent Project)1-0/+2
This patch expands the status information provided by ethtool for PSE c33 with current port priority and max port priority. It also adds a call to pse_ethtool_set_prio() to configure the PSE port priority. Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-8-78a1a645e2ee@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19net: pse-pd: Add support for budget evaluation strategiesKory Maincent (Dent Project)1-0/+18
This patch introduces the ability to configure the PSE PI budget evaluation strategies. Budget evaluation strategies is utilized by PSE controllers to determine which ports to turn off first in scenarios such as power budget exceedance. The pis_prio_max value is used to define the maximum priority level supported by the controller. Both the current priority and the maximum priority are exposed to the user through the pse_ethtool_get_status call. This patch add support for two mode of budget evaluation strategies. 1. Static Method: This method involves distributing power based on PD classification. It’s straightforward and stable, the PSE core keeping track of the budget and subtracting the power requested by each PD’s class. Advantages: Every PD gets its promised power at any time, which guarantees reliability. Disadvantages: PD classification steps are large, meaning devices request much more power than they actually need. As a result, the power supply may only operate at, say, 50% capacity, which is inefficient and wastes money. Priority max value is matching the number of PSE PIs within the PSE. 2. Dynamic Method: To address the inefficiencies of the static method, vendors like Microchip have introduced dynamic power budgeting, as seen in the PD692x0 firmware. This method monitors the current consumption per port and subtracts it from the available power budget. When the budget is exceeded, lower-priority ports are shut down. Advantages: This method optimizes resource utilization, saving costs. Disadvantages: Low-priority devices may experience instability. Priority max value is set by the PSE controller driver. For now, budget evaluation methods are not configurable and cannot be mixed. They are hardcoded in the PSE driver itself, as no current PSE controller supports both methods. Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-7-78a1a645e2ee@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19net: ethtool: Add support for new power domains index descriptionKory Maincent (Dent Project)1-0/+1
Report the index of the newly introduced PSE power domain to the user, enabling improved management of the power budget for PSE devices. Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-5-78a1a645e2ee@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19net: pse-pd: Add support for reporting eventsKory Maincent (Dent Project)1-0/+19
Add support for devm_pse_irq_helper() to register PSE interrupts and report events such as over-current or over-temperature conditions. This follows a similar approach to the regulator API but also sends notifications using a dedicated PSE ethtool netlink socket. Signed-off-by: Kory Maincent (Dent Project) <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20250617-feature_poe_port_prio-v14-2-78a1a645e2ee@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19net: ethtool: remove duplicate defines for family infoJakub Kicinski2-6/+2
Commit under fixes switched to uAPI generation from the YAML spec. A number of custom defines were left behind, mostly for commands very hard to express in YAML spec. Among what was left behind was the name and version of the generic netlink family. Problem is that the codegen always outputs those values so we ended up with a duplicated, differently named set of defines. Provide naming info in YAML and remove the incorrect defines. Fixes: 8d0580c6ebdd ("ethtool: regenerate uapi header from the spec") Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250617202240.811179-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18media: rockchip: rkisp1: Add support for Wide Dynamic RangeJai Luthra1-1/+94
RKISP supports a basic Wide Dynamic Range (WDR) module since the first iteration (v1.0) of the ISP. Add support for enabling and configuring it using extensible parameters. Also, to ease programming, switch to using macro variables for defining the tonemapping curve register addresses. Reviewed-by: Stefan Klug <stefan.klug@ideasonboard.com> Tested-by: Stefan Klug <stefan.klug@ideasonboard.com> Reviewed-by: Paul Elder <paul.elder@ideasonboard.com> Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Jai Luthra <jai.luthra@ideasonboard.com> Link: https://lore.kernel.org/r/20250610-wdr-latest-v4-1-b69d0ac17ce9@ideasonboard.com Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl>
2025-06-18media: rkisp1: Add RKISP1_CID_SUPPORTED_PARAMS_BLOCKS controlStefan Klug2-0/+17
Add a RKISP1_CID_SUPPORTED_PARAMS_BLOCKS V4L2 control to be able to query the parameters blocks supported by the current kernel on the current hardware from user space. Signed-off-by: Stefan Klug <stefan.klug@ideasonboard.com> Reviewed-by: Paul Elder <paul.elder@ideasonboard.com> Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Link: https://lore.kernel.org/r/20250523-supported-params-and-wdr-v3-2-7283b8536694@ideasonboard.com Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl>
2025-06-18vxlan: Support MC routing in the underlayPetr Machata1-0/+1
Locally-generated MC packets have so far not been subject to MC routing. Instead an MC-enabled installation would maintain the MC routing tables, and separately from that the list of interfaces to send packets to as part of the VXLAN FDB and MDB. In a previous patch, a ip_mr_output() and ip6_mr_output() routines were added for IPv4 and IPv6. All locally generated MC traffic is now passed through these functions. For reasons of backward compatibility, an SKB (IPCB / IP6CB) flag guards the actual MC routing. This patch adds logic to set the flag, and the UAPI to enable the behavior. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/d899655bb7e9b2521ee8c793e67056b9fd02ba12.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17tools headers: Synchronize linux/bits.h with the kernel sourcesArnaldo Carvalho de Melo1-2/+2
To pick up the changes in this cset: 1e7933a575ed8af4 ("uapi: Revert "bitops: avoid integer overflow in GENMASK(_ULL)"") 5b572e8a9f3dcd6e ("bits: introduce fixed-type BIT_U*()") 19408200c094858d ("bits: introduce fixed-type GENMASK_U*()") 31299a5e02112411 ("bits: add comments and newlines to #if, #else and #endif directives") This addresses these perf build warnings: Warning: Kernel ABI header differences: diff -u tools/include/linux/bits.h include/linux/bits.h Please see tools/include/uapi/README for further details. Acked-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Cc: I Hsin Cheng <richard120310@gmail.com> Cc: Yury Norov <yury.norov@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Yury Norov <yury.norov@gmail.com> Link: https://lore.kernel.org/r/aEr0ZJ60EbshEy6p@x1 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-06-17tty: vt: use _IO() to define ioctl numbersJiri Slaby (SUSE)1-17/+17
_IO*() is the proper way of defining ioctl numbers. All these vt numbers were synthetically built up the same way the _IO() macro does. So instead of implicit hex numbers, use _IO() properly. To not change the pre-existing numbers, use only _IO() (and not _IOR() or _IOW()). The latter would change the numbers indeed. Objdump of vt_ioctl.o reveals no difference with this patch. Again, VT_GETCONSIZECSRPOS already uses _IOR(), so everything is paved for this patch. Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org> Cc: Nicolas Pitre <nico@fluxnic.net> Reviewed-by: Nicolas Pitre <npitre@baylibre.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20250611100319.186924-8-jirislaby@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-17tty: vt: use sane types for userspace APIJiri Slaby (SUSE)1-22/+22
As discussed earlier (see the Link below), use the preferred ioctl types in vt.h (__u8, __u16, ...). These types are already used for the new VT_GETCONSIZECSRPOS. Therefore, the necessary includes are already present. Since now, the types are used for every structure defined in the header now. Note the kernel is built with -funsigned-char, therefore 'char' becomes '__u8' in here. Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org> Cc: Nicolas Pitre <nico@fluxnic.net> Link: https://lore.kernel.org/all/p7p83sq1-4ro2-o924-s9o2-30spr74n076o@syhkavp.arg/ Reviewed-by: Nicolas Pitre <npitre@baylibre.com> Link: https://lore.kernel.org/r/20250611100319.186924-7-jirislaby@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-14dpll: add phase-offset-monitor feature to netlink specArkadiusz Kubalewski1-0/+12
Add enum dpll_feature_state for control over features. Add dpll device level attribute: DPLL_A_PHASE_OFFSET_MONITOR - to allow control over a phase offset monitor feature. Attribute is present and shall return current state of a feature (enum dpll_feature_state), if the device driver provides such capability, otherwie attribute shall not be present. Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Milena Olech <milena.olech@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20250612152835.1703397-2-arkadiusz.kubalewski@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-13syscall_user_dispatch: Add PR_SYS_DISPATCH_INCLUSIVE_ONDmitry Vyukov1-1/+6
There are two possible scenarios for syscall filtering: - having a trusted/allowed range of PCs, and intercepting everything else - or the opposite: a single untrusted/intercepted range and allowing everything else (this is relevant for any kind of sandboxing scenario, or monitoring behavior of a single library) The current API only allows the former use case due to allowed range wrap-around check. Add PR_SYS_DISPATCH_INCLUSIVE_ON that enables the second use case. Add PR_SYS_DISPATCH_EXCLUSIVE_ON alias for PR_SYS_DISPATCH_ON to make it clear how it's different from the new PR_SYS_DISPATCH_INCLUSIVE_ON. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/97947cc8e205ff49675826d7b0327ef2e2c66eea.1747839857.git.dvyukov@google.com
2025-06-12Merge tag 'bitmap-for-6.16-rc2' of https://github.com/norov/linuxLinus Torvalds1-2/+2
Pull bitmap fix from Yury Norov: "Fix for __GENMASK() and __GENMASK_ULL() in UAPI" * tag 'bitmap-for-6.16-rc2' of https://github.com/norov/linux: uapi: bitops: use UAPI-safe variant of BITS_PER_LONG again
2025-06-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski4-1/+23
Cross-merge networking fixes after downstream PR (net-6.16-rc2). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12coredump: allow for flexible coredump handlingChristian Brauner1-0/+104
Extend the coredump socket to allow the coredump server to tell the kernel how to process individual coredumps. When the crashing task connects to the coredump socket the kernel will send a struct coredump_req to the coredump server. The kernel will set the size member of struct coredump_req allowing the coredump server how much data can be read. The coredump server uses MSG_PEEK to peek the size of struct coredump_req. If the kernel uses a newer struct coredump_req the coredump server just reads the size it knows and discard any remaining bytes in the buffer. If the kernel uses an older struct coredump_req the coredump server just reads the size the kernel knows. The returned struct coredump_req will inform the coredump server what features the kernel supports. The coredump_req->mask member is set to the currently know features. The coredump server may only use features whose bits were raised by the kernel in coredump_req->mask. In response to a coredump_req from the kernel the coredump server sends a struct coredump_ack to the kernel. The kernel informs the coredump server what version of struct coredump_ack it supports by setting struct coredump_req->size_ack to the size it knows about. The coredump server may only send as many bytes as coredump_req->size_ack indicates (a smaller size is fine of course). The coredump server must set coredump_ack->size accordingly. The coredump server sets the features it wants to use in struct coredump_ack->mask. Only bits returned in struct coredump_req->mask may be used. In case an invalid struct coredump_ack is sent to the kernel a non-zero u32 integer is sent indicating the reason for the failure. If it was successful a zero u32 integer is sent. In the initial version the following features are supported in coredump_{req,ack}->mask: * COREDUMP_KERNEL The kernel will write the coredump data to the socket. * COREDUMP_USERSPACE The kernel will not write coredump data but will indicate to the parent that a coredump has been generated. This is used when userspace generates its own coredumps. * COREDUMP_REJECT The kernel will skip generating a coredump for this task. * COREDUMP_WAIT The kernel will prevent the task from exiting until the coredump server has shutdown the socket connection. The flexible coredump socket can be enabled by using the "@@" prefix instead of the single "@" prefix for the regular coredump socket: @@/run/systemd/coredump.socket will enable flexible coredump handling. Current kernels already enforce that "@" must be followed by "/" and will reject anything else. So extending this is backward and forward compatible. Link: https://lore.kernel.org/20250603-work-coredump-socket-protocol-v2-1-05a5f0c18ecc@kernel.org Acked-by: Lennart Poettering <lennart@poettering.net> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-11mntns: use stable inode number for initial mount nsChristian Brauner1-0/+1
Apart from the network and mount namespace all other namespaces expose a stable inode number and userspace has been relying on that for a very long time now. It's very much heavily used API. Align the mount namespace and use a stable inode number from the reserved procfs inode number space so this is consistent across all namespaces. Link: https://lore.kernel.org/20250606-work-nsfs-v1-3-b8749c9a8844@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-11netns: use stable inode number for initial mount nsChristian Brauner1-0/+1
Apart from the network and mount namespace all other namespaces expose a stable inode number and userspace has been relying on that for a very long time now. It's very much heavily used API. Align the network namespace and use a stable inode number from the reserved procfs inode number space so this is consistent across all namespaces. Link: https://lore.kernel.org/20250606-work-nsfs-v1-2-b8749c9a8844@kernel.org Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-11nsfs: move root inode number to uapiChristian Brauner1-0/+9
Userspace relies on the root inode numbers to identify the initial namespaces. That's already a hard dependency. So we cannot change that anymore. Move the initial inode numbers to a public header. Link: https://github.com/systemd/systemd/commit/d293fade24b34ccc2f5716b0ff5513e9533cf0c4 Link: https://lore.kernel.org/20250606-work-nsfs-v1-1-b8749c9a8844@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-11uapi: in6: restore visibility of most IPv6 socket optionsJakub Kicinski1-2/+2
A decade ago commit 6d08acd2d32e ("in6: fix conflict with glibc") hid the definitions of IPV6 options, because GCC was complaining about duplicates. The commit did not list the warnings seen, but trying to recreate them now I think they are (building iproute2): In file included from ./include/uapi/rdma/rdma_user_cm.h:39, from rdma.h:16, from res.h:9, from res-ctx.c:7: ../include/uapi/linux/in6.h:171:9: warning: ‘IPV6_ADD_MEMBERSHIP’ redefined 171 | #define IPV6_ADD_MEMBERSHIP 20 | ^~~~~~~~~~~~~~~~~~~ In file included from /usr/include/netinet/in.h:37, from rdma.h:13: /usr/include/bits/in.h:233:10: note: this is the location of the previous definition 233 | # define IPV6_ADD_MEMBERSHIP IPV6_JOIN_GROUP | ^~~~~~~~~~~~~~~~~~~ ../include/uapi/linux/in6.h:172:9: warning: ‘IPV6_DROP_MEMBERSHIP’ redefined 172 | #define IPV6_DROP_MEMBERSHIP 21 | ^~~~~~~~~~~~~~~~~~~~ /usr/include/bits/in.h:234:10: note: this is the location of the previous definition 234 | # define IPV6_DROP_MEMBERSHIP IPV6_LEAVE_GROUP | ^~~~~~~~~~~~~~~~~~~~ Compilers don't complain about redefinition if the defines are identical, but here we have the kernel using the literal value, and glibc using an indirection (defining to a name of another define, with the same numerical value). Problem is, the commit in question hid all the IPV6 socket options, and glibc has a pretty sparse list. For instance it lacks Flow Label related options. Willem called this out in commit 3fb321fde22d ("selftests/net: ipv6 flowlabel"): /* uapi/glibc weirdness may leave this undefined */ #ifndef IPV6_FLOWINFO #define IPV6_FLOWINFO 11 #endif More interestingly some applications (socat) use a #ifdef IPV6_FLOWINFO to gate compilation of thier rudimentary flow label support. (For added confusion socat misspells it as IPV4_FLOWINFO in some places.) Hide only the two defines we know glibc has a problem with. If we discover more warnings we can hide more but we should avoid covering the entire block of defines for "IPV6 socket options". Link: https://patch.msgid.link/20250609143933.1654417-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-10bpf: adjust path to trace_output sample eBPF programTobias Klauser1-1/+1
The sample file was renamed from trace_output_kern.c to trace_output.bpf.c in commit d4fffba4d04b ("samples/bpf: Change _kern suffix to .bpf with syscall tracing program"). Adjust the path in the documentation comment for bpf_perf_event_output. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Link: https://lore.kernel.org/r/20250610140756.16332-1-tklauser@distanz.ch Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-10bpf: Add cookie to tracing bpf_link_infoTao Chen1-0/+2
bpf_tramp_link includes cookie info, we can add it in bpf_link_info. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250606165818.3394397-1-chen.dylane@linux.dev
2025-06-10bpf: Implement mprog API on top of existing cgroup progsYonghong Song1-0/+7
Current cgroup prog ordering is appending at attachment time. This is not ideal. In some cases, users want specific ordering at a particular cgroup level. To address this, the existing mprog API seems an ideal solution with supporting BPF_F_BEFORE and BPF_F_AFTER flags. But there are a few obstacles to directly use kernel mprog interface. Currently cgroup bpf progs already support prog attach/detach/replace and link-based attach/detach/replace. For example, in struct bpf_prog_array_item, the cgroup_storage field needs to be together with bpf prog. But the mprog API struct bpf_mprog_fp only has bpf_prog as the member, which makes it difficult to use kernel mprog interface. In another case, the current cgroup prog detach tries to use the same flag as in attach. This is different from mprog kernel interface which uses flags passed from user space. So to avoid modifying existing behavior, I made the following changes to support mprog API for cgroup progs: - The support is for prog list at cgroup level. Cross-level prog list (a.k.a. effective prog list) is not supported. - Previously, BPF_F_PREORDER is supported only for prog attach, now BPF_F_PREORDER is also supported by link-based attach. - For attach, BPF_F_BEFORE/BPF_F_AFTER/BPF_F_ID/BPF_F_LINK is supported similar to kernel mprog but with different implementation. - For detach and replace, use the existing implementation. - For attach, detach and replace, the revision for a particular prog list, associated with a particular attach type, will be updated by increasing count by 1. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250606163141.2428937-1-yonghong.song@linux.dev
2025-06-06Merge tag 'block-6.16-20250606' of git://git.kernel.dk/linuxLinus Torvalds1-0/+9
Pull more block updates from Jens Axboe: - NVMe pull request via Christoph: - TCP error handling fix (Shin'ichiro Kawasaki) - TCP I/O stall handling fixes (Hannes Reinecke) - fix command limits status code (Keith Busch) - support vectored buffers also for passthrough (Pavel Begunkov) - spelling fixes (Yi Zhang) - MD pull request via Yu: - fix REQ_RAHEAD and REQ_NOWAIT IO err handling for raid1/10 - fix max_write_behind setting for dm-raid - some minor cleanups - Integrity data direction fix and cleanup - bcache NULL pointer fix - Fix for loop missing write start/end handling - Decouple hardware queues and IO threads in ublk - Slew of ublk selftests additions and updates * tag 'block-6.16-20250606' of git://git.kernel.dk/linux: (29 commits) nvme: spelling fixes nvme-tcp: fix I/O stalls on congested sockets nvme-tcp: sanitize request list handling nvme-tcp: remove tag set when second admin queue config fails nvme: enable vectored registered bufs for passthrough cmds nvme: fix implicit bool to flags conversion nvme: fix command limits status code selftests: ublk: kublk: improve behavior on init failure block: flip iter directions in blk_rq_integrity_map_user() block: drop direction param from bio_integrity_copy_user() selftests: ublk: cover PER_IO_DAEMON in more stress tests Documentation: ublk: document UBLK_F_PER_IO_DAEMON selftests: ublk: add stress test for per io daemons selftests: ublk: add functional test for per io daemons selftests: ublk: kublk: decouple ublk_queues from ublk server threads selftests: ublk: kublk: move per-thread data out of ublk_queue selftests: ublk: kublk: lift queue initialization out of thread selftests: ublk: kublk: tie sqe allocation to io instead of queue selftests: ublk: kublk: plumb q_id in io_uring user_data ublk: have a per-io daemon instead of a per-queue daemon ...
2025-06-06Merge tag 'usb-6.16-rc1' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB / Thunderbolt updates from Greg KH: "Here is the big set of USB and Thunderbolt changes for 6.16-rc1. Included in here are the following: - USB offload support for audio devices. I think this takes the record for the most number of patch series (30+) over the longest period of time (2+ years) to get merged properly. Many props go to Wesley Cheng for seeing this effort through, they took a major out-of-tree hacked-up-monstrosity that was created by multiple vendors for their specific devices, got it all merged into a semi-coherent set of changes, and got all of the different major subsystems to agree on how this should be implemented both with changes to their code as well as userspace apis, AND wrangled the hardware companies into agreeing to go forward with this, despite making them all redo work they had already done in their private device trees. This feature offers major power savings on embedded devices where a USB audio stream can continue to flow while the rest of the system is sleeping, something that devices running on battery power really care about. There are still some more small tweaks left to be done here, and those patches are still out for review and arguing among the different hardware companies, but this is a major step forward and a great example of how to do upstream development well. - small number of thunderbolt fixes and updates, things seem to be slowing down here (famous last words...) - xhci refactors and reworking to try to handle some rough corner cases in some hardware implementations where things don't always work properly - typec driver updates - USB3 power management reworking and updates - Removal of some old and orphaned UDC gadget drivers that had not been used in a very long time, dropping over 11 thousand lines from the tree, always a nice thing, making up for the 12k lines added for the USB offload feature. - lots of little updates and fixes in different drivers All of these have been in linux-next for over 2 weeks, the USB offload logic has been in there for 8 weeks now, with no reported issues" * tag 'usb-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (172 commits) ALSA: usb-audio: qcom: fix USB_XHCI dependency ASoC: qdsp6: fix compile-testing without CONFIG_OF usb: misc: onboard_usb_dev: fix build warning for CONFIG_USB_ONBOARD_DEV_USB5744=n usb: typec: tipd: fix typo in TPS_STATUS_HIGH_VOLAGE_WARNING macro USB: typec: fix const issue in typec_match() USB: gadget: udc: fix const issue in gadget_match_driver() USB: gadget: fix up const issue with struct usb_function_instance USB: serial: pl2303: add new chip PL2303GC-Q20 and PL2303GT-2AB USB: serial: bus: fix const issue in usb_serial_device_match() usb: usbtmc: Fix timeout value in get_stb usb: usbtmc: Fix read_stb function and get_stb ioctl ALSA: qc_audio_offload: try to reduce address space confusion ALSA: qc_audio_offload: avoid leaking xfer_buf allocation ALSA: qc_audio_offload: rename dma/iova/va/cpu/phys variables ALSA: usb-audio: qcom: Fix an error handling path in qc_usb_audio_probe() usb: misc: onboard_usb_dev: Fix usb5744 initialization sequence dt-bindings: usb: ti,usb8041: Add binding for TI USB8044 hub controller usb: misc: onboard_usb_dev: Add support for TI TUSB8044 hub usb: gadget: lpc32xx_udc: Use USB API functions rather than constants usb: gadget: epautoconf: Use USB API functions rather than constants ...
2025-06-06Merge tag 'tty-6.16-rc1' of ↵Linus Torvalds2-0/+12
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial updates from Greg KH: "Here is the big set of tty and serial driver changes for 6.16-rc1. A little more churn than normal in this portion of the kernel for this development cycle, Jiri and Nicholas were busy with cleanups and reviews and fixes for the vt unicode handling logic which composed most of the overall work in here. Major changes are: - vt unicode changes/reverts/changes from Nicholas. This should help out a lot with screen readers and others that rely on vt console support - lock guard additions to the core tty/serial code to clean up lots of error handling logic - 8250 driver updates and fixes - device tree conversions to yaml - sh-sci driver updates - other small cleanups and updates for serial drivers and tty core portions All of these have been in linux-next for 2 weeks with no reported issues" * tag 'tty-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (105 commits) tty: serial: 8250_omap: fix TX with DMA for am33xx vt: add VT_GETCONSIZECSRPOS to retrieve console size and cursor position vt: bracketed paste support vt: remove VT_RESIZE and VT_RESIZEX from vt_compat_ioctl() vt: process the full-width ASCII fallback range programmatically vt: make use of ucs_get_fallback() when glyph is unavailable vt: add ucs_get_fallback() vt: create ucs_fallback_table.h_shipped with gen_ucs_fallback_table.py vt: introduce gen_ucs_fallback_table.py to create ucs_fallback_table.h vt: move glyph determination to a separate function vt: make sure displayed double-width characters are remembered as such vt: ucs.c: fix misappropriate in_range() usage serial: max3100: Replace open-coded parity calculation with parity8() dt-bindings: serial: 8250_omap: Drop redundant properties dt-bindings: serial: Convert socionext,milbeaut-usio-uart to DT schema dt-bindings: serial: Convert microchip,pic32mzda-uart to DT schema dt-bindings: serial: Convert arm,sbsa-uart to DT schema dt-bindings: serial: Convert snps,arc-uart to DT schema dt-bindings: serial: Convert marvell,armada-3700-uart to DT schema dt-bindings: serial: Convert lantiq,asc to DT schema ...
2025-06-06uapi: bitops: use UAPI-safe variant of BITS_PER_LONG againThomas Weißschuh1-2/+2
Commit 1e7933a575ed ("uapi: Revert "bitops: avoid integer overflow in GENMASK(_ULL)"") did not take in account that the usage of BITS_PER_LONG in __GENMASK() was changed to __BITS_PER_LONG for UAPI-safety in commit 3c7a8e190bc5 ("uapi: introduce uapi-friendly macros for GENMASK"). BITS_PER_LONG can not be used in UAPI headers as it derives from the kernel configuration and not from the current compiler invocation. When building compat userspace code or a compat vDSO its value will be incorrect. Switch back to __BITS_PER_LONG. Fixes: 1e7933a575ed ("uapi: Revert "bitops: avoid integer overflow in GENMASK(_ULL)"") Cc: stable@vger.kernel.org Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
2025-06-05Merge tag 'net-6.16-rc1' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from CAN, wireless, Bluetooth, and Netfilter. Current release - regressions: - Revert "kunit: configs: Enable CONFIG_INIT_STACK_ALL_PATTERN in all_tests", makes kunit error out if compiler is old - wifi: iwlwifi: mvm: fix assert on suspend - rxrpc: fix return from none_validate_challenge() Current release - new code bugs: - ovpn: couple of fixes for socket cleanup and UDP-tunnel teardown - can: kvaser_pciefd: refine error prone echo_skb_max handling logic - fix net_devmem_bind_dmabuf() stub when DEVMEM not compiled - eth: airoha: fixes for config / accel in bridge mode Previous releases - regressions: - Bluetooth: hci_qca: move the SoC type check to the right place, fix GPIO integration - prevent a NULL deref in rtnl_create_link() after locking changes - fix udp gso skb_segment after pull from frag_list - hv_netvsc: fix potential deadlock in netvsc_vf_setxdp() Previous releases - always broken: - netfilter: - nf_nat: also check reverse tuple to obtain clashing entry - nf_set_pipapo_avx2: fix initial map fill (zeroing) - fix the helper for incremental update of packet checksums after modifying the IP address, used by ILA and BPF - eth: - stmmac: prevent div by 0 when clock rate is misconfigured - ice: fix Tx scheduler handling of XDP and changing queue count - eth: fix support for the RGMII interface when delays configured" * tag 'net-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits) calipso: unlock rcu before returning -EAFNOSUPPORT seg6: Fix validation of nexthop addresses net: prevent a NULL deref in rtnl_create_link() net: annotate data-races around cleanup_net_task selftests: drv-net: tso: make bkg() wait for socat to quit selftests: drv-net: tso: fix the GRE device name selftests: drv-net: add configs for the TSO test wireguard: device: enable threaded NAPI netlink: specs: rt-link: decode ip6gre netlink: specs: rt-link: add missing byte-order properties net: wwan: mhi_wwan_mbim: use correct mux_id for multiplexing wifi: cfg80211/mac80211: correctly parse S1G beacon optional elements net: dsa: b53: do not touch DLL_IQQD on bcm53115 net: dsa: b53: allow RGMII for bcm63xx RGMII ports net: dsa: b53: do not configure bcm63xx's IMP port interface net: dsa: b53: do not enable RGMII delay on bcm63xx net: dsa: b53: do not enable EEE on bcm63xx net: ti: icssg-prueth: Fix swapped TX stats for MII interfaces. selftests: netfilter: nft_nat.sh: add test for reverse clash with nat netfilter: nf_nat: also check reverse tuple to obtain clashing entry ...
2025-06-05bpf: Add cookie to raw_tp bpf_link_infoTao Chen1-0/+2
After commit 68ca5d4eebb8 ("bpf: support BPF cookie in raw tracepoint (raw_tp, tp_btf) programs"), we can show the cookie in bpf_link_info like kprobe etc. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20250603154309.3063644-1-chen.dylane@linux.dev
2025-06-04Merge tag 'pci-v6.16-changes' of ↵Linus Torvalds1-1/+11
git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci Pull pci updates from Bjorn Helgaas: "Enumeration: - Print the actual delay time in pci_bridge_wait_for_secondary_bus() instead of assuming it was 1000ms (Wilfred Mallawa) - Revert 'iommu/amd: Prevent binding other PCI drivers to IOMMU PCI devices', which broke resume from system sleep on AMD platforms and has been fixed by other commits (Lukas Wunner) Resource management: - Remove mtip32xx use of pcim_iounmap_regions(), which is deprecated and unnecessary (Philipp Stanner) - Remove pcim_iounmap_regions() and pcim_request_region_exclusive() and related flags since all uses have been removed (Philipp Stanner) - Rework devres 'request' functions so they are no longer 'hybrid', i.e., their behavior no longer depends on whether pcim_enable_device or pci_enable_device() was used, and remove related code (Philipp Stanner) - Warn (not BUG()) about failure to assign optional resources (Ilpo Järvinen) Error handling: - Log the DPC Error Source ID only when it's actually valid (when ERR_FATAL or ERR_NONFATAL was received from a downstream device) and decode into bus/device/function (Bjorn Helgaas) - Determine AER log level once and save it so all related messages use the same level (Karolina Stolarek) - Use KERN_WARNING, not KERN_ERR, when logging PCIe Correctable Errors (Karolina Stolarek) - Ratelimit PCIe Correctable and Non-Fatal error logging, with sysfs controls on interval and burst count, to avoid flooding logs and RCU stall warnings (Jon Pan-Doh) Power management: - Increment PM usage counter when probing reset methods so we don't try to read config space of a powered-off device (Alex Williamson) - Set all devices to D0 during enumeration to ensure ACPI opregion is connected via _REG (Mario Limonciello) Power control: - Rename pwrctrl Kconfig symbols from 'PWRCTL' to 'PWRCTRL' to match the filename paths. Retain old deprecated symbols for compatibility, except for the pwrctrl slot driver (PCI_PWRCTRL_SLOT) (Johan Hovold) - When unregistering pwrctrl, cancel outstanding rescan work before cleaning up data structures to avoid use-after-free issues (Brian Norris) Bandwidth control: - Simplify link bandwidth controller by replacing the count of Link Bandwidth Management Status (LBMS) events with a PCI_LINK_LBMS_SEEN flag (Ilpo Järvinen) - Update the Link Speed after retraining, since the Link Speed may have changed (Ilpo Järvinen) PCIe native device hotplug: - Ignore Presence Detect Changed caused by DPC. pciehp already ignores Link Down/Up events caused by DPC, but on slots using in-band presence detect, DPC causes a spurious Presence Detect Changed event (Lukas Wunner) - Ignore Link Down/Up caused by Secondary Bus Reset. On hotplug ports using in-band presence detect, the reset causes a Presence Detect Changed event, which mistakenly caused teardown and re-enumeration of the device. Drivers may need to annotate code that resets their device (Lukas Wunner) Virtualization: - Add an ACS quirk for Loongson Root Ports that don't advertise ACS but don't allow peer-to-peer transactions between Root Ports; the quirk allows each Root Port to be in a separate IOMMU group (Huacai Chen) Endpoint framework: - For fixed-size BARs, retain both the actual size and the possibly larger size allocated to accommodate iATU alignment requirements (Jerome Brunet) - Simplify ctrl/SPAD space allocation and avoid allocating more space than needed (Jerome Brunet) - Correct MSI-X PBA offset calculations for DesignWare and Cadence endpoint controllers (Niklas Cassel) - Align the return value (number of interrupts) encoding for pci_epc_get_msi()/pci_epc_ops::get_msi() and pci_epc_get_msix()/pci_epc_ops::get_msix() (Niklas Cassel) - Align the nr_irqs parameter encoding for pci_epc_set_msi()/pci_epc_ops::set_msi() and pci_epc_set_msix()/pci_epc_ops::set_msix() (Niklas Cassel) Common host controller library: - Convert pci-host-common to a library so platforms that don't need native host controller drivers don't need to include these helper functions (Manivannan Sadhasivam) Apple PCIe controller driver: - Extract ECAM bridge creation helper from pci_host_common_probe() to separate driver-specific things like MSI from PCI things (Marc Zyngier) - Dynamically allocate RID-to_SID bitmap to prepare for SoCs with varying capabilities (Marc Zyngier) - Skip ports disabled in DT when setting up ports (Janne Grunau) - Add t6020 compatible string (Alyssa Rosenzweig) - Add T602x PCIe support (Hector Martin) - Directly set/clear INTx mask bits because T602x dropped the accessors that could do this without locking (Marc Zyngier) - Move port PHY registers to their own reg items to accommodate T602x, which moves them around; retain default offsets for existing DTs that lack phy%d entries with the reg offsets (Hector Martin) - Stop polling for core refclk, which doesn't work on T602x and the bootloader has already done anyway (Hector Martin) - Use gpiod_set_value_cansleep() when asserting PERST# in probe because we're allowed to sleep there (Hector Martin) Cadence PCIe controller driver: - Drop a runtime PM 'put' to resolve a runtime atomic count underflow (Hans Zhang) - Make the cadence core buildable as a module (Kishon Vijay Abraham I) - Add cdns_pcie_host_disable() and cdns_pcie_ep_disable() for use by loadable drivers when they are removed (Siddharth Vadapalli) Freescale i.MX6 PCIe controller driver: - Apply link training workaround only on IMX6Q, IMX6SX, IMX6SP (Richard Zhu) - Remove redundant dw_pcie_wait_for_link() from imx_pcie_start_link(); since the DWC core does this, imx6 only needs it when retraining for a faster link speed (Richard Zhu) - Toggle i.MX95 core reset to align with PHY powerup (Richard Zhu) - Set SYS_AUX_PWR_DET to work around i.MX95 ERR051624 erratum: in some cases, the controller can't exit 'L23 Ready' through Beacon or PERST# deassertion (Richard Zhu) - Clear GEN3_ZRXDC_NONCOMPL to work around i.MX95 ERR051586 erratum: controller can't meet 2.5 GT/s ZRX-DC timing when operating at 8 GT/s, causing timeouts in L1 (Richard Zhu) - Wait for i.MX95 PLL lock before enabling controller (Richard Zhu) - Save/restore i.MX95 LUT for suspend/resume (Richard Zhu) Mobiveil PCIe controller driver: - Return bool (not int) for link-up check in mobiveil_pab_ops.link_up() and layerscape-gen4, mobiveil (Hans Zhang) NVIDIA Tegra194 PCIe controller driver: - Create debugfs directory for 'aspm_state_cnt' only when CONFIG_PCIEASPM is enabled, since there are no other entries (Hans Zhang) Qualcomm PCIe controller driver: - Add OF support for parsing DT 'eq-presets-<N>gts' property for lane equalization presets (Krishna Chaitanya Chundru) - Read Maximum Link Width from the Link Capabilities register if DT lacks 'num-lanes' property (Krishna Chaitanya Chundru) - Add Physical Layer 64 GT/s Capability ID and register offsets for 8, 32, and 64 GT/s lane equalization registers (Krishna Chaitanya Chundru) - Add generic dwc support for configuring lane equalization presets (Krishna Chaitanya Chundru) - Add DT and driver support for PCIe on IPQ5018 SoC (Nitheesh Sekar) Renesas R-Car PCIe controller driver: - Describe endpoint BAR 4 as being fixed size (Jerome Brunet) - Document how to obtain R-Car V4H (r8a779g0) controller firmware (Yoshihiro Shimoda) Rockchip PCIe controller driver: - Reorder rockchip_pci_core_rsts because reset_control_bulk_deassert() deasserts in reverse order, to fix a link training regression (Jensen Huang) - Mark RK3399 as being capable of raising INTx interrupts (Niklas Cassel) Rockchip DesignWare PCIe controller driver: - Check only PCIE_LINKUP, not LTSSM status, to determine whether the link is up (Shawn Lin) - Increase N_FTS (used in L0s->L0 transitions) and enable ASPM L0s for Root Complex and Endpoint modes (Shawn Lin) - Hide the broken ATS Capability in rockchip_pcie_ep_init() instead of rockchip_pcie_ep_pre_init() so it stays hidden after PERST# resets non-sticky registers (Shawn Lin) - Call phy_power_off() before phy_exit() in rockchip_pcie_phy_deinit() (Diederik de Haas) Synopsys DesignWare PCIe controller driver: - Set PORT_LOGIC_LINK_WIDTH to one lane to make initial link training more robust; this will not affect the intended link width if all lanes are functional (Wenbin Yao) - Return bool (not int) for link-up check in dw_pcie_ops.link_up() and armada8k, dra7xx, dw-rockchip, exynos, histb, keembay, keystone, kirin, meson, qcom, qcom-ep, rcar_gen4, spear13xx, tegra194, uniphier, visconti (Hans Zhang) - Add debugfs support for exposing DWC device-specific PTM context (Manivannan Sadhasivam) TI J721E PCIe driver: - Make j721e buildable as a loadable and removable module (Siddharth Vadapalli) - Fix j721e host/endpoint dependencies that result in link failures in some configs (Arnd Bergmann) Device tree bindings: - Add qcom DT binding for 'global' interrupt (PCIe controller and link-specific events) for ipq8074, ipq8074-gen3, ipq6018, sa8775p, sc7280, sc8180x sdm845, sm8150, sm8250, sm8350 (Manivannan Sadhasivam) - Add qcom DT binding for 8 MSI SPI interrupts for msm8998, ipq8074, ipq8074-gen3, ipq6018 (Manivannan Sadhasivam) - Add dw rockchip DT binding for rk3576 and rk3562 (Kever Yang) - Correct indentation and style of examples in brcm,stb-pcie, cdns,cdns-pcie-ep, intel,keembay-pcie-ep, intel,keembay-pcie, microchip,pcie-host, rcar-pci-ep, rcar-pci-host, xilinx-versal-cpm (Krzysztof Kozlowski) - Convert Marvell EBU (dove, kirkwood, armada-370, armada-xp) and armada8k from text to schema DT bindings (Rob Herring) - Remove obsolete .txt DT bindings for content that has been moved to schemas (Rob Herring) - Add qcom DT binding for MHI registers in IPQ5332, IPQ6018, IPQ8074 and IPQ9574 (Varadarajan Narayanan) - Convert v3,v360epc-pci from text to DT schema binding (Rob Herring) - Change microchip,pcie-host DT binding to be 'dma-noncoherent' since PolarFire may be configured that way (Conor Dooley) Miscellaneous: - Drop 'pci' suffix from intel_mid_pci.c filename to match similar files (Andy Shevchenko) - All platforms with PCI have an MMU, so add PCI Kconfig dependency on MMU to simplify build testing and avoid inadvertent build regressions (Arnd Bergmann) - Update Krzysztof Wilczyński's email address in MAINTAINERS (Krzysztof Wilczyński) - Update Manivannan Sadhasivam's email address in MAINTAINERS (Manivannan Sadhasivam)" * tag 'pci-v6.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (147 commits) MAINTAINERS: Update Manivannan Sadhasivam email address PCI: j721e: Fix host/endpoint dependencies PCI: j721e: Add support to build as a loadable module PCI: cadence-ep: Introduce cdns_pcie_ep_disable() helper for cleanup PCI: cadence-host: Introduce cdns_pcie_host_disable() helper for cleanup PCI: cadence: Add support to build pcie-cadence library as a kernel module MAINTAINERS: Update Krzysztof Wilczyński email address PCI: Remove unnecessary linesplit in __pci_setup_bridge() PCI: WARN (not BUG()) when we fail to assign optional resources PCI: Remove unused pci_printk() PCI: qcom: Replace PERST# sleep time with proper macro PCI: dw-rockchip: Replace PERST# sleep time with proper macro PCI: host-common: Convert to library for host controller drivers PCI/ERR: Remove misleading TODO regarding kernel panic PCI: cadence: Remove duplicate message code definitions PCI: endpoint: Align pci_epc_set_msix(), pci_epc_ops::set_msix() nr_irqs encoding PCI: endpoint: Align pci_epc_set_msi(), pci_epc_ops::set_msi() nr_irqs encoding PCI: endpoint: Align pci_epc_get_msix(), pci_epc_ops::get_msix() return value encoding PCI: endpoint: Align pci_epc_get_msi(), pci_epc_ops::get_msi() return value encoding PCI: cadence-ep: Correct PBA offset in .set_msix() callback ...
2025-06-04Merge tag 'for-6.16/dm-changes' of ↵Linus Torvalds1-2/+7
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - better error handling when reloading a table - use use generic disable_* functions instead of open coding them - lock queue limits when reading them - remove unneeded kvfree from alloc_targets - fix BLK_FEAT_ATOMIC_WRITES - pass through operations on wrapped inline crypto keys - dm-verity: - use softirq context only when !need_resched() - fix a memory leak if some arguments are specified multiple times - dm-mpath: - interface for explicit probing of active paths - replace spin_lock_irqsave with spin_lock_irq - dm-delay: don't busy-wait in kthread - dm-bufio: remove maximum age based eviction - dm-flakey: various fixes - vdo indexer: don't read request structure after enqueuing - dm-zone: Use bdev_*() helper functions where applicable - dm-mirror: fix a tiny race condition - dm-stripe: small code cleanup * tag 'for-6.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (29 commits) dm-stripe: small code cleanup dm-verity: fix a memory leak if some arguments are specified multiple times dm-mirror: fix a tiny race condition dm-table: check BLK_FEAT_ATOMIC_WRITES inside limits_lock dm mpath: replace spin_lock_irqsave with spin_lock_irq dm-mpath: Don't grab work_mutex while probing paths dm-zone: Use bdev_*() helper functions where applicable dm vdo indexer: don't read request structure after enqueuing dm: pass through operations on wrapped inline crypto keys blk-crypto: export wrapped key functions dm-table: Set BLK_FEAT_ATOMIC_WRITES for target queue limits dm mpath: Interface for explicit probing of active paths dm: Allow .prepare_ioctl to handle ioctls directly dm-flakey: make corrupting read bios work dm-flakey: remove useless ERROR_READS check in flakey_end_io dm-flakey: error all IOs when num_features is absent dm-flakey: Clean up parsing messages dm: remove unneeded kvfree from alloc_targets dm-bufio: remove maximum age based eviction dm-verity: use softirq context only when !need_resched() ...
2025-06-03Merge tag 'fuse-update-6.16' of ↵Linus Torvalds1-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Remove tmp page copying in writeback path (Joanne). This removes ~300 lines and with that a lot of complexity related to avoiding reclaim related deadlock. The old mechanism is replaced with a mapping flag that tells the MM not to block reclaim waiting for writeback to complete. The MM parts have been reviewed/acked by respective maintainers. - Convert more code to handle large folios (Joanne). This still just adds the code to deal with large folios and does not enable them yet. - Allow invalidating all cached lookups atomically (Luis Henriques). This feature is useful for CernVMFS, which currently does this iteratively. - Align write prefaulting in fuse with generic one (Dave Hansen) - Fix race causing invalid data to be cached when setting attributes on different nodes of a distributed fs (Guang Yuan Wu) - Update documentation for passthrough (Chen Linxuan) - Add fdinfo about the device number associated with an opened /dev/fuse instance (Chen Linxuan) - Increase readdir buffer size (Miklos). This depends on a patch to VFS readdir code that was already merged through Christians tree. - Optimize io-uring request expiration (Joanne) - Misc cleanups * tag 'fuse-update-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits) fuse: increase readdir buffer size readdir: supply dir_context.count as readdir buffer size hint fuse: don't allow signals to interrupt getdents copying fuse: support large folios for writeback fuse: support large folios for readahead fuse: support large folios for queued writes fuse: support large folios for stores fuse: support large folios for symlinks fuse: support large folios for folio reads fuse: support large folios for writethrough writes fuse: refactor fuse_fill_write_pages() fuse: support large folios for retrieves fuse: support copying large folios fs: fuse: add dev id to /dev/fuse fdinfo docs: filesystems: add fuse-passthrough.rst MAINTAINERS: update filter of FUSE documentation fuse: fix race between concurrent setattrs from multiple nodes fuse: remove tmp folio for writebacks and internal rb tree mm: skip folio reclaim in legacy memcg contexts for deadlockable mappings fuse: optimize over-io-uring request expiration check ...
2025-06-01Merge tag 'mm-stable-2025-05-31-14-50' of ↵Linus Torvalds2-1/+7
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - "Always call constructor for kernel page tables" from Kevin Brodsky ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - "vmscan: enforce mems_effective during demotion" from Gregory Price changes reclaim to respect cpuset.mems_effective during demotion when possible. because presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently. - "Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. * tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits) mm: pcp: increase pcp->free_count threshold to trigger free_high mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range() mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page mm/hugetlb: pass folio instead of page to unmap_ref_private() memcg: objcg stock trylock without irq disabling memcg: no stock lock for cpu hot-unplug memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs memcg: make count_memcg_events re-entrant safe against irqs memcg: make mod_memcg_state re-entrant safe against irqs memcg: move preempt disable to callers of memcg_rstat_updated memcg: memcg_rstat_updated re-entrant safe against irqs mm: khugepaged: decouple SHMEM and file folios' collapse selftests/eventfd: correct test name and improve messages alloc_tag: check mem_profiling_support in alloc_tag_init Docs/damon: update titles and brief introductions to explain DAMOS selftests/damon/_damon_sysfs: read tried regions directories in order mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject() mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat() mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs ...
2025-05-31ublk: have a per-io daemon instead of a per-queue daemonUday Shankar1-0/+9
Currently, ublk_drv associates to each hardware queue (hctx) a unique task (called the queue's ubq_daemon) which is allowed to issue COMMIT_AND_FETCH commands against the hctx. If any other task attempts to do so, the command fails immediately with EINVAL. When considered together with the block layer architecture, the result is that for each CPU C on the system, there is a unique ublk server thread which is allowed to handle I/O submitted on CPU C. This can lead to suboptimal performance under imbalanced load generation. For an extreme example, suppose all the load is generated on CPUs mapping to a single ublk server thread. Then that thread may be fully utilized and become the bottleneck in the system, while other ublk server threads are totally idle. This issue can also be addressed directly in the ublk server without kernel support by having threads dequeue I/Os and pass them around to ensure even load. But this solution requires inter-thread communication at least twice for each I/O (submission and completion), which is generally a bad pattern for performance. The problem gets even worse with zero copy, as more inter-thread communication would be required to have the buffer register/unregister calls to come from the correct thread. Therefore, address this issue in ublk_drv by allowing each I/O to have its own daemon task. Two I/Os in the same queue are now allowed to be serviced by different daemon tasks - this was not possible before. Imbalanced load can then be balanced across all ublk server threads by having the ublk server threads issue FETCH_REQs in a round-robin manner. As a small toy example, consider a system with a single ublk device having 2 queues, each of depth 4. A ublk server having 4 threads could issue its FETCH_REQs against this device as follows (where each entry is the qid,tag pair that the FETCH_REQ targets): ublk server thread: T0 T1 T2 T3 0,0 0,1 0,2 0,3 1,3 1,0 1,1 1,2 This setup allows for load that is concentrated on one hctx/ublk_queue to be spread out across all ublk server threads, alleviating the issue described above. Add the new UBLK_F_PER_IO_DAEMON feature to ublk_drv, which ublk servers can use to essentially test for the presence of this change and tailor their behavior accordingly. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-1-e9d3b119336a@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-31bpf: Fix L4 csum update on IPv6 in CHECKSUM_COMPLETEPaul Chaignon1-0/+2
In Cilium, we use bpf_csum_diff + bpf_l4_csum_replace to, among other things, update the L4 checksum after reverse SNATing IPv6 packets. That use case is however not currently supported and leads to invalid skb->csum values in some cases. This patch adds support for IPv6 address changes in bpf_l4_csum_update via a new flag. When calling bpf_l4_csum_replace in Cilium, it ends up calling inet_proto_csum_replace_by_diff: 1: void inet_proto_csum_replace_by_diff(__sum16 *sum, struct sk_buff *skb, 2: __wsum diff, bool pseudohdr) 3: { 4: if (skb->ip_summed != CHECKSUM_PARTIAL) { 5: csum_replace_by_diff(sum, diff); 6: if (skb->ip_summed == CHECKSUM_COMPLETE && pseudohdr) 7: skb->csum = ~csum_sub(diff, skb->csum); 8: } else if (pseudohdr) { 9: *sum = ~csum_fold(csum_add(diff, csum_unfold(*sum))); 10: } 11: } The bug happens when we're in the CHECKSUM_COMPLETE state. We've just updated one of the IPv6 addresses. The helper now updates the L4 header checksum on line 5. Next, it updates skb->csum on line 7. It shouldn't. For an IPv6 packet, the updates of the IPv6 address and of the L4 checksum will cancel each other. The checksums are set such that computing a checksum over the packet including its checksum will result in a sum of 0. So the same is true here when we update the L4 checksum on line 5. We'll update it as to cancel the previous IPv6 address update. Hence skb->csum should remain untouched in this case. The same bug doesn't affect IPv4 packets because, in that case, three fields are updated: the IPv4 address, the IP checksum, and the L4 checksum. The change to the IPv4 address and one of the checksums still cancel each other in skb->csum, but we're left with one checksum update and should therefore update skb->csum accordingly. That's exactly what inet_proto_csum_replace_by_diff does. This special case for IPv6 L4 checksums is also described atop inet_proto_csum_replace16, the function we should be using in this case. This patch introduces a new bpf_l4_csum_replace flag, BPF_F_IPV6, to indicate that we're updating the L4 checksum of an IPv6 packet. When the flag is set, inet_proto_csum_replace_by_diff will skip the skb->csum update. Fixes: 7d672345ed295 ("bpf: add generic bpf_csum_diff helper") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://patch.msgid.link/96a6bc3a443e6f0b21ff7b7834000e17fb549e05.1748509484.git.paul.chaignon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-29Merge tag 'platform-drivers-x86-v6.16-1' of ↵Linus Torvalds1-0/+26
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform drivers updates from Ilpo Järvinen: "The changes are mostly business as usual. Besides pdx86 changes, there are a few power supply changes needed for related pdx86 features, move of oxpec driver from hwmon (oxp-sensors) to pdx86, and one FW version warning to hid-asus. Highlights: - alienware-wmi-wmax: - Add HWMON support - Add ABI and admin-guide documentation - Expose GPIO debug methods through debug FS - Support manual fan control and "custom" thermal profile - amd/hsmp: - Add sysfs files to show HSMP telemetry - Report power readings and limits via hwmon - amd/isp4: Add AMD ISP platform config for OV05C10 - asus-wmi: - Refactor Ally suspend/resume to work better with older FW - hid-asus: check ROG Ally MCU version and warn about old FW versions - dasharo-acpi: - Add driver for Dasharo devices supporting fans and temperatures monitoring - dell-ddv: - Expose the battery health and manufacture date to userspace using power supply extensions - Implement the battery matching algorithm - dell-pc: - Improve error propagation - Use faux device - int3472: - Add delays to avoid GPIO regulator spikes - Add handshake pin support - Make regulator supply name configurable and allow registering more than 1 GPIO regulator - Map mt9m114 powerdown pin to powerenable - intel/pmc: Add separate SSRAM Telemetry driver - intel-uncore-freq: Add attributes to show agent types and die ID - ISST: - Support SST-TF revision 2 (allows more cores per bucket) - Support SST-PP revision 2 (fabric 1 frequencies) - Remove unnecessary SST MSRs restore (the package retains MSRs despite CPU offlining) - mellanox: Add support for SN2201, SN4280, SN5610, and SN5640 - mellanox: mlxbf-pmc: Support additional PMC blocks - oxpec: - Add OneXFly variants - Add support for charge limit, charge thresholds, and turbo LED - Distinguish current X1 variants to avoid unwanted matching to new variants - Follow hwmon conventions - Move from hwmon/oxp-sensors to platform/x86 to match the enlarged scope - power supply: - Add inhibit-charge-awake (needed by oxpec) - Add additional battery health status values ("blown fuse" and "cell imbalance") (needed by dell-ddv) - powerwell-ec: Add driver for Portwell EC supporting GPIO and watchdog - thinkpad-acpi: Support camera shutter switch hotkey - tuxedo: Add virtual LampArray for TUXEDO NB04 devices - tools/power/x86/intel-speed-select: - Support displaying SST-PP revision 2 fields - Skip uncore frequency update on newer generations of CPUs - Miscellaneous cleanups / refactoring / improvements" * tag 'platform-drivers-x86-v6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (112 commits) thermal/drivers/acerhdf: Constify struct thermal_zone_device_ops platform/x86/amd/hsmp: fix building with CONFIG_HWMON=m platform/x86: asus-wmi: fix build without CONFIG_SUSPEND docs: ABI: Fix "aassociated" to "associated" platform/x86: Add AMD ISP platform config for OV05C10 Documentation: admin-guide: pm: Add documentation for die_id platform/x86/intel-uncore-freq: Add attributes to show die_id platform/x86/intel: power-domains: Add interface to get Linux die ID Documentation: admin-guide: pm: Add documentation for agent_types platform/x86/intel-uncore-freq: Add attributes to show agent types platform/x86/tuxedo: Prevent invalid Kconfig state platform/x86: dell-ddv: Expose the battery health to userspace platform/x86: dell-ddv: Expose the battery manufacture date to userspace platform/x86: dell-ddv: Implement the battery matching algorithm power: supply: core: Add additional health status values platform/x86/amd/hsmp: acpi: Add sysfs files to display HSMP telemetry platform/x86/amd/hsmp: Report power via hwmon sensors platform/x86/amd/hsmp: Use a single DRIVER_VERSION for all hsmp modules platform/mellanox: mlxreg-dpu: Fix smatch warnings platform: mellanox: nvsw-sn2200: Fix .items in nvsw_sn2201_busbar_hotplug ...