summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2025-09-02mtd: rawnand: s3c2410: Drop driver (no actual S3C64xx user)Krzysztof Kozlowski1-70/+0
The s3c2410 NAND driver still supports S3C64xx platform, which in general is supported in the kernel. There are however no references of "s3c6400-nand" platform device ID or "s3c24xx-nand" driver, thus this driver cannot be instantiated for S3C64xx platform and is basically unused. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
2025-09-02dmaengine: Fix dma_async_tx_descriptor->tx_submit documentationNathan Lynch1-1/+1
Commit 790fb9956eea ("linux/dmaengine.h: fix a few kernel-doc warnings") inserted new documentation for @desc_free in the middle of @tx_submit's description. Put @tx_submit's description back together, matching the indentation style of the rest of the documentation for dma_async_tx_descriptor. Fixes: 790fb9956eea ("linux/dmaengine.h: fix a few kernel-doc warnings") Reviewed-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Nathan Lynch <nathan.lynch@amd.com> Link: https://lore.kernel.org/r/20250826-dma_async_tx_desc-tx_submit-doc-fix-v1-1-18a4b51697db@amd.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2025-09-02dmaengine: sh: setup_xref error handlingThomas Andreatta1-1/+1
This patch modifies the type of setup_xref from void to int and handles errors since the function can fail. `setup_xref` now returns the (eventual) error from `dmae_set_dmars`|`dmae_set_chcr`, while `shdma_tx_submit` handles the result, removing the chunks from the queue and marking PM as idle in case of an error. Signed-off-by: Thomas Andreatta <thomas.andreatta2000@gmail.com> Link: https://lore.kernel.org/r/20250827152442.90962-1-thomas.andreatta2000@gmail.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2025-09-01copy_process: pass clone_flags as u64 across calltreeSimon Schuster16-28/+28
With the introduction of clone3 in commit 7f192e3cd316 ("fork: add clone3") the effective bit width of clone_flags on all architectures was increased from 32-bit to 64-bit, with a new type of u64 for the flags. However, for most consumers of clone_flags the interface was not changed from the previous type of unsigned long. While this works fine as long as none of the new 64-bit flag bits (CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still undesirable in terms of the principle of least surprise. Thus, this commit fixes all relevant interfaces of callees to sys_clone3/copy_process (excluding the architecture-specific copy_thread) to consistently pass clone_flags as u64, so that no truncation to 32-bit integers occurs on 32-bit architectures. Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com> Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01fs: remove vfs_ioctl exportGreg Kroah-Hartman1-2/+0
vfs_ioctl() is no longer called by anything outside of fs/ioctl.c, so remove the global symbol and export as it is not needed. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/2025083038-carving-amuck-a4ae@gregkh Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01Merge tag 'fuse-fixes-6.17-rc5' of ↵Christian Brauner8-61/+29
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse into vfs.fixes fuse fixes for 6.17-rc5 * tag 'fuse-fixes-6.17-rc5' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (6 commits) fuse: Block access to folio overlimit fuse: fix fuseblk i_blkbits for iomap partial writes fuse: reflect cached blocksize if blocksize was changed fuse: prevent overflow in copy_file_range return value fuse: check if copy_file_range() returns larger than requested size fuse: do not allow mapping a non-regular backing file Link: https://lore.kernel.org/CAJfpeguEVMMyw_zCb+hbOuSxdE2Z3Raw=SJsq=Y56Ae6dn2W3g@mail.gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01fs: add an icount_read helperJosef Bacik1-0/+5
Instead of doing direct access to ->i_count, add a helper to handle this. This will make it easier to convert i_count to a refcount later. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/9bc62a84c6b9d6337781203f60837bd98fbc4a96.1756222464.git.josef@toxicpanda.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-31iio: frequency: adf4350: Fix ADF4350_REG3_12BIT_CLKDIV_MODEMichael Hennerich1-1/+1
The clk div bits (2 bits wide) do not start in bit 16 but in bit 15. Fix it accordingly. Fixes: e31166f0fd48 ("iio: frequency: New driver for Analog Devices ADF4350/ADF4351 Wideband Synthesizers") Signed-off-by: Michael Hennerich <michael.hennerich@analog.com> Signed-off-by: Nuno Sá <nuno.sa@analog.com> Link: https://patch.msgid.link/20250829-adf4350-fix-v2-2-0bf543ba797d@analog.com Cc: <Stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2025-08-30audit: add record for multiple object contextsCasey Schaufler1-0/+7
Create a new audit record AUDIT_MAC_OBJ_CONTEXTS. An example of the MAC_OBJ_CONTEXTS record is: type=MAC_OBJ_CONTEXTS msg=audit(1601152467.009:1050): obj_selinux=unconfined_u:object_r:user_home_t:s0 When an audit event includes a AUDIT_MAC_OBJ_CONTEXTS record the "obj=" field in other records in the event will be "obj=?". An AUDIT_MAC_OBJ_CONTEXTS record is supplied when the system has multiple security modules that may make access decisions based on an object security context. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subj tweak, audit example readability indents] Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-30audit: add record for multiple task security contextsCasey Schaufler1-0/+16
Replace the single skb pointer in an audit_buffer with a list of skb pointers. Add the audit_stamp information to the audit_buffer as there's no guarantee that there will be an audit_context containing the stamp associated with the event. At audit_log_end() time create auxiliary records as have been added to the list. Functions are created to manage the skb list in the audit_buffer. Create a new audit record AUDIT_MAC_TASK_CONTEXTS. An example of the MAC_TASK_CONTEXTS record is: type=MAC_TASK_CONTEXTS msg=audit(1600880931.832:113) subj_apparmor=unconfined subj_smack=_ When an audit event includes a AUDIT_MAC_TASK_CONTEXTS record the "subj=" field in other records in the event will be "subj=?". An AUDIT_MAC_TASK_CONTEXTS record is supplied when the system has multiple security modules that may make access decisions based on a subject security context. Refactor audit_log_task_context(), creating a new audit_log_subj_ctx(). This is used in netlabel auditing to provide multiple subject security contexts as necessary. Suggested-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subj tweak, audit example readability indents] Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-30lsm: security_lsmblob_to_secctx module selectionCasey Schaufler1-2/+4
Add a parameter lsmid to security_lsmblob_to_secctx() to identify which of the security modules that may be active should provide the security context. If the value of lsmid is LSM_ID_UNDEF the first LSM providing a hook is used. security_secid_to_secctx() is unchanged, and will always report the first LSM providing a hook. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subj tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-30inet_diag: avoid cache line misses in inet_diag_bc_sk()Eric Dumazet1-0/+5
inet_diag_bc_sk() pulls five cache lines per socket, while most filters only need the two first ones. Add three booleans to struct inet_diag_dump_data, that are selectively set if a filter needs specific socket fields. - mark_needed /* INET_DIAG_BC_MARK_COND present. */ - cgroup_needed /* INET_DIAG_BC_CGROUP_COND present. */ - userlocks_needed /* INET_DIAG_BC_AUTO present. */ This removes millions of cache lines misses per ss invocation when simple filters are specified on busy servers. offsetof(struct sock, sk_userlocks) = 0xf3 offsetof(struct sock, sk_mark) = 0x20c offsetof(struct sock, sk_cgrp_data) = 0x298 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250828102738.2065992-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-30inet_diag: change inet_diag_bc_sk() first argumentEric Dumazet1-1/+1
We want to have access to the inet_diag_dump_data structure in the following patch. This patch removes duplication in callers. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250828102738.2065992-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29pppoe: remove rwlock usageQingfang Deng1-1/+1
Like ppp_generic.c, convert the PPPoE socket hash table to use RCU for lookups and a spinlock for updates. This removes rwlock usage and allows lockless readers on the fast path. - Mark hash table and list pointers as __rcu. - Use spin_lock() to protect writers. - Readers use rcu_dereference() under rcu_read_lock(). All known callers of get_item() already hold the RCU read lock, so no additional locking is needed. - get_item() now uses refcount_inc_not_zero() instead of sock_hold() to safely take a reference. This prevents crashes if a socket is already in the process of being freed (sk_refcnt == 0). - Set SOCK_RCU_FREE to defer socket freeing until after an RCU grace period. - Move skb_queue_purge() into sk_destruct callback to ensure purge happens after an RCU grace period. Signed-off-by: Qingfang Deng <dqfext@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250828012018.15922-1-dqfext@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski13-65/+38
Cross-merge networking fixes after downstream PR (net-6.17-rc4). No conflicts. Adjacent changes: drivers/net/ethernet/intel/idpf/idpf_txrx.c 02614eee26fb ("idpf: do not linearize big TSO packets") 6c4e68480238 ("idpf: remove obsolete stashing code") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-29Add RWF_NOSIGNAL flag for pwritev2Lauri Vasama1-0/+1
For a user mode library to avoid generating SIGPIPE signals (e.g. because this behaviour is not portable across operating systems) is cumbersome. It is generally bad form to change the process-wide signal mask in a library, so a local solution is needed instead. For I/O performed directly using system calls (synchronous or readiness based asynchronous) this currently involves applying a thread-specific signal mask before the operation and reverting it afterwards. This can be avoided when it is known that the file descriptor refers to neither a pipe nor a socket, but a conservative implementation must always apply the mask. This incurs the cost of two additional system calls. In the case of sockets, the existing MSG_NOSIGNAL flag can be used with send. For asynchronous I/O performed using io_uring, currently the only option (apart from MSG_NOSIGNAL for sockets), is to mask SIGPIPE entirely in the call to io_uring_enter. Thankfully io_uring_enter takes a signal mask, so only a single syscall is needed. However, copying the signal mask on every call incurs a non-zero performance penalty. Furthermore, this mask applies to all completions, meaning that if the non-signaling behaviour is desired only for some subset of operations, the desired signals must be raised manually from user-mode depending on the completed operation. Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal from being raised when writing on disconnected pipes or sockets. The flag is handled directly by the pipe filesystem and converted to the existing MSG_NOSIGNAL flag for sockets. Signed-off-by: Lauri Vasama <git@vasama.org> Link: https://lore.kernel.org/20250827133901.1820771-1-git@vasama.org Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-29fs: make the i_state flags an enumJosef Bacik1-112/+119
Adjusting i_state flags always means updating the values manually. Bring these forward into the 2020's and make a nice clean macro for defining the i_state values as an enum, providing __ variants for the cases where we need the bit position instead of the actual value, and leaving the actual NAME as the 1U << bit value. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/0da9348da6ece0dce12fccec07b1dd2b8e4cfdab.1756222464.git.josef@toxicpanda.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-29drivers: firmware: xilinx: Switch to new family code in ↵Jay Buddhabhatti1-13/+2
zynqmp_pm_get_family_info() Currently, the family code and subfamily code are derived from the PMC_TAP_IDCODE register. Versal, Versal NET share the same family code. Also some platforms share the same subfamily code, making it difficult to distinguish between platforms. Update zynqmp_pm_get_family_info() to use IDs derived from the compatible string instead of silicon ID codes derived from PMC_TAP_IDCODE register. Signed-off-by: Jay Buddhabhatti <jay.buddhabhatti@amd.com> Link: https://lore.kernel.org/r/20250701123851.1314531-4-jay.buddhabhatti@amd.com Signed-off-by: Michal Simek <michal.simek@amd.com>
2025-08-29drivers: firmware: xilinx: Add unique family code for all platformsJay Buddhabhatti1-0/+5
The family code is currently derived from the PMC_TAP_IDCODE register value, but there are issues where Versal, Versal NET, and future platforms share the same family code. Additionally for some platforms have identical subfamily code, making it challenging to differentiate between platforms based on the family and subfamily codes. To resolve this, a new family code member is added to the platform data, initialized with unique values. This change enables better platform distinction via the compatible string. Signed-off-by: Jay Buddhabhatti <jay.buddhabhatti@amd.com> Link: https://lore.kernel.org/r/20250701123851.1314531-3-jay.buddhabhatti@amd.com Signed-off-by: Michal Simek <michal.simek@amd.com>
2025-08-29firmware: xilinx: Add debugfs support for PM_GET_NODE_STATUSMadhav Bhatt1-1/+11
Add new debug interface to support PM_GET_NODE_STATUS to get the node information like requirements and usage. The debugfs firmware driver interface is only meant for testing and debugging EEMI APIs. Hence, it is by-default disabled in production systems. Signed-off-by: Madhav Bhatt <madhav.bhatt@amd.com> Link: https://lore.kernel.org/r/20250417094543.3873507-1-madhav.bhatt@amd.com Signed-off-by: Michal Simek <michal.simek@amd.com>
2025-08-29Merge tag 'net-6.17-rc4' of ↵Linus Torvalds2-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from Bluetooth. Current release - regressions: - ipv4: fix regression in local-broadcast routes - vsock: fix error-handling regression introduced in v6.17-rc1 Previous releases - regressions: - bluetooth: - mark connection as closed during suspend disconnect - fix set_local_name race condition - eth: - ice: fix NULL pointer dereference on reset - mlx5: fix memory leak in hws_pool_buddy_init error path - bnxt_en: fix stats context reservation logic - hv: fix loss of receive events from host during channel open Previous releases - always broken: - page_pool: fix incorrect mp_ops error handling - sctp: initialize more fields in sctp_v6_from_sk() - eth: - octeontx2-vf: fix max packet length errors - idpf: fix Tx flow scheduling to avoid Tx timeouts - bnxt_en: fix memory corruption during ifdown - ice: fix incorrect counter for buffer allocation failures - mlx5: fix lockdep assertion on sync reset unload event - fbnic: fixup rtnl_lock and devl_lock handling - xgmac: do not enable RX FIFO overflow interrupts - phy: mscc: fix when PTP clock is register and unregister Misc: - add Telit Cinterion LE910C4-WWX new compositions" * tag 'net-6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (60 commits) net: ipv4: fix regression in local-broadcast routes net: macb: Disable clocks once fbnic: Move phylink resume out of service_task and into open/close fbnic: Fixup rtnl_lock and devl_lock handling related to mailbox code net: rose: fix a typo in rose_clear_routes() l2tp: do not use sock_hold() in pppol2tp_session_get_sock() sctp: initialize more fields in sctp_v6_from_sk() MAINTAINERS: rmnet: Update email addresses net: rose: include node references in rose_neigh refcount net: rose: convert 'use' field to refcount_t net: rose: split remove and free operations in rose_remove_neigh() net: hv_netvsc: fix loss of early receive events from host during channel open. net: stmmac: Set CIC bit only for TX queues with COE net: stmmac: xgmac: Correct supported speed modes net: stmmac: xgmac: Do not enable RX FIFO Overflow interrupts net/mlx5e: Set local Xoff after FW update net/mlx5e: Update and set Xon/Xoff upon port speed set net/mlx5e: Update and set Xon/Xoff upon MTU set net/mlx5: Prevent flow steering mode changes in switchdev mode net/mlx5: Nack sync reset when SFs are present ...
2025-08-29Merge tag 'dma-mapping-6.17-2025-08-28' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: - another small fix for arm64 systems with memory encryption (Shanker Donthineni) - fix for arm32 systems with non-standard CMA configuration (Oreoluwa Babatunde) * tag 'dma-mapping-6.17-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma/pool: Ensure DMA_DIRECT_REMAP allocations are decrypted of: reserved_mem: Restructure call site for dma_contiguous_early_fixup()
2025-08-29Merge tag 'fixes-2025-08-28' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock fixes from Mike Rapoport: - printk cleanups in memblock and numa_memblks - update kernel-doc for MEMBLOCK_RSRV_NOINIT to be more accurate and detailed * tag 'fixes-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: memblock: fix kernel-doc for MEMBLOCK_RSRV_NOINIT mm: numa,memblock: Use SZ_1M macro to denote bytes to MB conversion mm/numa_memblks: Use pr_debug instead of printk(KERN_DEBUG)
2025-08-28x86/apic: Add new driver for Secure AVICNeeraj Upadhyay1-0/+8
The Secure AVIC feature provides SEV-SNP guests hardware acceleration for performance sensitive APIC accesses while securely managing the guest-owned APIC state through the use of a private APIC backing page. This helps prevent the hypervisor from generating unexpected interrupts for a vCPU or otherwise violate architectural assumptions around the APIC behavior. Add a new x2APIC driver that will serve as the base of the Secure AVIC support. It is initially the same as the x2APIC physical driver (without IPI callbacks), but will be modified as features are implemented. As the new driver does not implement Secure AVIC features yet, if the hypervisor sets the Secure AVIC bit in SEV_STATUS, maintain the existing behavior to enforce the guest termination. [ bp: Massage commit message. ] Co-developed-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Link: https://lore.kernel.org/20250828070334.208401-2-Neeraj.Upadhyay@amd.com
2025-08-28mtd: spinand: add support for FudanMicro FM25S01ATianling Shen1-0/+1
Add support for FudanMicro FM25S01A SPI NAND. Datasheet: http://eng.fmsh.com/nvm/FM25S01A_ds_eng.pdf Signed-off-by: Tianling Shen <cnsztl@gmail.com> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
2025-08-28mtd: nand: qpic-common: remove a bunch of unused definesGabor Juhos1-14/+0
A bunch of definitions in the 'nand-qpic-common.h' header became unused after the conversion of the 'qcom_nandc' and 'spi-qpic-snand' drivers to use the FIELD_PREP() macro, so remove those. No functional changes. Signed-off-by: Gabor Juhos <j4g8y7@gmail.com> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
2025-08-28inet: raw: add drop_counters to raw socketsEric Dumazet1-1/+1
When a packet flood hits one or more RAW sockets, many cpus have to update sk->sk_drops. This slows down other cpus, because currently sk_drops is in sock_write_rx group. Add a socket_drop_counters structure to raw sockets. Using dedicated cache lines to hold drop counters makes sure that consumers no longer suffer from false sharing if/when producers only change sk->sk_drops. This adds 128 bytes per RAW socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-6-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28udp: add drop_counters to udp socketEric Dumazet1-0/+1
When a packet flood hits one or more UDP sockets, many cpus have to update sk->sk_drops. This slows down other cpus, because currently sk_drops is in sock_write_rx group. Add a socket_drop_counters structure to udp sockets. Using dedicated cache lines to hold drop counters makes sure that consumers no longer suffer from false sharing if/when producers only change sk->sk_drops. This adds 128 bytes per UDP socket. Tested with the following stress test, sending about 11 Mpps to a dual socket AMD EPYC 7B13 64-Core. super_netperf 20 -t UDP_STREAM -H DUT -l10 -- -n -P,1000 -m 120 Note: due to socket lookup, only one UDP socket is receiving packets on DUT. Then measure receiver (DUT) behavior. We can see both consumer and BH handlers can process more packets per second. Before: nstat -n ; sleep 1 ; nstat | grep Udp Udp6InDatagrams 615091 0.0 Udp6InErrors 3904277 0.0 Udp6RcvbufErrors 3904277 0.0 After: nstat -n ; sleep 1 ; nstat | grep Udp Udp6InDatagrams 816281 0.0 Udp6InErrors 7497093 0.0 Udp6RcvbufErrors 7497093 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-5-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28net: add sk_drops_skbadd() helperEric Dumazet1-1/+1
Existing sk_drops_add() helper is renamed to sk_drops_skbadd(). Add sk_drops_add() and convert sk_drops_inc() to use it. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250826125031.1578842-3-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-28iopoll: Reorder the timeout handling in poll_timeout_us()Ville Syrjälä1-13/+11
Currently poll_timeout_us() evaluates 'op' and 'cond' twice within the loop, once at the start, and a second time after the timeout check. While it's probably not a big deal to do it twice almost back to back, it does make the macro a bit messy. Simplify the implementation to evaluate the timeout at the very start, then follow up with 'op'/'cond', and finally check if the timeout did in fact happen or not. For good measure throw in a compiler barrier between the timeout and 'op'/'cond' evaluations to make sure the compiler can't reoder the operations (which could cause false positive timeouts). The similar i915 __wait_for() macro already has the barrier, though there it is between the 'op' and 'cond' evaluations, which seems like it could still allow 'op' and the timeout evaluations to get reordered incorrectly. I suppose the ktime_get() might itself act as a sufficient barrier here, but better safe than sorry I guess. Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Dibin Moolakadan Subrahmanian <dibin.moolakadan.subrahmanian@intel.com> Cc: Imre Deak <imre.deak@intel.com> Cc: David Laight <david.laight.linux@gmail.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Matt Wagantall <mattw@codeaurora.org> Cc: Dejin Zheng <zhengdejin5@gmail.com> Cc: intel-gfx@lists.freedesktop.org Cc: intel-xe@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Simona Vetter <simona.vetter@ffwll.ch> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://lore.kernel.org/r/20250826121859.15497-3-ville.syrjala@linux.intel.com Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-08-28iopoll: Avoid evaluating 'cond' twice in poll_timeout_us()Ville Syrjälä1-4/+18
Currently poll_timeout_us() evaluates 'cond' twice at the end of the success case. This not desirable in case 'cond' itself is expensive. Avoid the double evaluation by tracking the return value in a variable. Need to use a triple undescore '___ret' name to avoid a conflict with an existing double undescore '__ret' variable in the regmap code. Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Dibin Moolakadan Subrahmanian <dibin.moolakadan.subrahmanian@intel.com> Cc: Imre Deak <imre.deak@intel.com> Cc: David Laight <david.laight.linux@gmail.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Matt Wagantall <mattw@codeaurora.org> Cc: Dejin Zheng <zhengdejin5@gmail.com> Cc: intel-gfx@lists.freedesktop.org Cc: intel-xe@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Simona Vetter <simona.vetter@ffwll.ch> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://lore.kernel.org/r/20250826121859.15497-2-ville.syrjala@linux.intel.com Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-08-28iopoll: Generalize read_poll_timeout() into poll_timeout_us()Ville Syrjälä1-32/+78
While read_poll_timeout() & co. were originally introduced just for simple I/O usage scenarios they have since been generalized to be useful in more cases. However the interface is very cumbersome to use in the general case. Attempt to make it more flexible by combining the 'op', 'var' and 'args' parameter into just a single 'op' that the caller can fully specify. For example i915 has one case where one might currently have to write something like: ret = read_poll_timeout(drm_dp_dpcd_read_byte, err, err || (status & mask), 0 * 1000, 200 * 1000, false, aux, DP_FEC_STATUS, &status); which is practically illegible, but with the adjusted macro we do: ret = poll_timeout_us(err = drm_dp_dpcd_read_byte(aux, DP_FEC_STATUS, &status), err || (status & mask), 0 * 1000, 200 * 1000, false); which much easier to understand. One could even combine the 'op' and 'cond' parameters into one, but that might make the caller a bit too unwieldly with assignments and checks being done on the same statement. This makes poll_timeout_us() closer to the i915 __wait_for() macro, with the main difference being that __wait_for() uses expenential backoff as opposed to the fixed polling interval used by poll_timeout_us(). Eventually we might be able to switch (at least most of) i915 to use poll_timeout_us(). v2: Fix typos (Jani) Fix delay_us docs for poll_timeout_us_atomic() (Jani) Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Dibin Moolakadan Subrahmanian <dibin.moolakadan.subrahmanian@intel.com> Cc: Imre Deak <imre.deak@intel.com> Cc: David Laight <david.laight.linux@gmail.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Matt Wagantall <mattw@codeaurora.org> Cc: Dejin Zheng <zhengdejin5@gmail.com> Cc: intel-gfx@lists.freedesktop.org Cc: intel-xe@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Simona Vetter <simona.vetter@ffwll.ch> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://lore.kernel.org/r/20250826121859.15497-1-ville.syrjala@linux.intel.com Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-08-28mm: introduce and use {pgd,p4d}_populate_kernel()Harry Yoo2-6/+36
Introduce and use {pgd,p4d}_populate_kernel() in core MM code when populating PGD and P4D entries for the kernel address space. These helpers ensure proper synchronization of page tables when updating the kernel portion of top-level page tables. Until now, the kernel has relied on each architecture to handle synchronization of top-level page tables in an ad-hoc manner. For example, see commit 9b861528a801 ("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping changes"). However, this approach has proven fragile for following reasons: 1) It is easy to forget to perform the necessary page table synchronization when introducing new changes. For instance, commit 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps") overlooked the need to synchronize page tables for the vmemmap area. 2) It is also easy to overlook that the vmemmap and direct mapping areas must not be accessed before explicit page table synchronization. For example, commit 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges")) caused crashes by accessing the vmemmap area before calling sync_global_pgds(). To address this, as suggested by Dave Hansen, introduce _kernel() variants of the page table population helpers, which invoke architecture-specific hooks to properly synchronize page tables. These are introduced in a new header file, include/linux/pgalloc.h, so they can be called from common code. They reuse existing infrastructure for vmalloc and ioremap. Synchronization requirements are determined by ARCH_PAGE_TABLE_SYNC_MASK, and the actual synchronization is performed by arch_sync_kernel_mappings(). This change currently targets only x86_64, so only PGD and P4D level helpers are introduced. Currently, these helpers are no-ops since no architecture sets PGTBL_{PGD,P4D}_MODIFIED in ARCH_PAGE_TABLE_SYNC_MASK. In theory, PUD and PMD level helpers can be added later if needed by other architectures. For now, 32-bit architectures (x86-32 and arm) only handle PGTBL_PMD_MODIFIED, so p*d_populate_kernel() will never affect them unless we introduce a PMD level helper. [harry.yoo@oracle.com: fix KASAN build error due to p*d_populate_kernel()] Link: https://lkml.kernel.org/r/20250822020727.202749-1-harry.yoo@oracle.com Link: https://lkml.kernel.org/r/20250818020206.4517-3-harry.yoo@oracle.com Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: bibo mao <maobibo@loongson.cn> Cc: Borislav Betkov <bp@alien8.de> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-28mm: move page table sync declarations to linux/pgtable.hHarry Yoo2-16/+16
During our internal testing, we started observing intermittent boot failures when the machine uses 4-level paging and has a large amount of persistent memory: BUG: unable to handle page fault for address: ffffe70000000034 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] SMP NOPTI RIP: 0010:__init_single_page+0x9/0x6d Call Trace: <TASK> __init_zone_device_page+0x17/0x5d memmap_init_zone_device+0x154/0x1bb pagemap_range+0x2e0/0x40f memremap_pages+0x10b/0x2f0 devm_memremap_pages+0x1e/0x60 dev_dax_probe+0xce/0x2ec [device_dax] dax_bus_probe+0x6d/0xc9 [... snip ...] </TASK> It turns out that the kernel panics while initializing vmemmap (struct page array) when the vmemmap region spans two PGD entries, because the new PGD entry is only installed in init_mm.pgd, but not in the page tables of other tasks. And looking at __populate_section_memmap(): if (vmemmap_can_optimize(altmap, pgmap)) // does not sync top level page tables r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap); else // sync top level page tables in x86 r = vmemmap_populate(start, end, nid, altmap); In the normal path, vmemmap_populate() in arch/x86/mm/init_64.c synchronizes the top level page table (See commit 9b861528a801 ("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping changes")) so that all tasks in the system can see the new vmemmap area. However, when vmemmap_can_optimize() returns true, the optimized path skips synchronization of top-level page tables. This is because vmemmap_populate_compound_pages() is implemented in core MM code, which does not handle synchronization of the top-level page tables. Instead, the core MM has historically relied on each architecture to perform this synchronization manually. We're not the first party to encounter a crash caused by not-sync'd top level page tables: earlier this year, Gwan-gyeong Mun attempted to address the issue [1] [2] after hitting a kernel panic when x86 code accessed the vmemmap area before the corresponding top-level entries were synced. At that time, the issue was believed to be triggered only when struct page was enlarged for debugging purposes, and the patch did not get further updates. It turns out that current approach of relying on each arch to handle the page table sync manually is fragile because 1) it's easy to forget to sync the top level page table, and 2) it's also easy to overlook that the kernel should not access the vmemmap and direct mapping areas before the sync. # The solution: Make page table sync more code robust and harder to miss To address this, Dave Hansen suggested [3] [4] introducing {pgd,p4d}_populate_kernel() for updating kernel portion of the page tables and allow each architecture to explicitly perform synchronization when installing top-level entries. With this approach, we no longer need to worry about missing the sync step, reducing the risk of future regressions. The new interface reuses existing ARCH_PAGE_TABLE_SYNC_MASK, PGTBL_P*D_MODIFIED and arch_sync_kernel_mappings() facility used by vmalloc and ioremap to synchronize page tables. pgd_populate_kernel() looks like this: static inline void pgd_populate_kernel(unsigned long addr, pgd_t *pgd, p4d_t *p4d) { pgd_populate(&init_mm, pgd, p4d); if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_PGD_MODIFIED) arch_sync_kernel_mappings(addr, addr); } It is worth noting that vmalloc() and apply_to_range() carefully synchronizes page tables by calling p*d_alloc_track() and arch_sync_kernel_mappings(), and thus they are not affected by this patch series. This series was hugely inspired by Dave Hansen's suggestion and hence added Suggested-by: Dave Hansen. Cc stable because lack of this series opens the door to intermittent boot failures. This patch (of 3): Move ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() to linux/pgtable.h so that they can be used outside of vmalloc and ioremap. Link: https://lkml.kernel.org/r/20250818020206.4517-1-harry.yoo@oracle.com Link: https://lkml.kernel.org/r/20250818020206.4517-2-harry.yoo@oracle.com Link: https://lore.kernel.org/linux-mm/20250220064105.808339-1-gwan-gyeong.mun@intel.com [1] Link: https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@intel.com [2] Link: https://lore.kernel.org/linux-mm/d1da214c-53d3-45ac-a8b6-51821c5416e4@intel.com [3] Link: https://lore.kernel.org/linux-mm/4d800744-7b88-41aa-9979-b245e8bf794b@intel.com [4] Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: bibo mao <maobibo@loongson.cn> Cc: Borislav Betkov <bp@alien8.de> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-28kexec: add KEXEC_FILE_NO_CMA as a legal flagBrian Mak1-1/+2
Commit 07d24902977e ("kexec: enable CMA based contiguous allocation") introduces logic to use CMA-based allocation in kexec by default. As part of the changes, it introduces a kexec_file_load flag to disable the use of CMA allocations from userspace. However, this flag is broken since it is missing from the list of legal flags for kexec_file_load. kexec_file_load returns EINVAL when attempting to use the flag. Fix this by adding the KEXEC_FILE_NO_CMA flag to the list of legal flags for kexec_file_load. Without this fix, kexec_file_load syscall will failed and return '-EINVAL' when KEXEC_FILE_NO_CMA is specified. Link: https://lkml.kernel.org/r/20250805211527.122367-2-makb@juniper.net Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") Signed-off-by: Brian Mak <makb@juniper.net> Acked-by: Baoquan He <bhe@redhat.com> Cc: Alexander Graf <graf@amazon.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Young <dyoung@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-08-28bpf: Improve the general precision of tnum_mulNandakumar Edamana1-0/+3
Drop the value-mask decomposition technique and adopt straightforward long-multiplication with a twist: when LSB(a) is uncertain, find the two partial products (for LSB(a) = known 0 and LSB(a) = known 1) and take a union. Experiment shows that applying this technique in long multiplication improves the precision in a significant number of cases (at the cost of losing precision in a relatively lower number of cases). Signed-off-by: Nandakumar Edamana <nandakumar@nandakumar.co.in> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Reviewed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20250826034524.2159515-1-nandakumar@nandakumar.co.in
2025-08-27Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-2/+0
Pull virtio/vhost fixes from Michael Tsirkin: "More small fixes. Most notably this fixes a messed up ioctl number, and a regression in shmem affecting drm users" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio_net: adjust the execution order of function `virtnet_close` during freeze virtio_input: Improve freeze handling vhost: Fix ioctl # for VHOST_[GS]ET_FORK_FROM_OWNER Revert "virtio: reject shm region if length is zero" vhost/net: Protect ubufs with rcu read lock in vhost_net_ubuf_put() virtio_pci: Fix misleading comment for queue vector
2025-08-27Merge tag 'media/v6.17-2' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media fixes from Mauro Carvalho Chehab: - drop the redundant pm_runtime_mark_last_busy() in rkvdec - fix probing error handling in rkvdec - fix an issue affecting lt6911uxe/lt6911uxc related to CSI-2 GPIO pins in int3472 * tag 'media/v6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: media: Remove redundant pm_runtime_mark_last_busy() calls platform/x86: int3472: add hpd pin support media: rkvdec: Remove redundant pm_runtime_mark_last_busy() calls media: rkvdec: Fix an error handling path in rkvdec_probe() media: rkvdec: Fix a NULL vs IS_ERR() bug in probe()
2025-08-27pinctrl: Add pinctrl_pm_select_init_state helper functionChristian Bruel1-0/+10
If a platform requires an initial pinctrl state during probing, this helper function provides the client with access to the same initial state. eg: xxx_suspend_noirq ... pinctrl_pm_select_sleep_state xxx resume_noirq pinctrl_pm_select_init_state ... pinctrl_pm_select_default_state Signed-off-by: Christian Bruel <christian.bruel@foss.st.com> Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Link: https://patch.msgid.link/20250820075411.1178729-3-christian.bruel@foss.st.com
2025-08-27mm: remove BDI_CAP_WRITEBACK_ACCTJoanne Koong1-3/+1
There are no users of BDI_CAP_WRITEBACK_ACCT now that fuse doesn't do its own writeback accounting. This commit removes BDI_CAP_WRITEBACK_ACCT. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-27fuse: use default writeback accountingJoanne Koong1-10/+0
commit 0c58a97f919c ("fuse: remove tmp folio for writebacks and internal rb tree") removed temp folios for dirty page writeback. Consequently, fuse can now use the default writeback accounting. With switching fuse to use default writeback accounting, there are some added benefits. This updates wb->writeback_inodes tracking as well now and updates writeback throughput estimates after writeback completion. This commit also removes inc_wb_stat() and dec_wb_stat(). These have no callers anymore now that fuse does not call them. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-27sched/wait: Add wait_event_state_exclusive()Sergey Senozhatsky1-0/+12
Allows exclusive waits with a custom @state. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-27KVM: guest_memfd: Track guest_memfd mmap support in memslotFuad Tabba1-1/+10
Add a new internal flag, KVM_MEMSLOT_GMEM_ONLY, to the top half of memslot->flags (which makes it strictly for KVM's internal use). This flag tracks when a guest_memfd-backed memory slot supports host userspace mmap operations, which implies that all memory, not just private memory for CoCo VMs, is consumed through guest_memfd: "gmem only". This optimization avoids repeatedly checking the underlying guest_memfd file for mmap support, which would otherwise require taking and releasing a reference on the file for each check. By caching this information directly in the memslot, we reduce overhead and simplify the logic involved in handling guest_memfd-backed pages for host mappings. Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: guest_memfd: Add plumbing to host to map guest_memfd pagesFuad Tabba1-0/+4
Introduce the core infrastructure to enable host userspace to mmap() guest_memfd-backed memory. This is needed for several evolving KVM use cases: * Non-CoCo VM backing: Allows VMMs like Firecracker to run guests entirely backed by guest_memfd, even for non-CoCo VMs [1]. This provides a unified memory management model and simplifies guest memory handling. * Direct map removal for enhanced security: This is an important step for direct map removal of guest memory [2]. By allowing host userspace to fault in guest_memfd pages directly, we can avoid maintaining host kernel direct maps of guest memory. This provides additional hardening against Spectre-like transient execution attacks by removing a potential attack surface within the kernel. * Future guest_memfd features: This also lays the groundwork for future enhancements to guest_memfd, such as supporting huge pages and enabling in-place sharing of guest memory with the host for CoCo platforms that permit it [3]. Enable the basic mmap and fault handling logic within guest_memfd, but hold off on allow userspace to actually do mmap() until the architecture support is also in place. [1] https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding [2] https://lore.kernel.org/linux-mm/cc1bb8e9bc3e1ab637700a4d3defeec95b55060a.camel@amazon.com [3] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com/T/#u Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Co-developed-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: x86: Enable KVM_GUEST_MEMFD for all 64-bit buildsFuad Tabba1-7/+2
Enable KVM_GUEST_MEMFD for all KVM x86 64-bit builds, i.e. for "default" VM types when running on 64-bit KVM. This will allow using guest_memfd to back non-private memory for all VM shapes, by supporting mmap() on guest_memfd. Opportunistically clean up various conditionals that become tautologies once x86 selects KVM_GUEST_MEMFD more broadly. Specifically, because SW protected VMs, SEV, and TDX are all 64-bit only, private memory no longer needs to take explicit dependencies on KVM_GUEST_MEMFD, because it is effectively a prerequisite. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: Fix comment that refers to kvm uapi header pathFuad Tabba1-1/+1
The comment that points to the path where the user-visible memslot flags are refers to an outdated path and has a typo. Update the comment to refer to the correct path. Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: Fix comments that refer to slots_lockFuad Tabba1-1/+1
Fix comments so that they refer to slots_lock instead of slots_locks (remove trailing s). Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: Rename kvm_slot_can_be_private() to kvm_slot_has_gmem()Fuad Tabba1-1/+1
Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() to improve clarity and accurately reflect its purpose. The function kvm_slot_can_be_private() was previously used to check if a given kvm_memory_slot is backed by guest_memfd. However, its name implied that the memory in such a slot was exclusively "private". As guest_memfd support expands to include non-private memory (e.g., shared host mappings), it's important to remove this association. The new name, kvm_slot_has_gmem(), states that the slot is backed by guest_memfd without making assumptions about the memory's privacy attributes. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: Rename CONFIG_KVM_GENERIC_PRIVATE_MEM to CONFIG_HAVE_KVM_ARCH_GMEM_POPULATEFuad Tabba1-1/+1
The original name was vague regarding its functionality. This Kconfig option specifically enables and gates the kvm_gmem_populate() function, which is responsible for populating a GPA range with guest data. The new name, HAVE_KVM_ARCH_GMEM_POPULATE, describes the purpose of the option: to enable arch-specific guest_memfd population mechanisms. It also follows the same pattern as the other HAVE_KVM_ARCH_* configuration options. This improves clarity for developers and ensures the name accurately reflects the functionality it controls, especially as guest_memfd support expands beyond purely "private" memory scenarios. Temporarily keep KVM_GENERIC_PRIVATE_MEM as an x86-only config so as to minimize churn, and to hopefully make it easier to see what features require HAVE_KVM_ARCH_GMEM_POPULATE. On that note, omit GMEM_POPULATE for KVM_X86_SW_PROTECTED_VM, as regular ol' memset() suffices for software-protected VMs. As for KVM_GENERIC_PRIVATE_MEM, a future change will select KVM_GUEST_MEMFD for all 64-bit KVM builds, at which point the intermediate config will become obsolete and can/will be dropped. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GUEST_MEMFDFuad Tabba1-7/+7
Rename the Kconfig option CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GUEST_MEMFD. The original name implied that the feature only supported "private" memory. However, CONFIG_KVM_PRIVATE_MEM enables guest_memfd in general, which is not exclusively for private memory. Subsequent patches in this series will add guest_memfd support for non-CoCo VMs, whose memory is not private. Renaming the Kconfig option to CONFIG_KVM_GUEST_MEMFD more accurately reflects its broader scope as the main Kconfig option for all guest_memfd-backed memory. This provides clearer semantics for the option and avoids confusion as new features are introduced. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>