summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2025-07-29Merge tag 'vfs-6.17-rc1.super' of ↵Linus Torvalds2-0/+20
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull superblock callback update from Christian Brauner: "Currently all filesystems which implement super_operations::shutdown() can not afford losing a device. Thus fs_bdev_mark_dead() will just call the ->shutdown() callback for the involved filesystem. But it will no longer be the case, as multi-device filesystems like btrfs can handle certain device loss without the need to shutdown the whole filesystem. To allow those multi-device filesystems to be integrated to use fs_holder_ops: - Add a new super_operations::remove_bdev() callback - Try ->remove_bdev() callback first inside fs_bdev_mark_dead(). If the callback returned 0, meaning the fs can handling the device loss, then exit without doing anything else. If there is no such callback or the callback returned non-zero value, continue to shutdown the filesystem as usual. This means the new remove_bdev() should only do the check on whether the operation can continue, and if so do the fs specific handlings. The shutdown handling should still be handled by the existing ->shutdown() callback. For all existing filesystems with shutdown callback, there is no change to the code nor behavior. Btrfs is going to implement both the ->remove_bdev() and ->shutdown() callbacks soon" * tag 'vfs-6.17-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add a new remove_bdev() callback
2025-07-29Merge tag 'vfs-6.17-rc1.fileattr' of ↵Linus Torvalds64-410/+752
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fileattr updates from Christian Brauner: "This introduces the new file_getattr() and file_setattr() system calls after lengthy discussions. Both system calls serve as successors and extensible companions to the FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR system calls which have started to show their age in addition to being named in a way that makes it easy to conflate them with extended attribute related operations. These syscalls allow userspace to set filesystem inode attributes on special files. One of the usage examples is the XFS quota projects. XFS has project quotas which could be attached to a directory. All new inodes in these directories inherit project ID set on parent directory. The project is created from userspace by opening and calling FS_IOC_FSSETXATTR on each inode. This is not possible for special files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left with empty project ID. Those inodes then are not shown in the quota accounting but still exist in the directory. This is not critical but in the case when special files are created in the directory with already existing project quota, these new inodes inherit extended attributes. This creates a mix of special files with and without attributes. Moreover, special files with attributes don't have a possibility to become clear or change the attributes. This, in turn, prevents userspace from re-creating quota project on these existing files. In addition, these new system calls allow the implementation of additional attributes that we couldn't or didn't want to fit into the legacy ioctls anymore" * tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: tighten a sanity check in file_attr_to_fileattr() tree-wide: s/struct fileattr/struct file_kattr/g fs: introduce file_getattr and file_setattr syscalls fs: prepare for extending file_get/setattr() fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP selinux: implement inode_file_[g|s]etattr hooks lsm: introduce new hooks for setting/getting inode fsxattr fs: split fileattr related helpers into separate file
2025-07-29Merge tag 'vfs-6.17-rc1.integrity' of ↵Linus Torvalds14-31/+209
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs 'protection info' updates from Christian Brauner: "This adds the new FS_IOC_GETLBMD_CAP ioctl() to query metadata and protection info (PI) capabilities. This ioctl returns information about the files integrity profile. This is useful for userspace applications to understand a files end-to-end data protection support and configure the I/O accordingly. For now this interface is only supported by block devices. However the design and placement of this ioctl in generic FS ioctl space allows us to extend it to work over files as well. This maybe useful when filesystems start supporting PI-aware layouts. A new structure struct logical_block_metadata_cap is introduced, which contains the following fields: - lbmd_flags: bitmask of logical block metadata capability flags - lbmd_interval: the amount of data described by each unit of logical block metadata - lbmd_size: size in bytes of the logical block metadata associated with each interval - lbmd_opaque_size: size in bytes of the opaque block tag associated with each interval - lbmd_opaque_offset: offset in bytes of the opaque block tag within the logical block metadata - lbmd_pi_size: size in bytes of the T10 PI tuple associated with each interval - lbmd_pi_offset: offset in bytes of T10 PI tuple within the logical block metadata - lbmd_pi_guard_tag_type: T10 PI guard tag type - lbmd_pi_app_tag_size: size in bytes of the T10 PI application tag - lbmd_pi_ref_tag_size: size in bytes of the T10 PI reference tag - lbmd_pi_storage_tag_size: size in bytes of the T10 PI storage tag The internal logic to fetch the capability is encapsulated in a helper function blk_get_meta_cap(), which uses the blk_integrity profile associated with the device. The ioctl returns -EOPNOTSUPP, if CONFIG_BLK_DEV_INTEGRITY is not enabled" * tag 'vfs-6.17-rc1.integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: block: fix lbmd_guard_tag_type assignment in FS_IOC_GETLBMD_CAP block: fix FS_IOC_GETLBMD_CAP parsing in blkdev_common_ioctl() fs: add ioctl to query metadata and protection info capabilities nvme: set pi_offset only when checksum type is not BLK_INTEGRITY_CSUM_NONE block: introduce pi_tuple_size field in blk_integrity block: rename tuple_size field in blk_integrity to metadata_size
2025-07-29Merge tag 'vfs-6.17-rc1.rust' of ↵Linus Torvalds4-42/+41
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs rust updates from Christian Brauner: - Allow poll_table pointers to be NULL - Add Rust files to vfs MAINTAINERS entry * tag 'vfs-6.17-rc1.rust' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: vfs: add Rust files to MAINTAINERS poll: rust: allow poll_table ptrs to be null
2025-07-29Merge tag 'vfs-6.17-rc1.bpf' of ↵Linus Torvalds11-33/+486
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs bpf updates from Christian Brauner: "These changes allow bpf to read extended attributes from cgroupfs. This is useful in redirecting AF_UNIX socket connections based on cgroup membership of the socket. One use-case is the ability to implement log namespaces in systemd so services and containers are redirected to different journals" * tag 'vfs-6.17-rc1.bpf' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests/kernfs: test xattr retrieval selftests/bpf: Add tests for bpf_cgroup_read_xattr bpf: Mark cgroup_subsys_state->cgroup RCU safe bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's node kernfs: remove iattr_mutex
2025-07-29Merge tag 'vfs-6.17-rc1.pidfs' of ↵Linus Torvalds26-381/+894
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pidfs updates from Christian Brauner: - persistent info Persist exit and coredump information independent of whether anyone currently holds a pidfd for the struct pid. The current scheme allocated pidfs dentries on-demand repeatedly. This scheme is reaching it's limits as it makes it impossible to pin information that needs to be available after the task has exited or coredumped and that should not be lost simply because the pidfd got closed temporarily. The next opener should still see the stashed information. This is also a prerequisite for supporting extended attributes on pidfds to allow attaching meta information to them. If someone opens a pidfd for a struct pid a pidfs dentry is allocated and stashed in pid->stashed. Once the last pidfd for the struct pid is closed the pidfs dentry is released and removed from pid->stashed. So if 10 callers create a pidfs dentry for the same struct pid sequentially, i.e., each closing the pidfd before the other creates a new one then a new pidfs dentry is allocated every time. Because multiple tasks acquiring and releasing a pidfd for the same struct pid can race with each another a task may still find a valid pidfs entry from the previous task in pid->stashed and reuse it. Or it might find a dead dentry in there and fail to reuse it and so stashes a new pidfs dentry. Multiple tasks may race to stash a new pidfs dentry but only one will succeed, the other ones will put their dentry. The current scheme aims to ensure that a pidfs dentry for a struct pid can only be created if the task is still alive or if a pidfs dentry already existed before the task was reaped and so exit information has been was stashed in the pidfs inode. That's great except that it's buggy. If a pidfs dentry is stashed in pid->stashed after pidfs_exit() but before __unhash_process() is called we will return a pidfd for a reaped task without exit information being available. The pidfds_pid_valid() check does not guard against this race as it doens't sync at all with pidfs_exit(). The pid_has_task() check might be successful simply because we're before __unhash_process() but after pidfs_exit(). Introduce a new scheme where the lifetime of information associated with a pidfs entry (coredump and exit information) isn't bound to the lifetime of the pidfs inode but the struct pid itself. The first time a pidfs dentry is allocated for a struct pid a struct pidfs_attr will be allocated which will be used to store exit and coredump information. If all pidfs for the pidfs dentry are closed the dentry and inode can be cleaned up but the struct pidfs_attr will stick until the struct pid itself is freed. This will ensure minimal memory usage while persisting relevant information. The new scheme has various advantages. First, it allows to close the race where we end up handing out a pidfd for a reaped task for which no exit information is available. Second, it minimizes memory usage. Third, it allows to remove complex lifetime tracking via dentries when registering a struct pid with pidfs. There's no need to get or put a reference. Instead, the lifetime of exit and coredump information associated with a struct pid is bound to the lifetime of struct pid itself. - extended attributes Now that we have a way to persist information for pidfs dentries we can start supporting extended attributes on pidfds. This will allow userspace to attach meta information to tasks. One natural extension would be to introduce a custom pidfs.* extended attribute space and allow for the inheritance of extended attributes across fork() and exec(). The first simple scheme will allow privileged userspace to set trusted extended attributes on pidfs inodes. - Allow autonomous pidfs file handles Various filesystems such as pidfs and drm support opening file handles without having to require a file descriptor to identify the filesystem. The filesystem are global single instances and can be trivially identified solely on the information encoded in the file handle. This makes it possible to not have to keep or acquire a sentinal file descriptor just to pass it to open_by_handle_at() to identify the filesystem. That's especially useful when such sentinel file descriptor cannot or should not be acquired. For pidfs this means a file handle can function as full replacement for storing a pid in a file. Instead a file handle can be stored and reopened purely based on the file handle. Such autonomous file handles can be opened with or without specifying a a file descriptor. If no proper file descriptor is used the FD_PIDFS_ROOT sentinel must be passed. This allows us to define further special negative fd sentinels in the future. Userspace can trivially test for support by trying to open the file handle with an invalid file descriptor. - Allow pidfds for reaped tasks with SCM_PIDFD messages This is a logical continuation of the earlier work to create pidfds for reaped tasks through the SO_PEERPIDFD socket option merged in 923ea4d4482b ("Merge patch series "net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid""). - Two minor fixes: * Fold fs_struct->{lock,seq} into a seqlock * Don't bother with path_{get,put}() in unix_open_file() * tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (37 commits) don't bother with path_get()/path_put() in unix_open_file() fold fs_struct->{lock,seq} into a seqlock selftests: net: extend SCM_PIDFD test to cover stale pidfds af_unix: enable handing out pidfds for reaped tasks in SCM_PIDFD af_unix: stash pidfs dentry when needed af_unix/scm: fix whitespace errors af_unix: introduce and use scm_replace_pid() helper af_unix: introduce unix_skb_to_scm helper af_unix: rework unix_maybe_add_creds() to allow sleep selftests/pidfd: decode pidfd file handles withou having to specify an fd fhandle, pidfs: support open_by_handle_at() purely based on file handle uapi/fcntl: add FD_PIDFS_ROOT uapi/fcntl: add FD_INVALID fcntl/pidfd: redefine PIDFD_SELF_THREAD_GROUP uapi/fcntl: mark range as reserved fhandle: reflow get_path_anchor() pidfs: add pidfs_root_path() helper fhandle: rename to get_path_anchor() fhandle: hoist copy_from_user() above get_path_from_fd() fhandle: raise FILEID_IS_DIR in handle_type ...
2025-07-28rv: Add opid per-cpu monitorGabriele Monaco9-0/+399
Add a per-cpu monitor as part of the sched model: * opid: operations with preemption and irq disabled Monitor to ensure wakeup and need_resched occur with irq and preemption disabled or in irq handlers. Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tomas Glozar <tglozar@redhat.com> Cc: Juri Lelli <jlelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Link: https://lore.kernel.org/20250728135022.255578-10-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Acked-by: Nam Cao <namcao@linutronix.de> Tested-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28rv: Add nrp and sssw per-task monitorsGabriele Monaco15-0/+728
Add 2 per-task monitors as part of the sched model: * nrp: need-resched preempts Monitor to ensure preemption requires need resched. * sssw: set state sleep and wakeup Monitor to ensure sched_set_state to sleepable leads to sleeping and sleeping tasks require wakeup. Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tomas Glozar <tglozar@redhat.com> Cc: Juri Lelli <jlelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20250728135022.255578-9-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Acked-by: Nam Cao <namcao@linutronix.de> Tested-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28rv: Replace tss and sncid monitors with more complete stsGabriele Monaco18-410/+385
The tss monitor currently guarantees task switches can happen only while scheduling, whereas the sncid monitor enforces scheduling occurs with interrupt disabled. Replace the monitors with a more comprehensive specification which implies both but also ensures that: * each scheduler call disable interrupts to switch * each task switch happens with interrupts disabled Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Nam Cao <namcao@linutronix.de> Cc: Tomas Glozar <tglozar@redhat.com> Cc: Juri Lelli <jlelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20250728135022.255578-8-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28sched: Adapt sched tracepoints for RV task modelGabriele Monaco8-18/+34
Add the following tracepoint: * sched_set_need_resched(tsk, cpu, tif) Called when a task is set the need resched [lazy] flag Remove the unused ip parameter from sched_entry and sched_exit and alter sched_entry to have a value of preempt consistent with the one used in sched_switch. Also adapt all monitors using sched_{entry,exit} to avoid breaking build. These tracepoints are useful to describe the Linux task model and are adapted from the patches by Daniel Bristot de Oliveira (https://bristot.me/linux-task-model/). Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Nam Cao <namcao@linutronix.de> Cc: Tomas Glozar <tglozar@redhat.com> Cc: Juri Lelli <jlelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Link: https://lore.kernel.org/20250728135022.255578-7-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28rv: Retry when da monitor detects race conditionsGabriele Monaco4-54/+85
DA monitor can be accessed from multiple cores simultaneously, this is likely, for instance when dealing with per-task monitors reacting on events that do not always occur on the CPU where the task is running. This can cause race conditions where two events change the next state and we see inconsistent values. E.g.: [62] event_srs: 27: sleepable x sched_wakeup -> running (final) [63] event_srs: 27: sleepable x sched_set_state_sleepable -> sleepable [63] error_srs: 27: event sched_switch_suspend not expected in the state running In this case the monitor fails because the event on CPU 62 wins against the one on CPU 63, although the correct state should have been sleepable, since the task get suspended. Detect if the current state was modified by using try_cmpxchg while storing the next value. If it was, try again reading the current state. After a maximum number of failed retries, react by calling a special tracepoint, print on the console and reset the monitor. Remove the functions da_monitor_curr_state() and da_monitor_set_state() as they only hide the underlying implementation in this case. Monitors where this type of condition can occur must be able to account for racing events in any possible order, as we cannot know the winner. Cc: Ingo Molnar <mingo@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tomas Glozar <tglozar@redhat.com> Cc: Juri Lelli <jlelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20250728135022.255578-6-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Reviewed-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28rv: Adjust monitor dependenciesGabriele Monaco4-4/+4
RV monitors relying on the preemptirqs tracepoints are set as dependent on PREEMPT_TRACER and IRQSOFF_TRACER. In fact, those configurations do enable the tracepoints but are not the minimal configurations enabling them, which are TRACE_PREEMPT_TOGGLE and TRACE_IRQFLAGS (not selectable manually). Set TRACE_PREEMPT_TOGGLE and TRACE_IRQFLAGS as dependencies for monitors. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tomas Glozar <tglozar@redhat.com> Cc: Juri Lelli <jlelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Link: https://lore.kernel.org/20250728135022.255578-5-gmonaco@redhat.com Fixes: fbe6c09b7eb4 ("rv: Add scpd, snep and sncid per-cpu monitors") Acked-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28rv: Use strings in da monitors tracepointsGabriele Monaco1-38/+38
Using DA monitors tracepoints with KASAN enabled triggers the following warning: BUG: KASAN: global-out-of-bounds in do_trace_event_raw_event_event_da_monitor+0xd6/0x1a0 Read of size 32 at addr ffffffffaada8980 by task ... Call Trace: <TASK> [...] do_trace_event_raw_event_event_da_monitor+0xd6/0x1a0 ? __pfx_do_trace_event_raw_event_event_da_monitor+0x10/0x10 ? trace_event_sncid+0x83/0x200 trace_event_sncid+0x163/0x200 [...] The buggy address belongs to the variable: automaton_snep+0x4e0/0x5e0 This is caused by the tracepoints reading 32 bytes __array instead of __string from the automata definition. Such strings are literals and reading 32 bytes ends up in out of bound memory accesses (e.g. the next automaton's data in this case). The error is harmless as, while printing the string, we stop at the null terminator, but it should still be fixed. Use the __string facilities while defining the tracepoints to avoid reading out of bound memory. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tomas Glozar <tglozar@redhat.com> Cc: Juri Lelli <jlelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Link: https://lore.kernel.org/20250728135022.255578-4-gmonaco@redhat.com Fixes: 792575348ff7 ("rv/include: Add deterministic automata monitor definition via C macros") Reviewed-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28rv: Remove trailing whitespace from tracepoint stringGabriele Monaco1-4/+4
RV event tracepoints print a line with the format: "event_xyz: S0 x event -> S1 " "event_xyz: S1 x event -> S0 (final)" While printing an event leading to a non-final state, the line has a trailing white space (visible above before the closing "). Adapt the format string not to print the trailing whitespace if we are not printing "(final)". Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tomas Glozar <tglozar@redhat.com> Cc: Juri Lelli <jlelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Link: https://lore.kernel.org/20250728135022.255578-3-gmonaco@redhat.com Reviewed-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28rv: Add da_handle_start_run_event_ to per-task monitorsGabriele Monaco1-0/+24
The RV da_monitor API allows to start monitors in two ways: da_handle_start_event_NAME and da_handle_start_run_event_NAME. The former is used when the event is followed by the initial state of the module, so we ignore the event but we know the monitor is in the initial state and can start monitoring, the latter can be used if the event can only occur in the initial state, so we do handle the event as if the monitor was in the initial state. This latter API is defined for implicit monitors but not per-task ones. Define da_handle_start_run_event_NAME macro also for per-task monitors. Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tomas Glozar <tglozar@redhat.com> Cc: Juri Lelli <jlelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Link: https://lore.kernel.org/20250728135022.255578-2-gmonaco@redhat.com Reviewed-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-28Merge tag 'vfs-6.17-rc1.mmap_prepare' of ↵Linus Torvalds64-187/+281
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull mmap_prepare updates from Christian Brauner: "Last cycle we introduce f_op->mmap_prepare() in c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"). This is preferred to the existing f_op->mmap() hook as it does require a VMA to be established yet, thus allowing the mmap logic to invoke this hook far, far earlier, prior to inserting a VMA into the virtual address space, or performing any other heavy handed operations. This allows for much simpler unwinding on error, and for there to be a single attempt at merging a VMA rather than having to possibly reattempt a merge based on potentially altered VMA state. Far more importantly, it prevents inappropriate manipulation of incompletely initialised VMA state, which is something that has been the cause of bugs and complexity in the past. The intent is to gradually deprecate f_op->mmap, and in that vein this series coverts the majority of file systems to using f_op->mmap_prepare. Prerequisite steps are taken - firstly ensuring all checks for mmap capabilities use the file_has_valid_mmap_hooks() helper rather than directly checking for f_op->mmap (which is now not a valid check) and secondly updating daxdev_mapping_supported() to not require a VMA parameter to allow ext4 and xfs to be converted. Commit bb666b7c2707 ("mm: add mmap_prepare() compatibility layer for nested file systems") handles the nasty edge-case of nested file systems like overlayfs, which introduces a compatibility shim to allow f_op->mmap_prepare() to be invoked from an f_op->mmap() callback. This allows for nested filesystems to continue to function correctly with all file systems regardless of which callback is used. Once we finally convert all file systems, this shim can be removed. As a result, ecryptfs, fuse, and overlayfs remain unaltered so they can nest all other file systems. We additionally do not update resctl - as this requires an update to remap_pfn_range() (or an alternative to it) which we defer to a later series, equally we do not update cramfs which needs a mixed mapping insertion with the same issue, nor do we update procfs, hugetlbfs, syfs or kernfs all of which require VMAs for internal state and hooks. We shall return to all of these later" * tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: doc: update porting, vfs documentation to describe mmap_prepare() fs: replace mmap hook with .mmap_prepare for simple mappings fs: convert most other generic_file_*mmap() users to .mmap_prepare() fs: convert simple use of generic_file_*_mmap() to .mmap_prepare() mm/filemap: introduce generic_file_*_mmap_prepare() helpers fs/xfs: transition from deprecated .mmap hook to .mmap_prepare fs/ext4: transition from deprecated .mmap hook to .mmap_prepare fs/dax: make it possible to check dev dax support without a VMA fs: consistently use can_mmap_file() helper mm/nommu: use file_has_valid_mmap_hooks() helper mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepare
2025-07-28drm/amdgpu: update mmhub 4.1.0 client id mappingsAlex Deucher1-21/+13
Update the client id mapping so the correct clients get printed when there is a mmhub page fault. Tested-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2025-07-28Merge tag 'vfs-6.17-rc1.fallocate' of ↵Linus Torvalds14-44/+212
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fallocate updates from Christian Brauner: "fallocate() currently supports creating preallocated files efficiently. However, on most filesystems fallocate() will preallocate blocks in an unwriten state even if FALLOC_FL_ZERO_RANGE is specified. The extent state must later be converted to a written state when the user writes data into this range, which can trigger numerous metadata changes and journal I/O. This may leads to significant write amplification and performance degradation in synchronous write mode. At the moment, the only method to avoid this is to create an empty file and write zero data into it (for example, using 'dd' with a large block size). However, this method is slow and consumes a considerable amount of disk bandwidth. Now that more and more flash-based storage devices are available it is possible to efficiently write zeros to SSDs using the unmap write zeroes command if the devices do not write physical zeroes to the media. For example, if SCSI SSDs support the UMMAP bit or NVMe SSDs support the DEAC bit[1], the write zeroes command does not write actual data to the device, instead, NVMe converts the zeroed range to a deallocated state, which works fast and consumes almost no disk write bandwidth. This series implements the BLK_FEAT_WRITE_ZEROES_UNMAP feature and BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED flag for SCSI, NVMe and device-mapper drivers, and add the FALLOC_FL_WRITE_ZEROES and STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices. fallocate() is subsequently extended with the FALLOC_FL_WRITE_ZEROES flag. FALLOC_FL_WRITE_ZEROES zeroes a specified file range in such a way that subsequent writes to that range do not require further changes to the file mapping metadata. This flag is beneficial for subsequent pure overwriting within this range, as it can save on block allocation and, consequently, significant metadata changes" * tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: ext4: add FALLOC_FL_WRITE_ZEROES support block: add FALLOC_FL_WRITE_ZEROES support block: factor out common part in blkdev_fallocate() fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate dm: clear unmap write zeroes limits when disabling write zeroes scsi: sd: set max_hw_wzeroes_unmap_sectors if device supports SD_ZERO_*_UNMAP nvmet: set WZDS and DRB if device enables unmap write zeroes operation nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bit block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits
2025-07-28Merge tag 'vfs-6.17-rc1.async.dir' of ↵Linus Torvalds4-39/+17
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull async directory updates from Christian Brauner: "This contains preparatory changes for the asynchronous directory locking scheme. While the locking scheme is still very much controversial and we're still far away from landing any actual changes in that area the preparatory work that we've been upstreaming for a while now has been very useful. This is another set of minor changes and cleanups" * tag 'vfs-6.17-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: exportfs: use lookup_one_unlocked() coda: use iterate_dir() in coda_readdir() VFS: Minor fixes for porting.rst VFS: merge lookup_one_qstr_excl_raw() back into lookup_one_qstr_excl()
2025-07-28drm/amd/display: Allow DCN301 to clear update flagsIvan Lipski1-1/+2
[Why & How] Not letting DCN301 to clear after surface/stream update results in artifacts when switching between active overlay planes. The issue is known and has been solved initially. See below: (https://gitlab.freedesktop.org/drm/amd/-/issues/3441) Fixes: f354556e29f4 ("drm/amd/display: limit clear_update_flags t dcn32 and above") Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Ivan Lipski <ivan.lipski@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28drm/amd/display: Pass up errors for reset GPU that fails to init HWMario Limonciello1-1/+3
[Why] If a GPU is in reset and the hardware fails to initialize the rest of the resume sequence shouldn't be run. [How] Pass error code up to caller of dm_resume(). Reviewed-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Ivan Lipski <ivan.lipski@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2025-07-28drm/amd/display: Only finalize atomic_obj if it was initializedMario Limonciello1-1/+2
[Why] If amdgpu_dm failed to initalize before amdgpu_dm_initialize_drm_device() completed then freeing atomic_obj will lead to list corruption. [How] Check if atomic_obj state is initialized before trying to free. Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Ivan Lipski <ivan.lipski@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28drm/amd/display: Avoid configuring PSR granularity if PSR-SU not supportedMario Limonciello1-2/+4
[Why] If PSR-SU is disabled on the link, then configuring su_y granularity in mod_power_calc_psr_configs() can lead to assertions in psr_su_set_dsc_slice_height(). [How] Check the PSR version in amdgpu_dm_link_setup_psr() to determine whether or not to configure granularity. Reviewed-by: Sun peng (Leo) Li <sunpeng.li@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Ivan Lipski <ivan.lipski@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28drm/amd/display: Disable dsc_power_gate for dcn314 by defaultRoman Li1-0/+1
[Why] "REG_WAIT timeout 1us * 1000 tries - dcn314_dsc_pg_control line" warnings seen after resuming from s2idle. DCN314 has issues with DSC power gating that cause REG_WAIT timeouts when attempting to power down DSC blocks. [How] Disable dsc_power_gate for dcn314 by default. Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com> Signed-off-by: Roman Li <Roman.Li@amd.com> Signed-off-by: Ivan Lipski <ivan.lipski@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28drm/amdgpu: add kicker fws loading for gfx12/smu14/psp14Frank Min5-10/+29
1. Add kicker firmwares loading for gfx12/smu14/psp14 2. Register additional MODULE_FIRMWARE entries for kicker fws - gc_12_0_1_rlc_kicker.bin - gc_12_0_1_imu_kicker.bin - psp_14_0_3_sos_kicker.bin - psp_14_0_3_ta_kicker.bin - smu_14_0_3_kicker.bin Signed-off-by: Frank Min <Frank.Min@amd.com> Reviewed-by: Gui Chengming <Jack.Gui@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2025-07-28drm/amd/amdgpu: fix missing lock for cper.ring->rptr/wptr accessYang Wang1-2/+4
Add lock protection for 'ring->wptr'/'ring->rptr' to ensure the correct execution. Fixes: 8652920d2c00 ("drm/amdgpu: add mutex lock for cper ring") Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2025-07-28drm/amd/display: Fix misuse of /** to /* in 'dce_i2c_hw.c'Srinivasan Shanmugam1-1/+7
Fix the comment style before cntl_stuck_hw_workaround() by replacing '/**' with '/*' since it is not a kdoc comment. Fixes the below with gcc W=1: display/dc/dce/dce_i2c_hw.c:380: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * If we boot without an HDMI display, the I2C engine does not get initialized Fixes: 04d57f4462a6 ("drm/amd/display: Workaround for stuck I2C arbitrage") Cc: Alvin Lee <alvin.lee2@amd.com> Cc: Dominik Kaszewski <dominik.kaszewski@amd.com> Cc: Ivan Lipski <ivan.lipski@amd.com> Cc: Harry Wentland <harry.wentland@amd.com> Cc: Tom Chung <chiahsuan.chung@amd.com> Cc: Roman Li <roman.li@amd.com> Cc: Alex Hung <alex.hung@amd.com> Cc: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28drm/amd/display: fix initial backlight brightness calculationLauri Tirkkonen1-2/+2
DIV_ROUND_CLOSEST(x, 100) returns either 0 or 1 if 0<x<=100, so the division needs to be performed after the multiplication and not the other way around, to properly scale the value. Fixes: 8b5f3a229a70 ("drm/amd/display: Fix default DC and AC levels") Signed-off-by: Lauri Tirkkonen <lauri@hacktheplanet.fi> Cc: stable@vger.kernel.org Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/aH2Q_HJvxKbW74vU@hacktheplanet.fi Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28drm/amdgpu: Avoid extra evict-restore process.Gang Ba1-4/+2
If vm belongs to another process, this is fclose after fork, wait may enable signaling KFD eviction fence and cause parent process queue evicted. [677852.634569] amdkfd_fence_enable_signaling+0x56/0x70 [amdgpu] [677852.634814] __dma_fence_enable_signaling+0x3e/0xe0 [677852.634820] dma_fence_wait_timeout+0x3a/0x140 [677852.634825] amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl] [677852.634831] amdgpu_vm_wait_idle+0x2d/0x60 [amdgpu] [677852.635026] amdgpu_flush+0x34/0x50 [amdgpu] [677852.635208] filp_flush+0x38/0x90 [677852.635213] filp_close+0x14/0x30 [677852.635216] do_close_on_exec+0xdd/0x130 [677852.635221] begin_new_exec+0x1da/0x490 [677852.635225] load_elf_binary+0x307/0xea0 [677852.635231] ? srso_alias_return_thunk+0x5/0xfbef5 [677852.635235] ? ima_bprm_check+0xa2/0xd0 [677852.635240] search_binary_handler+0xda/0x260 [677852.635245] exec_binprm+0x58/0x1a0 [677852.635249] bprm_execve.part.0+0x16f/0x210 [677852.635254] bprm_execve+0x45/0x80 [677852.635257] do_execveat_common.isra.0+0x190/0x200 Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Gang Ba <Gang.Ba@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2025-07-28drm/amdgpu: track whether a queue is a kernel queue in amdgpu_mqd_propAlex Deucher2-0/+2
Used to to set the MQD appropriately for each queue type. Kernel queues have additional privileges. Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.16.x
2025-07-28drm/amdgpu: check if hubbub is NULL in debugfs/amdgpu_dm_capabilitiesPeter Shkenev1-1/+1
HUBBUB structure is not initialized on DCE hardware, so check if it is NULL to avoid null dereference while accessing amdgpu_dm_capabilities file in debugfs. Signed-off-by: Peter Shkenev <mustela@erminea.space> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2025-07-28drm/amdgpu: Initialize data to NULL in imu_v12_0_program_rlc_ram()Nathan Chancellor1-1/+1
After a recent change in clang to expose uninitialized warnings from const variables and pointers [1], there is a warning in imu_v12_0_program_rlc_ram() because data is passed uninitialized to program_imu_rlc_ram(): drivers/gpu/drm/amd/amdgpu/imu_v12_0.c:374:30: error: variable 'data' is uninitialized when used here [-Werror,-Wuninitialized] 374 | program_imu_rlc_ram(adev, data, (const u32)size); | ^~~~ As this warning happens early in clang's frontend, it does not realize that due to the assignment of r to -EINVAL, program_imu_rlc_ram() is never actually called, and even if it were, data would not be dereferenced because size is 0. Just initialize data to NULL to silence the warning, as the commit that added program_imu_rlc_ram() mentioned it would eventually be used over the old method, at which point data can be properly initialized and used. Cc: stable@vger.kernel.org Closes: https://github.com/ClangBuiltLinux/linux/issues/2107 Fixes: 56159fffaab5 ("drm/amdgpu: use new method to program rlc ram") Link: https://github.com/llvm/llvm-project/commit/2464313eef01c5b1edf0eccf57a32cdee01472c7 [1] Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28drm/amd/display: Fix divide by zero when calculating min ODM factorDillon Varone1-14/+15
[WHY&HOW] If the debug option is set to disable_dsc the max slice width and/or dispclk can be zero. This causes a divide by zero when calculating the min ODM combine factor. Add a check to ensure they are valid first. Reviewed-by: Wenjing Liu <wenjing.liu@amd.com> Signed-off-by: Dillon Varone <dillon.varone@amd.com> Signed-off-by: Wayne Lin <wayne.lin@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2025-07-28Merge tag 'vfs-6.17-rc1.nsfs' of ↵Linus Torvalds6-13/+47
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull namespace updates from Christian Brauner: "This contains namespace updates. This time specifically for nsfs: - Userspace heavily relies on the root inode numbers for namespaces to identify the initial namespaces. That's already a hard dependency. So we cannot change that anymore. Move the initial inode numbers to a public header and align the only two namespaces that currently don't do that with all the other namespaces. - The root inode of /proc having a fixed inode number has been part of the core kernel ABI since its inception, and recently some userspace programs (mainly container runtimes) have started to explicitly depend on this behaviour. The main reason this is useful to userspace is that by checking that a suspect /proc handle has fstype PROC_SUPER_MAGIC and is PROCFS_ROOT_INO, they can then use openat2() together with RESOLVE_{NO_{XDEV,MAGICLINK},BENEATH} to ensure that there isn't a bind-mount that replaces some procfs file with a different one. This kind of attack has lead to security issues in container runtimes in the past (such as CVE-2019-19921) and libraries like libpathrs[1] use this feature of procfs to provide safe procfs handling functions" * tag 'vfs-6.17-rc1.nsfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: uapi: export PROCFS_ROOT_INO mntns: use stable inode number for initial mount ns netns: use stable inode number for initial mount ns nsfs: move root inode number to uapi
2025-07-28Merge tag 'vfs-6.17-rc1.ovl' of ↵Linus Torvalds14-225/+302
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull overlayfs updates from Christian Brauner: "This contains overlayfs updates for this cycle. The changes for overlayfs in here are primarily focussed on preparing for some proposed changes to directory locking. Overlayfs currently will sometimes lock a directory on the upper filesystem and do a few different things while holding the lock. This is incompatible with the new potential scheme. This series narrows the region of code protected by the directory lock, taking it multiple times when necessary. This theoretically opens up the possibilty of other changes happening on the upper filesytem between the unlock and the lock. To some extent the patches guard against that by checking the dentries still have the expect parent after retaking the lock. In general, concurrent changes to the upper and lower filesystems aren't supported properly anyway" * tag 'vfs-6.17-rc1.ovl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits) ovl: properly print correct variable ovl: rename ovl_cleanup_unlocked() to ovl_cleanup() ovl: change ovl_create_real() to receive dentry parent ovl: narrow locking in ovl_check_rename_whiteout() ovl: narrow locking in ovl_whiteout() ovl: change ovl_cleanup_and_whiteout() to take rename lock as needed ovl: narrow locking on ovl_remove_and_whiteout() ovl: change ovl_workdir_cleanup() to take dir lock as needed. ovl: narrow locking in ovl_workdir_cleanup_recurse() ovl: narrow locking in ovl_indexdir_cleanup() ovl: narrow locking in ovl_workdir_create() ovl: narrow locking in ovl_cleanup_index() ovl: narrow locking in ovl_cleanup_whiteouts() ovl: narrow locking in ovl_rename() ovl: simplify gotos in ovl_rename() ovl: narrow locking in ovl_create_over_whiteout() ovl: narrow locking in ovl_clear_empty() ovl: narrow locking in ovl_create_upper() ovl: narrow the locked region in ovl_copy_up_workdir() ovl: Call ovl_create_temp() without lock held. ...
2025-07-28Merge tag 'vfs-6.17-rc1.coredump' of ↵Linus Torvalds14-603/+2242
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull coredump updates from Christian Brauner: "This contains an extension to the coredump socket and a proper rework of the coredump code. - This extends the coredump socket to allow the coredump server to tell the kernel how to process individual coredumps. This allows for fine-grained coredump management. Userspace can decide to just let the kernel write out the coredump, or generate the coredump itself, or just reject it. * COREDUMP_KERNEL The kernel will write the coredump data to the socket. * COREDUMP_USERSPACE The kernel will not write coredump data but will indicate to the parent that a coredump has been generated. This is used when userspace generates its own coredumps. * COREDUMP_REJECT The kernel will skip generating a coredump for this task. * COREDUMP_WAIT The kernel will prevent the task from exiting until the coredump server has shutdown the socket connection. The flexible coredump socket can be enabled by using the "@@" prefix instead of the single "@" prefix for the regular coredump socket: @@/run/systemd/coredump.socket - Cleanup the coredump code properly while we have to touch it anyway. Split out each coredump mode in a separate helper so it's easy to grasp what is going on and make the code easier to follow. The core coredump function should now be very trivial to follow" * tag 'vfs-6.17-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (31 commits) cleanup: add a scoped version of CLASS() coredump: add coredump_skip() helper coredump: avoid pointless variable coredump: order auto cleanup variables at the top coredump: add coredump_cleanup() coredump: auto cleanup prepare_creds() cred: add auto cleanup method coredump: directly return coredump: auto cleanup argv coredump: add coredump_write() coredump: use a single helper for the socket coredump: move pipe specific file check into coredump_pipe() coredump: split pipe coredumping into coredump_pipe() coredump: move core_pipe_count to global variable coredump: prepare to simplify exit paths coredump: split file coredumping into coredump_file() coredump: rename do_coredump() to vfs_coredump() selftests/coredump: make sure invalid paths are rejected coredump: validate socket path in coredump_parse() coredump: don't allow ".." in coredump socket path ...
2025-07-28Merge tag 'vfs-6.17-rc1.misc' of ↵Linus Torvalds88-457/+520
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc VFS updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add ext4 IOCB_DONTCACHE support This refactors the address_space_operations write_begin() and write_end() callbacks to take const struct kiocb * as their first argument, allowing IOCB flags such as IOCB_DONTCACHE to propagate to the filesystem's buffered I/O path. Ext4 is updated to implement handling of the IOCB_DONTCACHE flag and advertises support via the FOP_DONTCACHE file operation flag. Additionally, the i915 driver's shmem write paths are updated to bypass the legacy write_begin/write_end interface in favor of directly calling write_iter() with a constructed synchronous kiocb. Another i915 change replaces a manual write loop with kernel_write() during GEM shmem object creation. Cleanups: - don't duplicate vfs_open() in kernel_file_open() - proc_fd_getattr(): don't bother with S_ISDIR() check - fs/ecryptfs: replace snprintf with sysfs_emit in show function - vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes() - filelock: add new locks_wake_up_waiter() helper - fs: Remove three arguments from block_write_end() - VFS: change old_dir and new_dir in struct renamedata to dentrys - netfs: Remove unused declaration netfs_queue_write_request() Fixes: - eventpoll: Fix semi-unbounded recursion - eventpoll: fix sphinx documentation build warning - fs/read_write: Fix spelling typo - fs: annotate data race between poll_schedule_timeout() and pollwake() - fs/pipe: set FMODE_NOWAIT in create_pipe_files() - docs/vfs: update references to i_mutex to i_rwsem - fs/buffer: remove comment about hard sectorsize - fs/buffer: remove the min and max limit checks in __getblk_slow() - fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable - fs_context: fix parameter name in infofc() macro - fs: Prevent file descriptor table allocations exceeding INT_MAX" * tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (24 commits) netfs: Remove unused declaration netfs_queue_write_request() eventpoll: fix sphinx documentation build warning ext4: support uncached buffered I/O mm/pagemap: add write_begin_get_folio() helper function fs: change write_begin/write_end interface to take struct kiocb * drm/i915: Refactor shmem_pwrite() to use kiocb and write_iter drm/i915: Use kernel_write() in shmem object create eventpoll: Fix semi-unbounded recursion vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes() fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable fs/buffer: remove the min and max limit checks in __getblk_slow() fs: Prevent file descriptor table allocations exceeding INT_MAX fs: Remove three arguments from block_write_end() fs/ecryptfs: replace snprintf with sysfs_emit in show function fs: annotate suspected data race between poll_schedule_timeout() and pollwake() docs/vfs: update references to i_mutex to i_rwsem fs/buffer: remove comment about hard sectorsize fs_context: fix parameter name in infofc() macro VFS: change old_dir and new_dir in struct renamedata to dentrys proc_fd_getattr(): don't bother with S_ISDIR() check ...
2025-07-28ASoC: SOF: amd: acp-loader: Use GFP_KERNEL for DMA allocations in resume contextMuhammad Usama Anjum1-3/+3
Replace GFP_ATOMIC with GFP_KERNEL for dma_alloc_coherent() calls. This change improves memory allocation reliability during firmware loading, particularly during system resume when memory pressure is high. Because of using GFP_KERNEL, reclaim can happen which can reduce the probability of failure. Fixes memory allocation failures observed during system resume with fragmented memory conditions. snd_sof_amd_vangogh 0000:04:00.5: error: failed to load DSP firmware after resume -12 Fixes: 145d7e5ae8f4e ("ASoC: SOF: amd: add option to use sram for data bin loading") Fixes: 7e51a9e38ab20 ("ASoC: SOF: amd: Add fw loader and renoir dsp ops to load firmware") Cc: stable@vger.kernel.org Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Link: https://patch.msgid.link/20250725190254.1081184-1-usama.anjum@collabora.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-07-28Merge tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds10-807/+1216
Pull vfs mount updates from Al Viro: - mount hash conflicts rudiments are gone now - we do not allow multiple mounts with the same parent/mountpoint to be hashed at the same time. - 'struct mount' changes: - mnt_umounting is gone - mnt_slave_list/mnt_slave is an hlist now - overmounts are kept track of by explicit pointer in mount - a bunch of flags moved out of mnt_flags to a new field, with only namespace_sem for protection - mnt_expiry is protected by mount_lock now (instead of namespace_sem) - MNT_LOCKED is used only for mounts that need to remain attached to their parents to prevent mountpoint exposure - no more overloading it for absolute roots - all mnt_list uses are transient now - it's used only to represent temporary sets during umount_tree() - mount refcounting change: children no longer pin parents for any mounts, whether they'd passed through umount_tree() or not - 'struct mountpoint' changes: - refcount is no more; what matters is ->m_list emptiness - instead of temporary bumping the refcount, we insert a new object (pinned_mountpoint) into ->m_list - new calling conventions for lock_mount() and friends - do_move_mount()/attach_recursive_mnt() seriously cleaned up - globals in fs/pnode.c are gone - propagate_mnt(), change_mnt_propagation() and propagate_umount() cleaned up (in the last case - pretty much completely rewritten). - freeing of emptied mnt_namespace is done in namespace_unlock(). For one thing, there are subtle ordering requirements there; for another it simplifies cleanups. - assorted cleanups - restore the machinery for long-term mounts from accumulated bitrot. This is going to get a followup come next cycle, when the change of vfs_fs_parse_string() calling conventions goes into -next * tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (48 commits) statmount_mnt_basic(): simplify the logics for group id invent_group_ids(): zero ->mnt_group_id always implies !IS_MNT_SHARED() get rid of CL_SHARE_TO_SLAVE take freeing of emptied mnt_namespace to namespace_unlock() copy_tree(): don't link the mounts via mnt_list change_mnt_propagation(): move ->mnt_master assignment into MS_SLAVE case mnt_slave_list/mnt_slave: turn into hlist_head/hlist_node turn do_make_slave() into transfer_propagation() do_make_slave(): choose new master sanely change_mnt_propagation(): do_make_slave() is a no-op unless IS_MNT_SHARED() change_mnt_propagation() cleanups, step 1 propagate_mnt(): fix comment and convert to kernel-doc, while we are at it propagate_mnt(): get rid of last_dest fs/pnode.c: get rid of globals propagate_one(): fold into the sole caller propagate_one(): separate the "what should be the master for this copy" part propagate_one(): separate the "do we need secondary here?" logics propagate_mnt(): handle all peer groups in the same loop propagate_one(): get rid of dest_master mount: separate the flags accessed only under namespace_sem ...
2025-07-28Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-11/+3
Pull CLASS(fd) update from Al Viro: "A missing bit of commit 66635b077624 ("assorted variants of irqfd setup: convert to CLASS(fd)") from a year ago. mshv_eventfd would've been covered by that, but it had forked slightly before that series and got merged into mainline later" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: mshv_eventfd: convert to CLASS(fd)
2025-07-28Merge tag 'pull-ceph-d_name-fixes' of ↵Linus Torvalds5-86/+43
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull ceph dentry->d_name fixes from Al Viro: "Stuff that had fallen through the cracks back in February; ceph folks tested that pile and said they prefer to have it go through my tree..." * tag 'pull-ceph-d_name-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ceph: fix a race with rename() in ceph_mdsc_build_path() prep for ceph_encode_encrypted_fname() fixes [ceph] parse_longname(): strrchr() expects NUL-terminated string
2025-07-28Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds5-21/+11
Pull misc VFS updates from Al Viro: "VFS-related cleanups in various places (mostly of the "that really can't happen" or "there's a better way to do it" variety)" * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: gpib: use file_inode() binder_ioctl_write_read(): simplify control flow a bit secretmem: move setting O_LARGEFILE and bumping users' count to the place where we create the file apparmor: file never has NULL f_path.mnt landlock: opened file never has a negative dentry
2025-07-28Merge tag 'pull-securityfs' of ↵Linus Torvalds9-251/+97
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull securityfs updates from Al Viro: "Securityfs cleanups and fixes: - one extra reference is enough to pin a dentry down; no need for two. Switch to regular scheme, similar to shmem, debugfs, etc. This fixes a securityfs_recursive_remove() dentry leak, among other things. - we need to have the filesystem pinned to prevent the contents disappearing; what we do not need is pinning it for each file. Doing that only for files and directories in the root is enough. - the previous two changes allow us to get rid of the racy kludges in efi_secret_unlink(), where we can use simple_unlink() instead of securityfs_remove(). Which does not require unlocking and relocking the parent, with all deadlocks that invites. - Make securityfs_remove() take the entire subtree out, turning securityfs_recursive_remove() into its alias. Makes a lot more sense for callers and fixes a mount leak, while we are at it. - Making securityfs_remove() remove the entire subtree allows for much simpler life in most of the users - efi_secret, ima_fs, evm, ipe, tmp get cleaner. I hadn't touched apparmor use of securityfs, but I suspect that it would be useful there as well" * tag 'pull-securityfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: tpm: don't bother with removal of files in directory we'll be removing ipe: don't bother with removal of files in directory we'll be removing evm_secfs: clear securityfs interactions ima_fs: get rid of lookup-by-dentry stuff ima_fs: don't bother with removal of files in directory we'll be removing efi_secret: clean securityfs use up make securityfs_remove() remove the entire subtree fix locking in efi_secret_unlink() securityfs: pin filesystem only for objects directly in root securityfs: don't pin dentries twice, once is enough...
2025-07-28bpf: Fix various typos in verifier.c commentsSuchit Karunakaran1-5/+5
This patch fixes several minor typos in comments within the BPF verifier. No changes in functionality. Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com> Link: https://lore.kernel.org/r/20250727081754.15986-1-suchitkarunakaran@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-28Merge branch 'bpf-improve-64bits-bounds-refinement'Alexei Starovoitov3-1/+186
Paul Chaignon says: ==================== bpf: Improve 64bits bounds refinement This patchset improves the 64bits bounds refinement when the s64 ranges crosses the sign boundary. The first patch explains the small addition to __reg64_deduce_bounds. The last one explains why we need a third round of __reg_deduce_bounds. The third patch adds a selftest with a more complete example of the impact on verification. The second and fourth patches update the existing selftests to take the new refinement into account. This patchset should reduce the number of kernel warnings hit by syzkaller due to invariant violations [1]. It was also tested with Agni [2] (and Cilium's CI for good measure). Link: https://syzkaller.appspot.com/bug?extid=c711ce17dd78e5d4fdcf [1] Link: https://github.com/bpfverif/agni [2] Changes in v4: - Fixed outdated test comment, noticed by Eduard. - Rebased. Changes in v3: - Added a 5th patch to call __reg_deduce_bounds a third time in reg_bounds_sync following tests from Eduard. - Fixed broken indentations in the first patch. Changes in v2 (all on Eduard's suggestions): - Added two tests to ensure we cover all cases of u64/s64 overlap. - Improved tests to check deduced ranges with __msg. - Improved code comments. ==================== Link: https://patch.msgid.link/cover.1753695655.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-28bpf: Add third round of bounds deductionPaul Chaignon2-1/+2
Commit d7f008738171 ("bpf: try harder to deduce register bounds from different numeric domains") added a second call to __reg_deduce_bounds in reg_bounds_sync because a single call wasn't enough to converge to a fixed point in terms of register bounds. With patch "bpf: Improve bounds when s64 crosses sign boundary" from this series, Eduard noticed that calling __reg_deduce_bounds twice isn't enough anymore to converge. The first selftest added in "selftests/bpf: Test cross-sign 64bits range refinement" highlights the need for a third call to __reg_deduce_bounds. After instruction 7, reg_bounds_sync performs the following bounds deduction: reg_bounds_sync entry: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146) __update_reg_bounds: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146) __reg_deduce_bounds: __reg32_deduce_bounds: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146,umin32=0xfffffcf1,umax32=0xffffff6e) __reg64_deduce_bounds: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146,umin32=0xfffffcf1,umax32=0xffffff6e) __reg_deduce_mixed_bounds: scalar(smin=-655,smax=0xeffffeee,umin=umin32=0xfffffcf1,umax=0xffffffffffffff6e,smin32=-783,smax32=-146,umax32=0xffffff6e) __reg_deduce_bounds: __reg32_deduce_bounds: scalar(smin=-655,smax=0xeffffeee,umin=umin32=0xfffffcf1,umax=0xffffffffffffff6e,smin32=-783,smax32=-146,umax32=0xffffff6e) __reg64_deduce_bounds: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e) __reg_deduce_mixed_bounds: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e) __reg_bound_offset: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e,var_off=(0xfffffffffffffc00; 0x3ff)) __update_reg_bounds: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e,var_off=(0xfffffffffffffc00; 0x3ff)) In particular, notice how: 1. In the first call to __reg_deduce_bounds, __reg32_deduce_bounds learns new u32 bounds. 2. __reg64_deduce_bounds is unable to improve bounds at this point. 3. __reg_deduce_mixed_bounds derives new u64 bounds from the u32 bounds. 4. In the second call to __reg_deduce_bounds, __reg64_deduce_bounds improves the smax and umin bounds thanks to patch "bpf: Improve bounds when s64 crosses sign boundary" from this series. 5. Subsequent functions are unable to improve the ranges further (only tnums). Yet, a better smin32 bound could be learned from the smin bound. __reg32_deduce_bounds is able to improve smin32 from smin, but for that we need a third call to __reg_deduce_bounds. As discussed in [1], there may be a better way to organize the deduction rules to learn the same information with less calls to the same functions. Such an optimization requires further analysis and is orthogonal to the present patchset. Link: https://lore.kernel.org/bpf/aIKtSK9LjQXB8FLY@mail.gmail.com/ [1] Acked-by: Eduard Zingerman <eddyz87@gmail.com> Co-developed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/79619d3b42e5525e0e174ed534b75879a5ba15de.1753695655.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-28selftests/bpf: Test invariants on JSLT crossing signPaul Chaignon1-1/+1
The improvement of the u64/s64 range refinement fixed the invariant violation that was happening on this test for BPF_JSLT when crossing the sign boundary. After this patch, we have one test remaining with a known invariant violation. It's the same test as fixed here but for 32 bits ranges. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/ad046fb0016428f1a33c3b81617aabf31b51183f.1753695655.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-28selftests/bpf: Test cross-sign 64bits range refinementPaul Chaignon1-0/+118
This patch adds coverage for the new cross-sign 64bits range refinement logic. The three tests cover the cases when the u64 and s64 ranges overlap (1) in the negative portion of s64, (2) in the positive portion of s64, and (3) in both portions. The first test is a simplified version of a BPF program generated by syzkaller that caused an invariant violation [1]. It looks like syzkaller could not extract the reproducer itself (and therefore didn't report it to the mailing list), but I was able to extract it from the console logs of a crash. The principle is similar to the invariant violation described in commit 6279846b9b25 ("bpf: Forget ranges when refining tnum after JSET"): the verifier walks a dead branch, uses the condition to refine ranges, and ends up with inconsistent ranges. In this case, the dead branch is when we fallthrough on both jumps. The new refinement logic improves the bounds such that the second jump is properly detected as always-taken and the verifier doesn't end up walking a dead branch. The second and third tests are inspired by the first, but rely on condition jumps to prepare the bounds instead of ALU instructions. An R10 write is used to trigger a verifier error when the bounds can't be refined. Link: https://syzkaller.appspot.com/bug?extid=c711ce17dd78e5d4fdcf [1] Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/a0e17b00dab8dabcfa6f8384e7e151186efedfdd.1753695655.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-28selftests/bpf: Update reg_bound range refinement logicPaul Chaignon1-0/+14
This patch updates the range refinement logic in the reg_bound test to match the new logic from the previous commit. Without this change, tests would fail because we end with more precise ranges than the tests expect. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/b7f6b1fbe03373cca4e1bb6a113035a6cd2b3ff7.1753695655.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-28bpf: Improve bounds when s64 crosses sign boundaryPaul Chaignon1-0/+52
__reg64_deduce_bounds currently improves the s64 range using the u64 range and vice versa, but only if it doesn't cross the sign boundary. This patch improves __reg64_deduce_bounds to cover the case where the s64 range crosses the sign boundary but overlaps with the u64 range on only one end. In that case, we can improve both ranges. Consider the following example, with the s64 range crossing the sign boundary: 0 U64_MAX | [xxxxxxxxxxxxxx u64 range xxxxxxxxxxxxxx] | |----------------------------|----------------------------| |xxxxx s64 range xxxxxxxxx] [xxxxxxx| 0 S64_MAX S64_MIN -1 The u64 range overlaps only with positive portion of the s64 range. We can thus derive the following new s64 and u64 ranges. 0 U64_MAX | [xxxxxx u64 range xxxxx] | |----------------------------|----------------------------| | [xxxxxx s64 range xxxxx] | 0 S64_MAX S64_MIN -1 The same logic can probably apply to the s32/u32 ranges, but this patch doesn't implement that change. In addition to the selftests, the __reg64_deduce_bounds change was also tested with Agni, the formal verification tool for the range analysis [1]. Link: https://github.com/bpfverif/agni [1] Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/933bd9ce1f36ded5559f92fdc09e5dbc823fa245.1753695655.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>