summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2023-08-16sysctl: Add size to register_sysctlJoel Granados1-2/+8
This commit adds table_size to register_sysctl in preparation for the removal of the sentinel elements in the ctl_table arrays (last empty markers). And though we do *not* remove any sentinels in this commit, we set things up by either passing the table_size explicitly or using ARRAY_SIZE on the ctl_table arrays. We replace the register_syctl function with a macro that will add the ARRAY_SIZE to the new register_sysctl_sz function. In this way the callers that are already using an array of ctl_table structs do not change. For the callers that pass a ctl_table array pointer, we pass the table_size to register_sysctl_sz instead of the macro. Signed-off-by: Joel Granados <j.granados@samsung.com> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-16sysctl: Add a size arg to __register_sysctl_tableJoel Granados1-1/+1
We make these changes in order to prepare __register_sysctl_table and its callers for when we remove the sentinel element (empty element at the end of ctl_table arrays). We don't actually remove any sentinels in this commit, but we *do* make sure to use ARRAY_SIZE so the table_size is available when the removal occurs. We add a table_size argument to __register_sysctl_table and adjust callers, all of which pass ctl_table pointers and need an explicit call to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in order to forward the size of the array pointer received from the network register calls. The new table_size argument does not yet have any effect in the init_header call which is still dependent on the sentinel's presence. table_size *does* however drive the `kzalloc` allocation in __register_sysctl_table with no adverse effects as the allocated memory is either one element greater than the calculated ctl_table array (for the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size of the calculated ctl_table array (for the call from sysctl_net.c and register_sysctl). This approach will allows us to "just" remove the sentinel without further changes to __register_sysctl_table as table_size will represent the exact size for all the callers at that point. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-16sysctl: Add ctl_table_size to ctl_table_headerJoel Granados1-2/+12
The new ctl_table_size element will hold the size of the ctl_table arrays contained in the ctl_table_header. This value should eventually be passed by the callers to the sysctl register infrastructure. And while this commit introduces the variable, it does not set nor use it because that requires case by case considerations for each caller. It provides two important things: (1) A place to put the result of the ctl_table array calculation when it gets introduced for each caller. And (2) the size that will be used as the additional stopping criteria in the list_for_each_table_entry macro (to be added when all the callers are migrated) Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-16list: Introduce CONFIG_LIST_HARDENEDMarco Elver1-6/+58
Numerous production kernel configs (see [1, 2]) are choosing to enable CONFIG_DEBUG_LIST, which is also being recommended by KSPP for hardened configs [3]. The motivation behind this is that the option can be used as a security hardening feature (e.g. CVE-2019-2215 and CVE-2019-2025 are mitigated by the option [4]). The feature has never been designed with performance in mind, yet common list manipulation is happening across hot paths all over the kernel. Introduce CONFIG_LIST_HARDENED, which performs list pointer checking inline, and only upon list corruption calls the reporting slow path. To generate optimal machine code with CONFIG_LIST_HARDENED: 1. Elide checking for pointer values which upon dereference would result in an immediate access fault (i.e. minimal hardening checks). The trade-off is lower-quality error reports. 2. Use the __preserve_most function attribute (available with Clang, but not yet with GCC) to minimize the code footprint for calling the reporting slow path. As a result, function size of callers is reduced by avoiding saving registers before calling the rarely called reporting slow path. Note that all TUs in lib/Makefile already disable function tracing, including list_debug.c, and __preserve_most's implied notrace has no effect in this case. 3. Because the inline checks are a subset of the full set of checks in __list_*_valid_or_report(), always return false if the inline checks failed. This avoids redundant compare and conditional branch right after return from the slow path. As a side-effect of the checks being inline, if the compiler can prove some condition to always be true, it can completely elide some checks. Since DEBUG_LIST is functionally a superset of LIST_HARDENED, the Kconfig variables are changed to reflect that: DEBUG_LIST selects LIST_HARDENED, whereas LIST_HARDENED itself has no dependency on DEBUG_LIST. Running netperf with CONFIG_LIST_HARDENED (using a Clang compiler with "preserve_most") shows throughput improvements, in my case of ~7% on average (up to 20-30% on some test cases). Link: https://r.android.com/1266735 [1] Link: https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/blob/main/config [2] Link: https://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project/Recommended_Settings [3] Link: https://googleprojectzero.blogspot.com/2019/11/bad-binder-android-in-wild-exploit.html [4] Signed-off-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/20230811151847.1594958-3-elver@google.com Signed-off-by: Kees Cook <keescook@chromium.org>
2023-08-16list_debug: Introduce inline wrappers for debug checksMarco Elver1-4/+33
Turn the list debug checking functions __list_*_valid() into inline functions that wrap the out-of-line functions. Care is taken to ensure the inline wrappers are always inlined, so that additional compiler instrumentation (such as sanitizers) does not result in redundant outlining. This change is preparation for performing checks in the inline wrappers. No functional change intended. Signed-off-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/20230811151847.1594958-2-elver@google.com Signed-off-by: Kees Cook <keescook@chromium.org>
2023-08-16compiler_types: Introduce the Clang __preserve_most function attributeMarco Elver1-0/+28
[1]: "On X86-64 and AArch64 targets, this attribute changes the calling convention of a function. The preserve_most calling convention attempts to make the code in the caller as unintrusive as possible. This convention behaves identically to the C calling convention on how arguments and return values are passed, but it uses a different set of caller/callee-saved registers. This alleviates the burden of saving and recovering a large register set before and after the call in the caller. If the arguments are passed in callee-saved registers, then they will be preserved by the callee across the call. This doesn't apply for values returned in callee-saved registers. * On X86-64 the callee preserves all general purpose registers, except for R11. R11 can be used as a scratch register. Floating-point registers (XMMs/YMMs) are not preserved and need to be saved by the caller. * On AArch64 the callee preserve all general purpose registers, except x0-X8 and X16-X18." [1] https://clang.llvm.org/docs/AttributeReference.html#preserve-most Introduce the attribute to compiler_types.h as __preserve_most. Use of this attribute results in better code generation for calls to very rarely called functions, such as error-reporting functions, or rarely executed slow paths. Beware that the attribute conflicts with instrumentation calls inserted on function entry which do not use __preserve_most themselves. Notably, function tracing which assumes the normal C calling convention for the given architecture. Where the attribute is supported, __preserve_most will imply notrace. It is recommended to restrict use of the attribute to functions that should or already disable tracing. Note: The additional preprocessor check against architecture should not be necessary if __has_attribute() only returns true where supported; also see https://github.com/ClangBuiltLinux/linux/issues/1908. But until __has_attribute() does the right thing, we also guard by known-supported architectures to avoid build warnings on other architectures. The attribute may be supported by a future GCC version (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110899). Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20230811151847.1594958-1-elver@google.com Signed-off-by: Kees Cook <keescook@chromium.org>
2023-08-15perf/smmuv3: Enable HiSilicon Erratum 162001900 quirk for HIP08/09Yicong Yang1-0/+1
Some HiSilicon SMMU PMCG suffers the erratum 162001900 that the PMU disable control sometimes fail to disable the counters. This will lead to error or inaccurate data since before we enable the counters the counter's still counting for the event used in last perf session. This patch tries to fix this by hardening the global disable process. Before disable the PMU, writing an invalid event type (0xffff) to focibly stop the counters. Correspondingly restore each events on pmu::pmu_enable(). Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Link: https://lore.kernel.org/r/20230814124012.58013-1-yangyicong@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
2023-08-15vfs: fix up the assert in i_readcount_decMateusz Guzik1-2/+1
Drops a race where 2 threads could spot a positive value and both proceed to dec to -1, without reporting anything. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Message-Id: <20230811194814.1612336-1-mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-15vfs, security: Fix automount superblock LSM init problem, preventing NFS sb ↵David Howells2-0/+7
sharing When NFS superblocks are created by automounting, their LSM parameters aren't set in the fs_context struct prior to sget_fc() being called, leading to failure to match existing superblocks. This bug leads to messages like the following appearing in dmesg when fscache is enabled: NFS: Cache volume key already in use (nfs,4.2,2,108,106a8c0,1,,,,100000,100000,2ee,3a98,1d4c,3a98,1) Fix this by adding a new LSM hook to load fc->security for submount creation. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/165962680944.3334508.6610023900349142034.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/165962729225.3357250.14350728846471527137.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/165970659095.2812394.6868894171102318796.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/166133579016.3678898.6283195019480567275.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/217595.1662033775@warthog.procyon.org.uk/ # v5 Fixes: 9bc61ab18b1d ("vfs: Introduce fs_context, switch vfs_kern_mount() to it.") Fixes: 779df6a5480f ("NFS: Ensure security label is set for root inode") Tested-by: Jeff Layton <jlayton@kernel.org> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: "Christian Brauner (Microsoft)" <brauner@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Message-Id: <20230808-master-v9-1-e0ecde888221@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-15bpf: Document struct bpf_struct_ops fieldsDavid Vernet1-0/+47
Subsystems that want to implement a struct bpf_struct_ops structure to enable struct_ops maps must currently reverse engineer how the structure works. Given that this is meant to be a way for subsystem maintainers to extend their subsystems using BPF, let's document it to make it a bit easier on them. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230814185908.700553-3-void@manifault.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-15torture: Add a kthread-creation callback to _torture_create_kthread()Paul E. McKenney1-2/+5
This commit adds a kthread-creation callback to the _torture_create_kthread() function, which allows callers of a new torture_create_kthread_cb() macro to specify a function to be invoked after the kthread is created but before it is awakened for the first time. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: kernel-team@android.com Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: John Stultz <jstultz@google.com>
2023-08-15block: Bring back zero_fill_bio_iterKent Overstreet1-1/+6
This reverts 6f822e1b5d9dda3d20e87365de138046e3baa03a - this helper is used by bcachefs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Link: https://lore.kernel.org/r/20230813182636.2966159-4-kent.overstreet@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-15block: Add some exports for bcachefsKent Overstreet1-0/+1
- bio_set_pages_dirty(), bio_check_pages_dirty() - dio path - blk_status_to_str() - error messages - bio_add_folio() - this should definitely be exported for everyone, it's the modern version of bio_add_page() Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Cc: linux-block@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20230813182636.2966159-2-kent.overstreet@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-15net/mlx5: Remove unused MAX HCA capabilitiesShay Drory2-53/+0
Each device cap has two modes: MAX and CUR. The driver maintains a cache of both modes of the capabilities. For most device caps, the MAX cap mode is never used. Hence, remove all driver queries of the MAX mode of the said caps as well as their helper MACROs. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-15net/mlx5: Remove unused CAPsShay Drory2-59/+1
mlx5 driver queries the device for VECTOR_CALC and SHAMPO caps, but there isn't any user who requires them. As well as, MLX5_MCAM_REGS_0x9080_0x90FF is queried but not used. Thus, drop all usages and definitions of the mentioned caps above. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-15net/mlx5: Check with FW that sync reset completed successfullyMoshe Shemesh1-1/+2
Even if the PF driver had no error on his part of the sync reset flow, the firmware can see wider picture as it syncs all the PFs in the flow. So add at end of sync reset flow check with firmware by reading MFRL register and initialization segment that the flow had no issue from firmware point of view too. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14fs: add FSCONFIG_CMD_CREATE_EXCLChristian Brauner1-0/+1
Summary ======= This introduces FSCONFIG_CMD_CREATE_EXCL which will allows userspace to implement something like mount -t ext4 --exclusive /dev/sda /B which fails if a superblock for the requested filesystem does already exist: Before this patch ----------------- $ sudo ./move-mount -f xfs -o source=/dev/sda4 /A Requesting filesystem type xfs Mount options requested: source=/dev/sda4 Attaching mount at /A Moving single attached mount Setting key(source) with val(/dev/sda4) $ sudo ./move-mount -f xfs -o source=/dev/sda4 /B Requesting filesystem type xfs Mount options requested: source=/dev/sda4 Attaching mount at /B Moving single attached mount Setting key(source) with val(/dev/sda4) After this patch with --exclusive as a switch for FSCONFIG_CMD_CREATE_EXCL -------------------------------------------------------------------------- $ sudo ./move-mount -f xfs --exclusive -o source=/dev/sda4 /A Requesting filesystem type xfs Request exclusive superblock creation Mount options requested: source=/dev/sda4 Attaching mount at /A Moving single attached mount Setting key(source) with val(/dev/sda4) $ sudo ./move-mount -f xfs --exclusive -o source=/dev/sda4 /B Requesting filesystem type xfs Request exclusive superblock creation Mount options requested: source=/dev/sda4 Attaching mount at /B Moving single attached mount Setting key(source) with val(/dev/sda4) Device or resource busy | move-mount.c: 300: do_fsconfig: i xfs: reusing existing filesystem not allowed Details ======= As mentioned on the list (cf. [1]-[3]) mount requests like mount -t ext4 /dev/sda /A are ambigous for userspace. Either a new superblock has been created and mounted or an existing superblock has been reused and a bind-mount has been created. This becomes clear in the following example where two processes create the same mount for the same block device: P1 P2 fd_fs = fsopen("ext4"); fd_fs = fsopen("ext4"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "dax", "always"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "resuid", "1000"); // wins and creates superblock fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...) // finds compatible superblock of P1 // spins until P1 sets SB_BORN and grabs a reference fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...) fd_mnt1 = fsmount(fd_fs); fd_mnt2 = fsmount(fd_fs); move_mount(fd_mnt1, "/A") move_mount(fd_mnt2, "/B") Not just does P2 get a bind-mount but the mount options that P2 requestes are silently ignored. The VFS itself doesn't, can't and shouldn't enforce filesystem specific mount option compatibility. It only enforces incompatibility for read-only <-> read-write transitions: mount -t ext4 /dev/sda /A mount -t ext4 -o ro /dev/sda /B The read-only request will fail with EBUSY as the VFS can't just silently transition a superblock from read-write to read-only or vica versa without risking security issues. To userspace this silent superblock reuse can become a security issue in because there is currently no straightforward way for userspace to know that they did indeed manage to create a new superblock and didn't just reuse an existing one. This adds a new FSCONFIG_CMD_CREATE_EXCL command to fsconfig() that returns EBUSY if an existing superblock would be reused. Userspace that needs to be sure that it did create a new superblock with the requested mount options can request superblock creation using this command. If the command succeeds they can be sure that they did create a new superblock with the requested mount options. This requires the new mount api. With the old mount api it would be necessary to plumb this through every legacy filesystem's file_system_type->mount() method. If they want this feature they are most welcome to switch to the new mount api. Following is an analysis of the effect of FSCONFIG_CMD_CREATE_EXCL on each high-level superblock creation helper: (1) get_tree_nodev() Always allocate new superblock. Hence, FSCONFIG_CMD_CREATE and FSCONFIG_CMD_CREATE_EXCL are equivalent. The binderfs or overlayfs filesystems are examples. (4) get_tree_keyed() Finds an existing superblock based on sb->s_fs_info. Hence, FSCONFIG_CMD_CREATE would reuse an existing superblock whereas FSCONFIG_CMD_CREATE_EXCL would reject it with EBUSY. The mqueue or nfsd filesystems are examples. (2) get_tree_bdev() This effectively works like get_tree_keyed(). The ext4 or xfs filesystems are examples. (3) get_tree_single() Only one superblock of this filesystem type can ever exist. Hence, FSCONFIG_CMD_CREATE would reuse an existing superblock whereas FSCONFIG_CMD_CREATE_EXCL would reject it with EBUSY. The securityfs or configfs filesystems are examples. Note that some single-instance filesystems never destroy the superblock once it has been created during the first mount. For example, if securityfs has been mounted at least onces then the created superblock will never be destroyed again as long as there is still an LSM making use it. Consequently, even if securityfs is unmounted and the superblock seemingly destroyed it really isn't which means that FSCONFIG_CMD_CREATE_EXCL will continue rejecting reusing an existing superblock. This is acceptable thugh since special purpose filesystems such as this shouldn't have a need to use FSCONFIG_CMD_CREATE_EXCL anyway and if they do it's probably to make sure that mount options aren't ignored. Following is an analysis of the effect of FSCONFIG_CMD_CREATE_EXCL on filesystems that make use of the low-level sget_fc() helper directly. They're all effectively variants on get_tree_keyed(), get_tree_bdev(), or get_tree_nodev(): (5) mtd_get_sb() Similar logic to get_tree_keyed(). (6) afs_get_tree() Similar logic to get_tree_keyed(). (7) ceph_get_tree() Similar logic to get_tree_keyed(). Already explicitly allows forcing the allocation of a new superblock via CEPH_OPT_NOSHARE. This turns it into get_tree_nodev(). (8) fuse_get_tree_submount() Similar logic to get_tree_nodev(). (9) fuse_get_tree() Forces reuse of existing FUSE superblock. Forces reuse of existing superblock if passed in file refers to an existing FUSE connection. If FSCONFIG_CMD_CREATE_EXCL is specified together with an fd referring to an existing FUSE connections this would cause the superblock reusal to fail. If reusing is the intent then FSCONFIG_CMD_CREATE_EXCL shouldn't be specified. (10) fuse_get_tree() -> get_tree_nodev() Same logic as in get_tree_nodev(). (11) fuse_get_tree() -> get_tree_bdev() Same logic as in get_tree_bdev(). (12) virtio_fs_get_tree() Same logic as get_tree_keyed(). (13) gfs2_meta_get_tree() Forces reuse of existing gfs2 superblock. Mounting gfs2meta enforces that a gf2s superblock must already exist. If not, it will error out. Consequently, mounting gfs2meta with FSCONFIG_CMD_CREATE_EXCL would always fail. If reusing is the intent then FSCONFIG_CMD_CREATE_EXCL shouldn't be specified. (14) kernfs_get_tree() Similar logic to get_tree_keyed(). (15) nfs_get_tree_common() Similar logic to get_tree_keyed(). Already explicitly allows forcing the allocation of a new superblock via NFS_MOUNT_UNSHARED. This effectively turns it into get_tree_nodev(). Link: [1] https://lore.kernel.org/linux-block/20230704-fasching-wertarbeit-7c6ffb01c83d@brauner Link: [2] https://lore.kernel.org/linux-block/20230705-pumpwerk-vielversprechend-a4b1fd947b65@brauner Link: [3] https://lore.kernel.org/linux-fsdevel/20230725-einnahmen-warnschilder-17779aec0a97@brauner Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Message-Id: <20230802-vfs-super-exclusive-v2-4-95dc4e41b870@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-14super: remove get_tree_single_reconf()Christian Brauner1-3/+0
The get_tree_single_reconf() helper isn't used anywhere. Remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Message-Id: <20230802-vfs-super-exclusive-v2-1-95dc4e41b870@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-14net: phy: Introduce PSGMII PHY interface modeGabor Juhos1-0/+4
The PSGMII interface is similar to QSGMII. The main difference is that the PSGMII interface combines five SGMII lines into a single link while in QSGMII only four lines are combined. Similarly to the QSGMII, this interface mode might also needs special handling within the MAC driver. It is commonly used by Qualcomm with their QCA807x PHY series and modern WiSoC-s. Add definitions for the PHY layer to allow to express this type of connection between the MAC and PHY. Signed-off-by: Gabor Juhos <j4g8y7@gmail.com> Signed-off-by: Robert Marko <robert.marko@sartura.hr> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-12locking: remove spin_lock_prefetchMateusz Guzik1-6/+1
The only remaining consumer is new_inode, where it showed up in 2001 as commit c37fa164f793 ("v2.4.9.9 -> v2.4.9.10") in a historical repo [1] with a changelog which does not mention it. Since then the line got only touched up to keep compiling. While it may have been of benefit back in the day, it is guaranteed to at best not get in the way in the multicore setting -- as the code performs *a lot* of work between the prefetch and actual lock acquire, any contention means the cacheline is already invalid by the time the routine calls spin_lock(). It adds spurious traffic, for short. On top of it prefetch is notoriously tricky to use for single-threaded purposes, making it questionable from the get go. As such, remove it. I admit upfront I did not see value in benchmarking this change, but I can do it if that is deemed appropriate. Removal from new_inode and of the entire thing are in the same patch as requested by Linus, so whatever weird looks can be directed at that guy. Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/fs/inode.c?id=c37fa164f793735b32aa3f53154ff1a7659e6442 [1] Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-08-12Merge tag 'x86_bugs_for_v6.5_rc6' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mitigation fixes from Borislav Petkov: "The first set of fallout fixes after the embargo madness. There will be another set next week too. - A first series of cleanups/unifications and documentation improvements to the SRSO and GDS mitigations code which got postponed to after the embargo date - Fix the SRSO aliasing addresses assertion so that the LLVM linker can parse it too" * tag 'x86_bugs_for_v6.5_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: driver core: cpu: Fix the fallback cpu_show_gds() name x86: Move gds_ucode_mitigated() declaration to header x86/speculation: Add cpu_show_gds() prototype driver core: cpu: Make cpu_show_not_affected() static x86/srso: Fix build breakage with the LLVM linker Documentation/srso: Document IBPB aspect and fix formatting driver core: cpu: Unify redundant silly stubs Documentation/hw-vuln: Unify filename specification in index
2023-08-12bpf: Remove unused declaration bpf_link_new_file()Yue Haibing1-1/+0
Commit a3b80e107894 ("bpf: Allocate ID for bpf_link") removed the implementation but not the declaration. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://lore.kernel.org/r/20230809140556.45836-1-yuehaibing@huawei.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-11Merge tag 'block-6.5-2023-08-11' of git://git.kernel.dk/linuxLinus Torvalds2-2/+1
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Fixes for request_queue state (Ming) - Another uuid quirk (August) - RCU poll fix for NVMe (Ming) - Fix for an IO stall with polled IO (me) - Fix for blk-iocost stats enable/disable accounting (Chengming) - Regression fix for large pages for zram (Christoph) * tag 'block-6.5-2023-08-11' of git://git.kernel.dk/linux: nvme: core: don't hold rcu read lock in nvme_ns_chr_uring_cmd_iopoll blk-iocost: fix queue stats accounting block: don't make REQ_POLLED imply REQ_NOWAIT block: get rid of unused plug->nowait flag zram: take device and not only bvec offset into account nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G and 512G nvme-rdma: fix potential unbalanced freeze & unfreeze nvme-tcp: fix potential unbalanced freeze & unfreeze nvme: fix possible hang when removing a controller during error recovery
2023-08-11fs: export fs_holder_opsChristoph Hellwig1-0/+2
Export fs_holder_ops so that file systems that open additional block devices can use it as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner <brauner@kernel.org> Message-Id: <20230802154131.2221419-9-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-11fs: add infrastructure for multigrain timestampsJeff Layton1-2/+44
The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. What we need is a way to only use fine-grained timestamps when they are being actively queried. POSIX generally mandates that when the the mtime changes, the ctime must also change. The kernel always stores normalized ctime values, so only the first 30 bits of the tv_nsec field are ever used. Use the 31st bit of the ctime tv_nsec field to indicate that something has queried the inode for the mtime or ctime. When this flag is set, on the next mtime or ctime update, the kernel will fetch a fine-grained timestamp instead of the usual coarse-grained one. Filesytems can opt into this behavior by setting the FS_MGTIME flag in the fstype. Filesystems that don't set this flag will continue to use coarse-grained timestamps. Later patches will convert individual filesystems to use the new infrastructure. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230807-mgctime-v7-9-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-11fs: drop the timespec64 argument from update_timeJeff Layton1-2/+2
Now that all of the update_time operations are prepared for it, we can drop the timespec64 argument from the update_time operation. Do that and remove it from some associated functions like inode_update_time and inode_needs_update_time. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230807-mgctime-v7-8-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+2
Merge net again, after pulling in x86/bugs fixes to clang linking errors. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-11Merge branch 'x86/bugs' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipJakub Kicinski1-0/+2
Cross merge x86 fixes to fix clang linking errors: ld.lld: error: ./arch/x86/kernel/vmlinux.lds:221: at least one side of the expression must be absolute These will hopefully be downstream by the time we ship the next batch of fixes. * 'x86/bugs' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Move gds_ucode_mitigated() declaration to header x86/speculation: Add cpu_show_gds() prototype driver core: cpu: Make cpu_show_not_affected() static x86/srso: Fix build breakage with the LLVM linker Documentation/srso: Document IBPB aspect and fix formatting driver core: cpu: Unify redundant silly stubs Documentation/hw-vuln: Unify filename specification in index Link: https://lore.kernel.org/all/CAHk-=wj_b+FGTnevQSBAtCWuhCk=0oQ_THvthBW2hzqpOTLFmg@mail.gmail.com/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-11net: phy: phy_device: Call into the PHY driver to set LED offloadAndrew Lunn1-0/+33
Linux LEDs can be requested to perform hardware accelerated blinking to indicate link, RX, TX etc. Pass the rules for blinking to the PHY driver, if it implements the ops needed to determine if a given pattern can be offloaded, to offload it, and what the current offload is. Additionally implement the op needed to get what device the LED is for. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Daniel Golle <daniel@makrotopia.org> Link: https://lore.kernel.org/r/20230808210436.838995-3-andrew@lunn.ch Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-11net: stmmac: add new mode parameter for fix_mac_speedShenwei Wang1-1/+1
A mode parameter has been added to the callback function of fix_mac_speed to indicate the physical layer type. The mode can be one the following: MLO_AN_PHY - Conventional PHY MLO_AN_FIXED - Fixed-link mode MLO_AN_INBAND - In-band protocol Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com> Link: https://lore.kernel.org/r/20230807160716.259072-2-shenwei.wang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-11Merge tag 'for-netdev' of ↵Jakub Kicinski3-7/+9
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2023-08-09 We've added 19 non-merge commits during the last 6 day(s) which contain a total of 25 files changed, 369 insertions(+), 141 deletions(-). The main changes are: 1) Fix array-index-out-of-bounds access when detaching from an already empty mprog entry from Daniel Borkmann. 2) Adjust bpf selftest because of a recent llvm change related to the cpu-v4 ISA from Eduard Zingerman. 3) Add uprobe support for the bpf_get_func_ip helper from Jiri Olsa. 4) Fix a KASAN splat due to the kernel incorrectly accepted an invalid program using the recent cpu-v4 instruction from Yonghong Song. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: bpf: btf: Remove two unused function declarations bpf: lru: Remove unused declaration bpf_lru_promote() selftests/bpf: relax expected log messages to allow emitting BPF_ST selftests/bpf: remove duplicated functions bpf, docs: Fix small typo and define semantics of sign extension selftests/bpf: Add bpf_get_func_ip test for uprobe inside function selftests/bpf: Add bpf_get_func_ip tests for uprobe on function entry bpf: Add support for bpf_get_func_ip helper for uprobe program selftests/bpf: Add a movsx selftest for sign-extension of R10 bpf: Fix an incorrect verification success with movsx insn bpf, docs: Formalize type notation and function semantics in ISA standard bpf: change bpf_alu_sign_string and bpf_movsx_string to static libbpf: Use local includes inside the library bpf: fix bpf_dynptr_slice() to stop return an ERR_PTR. bpf: fix inconsistent return types of bpf_xdp_copy_buf(). selftests/bpf: fix the incorrect verification of port numbers. selftests/bpf: Add test for detachment on empty mprog entry bpf: Fix mprog detachment for empty mprog entry bpf: bpf_struct_ops: Remove unnecessary initial values of variables ==================== Link: https://lore.kernel.org/r/20230810055123.109578-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski7-8/+25
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/intel/igc/igc_main.c 06b412589eef ("igc: Add lock to safeguard global Qbv variables") d3750076d464 ("igc: Add TransmissionOverrun counter") drivers/net/ethernet/microsoft/mana/mana_en.c a7dfeda6fdec ("net: mana: Fix MANA VF unload when hardware is unresponsive") a9ca9f9ceff3 ("page_pool: split types and declarations from page_pool.h") 92272ec4107e ("eth: add missing xdp.h includes in drivers") net/mptcp/protocol.h 511b90e39250 ("mptcp: fix disconnect vs accept race") b8dc6d6ce931 ("mptcp: fix rcv buffer auto-tuning") tools/testing/selftests/net/mptcp/mptcp_join.sh c8c101ae390a ("selftests: mptcp: join: fix 'implicit EP' test") 03668c65d153 ("selftests: mptcp: join: rework detailed report") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-10Merge tag 'net-6.5-rc6' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter, wireless and bpf. Still trending up in size but the good news is that the "current" regressions are resolved, AFAIK. We're getting weirdly many fixes for Wake-on-LAN and suspend/resume handling on embedded this week (most not merged yet), not sure why. But those are all for older bugs. Current release - regressions: - tls: set MSG_SPLICE_PAGES consistently when handing encrypted data over to TCP Current release - new code bugs: - eth: mlx5: correct IDs on VFs internal to the device (IPU) Previous releases - regressions: - phy: at803x: fix WoL support / reporting on AR8032 - bonding: fix incorrect deletion of ETH_P_8021AD protocol VID from slaves, leading to BUG_ON() - tun: prevent tun_build_skb() from exceeding the packet size limit - wifi: rtw89: fix 8852AE disconnection caused by RX full flags - eth/PCI: enetc: fix probing after 6fffbc7ae137 ("PCI: Honor firmware's device disabled status"), keep PCI devices around even if they are disabled / not going to be probed to be able to apply quirks on them - eth: prestera: fix handling IPv4 routes with nexthop IDs Previous releases - always broken: - netfilter: re-work garbage collection to avoid races between user-facing API and timeouts - tunnels: fix generating ipv4 PMTU error on non-linear skbs - nexthop: fix infinite nexthop bucket dump when using maximum nexthop ID - wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems() Misc: - unix: use consistent error code in SO_PEERPIDFD - ipv6: adjust ndisc_is_useropt() to include PREFIX_INFO, in prep for upcoming IETF RFC" * tag 'net-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits) net: hns3: fix strscpy causing content truncation issue net: tls: set MSG_SPLICE_PAGES consistently ibmvnic: Ensure login failure recovery is safe from other resets ibmvnic: Do partial reset on login failure ibmvnic: Handle DMA unmapping of login buffs in release functions ibmvnic: Unmap DMA login rsp buffer on send login fail ibmvnic: Enforce stronger sanity checks on login response net: mana: Fix MANA VF unload when hardware is unresponsive netfilter: nf_tables: remove busy mark and gc batch API netfilter: nft_set_hash: mark set element as dead when deleting from packet path netfilter: nf_tables: adapt set backend to use GC transaction API netfilter: nf_tables: GC transaction API to avoid race with control plane selftests/bpf: Add sockmap test for redirecting partial skb data selftests/bpf: fix a CI failure caused by vsock sockmap test bpf, sockmap: Fix bug that strp_done cannot be called bpf, sockmap: Fix map type error in sock_map_del_link xsk: fix refcount underflow in error path ipv6: adjust ndisc_is_useropt() to also return true for PIO selftests: forwarding: bridge_mdb: Make test more robust selftests: forwarding: bridge_mdb_max: Fix failing test with old libnet ...
2023-08-10x86/speculation: Add cpu_show_gds() prototypeArnd Bergmann1-0/+2
The newly added function has two definitions but no prototypes: drivers/base/cpu.c:605:16: error: no previous prototype for 'cpu_show_gds' [-Werror=missing-prototypes] Add a declaration next to the other ones for this file to avoid the warning. Fixes: 8974eb588283b ("x86/speculation: Add Gather Data Sampling mitigation") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Cc: stable@kernel.org Link: https://lore.kernel.org/all/20230809130530.1913368-1-arnd%40kernel.org
2023-08-10x86/amd_nb: Add PCI IDs for AMD Family 1Ah-based modelsAvadhut Naik1-0/+2
Add new PCI Device IDs required to support AMD's new Family 1Ah-based models 00h-1Fh, 20h and 40h-4Fh. [ bp: Zap a useless sentence. ] Co-developed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Avadhut Naik <Avadhut.Naik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230809035244.2722455-2-avadhut.naik@amd.com
2023-08-10tmpfs,xattr: enable limited user extended attributesHugh Dickins1-1/+2
Enable "user." extended attributes on tmpfs, limiting them by tracking the space they occupy, and deducting that space from the limited ispace (unless tmpfs mounted with nr_inodes=0 to leave that ispace unlimited). tmpfs inodes and simple xattrs are both unswappable, and have to be in lowmem on a 32-bit highmem kernel: so the ispace limit is appropriate for xattrs, without any need for a further mount option. Add simple_xattr_space() to give approximate but deterministic estimate of the space taken up by each xattr: with simple_xattrs_free() outputting the space freed if required (but kernfs and even some tmpfs usages do not require that, so don't waste time on strlen'ing if not needed). Security and trusted xattrs were already supported: for consistency and simplicity, account them from the same pool; though there's a small risk that a tmpfs with enough space before would now be considered too small. When extended attributes are used, "df -i" does show more IUsed and less IFree than can be explained by the inodes: document that (manpage later). xfstests tests/generic which were not run on tmpfs before but now pass: 020 037 062 070 077 097 103 117 337 377 454 486 523 533 611 618 728 with no new failures. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Message-Id: <2e63b26e-df46-5baa-c7d6-f9a8dd3282c5@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-10fs: export setup_bdev_superChristoph Hellwig1-0/+2
We'll want to use setup_bdev_super instead of duplicating it in nilfs2. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Message-Id: <20230802154131.2221419-2-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-10bpf, sockmap: Fix bug that strp_done cannot be calledXu Kuohai1-0/+1
strp_done is only called when psock->progs.stream_parser is not NULL, but stream_parser was set to NULL by sk_psock_stop_strp(), called by sk_psock_drop() earlier. So, strp_done can never be called. Introduce SK_PSOCK_RX_ENABLED to mark whether there is strp on psock. Change the condition for calling strp_done from judging whether stream_parser is set to judging whether this flag is set. This flag is only set once when strp_init() succeeds, and will never be cleared later. Fixes: c0d95d3380ee ("bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20230804073740.194770-3-xukuohai@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-10net: ptp: create a mock-up PTP Hardware Clock driverVladimir Oltean1-0/+38
There are several cases where virtual net devices may benefit from having a PTP clock, and these have to do with testing. I can see at least netdevsim and veth as potential users of a common mock-up PTP hardware clock driver. The proposed idea is to create an object which emulates PTP clock operations on top of the unadjustable CLOCK_MONOTONIC_RAW plus a software-controlled time domain via a timecounter/cyclecounter and then link that PHC to the netdevsim device. The driver is fully functional for its intended purpose, and it successfully passes the PTP selftests. $ cd tools/testing/selftests/ptp/ $ ./phc.sh /dev/ptp2 TEST: settime [ OK ] TEST: adjtime [ OK ] TEST: adjfreq [ OK ] Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230807193324.4128292-7-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-10net/mlx5: Expose NIC temperature via hardware monitoring kernel APIAdham Faris2-2/+15
Expose NIC temperature by implementing hwmon kernel API, which turns current thermal zone kernel API to redundant. For each one of the supported and exposed thermal diode sensors, expose the following attributes: 1) Input temperature. 2) Highest temperature. 3) Temperature label: Depends on the firmware capability, if firmware doesn't support sensors naming, the fallback naming convention would be: "sensorX", where X is the HW spec (MTMP register) sensor index. 4) Temperature critical max value: refers to the high threshold of Warning Event. Will be exposed as `tempY_crit` hwmon attribute (RO attribute). For example for ConnectX5 HCA's this temperature value will be 105 Celsius, 10 degrees lower than the HW shutdown temperature). 5) Temperature reset history: resets highest temperature. For example, for dualport ConnectX5 NIC with a single IC thermal diode sensor will have 2 hwmon directories (one for each PCI function) under "/sys/class/hwmon/hwmon[X,Y]". Listing one of the directories above (hwmonX/Y) generates the corresponding output below: $ grep -H -d skip . /sys/class/hwmon/hwmon0/* Output ======================================================================= /sys/class/hwmon/hwmon0/name:mlx5 /sys/class/hwmon/hwmon0/temp1_crit:105000 /sys/class/hwmon/hwmon0/temp1_highest:48000 /sys/class/hwmon/hwmon0/temp1_input:46000 /sys/class/hwmon/hwmon0/temp1_label:asic grep: /sys/class/hwmon/hwmon0/temp1_reset_history: Permission denied In addition, displaying the sensors data via lm_sensors generates the corresponding output below: $ sensors Output ======================================================================= mlx5-pci-0800 Adapter: PCI adapter asic: +46.0°C (crit = +105.0°C, highest = +48.0°C) mlx5-pci-0801 Adapter: PCI adapter asic: +46.0°C (crit = +105.0°C, highest = +48.0°C) CC: Jean Delvare <jdelvare@suse.com> Signed-off-by: Adham Faris <afaris@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Acked-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230807180507.22984-3-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-10net: annotate data-races around sock->opsEric Dumazet1-1/+1
IPV6_ADDRFORM socket option is evil, because it can change sock->ops while other threads might read it. Same issue for sk->sk_family being set to AF_INET. Adding READ_ONCE() over sock->ops reads is needed for sockets that might be impacted by IPV6_ADDRFORM. Note that mptcp_is_tcpsk() can also overwrite sock->ops. Adding annotations for all sk->sk_family reads will require more patches :/ BUG: KCSAN: data-race in ____sys_sendmsg / do_ipv6_setsockopt write to 0xffff888109f24ca0 of 8 bytes by task 4470 on cpu 0: do_ipv6_setsockopt+0x2c5e/0x2ce0 net/ipv6/ipv6_sockglue.c:491 ipv6_setsockopt+0x57/0x130 net/ipv6/ipv6_sockglue.c:1012 udpv6_setsockopt+0x95/0xa0 net/ipv6/udp.c:1690 sock_common_setsockopt+0x61/0x70 net/core/sock.c:3663 __sys_setsockopt+0x1c3/0x230 net/socket.c:2273 __do_sys_setsockopt net/socket.c:2284 [inline] __se_sys_setsockopt net/socket.c:2281 [inline] __x64_sys_setsockopt+0x66/0x80 net/socket.c:2281 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffff888109f24ca0 of 8 bytes by task 4469 on cpu 1: sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0x349/0x4c0 net/socket.c:2503 ___sys_sendmsg net/socket.c:2557 [inline] __sys_sendmmsg+0x263/0x500 net/socket.c:2643 __do_sys_sendmmsg net/socket.c:2672 [inline] __se_sys_sendmmsg net/socket.c:2669 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2669 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0xffffffff850e32b8 -> 0xffffffff850da890 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 4469 Comm: syz-executor.1 Not tainted 6.4.0-rc5-syzkaller-00313-g4c605260bc60 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023 Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230808135809.2300241-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-10block: don't make REQ_POLLED imply REQ_NOWAITJens Axboe1-1/+1
Normally these two flags do go together, as the issuer of polled IO generally cannot wait for resources that will get freed as part of IO completion. This is because that very task is the one that will complete the request and free those resources, hence that would introduce a deadlock. But it is possible to have someone else issue the polled IO, eg via io_uring if the request is punted to io-wq. For that case, it's fine to have the task block on IO submission, as it is not the same task that will be completing the IO. It's completely up to the caller to ask for both polled and nowait IO separately! If we don't allow polled IO where IOCB_NOWAIT isn't set in the kiocb, then we can run into repeated -EAGAIN submissions and not make any progress. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09Merge tag 'nf-next-2023-08-08' of ↵Jakub Kicinski2-5/+0
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter updates for net-next First 4 Patches, from Yue Haibing, remove unused prototypes in various netfilter headers. Last patch makes nfnetlink_log to always include a packet timestamp, up to now it was only included if the skb had assigned previously. From Maciej Żenczykowski. * tag 'nf-next-2023-08-08' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nfnetlink_log: always add a timestamp netfilter: h323: Remove unused function declarations netfilter: conntrack: Remove unused function declarations netfilter: helper: Remove unused function declarations netfilter: gre: Remove unused function declaration nf_ct_gre_keymap_flush() ==================== Link: https://lore.kernel.org/r/20230808124159.19046-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09net: phy: Remove two unused function declarationsYue Haibing1-4/+0
Commit 1e2dc14509fd ("net: ethtool: Add helpers for reporting test results") declared but never implemented these function. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20230808144610.19096-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09io_uring: Add io_uring command support for socketsBreno Leitao1-0/+6
Enable io_uring commands on network sockets. Create two new SOCKET_URING_OP commands that will operate on sockets. In order to call ioctl on sockets, use the file_operations->io_uring_cmd callbacks, and map it to a uring socket function, which handles the SOCKET_URING_OP accordingly, and calls socket ioctls. This patches was tested by creating a new test case in liburing. Link: https://github.com/leitao/liburing/tree/io_uring_cmd Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230627134424.2784797-1-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09tmpfs: track free_ispace instead of free_inodesHugh Dickins1-1/+1
In preparation for assigning some inode space to extended attributes, keep track of free_ispace instead of number of free_inodes: as if one tmpfs inode (and accompanying dentry) occupies very approximately 1KiB. Unsigned long is large enough for free_ispace, on 64-bit and on 32-bit: but take care to enforce the maximum. And fix the nr_blocks maximum on 32-bit: S64_MAX would be too big for it there, so say LONG_MAX instead. Delete the incorrect limited<->unlimited blocks/inodes comment above shmem_reconfigure(): leave it to the error messages below to describe. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Message-Id: <4fe1739-d9e7-8dfd-5bce-12e7339711da@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09xattr: simple_xattr_set() return old_xattr to be freedHugh Dickins1-3/+4
tmpfs wants to support limited user extended attributes, but kernfs (or cgroupfs, the only kernfs with KERNFS_ROOT_SUPPORT_USER_XATTR) already supports user extended attributes through simple xattrs: but limited by a policy (128KiB per inode) too liberal to be used on tmpfs. To allow a different limiting policy for tmpfs, without affecting the policy for kernfs, change simple_xattr_set() to return the replaced or removed xattr (if any), leaving the caller to update their accounting then free the xattr (by simple_xattr_free(), renamed from the static free_simple_xattr()). Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Message-Id: <158c6585-2aa7-d4aa-90ff-f7c3f8fe407c@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09shmem: stable directory offsetsChuck Lever1-0/+1
The current cursor-based directory offset mechanism doesn't work when a tmpfs filesystem is exported via NFS. This is because NFS clients do not open directories. Each server-side READDIR operation has to open the directory, read it, then close it. The cursor state for that directory, being associated strictly with the opened struct file, is thus discarded after each NFS READDIR operation. Directory offsets are cached not only by NFS clients, but also by user space libraries on those clients. Essentially there is no way to invalidate those caches when directory offsets have changed on an NFS server after the offset-to-dentry mapping changes. Thus the whole application stack depends on unchanging directory offsets. The solution we've come up with is to make the directory offset for each file in a tmpfs filesystem stable for the life of the directory entry it represents. shmem_readdir() and shmem_dir_llseek() now use an xarray to map each directory offset (an loff_t integer) to the memory address of a struct dentry. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Message-Id: <168814734331.530310.3911190551060453102.stgit@manet.1015granger.net> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09libfs: Add directory operations for stable offsetsChuck Lever1-0/+18
Create a vector of directory operations in fs/libfs.c that handles directory seeks and readdir via stable offsets instead of the current cursor-based mechanism. For the moment these are unused. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Message-Id: <168814732984.530310.11190772066786107220.stgit@manet.1015granger.net> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09shmem: Add default quota limit mount optionsLukas Czerner1-0/+8
Allow system administrator to set default global quota limits at tmpfs mount time. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230725144510.253763-7-cem@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>