Age | Commit message (Collapse) | Author | Files | Lines |
|
This commit adds table_size to register_sysctl in preparation for the
removal of the sentinel elements in the ctl_table arrays (last empty
markers). And though we do *not* remove any sentinels in this commit, we
set things up by either passing the table_size explicitly or using
ARRAY_SIZE on the ctl_table arrays.
We replace the register_syctl function with a macro that will add the
ARRAY_SIZE to the new register_sysctl_sz function. In this way the
callers that are already using an array of ctl_table structs do not
change. For the callers that pass a ctl_table array pointer, we pass the
table_size to register_sysctl_sz instead of the macro.
Signed-off-by: Joel Granados <j.granados@samsung.com>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
We make these changes in order to prepare __register_sysctl_table and
its callers for when we remove the sentinel element (empty element at
the end of ctl_table arrays). We don't actually remove any sentinels in
this commit, but we *do* make sure to use ARRAY_SIZE so the table_size
is available when the removal occurs.
We add a table_size argument to __register_sysctl_table and adjust
callers, all of which pass ctl_table pointers and need an explicit call
to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in
order to forward the size of the array pointer received from the network
register calls.
The new table_size argument does not yet have any effect in the
init_header call which is still dependent on the sentinel's presence.
table_size *does* however drive the `kzalloc` allocation in
__register_sysctl_table with no adverse effects as the allocated memory
is either one element greater than the calculated ctl_table array (for
the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size
of the calculated ctl_table array (for the call from sysctl_net.c and
register_sysctl). This approach will allows us to "just" remove the
sentinel without further changes to __register_sysctl_table as
table_size will represent the exact size for all the callers at that
point.
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
The new ctl_table_size element will hold the size of the ctl_table
arrays contained in the ctl_table_header. This value should eventually
be passed by the callers to the sysctl register infrastructure. And
while this commit introduces the variable, it does not set nor use it
because that requires case by case considerations for each caller.
It provides two important things: (1) A place to put the
result of the ctl_table array calculation when it gets introduced for
each caller. And (2) the size that will be used as the additional
stopping criteria in the list_for_each_table_entry macro (to be added
when all the callers are migrated)
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
Numerous production kernel configs (see [1, 2]) are choosing to enable
CONFIG_DEBUG_LIST, which is also being recommended by KSPP for hardened
configs [3]. The motivation behind this is that the option can be used
as a security hardening feature (e.g. CVE-2019-2215 and CVE-2019-2025
are mitigated by the option [4]).
The feature has never been designed with performance in mind, yet common
list manipulation is happening across hot paths all over the kernel.
Introduce CONFIG_LIST_HARDENED, which performs list pointer checking
inline, and only upon list corruption calls the reporting slow path.
To generate optimal machine code with CONFIG_LIST_HARDENED:
1. Elide checking for pointer values which upon dereference would
result in an immediate access fault (i.e. minimal hardening
checks). The trade-off is lower-quality error reports.
2. Use the __preserve_most function attribute (available with Clang,
but not yet with GCC) to minimize the code footprint for calling
the reporting slow path. As a result, function size of callers is
reduced by avoiding saving registers before calling the rarely
called reporting slow path.
Note that all TUs in lib/Makefile already disable function tracing,
including list_debug.c, and __preserve_most's implied notrace has
no effect in this case.
3. Because the inline checks are a subset of the full set of checks in
__list_*_valid_or_report(), always return false if the inline
checks failed. This avoids redundant compare and conditional
branch right after return from the slow path.
As a side-effect of the checks being inline, if the compiler can prove
some condition to always be true, it can completely elide some checks.
Since DEBUG_LIST is functionally a superset of LIST_HARDENED, the
Kconfig variables are changed to reflect that: DEBUG_LIST selects
LIST_HARDENED, whereas LIST_HARDENED itself has no dependency on
DEBUG_LIST.
Running netperf with CONFIG_LIST_HARDENED (using a Clang compiler with
"preserve_most") shows throughput improvements, in my case of ~7% on
average (up to 20-30% on some test cases).
Link: https://r.android.com/1266735 [1]
Link: https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/blob/main/config [2]
Link: https://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project/Recommended_Settings [3]
Link: https://googleprojectzero.blogspot.com/2019/11/bad-binder-android-in-wild-exploit.html [4]
Signed-off-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20230811151847.1594958-3-elver@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
Turn the list debug checking functions __list_*_valid() into inline
functions that wrap the out-of-line functions. Care is taken to ensure
the inline wrappers are always inlined, so that additional compiler
instrumentation (such as sanitizers) does not result in redundant
outlining.
This change is preparation for performing checks in the inline wrappers.
No functional change intended.
Signed-off-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20230811151847.1594958-2-elver@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
[1]: "On X86-64 and AArch64 targets, this attribute changes the calling
convention of a function. The preserve_most calling convention attempts
to make the code in the caller as unintrusive as possible. This
convention behaves identically to the C calling convention on how
arguments and return values are passed, but it uses a different set of
caller/callee-saved registers. This alleviates the burden of saving and
recovering a large register set before and after the call in the caller.
If the arguments are passed in callee-saved registers, then they will be
preserved by the callee across the call. This doesn't apply for values
returned in callee-saved registers.
* On X86-64 the callee preserves all general purpose registers, except
for R11. R11 can be used as a scratch register. Floating-point
registers (XMMs/YMMs) are not preserved and need to be saved by the
caller.
* On AArch64 the callee preserve all general purpose registers, except
x0-X8 and X16-X18."
[1] https://clang.llvm.org/docs/AttributeReference.html#preserve-most
Introduce the attribute to compiler_types.h as __preserve_most.
Use of this attribute results in better code generation for calls to
very rarely called functions, such as error-reporting functions, or
rarely executed slow paths.
Beware that the attribute conflicts with instrumentation calls inserted
on function entry which do not use __preserve_most themselves. Notably,
function tracing which assumes the normal C calling convention for the
given architecture. Where the attribute is supported, __preserve_most
will imply notrace. It is recommended to restrict use of the attribute
to functions that should or already disable tracing.
Note: The additional preprocessor check against architecture should not
be necessary if __has_attribute() only returns true where supported;
also see https://github.com/ClangBuiltLinux/linux/issues/1908. But until
__has_attribute() does the right thing, we also guard by known-supported
architectures to avoid build warnings on other architectures.
The attribute may be supported by a future GCC version (see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110899).
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Miguel Ojeda <ojeda@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20230811151847.1594958-1-elver@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
Some HiSilicon SMMU PMCG suffers the erratum 162001900 that the PMU
disable control sometimes fail to disable the counters. This will lead
to error or inaccurate data since before we enable the counters the
counter's still counting for the event used in last perf session.
This patch tries to fix this by hardening the global disable process.
Before disable the PMU, writing an invalid event type (0xffff) to
focibly stop the counters. Correspondingly restore each events on
pmu::pmu_enable().
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lore.kernel.org/r/20230814124012.58013-1-yangyicong@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Drops a race where 2 threads could spot a positive value and both
proceed to dec to -1, without reporting anything.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Message-Id: <20230811194814.1612336-1-mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
sharing
When NFS superblocks are created by automounting, their LSM parameters
aren't set in the fs_context struct prior to sget_fc() being called,
leading to failure to match existing superblocks.
This bug leads to messages like the following appearing in dmesg when
fscache is enabled:
NFS: Cache volume key already in use (nfs,4.2,2,108,106a8c0,1,,,,100000,100000,2ee,3a98,1d4c,3a98,1)
Fix this by adding a new LSM hook to load fc->security for submount
creation.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/165962680944.3334508.6610023900349142034.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165962729225.3357250.14350728846471527137.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/165970659095.2812394.6868894171102318796.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/166133579016.3678898.6283195019480567275.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/217595.1662033775@warthog.procyon.org.uk/ # v5
Fixes: 9bc61ab18b1d ("vfs: Introduce fs_context, switch vfs_kern_mount() to it.")
Fixes: 779df6a5480f ("NFS: Ensure security label is set for root inode")
Tested-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: "Christian Brauner (Microsoft)" <brauner@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230808-master-v9-1-e0ecde888221@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Subsystems that want to implement a struct bpf_struct_ops structure to
enable struct_ops maps must currently reverse engineer how the structure
works. Given that this is meant to be a way for subsystem maintainers to
extend their subsystems using BPF, let's document it to make it a bit
easier on them.
Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20230814185908.700553-3-void@manifault.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
This commit adds a kthread-creation callback to the
_torture_create_kthread() function, which allows callers of a new
torture_create_kthread_cb() macro to specify a function to be invoked
after the kthread is created but before it is awakened for the first time.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: kernel-team@android.com
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: John Stultz <jstultz@google.com>
|
|
This reverts 6f822e1b5d9dda3d20e87365de138046e3baa03a - this helper is
used by bcachefs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Link: https://lore.kernel.org/r/20230813182636.2966159-4-kent.overstreet@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
- bio_set_pages_dirty(), bio_check_pages_dirty() - dio path
- blk_status_to_str() - error messages
- bio_add_folio() - this should definitely be exported for everyone,
it's the modern version of bio_add_page()
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: linux-block@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20230813182636.2966159-2-kent.overstreet@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Each device cap has two modes: MAX and CUR. The driver maintains a
cache of both modes of the capabilities. For most device caps, the MAX
cap mode is never used.
Hence, remove all driver queries of the MAX mode of the said caps as
well as their helper MACROs.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
mlx5 driver queries the device for VECTOR_CALC and SHAMPO caps, but
there isn't any user who requires them.
As well as, MLX5_MCAM_REGS_0x9080_0x90FF is queried but not used.
Thus, drop all usages and definitions of the mentioned caps above.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Even if the PF driver had no error on his part of the sync reset flow,
the firmware can see wider picture as it syncs all the PFs in the flow.
So add at end of sync reset flow check with firmware by reading MFRL
register and initialization segment that the flow had no issue from
firmware point of view too.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Summary
=======
This introduces FSCONFIG_CMD_CREATE_EXCL which will allows userspace to
implement something like mount -t ext4 --exclusive /dev/sda /B which
fails if a superblock for the requested filesystem does already exist:
Before this patch
-----------------
$ sudo ./move-mount -f xfs -o source=/dev/sda4 /A
Requesting filesystem type xfs
Mount options requested: source=/dev/sda4
Attaching mount at /A
Moving single attached mount
Setting key(source) with val(/dev/sda4)
$ sudo ./move-mount -f xfs -o source=/dev/sda4 /B
Requesting filesystem type xfs
Mount options requested: source=/dev/sda4
Attaching mount at /B
Moving single attached mount
Setting key(source) with val(/dev/sda4)
After this patch with --exclusive as a switch for FSCONFIG_CMD_CREATE_EXCL
--------------------------------------------------------------------------
$ sudo ./move-mount -f xfs --exclusive -o source=/dev/sda4 /A
Requesting filesystem type xfs
Request exclusive superblock creation
Mount options requested: source=/dev/sda4
Attaching mount at /A
Moving single attached mount
Setting key(source) with val(/dev/sda4)
$ sudo ./move-mount -f xfs --exclusive -o source=/dev/sda4 /B
Requesting filesystem type xfs
Request exclusive superblock creation
Mount options requested: source=/dev/sda4
Attaching mount at /B
Moving single attached mount
Setting key(source) with val(/dev/sda4)
Device or resource busy | move-mount.c: 300: do_fsconfig: i xfs: reusing existing filesystem not allowed
Details
=======
As mentioned on the list (cf. [1]-[3]) mount requests like
mount -t ext4 /dev/sda /A are ambigous for userspace. Either a new
superblock has been created and mounted or an existing superblock has
been reused and a bind-mount has been created.
This becomes clear in the following example where two processes create
the same mount for the same block device:
P1 P2
fd_fs = fsopen("ext4"); fd_fs = fsopen("ext4");
fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda");
fsconfig(fd_fs, FSCONFIG_SET_STRING, "dax", "always"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "resuid", "1000");
// wins and creates superblock
fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...)
// finds compatible superblock of P1
// spins until P1 sets SB_BORN and grabs a reference
fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...)
fd_mnt1 = fsmount(fd_fs); fd_mnt2 = fsmount(fd_fs);
move_mount(fd_mnt1, "/A") move_mount(fd_mnt2, "/B")
Not just does P2 get a bind-mount but the mount options that P2
requestes are silently ignored. The VFS itself doesn't, can't and
shouldn't enforce filesystem specific mount option compatibility. It
only enforces incompatibility for read-only <-> read-write transitions:
mount -t ext4 /dev/sda /A
mount -t ext4 -o ro /dev/sda /B
The read-only request will fail with EBUSY as the VFS can't just
silently transition a superblock from read-write to read-only or vica
versa without risking security issues.
To userspace this silent superblock reuse can become a security issue in
because there is currently no straightforward way for userspace to know
that they did indeed manage to create a new superblock and didn't just
reuse an existing one.
This adds a new FSCONFIG_CMD_CREATE_EXCL command to fsconfig() that
returns EBUSY if an existing superblock would be reused. Userspace that
needs to be sure that it did create a new superblock with the requested
mount options can request superblock creation using this command. If the
command succeeds they can be sure that they did create a new superblock
with the requested mount options.
This requires the new mount api. With the old mount api it would be
necessary to plumb this through every legacy filesystem's
file_system_type->mount() method. If they want this feature they are
most welcome to switch to the new mount api.
Following is an analysis of the effect of FSCONFIG_CMD_CREATE_EXCL on
each high-level superblock creation helper:
(1) get_tree_nodev()
Always allocate new superblock. Hence, FSCONFIG_CMD_CREATE and
FSCONFIG_CMD_CREATE_EXCL are equivalent.
The binderfs or overlayfs filesystems are examples.
(4) get_tree_keyed()
Finds an existing superblock based on sb->s_fs_info. Hence,
FSCONFIG_CMD_CREATE would reuse an existing superblock whereas
FSCONFIG_CMD_CREATE_EXCL would reject it with EBUSY.
The mqueue or nfsd filesystems are examples.
(2) get_tree_bdev()
This effectively works like get_tree_keyed().
The ext4 or xfs filesystems are examples.
(3) get_tree_single()
Only one superblock of this filesystem type can ever exist.
Hence, FSCONFIG_CMD_CREATE would reuse an existing superblock
whereas FSCONFIG_CMD_CREATE_EXCL would reject it with EBUSY.
The securityfs or configfs filesystems are examples.
Note that some single-instance filesystems never destroy the
superblock once it has been created during the first mount. For
example, if securityfs has been mounted at least onces then the
created superblock will never be destroyed again as long as there is
still an LSM making use it. Consequently, even if securityfs is
unmounted and the superblock seemingly destroyed it really isn't
which means that FSCONFIG_CMD_CREATE_EXCL will continue rejecting
reusing an existing superblock.
This is acceptable thugh since special purpose filesystems such as
this shouldn't have a need to use FSCONFIG_CMD_CREATE_EXCL anyway
and if they do it's probably to make sure that mount options aren't
ignored.
Following is an analysis of the effect of FSCONFIG_CMD_CREATE_EXCL on
filesystems that make use of the low-level sget_fc() helper directly.
They're all effectively variants on get_tree_keyed(), get_tree_bdev(),
or get_tree_nodev():
(5) mtd_get_sb()
Similar logic to get_tree_keyed().
(6) afs_get_tree()
Similar logic to get_tree_keyed().
(7) ceph_get_tree()
Similar logic to get_tree_keyed().
Already explicitly allows forcing the allocation of a new superblock
via CEPH_OPT_NOSHARE. This turns it into get_tree_nodev().
(8) fuse_get_tree_submount()
Similar logic to get_tree_nodev().
(9) fuse_get_tree()
Forces reuse of existing FUSE superblock.
Forces reuse of existing superblock if passed in file refers to an
existing FUSE connection.
If FSCONFIG_CMD_CREATE_EXCL is specified together with an fd
referring to an existing FUSE connections this would cause the
superblock reusal to fail. If reusing is the intent then
FSCONFIG_CMD_CREATE_EXCL shouldn't be specified.
(10) fuse_get_tree()
-> get_tree_nodev()
Same logic as in get_tree_nodev().
(11) fuse_get_tree()
-> get_tree_bdev()
Same logic as in get_tree_bdev().
(12) virtio_fs_get_tree()
Same logic as get_tree_keyed().
(13) gfs2_meta_get_tree()
Forces reuse of existing gfs2 superblock.
Mounting gfs2meta enforces that a gf2s superblock must already
exist. If not, it will error out. Consequently, mounting gfs2meta
with FSCONFIG_CMD_CREATE_EXCL would always fail. If reusing is the
intent then FSCONFIG_CMD_CREATE_EXCL shouldn't be specified.
(14) kernfs_get_tree()
Similar logic to get_tree_keyed().
(15) nfs_get_tree_common()
Similar logic to get_tree_keyed().
Already explicitly allows forcing the allocation of a new superblock
via NFS_MOUNT_UNSHARED. This effectively turns it into
get_tree_nodev().
Link: [1] https://lore.kernel.org/linux-block/20230704-fasching-wertarbeit-7c6ffb01c83d@brauner
Link: [2] https://lore.kernel.org/linux-block/20230705-pumpwerk-vielversprechend-a4b1fd947b65@brauner
Link: [3] https://lore.kernel.org/linux-fsdevel/20230725-einnahmen-warnschilder-17779aec0a97@brauner
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Message-Id: <20230802-vfs-super-exclusive-v2-4-95dc4e41b870@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The get_tree_single_reconf() helper isn't used anywhere. Remove it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Message-Id: <20230802-vfs-super-exclusive-v2-1-95dc4e41b870@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The PSGMII interface is similar to QSGMII. The main difference
is that the PSGMII interface combines five SGMII lines into a
single link while in QSGMII only four lines are combined.
Similarly to the QSGMII, this interface mode might also needs
special handling within the MAC driver.
It is commonly used by Qualcomm with their QCA807x PHY series and
modern WiSoC-s.
Add definitions for the PHY layer to allow to express this type
of connection between the MAC and PHY.
Signed-off-by: Gabor Juhos <j4g8y7@gmail.com>
Signed-off-by: Robert Marko <robert.marko@sartura.hr>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The only remaining consumer is new_inode, where it showed up in 2001 as
commit c37fa164f793 ("v2.4.9.9 -> v2.4.9.10") in a historical repo [1]
with a changelog which does not mention it.
Since then the line got only touched up to keep compiling.
While it may have been of benefit back in the day, it is guaranteed to
at best not get in the way in the multicore setting -- as the code
performs *a lot* of work between the prefetch and actual lock acquire,
any contention means the cacheline is already invalid by the time the
routine calls spin_lock(). It adds spurious traffic, for short.
On top of it prefetch is notoriously tricky to use for single-threaded
purposes, making it questionable from the get go.
As such, remove it.
I admit upfront I did not see value in benchmarking this change, but I
can do it if that is deemed appropriate.
Removal from new_inode and of the entire thing are in the same patch as
requested by Linus, so whatever weird looks can be directed at that guy.
Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/fs/inode.c?id=c37fa164f793735b32aa3f53154ff1a7659e6442 [1]
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mitigation fixes from Borislav Petkov:
"The first set of fallout fixes after the embargo madness. There will
be another set next week too.
- A first series of cleanups/unifications and documentation
improvements to the SRSO and GDS mitigations code which got
postponed to after the embargo date
- Fix the SRSO aliasing addresses assertion so that the LLVM linker
can parse it too"
* tag 'x86_bugs_for_v6.5_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
driver core: cpu: Fix the fallback cpu_show_gds() name
x86: Move gds_ucode_mitigated() declaration to header
x86/speculation: Add cpu_show_gds() prototype
driver core: cpu: Make cpu_show_not_affected() static
x86/srso: Fix build breakage with the LLVM linker
Documentation/srso: Document IBPB aspect and fix formatting
driver core: cpu: Unify redundant silly stubs
Documentation/hw-vuln: Unify filename specification in index
|
|
Commit a3b80e107894 ("bpf: Allocate ID for bpf_link")
removed the implementation but not the declaration.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20230809140556.45836-1-yuehaibing@huawei.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- Fixes for request_queue state (Ming)
- Another uuid quirk (August)
- RCU poll fix for NVMe (Ming)
- Fix for an IO stall with polled IO (me)
- Fix for blk-iocost stats enable/disable accounting (Chengming)
- Regression fix for large pages for zram (Christoph)
* tag 'block-6.5-2023-08-11' of git://git.kernel.dk/linux:
nvme: core: don't hold rcu read lock in nvme_ns_chr_uring_cmd_iopoll
blk-iocost: fix queue stats accounting
block: don't make REQ_POLLED imply REQ_NOWAIT
block: get rid of unused plug->nowait flag
zram: take device and not only bvec offset into account
nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G and 512G
nvme-rdma: fix potential unbalanced freeze & unfreeze
nvme-tcp: fix potential unbalanced freeze & unfreeze
nvme: fix possible hang when removing a controller during error recovery
|
|
Export fs_holder_ops so that file systems that open additional block
devices can use it as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Message-Id: <20230802154131.2221419-9-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot metadata updates, down to around 1 per jiffy,
even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache. Even with NFSv4, a lot of
exported filesystems don't properly support a change attribute and are
subject to the same problems with timestamp granularity. Other
applications have similar issues with timestamps (e.g backup
applications).
If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.
What we need is a way to only use fine-grained timestamps when they are
being actively queried.
POSIX generally mandates that when the the mtime changes, the ctime must
also change. The kernel always stores normalized ctime values, so only
the first 30 bits of the tv_nsec field are ever used.
Use the 31st bit of the ctime tv_nsec field to indicate that something
has queried the inode for the mtime or ctime. When this flag is set,
on the next mtime or ctime update, the kernel will fetch a fine-grained
timestamp instead of the usual coarse-grained one.
Filesytems can opt into this behavior by setting the FS_MGTIME flag in
the fstype. Filesystems that don't set this flag will continue to use
coarse-grained timestamps.
Later patches will convert individual filesystems to use the new
infrastructure.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230807-mgctime-v7-9-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Now that all of the update_time operations are prepared for it, we can
drop the timespec64 argument from the update_time operation. Do that and
remove it from some associated functions like inode_update_time and
inode_needs_update_time.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230807-mgctime-v7-8-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Merge net again, after pulling in x86/bugs fixes to clang
linking errors.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross merge x86 fixes to fix clang linking errors:
ld.lld: error: ./arch/x86/kernel/vmlinux.lds:221: at least one side of the expression must be absolute
These will hopefully be downstream by the time we ship
the next batch of fixes.
* 'x86/bugs' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Move gds_ucode_mitigated() declaration to header
x86/speculation: Add cpu_show_gds() prototype
driver core: cpu: Make cpu_show_not_affected() static
x86/srso: Fix build breakage with the LLVM linker
Documentation/srso: Document IBPB aspect and fix formatting
driver core: cpu: Unify redundant silly stubs
Documentation/hw-vuln: Unify filename specification in index
Link: https://lore.kernel.org/all/CAHk-=wj_b+FGTnevQSBAtCWuhCk=0oQ_THvthBW2hzqpOTLFmg@mail.gmail.com/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Linux LEDs can be requested to perform hardware accelerated blinking
to indicate link, RX, TX etc. Pass the rules for blinking to the PHY
driver, if it implements the ops needed to determine if a given
pattern can be offloaded, to offload it, and what the current offload
is. Additionally implement the op needed to get what device the LED is
for.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Daniel Golle <daniel@makrotopia.org>
Link: https://lore.kernel.org/r/20230808210436.838995-3-andrew@lunn.ch
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A mode parameter has been added to the callback function of fix_mac_speed
to indicate the physical layer type.
The mode can be one the following:
MLO_AN_PHY - Conventional PHY
MLO_AN_FIXED - Fixed-link mode
MLO_AN_INBAND - In-band protocol
Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
Link: https://lore.kernel.org/r/20230807160716.259072-2-shenwei.wang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2023-08-09
We've added 19 non-merge commits during the last 6 day(s) which contain
a total of 25 files changed, 369 insertions(+), 141 deletions(-).
The main changes are:
1) Fix array-index-out-of-bounds access when detaching from an
already empty mprog entry from Daniel Borkmann.
2) Adjust bpf selftest because of a recent llvm change
related to the cpu-v4 ISA from Eduard Zingerman.
3) Add uprobe support for the bpf_get_func_ip helper from Jiri Olsa.
4) Fix a KASAN splat due to the kernel incorrectly accepted
an invalid program using the recent cpu-v4 instruction from
Yonghong Song.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
bpf: btf: Remove two unused function declarations
bpf: lru: Remove unused declaration bpf_lru_promote()
selftests/bpf: relax expected log messages to allow emitting BPF_ST
selftests/bpf: remove duplicated functions
bpf, docs: Fix small typo and define semantics of sign extension
selftests/bpf: Add bpf_get_func_ip test for uprobe inside function
selftests/bpf: Add bpf_get_func_ip tests for uprobe on function entry
bpf: Add support for bpf_get_func_ip helper for uprobe program
selftests/bpf: Add a movsx selftest for sign-extension of R10
bpf: Fix an incorrect verification success with movsx insn
bpf, docs: Formalize type notation and function semantics in ISA standard
bpf: change bpf_alu_sign_string and bpf_movsx_string to static
libbpf: Use local includes inside the library
bpf: fix bpf_dynptr_slice() to stop return an ERR_PTR.
bpf: fix inconsistent return types of bpf_xdp_copy_buf().
selftests/bpf: fix the incorrect verification of port numbers.
selftests/bpf: Add test for detachment on empty mprog entry
bpf: Fix mprog detachment for empty mprog entry
bpf: bpf_struct_ops: Remove unnecessary initial values of variables
====================
Link: https://lore.kernel.org/r/20230810055123.109578-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
drivers/net/ethernet/intel/igc/igc_main.c
06b412589eef ("igc: Add lock to safeguard global Qbv variables")
d3750076d464 ("igc: Add TransmissionOverrun counter")
drivers/net/ethernet/microsoft/mana/mana_en.c
a7dfeda6fdec ("net: mana: Fix MANA VF unload when hardware is unresponsive")
a9ca9f9ceff3 ("page_pool: split types and declarations from page_pool.h")
92272ec4107e ("eth: add missing xdp.h includes in drivers")
net/mptcp/protocol.h
511b90e39250 ("mptcp: fix disconnect vs accept race")
b8dc6d6ce931 ("mptcp: fix rcv buffer auto-tuning")
tools/testing/selftests/net/mptcp/mptcp_join.sh
c8c101ae390a ("selftests: mptcp: join: fix 'implicit EP' test")
03668c65d153 ("selftests: mptcp: join: rework detailed report")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, wireless and bpf.
Still trending up in size but the good news is that the "current"
regressions are resolved, AFAIK.
We're getting weirdly many fixes for Wake-on-LAN and suspend/resume
handling on embedded this week (most not merged yet), not sure why.
But those are all for older bugs.
Current release - regressions:
- tls: set MSG_SPLICE_PAGES consistently when handing encrypted data
over to TCP
Current release - new code bugs:
- eth: mlx5: correct IDs on VFs internal to the device (IPU)
Previous releases - regressions:
- phy: at803x: fix WoL support / reporting on AR8032
- bonding: fix incorrect deletion of ETH_P_8021AD protocol VID from
slaves, leading to BUG_ON()
- tun: prevent tun_build_skb() from exceeding the packet size limit
- wifi: rtw89: fix 8852AE disconnection caused by RX full flags
- eth/PCI: enetc: fix probing after 6fffbc7ae137 ("PCI: Honor
firmware's device disabled status"), keep PCI devices around even
if they are disabled / not going to be probed to be able to apply
quirks on them
- eth: prestera: fix handling IPv4 routes with nexthop IDs
Previous releases - always broken:
- netfilter: re-work garbage collection to avoid races between
user-facing API and timeouts
- tunnels: fix generating ipv4 PMTU error on non-linear skbs
- nexthop: fix infinite nexthop bucket dump when using maximum
nexthop ID
- wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems()
Misc:
- unix: use consistent error code in SO_PEERPIDFD
- ipv6: adjust ndisc_is_useropt() to include PREFIX_INFO, in prep for
upcoming IETF RFC"
* tag 'net-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits)
net: hns3: fix strscpy causing content truncation issue
net: tls: set MSG_SPLICE_PAGES consistently
ibmvnic: Ensure login failure recovery is safe from other resets
ibmvnic: Do partial reset on login failure
ibmvnic: Handle DMA unmapping of login buffs in release functions
ibmvnic: Unmap DMA login rsp buffer on send login fail
ibmvnic: Enforce stronger sanity checks on login response
net: mana: Fix MANA VF unload when hardware is unresponsive
netfilter: nf_tables: remove busy mark and gc batch API
netfilter: nft_set_hash: mark set element as dead when deleting from packet path
netfilter: nf_tables: adapt set backend to use GC transaction API
netfilter: nf_tables: GC transaction API to avoid race with control plane
selftests/bpf: Add sockmap test for redirecting partial skb data
selftests/bpf: fix a CI failure caused by vsock sockmap test
bpf, sockmap: Fix bug that strp_done cannot be called
bpf, sockmap: Fix map type error in sock_map_del_link
xsk: fix refcount underflow in error path
ipv6: adjust ndisc_is_useropt() to also return true for PIO
selftests: forwarding: bridge_mdb: Make test more robust
selftests: forwarding: bridge_mdb_max: Fix failing test with old libnet
...
|
|
The newly added function has two definitions but no prototypes:
drivers/base/cpu.c:605:16: error: no previous prototype for 'cpu_show_gds' [-Werror=missing-prototypes]
Add a declaration next to the other ones for this file to avoid the
warning.
Fixes: 8974eb588283b ("x86/speculation: Add Gather Data Sampling mitigation")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/all/20230809130530.1913368-1-arnd%40kernel.org
|
|
Add new PCI Device IDs required to support AMD's new Family 1Ah-based
models 00h-1Fh, 20h and 40h-4Fh.
[ bp: Zap a useless sentence. ]
Co-developed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Avadhut Naik <Avadhut.Naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230809035244.2722455-2-avadhut.naik@amd.com
|
|
Enable "user." extended attributes on tmpfs, limiting them by tracking
the space they occupy, and deducting that space from the limited ispace
(unless tmpfs mounted with nr_inodes=0 to leave that ispace unlimited).
tmpfs inodes and simple xattrs are both unswappable, and have to be in
lowmem on a 32-bit highmem kernel: so the ispace limit is appropriate
for xattrs, without any need for a further mount option.
Add simple_xattr_space() to give approximate but deterministic estimate
of the space taken up by each xattr: with simple_xattrs_free() outputting
the space freed if required (but kernfs and even some tmpfs usages do not
require that, so don't waste time on strlen'ing if not needed).
Security and trusted xattrs were already supported: for consistency and
simplicity, account them from the same pool; though there's a small risk
that a tmpfs with enough space before would now be considered too small.
When extended attributes are used, "df -i" does show more IUsed and less
IFree than can be explained by the inodes: document that (manpage later).
xfstests tests/generic which were not run on tmpfs before but now pass:
020 037 062 070 077 097 103 117 337 377 454 486 523 533 611 618 728
with no new failures.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Message-Id: <2e63b26e-df46-5baa-c7d6-f9a8dd3282c5@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
We'll want to use setup_bdev_super instead of duplicating it in nilfs2.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Message-Id: <20230802154131.2221419-2-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
strp_done is only called when psock->progs.stream_parser is not NULL,
but stream_parser was set to NULL by sk_psock_stop_strp(), called
by sk_psock_drop() earlier. So, strp_done can never be called.
Introduce SK_PSOCK_RX_ENABLED to mark whether there is strp on psock.
Change the condition for calling strp_done from judging whether
stream_parser is set to judging whether this flag is set. This flag is
only set once when strp_init() succeeds, and will never be cleared later.
Fixes: c0d95d3380ee ("bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20230804073740.194770-3-xukuohai@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
There are several cases where virtual net devices may benefit from
having a PTP clock, and these have to do with testing. I can see at
least netdevsim and veth as potential users of a common mock-up PTP
hardware clock driver.
The proposed idea is to create an object which emulates PTP clock
operations on top of the unadjustable CLOCK_MONOTONIC_RAW plus a
software-controlled time domain via a timecounter/cyclecounter and then
link that PHC to the netdevsim device.
The driver is fully functional for its intended purpose, and it
successfully passes the PTP selftests.
$ cd tools/testing/selftests/ptp/
$ ./phc.sh /dev/ptp2
TEST: settime [ OK ]
TEST: adjtime [ OK ]
TEST: adjfreq [ OK ]
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230807193324.4128292-7-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Expose NIC temperature by implementing hwmon kernel API, which turns
current thermal zone kernel API to redundant.
For each one of the supported and exposed thermal diode sensors, expose
the following attributes:
1) Input temperature.
2) Highest temperature.
3) Temperature label:
Depends on the firmware capability, if firmware doesn't support
sensors naming, the fallback naming convention would be: "sensorX",
where X is the HW spec (MTMP register) sensor index.
4) Temperature critical max value:
refers to the high threshold of Warning Event. Will be exposed as
`tempY_crit` hwmon attribute (RO attribute). For example for
ConnectX5 HCA's this temperature value will be 105 Celsius, 10
degrees lower than the HW shutdown temperature).
5) Temperature reset history: resets highest temperature.
For example, for dualport ConnectX5 NIC with a single IC thermal diode
sensor will have 2 hwmon directories (one for each PCI function)
under "/sys/class/hwmon/hwmon[X,Y]".
Listing one of the directories above (hwmonX/Y) generates the
corresponding output below:
$ grep -H -d skip . /sys/class/hwmon/hwmon0/*
Output
=======================================================================
/sys/class/hwmon/hwmon0/name:mlx5
/sys/class/hwmon/hwmon0/temp1_crit:105000
/sys/class/hwmon/hwmon0/temp1_highest:48000
/sys/class/hwmon/hwmon0/temp1_input:46000
/sys/class/hwmon/hwmon0/temp1_label:asic
grep: /sys/class/hwmon/hwmon0/temp1_reset_history: Permission denied
In addition, displaying the sensors data via lm_sensors generates the
corresponding output below:
$ sensors
Output
=======================================================================
mlx5-pci-0800
Adapter: PCI adapter
asic: +46.0°C (crit = +105.0°C, highest = +48.0°C)
mlx5-pci-0801
Adapter: PCI adapter
asic: +46.0°C (crit = +105.0°C, highest = +48.0°C)
CC: Jean Delvare <jdelvare@suse.com>
Signed-off-by: Adham Faris <afaris@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230807180507.22984-3-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
IPV6_ADDRFORM socket option is evil, because it can change sock->ops
while other threads might read it. Same issue for sk->sk_family
being set to AF_INET.
Adding READ_ONCE() over sock->ops reads is needed for sockets
that might be impacted by IPV6_ADDRFORM.
Note that mptcp_is_tcpsk() can also overwrite sock->ops.
Adding annotations for all sk->sk_family reads will require
more patches :/
BUG: KCSAN: data-race in ____sys_sendmsg / do_ipv6_setsockopt
write to 0xffff888109f24ca0 of 8 bytes by task 4470 on cpu 0:
do_ipv6_setsockopt+0x2c5e/0x2ce0 net/ipv6/ipv6_sockglue.c:491
ipv6_setsockopt+0x57/0x130 net/ipv6/ipv6_sockglue.c:1012
udpv6_setsockopt+0x95/0xa0 net/ipv6/udp.c:1690
sock_common_setsockopt+0x61/0x70 net/core/sock.c:3663
__sys_setsockopt+0x1c3/0x230 net/socket.c:2273
__do_sys_setsockopt net/socket.c:2284 [inline]
__se_sys_setsockopt net/socket.c:2281 [inline]
__x64_sys_setsockopt+0x66/0x80 net/socket.c:2281
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff888109f24ca0 of 8 bytes by task 4469 on cpu 1:
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
____sys_sendmsg+0x349/0x4c0 net/socket.c:2503
___sys_sendmsg net/socket.c:2557 [inline]
__sys_sendmmsg+0x263/0x500 net/socket.c:2643
__do_sys_sendmmsg net/socket.c:2672 [inline]
__se_sys_sendmmsg net/socket.c:2669 [inline]
__x64_sys_sendmmsg+0x57/0x60 net/socket.c:2669
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0xffffffff850e32b8 -> 0xffffffff850da890
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 4469 Comm: syz-executor.1 Not tainted 6.4.0-rc5-syzkaller-00313-g4c605260bc60 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230808135809.2300241-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Normally these two flags do go together, as the issuer of polled IO
generally cannot wait for resources that will get freed as part of IO
completion. This is because that very task is the one that will complete
the request and free those resources, hence that would introduce a
deadlock.
But it is possible to have someone else issue the polled IO, eg via
io_uring if the request is punted to io-wq. For that case, it's fine to
have the task block on IO submission, as it is not the same task that
will be completing the IO.
It's completely up to the caller to ask for both polled and nowait IO
separately! If we don't allow polled IO where IOCB_NOWAIT isn't set in
the kiocb, then we can run into repeated -EAGAIN submissions and not
make any progress.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter updates for net-next
First 4 Patches, from Yue Haibing, remove unused prototypes in
various netfilter headers.
Last patch makes nfnetlink_log to always include a packet timestamp,
up to now it was only included if the skb had assigned previously.
From Maciej Żenczykowski.
* tag 'nf-next-2023-08-08' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nfnetlink_log: always add a timestamp
netfilter: h323: Remove unused function declarations
netfilter: conntrack: Remove unused function declarations
netfilter: helper: Remove unused function declarations
netfilter: gre: Remove unused function declaration nf_ct_gre_keymap_flush()
====================
Link: https://lore.kernel.org/r/20230808124159.19046-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit 1e2dc14509fd ("net: ethtool: Add helpers for reporting test results")
declared but never implemented these function.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20230808144610.19096-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Enable io_uring commands on network sockets. Create two new
SOCKET_URING_OP commands that will operate on sockets.
In order to call ioctl on sockets, use the file_operations->io_uring_cmd
callbacks, and map it to a uring socket function, which handles the
SOCKET_URING_OP accordingly, and calls socket ioctls.
This patches was tested by creating a new test case in liburing.
Link: https://github.com/leitao/liburing/tree/io_uring_cmd
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230627134424.2784797-1-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In preparation for assigning some inode space to extended attributes,
keep track of free_ispace instead of number of free_inodes: as if one
tmpfs inode (and accompanying dentry) occupies very approximately 1KiB.
Unsigned long is large enough for free_ispace, on 64-bit and on 32-bit:
but take care to enforce the maximum. And fix the nr_blocks maximum on
32-bit: S64_MAX would be too big for it there, so say LONG_MAX instead.
Delete the incorrect limited<->unlimited blocks/inodes comment above
shmem_reconfigure(): leave it to the error messages below to describe.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Message-Id: <4fe1739-d9e7-8dfd-5bce-12e7339711da@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
tmpfs wants to support limited user extended attributes, but kernfs
(or cgroupfs, the only kernfs with KERNFS_ROOT_SUPPORT_USER_XATTR)
already supports user extended attributes through simple xattrs: but
limited by a policy (128KiB per inode) too liberal to be used on tmpfs.
To allow a different limiting policy for tmpfs, without affecting the
policy for kernfs, change simple_xattr_set() to return the replaced or
removed xattr (if any), leaving the caller to update their accounting
then free the xattr (by simple_xattr_free(), renamed from the static
free_simple_xattr()).
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Message-Id: <158c6585-2aa7-d4aa-90ff-f7c3f8fe407c@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The current cursor-based directory offset mechanism doesn't work
when a tmpfs filesystem is exported via NFS. This is because NFS
clients do not open directories. Each server-side READDIR operation
has to open the directory, read it, then close it. The cursor state
for that directory, being associated strictly with the opened
struct file, is thus discarded after each NFS READDIR operation.
Directory offsets are cached not only by NFS clients, but also by
user space libraries on those clients. Essentially there is no way
to invalidate those caches when directory offsets have changed on
an NFS server after the offset-to-dentry mapping changes. Thus the
whole application stack depends on unchanging directory offsets.
The solution we've come up with is to make the directory offset for
each file in a tmpfs filesystem stable for the life of the directory
entry it represents.
shmem_readdir() and shmem_dir_llseek() now use an xarray to map each
directory offset (an loff_t integer) to the memory address of a
struct dentry.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Message-Id: <168814734331.530310.3911190551060453102.stgit@manet.1015granger.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Create a vector of directory operations in fs/libfs.c that handles
directory seeks and readdir via stable offsets instead of the
current cursor-based mechanism.
For the moment these are unused.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Message-Id: <168814732984.530310.11190772066786107220.stgit@manet.1015granger.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Allow system administrator to set default global quota limits at tmpfs
mount time.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230725144510.253763-7-cem@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|