| Age | Commit message (Collapse) | Author | Files | Lines |
|
AF_XDP bind currently accepts zero-copy pool configurations without
verifying that the device MTU fits into the usable frame space provided
by the UMEM chunk.
This becomes a problem since we started to respect tailroom which is
subtracted from chunk_size (among with headroom). 2k chunk size might
not provide enough space for standard 1500 MTU, so let us catch such
settings at bind time. Furthermore, validate whether underlying HW will
be able to satisfy configured MTU wrt XSK's frame size multiplied by
supported Rx buffer chain length (that is exposed via
net_device::xdp_zc_max_segs).
Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
Reviewed-by: Björn Töpel <bjorn@kernel.org>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20260402154958.562179-5-maciej.fijalkowski@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently xp_assign_dev_shared() is missing XDP_USE_SG being propagated
to flags so set it in order to preserve mtu check that is supposed to be
done only when no multi-buffer setup is in picture.
Also, this flag has the same value as XDP_UMEM_TX_SW_CSUM so we could
get unexpected SG setups for software Tx checksums. Since csum flag is
UAPI, modify value of XDP_UMEM_SG_FLAG.
Fixes: d609f3d228a8 ("xsk: add multi-buffer support for sockets sharing umem")
Reviewed-by: Björn Töpel <bjorn@kernel.org>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20260402154958.562179-4-maciej.fijalkowski@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Multi-buffer XDP stores information about frags in skb_shared_info that
sits at the tailroom of a packet. The storage space is reserved via
xdp_data_hard_end():
((xdp)->data_hard_start + (xdp)->frame_sz - \
SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
and then we refer to it via macro below:
static inline struct skb_shared_info *
xdp_get_shared_info_from_buff(const struct xdp_buff *xdp)
{
return (struct skb_shared_info *)xdp_data_hard_end(xdp);
}
Currently we do not respect this tailroom space in multi-buffer AF_XDP
ZC scenario. To address this, introduce xsk_pool_get_tailroom() and use
it within xsk_pool_get_rx_frame_size() which is used in ZC drivers to
configure length of HW Rx buffer.
Typically drivers on Rx Hw buffers side work on 128 byte alignment so
let us align the value returned by xsk_pool_get_rx_frame_size() in order
to avoid addressing this on driver's side. This addresses the fact that
idpf uses mentioned function *before* pool->dev being set so we were at
risk that after subtracting tailroom we would not provide 128-byte
aligned value to HW.
Since xsk_pool_get_rx_frame_size() is actively used in xsk_rcv_check()
and __xsk_rcv(), add a variant of this routine that will not include 128
byte alignment and therefore old behavior is preserved.
Reviewed-by: Björn Töpel <bjorn@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20260402154958.562179-3-maciej.fijalkowski@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The current headroom validation in xdp_umem_reg() could leave us with
insufficient space dedicated to even receive minimum-sized ethernet
frame. Furthermore if multi-buffer would come to play then
skb_shared_info stored at the end of XSK frame would be corrupted.
HW typically works with 128-aligned sizes so let us provide this value
as bare minimum.
Multi-buffer setting is known later in the configuration process so
besides accounting for 128 bytes, let us also take care of tailroom space
upfront.
Reviewed-by: Björn Töpel <bjorn@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Fixes: 99e3a236dd43 ("xsk: Add missing check on user supplied headroom size")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20260402154958.562179-2-maciej.fijalkowski@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove the locally defined for_each_ip6_tunnel_rcu macro and use
the generic for_each_ip_tunnel_rcu from linux/if_tunnel.h instead.
This eliminates code duplication and ensures consistency across
the kernel tunnel implementations.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260403084619.4107978-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Anton Protopopov says:
====================
Properly load values from insn_arays with non-zero offsets
The PTR_TO_INSN is always loaded via BPF_LDX_MEM instruction.
However, the verifier doesn't properly verify such loads when the
offset is not zero. Fix this and extend selftests with more scenarios.
v2 -> v3:
* Add a C-level selftest which triggers a load with nonzero offset (Alexei)
* Rephrase commit messages a bit
v2: https://lore.kernel.org/bpf/20260402184647.988132-1-a.s.protopopov@gmail.com/
v1: https://lore.kernel.org/bpf/20260401161529.681755-1-a.s.protopopov@gmail.com
====================
Link: https://patch.msgid.link/20260406160141.36943-1-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
A `gotox rX` instruction accepts only values of type PTR_TO_INSN.
The only way to create such a value is to load it from a map of
type insn_array:
rX = *(rY + offset) # rY was read from an insn_array
...
gotox rX
Add instruction-level and C-level selftests to validate loads
with nonzero offsets.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20260406160141.36943-3-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When a pointer to PTR_TO_INSN is dereferenced, the offset field
of the BPF_LDX_MEM instruction can be nonzero. Patch the verifier
to not ignore this field.
Reported-by: Jiyong Yang <ksur673@gmail.com>
Fixes: 493d9e0d6083 ("bpf, x86: add support for indirect jumps")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20260406160141.36943-2-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Apparently, struct bpf_empty_prog_array exists entirely to populate a
single element of "items" in a global variable. "null_prog" is only
used during the initializer.
None of this is needed; globals will be correctly sized with an array
initializer of a flexible-array member.
So, remove struct bpf_empty_prog_array and adjust the rest of the code,
accordingly.
With these changes, fix the following warnings:
./include/linux/bpf.h:2369:31: warning: structure containing a flexible
array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/acr7Whmn0br3xeBP@kspp
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When net.core.skb_defer_max is adjusted to zero, napi_consume_skb()
shouldn't go into that deeper in skb_attempt_defer_free() because it adds
an additional pair of local_bh_enable/disable() which is evidently not
needed. Advancing the check of the static key saves more cycles and
benefits non defer case.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Link: https://patch.msgid.link/20260402034114.65766-1-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Daniel Golle says:
====================
net: dsa: mxl862xx: add support for bridge offloading
As a next step to complete the mxl862xx DSA driver, add support for
offloading forwarding between bridged ports to the switch hardware.
This works pretty much without any big surprises, apart from two
subtleties:
* per-port control over flooding behavior has to be implemented by
(ab)using a 0-rate QoS meters as stopper in lack of any better
option.
* STP state transition unconditionally enables learning on a port
even if it was previously explicitely disabled (a firmware bug)
Note that as the driver is still lacking all VLAN features (which
are going to be added next), at this point some of the
bridge_vlan_aware.sh tests are failing after applying this series.
This is expected and cannot be avoided without implementing
port_vlan_filtering + port_vlan_add/del. And adding both bridge and
VLAN offloading at the same time would be too much for anyone to
review, so VLAN support is going to be submitted in a follow-up
series immediately after this series has been accepted.
All other relevant selftests (including bridge_vlan_unaware.sh) are
still passing.
Inspired by the comments received from Paolo Abeni as reply to v5
the driver now no longer caches bridge port membership in the
driver, but instead imports an existing helper from yt921x.c to dsa.h
in order to allow the driver to easily iterate over bridge members.
The mapping between DSA bridge num and firmware bridge ID is done
using a simple fixed-size array in mxl862xx_priv.
====================
Link: https://patch.msgid.link/cover.1775049897.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Implement joining and leaving bridges as well as add, delete and dump
operations on isolated FDBs, port MDB membership management, and
setting a port's STP state.
The switch supports a maximum of 63 bridges, however, up to 12 may
be used as "single-port bridges" to isolate standalone ports.
Allowing up to 48 bridges to be offloaded seems more than enough on
that hardware, hence that is set as max_num_bridges.
A total of 128 bridge ports are supported in the bridge portmap, and
virtual bridge ports have to be used eg. for link-aggregation, hence
potentially exceeding the number of hardware ports.
The firmware-assigned bridge identifier (FID) for each offloaded bridge
is stored in an array used to map DSA bridge num to firmware bridge ID,
avoiding the need for a driver-private bridge tracking structure.
Bridge member portmaps are rebuilt on join/leave using
dsa_switch_for_each_bridge_member().
As there are now more users of the BRIDGEPORT_CONFIG_SET API and the
state of each port is cached locally, introduce a helper function
mxl862xx_set_bridge_port(struct dsa_switch *ds, int port) which
applies the cached per-port state to hardware. For standalone user
ports (dp->bridge == NULL), it additionally resets the port to
single-port bridge state: CPU-only portmap, learning and flooding
disabled. The CPU port path sets its state explicitly before calling
this helper and is therefore not affected by the reset.
Note that MASK_VLAN_BASED_MAC_LEARNING is intentionally absent from
the firmware write mask. After mxl862xx_reset(), the firmware
initialises all VLAN-based MAC learning fields to 0 (disabled), so
SVL is the active mode by default without having to set it explicitly.
Note that there is no convenient way to control flooding on per-port
level, so the driver is using a 0-rate QoS meter setup as a stopper in
lack of any better option. In order to be perfect the firmware-enforced
minimum bucket size is bypassed by directly writing 0s to the relevant
registers -- without that at least one 64-byte packet could still
pass before the meter would change from 'yellow' into 'red' state.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/dd079180e2098e5f9626fcd149b9bad9a1b5a1b2.1775049897.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The MxL862xx offloads bridge forwarding in hardware, so set
dsa_default_offload_fwd_mark() to avoid duplicate forwarding of
packets of (eg. flooded) frames arriving at the CPU port.
Link-local frames are directly trapped to the CPU port only, so don't
set dsa_default_offload_fwd_mark() on those.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/e1161c90894ddc519c57dc0224b3a0f6bfa1d2d6.1775049897.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Drivers that offload bridges need to iterate over the ports that are
members of a given bridge, for example to rebuild per-port forwarding
bitmaps when membership changes. Currently drivers typically open-code
this by combining dsa_switch_for_each_user_port() with a
dsa_port_offloads_bridge_dev() check, or cache bridge membership
within the driver.
Add dsa_switch_for_each_bridge_member() macro to express this pattern
directly, and use it for the existing dsa_bridge_ports() inline
helper.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/e7136aaa26773f39e805a00fe4ecf13cd2b83fc0.1775049897.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The yt921x driver contains a helper to create a bitmap of ports
which are members of a bridge.
Move the helper as static inline function into dsa.h, so other driver
can make use of it as well.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/4f8bbfce3e4e3a02064fc4dc366263136c6e0383.1775049897.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A common pattern in epoll network servers is to eagerly accept all
pending connections from the non-blocking listening socket after
epoll_wait indicates the socket is ready by calling accept in a loop
until EAGAIN is returned indicating that the backlog is empty.
Scheduling a timeout for a non-blocking accept with an empty backlog
meant AF_VSOCK sockets used by epoll network servers incurred hundreds
of microseconds of additional latency per accept loop compared to
AF_INET or AF_UNIX sockets.
Signed-off-by: Laurence Rowe <laurencerowe@gmail.com>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20260402204918.130395-1-laurencerowe@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit f05d26198cf2 ("psp: add stats from psp spec to driver facing
api") added device statistics (rx-packets, rx-bytes, rx-auth-fail,
rx-error, rx-bad, tx-packets, tx-bytes, tx-error) to the stats
attribute-set but did not add them to the get-stats operation reply
attributes. The kernel reports these attributes in the reply, so
list them in the spec to match.
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260403-psp-yaml-fix-v1-1-dacee0663903@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Sashiko reports:
> mctp_dst_from_route() increments the device reference count by calling
> mctp_dev_hold(). When a valid route is found and dst is NULL, the
> structure copy is bypassed and rc is set to 0.
Instead of optimistically creating a dst from the final route (then
releasing it if the saddr is invalid), perform the saddr check first.
This means we don't have an unuecessary hold/release on the dev, which
could leak if the dst pointer is NULL. No caller passes a NULL dst at
present though (so the leak is not possible), but this is an intended
use of mctp_dst_from_route().
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260403-dev-mctp-dst-defer-v1-1-9c2c55faf9e9@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Sashiko reports:
> This isn't a bug in the core networking code, but the addr parameter
> appears to be ignored here.
In mctp_test_create_dev_with_addr(), we are ignoring the addr argument
and just using `8`. Use the passed address instead.
All invocations use 8 anyway, so no effective change at present.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Reviewed-by: Simon Horman <horms@verge.net.au>
Link: https://patch.msgid.link/20260403-dev-mctp-fix-test-addr-v1-1-b7fa789cdd9b@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Sometimes it's hard to spot the ok / not ok lines in the output.
This is especially true for the GRO tests which retries a lot
so there's a wall of non-fatal output printed.
Try to color the crucial lines green / red / yellow when running
in a terminal.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Joe Damato <joe@dama.to>
Link: https://patch.msgid.link/20260402215444.1589893-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Kumar Kartikeya Dwivedi says:
====================
Allow variable offsets for syscall PTR_TO_CTX
Enable pointer modification with variable offsets accumulated in the
register for PTR_TO_CTX for syscall programs where it won't be
rewritten, and the context is user-supplied and checked against the max
offset. See patches for details. Fixed offset support landed in [0].
By combining this set with [0], examples like the one below should
succeed verification now.
SEC("syscall")
int prog(void *ctx) {
int *arr = ctx;
int i;
bpf_for(i, 0, 100)
arr[i] *= i;
return 0;
}
[0]: https://lore.kernel.org/bpf/20260227005725.1247305-1-memxor@gmail.com
Changelog:
----------
v4 -> v5
v4: https://lore.kernel.org/bpf/20260401122818.2240807-1-memxor@gmail.com
* Use is_var_ctx_off_allowed() consistently.
* Add acks. (Emil)
v3 -> v4
v3: https://lore.kernel.org/bpf/20260318103526.2590079-1-memxor@gmail.com
* Drop comment around describing choice of fixed or variable offsets. (Eduard)
* Simplify offset adjustment for different cases. (Eduard)
* Add PTR_TO_CTX case in __check_mem_access(). (Eduard)
* Drop aligned access constraint from syscall_prog_is_valid_access().
* Wrap naked checks for BPF_PROG_TYPE_SYSCALL in a utility function. (Eduard)
* Split tests into separate clean up and addition patches. (Eduard)
* Remove CAP_SYS_ADMIN changes. (Eduard)
* Enable unaligned access to syscall ctx, add tests.
* Add more tests for various corner cases.
* Add acks. (Puranjay, Mykyta)
v2 -> v3
v2: https://lore.kernel.org/bpf/20260318075133.1031781-1-memxor@gmail.com
* Prevent arg_type for KF_ARG_PTR_TO_CTX from applying to other cases
due to preceding fallthrough. (Gemini/Sashiko)
v1 -> v2
v1: https://lore.kernel.org/bpf/20260317111850.2107846-2-memxor@gmail.com
* Harden check_func_arg_reg_off check with ARG_PTR_TO_CTX.
* Add tests for unmodified ctx into tail calls.
* Squash unmodified ctx change into base commit.
* Add Reviewed-by's from Emil.
====================
Link: https://patch.msgid.link/20260406194403.1649608-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Ensure we reject programs that access beyond the maximum syscall ctx
size, i.e. U16_MAX either through direct accesses or helpers/kfuncs.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add coverage for unaligned access with fixed offsets and variable
offsets, and through helpers or kfuncs.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Ensure that global subprogs and tail calls can only accept an unmodified
PTR_TO_CTX for syscall programs. For all other program types, fixed or
variable offsets on PTR_TO_CTX is rejected when passed into an argument
of any call instruction type, through the unified logic of
check_func_arg_reg_off.
Finally, add a positive example of a case that should succeed with all
our previous changes.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add various tests to exercise fixed and variable offsets on PTR_TO_CTX
for syscall programs, and cover disallowed cases for other program types
lacking convert_ctx_access callback. Load verifier_ctx with CAP_SYS_ADMIN
so that kfunc related logic can be tested. While at it, convert assembly
tests to C. Unfortunately, ctx_pointer_to_helper_2's unpriv case conflicts
with usage of kfuncs in the file and cannot be run.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Convert existing tests from ASM to C, in prep for future changes to add
more comprehensive tests.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Don't reject usage of fixed unaligned offsets for syscall ctx. Tests
will be added in later commits. Unaligned offsets already work for
variable offsets.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Allow accessing PTR_TO_CTX with variable offsets in syscall programs.
Fixed offsets are already enabled for all program types that do not
convert their ctx accesses, since the changes we made in the commit
de6c7d99f898 ("bpf: Relax fixed offset check for PTR_TO_CTX"). Note
that we also lift the restriction on passing syscall context into
helpers, which was not permitted before, and passing modified syscall
context into kfuncs.
The structure of check_mem_access can be mostly shared and preserved,
but we must use check_mem_region_access to correctly verify access with
variable offsets.
The check made in check_helper_mem_access is hardened to only allow
PTR_TO_CTX for syscall programs to be passed in as helper memory. This
was the original intention of the existing code anyway, and it makes
little sense for other program types' context to be utilized as a memory
buffer. In case a convincing example presents itself in the future, this
check can be relaxed further.
We also no longer use the last-byte access to simulate helper memory
access, but instead go through check_mem_region_access. Since this no
longer updates our max_ctx_offset, we must do so manually, to keep track
of the maximum offset at which the program ctx may be accessed.
Take care to ensure that when arg_type is ARG_PTR_TO_CTX, we do not
relax any fixed or variable offset constraints around PTR_TO_CTX even in
syscall programs, and require them to be passed unmodified. There are
several reasons why this is necessary. First, if we pass a modified ctx,
then the global subprog's accesses will not update the max_ctx_offset to
its true maximum offset, and can lead to out of bounds accesses. Second,
tail called program (or extension program replacing global subprog) where
their max_ctx_offset exceeds the program they are being called from can
also cause issues. For the latter, unmodified PTR_TO_CTX is the first
requirement for the fix, the second is ensuring max_ctx_offset >= the
program they are being called from, which has to be a separate change
not made in this commit.
All in all, we can hint using arg_type when we expect ARG_PTR_TO_CTX and
make our relaxation around offsets conditional on it.
Drop coverage of syscall tests from verifier_ctx.c temporarily for
negative cases until they are updated in subsequent commits.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
[Why]
e1000_set_eeprom() performs a read-modify-write operation when the write
range is not word-aligned. This requires reading the first and last words
of the range from the EEPROM to preserve the unmodified bytes.
However, the code does not check the return value of e1000_read_eeprom().
If the read fails, the operation continues using uninitialized data from
eeprom_buff. This results in corrupted data being written back to the
EEPROM for the boundary words.
Add the missing error checks and abort the operation if reading fails.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Co-developed-by: Iskhakov Daniil <dish@amicon.ru>
Signed-off-by: Iskhakov Daniil <dish@amicon.ru>
Signed-off-by: Agalakov Daniil <ade@amicon.ru>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
When an AF_XDP zero-copy application terminates abruptly (e.g., kill -9),
the XSK buffer pool is destroyed but NAPI polling continues.
igb_clean_rx_irq_zc() repeatedly returns the full budget, preventing
napi_complete_done() from clearing NAPI_STATE_SCHED.
igb_down() calls napi_synchronize() before napi_disable() for each queue
vector. napi_synchronize() spins waiting for NAPI_STATE_SCHED to clear,
which never happens. igb_down() blocks indefinitely, the TX watchdog
fires, and the TX queue remains permanently stalled.
napi_disable() already handles this correctly: it sets NAPI_STATE_DISABLE.
After a full-budget poll, __napi_poll() checks napi_disable_pending(). If
set, it forces completion and clears NAPI_STATE_SCHED, breaking the loop
that napi_synchronize() cannot.
napi_synchronize() was added in commit 41f149a285da ("igb: Fix possible
panic caused by Rx traffic arrival while interface is down").
napi_disable() provides stronger guarantees: it prevents further
scheduling and waits for any active poll to exit.
Other Intel drivers (ixgbe, ice, i40e) use napi_disable() without a
preceding napi_synchronize() in their down paths.
Remove redundant napi_synchronize() call and reorder napi_disable()
before igb_set_queue_napi() so the queue-to-NAPI mapping is only
cleared after polling has fully stopped.
Fixes: 2c6196013f84 ("igb: Add AF_XDP zero-copy Rx support")
Cc: stable@vger.kernel.org
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Alex Dvoretsky <advoretsky@gmail.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Patryk Holda <patryk.holda@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Commit a7075f501bd3 ("ixgbevf: fix mailbox API compatibility by
negotiating supported features") added the .negotiate_features callback
to ixgbe_mac_operations and populated it in ixgbevf_mac_ops, but forgot
to add it to ixgbevf_hv_mac_ops. This leaves the function pointer NULL
on Hyper-V VMs.
During probe, ixgbevf_negotiate_api() calls ixgbevf_set_features(),
which unconditionally dereferences hw->mac.ops.negotiate_features().
On Hyper-V this results in a NULL pointer dereference:
BUG: kernel NULL pointer dereference, address: 0000000000000000
[...]
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine [...]
Workqueue: events work_for_cpu_fn
RIP: 0010:0x0
[...]
Call Trace:
ixgbevf_negotiate_api+0x66/0x160 [ixgbevf]
ixgbevf_sw_init+0xe4/0x1f0 [ixgbevf]
ixgbevf_probe+0x20f/0x4a0 [ixgbevf]
local_pci_probe+0x50/0xa0
work_for_cpu_fn+0x1a/0x30
[...]
Add ixgbevf_hv_negotiate_features_vf() that returns -EOPNOTSUPP and
wire it into ixgbevf_hv_mac_ops. The caller already handles -EOPNOTSUPP
gracefully.
Fixes: a7075f501bd3 ("ixgbevf: fix mailbox API compatibility by negotiating supported features")
Reported-by: Xiaoqiang Xiong <xxiong@redhat.com>
Closes: https://issues.redhat.com/browse/RHEL-155455
Assisted-by: Claude:claude-4.6-opus-high Cursor
Tested-by: Xiaoqiang Xiong <xxiong@redhat.com>
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
ixgbe_get_drvinfo() calls ixgbe_refresh_fw_version() on every ethtool
query for e610 adapters. That ends up in ixgbe_discover_flash_size(),
which bisects the full 16 MB NVM space issuing one ACI command per
step (~20 ms each, ~24 steps total = ~500 ms).
Profiling on an idle E610-XAT2 system with telegraf scraping ethtool
stats every 10 seconds:
kretprobe:ixgbe_get_drvinfo took 527603 us
kretprobe:ixgbe_get_drvinfo took 523978 us
kretprobe:ixgbe_get_drvinfo took 552975 us
kretprobe:ice_get_drvinfo took 3 us
kretprobe:igb_get_drvinfo took 2 us
kretprobe:i40e_get_drvinfo took 5 us
The half-second stall happens under the RTNL lock, causing visible
latency on ip-link and friends.
The FW version can only change after an EMPR reset. All flash data is
already populated at probe time and the cached adapter->eeprom_id is
what get_drvinfo should be returning. The only place that needs to
trigger a re-read is ixgbe_devlink_reload_empr_finish(), right after
the EMPR completes and new firmware is running. Additionally, refresh
the FW version in ixgbe_reinit_locked() so that any PF that undergoes a
reinit after an EMPR (e.g. triggered by another PF's devlink reload)
also picks up the new version in adapter->eeprom_id.
ixgbe_devlink_info_get() keeps its refresh call for explicit
"devlink dev info" queries, which is fine given those are user-initiated.
Fixes: c9e563cae19e ("ixgbe: add support for devlink reload")
Co-developed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
The E825C SyncE support added in commit ad1df4f2d591 ("ice: dpll:
Support E825-C SyncE and dynamic pin discovery") introduced a SyncE
reconfiguration block in ice_ptp_link_change() that prevents
ice_ptp_port_phy_restart() from being called in several error paths.
Without the PHY restart, PTP timestamps stop working after any link
change event.
There are three ways the PHY restart gets blocked:
1. When DPLL initialization fails (e.g. missing ACPI firmware node
properties), ICE_FLAG_DPLL is not set and the function returns early
before reaching the PHY restart.
2. When ice_tspll_bypass_mux_active_e825c() fails to read the CGU
register, WARN_ON_ONCE fires and the function returns early.
3. When ice_tspll_cfg_synce_ethdiv_e825c() fails to configure the
clock divider for an active pin, same early return.
SyncE and PTP are independent features. SyncE reconfiguration failures
must not prevent the PTP PHY restart that is essential for timestamp
recovery after link changes.
Fix by making the entire SyncE block conditional on ICE_FLAG_DPLL
without an early return, and replacing the WARN_ON_ONCE + return error
handling inside the loop with dev_err_once + break. The function always
proceeds to ice_ptp_port_phy_restart() regardless of SyncE errors.
Fixes: ad1df4f2d591 ("ice: dpll: Support E825-C SyncE and dynamic pin discovery")
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
In VFIO passthrough setups, it is possible to pass through only a PF
which doesn't own the source timer. In that case the PTP controlling PF
(adapter->ctrl_pf) is never initialized in the VM, so ice_get_ctrl_ptp()
returns NULL and triggers WARN_ON() in ice_ptp_setup_pf().
Since this is an expected behavior in that configuration, replace
WARN_ON() with an informational message and return -EOPNOTSUPP.
Fixes: e800654e85b5 ("ice: Use ice_adapter for PTP shared data instead of auxdev")
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Set the payload size before forwarding the reply to the async handler.
Without this, xn->reply_sz will be 0 and idpf_mac_filter_async_handler()
will never get past the size check.
Fixes: 34c21fa894a1 ("idpf: implement virtchnl transaction manager")
Cc: stable@vger.kernel.org
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Li Li <boolli@google.com>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Protect the set_bit() operation for the free_xn bitmask in
idpf_vc_xn_push_free(), to make the locking consistent with rest of the
code and avoid potential races in that logic.
Fixes: 34c21fa894a1 ("idpf: implement virtchnl transaction manager")
Cc: stable@vger.kernel.org
Reported-by: Ray Zhang <sgzhang@google.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Switch from using the completion's raw spinlock to a local lock in the
idpf_vc_xn struct. The conversion is safe because complete/_all() are
called outside the lock and there is no reason to share the completion
lock in the current logic. This avoids invalid wait context reported by
the kernel due to the async handler taking BH spinlock:
[ 805.726977] =============================
[ 805.726991] [ BUG: Invalid wait context ]
[ 805.727006] 7.0.0-rc2-net-devq-031026+ #28 Tainted: G S OE
[ 805.727026] -----------------------------
[ 805.727038] kworker/u261:0/572 is trying to lock:
[ 805.727051] ff190da6a8dbb6a0 (&vport_config->mac_filter_list_lock){+...}-{3:3}, at: idpf_mac_filter_async_handler+0xe9/0x260 [idpf]
[ 805.727099] other info that might help us debug this:
[ 805.727111] context-{5:5}
[ 805.727119] 3 locks held by kworker/u261:0/572:
[ 805.727132] #0: ff190da6db3e6148 ((wq_completion)idpf-0000:83:00.0-mbx){+.+.}-{0:0}, at: process_one_work+0x4b5/0x730
[ 805.727163] #1: ff3c6f0a6131fe50 ((work_completion)(&(&adapter->mbx_task)->work)){+.+.}-{0:0}, at: process_one_work+0x1e5/0x730
[ 805.727191] #2: ff190da765190020 (&x->wait#34){+.+.}-{2:2}, at: idpf_recv_mb_msg+0xc8/0x710 [idpf]
[ 805.727218] stack backtrace:
...
[ 805.727238] Workqueue: idpf-0000:83:00.0-mbx idpf_mbx_task [idpf]
[ 805.727247] Call Trace:
[ 805.727249] <TASK>
[ 805.727251] dump_stack_lvl+0x77/0xb0
[ 805.727259] __lock_acquire+0xb3b/0x2290
[ 805.727268] ? __irq_work_queue_local+0x59/0x130
[ 805.727275] lock_acquire+0xc6/0x2f0
[ 805.727277] ? idpf_mac_filter_async_handler+0xe9/0x260 [idpf]
[ 805.727284] ? _printk+0x5b/0x80
[ 805.727290] _raw_spin_lock_bh+0x38/0x50
[ 805.727298] ? idpf_mac_filter_async_handler+0xe9/0x260 [idpf]
[ 805.727303] idpf_mac_filter_async_handler+0xe9/0x260 [idpf]
[ 805.727310] idpf_recv_mb_msg+0x1c8/0x710 [idpf]
[ 805.727317] process_one_work+0x226/0x730
[ 805.727322] worker_thread+0x19e/0x340
[ 805.727325] ? __pfx_worker_thread+0x10/0x10
[ 805.727328] kthread+0xf4/0x130
[ 805.727333] ? __pfx_kthread+0x10/0x10
[ 805.727336] ret_from_fork+0x32c/0x410
[ 805.727345] ? __pfx_kthread+0x10/0x10
[ 805.727347] ret_from_fork_asm+0x1a/0x30
[ 805.727354] </TASK>
Fixes: 34c21fa894a1 ("idpf: implement virtchnl transaction manager")
Cc: stable@vger.kernel.org
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reported-by: Ray Zhang <sgzhang@google.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
kunit.py will attempt to catch SIGINT / ^C in order to ensure the TTY isn't
messed up, but never actually attempts to terminate the running kernel (be
it UML or QEMU). This can lead to a bit of frustration if the kernel has
crashed or hung.
Terminate the kernel process in the signal handler, if it's running. This
requires plumbing through the process handle in a few more places (and
having some checks to see if the kernel is still running in places where it
may have already been killed).
Reported-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Closes: https://lore.kernel.org/all/aaFmiAmg9S18EANA@smile.fi.intel.com/
Signed-off-by: David Gow <david@davidgow.net>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
|
|
run_kernel() cleanup and signal_handler() invoke stty unconditionally.
When stdin is not a tty (for example in CI or unit tests), this writes
noise to stderr.
Call stty only when stdin is a tty.
Add regression tests for these paths:
- run_kernel() with non-tty stdin
- signal_handler() with non-tty stdin
- signal_handler() with tty stdin
Signed-off-by: Shuvam Pandey <shuvampandey1@gmail.com>
Reviewed-by: David Gow <david@davidgow.net>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
|
|
If no KTAP header is found in the kernel output (e.g., because the kernel
crashed before the KUnit executor was run), it's very useful to re-run the
test with --raw_output=all, as that will show any error output (such as a
stacktrace, log message, BUG, etc). This is not particularly intuitive,
however, as --raw_output=all is not well known.
Add an extra log line to advertise --raw_output=all in this case, as it's
a terrible user experience to just get "Did any KUnit tests run?"
Signed-off-by: David Gow <david@davidgow.net>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
|
|
Currently, kunit.py allows listing all individual tests via --list_tests.
However, users often need to see only the available test suites.
Add --list_suites to show suites. This option parses the test list output
from the kernel and prints only the suite names.
Example of the output of --list_suites:
example_init
miscdev_init
printk-ringbuffer
Signed-off-by: Ryota Sakamoto <sakamo.ryota@gmail.com>
Reviewed-by: David Gow <david@davidgow.net>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
|
|
wbt_init_enable_default() uses WARN_ON_ONCE to check for failures from
wbt_alloc() and wbt_init(). However, both are expected failure paths:
- wbt_alloc() can return NULL under memory pressure (-ENOMEM)
- wbt_init() can fail with -EBUSY if wbt is already registered
syzbot triggers this by injecting memory allocation failures during MTD
partition creation via ioctl(BLKPG), causing a spurious warning.
wbt_init_enable_default() is a best-effort initialization called from
blk_register_queue() with a void return type. Failure simply means the
disk operates without writeback throttling, which is harmless.
Replace WARN_ON_ONCE with plain if-checks, consistent with how
wbt_set_lat() in the same file already handles these failures. Add a
pr_warn() for the wbt_init() failure to retain diagnostic information
without triggering a full stack trace.
Reported-by: syzbot+71fcf20f7c1e5043d78c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=71fcf20f7c1e5043d78c
Fixes: 41afaeeda509 ("blk-wbt: fix possible deadlock to nest pcpu_alloc_mutex under q_usage_counter")
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://patch.msgid.link/20260316070358.65225-2-ytohnuki@amazon.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
KASAN reports a use-after-free write of 4086 bytes in
ocfs2_write_end_inline, called from ocfs2_write_end_nolock during a
copy_file_range splice fallback on a corrupted ocfs2 filesystem mounted on
a loop device. The actual bug is an out-of-bounds write past the inode
block buffer, not a true use-after-free. The write overflows into an
adjacent freed page, which KASAN reports as UAF.
The root cause is that ocfs2_try_to_write_inline_data trusts the on-disk
id_count field to determine whether a write fits in inline data. On a
corrupted filesystem, id_count can exceed the physical maximum inline data
capacity, causing writes to overflow the inode block buffer.
Call trace (crash path):
vfs_copy_file_range (fs/read_write.c:1634)
do_splice_direct
splice_direct_to_actor
iter_file_splice_write
ocfs2_file_write_iter
generic_perform_write
ocfs2_write_end
ocfs2_write_end_nolock (fs/ocfs2/aops.c:1949)
ocfs2_write_end_inline (fs/ocfs2/aops.c:1915)
memcpy_from_folio <-- KASAN: write OOB
So add id_count upper bound check in ocfs2_validate_inode_block() to
alongside the existing i_size check to fix it.
Link: https://lkml.kernel.org/r/20260403063830.3662739-1-joseph.qi@linux.alibaba.com
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: syzbot+62c1793956716ea8b28a@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=62c1793956716ea8b28a
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_stat_start() always allocates the module's damon_ctx object
(damon_stat_context). Meanwhile, if damon_call() in the function fails,
the damon_ctx object is not deallocated. Hence, if the damon_call() is
failed, and the user writes Y to “enabled” again, the previously
allocated damon_ctx object is leaked.
This cannot simply be fixed by deallocating the damon_ctx object when
damon_call() fails. That's because damon_call() failure doesn't guarantee
the kdamond main function, which accesses the damon_ctx object, is
completely finished. In other words, if damon_stat_start() deallocates
the damon_ctx object after damon_call() failure, the not-yet-terminated
kdamond could access the freed memory (use-after-free).
Fix the leak while avoiding the use-after-free by keeping returning
damon_stat_start() without deallocating the damon_ctx object after
damon_call() failure, but deallocating it when the function is invoked
again and the kdamond is completely terminated. If the kdamond is not yet
terminated, simply return -EAGAIN, as the kdamond will soon be terminated.
The issue was discovered [1] by sashiko.
Link: https://lkml.kernel.org/r/20260402134418.74121-1-sj@kernel.org
Link: https://lore.kernel.org/20260401012428.86694-1-sj@kernel.org [1]
Fixes: 405f61996d9d ("mm/damon/stat: use damon_call() repeat mode instead of damon_callback")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 6.17.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
commit 605f6586ecf7 ("mm/vma: do not leak memory when .mmap_prepare
swaps the file") handled the success path by skipping get_file() via
file_doesnt_need_get, but missed the error path.
When /dev/zero is mmap'd with MAP_SHARED, mmap_zero_prepare() calls
shmem_zero_setup_desc() which allocates a new shmem file to back the
mapping. If __mmap_new_vma() subsequently fails, this replacement
file is never fput()'d - the original is released by
ksys_mmap_pgoff(), but nobody releases the new one.
Add fput() for the swapped file in the error path.
Reproducible with fault injection.
FAULT_INJECTION: forcing a failure.
name failslab, interval 1, probability 0, space 0, times 1
CPU: 2 UID: 0 PID: 366 Comm: syz.7.14 Not tainted 7.0.0-rc6 #2 PREEMPT(full)
Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x164/0x1f0
should_fail_ex+0x525/0x650
should_failslab+0xdf/0x140
kmem_cache_alloc_noprof+0x78/0x630
vm_area_alloc+0x24/0x160
__mmap_region+0xf6b/0x2660
mmap_region+0x2eb/0x3a0
do_mmap+0xc79/0x1240
vm_mmap_pgoff+0x252/0x4c0
ksys_mmap_pgoff+0xf8/0x120
__x64_sys_mmap+0x12a/0x190
do_syscall_64+0xa9/0x580
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
BUG: memory leak
unreferenced object 0xffff8881118aca80 (size 360):
comm "syz.7.14", pid 366, jiffies 4294913255
hex dump (first 32 bytes):
00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........
ff ff ff ff ff ff ff ff c0 28 4d ae ff ff ff ff .........(M.....
backtrace (crc db0f53bc):
kmem_cache_alloc_noprof+0x3ab/0x630
alloc_empty_file+0x5a/0x1e0
alloc_file_pseudo+0x135/0x220
__shmem_file_setup+0x274/0x420
shmem_zero_setup_desc+0x9c/0x170
mmap_zero_prepare+0x123/0x140
__mmap_region+0xdda/0x2660
mmap_region+0x2eb/0x3a0
do_mmap+0xc79/0x1240
vm_mmap_pgoff+0x252/0x4c0
ksys_mmap_pgoff+0xf8/0x120
__x64_sys_mmap+0x12a/0x190
do_syscall_64+0xa9/0x580
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Found by syzkaller.
Link: https://lkml.kernel.org/r/20260331180811.1333348-1-rhkrqnwk98@gmail.com
Fixes: 605f6586ecf7 ("mm/vma: do not leak memory when .mmap_prepare swaps the file")
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
N_NORMAL_MEMORY is initialized from zone population at boot, but memory
hotplug currently only updates N_MEMORY. As a result, a node that gains
normal memory via hotplug can remain invisible to users iterating over
N_NORMAL_MEMORY, while a node that loses its last normal memory can stay
incorrectly marked as such.
The most visible effect is that
/sys/devices/system/node/has_normal_memory does not report a node even
after that node has gained normal memory via hotplug.
Also, list_lru-based shrinkers can undercount objects on such a node
and may skip reclaim on that node entirely, which can lead to a higher
memory footprint than expected.
Restore N_NORMAL_MEMORY maintenance directly in online_pages() and
offline_pages(). Set the bit when a node that currently lacks normal
memory onlines pages into a zone <= ZONE_NORMAL, and clear it when
offlining removes the last present pages from zones <= ZONE_NORMAL.
This restores the intended semantics without bringing back the old
status_change_nid_normal notifier plumbing which was removed in
8d2882a8edb8.
Current users that benefit include list_lru, zswap, nfsd filecache,
hugetlb_cgroup, and has_normal_memory sysfs reporting.
Link: https://lkml.kernel.org/r/20260330035941.518186-1-hao.li@linux.dev
Fixes: 8d2882a8edb8 ("mm,memory_hotplug: remove status_change_nid_normal and update documentation")
Signed-off-by: Hao Li <hao.li@linux.dev>
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_call() for repeat_call_control of DAMON_SYSFS could fail if somehow
the kdamond is stopped before the damon_call(). It could happen, for
example, when te damon context was made for monitroing of a virtual
address processes, and the process is terminated immediately, before the
damon_call() invocation. In the case, the dyanmically allocated
repeat_call_control is not deallocated and leaked.
Fix the leak by deallocating the repeat_call_control under the
damon_call() failure.
This issue is discovered by sashiko [1].
Link: https://lkml.kernel.org/r/20260327003224.55752-1-sj@kernel.org
Link: https://lore.kernel.org/20260320020630.962-1-sj@kernel.org [1]
Fixes: 04a06b139ec0 ("mm/damon/sysfs: use dynamically allocated repeat mode damon_call_control")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 64dd89ae01f2 ("mm/block/fs: remove laptop_mode") removed this
unconditional writeback start from balance_dirty_pages():
if (unlikely(!writeback_in_progress(wb)))
wb_start_background_writeback(wb);
This logic needs to be reinstated to prevent performance regressions for
strictlimited BDIs and memcg setups. The problem occurs because:
a) For strictlimited BDIs, throttling is calculated using per-wb
thresholds. The per-wb threshold can be exceeded even when the global
dirty threshold was not exceeded (nr_dirty < gdtc->bg_thresh)
b) For memcg-based throttling, memcg uses its own dirty count /
thresholds and can trigger throttling even when the global threshold
isn't exceeded
Without the unconditional writeback start, IO is throttled as it waits for
dirty pages to be written back but there is no writeback running. This
leads to severe stalls. On fuse, buffered write performance dropped from
1400 MiB/s to 2000 KiB/s.
Reinstate the unconditional writeback start so that writeback is
guaranteed to be running whenever IO needs to be throttled.
Link: https://lkml.kernel.org/r/20260326215127.3857682-2-joannelkoong@gmail.com
Fixes: 64dd89ae01f2 ("mm/block/fs: remove laptop_mode")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
luo_session_deserialize() ignored the return value from
luo_file_deserialize(). As a result, a session could be left partially
restored even though the /dev/liveupdate open path treats deserialization
failures as fatal.
Propagate the error so a failed file deserialization aborts session
deserialization instead of silently continuing.
Link: https://lkml.kernel.org/r/20260325044608.8407-1-leotimmins1974@gmail.com
Link: https://lkml.kernel.org/r/20260325044608.8407-2-leotimmins1974@gmail.com
Fixes: 16cec0d26521 ("liveupdate: luo_session: add ioctls for file preservation")
Signed-off-by: Leo Timmins <leotimmins1974@gmail.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When running stress-ng on my Arm64 machine with v7.0-rc3 kernel, I
encountered some very strange crash issues showing up as "Bad page state":
"
[ 734.496287] BUG: Bad page state in process stress-ng-env pfn:415735fb
[ 734.496427] page: refcount:0 mapcount:1 mapping:0000000000000000 index:0x4cf316 pfn:0x415735fb
[ 734.496434] flags: 0x57fffe000000800(owner_2|node=1|zone=2|lastcpupid=0x3ffff)
[ 734.496439] raw: 057fffe000000800 0000000000000000 dead000000000122 0000000000000000
[ 734.496440] raw: 00000000004cf316 0000000000000000 0000000000000000 0000000000000000
[ 734.496442] page dumped because: nonzero mapcount
"
After analyzing this page’s state, it is hard to understand why the
mapcount is not 0 while the refcount is 0, since this page is not where
the issue first occurred. By enabling the CONFIG_DEBUG_VM config, I can
reproduce the crash as well and captured the first warning where the issue
appears:
"
[ 734.469226] page: refcount:33 mapcount:0 mapping:00000000bef2d187 index:0x81a0 pfn:0x415735c0
[ 734.469304] head: order:5 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 734.469315] memcg:ffff000807a8ec00
[ 734.469320] aops:ext4_da_aops ino:100b6f dentry name(?):"stress-ng-mmaptorture-9397-0-2736200540"
[ 734.469335] flags: 0x57fffe400000069(locked|uptodate|lru|head|node=1|zone=2|lastcpupid=0x3ffff)
......
[ 734.469364] page dumped because: VM_WARN_ON_FOLIO((_Generic((page + nr_pages - 1),
const struct page *: (const struct folio *)_compound_head(page + nr_pages - 1), struct page *:
(struct folio *)_compound_head(page + nr_pages - 1))) != folio)
[ 734.469390] ------------[ cut here ]------------
[ 734.469393] WARNING: ./include/linux/rmap.h:351 at folio_add_file_rmap_ptes+0x3b8/0x468,
CPU#90: stress-ng-mlock/9430
[ 734.469551] folio_add_file_rmap_ptes+0x3b8/0x468 (P)
[ 734.469555] set_pte_range+0xd8/0x2f8
[ 734.469566] filemap_map_folio_range+0x190/0x400
[ 734.469579] filemap_map_pages+0x348/0x638
[ 734.469583] do_fault_around+0x140/0x198
......
[ 734.469640] el0t_64_sync+0x184/0x188
"
The code that triggers the warning is: "VM_WARN_ON_FOLIO(page_folio(page +
nr_pages - 1) != folio, folio)", which indicates that set_pte_range()
tried to map beyond the large folio’s size.
By adding more debug information, I found that 'nr_pages' had overflowed
in filemap_map_pages(), causing set_pte_range() to establish mappings for
a range exceeding the folio size, potentially corrupting fields of pages
that do not belong to this folio (e.g., page->_mapcount).
After above analysis, I think the possible race is as follows:
CPU 0 CPU 1
filemap_map_pages() ext4_setattr()
//get and lock folio with old inode->i_size
next_uptodate_folio()
.......
//shrink the inode->i_size
i_size_write(inode, attr->ia_size);
//calculate the end_pgoff with the new inode->i_size
file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
end_pgoff = min(end_pgoff, file_end);
......
//nr_pages can be overflowed, cause xas.xa_index > end_pgoff
end = folio_next_index(folio) - 1;
nr_pages = min(end, end_pgoff) - xas.xa_index + 1;
......
//map large folio
filemap_map_folio_range()
......
//truncate folios
truncate_pagecache(inode, inode->i_size);
To fix this issue, move the 'end_pgoff' calculation before
next_uptodate_folio(), so the retrieved folio stays consistent with the
file end to avoid 'nr_pages' calculation overflow. After this patch, the
crash issue is gone.
Link: https://lkml.kernel.org/r/1cf1ac59018fc647a87b0dad605d4056a71c14e4.1773739704.git.baolin.wang@linux.alibaba.com
Fixes: 743a2753a02e ("filemap: cap PTE range to be created to allowed zero fill in folio_map_range()")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reported-by: Yuanhe Shu <xiangzao@linux.alibaba.com>
Tested-by: Yuanhe Shu <xiangzao@linux.alibaba.com>
Acked-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Pankaj Raghav <p.raghav@samsung.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|