Age | Commit message (Collapse) | Author | Files | Lines |
|
Pavel Begunkov says:
====================
implement io_uring notification (ubuf_info) stacking (net part)
To have per request buffer notifications each zerocopy io_uring send
request allocates a new ubuf_info. However, as an skb can carry only
one uarg, it may force the stack to create many small skbs hurting
performance in many ways.
The patchset implements notification, i.e. an io_uring's ubuf_info
extension, stacking. It attempts to link ubuf_info's into a list,
allowing to have multiple of them per skb.
liburing/examples/send-zerocopy shows up 6 times performance improvement
for TCP with 4KB bytes per send, and levels it with MSG_ZEROCOPY. Without
the patchset it requires much larger sends to utilise all potential.
bytes | before | after (Kqps)
1200 | 195 | 1023
4000 | 193 | 1386
8000 | 154 | 1058
====================
Link: https://lore.kernel.org/all/cover.1713369317.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
At the moment an skb can only have one ubuf_info associated with it,
which might be a performance problem for zerocopy sends in cases like
TCP via io_uring. Add a callback for assigning ubuf_info to skb, this
way we will implement smarter assignment later like linking ubuf_info
together.
Note, it's an optional callback, which should be compatible with
skb_zcopy_set(), that's because the net stack might potentially decide
to clone an skb and take another reference to ubuf_info whenever it
wishes. Also, a correct implementation should always be able to bind to
an skb without prior ubuf_info, otherwise we could end up in a situation
when the send would not be able to progress.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/all/b7918aadffeb787c84c9e72e34c729dc04f3a45d.1713369317.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We'll need to associate additional callbacks with ubuf_info, introduce
a structure holding ubuf_info callbacks. Apart from a more smarter
io_uring notification management introduced in next patches, it can be
used to generalise msg_zerocopy_put_abort() and also store
->sg_from_iter, which is currently passed in struct msghdr.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/all/a62015541de49c0e2a8a0377a1d5d0a5aeb07016.1713369317.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When CQE mode or DIM state is changed, gracefully reconfigure channels to
handle new configuration. Previously, would create new channels that would
reflect the changes rather than update the original channels.
Co-developed-by: Nabil S. Alramli <dev@nalramli.com>
Signed-off-by: Nabil S. Alramli <dev@nalramli.com>
Co-developed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240419080445.417574-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use core DIM CQ period mode enum values for the CQ parameter for the period
mode. Translate the value to the specific mlx5 device constant for the
selected period mode when creating a CQ. Avoid needing to translate mlx5
device constants to DIM constants for core DIM functionality.
Co-developed-by: Nabil S. Alramli <dev@nalramli.com>
Signed-off-by: Nabil S. Alramli <dev@nalramli.com>
Co-developed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240419080445.417574-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
To be able to constify instances of struct ctl_tables it is necessary to
remove ways through which non-const versions are exposed from the
sysctl core.
One of these is the ctl_table_arg member of struct ctl_table_header.
Constify this reference as a prerequisite for the full constification of
struct ctl_table instances.
No functional change.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Instead of relying on RTNL, choke_dump() can use READ_ONCE()
annotations, paired with WRITE_ONCE() ones in choke_change().
v2: added a WRITE_ONCE(p->Scell_log, Scell_log)
per Simon feedback in V1
Removed the READ_ONCE(q->limit) in choke_enqueue()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Integrate the regulator framework to the PSE framework for enhanced
access to features such as voltage, power measurement, and limits, which
are akin to regulators. Additionally, PSE features like port priorities
could potentially enhance the regulator framework. Note that this
integration introduces some implementation complexity, including wrapper
callbacks, but the potential benefits make it worthwhile.
Regulator are using enable counter with specific behavior.
Two calls to regulator_disable will trigger kernel warnings.
If the counter exceeds one, regulator_disable call won't disable the
PSE PI. These behavior isn't suitable for PSE control.
Added a boolean 'enabled' state to prevent multiple calls to
regulator_enable/disable. These calls will only be called from PSE
framework as it won't have any regulator children, therefore no mutex are
needed to safeguards this boolean.
regulator_get needs the consumer device pointer. Use PSE as regulator
provider and consumer device until we have RJ45 ports represented in
the Kernel.
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240417-feature_poe-v9-10-242293fd1900@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Implement setup_pi_matrix callback to configure the PSE PI matrix. This
functionality is invoked before registering the PSE and following the core
parsing of the pse_pis devicetree subnode.
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240417-feature_poe-v9-9-242293fd1900@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The Power Sourcing Equipment Power Interface (PSE PI) plays a pivotal role
in the architecture of Power over Ethernet (PoE) systems. It is essentially
a blueprint that outlines how one or multiple power sources are connected
to the eight-pin modular jack, commonly known as the Ethernet RJ45 port.
This connection scheme is crucial for enabling the delivery of power
alongside data over Ethernet cables.
This patch adds support for getting the PSE controller node through PSE PI
device subnode.
This supports adds a way to get the PSE PI id from the pse_pi devicetree
subnode of a PSE controller node simply by reading the reg property.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://lore.kernel.org/r/20240417-feature_poe-v9-7-242293fd1900@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Introduce an enumeration to define PSE types (C33 or PoDL),
utilizing a bitfield for potential future support of both types.
Include 'pse_get_types' helper for external access to PSE type info.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://lore.kernel.org/r/20240417-feature_poe-v9-2-242293fd1900@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In the current PSE interface for Ethernet Power Equipment, support is
limited to PoDL. This patch extends the interface to accommodate the
objects specified in IEEE 802.3-2022 145.2 for Power sourcing
Equipment (PSE).
The following objects are now supported and considered mandatory:
- IEEE 802.3-2022 30.9.1.1.5 aPSEPowerDetectionStatus
- IEEE 802.3-2022 30.9.1.1.2 aPSEAdminState
- IEEE 802.3-2022 30.9.1.2.1 aPSEAdminControl
To avoid confusion between "PoDL PSE" and "PoE PSE", which have similar
names but distinct values, we have followed the suggestion of Oleksij
Rempel and Andrew Lunn to maintain separate naming schemes for each,
using c33 (clause 33) prefix for "PoE PSE".
You can find more details in the discussion threads here:
https://lore.kernel.org/netdev/20230912110637.GI780075@pengutronix.de/
https://lore.kernel.org/netdev/2539b109-72ad-470a-9dae-9f53de4f64ec@lunn.ch/
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://lore.kernel.org/r/20240417-feature_poe-v9-1-242293fd1900@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
include/trace/events/rpcgss.h
386f4a737964 ("trace: events: cleanup deprecated strncpy uses")
a4833e3abae1 ("SUNRPC: Fix rpcgss_context trace event acceptor field")
Adjacent changes:
drivers/net/ethernet/intel/ice/ice_tc_lib.c
2cca35f5dd78 ("ice: Fix checking for unsupported keys on non-tunnel device")
784feaa65dfd ("ice: Add support for PFCP hardware offload in switchdev")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"A little calmer than usual, probably just the timing of sub-tree PRs.
Including fixes from netfilter.
Current release - regressions:
- inet: bring NLM_DONE out to a separate recv() again, fix user space
which assumes multiple recv()s will happen and gets blocked forever
- drv: mlx5:
- restore mistakenly dropped parts in register devlink flow
- use channel mdev reference instead of global mdev instance for
coalescing
- acquire RTNL lock before RQs/SQs activation/deactivation
Previous releases - regressions:
- net: change maximum number of UDP segments to 128, fix virtio
compatibility with Windows peers
- usb: ax88179_178a: avoid writing the mac address before first
reading
Previous releases - always broken:
- sched: fix mirred deadlock on device recursion
- netfilter:
- br_netfilter: skip conntrack input hook for promisc packets
- fixes removal of duplicate elements in the pipapo set backend
- various fixes for abort paths and error handling
- af_unix: don't peek OOB data without MSG_OOB
- drv: flower: fix fragment flags handling in multiple drivers
- drv: ravb: fix jumbo frames and packet stats accounting
Misc:
- kselftest_harness: fix Clang warning about zero-length format
- tun: limit printing rate when illegal packet received by tun dev"
* tag 'net-6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (46 commits)
net: ethernet: ti: am65-cpsw-nuss: cleanup DMA Channels before using them
net: usb: ax88179_178a: avoid writing the mac address before first reading
net: ravb: Fix RX byte accounting for jumbo packets
net: ravb: Fix GbEth jumbo packet RX checksum handling
net: ravb: Allow RX loop to move past DMA mapping errors
net: ravb: Count packets instead of descriptors in R-Car RX path
net: ethernet: mtk_eth_soc: fix WED + wifi reset
net:usb:qmi_wwan: support Rolling modules
selftests: kselftest_harness: fix Clang warning about zero-length format
net/sched: Fix mirred deadlock on device recursion
netfilter: nf_tables: fix memleak in map from abort path
netfilter: nf_tables: restore set elements when delete set fails
netfilter: nf_tables: missing iterator type in lookup walk
s390/ism: Properly fix receive message buffer allocation
net: dsa: mt7530: fix port mirroring for MT7988 SoC switch
net: dsa: mt7530: fix mirroring frames received on local port
tun: limit printing rate when illegal packet received by tun dev
ice: Fix checking for unsupported keys on non-tunnel device
ice: tc: allow zero flags in parsing tc flower
ice: tc: check src_vsi in case of traffic from VF
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- use -ENOTSUPP consistently in Intel GPIO drivers
- don't include dt-bindings headers in gpio-swnode code
- add missing of device table to gpio-lpc32xx and fix autoloading
* tag 'gpio-fixes-for-v6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpiolib: swnode: Remove wrong header inclusion
gpio: lpc32xx: fix module autoloading
gpio: crystalcove: Use -ENOTSUPP consistently
gpio: wcove: Use -ENOTSUPP consistently
|
|
When the mirred action is used on a classful egress qdisc and a packet is
mirrored or redirected to self we hit a qdisc lock deadlock.
See trace below.
[..... other info removed for brevity....]
[ 82.890906]
[ 82.890906] ============================================
[ 82.890906] WARNING: possible recursive locking detected
[ 82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G W
[ 82.890906] --------------------------------------------
[ 82.890906] ping/418 is trying to acquire lock:
[ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
__dev_queue_xmit+0x1778/0x3550
[ 82.890906]
[ 82.890906] but task is already holding lock:
[ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
__dev_queue_xmit+0x1778/0x3550
[ 82.890906]
[ 82.890906] other info that might help us debug this:
[ 82.890906] Possible unsafe locking scenario:
[ 82.890906]
[ 82.890906] CPU0
[ 82.890906] ----
[ 82.890906] lock(&sch->q.lock);
[ 82.890906] lock(&sch->q.lock);
[ 82.890906]
[ 82.890906] *** DEADLOCK ***
[ 82.890906]
[..... other info removed for brevity....]
Example setup (eth0->eth0) to recreate
tc qdisc add dev eth0 root handle 1: htb default 30
tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
action mirred egress redirect dev eth0
Another example(eth0->eth1->eth0) to recreate
tc qdisc add dev eth0 root handle 1: htb default 30
tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
action mirred egress redirect dev eth1
tc qdisc add dev eth1 root handle 1: htb default 30
tc filter add dev eth1 handle 1: protocol ip prio 2 matchall \
action mirred egress redirect dev eth0
We fix this by adding an owner field (CPU id) to struct Qdisc set after
root qdisc is entered. When the softirq enters it a second time, if the
qdisc owner is the same CPU, the packet is dropped to break the loop.
Reported-by: Mingshuai Ren <renmingshuai@huawei.com>
Closes: https://lore.kernel.org/netdev/20240314111713.5979-1-renmingshuai@huawei.com/
Fixes: 3bcb846ca4cf ("net: get rid of spin_trylock() in net_tx_action()")
Fixes: e578d9c02587 ("net: sched: use counter to break reclassify loops")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20240415210728.36949-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The flags in the software node properties are supposed to be
the GPIO lookup flags, which are provided by gpio/machine.h,
as the software nodes are the kernel internal thing and doesn't
need to rely to any of ABIs.
Fixes: e7f9ff5dc90c ("gpiolib: add support for software nodes")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
|
|
pse_control_config
In commit 18ff0bcda6d1 ("ethtool: add interface to interact with Ethernet
Power Equipment"), the 'pse_control_config' structure was introduced,
housing a single member labeled 'admin_cotrol' responsible for maintaining
the operational state of the PoDL PSE functions.
A noticeable typographical error exists in the naming of this field
('cotrol' should be corrected to 'control'), which this commit aims to
rectify.
Furthermore, with upcoming extensions of this structure to encompass PoE
functionalities, the field is being renamed to 'podl_admin_state' to
distinctly indicate that this state is tailored specifically for PoDL."
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://lore.kernel.org/r/20240414-feature_poe-v8-3-e4bf1e860da5@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit dcf70df2048d ("af_unix: Fix up unix_edge.successor for embryo
socket.") added spin_lock(&unix_gc_lock) in accept() path, and it
caused regression in a stress test as reported by kernel test robot.
If the embryo socket is not part of the inflight graph, we need not
hold the lock.
To decide that in O(1) time and avoid the regression in the normal
use case,
1. add a new stat unix_sk(sk)->scm_stat.nr_unix_fds
2. count the number of inflight AF_UNIX sockets in the receive
queue under unix_state_lock()
3. move unix_update_edges() call under unix_state_lock()
4. avoid locking if nr_unix_fds is 0 in unix_update_edges()
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202404101427.92a08551-oliver.sang@intel.com
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240413021928.20946-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix a potential tracepoint crash
- Fix NFSv4 GETATTR on big-endian platforms
* tag 'nfsd-6.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: fix endianness issue in nfsd4_encode_fattr4
SUNRPC: Fix rpcgss_context trace event acceptor field
|
|
With the previous change, struct dqs->stall_thrs will be in the hot path
(at queue side), even if DQS is disabled.
The other fields accessed in this function (last_obj_cnt and num_queued)
are in the first cache line, let's move this field (stall_thrs) to the
very first cache line, since there is a hole there.
This does not change the structure size, since it moves an short (2
bytes) to 4-bytes whole in the first cache line.
This is the new structure format now:
struct dql {
unsigned int num_queued;
unsigned int last_obj_cnt;
...
short unsigned int stall_thrs;
/* XXX 2 bytes hole, try to pack */
...
/* --- cacheline 1 boundary (64 bytes) --- */
...
/* Longest stall detected, reported to user */
short unsigned int stall_max;
/* XXX 2 bytes hole, try to pack */
};
Also, read the stall_thrs (now in the very first cache line) earlier,
together with dql->num_queued (also in the first cache line).
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240411192241.2498631-5-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When Dynamic Queue Limit (DQL) is set, it always populate stall
information through dql_queue_stall(). However, this information is
only necessary if a stall threshold is set, stored in struct
dql->stall_thrs.
dql_queue_stall() is cheap, but not free, since it does have memory
barriers and so forth.
Do not call dql_queue_stall() if there is no stall threshold set, and
save some CPU cycles.
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240411192241.2498631-4-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The dql_queued() function currently handles both queuing object counts
and populating bitmaps for reporting stalls.
This commit splits the bitmap population into a separate function,
allowing for conditional invocation in scenarios where the feature is
disabled.
This refactor maintains functionality while improving code
organization.
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240411192241.2498631-3-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If the dql_queued() function receives an invalid argument, WARN about it
and continue, instead of crashing the kernel.
This was raised by checkpatch, when I am refactoring this code (see
following patch/commit)
WARNING: Do not crash the kernel unless it is absolutely unavoidable--use WARN_ON_ONCE() plus recovery code (if feasible) instead of BUG() or variants
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240411192241.2498631-2-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
These helpers aim to help drivers, with checking
for the presence of unsupported control flags.
For drivers supporting at least one control flag:
flow_rule_is_supp_control_flags()
For drivers using flow_rule_match_control(), but not using flags:
flow_rule_has_control_flags()
For drivers not using flow_rule_match_control():
flow_rule_match_has_control_flags()
While primarily aimed at FLOW_DISSECTOR_KEY_CONTROL
and flow_rule_match_control(), then the first two
can also be used with FLOW_DISSECTOR_KEY_ENC_CONTROL
and flow_rule_match_enc_control().
These helpers mirrors the existing check done in sfc:
drivers/net/ethernet/sfc/tc.c +276
Only compile-tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
- Address a (valid) W=1 build warning
- Fix timer self-tests
- Annotate a KCSAN warning wrt. accesses to the tick_do_timer_cpu
global variable
- Address a !CONFIG_BUG build warning
* tag 'timers-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
selftests: kselftest: Fix build failure with NOLIBC
selftests: timers: Fix abs() warning in posix_timers test
selftests: kselftest: Mark functions that unconditionally call exit() as __noreturn
selftests: timers: Fix posix_timers ksft_print_msg() warning
selftests: timers: Fix valid-adjtimex signed left-shift undefined behavior
bug: Fix no-return-statement warning with !CONFIG_BUG
timekeeping: Use READ/WRITE_ONCE() for tick_do_timer_cpu
selftests/timers/posix_timers: Reimplement check_timer_distribution()
irqflags: Explicitly ignore lockdep_hrtimer_exit() argument
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
"Fix a PREEMPT_RT build bug"
* tag 'locking-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking: Make rwsem_assert_held_write_nolockdep() build with PREEMPT_RT=y
|
|
Pull virtio bugfixes from Michael Tsirkin:
"Some small, obvious (in hindsight) bugfixes:
- new ioctl in vhost-vdpa has a wrong # - not too late to fix
- vhost has apparently been lacking an smp_rmb() - due to code
duplication :( The duplication will be fixed in the next merge
cycle, this is a minimal fix
- an error message in vhost talks about guest moving used index -
which of course never happens, guest only ever moves the available
index
- i2c-virtio didn't set the driver owner so it did not get refcounted
correctly"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost: correct misleading printing information
vhost-vdpa: change ioctl # for VDPA_GET_VRING_SIZE
virtio: store owner from modules with register_virtio_driver()
vhost: Add smp_rmb() in vhost_enable_notify()
vhost: Add smp_rmb() in vhost_vq_avail_empty()
|
|
The commit fc8b2a619469
("net: more strict VIRTIO_NET_HDR_GSO_UDP_L4 validation")
adds check of potential number of UDP segments vs
UDP_MAX_SEGMENTS in linux/virtio_net.h.
After this change certification test of USO guest-to-guest
transmit on Windows driver for virtio-net device fails,
for example with packet size of ~64K and mss of 536 bytes.
In general the USO should not be more restrictive than TSO.
Indeed, in case of unreasonably small mss a lot of segments
can cause queue overflow and packet loss on the destination.
Limit of 128 segments is good for any practical purpose,
with minimal meaningful mss of 536 the maximal UDP packet will
be divided to ~120 segments.
The number of segments for UDP packets is validated vs
UDP_MAX_SEGMENTS also in udp.c (v4,v6), this does not affect
quest-to-guest path but does affect packets sent to host, for
example.
It is important to mention that UDP_MAX_SEGMENTS is kernel-only
define and not available to user mode socket applications.
In order to request MSS smaller than MTU the applications
just uses setsockopt with SOL_UDP and UDP_SEGMENT and there is
no limitations on socket API level.
Fixes: fc8b2a619469 ("net: more strict VIRTIO_NET_HDR_GSO_UDP_L4 validation")
Signed-off-by: Yuri Benditovich <yuri.benditovich@daynix.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pull io_uring fixes from Jens Axboe:
- Fix for sigmask restoring while waiting for events (Alexey)
- Typo fix in comment (Haiyue)
- Fix for a msg_control retstore on SEND_ZC retries (Pavel)
* tag 'io_uring-6.9-20240412' of git://git.kernel.dk/linux:
io-uring: correct typo in comment for IOU_F_TWQ_LAZY_WAKE
io_uring/net: restore msg_control on sendzc retry
io_uring: Fix io_cqring_wait() not restoring sigmask on get_timespec64() failure
|
|
Pull drm fixes from Dave Airlie:
"Looks like everyone woke up after holidays, this weeks pull has a
bunch of stuff all over, 2 weeks worth of amdgpu is a lot of it, then
i915/xe have a few, a bunch of msm fixes, then some scattered driver
fixes.
I expect things will settle down for rc5.
client:
- Protect connector modes with mode_config mutex
ast:
- Fix soft lockup
host1x:
- Do not setup DMA for virtual addresses
ivpu:
- Fix deadlock in context_xa
- PCI fixes
- Fixes to error handling
nouveau:
- gsp: Fix OOB access
- Fix casting
panfrost:
- Fix error path in MMU code
qxl:
- Revert "drm/qxl: simplify qxl_fence_wait"
vmwgfx:
- Enable DMA for SEV mappings
i915:
- Couple CDCLK programming fixes
- HDCP related fix
- 4 Bigjoiner related fixes
- Fix for a circular locking around GuC on reset+wedged case
xe:
- Fix double display mutex initializations
- Fix u32 -> u64 implicit conversions
- Fix RING_CONTEXT_CONTROL not marked as masked
msm:
- DP refcount leak fix on disconnect
- Add missing newlines to prints in msm_fb and msm_kms
- fix dpu debugfs entry permissions
- Fix the interface table for the catalog of X1E80100
- fix irq message printing
- Bindings fix to add DP node as child of mdss for mdss node
- Minor typo fix in DP driver API which handles port status change
- fix CHRASHDUMP_READ()
- fix HHB (highest bank bit) for a619 to fix UBWC corruption
amdgpu:
- GPU reset fixes
- Fix some confusing logging
- UMSCH fix
- Aborted suspend fix
- DCN 3.5 fixes
- S4 fix
- MES logging fixes
- SMU 14 fixes
- SDMA 4.4.2 fix
- KASAN fix
- SMU 13.0.10 fix
- VCN partition fix
- GFX11 fixes
- DWB fixes
- Plane handling fix
- FAMS fix
- DCN 3.1.6 fix
- VSC SDP fixes
- OLED panel fix
- GFX 11.5 fix
amdkfd:
- GPU reset fixes
- fix ioctl integer overflow"
* tag 'drm-fixes-2024-04-12' of https://gitlab.freedesktop.org/drm/kernel: (65 commits)
amdkfd: use calloc instead of kzalloc to avoid integer overflow
drm/xe: Label RING_CONTEXT_CONTROL as masked
drm/xe/xe_migrate: Cast to output precision before multiplying operands
drm/xe/hwmon: Cast result to output precision on left shift of operand
drm/xe/display: Fix double mutex initialization
drm/amdgpu: differentiate external rev id for gfx 11.5.0
drm/amd/display: Adjust dprefclk by down spread percentage.
drm/amd/display: Set VSC SDP Colorimetry same way for MST and SST
drm/amd/display: Program VSC SDP colorimetry for all DP sinks >= 1.4
drm/amd/display: fix disable otg wa logic in DCN316
drm/amd/display: Do not recursively call manual trigger programming
drm/amd/display: always reset ODM mode in context when adding first plane
drm/amdgpu: fix incorrect number of active RBs for gfx11
drm/amd/display: Return max resolution supported by DWB
amd/amdkfd: sync all devices to wait all processes being evicted
drm/amdgpu: clear set_q_mode_offs when VM changed
drm/amdgpu: Fix VCN allocation in CPX partition
drm/amd/pm: fix the high voltage issue after unload
drm/amd/display: Skip on writeback when it's not applicable
drm/amdgpu: implement IRQ_STATE_ENABLE for SDMA v4.4.2
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
netfilter pull request 24-04-11
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
Patches #1 and #2 add missing rcu read side lock when iterating over
expression and object type list which could race with module removal.
Patch #3 prevents promisc packet from visiting the bridge/input hook
to amend a recent fix to address conntrack confirmation race
in br_netfilter and nf_conntrack_bridge.
Patch #4 adds and uses iterate decorator type to fetch the current
pipapo set backend datastructure view when netlink dumps the
set elements.
Patch #5 fixes removal of duplicate elements in the pipapo set backend.
Patch #6 flowtable validates pppoe header before accessing it.
Patch #7 fixes flowtable datapath for pppoe packets, otherwise lookup
fails and pppoe packets follow classic path.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add definition and documentation for the new generic
info "board.part_number".
The new one is for part number specific use, and board.id
is modified to match the documentation in devlink-info.
Signed-off-by: Fei Qin <fei.qin@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale"),
we noticed an application-level timeout due to reduced throughput.
Before the commit, for a client that sets SO_RCVBUF to 65k, it takes
around 22 seconds to transfer 10M data. After the commit, it takes 40
seconds. Because our application has a 30-second timeout, this
regression broke the application.
The reason that it takes longer to transfer data is that
tp->scaling_ratio is initialized to a value that results in ~0.25 of
rcvbuf. In our case, SO_RCVBUF is set to 65536 by the application, which
translates to 2 * 65536 = 131,072 bytes in rcvbuf and hence a ~28k
initial receive window.
Later, even though the scaling_ratio is updated to a more accurate
skb->len/skb->truesize, which is ~0.66 in our environment, the window
stays at ~0.25 * rcvbuf. This is because tp->window_clamp does not
change together with the tp->scaling_ratio update when autotuning is
disabled due to SO_RCVBUF. As a result, the window size is capped at the
initial window_clamp, which is also ~0.25 * rcvbuf, and never grows
bigger.
Most modern applications let the kernel do autotuning, and benefit from
the increased scaling_ratio. But there are applications such as kafka
that has a default setting of SO_RCVBUF=64k.
This patch increases the initial scaling_ratio from ~25% to 50% in order
to make it backward compatible with the original default
sysctl_tcp_adv_win_scale for applications setting SO_RCVBUF.
Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
Signed-off-by: Hechao Li <hli@netflix.com>
Reviewed-by: Tycho Andersen <tycho@tycho.pizza>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/netdev/20240402215405.432863-1-hli@netflix.com/
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Rather than having a shim for each and every phylink MAC operation,
allow DSA switch drivers to provide their own ops structure. When a
DSA driver provides the phylink MAC operations, the shimmed ops must
not be provided, so fail an attempt to register a switch with both
the phylink_mac_ops in struct dsa_switch and the phylink_mac_*
operations populated in dsa_switch_ops populated.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/E1rudqF-006K9H-Cc@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We convert from a phylink_config struct to a dsa_port struct in many
places, let's provide a helper for this.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/E1rudqA-006K9B-85@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
RTO_ONLINK was a flag used in ->flowi4_tos that allowed to alter the
scope of an IPv4 route lookup. Setting this flag was equivalent to
specifying RT_SCOPE_LINK in ->flowi4_scope.
With commit ec20b2830093 ("ipv4: Set scope explicitly in
ip_route_output()."), the last users of RTO_ONLINK have been removed.
Therefore, we can now drop the code that checked this bit and stop
modifying ->flowi4_scope in ip_route_output_key_hash().
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/57de760565cab55df7b129f523530ac6475865b2.1712754146.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
include/net/flow_offload.h:351: warning:
No description found for return value of 'flow_offload_has_one_action'
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com>
Link: https://lore.kernel.org/r/20240410114718.15145-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Refactor some of the skb frag ref/unref helpers for improved clarity.
Implement napi_pp_get_page() to be the mirror counterpart of
napi_pp_put_page().
Implement skb_page_ref() to be the mirror of skb_page_unref().
Improve __skb_frag_ref() to become a mirror counterpart of
__skb_frag_unref(). Previously unref could handle pp & non-pp pages,
while the ref could only handle non-pp pages. Now both the ref & unref
helpers can correctly handle both pp & non-pp pages.
Now that __skb_frag_ref() can handle both pp & non-pp pages, remove
skb_pp_frag_ref(), and use __skb_frag_ref() instead. This lets us
remove pp specific handling from skb_try_coalesce.
Additionally, since __skb_frag_ref() can now handle both pp & non-pp
pages, a latent issue in skb_shift() should now be fixed. Previously
this function would do a non-pp ref & pp unref on potential pp frags
(fragfrom). After this patch, skb_shift() should correctly do a pp
ref/unref on pp frags.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20240410190505.1225848-3-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a new header, linux/skbuff_ref.h, which contains all the skb_*_ref()
helpers. Many of the consumers of skbuff.h do not actually use any of
the skb ref helpers, and we can speed up compilation a bit by minimizing
this header file.
Additionally in the later patch in the series we add page_pool support
to skb_frag_ref(), which requires some page_pool dependencies. We can
now add these dependencies to skbuff_ref.h instead of a very ubiquitous
skbuff.h
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://lore.kernel.org/r/20240410190505.1225848-2-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- Some cosmetic changes (Erni Sri Satya Vennela, Li Zhijian)
- Introduce hv_numa_node_to_pxm_info() (Nuno Das Neves)
- Fix KVP daemon to handle IPv4 and IPv6 combination for keyfile format
(Shradha Gupta)
- Avoid freeing decrypted memory in a confidential VM (Rick Edgecombe
and Michael Kelley)
* tag 'hyperv-fixes-signed-20240411' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
Drivers: hv: vmbus: Don't free ring buffers that couldn't be re-encrypted
uio_hv_generic: Don't free decrypted memory
hv_netvsc: Don't free decrypted memory
Drivers: hv: vmbus: Track decrypted status in vmbus_gpadl
Drivers: hv: vmbus: Leak pages if set_memory_encrypted() fails
hv/hv_kvp_daemon: Handle IPv4 and Ipv6 combination for keyfile format
hv: vmbus: Convert sprintf() family to sysfs_emit() family
mshyperv: Introduce hv_numa_node_to_pxm_info()
x86/hyperv: Cosmetic changes for hv_apic.c
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
net/unix/garbage.c
47d8ac011fe1 ("af_unix: Fix garbage collector racing against connect()")
4090fa373f0e ("af_unix: Replace garbage collection algorithm.")
Adjacent changes:
drivers/net/ethernet/broadcom/bnxt/bnxt.c
faa12ca24558 ("bnxt_en: Reset PTP tx_avail after possible firmware reset")
b3d0083caf9a ("bnxt_en: Support RSS contexts in ethtool .{get|set}_rxfh()")
drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
7ac10c7d728d ("bnxt_en: Fix possible memory leak in bnxt_rdma_aux_device_init()")
194fad5b2781 ("bnxt_en: Refactor bnxt_rdma_aux_device_init/uninit functions")
drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
958f56e48385 ("net/mlx5e: Un-expose functions in en.h")
49e6c9387051 ("net/mlx5e: RSS, Block XOR hash with over 128 channels")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
Short summary of fixes pull:
ast:
- Fix soft lockup
client:
- Protect connector modes with mode_config mutex
host1x:
- Do not setup DMA for virtual addresses
ivpu:
- Fix deadlock in context_xa
- PCI fixes
- Fixes to error handling
nouveau:
- gsp: Fix OOB access
- Fix casting
panfrost:
- Fix error path in MMU code
qxl:
- Revert "drm/qxl: simplify qxl_fence_wait"
vmwgfx:
- Enable DMA for SEV mappings
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20240411073403.GA9895@localhost.localdomain
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix the handling of dependencies between devices in the ACPI
device enumeration code and address a _UID matching regression from
the 6.8 development cycle.
Specifics:
- Modify the ACPI device enumeration code to avoid counting
dependencies that have been met already as unmet (Hans de Goede)
- Make _UID matching take the integer value of 0 into account as
appropriate (Raag Jadav)"
* tag 'acpi-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: bus: allow _UID matching for integer zero
ACPI: scan: Do not increase dep_unmet for already met dependencies
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from bluetooth.
Current release - new code bugs:
- netfilter: complete validation of user input
- mlx5: disallow SRIOV switchdev mode when in multi-PF netdev
Previous releases - regressions:
- core: fix u64_stats_init() for lockdep when used repeatedly in one
file
- ipv6: fix race condition between ipv6_get_ifaddr and ipv6_del_addr
- bluetooth: fix memory leak in hci_req_sync_complete()
- batman-adv: avoid infinite loop trying to resize local TT
- drv: geneve: fix header validation in geneve[6]_xmit_skb
- drv: bnxt_en: fix possible memory leak in
bnxt_rdma_aux_device_init()
- drv: mlx5: offset comp irq index in name by one
- drv: ena: avoid double-free clearing stale tx_info->xdpf value
- drv: pds_core: fix pdsc_check_pci_health deadlock
Previous releases - always broken:
- xsk: validate user input for XDP_{UMEM|COMPLETION}_FILL_RING
- bluetooth: fix setsockopt not validating user input
- af_unix: clear stale u->oob_skb.
- nfc: llcp: fix nfc_llcp_setsockopt() unsafe copies
- drv: virtio_net: fix guest hangup on invalid RSS update
- drv: mlx5e: Fix mlx5e_priv_init() cleanup flow
- dsa: mt7530: trap link-local frames regardless of ST Port State"
* tag 'net-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (59 commits)
net: ena: Set tx_info->xdpf value to NULL
net: ena: Fix incorrect descriptor free behavior
net: ena: Wrong missing IO completions check order
net: ena: Fix potential sign extension issue
af_unix: Fix garbage collector racing against connect()
net: dsa: mt7530: trap link-local frames regardless of ST Port State
Revert "s390/ism: fix receive message buffer allocation"
net: sparx5: fix wrong config being used when reconfiguring PCS
net/mlx5: fix possible stack overflows
net/mlx5: Disallow SRIOV switchdev mode when in multi-PF netdev
net/mlx5e: RSS, Block XOR hash with over 128 channels
net/mlx5e: Do not produce metadata freelist entries in Tx port ts WQE xmit
net/mlx5e: HTB, Fix inconsistencies with QoS SQs number
net/mlx5e: Fix mlx5e_priv_init() cleanup flow
net/mlx5e: RSS, Block changing channels number when RXFH is configured
net/mlx5: Correctly compare pkt reformat ids
net/mlx5: Properly link new fs rules into the tree
net/mlx5: offset comp irq index in name by one
net/mlx5: Register devlink first under devlink lock
net/mlx5: E-switch, store eswitch pointer before registering devlink_param
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:
- make {virt, phys, page, pfn} translation work with KFENCE for
LoongArch (otherwise NVMe and virtio-blk cannot work with KFENCE
enabled)
- update dts files for Loongson-2K series to make devices work
correctly
- fix a build error
* tag 'loongarch-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: Include linux/sizes.h in addrspace.h to prevent build errors
LoongArch: Update dts for Loongson-2K2000 to support GMAC/GNET
LoongArch: Update dts for Loongson-2K2000 to support PCI-MSI
LoongArch: Update dts for Loongson-2K2000 to support ISA/LPC
LoongArch: Update dts for Loongson-2K1000 to support ISA/LPC
LoongArch: Make virt_addr_valid()/__virt_addr_valid() work with KFENCE
LoongArch: Make {virt, phys, page, pfn} translation work with KFENCE
mm: Move lowmem_page_address() a little later
|
|
This patch adds "last time" fields last_data_sent, last_data_recv and
last_ack_recv in struct mptcp_sock to record the last time data_sent,
data_recv and ack_recv happened. They all are initialized as
tcp_jiffies32 in __mptcp_init_sock(), and updated as tcp_jiffies32 too
when data is sent in __subflow_push_pending(), data is received in
__mptcp_move_skbs_from_subflow(), and ack is received in ack_update_msk().
Similar to tcpi_last_data_sent, tcpi_last_data_recv and tcpi_last_ack_recv
exposed with TCP, this patch exposes the last time "an action happened" for
MPTCP in mptcp_info, named mptcpi_last_data_sent, mptcpi_last_data_recv and
mptcpi_last_ack_recv, calculated in mptcp_diag_fill_info() as the time
deltas between now and the newly added last time fields in mptcp_sock.
Since msk->last_ack_recv needs to be protected by mptcp_data_lock/unlock,
and lock_sock_fast can sleep and be quite slow, move the entire
mptcp_data_lock/unlock block after the lock/unlock_sock_fast block.
Then mptcpi_last_data_sent and mptcpi_last_data_recv are set in
lock/unlock_sock_fast block, while mptcpi_last_ack_recv is set in
mptcp_data_lock/unlock block, which is protected by a spinlock and
should not block for too long.
Also add three reserved bytes in struct mptcp_info not to have holes in
this structure exposed to userspace.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/446
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://lore.kernel.org/r/20240410-upstream-net-next-20240405-mptcp-last-time-info-v2-1-f95bd6b33e51@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
Erick Archer says:
====================
mana: Add flex array to struct mana_cfg_rx_steer_req_v2 (part)
The "struct mana_cfg_rx_steer_req_v2" uses a dynamically sized set of
trailing elements. Specifically, it uses a "mana_handle_t" array. So,
use the preferred way in the kernel declaring a flexible array [1].
At the same time, prepare for the coming implementation by GCC and Clang
of the __counted_by attribute. Flexible array members annotated with
__counted_by can have their accesses bounds-checked at run-time via
CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for
strcpy/memcpy-family functions).
Also, avoid the open-coded arithmetic in the memory allocator functions
[2] using the "struct_size" macro.
Moreover, use the "offsetof" helper to get the indirect table offset
instead of the "sizeof" operator and avoid the open-coded arithmetic in
pointers using the new flex member. This new structure member also allow
us to remove the "req_indir_tab" variable since it is no longer needed.
Now, it is also possible to use the "flex_array_size" helper to compute
the size of these trailing elements in the "memcpy" function.
Specifically, the first commit adds the flex member and the patches 2 and
3 refactor the consumers of the "struct mana_cfg_rx_steer_req_v2".
This code was detected with the help of Coccinelle, and audited and
modified manually. The Coccinelle script used to detect this code pattern
is the following:
virtual report
@rule1@
type t1;
type t2;
identifier i0;
identifier i1;
identifier i2;
identifier ALLOC =~ "kmalloc|kzalloc|kmalloc_node|kzalloc_node|vmalloc|vzalloc|kvmalloc|kvzalloc";
position p1;
@@
i0 = sizeof(t1) + sizeof(t2) * i1;
...
i2 = ALLOC@p1(..., i0, ...);
@script:python depends on report@
p1 << rule1.p1;
@@
msg = "WARNING: verify allocation on line %s" % (p1[0].line)
coccilib.report.print_report(p1[0],msg)
Link: https://www.kernel.org/doc/html/next/process/deprecated.html#zero-length-and-one-element-arrays [1]
Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [2]
v1: https://lore.kernel.org/linux-hardening/AS8PR02MB7237974EF1B9BAFA618166C38B382@AS8PR02MB7237.eurprd02.prod.outlook.com/
v2: https://lore.kernel.org/linux-hardening/AS8PR02MB723729C5A63F24C312FC9CD18B3F2@AS8PR02MB7237.eurprd02.prod.outlook.com/
====================
Link: https://lore.kernel.org/r/AS8PR02MB72374BD1B23728F2E3C3B1A18B022@AS8PR02MB7237.eurprd02.prod.outlook.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The "struct mana_cfg_rx_steer_req_v2" uses a dynamically sized set of
trailing elements. Specifically, it uses a "mana_handle_t" array. So,
use the preferred way in the kernel declaring a flexible array [1].
At the same time, prepare for the coming implementation by GCC and Clang
of the __counted_by attribute. Flexible array members annotated with
__counted_by can have their accesses bounds-checked at run-time via
CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for
strcpy/memcpy-family functions).
This is a previous step to refactor the two consumers of this structure.
drivers/infiniband/hw/mana/qp.c
drivers/net/ethernet/microsoft/mana/mana_en.c
The ultimate goal is to avoid the open-coded arithmetic in the memory
allocator functions [2] using the "struct_size" macro.
Link: https://www.kernel.org/doc/html/next/process/deprecated.html#zero-length-and-one-element-arrays [1]
Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [2]
Signed-off-by: Erick Archer <erick.archer@outlook.com>
Link: https://lore.kernel.org/r/AS8PR02MB7237E2900247571C9CB84C678B022@AS8PR02MB7237.eurprd02.prod.outlook.com
Reviewed-by: Long Li <longli@microsoft.com>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Ensure there is sufficient room to access the protocol field of the
PPPoe header. Validate it once before the flowtable lookup, then use a
helper function to access protocol field.
Reported-by: syzbot+b6f07e1c07ef40199081@syzkaller.appspotmail.com
Fixes: 72efd585f714 ("netfilter: flowtable: add pppoe support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|