Age | Commit message (Collapse) | Author | Files | Lines |
|
bareudp_exit_batch_rtnl() iterates the dying netns list and performs the
same operation for each.
Let's use ->exit_rtnl().
While at it, we replace unregister_netdevice_queue() with
bareudp_dellink() for better cleanup. It unlinks the device
from net_generic(net, bareudp_net_id)->bareudp_list, but there
is no real issue as both the dev and the list are freed later.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-13-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
gtp_net_exit_batch_rtnl() iterates the dying netns list
and performs the same operations for each.
Let's use ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-12-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
bond_net_exit_batch_rtnl() iterates the dying netns list and
performs the same operation for each.
Let's use ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-11-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
br_net_exit_batch_rtnl() iterates the dying netns list and
performs the same operation for each.
Let's use ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-10-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
xfrmi_exit_batch_rtnl() iterates the dying netns list and
performs the same operations for each.
Let's use ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-9-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The following functions iterates the dying netns list and performs
the same operations for each netns.
* ip6gre_exit_batch_rtnl()
* ip6_tnl_exit_batch_rtnl()
* vti6_exit_batch_rtnl()
* sit_exit_batch_rtnl()
Let's use ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
ip_tunnel_delete_nets() iterates the dying netns list and performs the
same operations for each.
Let's export ip_tunnel_destroy() as ip_tunnel_delete_net() and call it
from ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
vxlan_exit_batch_rtnl() iterates the dying netns list and
performs the same operations for each.
Let's use ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
nexthop_net_exit_batch_rtnl() iterates the dying netns list and
performs the same operation for each.
Let's use ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
struct pernet_operations provides two batching hooks; ->exit_batch()
and ->exit_batch_rtnl().
The batching variant is beneficial if ->exit() meets any of the
following conditions:
1) ->exit() repeatedly acquires a global lock for each netns
2) ->exit() has a time-consuming operation that can be factored
out (e.g. synchronize_rcu(), smp_mb(), etc)
3) ->exit() does not need to repeat the same iterations for each
netns (e.g. inet_twsk_purge())
Currently, none of the ->exit_batch_rtnl() functions satisfy any of
the above conditions because RTNL is factored out and held by the
caller and all of these functions iterate over the dying netns list.
Also, we want to hold per-netns RTNL there but avoid spreading
__rtnl_net_lock() across multiple locations.
Let's add ->exit_rtnl() hook and run it under __rtnl_net_lock().
The following patches will convert all ->exit_batch_rtnl() users
to ->exit_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If ops_init() fails while loading a module or we unload the
module, free_exit_list() rolls back the changes.
The rollback sequence is the same as ops_undo_list().
The ops is already removed from pernet_list before calling
free_exit_list(). If we link the ops to a temporary list,
we can reuse ops_undo_list().
Let's add a wrapper of ops_undo_list() and use it instead
of free_exit_list().
Now, we have the central place to roll back ops_init().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When we roll back the changes made by struct pernet_operations.init(),
we execute mostly identical sequences in three places.
* setup_net()
* cleanup_net()
* free_exit_list()
The only difference between the first two is which list and RCU helpers
to use.
In setup_net(), an ops could fail on the way, so we need to perform a
reverse walk from its previous ops in pernet_list. OTOH, in cleanup_net(),
we iterate the full list from tail to head.
The former passes the failed ops to list_for_each_entry_continue_reverse().
It's tricky, but we can reuse it for the latter if we pass list_entry() of
the head node.
Also, synchronize_rcu() and synchronize_rcu_expedited() can be easily
switched by an argument.
Let's factorise the rollback part in setup_net() and cleanup_net().
In the next patch, ops_undo_list() will be reused for free_exit_list(),
and then two arguments (ops_list and hold_rtnl) will differ.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250411205258.63164-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit 0e9c127729be ("ethtool: add interface to read Tx hardware
timestamping statistics") added documentation for timestamping
statistics, but added the detailed explanation for this method to
the get_ts_info() rather than get_ts_stats(). Move it to the correct
entry.
Cc: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1u3MTz-000Crx-IW@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Toke Høiland-Jørgensen says:
====================
Fix late DMA unmap crash for page pool
This series fixes the late dma_unmap crash for page pool first reported
by Yonglong Liu in [0]. It is an alternative approach to the one
submitted by Yunsheng Lin, most recently in [1]. The first commit just
wraps some tests in a helper function, in preparation of the main change
in patch 2. See the commit message of patch 2 for the details.
[0] https://lore.kernel.org/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org
[1] https://lore.kernel.org/20250307092356.638242-1-linyunsheng@huawei.com
v8: https://lore.kernel.org/20250407-page-pool-track-dma-v8-0-da9500d4ba21@redhat.com
v7: https://lore.kernel.org/20250404-page-pool-track-dma-v7-0-ad34f069bc18@redhat.com
v6: https://lore.kernel.org/20250401-page-pool-track-dma-v6-0-8b83474870d4@redhat.com
v5: https://lore.kernel.org/20250328-page-pool-track-dma-v5-0-55002af683ad@redhat.com
v4: https://lore.kernel.org/20250327-page-pool-track-dma-v4-0-b380dc6706d0@redhat.com
v3: https://lore.kernel.org/20250326-page-pool-track-dma-v3-0-8e464016e0ac@redhat.com
v2: https://lore.kernel.org/20250325-page-pool-track-dma-v2-0-113ebc1946f3@redhat.com
v1: https://lore.kernel.org/20250314-page-pool-track-dma-v1-0-c212e57a74c2@redhat.com
====================
Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-0-6a9ef2e0cba8@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When enabling DMA mapping in page_pool, pages are kept DMA mapped until
they are released from the pool, to avoid the overhead of re-mapping the
pages every time they are used. This causes resource leaks and/or
crashes when there are pages still outstanding while the device is torn
down, because page_pool will attempt an unmap through a non-existent DMA
device on the subsequent page return.
To fix this, implement a simple tracking of outstanding DMA-mapped pages
in page pool using an xarray. This was first suggested by Mina[0], and
turns out to be fairly straight forward: We simply store pointers to
pages directly in the xarray with xa_alloc() when they are first DMA
mapped, and remove them from the array on unmap. Then, when a page pool
is torn down, it can simply walk the xarray and unmap all pages still
present there before returning, which also allows us to get rid of the
get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
synchronisation is needed, as a page will only ever be unmapped once.
To avoid having to walk the entire xarray on unmap to find the page
reference, we stash the ID assigned by xa_alloc() into the page
structure itself, using the upper bits of the pp_magic field. This
requires a couple of defines to avoid conflicting with the
POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
so does not affect run-time performance. The bitmap calculations in this
patch gives the following number of bits for different architectures:
- 23 bits on 32-bit architectures
- 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
- 32 bits on other 64-bit architectures
Stashing a value into the unused bits of pp_magic does have the effect
that it can make the value stored there lie outside the unmappable
range (as governed by the mmap_min_addr sysctl), for architectures that
don't define ILLEGAL_POINTER_VALUE. This means that if one of the
pointers that is aliased to the pp_magic field (such as page->lru.next)
is dereferenced while the page is owned by page_pool, that could lead to
a dereference into userspace, which is a security concern. The risk of
this is mitigated by the fact that (a) we always clear pp_magic before
releasing a page from page_pool, and (b) this would need a
use-after-free bug for struct page, which can have many other risks
since page->lru.next is used as a generic list pointer in multiple
places in the kernel. As such, with this patch we take the position that
this risk is negligible in practice. For more discussion, see[1].
Since all the tracking added in this patch is performed on DMA
map/unmap, no additional code is needed in the fast path, meaning the
performance overhead of this tracking is negligible there. A
micro-benchmark shows that the total overhead of the tracking itself is
about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]).
Since this cost is only paid on DMA map and unmap, it seems like an
acceptable cost to fix the late unmap issue. Further optimisation can
narrow the cases where this cost is paid (for instance by eliding the
tracking when DMA map/unmap is a no-op).
The extra memory needed to track the pages is neatly encapsulated inside
xarray, which uses the 'struct xa_node' structure to track items. This
structure is 576 bytes long, with slots for 64 items, meaning that a
full node occurs only 9 bytes of overhead per slot it tracks (in
practice, it probably won't be this efficient, but in any case it should
be an acceptable overhead).
[0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
[1] https://lore.kernel.org/r/20250320023202.GA25514@openwall.com
[2] https://lore.kernel.org/r/ae07144c-9295-4c9d-a400-153bb689fe9e@huawei.com
Reported-by: Yonglong Liu <liuyonglong@huawei.com>
Closes: https://lore.kernel.org/r/8743264a-9700-4227-a556-5f931c720211@huawei.com
Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
Suggested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Qiuling Ren <qren@redhat.com>
Tested-by: Yuying Ma <yuma@redhat.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-2-6a9ef2e0cba8@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since we are about to stash some more information into the pp_magic
field, let's move the magic signature checks into a pair of helper
functions so it can be changed in one place.
Reviewed-by: Mina Almasry <almasrymina@google.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-1-6a9ef2e0cba8@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
It's not safe to access nla_len(ovs_key) if the data is smaller than
the netlink header. Check that the attribute is OK first.
Fixes: ccb1352e76cf ("net: Add Open vSwitch kernel components.")
Reported-by: syzbot+b07a9da40df1576b8048@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b07a9da40df1576b8048
Tested-by: syzbot+b07a9da40df1576b8048@syzkaller.appspotmail.com
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250412104052.2073688-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2025-04-11 (ice, i40e, ixgbe, igc, e1000e)
For ice:
Mateusz and Larysa add support for LLDP packets to be received on a VF
and transmitted by a VF in switchdev mode. Additional information:
https://lore.kernel.org/intel-wired-lan/20250214085215.2846063-1-larysa.zaremba@intel.com/
Karol adds timesync support for E825C devices using 2xNAC (Network
Acceleration Complex) configuration. 2xNAC mode is the mode in which
IO die is housing two complexes and each of them has its own PHY
connected to it.
Martyna adds messaging to clarify filter errors when recipe space is
exhausted.
Colin Ian King adds static modifier to a const array to avoid stack
usage.
For i40e:
Kyungwook Boo changes variable declaration types to prevent possible
underflow.
For ixgbe:
Rand Deeb adjusts retry values so that retries are attempted.
For igc:
Rui Salvaterra sets VLAN offloads to be enabled as default.
For e1000e:
Piotr Wejman converts driver to use newer hardware timestamping API.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
net: e1000e: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
igc: enable HW vlan tag insertion/stripping by default
ixgbe: Fix unreachable retry logic in combined and byte I2C write functions
i40e: fix MMIO write access to an invalid page in i40e_clear_hw
ice: make const read-only array dflt_rules static
ice: improve error message for insufficient filter space
ice: enable timesync operation on 2xNAC E825 devices
ice: refactor ice_sbq_msg_dev enum
ice: remove SW side band access workaround for E825
ice: enable LLDP TX for VFs through tc
ice: support egress drop rules on PF
ice: remove headers argument from ice_tc_count_lkups
ice: receive LLDP on trusted VFs
ice: do not add LLDP-specific filter if not necessary
ice: fix check for existing switch rule
====================
Link: https://patch.msgid.link/20250411204401.3271306-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
igc: Fix PTM timeout
Christopher S M Hall says:
There have been sporadic reports of PTM timeouts using i225/i226 devices
These timeouts have been root caused to:
1) Manipulating the PTM status register while PTM is enabled
and triggered
2) The hardware retrying too quickly when an inappropriate response
is received from the upstream device
The issue can be reproduced with the following:
$ sudo phc2sys -R 1000 -O 0 -i tsn0 -m
Note: 1000 Hz (-R 1000) is unrealistically large, but provides a way to
quickly reproduce the issue.
PHC2SYS exits with:
"ioctl PTP_OFFSET_PRECISE: Connection timed out" when the PTM transaction
fails
The first patch in this series also resolves an issue reported by Corinna
Vinschen relating to kdump:
This patch also fixes a hang in igc_probe() when loading the igc
driver in the kdump kernel on systems supporting PTM.
The igc driver running in the base kernel enables PTM trigger in
igc_probe(). Therefore the driver is always in PTM trigger mode,
except in brief periods when manually triggering a PTM cycle.
When a crash occurs, the NIC is reset while PTM trigger is enabled.
Due to a hardware problem, the NIC is subsequently in a bad busmaster
state and doesn't handle register reads/writes. When running
igc_probe() in the kdump kernel, the first register access to a NIC
register hangs driver probing and ultimately breaks kdump.
With this patch, igc has PTM trigger disabled most of the time,
and the trigger is only enabled for very brief (10 - 100 us) periods
when manually triggering a PTM cycle. Chances that a crash occurs
during a PTM trigger are not zero, but extremly reduced.
* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
igc: add lock preventing multiple simultaneous PTM transactions
igc: cleanup PTP module if probe fails
igc: handle the IGC_PTP_ENABLED flag correctly
igc: move ktime snapshot into PTM retry loop
igc: increase wait time before retrying PTM
igc: fix PTM cycle trigger logic
====================
Link: https://patch.msgid.link/20250411162857.2754883-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Siddharth Vadapalli says:
====================
CPSW Bindings for 5000M Fixed-Link
This series adds 5000M as a valid speed for fixed-link mode of operation
and also updates the CPSW bindings to evaluate fixed-link property. This
series is in the context of the following device-tree overlay which
enables USXGMII 5000M Fixed-link mode of operation with CPSW on TI's
J784S4 SoC:
https://github.com/torvalds/linux/blob/v6.15-rc1/arch/arm64/boot/dts/ti/k3-j784s4-evm-usxgmii-exp1-exp2.dtso
====================
Link: https://patch.msgid.link/20250411060917.633769-1-s-vadapalli@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since the fixed-link (phyless) mode of operation is supported by the
CPSW MAC, include "fixed-link" in the set of properties to be evaluated.
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Link: https://patch.msgid.link/20250411060917.633769-3-s-vadapalli@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A link speed of 5000 Mbps is a valid speed for a fixed-link mode of
operation. Hence, update the bindings to include the same.
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250411060917.633769-2-s-vadapalli@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Joseph Huang says:
====================
Add support for mdb offload failure notification
Currently the bridge does not provide real-time feedback to user space
on whether or not an attempt to offload an mdb entry was successful.
This patch set adds support to notify user space about failed offload
attempts, and is controlled by a new knob mdb_offload_fail_notification.
A break-down of the patches in the series:
Patch 1 adds offload failed flag to indicate that the offload attempt
has failed. The flag is reflected in netlink mdb entry flags.
Patch 2 adds the new bridge bool option mdb_offload_fail_notification.
Patch 3 notifies user space when the result is known, controlled by
mdb_offload_fail_notification setting.
====================
Link: https://patch.msgid.link/20250411150323.1117797-1-Joseph.Huang@garmin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Notify user space on mdb offload failure if
mdb_offload_fail_notification is enabled.
Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250411150323.1117797-4-Joseph.Huang@garmin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add BR_BOOLOPT_MDB_OFFLOAD_FAIL_NOTIFICATION bool option.
Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250411150323.1117797-3-Joseph.Huang@garmin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add MDB_FLAGS_OFFLOAD_FAILED and MDB_PG_FLAGS_OFFLOAD_FAILED to indicate
that an attempt to offload the MDB entry to switchdev has failed.
Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250411150323.1117797-2-Joseph.Huang@garmin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit 8fa7292fee5c ("treewide: Switch/rename to timer_delete[_sync]()")
switched del_timer_sync to timer_delete_sync, but did not modify the
comment for bnad_dim_timeout(). Now fix it.
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/61DDCE7AB5B6CE82+20250411101736.160981-1-wangyuli@uniontech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
I have done quite some review on hugetlb code over the years, and some
work on it as well, the latest being the hugetlb pagewalk unification
which is a work in progress, and touches hugetlb code to great lengths.
HugeTLB does not have many reviewers, so I would like to help out by
offering myself as an official Reviewer.
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Link: https://lkml.kernel.org/r/20250409082452.269180-1-osalvador@suse.de
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Peter Seiderer says:
====================
net: pktgen: fix checkpatch code style errors/warnings
Fix checkpatch detected code style errors/warnings detected in
the file net/core/pktgen.c (remaining checkpatch checks will be addressed
in a follow up patch set).
====================
Link: https://patch.msgid.link/20250410071749.30505-1-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix checkpatch code style warnings:
WARNING: quoted string split across lines
#480: FILE: net/core/pktgen.c:480:
+ "Packet Generator for packet performance testing. "
+ "Version: " VERSION "\n";
WARNING: quoted string split across lines
#632: FILE: net/core/pktgen.c:632:
+ " udp_src_min: %d udp_src_max: %d"
+ " udp_dst_min: %d udp_dst_max: %d\n",
Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
semicolon)
Fix checkpatch code style warnings:
WARNING: macros should not use a trailing semicolon
#180: FILE: net/core/pktgen.c:180:
+#define func_enter() pr_debug("entering %s\n", __func__);
WARNING: macros should not use a trailing semicolon
#234: FILE: net/core/pktgen.c:234:
+#define if_lock(t) mutex_lock(&(t->if_lock));
CHECK: Unnecessary parentheses around t->if_lock
#235: FILE: net/core/pktgen.c:235:
+#define if_unlock(t) mutex_unlock(&(t->if_lock));
Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix checkpatch code style warnings:
WARNING: Missing a blank line after declarations
#761: FILE: net/core/pktgen.c:761:
+ char c;
+ if (get_user(c, &user_buffer[i]))
WARNING: Missing a blank line after declarations
#780: FILE: net/core/pktgen.c:780:
+ char c;
+ if (get_user(c, &user_buffer[i]))
WARNING: Missing a blank line after declarations
#806: FILE: net/core/pktgen.c:806:
+ char c;
+ if (get_user(c, &user_buffer[i]))
WARNING: Missing a blank line after declarations
#823: FILE: net/core/pktgen.c:823:
+ char c;
+ if (get_user(c, &user_buffer[i]))
WARNING: Missing a blank line after declarations
#1968: FILE: net/core/pktgen.c:1968:
+ char f[32];
+ memset(f, 0, 32);
WARNING: Missing a blank line after declarations
#2410: FILE: net/core/pktgen.c:2410:
+ struct pktgen_net *pn = net_generic(dev_net(pkt_dev->odev), pg_net_id);
+ if (!x) {
WARNING: Missing a blank line after declarations
#2442: FILE: net/core/pktgen.c:2442:
+ __u16 t;
+ if (pkt_dev->flags & F_QUEUE_MAP_RND) {
WARNING: Missing a blank line after declarations
#2523: FILE: net/core/pktgen.c:2523:
+ unsigned int i;
+ for (i = 0; i < pkt_dev->nr_labels; i++)
WARNING: Missing a blank line after declarations
#2567: FILE: net/core/pktgen.c:2567:
+ __u32 t;
+ if (pkt_dev->flags & F_IPSRC_RND)
WARNING: Missing a blank line after declarations
#2587: FILE: net/core/pktgen.c:2587:
+ __be32 s;
+ if (pkt_dev->flags & F_IPDST_RND) {
WARNING: Missing a blank line after declarations
#2634: FILE: net/core/pktgen.c:2634:
+ __u32 t;
+ if (pkt_dev->flags & F_TXSIZE_RND) {
WARNING: Missing a blank line after declarations
#2736: FILE: net/core/pktgen.c:2736:
+ int i;
+ for (i = 0; i < pkt_dev->cflows; i++) {
WARNING: Missing a blank line after declarations
#2738: FILE: net/core/pktgen.c:2738:
+ struct xfrm_state *x = pkt_dev->flows[i].x;
+ if (x) {
WARNING: Missing a blank line after declarations
#2752: FILE: net/core/pktgen.c:2752:
+ int nhead = 0;
+ if (x) {
WARNING: Missing a blank line after declarations
#2795: FILE: net/core/pktgen.c:2795:
+ unsigned int i;
+ for (i = 0; i < pkt_dev->nr_labels; i++)
WARNING: Missing a blank line after declarations
#3480: FILE: net/core/pktgen.c:3480:
+ ktime_t idle_start = ktime_get();
+ schedule();
Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix checkpatch code style warnings:
WARNING: Block comments use a trailing */ on a separate line
+ * removal by worker thread */
WARNING: Block comments use * on subsequent lines
+ __u8 tos; /* six MSB of (former) IPv4 TOS
+ are for dscp codepoint */
WARNING: Block comments use a trailing */ on a separate line
+ are for dscp codepoint */
WARNING: Block comments use * on subsequent lines
+ __u8 traffic_class; /* ditto for the (former) Traffic Class in IPv6
+ (see RFC 3260, sec. 4) */
WARNING: Block comments use a trailing */ on a separate line
+ (see RFC 3260, sec. 4) */
WARNING: Block comments use * on subsequent lines
+ /* = {
+ 0x00, 0x80, 0xC8, 0x79, 0xB3, 0xCB,
WARNING: Block comments use * on subsequent lines
+ /* Field for thread to receive "posted" events terminate,
+ stop ifs etc. */
WARNING: Block comments use a trailing */ on a separate line
+ stop ifs etc. */
WARNING: Block comments should align the * on each line
+ * we go look for it ...
+*/
WARNING: Block comments use a trailing */ on a separate line
+ * we resolve the dst issue */
WARNING: Block comments use a trailing */ on a separate line
+ * with proc_create_data() */
Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
statements)
Fix checkpatch code style warnings:
WARNING: suspect code indent for conditional statements (8, 17)
#2901: FILE: net/core/pktgen.c:2901:
+ } else {
+ skb = __netdev_alloc_skb(dev, size, GFP_NOWAIT);
Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix checkpatch code style errors/checks:
CHECK: No space is necessary after a cast
#2984: FILE: net/core/pktgen.c:2984:
+ *(__be16 *) & eth[12] = protocol;
ERROR: space prohibited after that '&' (ctx:WxW)
#2984: FILE: net/core/pktgen.c:2984:
+ *(__be16 *) & eth[12] = protocol;
Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix checkpatch code style errors:
ERROR: "foo * bar" should be "foo *bar"
#977: FILE: net/core/pktgen.c:977:
+ const char __user * user_buffer, size_t count,
ERROR: "foo * bar" should be "foo *bar"
#978: FILE: net/core/pktgen.c:978:
+ loff_t * offset)
ERROR: "foo * bar" should be "foo *bar"
#1912: FILE: net/core/pktgen.c:1912:
+ const char __user * user_buffer,
ERROR: "foo * bar" should be "foo *bar"
#1913: FILE: net/core/pktgen.c:1913:
+ size_t count, loff_t * offset)
Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
netdevice reg_state was split into two 16 bit enums back in 2010
in commit a2835763e130 ("rtnetlink: handle rtnl_link netlink
notifications manually"). Since the split the fields have been
moved apart, and last year we converted reg_state to a normal
u8 in commit 4d42b37def70 ("net: convert dev->reg_state to u8").
rtnl_link_state being a 16 bitfield makes no sense. Convert it
to a single bool, it seems very unlikely after 15 years that
we'll need more values in it.
We could drop dev->rtnl_link_ops from the conditions but feels
like having it there more clearly points at the reason for this
hack.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250410014246.780885-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
UDP GRO accounting assumes that the GRO receive callback is always
set when the UDP tunnel is enabled, but syzkaller proved otherwise,
leading tot the following splat:
WARNING: CPU: 0 PID: 5837 at net/ipv4/udp_offload.c:123 udp_tunnel_update_gro_rcv+0x28d/0x4c0 net/ipv4/udp_offload.c:123
Modules linked in:
CPU: 0 UID: 0 PID: 5837 Comm: syz-executor850 Not tainted 6.14.0-syzkaller-13320-g420aabef3ab5 #0 PREEMPT(full)
Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
RIP: 0010:udp_tunnel_update_gro_rcv+0x28d/0x4c0 net/ipv4/udp_offload.c:123
Code: 00 00 e8 c6 5a 2f f7 48 c1 e5 04 48 8d b5 20 53 c7 9a ba 10
00 00 00 4c 89 ff e8 ce 87 99 f7 e9 ce 00 00 00 e8 a4 5a 2f
f7 90 <0f> 0b 90 e9 de fd ff ff bf 01 00 00 00 89 ee e8 cf
5e 2f f7 85 ed
RSP: 0018:ffffc90003effa88 EFLAGS: 00010293
RAX: ffffffff8a93fc9c RBX: 0000000000000000 RCX: ffff8880306f9e00
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffffffff8a93fabe R09: 1ffffffff20bfb2e
R10: dffffc0000000000 R11: fffffbfff20bfb2f R12: ffff88814ef21738
R13: dffffc0000000000 R14: ffff88814ef21778 R15: 1ffff11029de42ef
FS: 0000000000000000(0000) GS:ffff888124f96000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f04eec760d0 CR3: 000000000eb38000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
udp_tunnel_cleanup_gro include/net/udp_tunnel.h:205 [inline]
udpv6_destroy_sock+0x212/0x270 net/ipv6/udp.c:1829
sk_common_release+0x71/0x2e0 net/core/sock.c:3896
inet_release+0x17d/0x200 net/ipv4/af_inet.c:435
__sock_release net/socket.c:647 [inline]
sock_close+0xbc/0x240 net/socket.c:1391
__fput+0x3e9/0x9f0 fs/file_table.c:465
task_work_run+0x251/0x310 kernel/task_work.c:227
exit_task_work include/linux/task_work.h:40 [inline]
do_exit+0xa11/0x27f0 kernel/exit.c:953
do_group_exit+0x207/0x2c0 kernel/exit.c:1102
__do_sys_exit_group kernel/exit.c:1113 [inline]
__se_sys_exit_group kernel/exit.c:1111 [inline]
__x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1111
x64_sys_call+0x26c3/0x26d0 arch/x86/include/generated/asm/syscalls_64.h:232
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f04eebfac79
Code: Unable to access opcode bytes at 0x7f04eebfac4f.
RSP: 002b:00007fffdcaa34a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f04eebfac79
RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
RBP: 00007f04eec75270 R08: ffffffffffffffb8 R09: 00007fffdcaa36c8
R10: 0000200000000000 R11: 0000000000000246 R12: 00007f04eec75270
R13: 0000000000000000 R14: 00007f04eec75cc0 R15: 00007f04eebcca70
Address the issue moving the accounting hook into
setup_udp_tunnel_sock() and set_xfrm_gro_udp_encap_rcv(), where
the GRO callback is actually set.
set_xfrm_gro_udp_encap_rcv() is prone to races with IPV6_ADDRFORM,
run the relevant setsockopt under the socket lock to ensure using
consistent values of sk_family and up->encap_type.
Refactor the GRO callback selection code, to make it clear that
the function pointer is always initialized.
Reported-by: syzbot+8c469a2260132cd095c1@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=8c469a2260132cd095c1
Fixes: 172bf009c18d ("xfrm: Support GRO for IPv4 ESP in UDP encapsulation")
Fixes: 5d7f5b2f6b935 ("udp_tunnel: use static call for GRO hooks when possible")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/92bcdb6899145a9a387c8fa9e3ca656642a43634.1744228733.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We started generating C code for OvS a while back, but actually
C codegen only supports fixed headers specified at the family
level right now (schema also allows specifying them per op).
ovs_flow and ovs_datapath already specify the fixed header
at the family level but ovs_vport does it per op.
Move the property, all ops use the same header.
This ensures YNL C sees the correct hdr_len:
const struct ynl_family ynl_ovs_vport_family = {
.name = "ovs_vport",
- .hdr_len = sizeof(struct genlmsghdr),
+ .hdr_len = sizeof(struct genlmsghdr) + sizeof(struct ovs_header),
};
Fixes: 7c59c9c8f202 ("tools: ynl: generate code for ovs families")
Link: https://patch.msgid.link/20250409145541.580674-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Lockdep found the following dependency:
&dev_instance_lock_key#3 -->
&rdev->wiphy.mtx -->
&net->xdp.lock -->
&xs->mutex -->
&dev_instance_lock_key#3
The first dependency is the problem. wiphy mutex should be outside
the instance locks. The problem happens in notifiers (as always)
for CLOSE. We only hold the instance lock for ops locked devices
during CLOSE, and WiFi netdevs are not ops locked. Unfortunately,
when we dev_close_many() during netns dismantle we may be holding
the instance lock of _another_ netdev when issuing a CLOSE for
a WiFi device.
Lockdep's "Possible unsafe locking scenario" only prints 3 locks
and we have 4, plus I think we'd need 3 CPUs, like this:
CPU0 CPU1 CPU2
---- ---- ----
lock(&xs->mutex);
lock(&dev_instance_lock_key#3);
lock(&rdev->wiphy.mtx);
lock(&net->xdp.lock);
lock(&xs->mutex);
lock(&rdev->wiphy.mtx);
lock(&dev_instance_lock_key#3);
Tho, I don't think that's possible as CPU1 and CPU2 would
be under rtnl_lock. Even if we have per-netns rtnl_lock and
wiphy can span network namespaces - CPU0 and CPU1 must be
in the same netns to see dev_instance_lock, so CPU0 can't
be installing a socket as CPU1 is tearing the netns down.
Regardless, our expected lock ordering is that wiphy lock
is taken before instance locks, so let's fix this.
Go over the ops locked and non-locked devices separately.
Note that calling dev_close_many() on an empty list is perfectly
fine. All processing (including RCU syncs) are conditional
on the list not being empty, already.
Fixes: 7e4d784f5810 ("net: hold netdev instance lock during rtnetlink operations")
Reported-by: syzbot+6f588c78bf765b62b450@syzkaller.appspotmail.com
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250412233011.309762-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pull rdma fixes from Jason Gunthorpe:
- Fix hang in bnxt_re due to miscomputing the budget
- Avoid a -Wformat-security message in dev_set_name()
- Avoid an unused definition warning in fs.c with some kconfigs
- Fix error handling in usnic and remove IS_ERR_OR_NULL() usage
- Regression in RXE support foudn by blktests due to missing ODP
exclusions
- Set the dma_segment_size on HNS so it doesn't corrupt DMA when using
very large IOs
- Move a INIT_WORK to near when the work is allocated in cm.c to fix a
racey crash where work in progress was being init'd
- Use __GFP_NOWARN to not dump in kvcalloc() if userspace requests a
very big MR
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/bnxt_re: Remove unusable nq variable
RDMA/core: Silence oversized kvmalloc() warning
RDMA/cma: Fix workqueue crash in cma_netevent_work_handler
RDMA/hns: Fix wrong maximum DMA segment size
RDMA/rxe: Fix null pointer dereference in ODP MR check
RDMA/mlx5: Fix compilation warning when USER_ACCESS isn't set
RDMA/usnic: Fix passing zero to PTR_ERR in usnic_ib_pci_probe()
RDMA/ucaps: Avoid format-security warning
RDMA/bnxt_re: Fix budget handling of notification queue
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix NULL pointer dereference in virtiofs
- Fix slab OOB access in hfs/hfsplus
- Only create /proc/fs/netfs when CONFIG_PROC_FS is set
- Fix getname_flags() to initialize pointer correctly
- Convert dentry flags to enum
- Don't allow datadir without lowerdir in overlayfs
- Use namespace_{lock,unlock} helpers in dissolve_on_fput() instead of
plain namespace_sem so unmounted mounts are properly cleaned up
- Skip unnecessary ifs_block_is_uptodate check in iomap
- Remove an unused forward declaration in overlayfs
- Fix devpts uid/gid handling after converting to the new mount api
- Fix afs_dynroot_readdir() to not use the RCU read lock
- Fix mount_setattr() and open_tree_attr() to not pointlessly do path
lookup or walk the mount tree if no mount option change has been
requested
* tag 'vfs-6.15-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: use namespace_{lock,unlock} in dissolve_on_fput()
iomap: skip unnecessary ifs_block_is_uptodate check
fs: Fix filename init after recent refactoring
netfs: Only create /proc/fs/netfs with CONFIG_PROC_FS
mount: ensure we don't pointlessly walk the mount tree
dcache: convert dentry flag macros to enum
afs: Fix afs_dynroot_readdir() to not use the RCU read lock
hfs/hfsplus: fix slab-out-of-bounds in hfs_bnode_read_key
virtiofs: add filesystem context source name check
devpts: Fix type for uid and gid params
ovl: remove unused forward declaration
ovl: don't allow datadir only
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tools fixes from Namhyung Kim:
"A couple of fixes and the usual tooling header updates:
- fix a build error on ARM64 when libunwind is requested
- fix an infinite loop with branch stack on AMD Zen3
- sync tooling headers with the kernel source"
* tag 'perf-tools-fixes-for-v6.15-2025-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
perf tools: Remove evsel__handle_error_quirks()
perf libunwind arm64: Fix missing close parens in an if statement
tools headers: Update the arch/x86/lib/memset_64.S copy with the kernel sources
tools headers: Update the x86 headers with the kernel sources
tools headers: Update the linux/unaligned.h copy with the kernel sources
tools headers: Update the uapi/asm-generic/mman-common.h copy with the kernel sources
tools headers: Update the uapi/linux/prctl.h copy with the kernel sources
tools headers: Update the syscall table with the kernel sources
tools headers: Update the VFS headers with the kernel sources
tools headers: Update the uapi/linux/perf_event.h copy with the kernel sources
tools headers: Update the socket headers with the kernel sources
tools headers: Update the KVM headers with the kernel sources
|
|
In order to work properly PRP requires slave1 and slave2 to share the
same MAC address. To ease the configuration process on userspace tools,
sync the slave2 MAC address with slave1. In addition, when deleting the
port from the list, restore the original MAC address.
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
ktest recently reported crashes while running several buffered io tests
with __alloc_tagging_slab_alloc_hook() at the top of the crash call stack.
The signature indicates an invalid address dereference with low bits of
slab->obj_exts being set. The bits were outside of the range used by
page_memcg_data_flags and objext_flags and hence were not masked out
by slab_obj_exts() when obtaining the pointer stored in slab->obj_exts.
The typical crash log looks like this:
00510 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
00510 Mem abort info:
00510 ESR = 0x0000000096000045
00510 EC = 0x25: DABT (current EL), IL = 32 bits
00510 SET = 0, FnV = 0
00510 EA = 0, S1PTW = 0
00510 FSC = 0x05: level 1 translation fault
00510 Data abort info:
00510 ISV = 0, ISS = 0x00000045, ISS2 = 0x00000000
00510 CM = 0, WnR = 1, TnD = 0, TagAccess = 0
00510 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
00510 user pgtable: 4k pages, 39-bit VAs, pgdp=0000000104175000
00510 [0000000000000010] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
00510 Internal error: Oops: 0000000096000045 [#1] SMP
00510 Modules linked in:
00510 CPU: 10 UID: 0 PID: 7692 Comm: cat Not tainted 6.15.0-rc1-ktest-g189e17946605 #19327 NONE
00510 Hardware name: linux,dummy-virt (DT)
00510 pstate: 20001005 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
00510 pc : __alloc_tagging_slab_alloc_hook+0xe0/0x190
00510 lr : __kmalloc_noprof+0x150/0x310
00510 sp : ffffff80c87df6c0
00510 x29: ffffff80c87df6c0 x28: 000000000013d1ff x27: 000000000013d200
00510 x26: ffffff80c87df9e0 x25: 0000000000000000 x24: 0000000000000001
00510 x23: ffffffc08041953c x22: 000000000000004c x21: ffffff80c0002180
00510 x20: fffffffec3120840 x19: ffffff80c4821000 x18: 0000000000000000
00510 x17: fffffffec3d02f00 x16: fffffffec3d02e00 x15: fffffffec3d00700
00510 x14: fffffffec3d00600 x13: 0000000000000200 x12: 0000000000000006
00510 x11: ffffffc080bb86c0 x10: 0000000000000000 x9 : ffffffc080201e58
00510 x8 : ffffff80c4821060 x7 : 0000000000000000 x6 : 0000000055555556
00510 x5 : 0000000000000001 x4 : 0000000000000010 x3 : 0000000000000060
00510 x2 : 0000000000000000 x1 : ffffffc080f50cf8 x0 : ffffff80d801d000
00510 Call trace:
00510 __alloc_tagging_slab_alloc_hook+0xe0/0x190 (P)
00510 __kmalloc_noprof+0x150/0x310
00510 __bch2_folio_create+0x5c/0xf8
00510 bch2_folio_create+0x2c/0x40
00510 bch2_readahead+0xc0/0x460
00510 read_pages+0x7c/0x230
00510 page_cache_ra_order+0x244/0x3a8
00510 page_cache_async_ra+0x124/0x170
00510 filemap_readahead.isra.0+0x58/0xa0
00510 filemap_get_pages+0x454/0x7b0
00510 filemap_read+0xdc/0x418
00510 bch2_read_iter+0x100/0x1b0
00510 vfs_read+0x214/0x300
00510 ksys_read+0x6c/0x108
00510 __arm64_sys_read+0x20/0x30
00510 invoke_syscall.constprop.0+0x54/0xe8
00510 do_el0_svc+0x44/0xc8
00510 el0_svc+0x18/0x58
00510 el0t_64_sync_handler+0x104/0x130
00510 el0t_64_sync+0x154/0x158
00510 Code: d5384100 f9401c01 b9401aa3 b40002e1 (f8227881)
00510 ---[ end trace 0000000000000000 ]---
00510 Kernel panic - not syncing: Oops: Fatal exception
00510 SMP: stopping secondary CPUs
00510 Kernel Offset: disabled
00510 CPU features: 0x0000,000000e0,00000410,8240500b
00510 Memory Limit: none
Investigation indicates that these bits are already set when we allocate
slab page and are not zeroed out after allocation. We are not yet sure
why these crashes start happening only recently but regardless of the
reason, not initializing a field that gets used later is wrong. Fix it
by initializing slab->obj_exts during slab page allocation.
Fixes: 21c690a349ba ("mm: introduce slabobj_ext to support slab object extensions")
Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Tested-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20250411155737.1360746-1-surenb@google.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Prior to commit e614a00117bc2d, xmbuf_map_backing_mem relied on
folio_file_page to return the base page for the xmbuf's loff_t in the
xfile, and set b_addr to the page_address of that base page.
Now that folio_file_page has been removed from xmbuf_map_backing_mem, we
always set b_addr to the folio_address of the folio. This is correct
for the situation where the folio size matches the buffer size, but it's
totally wrong if tmpfs uses large folios. We need to use
offset_in_folio here.
Found via xfs/801, which demonstrated evidence of corruption of an
in-memory rmap btree block right after initializing an adjacent block.
Fixes: e614a00117bc2d ("xfs: cleanup mapping tmpfs folios into the buffer cache")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Presently we start garbage collection late - when we start running
out of free zones to backfill max_open_zones. This is a reasonable
default as it minimizes write amplification. The longer we wait,
the more blocks are invalidated and reclaim cost less in terms
of blocks to relocate.
Starting this late however introduces a risk of GC being outcompeted
by user writes. If GC can't keep up, user writes will be forced to
wait for free zones with high tail latencies as a result.
This is not a problem under normal circumstances, but if fragmentation
is bad and user write pressure is high (multiple full-throttle
writers) we will "bottom out" of free zones.
To mitigate this, introduce a zonegc_low_space tunable that lets the
user specify a percentage of how much of the unused space that GC
should keep available for writing. A high value will reclaim more of
the space occupied by unused blocks, creating a larger buffer against
write bursts.
This comes at a cost as write amplification is increased. To
illustrate this using a sample workload, setting zonegc_low_space to
60% avoids high (500ms) max latencies while increasing write
amplification by 15%.
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
xfs_buf_free can call vunmap, which can sleep. The vunmap path is an
unlikely one, so add might_sleep to ensure calling xfs_buf_free from
atomic context gets caught more easily.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Marking a log item as failed kept a buffer reference around for
resubmission of inode and dquote items.
For inode items commit 298f7bec503f3 ("xfs: pin inode backing buffer to
the inode log item") started pinning the inode item buffers
unconditionally and removed the need for this. Later commit acc8f8628c37
("xfs: attach dquot buffer to dquot log item buffer") did the same for
dquot items but didn't fully clean up the xfs_clear_li_failed side
for them. Stop adding the extra pin for dquot items and remove the
helpers.
This happens to fix a call to xfs_buf_free with the AIL lock held,
which would be incorrect for the unlikely case freeing the buffer
ends up calling vfree.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
|