Age | Commit message (Collapse) | Author | Files | Lines |
|
This reverts commit bfaaba99e680bf82bf2cbf69866c3f37434ff766.
Commit bfaaba99e680 ("ice: Hide bus-info in ethtool for PRs in switchdev
mode") was a workaround for lshw tool displaying incorrect
descriptions for port representors and PF in switchdev mode. Now the issue
has been fixed in the lshw tool itself [1].
Removing the workaround can be considered a regression, as the user might
be running older, unpatched lshw version. However, another important change
(ice: link representors to PCI device, which improves port representor
netdev naming with SET_NETDEV_DEV) also causes the same "regression" as
removing the workaround, i.e. unpatched lshw is able to access bus-info
information (this time not via ethtool) and the bug can occur. Therefore,
the workaround no longer prevents the bug and can be removed.
[1] https://ezix.org/src/pkg/lshw/commit/9bf4e4c9c1
Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Pull rdma fixes from Jason Gunthorpe:
"A few recent regressions in rxe's multicast code, and some old driver
bugs:
- Error case unwind bug in rxe for rkeys
- Dot not call netdev functions under a spinlock in rxe multicast
code
- Use the proper BH lock type in rxe multicast code
- Fix idrma deadlock and crash
- Add a missing flush to drain irdma QPs when in error
- Fix high userspace latency in irdma during destroy due to
synchronize_rcu()
- Rare race in siw MPA processing"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/rxe: Change mcg_lock to a _bh lock
RDMA/rxe: Do not call dev_mc_add/del() under a spinlock
RDMA/siw: Fix a condition race issue in MPA request processing
RDMA/irdma: Fix possible crash due to NULL netdev in notifier
RDMA/irdma: Reduce iWARP QP destroy time
RDMA/irdma: Flush iWARP QP if modified to ERR from RTR state
RDMA/rxe: Recheck the MR in when generating a READ reply
RDMA/irdma: Fix deadlock in irdma_cleanup_cm_core()
RDMA/rxe: Fix "Replace mr by rkey in responder resources"
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull mmc fixes from Ulf Hansson:
"MMC core:
- Fix initialization for eMMC's HS200/HS400 mode
MMC host:
- sdhci-msm: Reset GCC_SDCC_BCR register to prevent timeout issues
- sunxi-mmc: Fix DMA descriptors allocated above 32 bits"
* tag 'mmc-v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: sdhci-msm: Reset GCC_SDCC_BCR register for SDHC
mmc: sunxi-mmc: Fix DMA descriptors allocated above 32 bits
mmc: core: Set HS clock speed before sending HS CMD13
|
|
Link port representors to parent PCI device to benefit from
systemd defined naming scheme.
Example from ip tool:
- without linking:
eth0 ...
- with linking:
eth0 ...
altname enp24s0f0npf0vf0
The port representor name is being shown in altname, because the name is
longer than IFNAMSIZ (16) limit. Altname can be used in ip tool.
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Pull drm fixes from Dave Airlie:
"A pretty quiet week, one fbdev, msm, kconfig, and two amdgpu fixes,
about what I'd expect for rc6.
fbdev:
- hotunplugging fix
amdgpu:
- Fix a xen dom0 regression on APUs
- Fix a potential array overflow if a receiver were to send an
erroneous audio channel count
msm:
- lockdep fix.
it6505:
- kconfig fix"
* tag 'drm-fixes-2022-05-06' of git://anongit.freedesktop.org/drm/drm:
drm/amd/display: Avoid reading audio pattern past AUDIO_CHANNELS_COUNT
drm/amdgpu: do not use passthrough mode in Xen dom0
drm/bridge: ite-it6505: add missing Kconfig option select
fbdev: Make fb_release() return -ENODEV if fbdev was unregistered
drm/msm/dp: remove fail safe mode related code
|
|
When one port's input state get inverted (eg. from low to hight) after
pca953x_irq_setup but before setting irq_mask (by some other driver such as
"gpio-keys"), the next inversion of this port (eg. from hight to low) will not
be triggered any more (because irq_stat is not updated at the first time). Issue
should be fixed after this commit.
Fixes: 89ea8bbe9c3e ("gpio: pca953x.c: add interrupt handling capability")
Signed-off-by: Puyou Lu <puyou.lu@gmail.com>
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
|
|
Jakub Kicinski says:
====================
net: disambiguate the TSO and GSO limits
This series separates the device-reported TSO limitations
from the user space-controlled GSO limits. It used to be that
we only had the former (HW limits) but they were named GSO.
This probably lead to confusion and letting user override them.
The problem came up in the BIG TCP discussion between Eric and
Alex, and seems like something we should address.
Targeting net-next because (a) nobody is reporting problems;
and (b) there is a tiny but non-zero chance that some actually
wants to lift the HW limitations.
====================
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
These are now internal to the core, no need to expose them.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Drivers should call the TSO setting helper, GSO is controllable
by user space.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Up until commit 46e6b992c250 ("rtnetlink: allow GSO maximums to
be set on device creation") the gso_max_segs and gso_max_size
of a device were not controlled from user space.
The quoted commit added the ability to control them because of
the following setup:
netns A | netns B
veth<->veth eth0
If eth0 has TSO limitations and user wants to efficiently forward
traffic between eth0 and the veths they should copy the TSO
limitations of eth0 onto the veths. This would happen automatically
for macvlans or ipvlan but veth users are not so lucky (given the
loose coupling).
Unfortunately the commit in question allowed users to also override
the limits on real HW devices.
It may be useful to control the max GSO size and someone may be using
that ability (not that I know of any user), so create a separate set
of knobs to reliably record the TSO limitations. Validate the user
requests.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To make later patches smaller create a helper for inheriting
the TSO limitations of a lower device. The TSO in the name
is not an accident, subsequent patches will replace GSO
with TSO in more names.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
builtin module
When building the Surface Aggregator Module (SAM) core, registry, and
other SAM client drivers as builtin modules (=y), proper initialization
order is not guaranteed. Due to this, client driver registration
(triggered by device registration in the registry) races against bus
initialization in the core.
If any attempt is made at registering the device driver before the bus
has been initialized (i.e. if bus initialization fails this race) driver
registration will fail with a message similar to:
Driver surface_battery was unable to register with bus_type surface_aggregator because the bus was not initialized
Switch from module_init() to subsys_initcall() to resolve this issue.
Note that the serdev subsystem uses postcore_initcall() so we are still
able to safely register the serdev device driver for the core.
Fixes: c167b9c7e3d6 ("platform/surface: Add Surface Aggregator subsystem")
Reported-by: Blaž Hrastnik <blaz@mxxn.io>
Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com>
Link: https://lore.kernel.org/r/20220429195738.535751-1-luzmaximilian@gmail.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
|
|
The new Surface Pro 8 uses GPEs for lid events as well. Add an entry for
that so that the lid can be used to wake the device. Note that this is a
device with a keyboard type-cover, where this acts as the "lid".
Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com>
Link: https://lore.kernel.org/r/20220429180049.1282447-1-luzmaximilian@gmail.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
|
|
'rmmod pmt_telemetry' panics with:
BUG: kernel NULL pointer dereference, address: 0000000000000040
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 4 PID: 1697 Comm: rmmod Tainted: G S W -------- --- 5.18.0-rc4 #3
Hardware name: Intel Corporation Alder Lake Client Platform/AlderLake-P DDR5 RVP, BIOS ADLPFWI1.R00.3056.B00.2201310233 01/31/2022
RIP: 0010:device_del+0x1b/0x3d0
Code: e8 1a d9 e9 ff e9 58 ff ff ff 48 8b 08 eb dc 0f 1f 44 00 00 41 56 41 55 41 54 55 48 8d af 80 00 00 00 53 48 89 fb 48 83 ec 18 <4c> 8b 67 40 48 89 ef 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31
RSP: 0018:ffffb520415cfd60 EFLAGS: 00010286
RAX: 0000000000000070 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000080 R08: ffffffffffffffff R09: ffffb520415cfd78
R10: 0000000000000002 R11: ffffb520415cfd78 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f7e198e5740(0000) GS:ffff905c9f700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000040 CR3: 000000010782a005 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
<TASK>
? __xa_erase+0x53/0xb0
device_unregister+0x13/0x50
intel_pmt_dev_destroy+0x34/0x60 [pmt_class]
pmt_telem_remove+0x40/0x50 [pmt_telemetry]
auxiliary_bus_remove+0x18/0x30
device_release_driver_internal+0xc1/0x150
driver_detach+0x44/0x90
bus_remove_driver+0x74/0xd0
auxiliary_driver_unregister+0x12/0x20
pmt_telem_exit+0xc/0xe4a [pmt_telemetry]
__x64_sys_delete_module+0x13a/0x250
? syscall_trace_enter.isra.19+0x11e/0x1a0
do_syscall_64+0x58/0x80
? syscall_exit_to_user_mode+0x12/0x30
? do_syscall_64+0x67/0x80
? syscall_exit_to_user_mode+0x12/0x30
? do_syscall_64+0x67/0x80
? syscall_exit_to_user_mode+0x12/0x30
? do_syscall_64+0x67/0x80
? exc_page_fault+0x64/0x140
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f7e1803a05b
Code: 73 01 c3 48 8b 0d 2d 4e 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fd 4d 38 00 f7 d8 64 89 01 48
The probe function, pmt_telem_probe(), adds an entry for devices even if
they have not been initialized. This results in the array of initialized
devices containing both initialized and uninitialized entries. This
causes a panic in the remove function, pmt_telem_remove() which expects
the array to only contain initialized entries.
Only use an entry when a device is initialized.
Cc: "David E. Box" <david.e.box@linux.intel.com>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Mark Gross <markgross@kernel.org>
Cc: platform-driver-x86@vger.kernel.org
Signed-off-by: David Arcari <darcari@redhat.com>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Reviewed-by: David E. Box <david.e.box@linux.intel.com>
Link: https://lore.kernel.org/r/20220429122322.2550003-1-prarit@redhat.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
|
|
There was an issue with the dual fan probe whereby the probe was
failing as it assuming that second_fan support was not available.
Corrected the logic so the probe works correctly. Cleaned up so
quirks only used if 2nd fan not detected.
Tested on X1 Carbon 10 (2 fans), X1 Carbon 9 (2 fans) and T490 (1 fan)
Signed-off-by: Mark Pearson <markpearson@lenovo.com>
Link: https://lore.kernel.org/r/20220502191200.63470-1-markpearson@lenovo.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
|
|
Lenovo laptops that contain NVME SSDs across a variety of generations have
trouble resuming from suspend to idle when the IOMMU translation layer is
active for the NVME storage device.
This generally manifests as a large resume delay or page faults. These
delays and page faults occur as a result of a Lenovo BIOS specific SMI
that runs during the D3->D0 transition on NVME devices.
This SMI occurs because of a flag that is set during resume by Lenovo
firmware:
```
OperationRegion (PM80, SystemMemory, 0xFED80380, 0x10)
Field (PM80, AnyAcc, NoLock, Preserve)
{
SI3R, 1
}
Method (_ON, 0, NotSerialized) // _ON_: Power On
{
TPST (0x60D0)
If ((DAS3 == 0x00))
{
If (SI3R)
{
TPST (0x60E0)
M020 (NBRI, 0x00, 0x00, 0x04, (NCMD | 0x06))
M020 (NBRI, 0x00, 0x00, 0x10, NBAR)
APMC = HDSI /* \HDSI */
SLPS = 0x01
SI3R = 0x00
TPST (0x60E1)
}
D0NV = 0x01
}
}
```
Create a quirk that will run early in the resume process to prevent this
SMI from running. As any of these machines are fixed, they can be peeled
back from this quirk or narrowed down to individual firmware versions.
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1910
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1689
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Mark Pearson <markpearson@lenvo.com>
Link: https://lore.kernel.org/r/20220429030501.1909-3-mario.limonciello@amd.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
|
|
DMI matching in thinkpad_acpi happens local to a function meaning
quirks can only match that function.
Future changes to thinkpad_acpi may need to quirk other code, so
change this to use a quirk infrastructure.
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Mark Pearson <markpearson@lenvo.com>
Link: https://lore.kernel.org/r/20220429030501.1909-2-mario.limonciello@amd.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
|
|
Simon Horman says:
====================
nfp: flower: decap neighbour table rework
Louis Peens says:
This patch series reworks the way in which flow rules that outputs to
OVS internal ports gets handled by the nfp driver.
Previously this made use of a small pre_tun_table, but this only used
destination MAC addresses, and made the implicit assumption that there is
only a single source MAC":"destination MAC" mapping per tunnel. In
hindsight this seems to be a pretty obvious oversight, but this was hidden
in plain sight for quite some time.
This series changes the implementation to make use of the same Neighbour
table for decap that is in use for the tunnel encap solution. It stores
any new Neighbour updates in this table. Previously this path was only
triggered for encapsulation candidates, and the entries were send and
forget, not saved on the host as it is after this series. It also keeps
track of any flow rule that outputs to OVS internal ports (and some
other criteria not worth mentioning here), very similar to how it was
done previously, except now these flows are kept track of in a list.
When a new Neighbour entry gets added this list gets iterated for
potential matches, in which case the table gets updated with a reference
to the flow, and the Neighbour entry on the card gets updated with the
relevant host_ctx. The same happens when a new qualifying flow gets
added - the Neighbour table gets iterated for applicable matches, and
once again the firmware gets updated with the host_ctx when any matches
are found.
Since this also requires a firmware change we add a new capability bit,
and keep the old behaviour in case of older firmware without this bit
set.
This series starts by doing some preparation, then adding the new list
and table entries. Next the functionality to link/unlink these entries
are added, and finally this new functionality is enabled by adding the
DECAP_V2 bit to the driver feature list.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Finally enable the decap_v2 feature bit now that all the
other bits are in place to configure it correctly.
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
With the neighbour entries now stored in a dedicated table there
is no use to make use of the tunnel route cache anymore, so remove
this.
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add helper functions that can create links between flow rules
and cached neighbour entries. Also add the relevant calls to
these functions.
* When a new neighbour entry gets added cycle through the saved
pre_tun flow list and link any relevant matches. Update the
neighbour table on the nfp with this new information.
* When a new pre_tun flow rule gets added iterate through the
save neighbour entries and link any relevant matches. Once
again update the nfp neighbour table with any new links.
* Do the inverse when deleting - remove any created links and
also inform the nfp of this.
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch updates the way in which the tunnel neighbour entries
are handled. Previously they were mostly send-and-forget, with
just the destination IP's cached in a list. This update changes
to a scheme where the neighbour entry information is stored in
a hash table.
The reason for this is that the neighbour table will now also
be used on the decapsulation path, whereas previously it was
only used for encapsulation. We need to save more of the neighbour
information in order to link them with flower flows in follow
up patches.
Updating of the neighbour table is now also handled by the same
function, instead of separate *_write_neigh_vX functions.
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Prepare for more rework in following patches by updating
the existing nfp_neigh_structs. The update allows for
the same headers to be used for both old and new firmware,
with a slight length adjustment when sending the control message
to the firmware.
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When a callback is received to invalidate a neighbour entry
there is no need to try and populate any other flow information.
Only the flowX->daddr information is needed as lookup key to delete
an entry from the NFP neighbour table. Fix this by only doing the
lookup if the callback is for a new entry.
As part of this cleanup remove the setting of flow6.flowi6_proto, as
this is not needed either, it looks to be a possible leftover from a
previous implementation.
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Make sure that the rule also matches on source MAC address. On top
of that also now save the src and dst MAC addresses similar to how
vlan_tci is saved - this will be used in later comparisons with
neighbour entries. Indicate if the flow matched on ipv4 or ipv6.
Populate the vlan_tpid field that got added to the pre_run_rule
struct as well.
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add calls to add and remove flows to the predt_table. This very simply
just allocates and add a new pretun entry if detected as such, and
removes it when encountered on a delete flow.
Compatibility for older firmware is kept in place through the
DECAP_V2 feature bit.
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The previous implementation of using a pre_tun_table for decap has
some limitations, causing flows to end up unoffloaded when in fact
we are able to offload them. This is because the pre_tun_table does
not have enough matching resolution. The next step is to instead make
use of the neighbour table which already exists for the encap direction.
This patch prepares for this by:
- Moving nfp_tun_neigh/_v6 to main.h.
- Creating two new "wrapping" structures, one to keep track of neighbour
entries (previously they were send-and-forget), and another to keep
track of pre_tun flows.
- Create a new list in nfp_flower_priv to keep track of pre_tunnel flows
- Create a new table in nfp_flower_priv to keep track of next neighbour
entries
- Initialising and destroying these new list/tables
- Extending nfp_fl_payload->pre_tun_rule to save more information for
future use.
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
100GbE Intel Wired LAN Driver Updates 2022-05-05
This series contains updates to ice driver only.
Wan Jiabing converts an open coded min selection to min_t().
Maciej commonizes on a single find VSI function and removes the
duplicated implementation.
Wojciech adjusts the return value when exceeding ICE_MAX_CHAIN_WORDS to,
a more appropriate, -ENOSPC and allows for the error to be propagated.
Michal adds support for ndo_get_devlink_port().
Jake does some cleanup related to virtualization code. Mainly involving
function header comments and wording changes. NULL checks are added to
ice_get_vf_vsi() calls in order to prevent static analysis tools from
complaining that a NULL value could be dereferenced.
---
v2: Dropped patch 1: "ice: Add support for classid based queue selection"
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
With CONFIG_FORTIFY_SOURCE enabled, string functions will also perform
dynamic checks for string size which can panic the kernel, like incase
of overflow detection.
In papr_scm, papr_scm_pmu_check_events function uses stat->stat_id with
string operations, to populate the nvdimm_events_map array. Since
stat_id variable is not NULL terminated, the kernel panics with
CONFIG_FORTIFY_SOURCE enabled at boot time.
Below are the logs of kernel panic:
detected buffer overflow in __fortify_strlen
------------[ cut here ]------------
kernel BUG at lib/string_helpers.c:980!
Oops: Exception in kernel mode, sig: 5 [#1]
NIP [c00000000077dad0] fortify_panic+0x28/0x38
LR [c00000000077dacc] fortify_panic+0x24/0x38
Call Trace:
[c0000022d77836e0] [c00000000077dacc] fortify_panic+0x24/0x38 (unreliable)
[c00800000deb2660] papr_scm_pmu_check_events.constprop.0+0x118/0x220 [papr_scm]
[c00800000deb2cb0] papr_scm_probe+0x288/0x62c [papr_scm]
[c0000000009b46a8] platform_probe+0x98/0x150
Fix this issue by using kmemdup_nul() to copy the content of
stat->stat_id directly to the nvdimm_events_map array.
mpe: stat->stat_id comes from the hypervisor, not userspace, so there is
no security exposure.
Fixes: 4c08d4bbc089 ("powerpc/papr_scm: Add perf interface support")
Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220505153451.35503-1-kjain@linux.ibm.com
|
|
Vladimir Oltean says:
====================
Ocelot VCAP fixes
Changes in v2:
fix the NPDs and UAFs caused by filter->trap_list in a more robust way
that actually does not introduce bugs of its own (1/5)
This series fixes issues found while running
tools/testing/selftests/net/forwarding/tc_actions.sh on the ocelot
switch:
- NULL pointer dereference when failing to offload a filter
- NULL pointer dereference after deleting a trap
- filters still having effect after being deleted
- dropped packets still being seen by software
- statistics counters showing double the amount of hits
- statistics counters showing inexistent hits
- invalid configurations not rejected
====================
Link: https://lore.kernel.org/r/20220504235503.4161890-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Given the following order of operations:
(1) we add filter A using tc-flower
(2) we send a packet that matches it
(3) we read the filter's statistics to find a hit count of 1
(4) we add a second filter B with a higher preference than A, and A
moves one position to the right to make room in the TCAM for it
(5) we send another packet, and this matches the second filter B
(6) we read the filter statistics again.
When this happens, the hit count of filter A is 2 and of filter B is 1,
despite a single packet having matched each filter.
Furthermore, in an alternate history, reading the filter stats a second
time between steps (3) and (4) makes the hit count of filter A remain at
1 after step (6), as expected.
The reason why this happens has to do with the filter->stats.pkts field,
which is written to hardware through the call path below:
vcap_entry_set
/ | \
/ | \
/ | \
/ | \
es0_entry_set is1_entry_set is2_entry_set
\ | /
\ | /
\ | /
vcap_data_set(data.counter, ...)
The primary role of filter->stats.pkts is to transport the filter hit
counters from the last readout all the way from vcap_entry_get() ->
ocelot_vcap_filter_stats_update() -> ocelot_cls_flower_stats().
The reason why vcap_entry_set() writes it to hardware is so that the
counters (saturating and having a limited bit width) are cleared
after each user space readout.
The writing of filter->stats.pkts to hardware during the TCAM entry
movement procedure is an unintentional consequence of the code design,
because the hit count isn't up to date at this point.
So at step (4), when filter A is moved by ocelot_vcap_filter_add() to
make room for filter B, the hardware hit count is 0 (no packet matched
on it in the meantime), but filter->stats.pkts is 1, because the last
readout saw the earlier packet. The movement procedure programs the old
hit count back to hardware, so this creates the impression to user space
that more packets have been matched than they really were.
The bug can be seen when running the gact_drop_and_ok_test() from the
tc_actions.sh selftest.
Fix the issue by reading back the hit count to tmp->stats.pkts before
migrating the VCAP filter. Sure, this is a best-effort technique, since
the packets that hit the rule between vcap_entry_get() and
vcap_entry_set() won't be counted, but at least it allows the counters
to be reliably used for selftests where the traffic is under control.
The vcap_entry_get() name is a bit unintuitive, but it only reads back
the counter portion of the TCAM entry, not the entire entry.
The index from which we retrieve the counter is also a bit unintuitive
(i - 1 during add, i + 1 during del), but this is the way in which TCAM
entry movement works. The "entry index" isn't a stored integer for a
TCAM filter, instead it is dynamically computed by
ocelot_vcap_block_get_filter_index() based on the entry's position in
the &block->rules list. That position (as well as block->count) is
automatically updated by ocelot_vcap_filter_add_to_block() on add, and
by ocelot_vcap_block_remove_filter() on del. So "i" is the new filter
index, and "i - 1" or "i + 1" respectively are the old addresses of that
TCAM entry (we only support installing/deleting one filter at a time).
Fixes: b596229448dd ("net: mscc: ocelot: Add support for tcam")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Once the CPU port was added to the destination port mask of a packet, it
can never be cleared, so even packets marked as dropped by the MASK_MODE
of a VCAP IS2 filter will still reach it. This is why we need the
OCELOT_POLICER_DISCARD to "kill dropped packets dead" and make software
stop seeing them.
We disallow policer rules from being put on any other chain than the one
for the first lookup, but we don't do this for "drop" rules, although we
should. This change is merely ascertaining that the rules dont't
(completely) work and letting the user know.
The blamed commit is the one that introduced the multi-chain architecture
in ocelot. Prior to that, we should have always offloaded the filters to
VCAP IS2 lookup 0, where they did work.
Fixes: 1397a2eb52e2 ("net: mscc: ocelot: create TCAM skeleton from tc filter chains")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The VCAP IS2 TCAM is looked up twice per packet, and each filter can be
configured to only match during the first, second lookup, or both, or
none.
The blamed commit wrote the code for making VCAP IS2 filters match only
on the given lookup. But right below that code, there was another line
that explicitly made the lookup a "don't care", and this is overwriting
the lookup we've selected. So the code had no effect.
Some of the more noticeable effects of having filters match on both
lookups:
- in "tc -s filter show dev swp0 ingress", we see each packet matching a
VCAP IS2 filter counted twice. This throws off scripts such as
tools/testing/selftests/net/forwarding/tc_actions.sh and makes them
fail.
- a "tc-drop" action offloaded to VCAP IS2 needs a policer as well,
because once the CPU port becomes a member of the destination port
mask of a packet, nothing removes it, not even a PERMIT/DENY mask mode
with a port mask of 0. But VCAP IS2 rules with the POLICE_ENA bit in
the action vector can only appear in the first lookup. What happens
when a filter matches both lookups is that the action vector is
combined, and this makes the POLICE_ENA bit ineffective, since the
last lookup in which it has appeared is the second one. In other
words, "tc-drop" actions do not drop packets for the CPU port, dropped
packets are still seen by software unless there was an FDB entry that
directed those packets to some other place different from the CPU.
The last bit used to work, because in the initial commit b596229448dd
("net: mscc: ocelot: Add support for tcam"), we were writing the FIRST
field of the VCAP IS2 half key with a 1, not with a "don't care".
The change to "don't care" was made inadvertently by me in commit
c1c3993edb7c ("net: mscc: ocelot: generalize existing code for VCAP"),
which I just realized, and which needs a separate fix from this one,
for "stable" kernels that lack the commit blamed below.
Fixes: 226e9cd82a96 ("net: mscc: ocelot: only install TCAM entries into a specific lookup and PAG")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
deleted
ocelot_vcap_filter_del() works by moving the next filters over the
current one, and then deleting the last filter by calling vcap_entry_set()
with a del_filter which was specially created by memsetting its memory
to zeroes. vcap_entry_set() then programs this to the TCAM and action
RAM via the cache registers.
The problem is that vcap_entry_set() is a dispatch function which looks
at del_filter->block_id. But since del_filter is zeroized memory, the
block_id is 0, or otherwise said, VCAP_ES0. So practically, what we do
is delete the entry at the same TCAM index from VCAP ES0 instead of IS1
or IS2.
The code was not always like this. vcap_entry_set() used to simply be
is2_entry_set(), and then, the logic used to work.
Restore the functionality by populating the block_id of the del_filter
based on the VCAP block of the filter that we're deleting. This makes
vcap_entry_set() know what to do.
Fixes: 1397a2eb52e2 ("net: mscc: ocelot: create TCAM skeleton from tc filter chains")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since the blamed commit, VCAP filters can appear on more than one list.
If their action is "trap", they are chained on ocelot->traps via
filter->trap_list. This is in addition to their normal placement on the
VCAP block->rules list head.
Therefore, when we free a VCAP filter, we must remove it from all lists
it is a member of, including ocelot->traps.
There are at least 2 bugs which are direct consequences of this design
decision.
First is the incorrect usage of list_empty(), meant to denote whether
"filter" is chained into ocelot->traps via filter->trap_list.
This does not do the correct thing, because list_empty() checks whether
"head->next == head", but in our case, head->next == head->prev == NULL.
So we dereference NULL pointers and die when we call list_del().
Second is the fact that not all places that should remove the filter
from ocelot->traps do so. One example is ocelot_vcap_block_remove_filter(),
which is where we have the main kfree(filter). By keeping freed filters
in ocelot->traps we end up in a use-after-free in
felix_update_trapping_destinations().
Attempting to fix all the buggy patterns is a whack-a-mole game which
makes the driver unmaintainable. Actually this is what the previous
patch version attempted to do:
https://patchwork.kernel.org/project/netdevbpf/patch/20220503115728.834457-3-vladimir.oltean@nxp.com/
but it introduced another set of bugs, because there are other places in
which create VCAP filters, not just ocelot_vcap_filter_create():
- ocelot_trap_add()
- felix_tag_8021q_vlan_add_rx()
- felix_tag_8021q_vlan_add_tx()
Relying on the convention that all those code paths must call
INIT_LIST_HEAD(&filter->trap_list) is not going to scale.
So let's do what should have been done in the first place and keep a
bool in struct ocelot_vcap_filter which denotes whether we are looking
at a trapping rule or not. Iterating now happens over the main VCAP IS2
block->rules. The advantage is that we no longer risk having stale
references to a freed filter, since it is only present in that list.
Fixes: e42bd4ed09aa ("net: mscc: ocelot: keep traps in a list")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use kzalloc rather than duplicating its implementation, which
makes code simple and easy to understand.
Signed-off-by: Haowen Bai <baihaowen@meizu.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-6-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Read requests that return with NRF error are partially completed in
dasd_eckd_ese_read(). The function keeps track of the amount of
processed bytes and the driver will eventually return this information
back to the block layer for further processing via __dasd_cleanup_cqr()
when the request is in the final stage of processing (from the driver's
perspective).
For this, blk_update_request() is used which requires the number of
bytes to complete the request. As per documentation the nr_bytes
parameter is described as follows:
"number of bytes to complete for @req".
This was mistakenly interpreted as "number of bytes _left_ for @req"
leading to new requests with incorrect data length. The consequence are
inconsistent and completely wrong read requests as data from random
memory areas are read back.
Fix this by correctly specifying the amount of bytes that should be used
to complete the request.
Fixes: 5e6bdd37c552 ("s390/dasd: fix data corruption for thin provisioned devices")
Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-5-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When reading unformatted tracks on ESE devices, the corresponding memory
areas are simply set to zero for each segment. This is done incorrectly
for blocksizes < 4096.
There are two problems. First, the increment of dst is done using the
counter of the loop (off), which is increased by blksize every
iteration. This leads to a much bigger increment for dst as actually
intended. Second, the increment of dst is done before the memory area
is set to 0, skipping a significant amount of bytes of memory.
This leads to illegal overwriting of memory and ultimately to a kernel
panic.
This is not a problem with 4k blocksize because
blk_queue_max_segment_size is set to PAGE_SIZE, always resulting in a
single iteration for the inner segment loop (bv.bv_len == blksize). The
incorrectly used 'off' value to increment dst is 0 and the correct
memory area is used.
In order to fix this for blksize < 4k, increment dst correctly using the
blksize and only do it at the end of the loop.
Fixes: 5e2b17e712cf ("s390/dasd: Add dynamic formatting support for ESE volumes")
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-4-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For ESE devices we get an error for write operations on an unformatted
track. Afterwards the track will be formatted and the IO operation
restarted.
When using alias devices a track might be accessed by multiple requests
simultaneously and there is a race window that a track gets formatted
twice resulting in data loss.
Prevent this by remembering the amount of formatted tracks when starting
a request and comparing this number before actually formatting a track
on the fly. If the number has changed there is a chance that the current
track was finally formatted in between. As a result do not format the
track and restart the current IO to check.
The number of formatted tracks does not match the overall number of
formatted tracks on the device and it might wrap around but this is no
problem. It is only needed to recognize that a track has been formatted at
all in between.
Fixes: 5e2b17e712cf ("s390/dasd: Add dynamic formatting support for ESE volumes")
Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-3-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For ESE devices we get an error when accessing an unformatted track.
The handling of this error will return zero data for read requests and
format the track on demand before writing to it. To do this the code needs
to distinguish between read and write requests. This is done with data from
the blocklayer request. A pointer to the blocklayer request is stored in
the CQR.
If there is an error on the device an ERP request is built to do error
recovery. While the ERP request is mostly a copy of the original CQR the
pointer to the blocklayer request is not copied to not accidentally pass
it back to the blocklayer without cleanup.
This leads to the error that during ESE handling after an ERP request was
built it is not possible to determine the IO direction. This leads to the
formatting of a track for read requests which might in turn lead to data
corruption.
Fixes: 5e2b17e712cf ("s390/dasd: Add dynamic formatting support for ESE volumes")
Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-2-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Mat Martineau says:
====================
mptcp: Improve MPTCP-level window tracking
This series improves MPTCP receive window compliance with RFC 8684 and
helps increase throughput on high-speed links. Note that patch 3 makes a
change in tcp_output.c
For the details, Paolo says:
I've been chasing bad/unstable performance with multiple subflows
on very high speed links.
It looks like the root cause is due to the current mptcp-level
congestion window handling. There are apparently a few different
sub-issues:
- the rcv_wnd is not effectively shared on the tx side, as each
subflow takes in account only the value received by the underlaying
TCP connection. This is addressed in patch 1/5
- The mptcp-level offered wnd right edge is currently allowed to shrink.
Reading section 3.3.4.:
"""
The receive window is relative to the DATA_ACK. As in TCP, a
receiver MUST NOT shrink the right edge of the receive window (i.e.,
DATA_ACK + receive window). The receiver will use the data sequence
number to tell if a packet should be accepted at the connection
level.
"""
I read the above as we need to reflect window right-edge tracking
on the wire, see patch 4/5.
- The offered window right edge tracking can happen concurrently on
multiple subflows, but there is no mutex protection. We need an
additional atomic operation - still patch 4/5
This series additionally bumps a few new MIBs to track all the above
(ensure/observe that the suspected races actually take place).
I could not access again the host where the issue was so
noticeable, still in the current setup the tput changes from
[6-18] Gbps to 19Gbps very stable.
====================
Link: https://lore.kernel.org/r/20220504215408.349318-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Track the exceptional handling of MPTCP-level offered window
with a few more counters for observability.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As per RFC, the offered MPTCP-level window should never shrink.
While we currently track the right edge, we don't enforce the
above constraint on the wire.
Additionally, concurrent xmit on different subflows can end-up in
erroneous right edge update.
Address the above explicitly updating the announced window and
protecting the update with an additional atomic operation (sic)
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The MPTCP RFC requires that the MPTCP-level receive window's
right edge never moves backward. Currently the MPTCP code
enforces such constraint while tracking the right edge, but it
does not reflects it on the wire, as MPTCP lacks a suitable hook
to update accordingly the TCP header.
This change modifies the existing mptcp_write_options() hook,
providing the current packet's TCP header to the MPTCP protocol,
so that the next patch could implement the above mentioned
constraint.
No functional changes intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Bump a counter for counter when snd_wnd is shared among subflow,
for observability's sake.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As per RFC, mptcp subflows use a "shared" snd_wnd: the effective
window is the maximum among the current values received on all
subflows. Without such feature a data transfer using multiple
subflows could block.
Window sharing is currently implemented in the RX side:
__tcp_select_window uses the mptcp-level receive buffer to compute
the announced window.
That is not enough: the TCP stack will stick to the window size
received on the given subflow; we need to propagate the msk window
value on each subflow at xmit time.
Change the packet scheduler to ignore the subflow level window
and use instead the msk level one
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The bonding entry did not include additional include files that have
been added nor did it reference the documentation. Add these references
for completeness.
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Link: https://lore.kernel.org/r/903ed2906b93628b38a2015664a20d2802042863.1651690748.git.jtoppins@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The find_next_netdev_feature() macro gets the "remaining length",
not bit index.
Passing "bit - 1" for the following iteration is wrong as it skips
the adjacent bit. Pass "bit" instead.
Fixes: 3b89ea9c5902 ("net: Fix for_each_netdev_feature on Big endian")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://lore.kernel.org/r/20220504080914.1918-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
https://gitlab.freedesktop.org/drm/msm into drm-fixes
single lockdep fix.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rob Clark <robdclark@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/CAF6AEGtkzqzxDLp82OaKXVrWd7nWZtkxKsuOK1wOGCDz7qF-dA@mail.gmail.com
|
|
There is export_uuid() function which exports uuid_t to the u8 array.
Use it instead of open coding variant.
This allows to hide the uuid_t internals.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220504091407.70661-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|