Age | Commit message (Collapse) | Author | Files | Lines |
|
mlx5e_close_xdpsq does the cleanup: it calls mlx5e_free_xdpsq_descs to
free the outstanding descriptors, which relies on
mlx5e_page_release_dynamic and page_pool_release_page. However,
page_pool_destroy is already called by this point, because
mlx5e_close_rq runs before mlx5e_close_xdpsq.
This commit fixes the use-after-free by swapping mlx5e_close_xdpsq and
mlx5e_close_rq.
The commit cited below started calling page_pool_destroy directly from
the driver. Previously, the page pool was destroyed under a call_rcu
from xdp_rxq_info_unreg_mem_model, which would defer the deallocation
until after the XDPSQ is cleaned up.
Fixes: 1da4bbeffe41 ("net: core: page_pool: add user refcnt and reintroduce page_pool_destroy")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Ageing time is not converted from clock_t to jiffies which results
incorrect ageing timeout calculation in workqueue update task. Fix it by
applying clock_t_to_jiffies() to provided value.
Fixes: c636a0f0f3f0 ("net/mlx5: Bridge, dynamic entry ageing")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
It could be local and remote are on the same machine and the route
result will be a local route which will result in creating encap id
with src/dst mac address of 0.
Fixes: a54e20b4fcae ("net/mlx5e: Add basic TC tunnel set action for SRIOV offloads")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
While processing encapsulated packet on RX, one of the fields that is
checked is the inner packet length. If the length as specified in the header
doesn't match the actual inner packet length, the packet is invalid
and should be dropped. However, such packet caused the NIC to hang.
This patch turns on a 'fail_on_error' HW bit which allows HW to drop
such an invalid packet while processing RX packet and trying to decap it.
Fixes: ad17dc8cf910 ("net/mlx5: DR, Move STEv0 action apply logic")
Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Clean SF resources if mlx5 eth failed to initialize.
Fixes: 1958fc2f0712 ("net/mlx5: SF, Add auxiliary device driver")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Currently the call to _base_static_config_pages() is assigning the error
return to variable 'rc' but checking the error return in error 'r'. Fix
this by assigning the error return to variable 'r' instead of 'rc'.
Link: https://lore.kernel.org/r/20210804134940.114011-1-colin.king@canonical.com
Fixes: 19a622c39a9d ("scsi: mpt3sas: Handle firmware faults during first half of IOC init")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Addresses-Coverity: ("Unused value")
|
|
Commit 08f76547f08d ("scsi: storvsc: Update error logging") added more
robust logging of errors, particularly those reported as Hyper-V
errors. But this change produces extra logging noise in that
TEST_UNIT_READY may report errors during the normal course of detecting
device adds and removes.
Fix this by logging TEST_UNIT_READY errors as warnings, so that log lines
are produced only if the storvsc log level is changed to WARN level on the
kernel boot line.
Link: https://lore.kernel.org/r/1628269970-87876-1-git-send-email-mikelley@microsoft.com
Fixes: 08f76547f08d ("scsi: storvsc: Update error logging")
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
The phba->poll_list is traversed in case of an error in
lpfc_sli4_hba_setup(), so it must be initialized earlier in case the error
path is taken.
[ 490.030738] lpfc 0000:65:00.0: 0:1413 Failed to init iocb list.
[ 490.036661] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 490.044485] PGD 0 P4D 0
[ 490.047027] Oops: 0000 [#1] SMP PTI
[ 490.050518] CPU: 0 PID: 7 Comm: kworker/0:1 Kdump: loaded Tainted: G I --------- - - 4.18.
[ 490.060511] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 1.4.8 05/22/2018
[ 490.067994] Workqueue: events work_for_cpu_fn
[ 490.072371] RIP: 0010:lpfc_sli4_cleanup_poll_list+0x20/0xb0 [lpfc]
[ 490.078546] Code: cf e9 04 f7 fe ff 0f 1f 40 00 0f 1f 44 00 00 41 57 49 89 ff 41 56 41 55 41 54 4d 8d a79
[ 490.097291] RSP: 0018:ffffbd1a463dbcc8 EFLAGS: 00010246
[ 490.102518] RAX: 0000000000008200 RBX: ffff945cdb8c0000 RCX: 0000000000000000
[ 490.109649] RDX: 0000000000018200 RSI: ffff9468d0e16818 RDI: 0000000000000000
[ 490.116783] RBP: ffff945cdb8c1740 R08: 00000000000015c5 R09: 0000000000000042
[ 490.123915] R10: 0000000000000000 R11: ffffbd1a463dbab0 R12: ffff945cdb8c25c0
[ 490.131049] R13: 00000000fffffff4 R14: 0000000000001800 R15: ffff945cdb8c0000
[ 490.138182] FS: 0000000000000000(0000) GS:ffff9468d0e00000(0000) knlGS:0000000000000000
[ 490.146267] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 490.152013] CR2: 0000000000000000 CR3: 000000042ca10002 CR4: 00000000007706f0
[ 490.159146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 490.166277] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 490.173409] PKRU: 55555554
[ 490.176123] Call Trace:
[ 490.178598] lpfc_sli4_queue_destroy+0x7f/0x3c0 [lpfc]
[ 490.183745] lpfc_sli4_hba_setup+0x1bc7/0x23e0 [lpfc]
[ 490.188797] ? kernfs_activate+0x63/0x80
[ 490.192721] ? kernfs_add_one+0xe7/0x130
[ 490.196647] ? __kernfs_create_file+0x80/0xb0
[ 490.201020] ? lpfc_pci_probe_one_s4.isra.48+0x46f/0x9e0 [lpfc]
[ 490.206944] lpfc_pci_probe_one_s4.isra.48+0x46f/0x9e0 [lpfc]
[ 490.212697] lpfc_pci_probe_one+0x179/0xb70 [lpfc]
[ 490.217492] local_pci_probe+0x41/0x90
[ 490.221246] work_for_cpu_fn+0x16/0x20
[ 490.224994] process_one_work+0x1a7/0x360
[ 490.229009] ? create_worker+0x1a0/0x1a0
[ 490.232933] worker_thread+0x1cf/0x390
[ 490.236687] ? create_worker+0x1a0/0x1a0
[ 490.240612] kthread+0x116/0x130
[ 490.243846] ? kthread_flush_work_fn+0x10/0x10
[ 490.248293] ret_from_fork+0x35/0x40
[ 490.251869] Modules linked in: lpfc(+) xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4i
[ 490.332609] CR2: 0000000000000000
Link: https://lore.kernel.org/r/20210809150947.18104-1-emilne@redhat.com
Fixes: 93a4d6f40198 ("scsi: lpfc: Add registration for CPU Offline/Online events")
Cc: stable@vger.kernel.org
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
blkcg->lock depends on q->queue_lock which may depend on another driver
lock required in irq context, one example is dm-thin:
Chain exists of:
&pool->lock#3 --> &q->queue_lock --> &blkcg->lock
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&blkcg->lock);
local_irq_disable();
lock(&pool->lock#3);
lock(&q->queue_lock);
<Interrupt>
lock(&pool->lock#3);
Fix the issue by using spin_lock_irq(&blkcg->lock) in ioc_weight_write().
Cc: Tejun Heo <tj@kernel.org>
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Link: https://lore.kernel.org/linux-block/CA+QYu4rzz6079ighEanS3Qq_Dmnczcf45ZoJoHKVLVATTo1e4Q@mail.gmail.com/T/#u
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210803070608.1766400-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
__io_rsrc_put_work() might need ->uring_lock, so nobody should wait for
rsrc nodes holding the mutex. However, that's exactly what
io_ring_ctx_free() does with io_wait_rsrc_data().
Split it into rsrc wait + dealloc, and move the first one out of the
lock.
Cc: stable@vger.kernel.org
Fixes: b60c8dce33895 ("io_uring: preparation for rsrc tagging")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0130c5c2693468173ec1afab714e0885d2c9c363.1628559783.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Ammar reports that he's seeing a lockdep splat on running test/rsrc_tags
from the regression suite:
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Tainted: G OE
------------------------------------------------------
kworker/2:4/2684 is trying to acquire lock:
ffff88814bb1c0a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_rsrc_put_work+0x13d/0x1a0
but task is already holding lock:
ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}:
__flush_work+0x31b/0x490
io_rsrc_ref_quiesce.part.0.constprop.0+0x35/0xb0
__do_sys_io_uring_register+0x45b/0x1060
do_syscall_64+0x35/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #0 (&ctx->uring_lock){+.+.}-{3:3}:
__lock_acquire+0x119a/0x1e10
lock_acquire+0xc8/0x2f0
__mutex_lock+0x86/0x740
io_rsrc_put_work+0x13d/0x1a0
process_one_work+0x236/0x530
worker_thread+0x52/0x3b0
kthread+0x135/0x160
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&(&ctx->rsrc_put_work)->work));
lock(&ctx->uring_lock);
lock((work_completion)(&(&ctx->rsrc_put_work)->work));
lock(&ctx->uring_lock);
*** DEADLOCK ***
2 locks held by kworker/2:4/2684:
#0: ffff88810004d938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530
#1: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530
stack backtrace:
CPU: 2 PID: 2684 Comm: kworker/2:4 Tainted: G OE 5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5
Hardware name: Acer Aspire ES1-421/OLVIA_BE, BIOS V1.05 07/02/2015
Workqueue: events io_rsrc_put_work
Call Trace:
dump_stack_lvl+0x6a/0x9a
check_noncircular+0xfe/0x110
__lock_acquire+0x119a/0x1e10
lock_acquire+0xc8/0x2f0
? io_rsrc_put_work+0x13d/0x1a0
__mutex_lock+0x86/0x740
? io_rsrc_put_work+0x13d/0x1a0
? io_rsrc_put_work+0x13d/0x1a0
? io_rsrc_put_work+0x13d/0x1a0
? process_one_work+0x1ce/0x530
io_rsrc_put_work+0x13d/0x1a0
process_one_work+0x236/0x530
worker_thread+0x52/0x3b0
? process_one_work+0x530/0x530
kthread+0x135/0x160
? set_kthread_struct+0x40/0x40
ret_from_fork+0x1f/0x30
which is due to holding the ctx->uring_lock when flushing existing
pending work, while the pending work flushing may need to grab the uring
lock if we're using IOPOLL.
Fix this by dropping the uring_lock a bit earlier as part of the flush.
Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/issues/404
Tested-by: Ammar Faizi <ammarfaizi2@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There may be cases like:
A B
spin_lock(wqe->lock)
nr_workers is 0
nr_workers++
spin_unlock(wqe->lock)
spin_lock(wqe->lock)
nr_wokers is 1
nr_workers++
spin_unlock(wqe->lock)
create_io_worker()
acct->worker is 1
create_io_worker()
acct->worker is 1
There should be one worker marked IO_WORKER_F_FIXED, but no one is.
Fix this by introduce a new agrument for create_io_worker() to indicate
if it is the first worker.
Fixes: 3d4e4face9c1 ("io-wq: fix no lock protection of acct->nr_worker")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210808135434.68667-3-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The former patch to add check between nr_workers and max_workers has a
bug, which will cause unconditionally creating io-workers. That's
because the result of the check doesn't affect the call of
create_io_worker(), fix it by bringing in a boolean value for it.
Fixes: 21698274da5b ("io-wq: fix lack of acct->nr_workers < acct->max_workers judgement")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210808135434.68667-2-haoxu@linux.alibaba.com
[axboe: drop hunk that isn't strictly needed]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Nadav reports running into the below splat on re-enabling softirqs:
WARNING: CPU: 2 PID: 1777 at kernel/softirq.c:364 __local_bh_enable_ip+0xaa/0xe0
Modules linked in:
CPU: 2 PID: 1777 Comm: umem Not tainted 5.13.1+ #161
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020
RIP: 0010:__local_bh_enable_ip+0xaa/0xe0
Code: a9 00 ff ff 00 74 38 65 ff 0d a2 21 8c 7a e8 ed 1a 20 00 fb 66 0f 1f 44 00 00 5b 41 5c 5d c3 65 8b 05 e6 2d 8c 7a 85 c0 75 9a <0f> 0b eb 96 e8 2d 1f 20 00 eb a5 4c 89 e7 e8 73 4f 0c 00 eb ae 65
RSP: 0018:ffff88812e58fcc8 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000201 RCX: dffffc0000000000
RDX: 0000000000000007 RSI: 0000000000000201 RDI: ffffffff8898c5ac
RBP: ffff88812e58fcd8 R08: ffffffff8575dbbf R09: ffffed1028ef14f9
R10: ffff88814778a7c3 R11: ffffed1028ef14f8 R12: ffffffff85c9e9ae
R13: ffff88814778a000 R14: ffff88814778a7b0 R15: ffff8881086db890
FS: 00007fbcfee17700(0000) GS:ffff8881e0300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000c0402a5008 CR3: 000000011c1ac003 CR4: 00000000003706e0
Call Trace:
_raw_spin_unlock_bh+0x31/0x40
io_rsrc_node_ref_zero+0x13e/0x190
io_dismantle_req+0x215/0x220
io_req_complete_post+0x1b8/0x720
__io_complete_rw.isra.0+0x16b/0x1f0
io_complete_rw+0x10/0x20
where it's clear we end up calling the percpu count release directly
from the completion path, as it's in atomic mode and we drop the last
ref. For file/block IO, this can be from IRQ context already, and the
softirq locking for rsrc isn't enough.
Just make the lock fully IRQ safe, and ensure we correctly safe state
from the release path as we don't know the full context there.
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Tested-by: Nadav Amit <nadav.amit@gmail.com>
Link: https://lore.kernel.org/io-uring/C187C836-E78B-4A31-B24C-D16919ACA093@gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
"One commit to fix a possible A-A deadlock around u64_stats_sync on
32bit machines caused by updating it without disabling IRQ when it may
be read from IRQ context"
* 'for-5.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rstat: fix A-A deadlock on 32bit around u64_stats_sync
|
|
Yunsheng Lin says:
====================
add frag page support in page pool
This patchset adds frag page support in page pool and
enable skb's page frag recycling based on page pool in
hns3 drvier.
====================
Link: https://lore.kernel.org/r/1628217982-53533-1-git-send-email-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This patch adds skb's frag page recycling support based on
the frag page support in page pool.
The performance improves above 10~20% for single thread iperf
TCP flow with IOMMU disabled when iperf server and irq/NAPI
have a different CPU.
The performance improves about 135%(14Gbit to 33Gbit) for single
thread iperf TCP flow when IOMMU is in strict mode and iperf
server shares the same cpu with irq/NAPI.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently page pool only support page recycling when there
is only one user of the page, and the split page reusing
implemented in the most driver can not use the page pool as
bing-pong way of reusing requires the multi user support in
page pool.
Those reusing or recycling has below limitations:
1. page from page pool can only be used be one user in order
for the page recycling to happen.
2. Bing-pong way of reusing in most driver does not support
multi desc using different part of the same page in order
to save memory.
So add multi-users support and frag page recycling in page
pool to overcome the above limitation.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
For 32 bit systems with 64 bit dma, dma_addr[1] is used to
store the upper 32 bit dma addr, those system should be rare
those days.
For normal system, the dma_addr[1] in 'struct page' is not
used, so we can reuse dma_addr[1] for storing frag count,
which means how many frags this page might be splited to.
In order to simplify the page frag support in the page pool,
the PAGE_POOL_DMA_USE_PP_FRAG_COUNT macro is added to indicate
the 32 bit systems with 64 bit dma, and the page frag support
in page pool is disabled for such system.
The newly added page_pool_set_frag_count() is called to reserve
the maximum frag count before any page frag is passed to the
user. The page_pool_atomic_sub_frag_count_return() is called
when user is done with the page frag.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, page->pp is cleared and set everytime the page
is recycled, which is unnecessary.
So only set the page->pp when the page is added to the page
pool and only clear it when the page is released from the
page pool.
This is also a preparation to support allocating frag page
in page pool.
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Data beyond the UDP header might not be part of the skb's linear data.
Use skb_copy_bits() instead of direct access to skb->data+X, so that
we read the correct bytes even on a fragmented skb.
Fixes: 4b5f67232d95 ("net: Special handling for IP & MPLS.")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://lore.kernel.org/r/7741c46545c6ef02e70c80a9b32814b22d9616b3.1628264975.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Repair kernel-doc notation in a few places to make it conform to
the expected format.
Fixes the following kernel-doc warnings:
flow.c:296: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* Parse vlan tag from vlan header.
flow.c:296: warning: missing initial short description on line:
* Parse vlan tag from vlan header.
flow.c:537: warning: No description found for return value of 'key_extract_l3l4'
flow.c:769: warning: No description found for return value of 'key_extract'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: dev@openvswitch.org
Link: https://lore.kernel.org/r/20210808190834.23362-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Without this there is a warning if source files include psample.h
before skbuff.h or doesn't include it at all.
Fixes: 6ae0a6286171 ("net: Introduce psample, a new genetlink channel for packet sampling")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Link: https://lore.kernel.org/r/20210808065242.1522535-1-roid@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
I'll be leaving Synopsys shortly, but will continue to handle maintenance
for the transition period.
Signed-off-by: Vineet Gupta <vgupta@kernel.org>
|
|
Add a test suite to test XDP bonding implementation over a pair of
veth devices.
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210731055738.16820-8-joamaki@gmail.com
|
|
The program type cannot be deduced from 'tx' which causes an invalid
argument error when trying to load xdp_tx.o using the skeleton.
Rename the section name to "xdp" so that libbpf can deduce the type.
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210731055738.16820-7-joamaki@gmail.com
|
|
For the XDP bonding slave lookup to work in the NAPI poll context in which
the redudant rcu_read_lock() has been removed we have to follow the same
approach as in 694cea395fde ("bpf: Allow RCU-protected lookups to happen
from bh context") and modify the WARN_ON to also check rcu_read_lock_bh_held().
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210731055738.16820-6-joamaki@gmail.com
|
|
If the ingress device is bond slave, do not broadcast back through it or
the bond master.
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210731055738.16820-5-joamaki@gmail.com
|
|
XDP is implemented in the bonding driver by transparently delegating
the XDP program loading, removal and xmit operations to the bonding
slave devices. The overall goal of this work is that XDP programs
can be attached to a bond device *without* any further changes (or
awareness) necessary to the program itself, meaning the same XDP
program can be attached to a native device but also a bonding device.
Semantics of XDP_TX when attached to a bond are equivalent in such
setting to the case when a tc/BPF program would be attached to the
bond, meaning transmitting the packet out of the bond itself using one
of the bond's configured xmit methods to select a slave device (rather
than XDP_TX on the slave itself). Handling of XDP_TX to transmit
using the configured bonding mechanism is therefore implemented by
rewriting the BPF program return value in bpf_prog_run_xdp. To avoid
performance impact this check is guarded by a static key, which is
incremented when a XDP program is loaded onto a bond device. This
approach was chosen to avoid changes to drivers implementing XDP. If
the slave device does not match the receive device, then XDP_REDIRECT
is transparently used to perform the redirection in order to have
the network driver release the packet from its RX ring. The bonding
driver hashing functions have been refactored to allow reuse with
xdp_buff's to avoid code duplication.
The motivation for this change is to enable use of bonding (and
802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
XDP and also to transparently support bond devices for projects that
use XDP given most modern NICs have dual port adapters. An alternative
to this approach would be to implement 802.3ad in user-space and
implement the bonding load-balancing in the XDP program itself, but
is rather a cumbersome endeavor in terms of slave device management
(e.g. by watching netlink) and requires separate programs for native
vs bond cases for the orchestrator. A native in-kernel implementation
overcomes these issues and provides more flexibility.
Below are benchmark results done on two machines with 100Gbit
Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
16-core 3950X on receiving machine. 64 byte packets were sent with
pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
ice driver, so the tests were performed with iommu=off and patch [2]
applied. Additionally the bonding round robin algorithm was modified
to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
of cache misses were caused by the shared rr_tx_counter (see patch
2/3). The statistics were collected using "sar -n dev -u 1 10". On top
of that, for ice, further work is in progress on improving the XDP_TX
numbers [4].
-----------------------| CPU |--| rxpck/s |--| txpck/s |----
without patch (1 dev):
XDP_DROP: 3.15% 48.6Mpps
XDP_TX: 3.12% 18.3Mpps 18.3Mpps
XDP_DROP (RSS): 9.47% 116.5Mpps
XDP_TX (RSS): 9.67% 25.3Mpps 24.2Mpps
-----------------------
with patch, bond (1 dev):
XDP_DROP: 3.14% 46.7Mpps
XDP_TX: 3.15% 13.9Mpps 13.9Mpps
XDP_DROP (RSS): 10.33% 117.2Mpps
XDP_TX (RSS): 10.64% 25.1Mpps 24.0Mpps
-----------------------
with patch, bond (2 devs):
XDP_DROP: 6.27% 92.7Mpps
XDP_TX: 6.26% 17.6Mpps 17.5Mpps
XDP_DROP (RSS): 11.38% 117.2Mpps
XDP_TX (RSS): 14.30% 28.7Mpps 27.4Mpps
--------------------------------------------------------------
RSS: Receive Side Scaling, e.g. the packets were sent to a range of
destination IPs.
[1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
[2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
[3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/
[4]: https://lore.kernel.org/bpf/20210805230046.28715-1-maciej.fijalkowski@intel.com/T/#t
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20210731055738.16820-4-joamaki@gmail.com
|
|
This adds the ndo_xdp_get_xmit_slave hook for transforming XDP_TX
into XDP_REDIRECT after BPF program run when the ingress device
is a bond slave.
The dev_xdp_prog_count is exposed so that slave devices can be checked
for loaded XDP programs in order to avoid the situation where both
bond master and slave have programs loaded according to xdp_state.
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Link: https://lore.kernel.org/bpf/20210731055738.16820-3-joamaki@gmail.com
|
|
In preparation for adding XDP support to the bonding driver
refactor the packet hashing functions to be able to work with
any linear data buffer without an skb.
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Link: https://lore.kernel.org/bpf/20210731055738.16820-2-joamaki@gmail.com
|
|
commit f9c82a4ea89c3 ("Increase size of ucounts to atomic_long_t")
changed the data type of ucounts/ucounts_max to long, but missed to
adjust a few other places. This is noticeable on big endian platforms
from user space because the /proc/sys/user/max_*_names files all
contain 0.
v4 - Made the min and max constants long so the sysctl values
are actually settable on little endian machines.
-- EWB
Fixes: f9c82a4ea89c ("Increase size of ucounts to atomic_long_t")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Acked-by: Alexey Gladkov <legion@kernel.org>
v1: https://lkml.kernel.org/r/20210721115800.910778-1-svens@linux.ibm.com
v2: https://lkml.kernel.org/r/20210721125233.1041429-1-svens@linux.ibm.com
v3: https://lkml.kernel.org/r/20210730062854.3601635-1-svens@linux.ibm.com
Link: https://lkml.kernel.org/r/8735rijqlv.fsf_-_@disp2133
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
|
|
Rename LOCKDOWN_BPF_READ into LOCKDOWN_BPF_READ_KERNEL so we have naming
more consistent with a LOCKDOWN_BPF_WRITE_USER option that we are adding.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
|
|
iavf driver should set RSS LUT and key unconditionally in reset
path. Currently, the driver does not do that. This patch fixes
this issue.
Fixes: 2c86ac3c7079 ("i40evf: create a generic config RSS function")
Signed-off-by: Md Fahad Iqbal Polash <md.fahad.iqbal.polash@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
In some circumstances, such as with bridging, it's possible that the
stack will add the device's own MAC address to its unicast address list.
If, later, the stack deletes this address, the driver will receive a
request to remove this address.
The driver stores its current MAC address as part of the VSI MAC filter
list instead of separately. So, this causes a problem when the device's
MAC address is deleted unexpectedly, which results in traffic failure in
some cases.
The following configuration steps will reproduce the previously
mentioned problem:
> ip link set eth0 up
> ip link add dev br0 type bridge
> ip link set br0 up
> ip addr flush dev eth0
> ip link set eth0 master br0
> echo 1 > /sys/class/net/br0/bridge/vlan_filtering
> modprobe -r veth
> modprobe -r bridge
> ip addr add 192.168.1.100/24 dev eth0
The following ping command fails due to the netdev->dev_addr being
deleted when removing the bridge module.
> ping <link partner>
Fix this by making sure to not delete the netdev->dev_addr during MAC
address sync. After fixing this issue it was noticed that the
netdev_warn() in .set_mac was overly verbose, so make it at
netdev_dbg().
Also, there is a possibility of a race condition between .set_mac and
.set_rx_mode. Fix this by calling netif_addr_lock_bh() and
netif_addr_unlock_bh() on the device's netdev when the netdev->dev_addr
is going to be updated in .set_mac.
Fixes: e94d44786693 ("ice: Implement filter sync, NDO operations and bump version")
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Liang Li <liali@redhat.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
When VFs are setup and torn down in quick succession, it is possible
that a VF is torn down by the PF while the VF's virtchnl requests are
still in the PF's mailbox ring. Processing the VF's virtchnl request
when the VF itself doesn't exist results in undefined behavior. Fix
this by adding a check to stop processing virtchnl requests when VF
teardown is in progress.
Fixes: ddf30f7ff840 ("ice: Add handler to configure SR-IOV")
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
The userspace utility "driverctl" can be used to change/override the
system's default driver choices. This is useful in some situations
(buggy driver, old driver missing a device ID, trying a workaround,
etc.) where the user needs to load a different driver.
However, this is also prone to user error, where a driver is mapped
to a device it's not designed to drive. For example, if the ice driver
is mapped to driver iavf devices, the ice driver crashes.
Add a check to return an error if the ice driver is being used to
probe a virtual function.
Fixes: 837f08fdecbe ("ice: Add basic driver framework for Intel(R) E800 Series")
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Instead of appending new text attribute data at the offset specified by the
write() system call, only pass the newly written data to the .store()
callback.
Reported-by: Bodo Stroesser <bostroesser@gmail.com>
Tested-by: Bodo Stroesser <bostroesser@gmail.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Fix a typo when checking existence of port_type_set function pointer.
Fixes: 82564f6c706a ("devlink: Simplify devlink port API calls")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
make.cross ARCH=m68k m5272c3_defconfig
make.cross ARCH=m68k
drivers/net/ethernet/freescale/fec_main.c: In function 'fec_enet_eee_mode_set':
>> drivers/net/ethernet/freescale/fec_main.c:2758:33: error: 'FEC_LPI_SLEEP' undeclared (first use in this function); did you mean 'FEC_ECR_SLEEP'?
2758 | writel(sleep_cycle, fep->hwp + FEC_LPI_SLEEP);
| ^~~~~~~~~~~~~
arch/m68k/include/asm/io_no.h:25:66: note: in definition of macro '__raw_writel'
25 | #define __raw_writel(b, addr) (void)((*(__force volatile u32 *) (addr)) = (b))
| ^~~~
drivers/net/ethernet/freescale/fec_main.c:2758:2: note: in expansion of macro 'writel'
2758 | writel(sleep_cycle, fep->hwp + FEC_LPI_SLEEP);
| ^~~~~~
drivers/net/ethernet/freescale/fec_main.c:2758:33: note: each undeclared identifier is reported only once for each function it appears in
2758 | writel(sleep_cycle, fep->hwp + FEC_LPI_SLEEP);
| ^~~~~~~~~~~~~
arch/m68k/include/asm/io_no.h:25:66: note: in definition of macro '__raw_writel'
25 | #define __raw_writel(b, addr) (void)((*(__force volatile u32 *) (addr)) = (b))
| ^~~~
drivers/net/ethernet/freescale/fec_main.c:2758:2: note: in expansion of macro 'writel'
2758 | writel(sleep_cycle, fep->hwp + FEC_LPI_SLEEP);
| ^~~~~~
>> drivers/net/ethernet/freescale/fec_main.c:2759:32: error: 'FEC_LPI_WAKE' undeclared (first use in this function)
2759 | writel(wake_cycle, fep->hwp + FEC_LPI_WAKE);
| ^~~~~~~~~~~~
arch/m68k/include/asm/io_no.h:25:66: note: in definition of macro '__raw_writel'
25 | #define __raw_writel(b, addr) (void)((*(__force volatile u32 *) (addr)) = (b))
| ^~~~
drivers/net/ethernet/freescale/fec_main.c:2759:2: note: in expansion of macro 'writel'
2759 | writel(wake_cycle, fep->hwp + FEC_LPI_WAKE);
| ^~~~~~
This patch adds register definition for M5272 platform to pass build.
Fixes: b82f8c3f1409 ("net: fec: add eee mode tx lpi support")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When mirror/redirect a skb to a different port, the ct info should be reset
for reclassification. Or the pkts will match unexpected rules. For example,
with following topology and commands:
-----------
|
veth0 -+-------
|
veth1 -+-------
|
------------
tc qdisc add dev veth0 clsact
# The same with "action mirred egress mirror dev veth1" or "action mirred ingress redirect dev veth1"
tc filter add dev veth0 egress chain 1 protocol ip flower ct_state +trk action mirred ingress mirror dev veth1
tc filter add dev veth0 egress chain 0 protocol ip flower ct_state -inv action ct commit action goto chain 1
tc qdisc add dev veth1 clsact
tc filter add dev veth1 ingress chain 0 protocol ip flower ct_state +trk action drop
ping <remove ip via veth0> &
tc -s filter show dev veth1 ingress
With command 'tc -s filter show', we can find the pkts were dropped on
veth1.
Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit a3fe3d01bd0d7 ("net/smc: introduce sg-logic for RMBs") introduced
a restriction for RMB allocations as used by SMC-R. However, SMC-D does
not use scatter-gather lists to back its DMBs, yet it was limited by
this restriction, still.
This patch exempts SMC, but limits allocations to the maximum RMB/DMB
size respectively.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Guvenc Gulce says:
====================
net/smc: fixes 2021-08-09
please apply the following patch series for smc to netdev's net tree.
One patch fixes invalid connection counting for links and the other
one fixes an access to an already cleared link.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
SMC clients may be assigned to a different link after the initial
connection between two peers was established. In such a case,
the connection counter was not correctly set.
Update the connection counter correctly when a smc client connection
is assigned to a different smc link.
Fixes: 07d51580ff65 ("net/smc: Add connection counters for links")
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Tested-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There can be a race between the waiters for a tx work request buffer
and the link down processing that finally clears the link. Although
all waiters are woken up before the link is cleared there might be
waiters which did not yet get back control and are still waiting.
This results in an access to a cleared wait queue head.
Fix this by introducing atomic reference counting around the wait calls,
and wait with the link clear processing until all waiters have finished.
Move the work request layer related calls into smc_wr.c and set the
link state to INACTIVE before calling smcr_link_clear() in
smc_llc_srv_add_link().
Fixes: 15e1b99aadfb ("net/smc: no WR buffer wait for terminating link group")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
All kernel devlink implementations call to devlink_alloc() during
initialization routine for specific device which is used later as
a parent device for devlink_register().
Such late device assignment causes to the situation which requires us to
call to device_register() before setting other parameters, but that call
opens devlink to the world and makes accessible for the netlink users.
Any attempt to move devlink_register() to be the last call generates the
following error due to access to the devlink->dev pointer.
[ 8.758862] devlink_nl_param_fill+0x2e8/0xe50
[ 8.760305] devlink_param_notify+0x6d/0x180
[ 8.760435] __devlink_params_register+0x2f1/0x670
[ 8.760558] devlink_params_register+0x1e/0x20
The simple change of API to set devlink device in the devlink_alloc()
instead of devlink_register() fixes all this above and ensures that
prior to call to devlink_register() everything already set.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The driver allocates the spinlock but not initialize it.
Use spin_lock_init() on it to initialize it correctly.
Fixes: aa730a9905b7 ("net: wwan: Add MHI MBIM network driver")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The CPSW switchdev driver inherited fix from commit 9421c9015047 ("net:
ethernet: ti: cpsw: fix min eth packet size") which changes min TX packet
size to 64bytes (VLAN_ETH_ZLEN, excluding ETH_FCS). It was done to fix HW
packed drop issue when packets are sent from Host to the port with PVID and
un-tagging enabled. Unfortunately this breaks some other non-switch
specific use-cases, like:
- [1] CPSW port as DSA CPU port with DSA-tag applied at the end of the
packet
- [2] Some industrial protocols, which expects min TX packet size 60Bytes
(excluding FCS).
Fix it by configuring min TX packet size depending on driver mode
- 60Bytes (ETH_ZLEN) for multi mac (dual-mac) mode
- 64Bytes (VLAN_ETH_ZLEN) for switch mode
and update it during driver mode change and annotate with
READ_ONCE()/WRITE_ONCE() as it can be read by napi while writing.
[1] https://lore.kernel.org/netdev/20210531124051.GA15218@cephalopod/
[2] https://e2e.ti.com/support/arm/sitara_arm/f/791/t/701669
Cc: stable@vger.kernel.org
Fixes: ed3525eda4c4 ("net: ethernet: ti: introduce cpsw switchdev based driver part 1 - dual-emac")
Reported-by: Ben Hutchings <ben.hutchings@essensium.com>
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Karsten Graul says:
====================
net/iucv: updates 2021-08-09
Please apply the following iucv patches to netdev's net-next tree.
Remove the usage of register asm statements and replace deprecated
CPU-hotplug functions with the current version.
Use use consume_skb() instead of kfree_skb() to avoid flooding
dropwatch with false-positives, and 2 patches with cleanups.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Cc: Julian Wiedmann <jwi@linux.ibm.com>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: linux-s390@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|