Age | Commit message (Collapse) | Author | Files | Lines |
|
This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that
it does not allow BPF_LD_[ABS|IND] instructions and hooks up the
bpf_skb_load_bytes() helper.
Programs of this type will be attached to cgroups for network filtering
and accounting.
Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Also move the inline capablities enum to a shared header vport.h
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Some HWs need the VF driver to put part of the packet headers on the
TX descriptor so the e-switch can do proper matching and steering.
The supported modes: none, link, network, transport.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Some drivers would need to check few internal matters for
that. To be used in downstream mlx5 commit.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Broadcom's Wirespeed feature allows us to configure how auto-negotiation
should behave with fewer working pairs of wires on a cable. Add support
code for retrieving and setting such downshift counters using the
recently added ethtool downshift tunables.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Make it possible to generate trace events for mdio read and write accesses.
Signed-off-by: Uwe Kleine-König <uwe@kleine-koenig.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
All conflicts were simple overlapping changes except perhaps
for the Thunder driver.
That driver has a change_mtu method explicitly for sending
a message to the hardware. If that fails it returns an
error.
Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.
However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pull networking fixes from David Miller:
1) Clear congestion control state when changing algorithms on an
existing socket, from Florian Westphal.
2) Fix register bit values in altr_tse_pcs portion of stmmac driver,
from Jia Jie Ho.
3) Fix PTP handling in stammc driver for GMAC4, from Giuseppe
CAVALLARO.
4) Fix udplite multicast delivery handling, it ignores the udp_table
parameter passed into the lookups, from Pablo Neira Ayuso.
5) Synchronize the space estimated by rtnl_vfinfo_size and the space
actually used by rtnl_fill_vfinfo. From Sabrina Dubroca.
6) Fix memory leak in fib_info when splitting nodes, from Alexander
Duyck.
7) If a driver does a napi_hash_del() explicitily and not via
netif_napi_del(), it must perform RCU synchronization as needed. Fix
this in virtio-net and bnxt drivers, from Eric Dumazet.
8) Likewise, it is not necessary to invoke napi_hash_del() is we are
also doing neif_napi_del() in the same code path. Remove such calls
from be2net and cxgb4 drivers, also from Eric Dumazet.
9) Don't allocate an ID in peernet2id_alloc() if the netns is dead,
from WANG Cong.
10) Fix OF node and device struct leaks in of_mdio, from Johan Hovold.
11) We cannot cache routes in ip6_tunnel when using inherited traffic
classes, from Paolo Abeni.
12) Fix several crashes and leaks in cpsw driver, from Johan Hovold.
13) Splice operations cannot use freezable blocking calls in AF_UNIX,
from WANG Cong.
14) Link dump filtering by master device and kind support added an error
in loop index updates during the dump if we actually do filter, fix
from Zhang Shengju.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits)
tcp: zero ca_priv area when switching cc algorithms
net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit
ethernet: stmmac: make DWMAC_STM32 depend on it's associated SoC
tipc: eliminate obsolete socket locking policy description
rtnl: fix the loop index update error in rtnl_dump_ifinfo()
l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()
net: macb: add check for dma mapping error in start_xmit()
rtnetlink: fix FDB size computation
netns: fix get_net_ns_by_fd(int pid) typo
af_unix: conditionally use freezable blocking calls in read
net: ethernet: ti: cpsw: fix fixed-link phy probe deferral
net: ethernet: ti: cpsw: add missing sanity check
net: ethernet: ti: cpsw: fix secondary-emac probe error path
net: ethernet: ti: cpsw: fix of_node and phydev leaks
net: ethernet: ti: cpsw: fix deferred probe
net: ethernet: ti: cpsw: fix mdio device reference leak
net: ethernet: ti: cpsw: fix bad register access in probe error path
net: sky2: Fix shutdown crash
cfg80211: limit scan results cache size
net sched filters: pass netlink message flags in event notification
...
|
|
The undo_cwnd fallback in the stack doubles cwnd based on ssthresh,
which un-does reno halving behaviour.
It seems more appropriate to let congctl algorithms pair .ssthresh
and .undo_cwnd properly. Add a 'tcp_reno_undo_cwnd' function and wire it
up for all congestion algorithms that used to rely on the fallback.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds basic support for MLDv2 queries, the default is MLDv1
as before. A new multicast option - multicast_mld_version, adds the
ability to change it between 1 and 2 via netlink and sysfs.
The MLD option is disabled if CONFIG_IPV6 is disabled.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds basic support for IGMPv3 queries, the default is IGMPv2
as before. A new multicast option - multicast_igmp_version, adds the
ability to change it between 2 and 3 via netlink and sysfs. The option
struct member is in a 4 byte hole in net_bridge.
There also a few minor style adjustments in br_multicast_new_group and
br_multicast_add_group.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Helpers like bpf_prog_add(), bpf_prog_inc(), bpf_map_inc() can fail
with an error, so make sure the caller properly checks their return
value and not just ignores it, which could worst-case lead to use
after free.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
1) cast to "int" is unnecessary:
u8 will be promoted to int before decrementing,
small positive numbers fit into "int", so their values won't be changed
during promotion.
Once everything is int including loop counters, signedness doesn't
matter: 32-bit operations will stay 32-bit operations.
But! Someone tried to make this loop smart by making everything of
the same type apparently in an attempt to optimise it.
Do the optimization, just differently.
Do the cast where it matters. :^)
2) frag size is unsigned entity and sum of fragments sizes is also
unsigned.
Make everything unsigned, leave no MOVSX instruction behind.
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-4 (-4)
function old new delta
skb_cow_data 835 834 -1
ip_do_fragment 2549 2548 -1
ip6_fragment 3130 3128 -2
Total: Before=154865032, After=154865028, chg -0.00%
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
->nla_len is unsigned entity (it's length after all) and u16,
thus it can't overflow when being aligned into int/unsigned int.
(nlmsg_next has the same code, but I didn't yet convince myself
it is correct to do so).
There is pointer arithmetic in this function and offset being
unsigned is better:
add/remove: 0/0 grow/shrink: 1/64 up/down: 5/-309 (-304)
function old new delta
nl80211_set_wiphy 1444 1449 +5
team_nl_cmd_options_set 997 995 -2
tcf_em_tree_validate 872 870 -2
switchdev_port_bridge_setlink 352 350 -2
switchdev_port_br_afspec 312 310 -2
rtm_to_fib_config 428 426 -2
qla4xxx_sysfs_ddb_set_param 2193 2191 -2
qla4xxx_iface_set_param 4470 4468 -2
ovs_nla_free_flow_actions 152 150 -2
output_userspace 518 516 -2
...
nl80211_set_reg 654 649 -5
validate_scan_freqs 148 142 -6
validate_linkmsg 288 282 -6
nl80211_parse_connkeys 489 483 -6
nlattr_set 231 224 -7
nf_tables_delsetelem 267 260 -7
do_setlink 3416 3408 -8
netlbl_cipsov4_add_std 1672 1659 -13
nl80211_parse_sched_scan 2902 2888 -14
nl80211_trigger_scan 1738 1720 -18
do_execute_actions 2821 2738 -83
Total: Before=154865355, After=154865051, chg -0.00%
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pull KVM fixes from Radim Krčmář:
"ARM:
- Fix handling of the 32bit cycle counter
- Fix cycle counter filtering
x86:
- Fix a race leading to double unregistering of user notifiers
- Amend oversight in kvm_arch_set_irq that turned Hyper-V code dead
- Use SRCU around kvm_lapic_set_vapic_addr
- Avoid recursive flushing of asynchronous page faults
- Do not rely on deferred update in KVM_GET_CLOCK, which fixes #GP
- Let userspace know that KVM_GET_CLOCK is useful with master clock;
4.9 changed the return value to better match the guest clock, but
didn't provide means to let guests take advantage of it"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: x86: merge kvm_arch_set_irq and kvm_arch_set_irq_inatomic
KVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addr
KVM: async_pf: avoid recursive flushing of work items
kvm: kvmclock: let KVM_GET_CLOCK return whether the master clock is in use
KVM: Disable irq while unregistering user notifier
KVM: x86: do not go through vcpu in __get_kvmclock_ns
KVM: arm64: Fix the issues when guest PMCCFILTR is configured
arm64: KVM: pmu: Fix AArch32 cycle counter access
|
|
Userspace can read the exact value of kvmclock by reading the TSC
and fetching the timekeeping parameters out of guest memory. This
however is brittle and not necessary anymore with KVM 4.11. Provide
a mechanism that lets userspace know if the new KVM_GET_CLOCK
semantics are in effect, and---since we are at it---if the clock
is stable across all VCPUs.
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
virtio_net_hdr_from_skb() clears the memory for the header, so there
is no point for the callers to do the same.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix incorrent comment after the final #endif.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"They fix an ACPI thermal management regression introduced by a recent
FADT handling cleanup, an ACPI tools build issue introduced by a
recent ACPICA commit and a PCC mailbox initialization bug causing
lockdep to complain loudly.
Specifics:
- Revert a recent ACPICA cleanup that attempted to get rid of all
FADT version 2 legacy, but broke ACPI thermal management on at
least one system (Rafael Wysocki).
- Fix cross-compiled builds of ACPI tools that stopped working after
a recent cleanup related to the handling of header files in ACPICA
(Lv Zheng).
- Fix a locking issue in the PCC channel initialization code that
invokes devm_request_irq() under a spinlock (among other things)
and causes lockdep to complain (Hoan Tran)"
* tag 'acpi-4.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
tools/power/acpi: Remove direct kernel source include reference
mailbox: PCC: Fix lockdep warning when request PCC channel
Revert "ACPICA: FADT support cleanup"
|
|
Pull nfsd bugfix from Bruce Fields:
"Just one fix for an NFS/RDMA crash"
* tag 'nfsd-4.9-2' of git://linux-nfs.org/~bfields/linux:
sunrpc: svc_age_temp_xprts_now should not call setsockopt non-tcp transports
|
|
* acpica-fixes:
Revert "ACPICA: FADT support cleanup"
* acpi-cppc-fixes:
mailbox: PCC: Fix lockdep warning when request PCC channel
* acpi-tools-fixes:
tools/power/acpi: Remove direct kernel source include reference
|
|
The argument to get_net_ns_by_fd() is a /proc/$PID/ns/net file
descriptor not a pid. Fix the typo.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Rami Rosen <roszenrami@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In preparation for allowing CONFIG_MVNETA_BM to build with COMPILE_TEST,
provide an inline stub for mvebu_mbus_get_dram_win_info().
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
For operation in cabling environments that are incompatible with
1000BASE-T, PHY device may provide an automatic link speed downshift
operation. When enabled, the device automatically changes its 1000BASE-T
auto-negotiation to the next slower speed after a configured number of
failed attempts at 1000BASE-T. This feature is useful in setting up in
networks using older cable installations that include only pairs A and B,
and not pairs C and D.
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Adding get_tunable/set_tunable function pointer to the phy_driver
structure, and uses these function pointers to implement the
ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE ioctls.
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Defines a generic API to get/set phy tunables. The API is using the
existing ethtool_tunable/tunable_type_id types which is already being used
for mac level tunables.
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microsemi.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Allan W. Nielsen <allan.nielsen@microsemi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add the needed infrastructure for future use of MPCNT register.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add driver_version capability bit is enabled, and set driver
version command in mlx5_ifc firmware header. The only purpose
of this command is to store a driver version/OS string in FW
to be reported and displayed in various management systems,
such as IPMI/BMC.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
For each asynchronous port module event:
1. print with ratelimit to the dmesg log
2. increment the corresponding event counter
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add hardware structures and constants definitions needed for module
events support.
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add more cache command size sets and more entries for each set based on
the current commands set different sizes and commands frequency.
Fixes: e126ba97dba9 ('mlx5: Add driver for Mellanox Connect-IB adapters')
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Make struct pernet_operations::id unsigned.
There are 2 reasons to do so:
1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.
2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.
"int" being used as an array index needs to be sign-extended
to 64-bit before being used.
void f(long *p, int i)
{
g(p[i]);
}
roughly translates to
movsx rsi, esi
mov rdi, [rsi+...]
call g
MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.
Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:
static inline void *net_generic(const struct net *net, int id)
{
...
ptr = ng->ptr[id - 1];
...
}
And this function is used a lot, so those sign extensions add up.
Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]
However, overall balance is in negative direction:
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
function old new delta
nfsd4_lock 3886 3959 +73
tipc_link_build_proto_msg 1096 1140 +44
mac80211_hwsim_new_radio 2776 2808 +32
tipc_mon_rcv 1032 1058 +26
svcauth_gss_legacy_init 1413 1429 +16
tipc_bcbase_select_primary 379 392 +13
nfsd4_exchange_id 1247 1260 +13
nfsd4_setclientid_confirm 782 793 +11
...
put_client_renew_locked 494 480 -14
ip_set_sockfn_get 730 716 -14
geneve_sock_add 829 813 -16
nfsd4_sequence_done 721 703 -18
nlmclnt_lookup_host 708 686 -22
nfsd4_lockt 1085 1063 -22
nfs_get_client 1077 1050 -27
tcf_bpf_init 1106 1076 -30
nfsd4_encode_fattr 5997 5930 -67
Total: Before=154856051, After=154854321, chg -0.00%
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
UDP busy polling is restricted to connected UDP sockets.
This is because sk_busy_loop() only takes care of one NAPI context.
There are cases where it could be extended.
1) Some hosts receive traffic on a single NIC, with one RX queue.
2) Some applications use SO_REUSEPORT and associated BPF filter
to split the incoming traffic on one UDP socket per RX
queue/thread/cpu
3) Some UDP sockets are used to send/receive traffic for one flow, but
they do not bother with connect()
This patch records the napi_id of first received skb, giving more
reach to busy polling.
Tested:
lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
lpaa24:~# echo 70 >/proc/sys/net/core/busy_read
lpaa23:~# for f in `seq 1 10`; do ./super_netperf 1 -H lpaa24 -t UDP_RR -l 5; done
Before patch :
27867 28870 37324 41060 41215
36764 36838 44455 41282 43843
After patch :
73920 73213 70147 74845 71697
68315 68028 75219 70082 73707
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost. Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().
We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.
The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
this patch is to fix the race between page_mkclean() and mremap().
Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it. Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.
thread 1 thread 2
(bound to CPU1) (bound to CPU2)
1: write 1 to addr X to get a
writeable TLB on this CPU
2: mremap starts
3: move_ptes emptied PTE for addr X
and setup new PTE for addr Y and
then dropped PTL for X and Y
4: page laundering for N by doing
fadvise FADV_DONTNEED. When done,
pageframe N is deemed clean.
5: *write 2 to addr X
6: tlb flush for addr X
7: munmap (Y, pagesize) to make the
page unmapped
8: fadvise with FADV_DONTNEED again
to kick the page off the pagecache
9: pread the page from file to verify
the value. If 1 is there, it means
we have lost the written 2.
*the write may or may not cause segmentation fault, it depends on
if the TLB is still on the CPU.
Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.
For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it. The normal anonymous page has similar situation.
To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL. If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range. But if we didn't, we
still need to do the whole range flush.
Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables(). But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.
KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
real 5m14.048s
user 32m19.800s
sys 4m50.320s
With this commit:
real 5m13.888s
user 32m19.330s
sys 4m51.200s
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Now sctp transport rhashtable uses hash(lport, dport, daddr) as the key
to hash a node to one chain. If in one host thousands of assocs connect
to one server with the same lport and different laddrs (although it's
not a normal case), all the transports would be hashed into the same
chain.
It may cause to keep returning -EBUSY when inserting a new node, as the
chain is too long and sctp inserts a transport node in a loop, which
could even lead to system hangs there.
The new rhlist interface works for this case that there are many nodes
with the same key in one chain. It puts them into a list then makes this
list be as a node of the chain.
This patch is to replace rhashtable_ interface with rhltable_ interface.
Since a chain would not be too long and it would not return -EBUSY with
this fix when inserting a node, the reinsert loop is also removed here.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Callers of netpoll_poll_lock() own NAPI_STATE_SCHED
Callers of netpoll_poll_unlock() have BH blocked between
the NAPI_STATE_SCHED being cleared and poll_lock is released.
We can avoid the spinlock which has no contention, and use cmpxchg()
on poll_owner which we need to set anyway.
This removes a possible lockdep violation after the cited commit,
since sk_busy_loop() re-enables BH before calling busy_poll_stop()
Fixes: 217f69743681 ("net: busy-poll: allow preemption in sk_busy_loop()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch changes the lwtunnel_headroom() function which is called
in ipv4_mtu() and ip6_mtu(), to also return the correct headroom
value when the lwtunnel state is OUTPUT_REDIRECT.
This patch enables e.g. SR-IPv6 encapsulations to work without
manually setting the route mtu.
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Avoid breaking cross-compiled ACPI tools builds by rearranging the
handling of kernel header files.
This patch also contains OUTPUT/srctree cleanups in order to make above fix
working for various build environments.
Fixes: e323c02dee59 (ACPICA: MSVC9: Fix <sys/stat.h> inclusion order issue)
Reported-and-tested-by: Yisheng Xie <xieyisheng1@huawei.com>
Reported-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Lv Zheng <lv.zheng@intel.com>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
NAPI drivers use napi_complete_done() or napi_complete() when
they drained RX ring and right before re-enabling device interrupts.
In busy polling, we can avoid interrupts being delivered since
we are polling RX ring in a controlled loop.
Drivers can chose to use napi_complete_done() return value
to reduce interrupts overhead while busy polling is active.
This is optional, legacy drivers should work fine even
if not updated.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Adam Belay <abelay@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Yuval Mintz <Yuval.Mintz@cavium.com>
Cc: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Now sk_busy_loop() can schedule by itself, we can remove
need_resched() check from sk_can_busy_loop()
Also add a const to its struct sock parameter.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Adam Belay <abelay@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Yuval Mintz <Yuval.Mintz@cavium.com>
Cc: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job"),
sk_busy_loop() needs a bit of care :
softirqs might be delayed since we do not allow preemption yet.
This patch adds preemptiom points in sk_busy_loop(),
and makes sure no unnecessary cache line dirtying
or atomic operations are done while looping.
A new flag is added into napi->state : NAPI_STATE_IN_BUSY_POLL
This prevents napi_complete_done() from clearing NAPIF_STATE_SCHED,
so that sk_busy_loop() does not have to grab it again.
Similarly, netpoll_poll_lock() is done one time.
This gives about 10 to 20 % improvement in various busy polling
tests, especially when many threads are busy polling in
configurations with large number of NIC queues.
This should allow experimenting with bigger delays without
hurting overall latencies.
Tested:
On a 40Gb mlx4 NIC, 32 RX/TX queues.
echo 70 >/proc/sys/net/core/busy_read
for i in `seq 1 40`; do echo -n $i: ; ./super_netperf $i -H lpaa24 -t UDP_RR -- -N -n; done
Before: After:
1: 90072 92819
2: 157289 184007
3: 235772 213504
4: 344074 357513
5: 394755 458267
6: 461151 487819
7: 549116 625963
8: 544423 716219
9: 720460 738446
10: 794686 837612
11: 915998 923960
12: 937507 925107
13: 1019677 971506
14: 1046831 1113650
15: 1114154 1148902
16: 1105221 1179263
17: 1266552 1299585
18: 1258454 1383817
19: 1341453 1312194
20: 1363557 1488487
21: 1387979 1501004
22: 1417552 1601683
23: 1550049 1642002
24: 1568876 1601915
25: 1560239 1683607
26: 1640207 1745211
27: 1706540 1723574
28: 1638518 1722036
29: 1734309 1757447
30: 1782007 1855436
31: 1724806 1888539
32: 1717716 1944297
33: 1778716 1869118
34: 1805738 1983466
35: 1815694 2020758
36: 1893059 2035632
37: 1843406 2034653
38: 1888830 2086580
39: 1972827 2143567
40: 1877729 2181851
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Adam Belay <abelay@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Yuval Mintz <Yuval.Mintz@cavium.com>
Cc: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The patch that removed the FIB offload infrastructure was a bit too
aggressive and also removed code needed to clean up us splitting the table
if additional rules were added. Specifically the function
fib_trie_flush_external was called at the end of a new rule being added to
flush the foreign trie entries from the main trie.
I updated the code so that we only call fib_trie_flush_external on the main
table so that we flush the entries for local from main. This way we don't
call it for every rule change which is what was happening previously.
Fixes: 347e3b28c1ba2 ("switchdev: remove FIB offload infrastructure")
Reported-by: Eric Dumazet <edumazet@google.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
invalid accesses to bpf map entries. Fix this up by doing a few things
1) Kill BPF_MOD support. This doesn't actually get used by the compiler in real
life and just adds extra complexity.
2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the
minimum value to 0 for positive AND's.
3) Don't do operations on the ranges if they are set to the limits, as they are
by definition undefined, and allowing arithmetic operations on those values
could make them appear valid when they really aren't.
This fixes the testcase provided by Jann as well as a few other theoretical
problems.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Introduce managed counterparts for alloc_percpu() and free_percpu().
Add devm_alloc_percpu() and devm_free_percpu() into the managed
interfaces list.
Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Rolf Neugebauer reported very long delays at netns dismantle.
Eric W. Biederman was kind enough to look at this problem
and noticed synchronize_net() occurring from netif_napi_del() that was
added in linux-4.5
Busy polling makes no sense for tunnels NAPI.
If busy poll is used for sessions over tunnels, the poller will need to
poll the physical device queue anyway.
netif_tx_napi_add() could be used here, but function name is misleading,
and renaming it is not stable material, so set NAPI_STATE_NO_BUSY_POLL
bit directly.
This will avoid inserting gro_cells napi structures in napi_hash[]
and avoid the problematic synchronize_net() (per possible cpu) that
Rolf reported.
Fixes: 93d05d4a320c ("net: provide generic busy polling to all NAPI drivers")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Rolf Neugebauer <rolf.neugebauer@docker.com>
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Tested-by: Rolf Neugebauer <rolf.neugebauer@docker.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This function just calls into genphy_restart_aneg() to perform an
autonegotation restart.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
VxLan device does not have special handling for vlan taging on egress.
Therefore it does not make sense to expose vlan offloading feature.
This patch does not change vxlan functinality.
Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The commit 850cbaddb52d ("udp: use it's own memory accounting schema")
assumes that the socket proto has memory accounting enabled,
but this is not the case for UDPLITE.
Fix it enabling memory accounting for UDPLITE and performing
fwd allocated memory reclaiming on socket shutdown.
UDP and UDPLITE share now the same memory accounting limits.
Also drop the backlog receive operation, since is no more needed.
Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema")
Reported-by: Andrei Vagin <avagin@gmail.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Provide a LRU version of the existing BPF_MAP_TYPE_PERCPU_HASH
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|