Age | Commit message (Collapse) | Author | Files | Lines |
|
__netpoll_rcu_free is used to free netpoll structures when the rtnl_lock is
already held. The mechanism is used to asynchronously call __netpoll_cleanup
outside of the holding of the rtnl_lock, so as to avoid deadlock.
Unfortunately, __netpoll_cleanup modifies pointers (dev->np), which means the
rtnl_lock must be held while calling it. Further, it cannot be held, because
rcu callbacks may be issued in softirq contexts, which cannot sleep.
Fix this by converting the rcu callback to a work queue that is guaranteed to
get scheduled in process context, so that we can hold the rtnl properly while
calling __netpoll_cleanup
Tested successfully by myself.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Cong Wang <amwang@redhat.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflicts:
drivers/net/ethernet/intel/e1000e/ethtool.c
drivers/net/vmxnet3/vmxnet3_drv.c
drivers/net/wireless/iwlwifi/dvm/tx.c
net/ipv6/route.c
The ipv6 route.c conflict is simple, just ignore the 'net' side change
as we fixed the same problem in 'net-next' by eliminating cached
neighbours from ipv6 routes.
The e1000e conflict is an addition of a new statistic in the ethtool
code, trivial.
The vmxnet3 conflict is about one change in 'net' removing a guarding
conditional, whilst in 'net-next' we had a netdev_info() conversion.
The iwlwifi conflict is dealing with a WARN_ON() conversion in
'net-next' vs. a revert happening in 'net'.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
reduce the permission check of bond device's ioctl.
allow the userns root to control the bond device.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Makes more sense to have randomly generated address by default than to
have all zeroes. It also allows user to for example put the bond into
bridge without need to have any slaves in it.
Also note that this changes only behaviour of bonds with no slaves. Once
the first slave device is enslaved, its address will be used (no change
here).
Also, fix dev_assign_type values on the way.
Reported-by: Pavel Šimerda <psimerda@redhat.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When bonding module is loaded with primary parameter and one decides to unset
primary slave using sysfs these settings are not preserved during bond device
restart. Primary slave is only unset once and it's not remembered in
bond->params structure. Below is example of recreation.
grep OPTS /etc/sysconfig/network-scripts/ifcfg-bond0
BONDING_OPTS="mode=active-backup miimon=100 primary=eth01"
grep "Primary Slave" /proc/net/bonding/bond0
Primary Slave: eth01 (primary_reselect always)
echo "" > /sys/class/net/bond0/bonding/primary
grep "Primary Slave" /proc/net/bonding/bond0
Primary Slave: None
sed -i -e 's/primary=eth01//' /etc/sysconfig/network-scripts/ifcfg-bond0
grep OPTS /etc/sysconfig/network-scripts/ifcfg-bond
BONDING_OPTS="mode=active-backup miimon=100 "
ifdown bond0 && ifup bond0
without patch:
grep "Primary Slave" /proc/net/bonding/bond0
Primary Slave: eth01 (primary_reselect always)
with patch:
grep "Primary Slave" /proc/net/bonding/bond0
Primary Slave: None
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Milos Vyletel <milos.vyletel@sde.cz>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use strlcpy where possible to ensure the string is \0 terminated.
Use always sizeof(string) instead of 32, ETHTOOL_BUSINFO_LEN
and custom defines.
Use snprintf instead of sprint.
Remove unnecessary inits of ->fw_version
Remove unnecessary inits of drvinfo struct.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Benefit from new upper dev list and free bonding from dev->master usage.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Bonding initializes these works in bond_open() and cancels in bond_close(),
thus in bond_uninit() they are already canceled but may be unitialized yet.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Nikolay Aleksandrov <nikolay@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial branch from Jiri Kosina:
"Usual stuff -- comment/printk typo fixes, documentation updates, dead
code elimination."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
HOWTO: fix double words typo
x86 mtrr: fix comment typo in mtrr_bp_init
propagate name change to comments in kernel source
doc: Update the name of profiling based on sysfs
treewide: Fix typos in various drivers
treewide: Fix typos in various Kconfig
wireless: mwifiex: Fix typo in wireless/mwifiex driver
messages: i2o: Fix typo in messages/i2o
scripts/kernel-doc: check that non-void fcts describe their return value
Kernel-doc: Convention: Use a "Return" section to describe return values
radeon: Fix typo and copy/paste error in comments
doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
various: Fix spelling of "asynchronous" in comments.
Fix misspellings of "whether" in comments.
eisa: Fix spelling of "asynchronous".
various: Fix spelling of "registered" in comments.
doc: fix quite a few typos within Documentation
target: iscsi: fix comment typos in target/iscsi drivers
treewide: fix typo of "suport" in various comments and Kconfig
treewide: fix typo of "suppport" in various comments
...
|
|
Pull networking changes from David Miller:
1) Allow to dump, monitor, and change the bridge multicast database
using netlink. From Cong Wang.
2) RFC 5961 TCP blind data injection attack mitigation, from Eric
Dumazet.
3) Networking user namespace support from Eric W. Biederman.
4) tuntap/virtio-net multiqueue support by Jason Wang.
5) Support for checksum offload of encapsulated packets (basically,
tunneled traffic can still be checksummed by HW). From Joseph
Gasparakis.
6) Allow BPF filter access to VLAN tags, from Eric Dumazet and
Daniel Borkmann.
7) Bridge port parameters over netlink and BPDU blocking support
from Stephen Hemminger.
8) Improve data access patterns during inet socket demux by rearranging
socket layout, from Eric Dumazet.
9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and
Jon Maloy.
10) Update TCP socket hash sizing to be more in line with current day
realities. The existing heurstics were choosen a decade ago.
From Eric Dumazet.
11) Fix races, queue bloat, and excessive wakeups in ATM and
associated drivers, from Krzysztof Mazur and David Woodhouse.
12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions
in VXLAN driver, from David Stevens.
13) Add "oops_only" mode to netconsole, from Amerigo Wang.
14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also
allow DCB netlink to work on namespaces other than the initial
namespace. From John Fastabend.
15) Support PTP in the Tigon3 driver, from Matt Carlson.
16) tun/vhost zero copy fixes and improvements, plus turn it on
by default, from Michael S. Tsirkin.
17) Support per-association statistics in SCTP, from Michele
Baldessari.
And many, many, driver updates, cleanups, and improvements. Too
numerous to mention individually.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
net/mlx4_en: Add support for destination MAC in steering rules
net/mlx4_en: Use generic etherdevice.h functions.
net: ethtool: Add destination MAC address to flow steering API
bridge: add support of adding and deleting mdb entries
bridge: notify mdb changes via netlink
ndisc: Unexport ndisc_{build,send}_skb().
uapi: add missing netconf.h to export list
pkt_sched: avoid requeues if possible
solos-pci: fix double-free of TX skb in DMA mode
bnx2: Fix accidental reversions.
bna: Driver Version Updated to 3.1.2.1
bna: Firmware update
bna: Add RX State
bna: Rx Page Based Allocation
bna: TX Intr Coalescing Fix
bna: Tx and Rx Optimizations
bna: Code Cleanup and Enhancements
ath9k: check pdata variable before dereferencing it
ath5k: RX timestamp is reported at end of frame
ath9k_htc: RX timestamp is reported at end of frame
...
|
|
Since commit 2c60db037034 ('net: provide a default dev->ethtool_ops')
all devices have a non-null ethtool_ops. Test only
dev->ethtool_ops->get_link in both places where we care.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Bonding in balance-alb mode records information from ARP packets
passing through the bond in a hash table (rx_hashtbl).
At certain situations (e.g. link change of a slave),
rlb_update_rx_clients() will send out ARP packets to update ARP
caches of other hosts on the network to achieve RX load
balancing.
The problem is that once an IP address is recorded in the hash
table, it stays there indefinitely. If this IP address is
migrated to a different host in the network, bonding still sends
out ARP packets that poison other systems' ARP caches with
invalid information.
This patch solves this by looking at all incoming ARP packets,
and checking if the source IP address is one of the source
addresses stored in the rx_hashtbl. If it is, but the MAC
addresses differ, the corresponding hash table entries are
removed. Thus, when an IP address is migrated, the first ARP
broadcast by its new owner will purge the offending entries of
rx_hashtbl.
The hash table is hashed by ip_dst. To be able to do the above
check efficiently (not walking the whole hash table), we need a
reverse mapping (by ip_src).
I added three new members in struct rlb_client_info:
rx_hashtbl[x].src_first will point to the start of a list of
entries for which hash(ip_src) == x.
The list is linked with src_next and src_prev.
When an incoming ARP packet arrives at rlb_arp_recv()
rlb_purge_src_ip() can quickly walk only the entries on the
corresponding lists, i.e. the entries that are likely to contain
the offending IP address.
To avoid confusion, I renamed these existing fields of struct
rlb_client_info:
next -> used_next
prev -> used_prev
rx_hashtbl_head -> rx_hashtbl_used_head
(The current linked list is _not_ a list of hash table
entries with colliding ip_dst. It's a list of entries that are
being used; its purpose is to avoid walking the whole hash table
when looking for used entries.)
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Do not modify or load balance ARP packets passing through balance-alb
mode (wherein the ARP did not originate locally, and arrived via a bridge).
Modifying pass-through ARP replies causes an incorrect MAC address
to be placed into the ARP packet, rendering peers unable to communicate
with the actual destination from which the ARP reply originated.
Load balancing pass-through ARP requests causes an entry to be
created for the peer in the rlb table, and bond_alb_monitor will
occasionally issue ARP updates to all peers in the table instrucing them
as to which MAC address they should communicate with; this occurs when
some event sets rx_ntt. In the bridged case, however, the MAC address
used for the update would be the MAC of the slave, not the actual source
MAC of the originating destination. This would render peers unable to
communicate with the destinations beyond the bridge.
Signed-off-by: Zheng Li <zheng.x.li@oracle.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Race between bonding_store_slaves_active() and slave manipulation
functions. The bond_for_each_slave use in bonding_store_slaves_active()
is not protected by any synchronization mechanism.
NULL pointer dereference is easy to reach.
Fixed by acquiring the bond->lock for the slave walk.
v2: Make description text < 75 columns
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The module can be loaded with arp_ip_target="255.255.255.255" which makes
it impossible to remove as the function in sysfs checks for that value,
so we make the parameter checks consistent with sysfs.
v2: Fix formatting
v3: Make description text < 75 columns
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
First I would give three observations which will be used later.
Observation 1: if (delayed_work_pending(wq)) cancel_delayed_work(wq)
This usage is wrong because the pending bit is cleared just before the
work's fn is executed and if the function re-arms itself we might end up
with the work still running. It's safe to call cancel_delayed_work_sync()
even if the work is not queued at all.
Observation 2: Use of INIT_DELAYED_WORK()
Work needs to be initialized only once prior to (de/en)queueing.
Observation 3: IFF_UP is set only after ndo_open is called
Related race conditions:
1. Race between bonding_store_miimon() and bonding_store_arp_interval()
Because of Obs.1 we can end up having both works enqueued.
2. Multiple races with INIT_DELAYED_WORK()
Since the works are not protected by anything between INIT_DELAYED_WORK()
and calls to (en/de)queue it is possible for races between the following
functions:
(races are also possible between the calls to INIT_DELAYED_WORK()
and workqueue code)
bonding_store_miimon() - bonding_store_arp_interval(), bond_close(),
bond_open(), enqueued functions
bonding_store_arp_interval() - bonding_store_miimon(), bond_close(),
bond_open(), enqueued functions
3. By Obs.1 we need to change bond_cancel_all()
Bugs 1 and 2 are fixed by moving all work initializations in bond_open
which by Obs. 2 and Obs. 3 and the fact that we make sure that all works
are cancelled in bond_close(), is guaranteed not to have any work
enqueued.
Also RTNL lock is now acquired in bonding_store_miimon/arp_interval so
they can't race with bond_close and bond_open. The opposing work is
cancelled only if the IFF_UP flag is set and it is cancelled
unconditionally. The opposing work is already cancelled if the interface
is down so no need to cancel it again. This way we don't need new
synchronizations for the bonding workqueue. These bugs (and fixes) are
tied together and belong in the same patch.
Note: I have left 1 line intentionally over 80 characters (84) because I
didn't like how it looks broken down. If you'd prefer it otherwise,
then simply break it.
v2: Make description text < 75 columns
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If all slaves of a balance-rr bond with ARP monitor are enslaved
with down link state, bond keeps down state even after slaves
go up.
This is caused by bond_enslave() setting curr_active_slave to
first slave not taking into account its link state. As
bond_loadbalance_arp_mon() uses curr_active_slave to identify
whether slave's down->up transition should update bond's link
state, bond stays down even if slaves are up (until first slave
goes from up to down at least once).
Before commit f31c7937 "bonding: start slaves with link down for
ARP monitor", this was masked by slaves always starting in UP
state with ARP monitor (and MII monitor not relying on
curr_active_slave being NULL if there is no slave up).
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
setting of slave devices.
Patch sets the lowest gso_max_size and gso_max_segs values of the slave devices during enslave and detach.
Signed-off-by: Sarveshwar Bandi <sarveshwar.bandi@emulex.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
|
Fix off-by-one error because IFNAMSIZ == 16 and when this
code gets executed we stick a NULL byte where we should not.
How to reproduce:
with CONFIG_CC_STACKPROTECTOR=y (otherwise it may pass by silently)
modprobe bonding; echo 1 > /sys/class/net/bond0/bonding/mode;
echo "AAAAAAAAAAAAAAAA" > /sys/class/net/bond0/bonding/active_slave;
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Note: Sorry for the second patch but I missed this one while checking
the file. You can squash them into one patch.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix off-by-one error because IFNAMSIZ == 16 and when this
code gets executed we stick a NULL byte where we should not.
How to reproduce:
with CONFIG_CC_STACKPROTECTOR=y (otherwise it may pass by silently)
modprobe bonding; echo 1 > /sys/class/net/bond0/bonding/mode;
echo "AAAAAAAAAAAAAAAA" > /sys/class/net/bond0/bonding/primary;
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In vlan_uses_dev() check for number of vlan devs rather than existence
of vlan_info. The reason is that vlan id 0 is there without appropriate
vlan dev on it by default which prevented from enslaving vlan challenged
dev.
Reported-by: Jon Stanley <jstanley@rmrf.net>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If a qdisc is installed on a bonding device, its possible to get
following lockdep splat under stress :
=============================================
[ INFO: possible recursive locking detected ]
3.6.0+ #211 Not tainted
---------------------------------------------
ping/4876 is trying to acquire lock:
(dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+.-...}, at: [<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
but task is already holding lock:
(dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+.-...}, at: [<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
lock(dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
*** DEADLOCK ***
May be due to missing lock nesting notation
6 locks held by ping/4876:
#0: (sk_lock-AF_INET){+.+.+.}, at: [<ffffffff815e5030>] raw_sendmsg+0x600/0xc30
#1: (rcu_read_lock_bh){.+....}, at: [<ffffffff815ba4bd>] ip_finish_output+0x12d/0x870
#2: (rcu_read_lock_bh){.+....}, at: [<ffffffff8157a0b0>] dev_queue_xmit+0x0/0x830
#3: (dev->qdisc_tx_busylock ?: &qdisc_tx_busylock){+.-...}, at: [<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
#4: (&bond->lock){++.?..}, at: [<ffffffffa02128c1>] bond_start_xmit+0x31/0x4b0 [bonding]
#5: (rcu_read_lock_bh){.+....}, at: [<ffffffff8157a0b0>] dev_queue_xmit+0x0/0x830
stack backtrace:
Pid: 4876, comm: ping Not tainted 3.6.0+ #211
Call Trace:
[<ffffffff810a0145>] __lock_acquire+0x715/0x1b80
[<ffffffff810a256b>] ? mark_held_locks+0x9b/0x100
[<ffffffff810a1bf2>] lock_acquire+0x92/0x1d0
[<ffffffff8157a191>] ? dev_queue_xmit+0xe1/0x830
[<ffffffff81726b7c>] _raw_spin_lock+0x3c/0x50
[<ffffffff8157a191>] ? dev_queue_xmit+0xe1/0x830
[<ffffffff8106264d>] ? rcu_read_lock_bh_held+0x5d/0x90
[<ffffffff8157a191>] dev_queue_xmit+0xe1/0x830
[<ffffffff8157a0b0>] ? netdev_pick_tx+0x570/0x570
[<ffffffffa0212a6a>] bond_start_xmit+0x1da/0x4b0 [bonding]
[<ffffffff815796d0>] dev_hard_start_xmit+0x240/0x6b0
[<ffffffff81597c6e>] sch_direct_xmit+0xfe/0x2a0
[<ffffffff8157a249>] dev_queue_xmit+0x199/0x830
[<ffffffff8157a0b0>] ? netdev_pick_tx+0x570/0x570
[<ffffffff815ba96f>] ip_finish_output+0x5df/0x870
[<ffffffff815ba4bd>] ? ip_finish_output+0x12d/0x870
[<ffffffff815bb964>] ip_output+0x54/0xf0
[<ffffffff815bad48>] ip_local_out+0x28/0x90
[<ffffffff815bc444>] ip_send_skb+0x14/0x50
[<ffffffff815bc4b2>] ip_push_pending_frames+0x32/0x40
[<ffffffff815e536a>] raw_sendmsg+0x93a/0xc30
[<ffffffff8128d570>] ? selinux_file_send_sigiotask+0x1f0/0x1f0
[<ffffffff8109ddb4>] ? __lock_is_held+0x54/0x80
[<ffffffff815f6730>] ? inet_recvmsg+0x220/0x220
[<ffffffff8109ddb4>] ? __lock_is_held+0x54/0x80
[<ffffffff815f6855>] inet_sendmsg+0x125/0x240
[<ffffffff815f6730>] ? inet_recvmsg+0x220/0x220
[<ffffffff8155cddb>] sock_sendmsg+0xab/0xe0
[<ffffffff810a1650>] ? lock_release_non_nested+0xa0/0x2e0
[<ffffffff810a1650>] ? lock_release_non_nested+0xa0/0x2e0
[<ffffffff8155d18c>] __sys_sendmsg+0x37c/0x390
[<ffffffff81195b2a>] ? fsnotify+0x2ca/0x7e0
[<ffffffff811958e8>] ? fsnotify+0x88/0x7e0
[<ffffffff81361f36>] ? put_ldisc+0x56/0xd0
[<ffffffff8116f98a>] ? fget_light+0x3da/0x510
[<ffffffff8155f6c4>] sys_sendmsg+0x44/0x80
[<ffffffff8172fc22>] system_call_fastpath+0x16/0x1b
Avoid this problem using a distinct lock_class_key for bonding
devices.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, all the time limits in the bonding ARP monitor are in
multiples of arp_interval -- the time interval at which the ARP
monitor is periodically scheduled.
With a fast network round-trip and a little scheduling latency
of the ARP monitor work, a limit of n*delta_in_ticks may
effectively mean (n-1)*delta_in_ticks.
This is fatal in case of n==1 (the link will stay down
forever) and makes the behaviour non-deterministic in all the
other cases.
Add a delta_in_ticks/2 time slack to all the time limits.
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.
The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.
The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.
Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.
Tested-by: John Eaglesham <linux@8192.net>
Signed-off-by: John Eaglesham <linux@8192.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
Although this doesn't matter actually, because netpoll_tx_running()
doesn't use the parameter, the code will be more readable.
For team_dev_queue_xmit() we have to move it down to avoid
compile errors.
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Like the previous patch, slave_disable_netpoll() and __netpoll_cleanup()
may be called with read_lock() held too, so we should make them
non-block, by moving the cleanup and kfree() to call_rcu_bh() callbacks.
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
slave_enable_netpoll() and __netpoll_setup() may be called
with read_lock() held, so should use GFP_ATOMIC to allocate
memory. Eric suggested to pass gfp flags to __netpoll_setup().
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
I don't see any benifits to use netdev_bonding_change() than
using call_netdevice_notifiers() directly.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since now number of tx queues can be specified during bond instance
creation and therefore it may differ from params.tx_queues, use rather
real_num_tx_queues for boundary check.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As this is going to be used not only by bonding.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Also cut out unused function parameters and possible err in return
value.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Some workloads greatly benefit of IFF_XMIT_DST_RELEASE capability
on output net device, avoiding dirtying dst refcount.
bonding currently disables IFF_XMIT_DST_RELEASE unconditionally.
If all slaves have the IFF_XMIT_DST_RELEASE bit set, then
bonding master can also have it in its priv_flags
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflicts:
net/batman-adv/bridge_loop_avoidance.c
net/batman-adv/bridge_loop_avoidance.h
net/batman-adv/soft-interface.c
net/mac80211/mlme.c
With merge help from Antonio Quartulli (batman-adv) and
Stephen Rothwell (drivers/net/usb/qmi_wwan.c).
The net/mac80211/mlme.c conflict seemed easy enough, accounting for a
conversion to some new tracing macros.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The bonding debugfs support has been broken in the presence of network
namespaces since it has been added. The debugfs support does not handle
multiple bonding devices with the same name in different network
namespaces.
I haven't had any bug reports, and I'm not interested in getting any.
Disable the debugfs support when network namespaces are enabled.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It was recently reported that moving a bonding device between network
namespaces causes warnings from /proc. It turns out after the move we
were trying to add and to remove the /proc/net/bonding entries from the
wrong network namespace.
Move the bonding /proc registration code into the NETDEV_REGISTER and
NETDEV_UNREGISTER events where the proc registration and unregistration
will always happen at the right time.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflicts:
drivers/net/usb/qmi_wwan.c
net/batman-adv/translation-table.c
net/ipv6/route.c
qmi_wwan.c resolution provided by Bjørn Mork.
batman-adv conflict is dealing merely with the changes
of global function names to have a proper subsystem
prefix.
ipv6's route.c conflict is merely two side-by-side additions
of network namespace methods.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There are four link statuses of a bonding slave, the procfs
code shows a wrong status when using downdelay/updelay:
(slave->link == BOND_LINK_UP) ? "up" : "down"
It doesn't respect the rest two statuses. This patch fixes it.
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When packets are dropped in TX path, its better to use kfree_skb()
instead of dev_kfree_skb() to give proper drop_monitor events.
Also move the kfree_skb() call after read_unlock() in bond_alb_xmit()
and bond_xmit_activebackup()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflicts:
MAINTAINERS
drivers/net/wireless/iwlwifi/pcie/trans.c
The iwlwifi conflict was resolved by keeping the code added
in 'net' that turns off the buggy chip feature.
The MAINTAINERS conflict was merely overlapping changes, one
change updated all the wireless web site URLs and the other
changed some GIT trees to be Johannes's instead of John's.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Cloning all packets in input path have a significant cost.
Use skb_header_pointer()/skb_copy_bits() instead of pskb_may_pull() so
that recv_probe handlers (bond_3ad_lacpdu_recv / bond_arp_rcv /
rlb_arp_recv ) dont touch input skb.
bond_handle_frame() can avoid the skb_clone()/dev_kfree_skb()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Cc: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the transmit path of the bonding driver, skb->cb is used to
stash the skb->queue_mapping so that the bonding device can set its
own queue mapping. This value becomes corrupted since the skb->cb is
also used in __dev_xmit_skb.
When transmitting through bonding driver, bond_select_queue is
called from dev_queue_xmit. In bond_select_queue the original
skb->queue_mapping is copied into skb->cb (via bond_queue_mapping)
and skb->queue_mapping is overwritten with the bond driver queue.
Subsequently in dev_queue_xmit, __dev_xmit_skb is called which writes
the packet length into skb->cb, thereby overwriting the stashed
queue mappping. In bond_dev_queue_xmit (called from hard_start_xmit),
the queue mapping for the skb is set to the stashed value which is now
the skb length and hence is an invalid queue for the slave device.
If we want to save skb->queue_mapping into skb->cb[], best place is to
add a field in struct qdisc_skb_cb, to make sure it wont conflict with
other layers (eg : Qdiscc, Infiniband...)
This patchs also makes sure (struct qdisc_skb_cb)->data is aligned on 8
bytes :
netem qdisc for example assumes it can store an u64 in it, without
misalignment penalty.
Note : we only have 20 bytes left in (struct qdisc_skb_cb)->data[].
The largest user is CHOKe and it fills it.
Based on a previous patch from Tom Herbert.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Tom Herbert <therbert@google.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Roland Dreier <roland@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If we modify primary via sysfs and it is not a valid slave,
we should record it for future use, and this behavior is the same with
bond_check_params().
Signed-off-by: Weiping Pan <wpan@redhat.com>
Acked-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
I applied the wrong version of Jiri's bonding fix in commit
13a8e0c8cdb43982372bd6c65fb26839c8fd8ce9 ("bonding: don't increase
rx_dropped after processing LACPDUs")
I applied v3, which introduces warnings I asked him to fix,
instead of v4 which properly takes care of those issues.
This inter-diffs such that the warnings are now gone.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use the new bool function ether_addr_equal_64bits to add
some clarity and reduce the likelihood for misuse of
compare_ether_addr_64bits for sorting.
Done via cocci script:
$ cat compare_ether_addr_64bits.cocci
@@
expression a,b;
@@
- !compare_ether_addr_64bits(a, b)
+ ether_addr_equal_64bits(a, b)
@@
expression a,b;
@@
- compare_ether_addr_64bits(a, b)
+ !ether_addr_equal_64bits(a, b)
@@
expression a,b;
@@
- !ether_addr_equal_64bits(a, b) == 0
+ ether_addr_equal_64bits(a, b)
@@
expression a,b;
@@
- !ether_addr_equal_64bits(a, b) != 0
+ !ether_addr_equal_64bits(a, b)
@@
expression a,b;
@@
- ether_addr_equal_64bits(a, b) == 0
+ !ether_addr_equal_64bits(a, b)
@@
expression a,b;
@@
- ether_addr_equal_64bits(a, b) != 0
+ ether_addr_equal_64bits(a, b)
@@
expression a,b;
@@
- !!ether_addr_equal_64bits(a, b)
+ ether_addr_equal_64bits(a, b)
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use the new bool function ether_addr_equal to add
some clarity and reduce the likelihood for misuse
of compare_ether_addr for sorting.
Done via cocci script:
$ cat compare_ether_addr.cocci
@@
expression a,b;
@@
- !compare_ether_addr(a, b)
+ ether_addr_equal(a, b)
@@
expression a,b;
@@
- compare_ether_addr(a, b)
+ !ether_addr_equal(a, b)
@@
expression a,b;
@@
- !ether_addr_equal(a, b) == 0
+ ether_addr_equal(a, b)
@@
expression a,b;
@@
- !ether_addr_equal(a, b) != 0
+ !ether_addr_equal(a, b)
@@
expression a,b;
@@
- ether_addr_equal(a, b) == 0
+ !ether_addr_equal(a, b)
@@
expression a,b;
@@
- ether_addr_equal(a, b) != 0
+ ether_addr_equal(a, b)
@@
expression a,b;
@@
- !!ether_addr_equal(a, b)
+ ether_addr_equal(a, b)
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since commit 3aba891d, bonding processes LACP frames (802.3ad
mode) with bond_handle_frame(). Currently a copy of the skb is
made and the original is left to be processed by other
rx_handlers and the rest of the network stack by returning
RX_HANDLER_ANOTHER. As there is no protocol handler for
PKT_TYPE_LACPDU, the frame is dropped and dev->rx_dropped
increased.
Fix this by making bond_handle_frame() return RX_HANDLER_CONSUMED
if bonding has processed the LACP frame.
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
|