<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/net/core/dst.c, branch linux-4.13.y</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=linux-4.13.y</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=linux-4.13.y'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2017-06-25T15:42:01+00:00</updated>
<entry>
<title>net: store port/representator id in metadata_dst</title>
<updated>2017-06-25T15:42:01+00:00</updated>
<author>
<name>Jakub Kicinski</name>
<email>jakub.kicinski@netronome.com</email>
</author>
<published>2017-06-23T20:11:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3fcece12bc1b6dcdf0986f2cd9e8f63b1f9b6aa0'/>
<id>urn:sha1:3fcece12bc1b6dcdf0986f2cd9e8f63b1f9b6aa0</id>
<content type='text'>
Switches and modern SR-IOV enabled NICs may multiplex traffic from Port
representators and control messages over single set of hardware queues.
Control messages and muxed traffic may need ordered delivery.

Those requirements make it hard to comfortably use TC infrastructure today
unless we have a way of attaching metadata to skbs at the upper device.
Because single set of queues is used for many netdevs stopping TC/sched
queues of all of them reliably is impossible and lower device has to
retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on
the fastpath.

This patch attempts to enable port/representative devs to attach metadata
to skbs which carry port id.  This way representatives can be queueless and
all queuing can be performed at the lower netdev in the usual way.

Traffic arriving on the port/representative interfaces will be have
metadata attached and will subsequently be queued to the lower device for
transmission.  The lower device should recognize the metadata and translate
it to HW specific format which is most likely either a special header
inserted before the network headers or descriptor/metadata fields.

Metadata is associated with the lower device by storing the netdev pointer
along with port id so that if TC decides to redirect or mirror the new
netdev will not try to interpret it.

This is mostly for SR-IOV devices since switches don't have lower netdevs
today.

Signed-off-by: Jakub Kicinski &lt;jakub.kicinski@netronome.com&gt;
Signed-off-by: Sridhar Samudrala &lt;sridhar.samudrala@intel.com&gt;
Signed-off-by: Simon Horman &lt;horms@verge.net.au&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: remove DST_NOCACHE flag</title>
<updated>2017-06-18T02:54:01+00:00</updated>
<author>
<name>Wei Wang</name>
<email>weiwan@google.com</email>
</author>
<published>2017-06-17T17:42:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=a4c2fd7f78915a0d7c5275e7612e7793157a01f2'/>
<id>urn:sha1:a4c2fd7f78915a0d7c5275e7612e7793157a01f2</id>
<content type='text'>
DST_NOCACHE flag check has been removed from dst_release() and
dst_hold_safe() in a previous patch because all the dst are now ref
counted properly and can be released based on refcnt only.
Looking at the rest of the DST_NOCACHE use, all of them can now be
removed or replaced with other checks.
So this patch gets rid of all the DST_NOCACHE usage and remove this flag
completely.

Signed-off-by: Wei Wang &lt;weiwan@google.com&gt;
Acked-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: remove DST_NOGC flag</title>
<updated>2017-06-18T02:54:01+00:00</updated>
<author>
<name>Wei Wang</name>
<email>weiwan@google.com</email>
</author>
<published>2017-06-17T17:42:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b2a9c0ed75a32e788d034a58a18f2fc46396e412'/>
<id>urn:sha1:b2a9c0ed75a32e788d034a58a18f2fc46396e412</id>
<content type='text'>
Now that all the components have been changed to release dst based on
refcnt only and not depend on dst gc anymore, we can remove the
temporary flag DST_NOGC.

Note that we also need to remove the DST_NOCACHE check in dst_release()
and dst_hold_safe() because now all the dst are released based on refcnt
and behaves as DST_NOCACHE.

Signed-off-by: Wei Wang &lt;weiwan@google.com&gt;
Acked-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: remove dst gc related code</title>
<updated>2017-06-18T02:54:01+00:00</updated>
<author>
<name>Wei Wang</name>
<email>weiwan@google.com</email>
</author>
<published>2017-06-17T17:42:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5b7c9a8ff828287af5aebe93e707271bf1a82cc3'/>
<id>urn:sha1:5b7c9a8ff828287af5aebe93e707271bf1a82cc3</id>
<content type='text'>
This patch removes all dst gc related code and all the dst free
functions

Signed-off-by: Wei Wang &lt;weiwan@google.com&gt;
Acked-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>xfrm: take refcnt of dst when creating struct xfrm_dst bundle</title>
<updated>2017-06-18T02:54:00+00:00</updated>
<author>
<name>Wei Wang</name>
<email>weiwan@google.com</email>
</author>
<published>2017-06-17T17:42:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=52df157f17e564ec22afc3e4a89b21828220f576'/>
<id>urn:sha1:52df157f17e564ec22afc3e4a89b21828220f576</id>
<content type='text'>
During the creation of xfrm_dst bundle, always take ref count when
allocating the dst. This way, xfrm_bundle_create() will form a linked
list of dst with dst-&gt;child pointing to a ref counted dst child. And
the returned dst pointer is also ref counted. This makes the link from
the flow cache to this dst now ref counted properly.
As the dst is always ref counted properly, we can safely mark
DST_NOGC flag so dst_release() will release dst based on refcnt only.
And dst gc is no longer needed and all dst_free() and its related
function calls should be replaced with dst_release() or
dst_release_immediate().

The special handling logic for dst-&gt;child in dst_destroy() can be
replaced with a simple dst_release_immediate() call on the child to
release the whole list linked by dst-&gt;child pointer.
Previously used DST_NOHASH flag is not needed anymore as well. The
reason that DST_NOHASH is used in the existing code is mainly to prevent
the dst inserted in the fib tree to be wrongly destroyed during the
deletion of the xfrm_dst bundle. So in the existing code, DST_NOHASH
flag is marked in all the dst children except the one which is in the
fib tree.
However, with this patch series to remove dst gc logic and release dst
only based on ref count, it is safe to release all the children from a
xfrm_dst bundle as long as the dst children are all ref counted
properly which is already the case in the existing code.
So, this patch removes the use of DST_NOHASH flag.

Signed-off-by: Wei Wang &lt;weiwan@google.com&gt;
Acked-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: introduce a new function dst_dev_put()</title>
<updated>2017-06-18T02:53:59+00:00</updated>
<author>
<name>Wei Wang</name>
<email>weiwan@google.com</email>
</author>
<published>2017-06-17T17:42:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36'/>
<id>urn:sha1:4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36</id>
<content type='text'>
This function should be called when removing routes from fib tree after
the dst gc is no longer in use.
We first mark DST_OBSOLETE_DEAD on this dst to make sure next
dst_ops-&gt;check() fails and returns NULL.
Secondly, as we no longer keep the gc_list, we need to properly
release dst-&gt;dev right at the moment when the dst is removed from
the fib/fib6 tree.
It does the following:
1. change dst-&gt;input and output pointers to dst_discard/dst_dscard_out to
   discard all packets
2. replace dst-&gt;dev with loopback interface

Signed-off-by: Wei Wang &lt;weiwan@google.com&gt;
Acked-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: introduce DST_NOGC in dst_release() to destroy dst based on refcnt</title>
<updated>2017-06-18T02:53:59+00:00</updated>
<author>
<name>Wei Wang</name>
<email>weiwan@google.com</email>
</author>
<published>2017-06-17T17:42:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5f56f409b5142bb4e88e1f64dedeb4a76c678177'/>
<id>urn:sha1:5f56f409b5142bb4e88e1f64dedeb4a76c678177</id>
<content type='text'>
The current mechanism of freeing dst is a bit complicated. dst has its
ref count and when user grabs the reference to the dst, the ref count is
properly taken in most cases except in IPv4/IPv6/decnet/xfrm routing
code due to some historic reasons.

If the reference to dst is always taken properly, we should be able to
simplify the logic in dst_release() to destroy dst when dst-&gt;__refcnt
drops from 1 to 0. And this should be the only condition to determine
if we can call dst_destroy().
And as dst is always ref counted, there is no need for a dst garbage
list to hold the dst entries that already get removed by the routing
code but are still held by other users. And the task to periodically
check the list to free dst if ref count become 0 is also not needed
anymore.

This patch introduces a temporary flag DST_NOGC(no garbage collector).
If it is set in the dst, dst_release() will call dst_destroy() when
dst-&gt;__refcnt drops to 0. dst_hold_safe() will also check for this flag
and do atomic_inc_not_zero() similar as DST_NOCACHE to avoid double free
issue.
This temporary flag is mainly used so that we can make the transition
component by component without breaking other parts.
This flag will be removed after all components are properly transitioned.

This patch also introduces a new function dst_release_immediate() which
destroys dst without waiting on the rcu when refcnt drops to 0. It will
be used in later patches.

Follow-up patches will correct all the places to properly take ref count
on dst and mark DST_NOGC. dst_release() or dst_release_immediate() will
be used to release the dst instead of dst_free() and its related
functions.
And final clean-up patch will remove the DST_NOGC flag.

Signed-off-by: Wei Wang &lt;weiwan@google.com&gt;
Acked-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>Fix an intermittent pr_emerg warning about lo becoming free.</title>
<updated>2017-06-09T16:27:28+00:00</updated>
<author>
<name>Krister Johansen</name>
<email>kjlx@templeofstupid.com</email>
</author>
<published>2017-06-08T20:12:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=f186ce61bb8235d80068c390dc2aad7ca427a4c2'/>
<id>urn:sha1:f186ce61bb8235d80068c390dc2aad7ca427a4c2</id>
<content type='text'>
It looks like this:

Message from syslogd@flamingo at Apr 26 00:45:00 ...
 kernel:unregister_netdevice: waiting for lo to become free. Usage count = 4

They seem to coincide with net namespace teardown.

The message is emitted by netdev_wait_allrefs().

Forced a kdump in netdev_run_todo, but found that the refcount on the lo
device was already 0 at the time we got to the panic.

Used bcc to check the blocking in netdev_run_todo.  The only places
where we're off cpu there are in the rcu_barrier() and msleep() calls.
That behavior is expected.  The msleep time coincides with the amount of
time we spend waiting for the refcount to reach zero; the rcu_barrier()
wait times are not excessive.

After looking through the list of callbacks that the netdevice notifiers
invoke in this path, it appears that the dst_dev_event is the most
interesting.  The dst_ifdown path places a hold on the loopback_dev as
part of releasing the dev associated with the original dst cache entry.
Most of our notifier callbacks are straight-forward, but this one a)
looks complex, and b) places a hold on the network interface in
question.

I constructed a new bcc script that watches various events in the
liftime of a dst cache entry.  Note that dst_ifdown will take a hold on
the loopback device until the invalidated dst entry gets freed.

[      __dst_free] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282115677036183
    __dst_free
    rcu_nocb_kthread
    kthread
    ret_from_fork
Acked-by: Eric Dumazet &lt;edumazet@google.com&gt;

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>ipv4: add reference counting to metrics</title>
<updated>2017-05-26T18:57:07+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2017-05-25T21:27:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=3fb07daff8e99243366a081e5129560734de4ada'/>
<id>urn:sha1:3fb07daff8e99243366a081e5129560734de4ada</id>
<content type='text'>
Andrey Konovalov reported crashes in ipv4_mtu()

I could reproduce the issue with KASAN kernels, between
10.246.7.151 and 10.246.7.152 :

1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 &amp;

2) At the same time run following loop :
while :
do
 ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
 ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
done

Cong Wang attempted to add back rt-&gt;fi in commit
82486aa6f1b9 ("ipv4: restore rt-&gt;fi for reference counting")
but this proved to add some issues that were complex to solve.

Instead, I suggested to add a refcount to the metrics themselves,
being a standalone object (in particular, no reference to other objects)

I tried to make this patch as small as possible to ease its backport,
instead of being super clean. Note that we believe that only ipv4 dst
need to take care of the metric refcount. But if this is wrong,
this patch adds the basic infrastructure to extend this to other
families.

Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang
for his efforts on this problem.

Fixes: 2860583fe840 ("ipv4: Kill rt-&gt;fi")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Reviewed-by: Julian Anastasov &lt;ja@ssi.bg&gt;
Acked-by: Cong Wang &lt;xiyou.wangcong@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: pending_confirm is not used anymore</title>
<updated>2017-02-07T18:07:47+00:00</updated>
<author>
<name>Julian Anastasov</name>
<email>ja@ssi.bg</email>
</author>
<published>2017-02-06T21:14:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=51ce8bd4d17a761e1a90a34a1b5c9b762cce7553'/>
<id>urn:sha1:51ce8bd4d17a761e1a90a34a1b5c9b762cce7553</id>
<content type='text'>
When same struct dst_entry can be used for many different
neighbours we can not use it for pending confirmations.
As last step, we can remove the pending_confirm flag.

Reported-by: YueHaibing &lt;yuehaibing@huawei.com&gt;
Fixes: 5110effee8fd ("net: Do delayed neigh confirmation.")
Fixes: f2bb4bedf35d ("ipv4: Cache output routes in fib_info nexthops.")
Signed-off-by: Julian Anastasov &lt;ja@ssi.bg&gt;
Acked-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
