<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/net/net_namespace.h, branch v6.6.40</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.40</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v6.6.40'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2023-08-30T00:39:15+00:00</updated>
<entry>
<title>Merge tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux</title>
<updated>2023-08-30T00:39:15+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-08-30T00:39:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=adfd671676c922bada16477eb68b5eb5f065addc'/>
<id>urn:sha1:adfd671676c922bada16477eb68b5eb5f065addc</id>
<content type='text'>
Pull sysctl updates from Luis Chamberlain:
 "Long ago we set out to remove the kitchen sink on kernel/sysctl.c
  arrays and placings sysctls to their own sybsystem or file to help
  avoid merge conflicts. Matthew Wilcox pointed out though that if we're
  going to do that we might as well also *save* space while at it and
  try to remove the extra last sysctl entry added at the end of each
  array, a sentintel, instead of bloating the kernel by adding a new
  sentinel with each array moved.

  Doing that was not so trivial, and has required slowing down the moves
  of kernel/sysctl.c arrays and measuring the impact on size by each new
  move.

  The complex part of the effort to help reduce the size of each sysctl
  is being done by the patient work of el señor Don Joel Granados. A lot
  of this is truly painful code refactoring and testing and then trying
  to measure the savings of each move and removing the sentinels.
  Although Joel already has code which does most of this work,
  experience with sysctl moves in the past shows is we need to be
  careful due to the slew of odd build failures that are possible due to
  the amount of random Kconfig options sysctls use.

  To that end Joel's work is split by first addressing the major
  housekeeping needed to remove the sentinels, which is part of this
  merge request. The rest of the work to actually remove the sentinels
  will be done later in future kernel releases.

  The preliminary math is showing this will all help reduce the overall
  build time size of the kernel and run time memory consumed by the
  kernel by about ~64 bytes per array where we are able to remove each
  sentinel in the future. That also means there is no more bloating the
  kernel with the extra ~64 bytes per array moved as no new sentinels
  are created"

* tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  sysctl: Use ctl_table_size as stopping criteria for list macro
  sysctl: SIZE_MAX-&gt;ARRAY_SIZE in register_net_sysctl
  vrf: Update to register_net_sysctl_sz
  networking: Update to register_net_sysctl_sz
  netfilter: Update to register_net_sysctl_sz
  ax.25: Update to register_net_sysctl_sz
  sysctl: Add size to register_net_sysctl function
  sysctl: Add size arg to __register_sysctl_init
  sysctl: Add size to register_sysctl
  sysctl: Add a size arg to __register_sysctl_table
  sysctl: Add size argument to init_header
  sysctl: Add ctl_table_size to ctl_table_header
  sysctl: Use ctl_table_header in list_for_each_table_entry
  sysctl: Prefer ctl_table_header in proc_sysctl
</content>
</entry>
<entry>
<title>sysctl: SIZE_MAX-&gt;ARRAY_SIZE in register_net_sysctl</title>
<updated>2023-08-15T22:26:18+00:00</updated>
<author>
<name>Joel Granados</name>
<email>joel.granados@gmail.com</email>
</author>
<published>2023-08-09T10:50:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e1b41e4f4f44dbdc4fbb8ed37919182104d7699c'/>
<id>urn:sha1:e1b41e4f4f44dbdc4fbb8ed37919182104d7699c</id>
<content type='text'>
Replace SIZE_MAX with ARRAY_SIZE in the register_net_sysctl macro. Now
that all the callers to register_net_sysctl are actual arrays, we can
call ARRAY_SIZE() without any compilation warnings. By calculating the
actual array size, this commit is making sure that register_net_sysctl
and all its callers forward the table_size into sysctl backend for when
the sentinel elements in the ctl_table arrays (last empty markers) are
removed. Without it the removal would fail lacking a stopping criteria
for traversing the ctl_table arrays.

Stopping condition continues to be based on both table size and the
procname null test. This is needed in order to allow for the systematic
removal al the sentinel element in subsequent commits: Before removing
sentinel the stopping criteria will be the last null element. When the
sentinel is removed then the (correct) size will take over.

Signed-off-by: Joel Granados &lt;j.granados@samsung.com&gt;
Suggested-by: Jani Nikula &lt;jani.nikula@linux.intel.com&gt;
Signed-off-by: Luis Chamberlain &lt;mcgrof@kernel.org&gt;
</content>
</entry>
<entry>
<title>sysctl: Add size to register_net_sysctl function</title>
<updated>2023-08-15T22:26:17+00:00</updated>
<author>
<name>Joel Granados</name>
<email>joel.granados@gmail.com</email>
</author>
<published>2023-08-09T10:50:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=95d4977876d6658fdee58be8c8668aac68c25dd2'/>
<id>urn:sha1:95d4977876d6658fdee58be8c8668aac68c25dd2</id>
<content type='text'>
This commit adds size to the register_net_sysctl indirection function to
facilitate the removal of the sentinel elements (last empty markers)
from the ctl_table arrays. Though we don't actually remove any sentinels
in this commit, register_net_sysctl* now has the capability of
forwarding table_size for when that happens.

We create a new function register_net_sysctl_sz with an extra size
argument. A macro replaces the existing register_net_sysctl. The size in
the macro is SIZE_MAX instead of ARRAY_SIZE to avoid compilation errors
while we systematically migrate to register_net_sysctl_sz. Will change
to ARRAY_SIZE in subsequent commits.

Care is taken to add table_size to the stopping criteria in such a way
that when we remove the empty sentinel element, it will continue
stopping in the last element of the ctl_table array.

Signed-off-by: Joel Granados &lt;j.granados@samsung.com&gt;
Suggested-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Signed-off-by: Luis Chamberlain &lt;mcgrof@kernel.org&gt;
</content>
</entry>
<entry>
<title>net: store netdevs in an xarray</title>
<updated>2023-07-28T18:35:58+00:00</updated>
<author>
<name>Jakub Kicinski</name>
<email>kuba@kernel.org</email>
</author>
<published>2023-07-26T18:55:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=759ab1edb56c88906830fd6b2e7b12514dd32758'/>
<id>urn:sha1:759ab1edb56c88906830fd6b2e7b12514dd32758</id>
<content type='text'>
Iterating over the netdev hash table for netlink dumps is hard.
Dumps are done in "chunks" so we need to save the position
after each chunk, so we know where to restart from. Because
netdevs are stored in a hash table we remember which bucket
we were in and how many devices we dumped.

Since we don't hold any locks across the "chunks" - devices may
come and go while we're dumping. If that happens we may miss
a device (if device is deleted from the bucket we were in).
We indicate to user space that this may have happened by setting
NLM_F_DUMP_INTR. User space is supposed to dump again (I think)
if it sees that. Somehow I doubt most user space gets this right..

To illustrate let's look at an example:

               System state:
  start:       # [A, B, C]
  del:  B      # [A, C]

with the hash table we may dump [A, B], missing C completely even
tho it existed both before and after the "del B".

Add an xarray and use it to allocate ifindexes. This way we
can iterate ifindexes in order, without the worry that we'll
skip one. We may still generate a dump of a state which "never
existed", for example for a set of values and sequence of ops:

               System state:
  start:       # [A, B]
  add:  C      # [A, C, B]
  del:  B      # [A, C]

we may generate a dump of [A], if C got an index between A and B.
System has never been in such state. But I'm 90% sure that's perfectly
fine, important part is that we can't _miss_ devices which exist before
and after. User space which wants to mirror kernel's state subscribes
to notifications and does periodic dumps so it will know that C exists
from the notification about its creation or from the next dump
(next dump is _guaranteed_ to include C, if it doesn't get removed).

To avoid any perf regressions keep the hash table for now. Most
net namespaces have very few devices and microbenchmarking 1M lookups
on Skylake I get the following results (not counting loopback
to number of devs):

 #devs | hash |  xa  | delta
    2  | 18.3 | 20.1 | + 9.8%
   16  | 18.3 | 20.1 | + 9.5%
   64  | 18.3 | 26.3 | +43.8%
  128  | 20.4 | 26.3 | +28.6%
  256  | 20.0 | 26.4 | +32.1%
 1024  | 26.6 | 26.7 | + 0.2%
 8192  |541.3 | 33.5 | -93.8%

No surprises since the hash table has 256 entries.
The microbenchmark scans indexes in order, if the pattern is more
random xa starts to win at 512 devices already. But that's a lot
of devices, in practice.

Reviewed-by: Leon Romanovsky &lt;leonro@nvidia.com&gt;
Link: https://lore.kernel.org/r/20230726185530.2247698-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
<entry>
<title>net: add a refcount tracker for kernel sockets</title>
<updated>2022-10-24T10:04:43+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2022-10-20T23:20:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=0cafd77dcd032d1687efaba5598cf07bce85997f'/>
<id>urn:sha1:0cafd77dcd032d1687efaba5598cf07bce85997f</id>
<content type='text'>
Commit ffa84b5ffb37 ("net: add netns refcount tracker to struct sock")
added a tracker to sockets, but did not track kernel sockets.

We still have syzbot reports hinting about netns being destroyed
while some kernel TCP sockets had not been dismantled.

This patch tracks kernel sockets, and adds a ref_tracker_dir_print()
call to net_free() right before the netns is freed.

Normally, each layer is responsible for properly releasing its
kernel sockets before last call to net_free().

This debugging facility is enabled with CONFIG_NET_NS_REFCNT_TRACKER=y

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reviewed-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Tested-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>netfilter: nf_flow_table: count pending offload workqueue tasks</title>
<updated>2022-07-11T14:25:14+00:00</updated>
<author>
<name>Vlad Buslov</name>
<email>vladbu@nvidia.com</email>
</author>
<published>2022-06-15T10:43:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b038177636f83bbf87c2b238706474145dd2cd04'/>
<id>urn:sha1:b038177636f83bbf87c2b238706474145dd2cd04</id>
<content type='text'>
To improve hardware offload debuggability count pending 'add', 'del' and
'stats' flow_table offload workqueue tasks. Counters are incremented before
scheduling new task and decremented when workqueue handler finishes
executing. These counters allow user to diagnose congestion on hardware
offload workqueues that can happen when either CPU is starved and workqueue
jobs are executed at lower rate than new ones are added or when
hardware/driver can't keep up with the rate.

Implement the described counters as percpu counters inside new struct
netns_ft which is stored inside struct net. Expose them via new procfs file
'/proc/net/stats/nf_flowtable' that is similar to existing 'nf_conntrack'
file.

Signed-off-by: Vlad Buslov &lt;vladbu@nvidia.com&gt;
Signed-off-by: Oz Shlomo &lt;ozsh@nvidia.com&gt;
Signed-off-by: Pablo Neira Ayuso &lt;pablo@netfilter.org&gt;
</content>
</entry>
<entry>
<title>af_unix: Define a per-netns hash table.</title>
<updated>2022-06-22T11:59:43+00:00</updated>
<author>
<name>Kuniyuki Iwashima</name>
<email>kuniyu@amazon.com</email>
</author>
<published>2022-06-21T17:19:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b6e811383062f88212082714db849127fa95142c'/>
<id>urn:sha1:b6e811383062f88212082714db849127fa95142c</id>
<content type='text'>
This commit adds a per netns hash table for AF_UNIX, which size is fixed
as UNIX_HASH_SIZE for now.

The first implementation defines a per-netns hash table as a single array
of lock and list:

	struct unix_hashbucket {
		spinlock_t		lock;
		struct hlist_head	head;
	};

	struct netns_unix {
		struct unix_hashbucket	*hash;
		...
	};

But, Eric pointed out memory cost that the structure has holes because of
sizeof(spinlock_t), which is 4 (or more if LOCKDEP is enabled). [0]  It
could be expensive on a host with thousands of netns and few AF_UNIX
sockets.  For this reason, a per-netns hash table uses two dense arrays.

	struct unix_table {
		spinlock_t		*locks;
		struct hlist_head	*buckets;
	};

	struct netns_unix {
		struct unix_table	table;
		...
	};

Note the length of the list has a significant impact rather than lock
contention, so having shared locks can be an option.  But, per-netns
locks and lists still perform better than the global locks and per-netns
lists. [1]

Also, this patch adds a change so that struct netns_unix disappears from
struct net if CONFIG_UNIX is disabled.

[0]: https://lore.kernel.org/netdev/CANn89iLVxO5aqx16azNU7p7Z-nz5NrnM5QTqOzueVxEnkVTxyg@mail.gmail.com/
[1]: https://lore.kernel.org/netdev/20220617175215.1769-1-kuniyu@amazon.com/

Signed-off-by: Kuniyuki Iwashima &lt;kuniyu@amazon.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: make net-&gt;dev_unreg_count atomic</title>
<updated>2022-02-10T15:30:26+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2022-02-10T02:59:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ede6c39c4f9068cbeb4036448c45fff5393e0432'/>
<id>urn:sha1:ede6c39c4f9068cbeb4036448c45fff5393e0432</id>
<content type='text'>
Having to acquire rtnl from netdev_run_todo() for every dismantled
device is not desirable when/if rtnl is under stress.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: initialize init_net earlier</title>
<updated>2022-02-06T11:04:29+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2022-02-05T17:01:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=9c1be1935fb68b2413796cdc03d019b8cf35ab51'/>
<id>urn:sha1:9c1be1935fb68b2413796cdc03d019b8cf35ab51</id>
<content type='text'>
While testing a patch that will follow later
("net: add netns refcount tracker to struct nsproxy")
I found that devtmpfs_init() was called before init_net
was initialized.

This is a bug, because devtmpfs_setup() calls
ksys_unshare(CLONE_NEWNS);

This has the effect of increasing init_net refcount,
which will be later overwritten to 1, as part of setup_net(&amp;init_net)

We had too many prior patches [1] trying to work around the root cause.

Really, make sure init_net is in BSS section, and that net_ns_init()
is called earlier at boot time.

Note that another patch ("vfs: add netns refcount tracker
to struct fs_context") also will need net_ns_init() being called
before vfs_caches_init()

As a bonus, this patch saves around 4KB in .data section.

[1]

f8c46cb39079 ("netns: do not call pernet ops for not yet set up init_net namespace")
b5082df8019a ("net: Initialise init_net.count to 1")
734b65417b24 ("net: Statically initialize init_net.dev_base_head")

v2: fixed a build error reported by kernel build bots (CONFIG_NET=n)

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: add networking namespace refcount tracker</title>
<updated>2021-12-10T14:38:26+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2021-12-10T07:44:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=9ba74e6c9e9d0c5c1e5792a7111fc7d1a0589cb8'/>
<id>urn:sha1:9ba74e6c9e9d0c5c1e5792a7111fc7d1a0589cb8</id>
<content type='text'>
We have 100+ syzbot reports about netns being dismantled too soon,
still unresolved as of today.

We think a missing get_net() or an extra put_net() is the root cause.

In order to find the bug(s), and be able to spot future ones,
this patch adds CONFIG_NET_NS_REFCNT_TRACKER and new helpers
to precisely pair all put_net() with corresponding get_net().

To use these helpers, each data structure owning a refcount
should also use a "netns_tracker" to pair the get and put.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
</content>
</entry>
</feed>
