summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-02-13Merge branch 'mlxsw-IPIP-cleanups'David S. Miller3-91/+73
Jiri Pirko says: ==================== mlxsw: IPIP cleanups In the first patch, a forgotten #include is added. Even though the code compiles as-is, the include is necessary for modules that should include spectrum_ipip.h. The second patch corrects an assumption that IPv6 tunnels use struct ip_tunnel_parm to store tunnel parameters. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13mlxsw: spectrum: Distinguish between IPv4/6 tunnelsPetr Machata3-91/+72
struct ip_tunnel_parm, where GRE and several other tunnel types hold information, is IPv4-specific. The current router / ipip code in mlxsw however uses it as if it were generic. Make it clear that it's not. Rename many functions from _params_ to _params4_. mlxsw_sp_ipip_parms_saddr() and _daddr() take a proto argument to dispatch on it. Move the dispatch logic to mlxsw_sp_ipip_netdev_saddr() and _daddr(), and replace with single-protocol functions. In struct mlxsw_sp_ipip_entry, move the "parms" field to a (for the time being, singleton) union. Update users throughout. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13mlxsw: spectrum_ipip: Add a forgotten includePetr Machata1-0/+1
struct ip_tunnel_parm, which is used in spectrum_ipip.h, is defined in if_tunnel.h. However, the former neglects to include the latter. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13mlxsw: spectrum_router: Fix error path in mlxsw_sp_vr_createJiri Pirko1-14/+18
Since mlxsw_sp_fib_create() and mlxsw_sp_mr_table_create() use ERR_PTR macro to propagate int err through return of a pointer, the return value is not NULL in case of failure. So if one of the calls fails, one of vr->fib4, vr->fib6 or vr->mr4_table is not NULL and mlxsw_sp_vr_is_used wrongly assumes that vr is in use which leads to crash like following one: [ 1293.949291] BUG: unable to handle kernel NULL pointer dereference at 00000000000006c9 [ 1293.952729] IP: mlxsw_sp_mr_table_flush+0x15/0x70 [mlxsw_spectrum] Fix this by using local variables to hold the pointers and set vr->* only in case everything went fine. Fixes: 76610ebbde18 ("mlxsw: spectrum_router: Refactor virtual router handling") Fixes: a3d9bc506d64 ("mlxsw: spectrum_router: Extend virtual routers with IPv6 support") Fixes: d42b0965b1d4 ("mlxsw: spectrum_router: Add multicast routes notification handling functionality") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: af_unix: fix typo in UNIX_SKB_FRAGS_SZ commentTobias Klauser1-1/+1
Change "minimun" to "minimum". Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13dpaa_eth: fix incorrect commentJake Moroni1-1/+1
The comment stated that a thread was started, but that is not the case. Signed-off-by: Jake Moroni <mail@jakemoroni.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13powerpc/macio: set a proper dma_coherent_maskChristoph Hellwig1-0/+1
We have expected busses to set up a coherent mask to properly use the common dma mapping code for a long time, and now that I've added a warning macio turned out to not set one up yet. This sets it to the same value as the dma_mask, which seems to be what the drivers expect. Reported-by: Mathieu Malaterre <malat@debian.org> Tested-by: Mathieu Malaterre <malat@debian.org> Reported-by: Meelis Roos <mroos@linux.ee> Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-02-13uapi/if_ether.h: move __UAPI_DEF_ETHHDR libc defineHauke Mehrtens2-7/+5
This fixes a compile problem of some user space applications by not including linux/libc-compat.h in uapi/if_ether.h. linux/libc-compat.h checks which "features" the header files, included from the libc, provide to make the Linux kernel uapi header files only provide no conflicting structures and enums. If a user application mixes kernel headers and libc headers it could happen that linux/libc-compat.h gets included too early where not all other libc headers are included yet. Then the linux/libc-compat.h would not prevent all the redefinitions and we run into compile problems. This patch removes the include of linux/libc-compat.h from uapi/if_ether.h to fix the recently introduced case, but not all as this is more or less impossible. It is no problem to do the check directly in the if_ether.h file and not in libc-compat.h as this does not need any fancy glibc header detection as glibc never provided struct ethhdr and should define __UAPI_DEF_ETHHDR by them self when they will provide this. The following test program did not compile correctly any more: #include <linux/if_ether.h> #include <netinet/in.h> #include <linux/in.h> int main(void) { return 0; } Fixes: 6926e041a892 ("uapi/if_ether.h: prevent redefinition of struct ethhdr") Reported-by: Guillaume Nault <g.nault@alphalink.fr> Cc: <stable@vger.kernel.org> # 4.15 Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13blk: optimization for classic pollingNitesh Shetty1-0/+1
This removes the dependency on interrupts to wake up task. Set task state as TASK_RUNNING, if need_resched() returns true, while polling for IO completion. Earlier, polling task used to sleep, relying on interrupt to wake it up. This made some IO take very long when interrupt-coalescing is enabled in NVMe. Reference: http://lists.infradead.org/pipermail/linux-nvme/2018-February/015435.html Changes since v2->v3: -using __set_current_state() instead of set_current_state() Changes since v1->v2: -setting task state once in blk_poll, instead of multiple callers. Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-02-13Merge branch 'Replacing-net_mutex-with-rw_semaphore'David S. Miller45-41/+116
Kirill Tkhai says: ==================== Replacing net_mutex with rw_semaphore this is the third version of the patchset introducing net_sem instead of net_mutex. The patchset adds net_sem in addition to net_mutex and allows pernet_operations to be "async". This flag means, the pernet_operations methods are safe to be executed with any other pernet_operations (un)initializing another net. If there are only async pernet_operations in the system, net_mutex is not used either for setup_net() or for cleanup_net(). The pernet_operations converted in this patchset allow to create minimal .config to have network working, and the changes improve the performance like you may see below: %for i in {1..10000}; do unshare -n bash -c exit; done *before* real 1m40,377s user 0m9,672s sys 0m19,928s *after* real 0m17,007s user 0m5,311s sys 0m11,779 (5.8 times faster) In the future, when all pernet_operations become async, we'll just remove this "async" field tree-wide. All the new logic is concentrated in patches [1-5/32]. The rest of patches converts specific operations: review, rationale of they can be converted, and setting of async flag. Kirill v3: Improved patches descriptions. Added comment into [5/32]. Added [32/32] converting netlink_tap_net_ops (new pernet operations introduced in 2018). v2: Single patch -> patchset with rationale of every conversion ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert netlink_tap_net_opsKirill Tkhai1-0/+1
These pernet_operations init just allocated net memory, and they obviously can be executed in parallel in any others. v3: New Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert diag_net_opsKirill Tkhai1-0/+1
These pernet operations just create and destroy netlink socket. The socket is pernet and else operations don't touch it. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert default_device_opsKirill Tkhai1-0/+1
These pernet operations consist of exit() and exit_batch() methods. default_device_exit() moves not-local and virtual devices to init_net. There is nothing exciting, because this may happen in any time on a working system, and rtnl_lock() and synchronize_net() protect us from all cases of external dereference. The same for default_device_exit_batch(). Similar unregisteration may happen in any time on a system. Here several lists (like todo_list), which are accessed under rtnl_lock(). After rtnl_unlock() and netdev_run_todo() all the devices are flushed. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert loopback_net_opsKirill Tkhai1-0/+1
These pernet_operations have only init() method. It allocates memory for net_device, calls register_netdev() and assigns net::loopback_dev. register_netdev() is allowed be used without additional locks, as it's synchronized on rtnl_lock(). There are many examples of using this functon directly from ioctl(). The only difference, compared to ioctl(), is that net is not completely alive at this moment. But it looks like, there is no way for parallel pernet_operations to dereference the net_device, as the most of struct net_device lists, where it's linked, are related to net, and the net is not liked. The exceptions are net_device::unreg_list, close_list, todo_list, used for unregistration, and ::link_watch_list, where net_device may be linked to global lists. Unregistration of loopback_dev obviously can't happen, when loopback_net_init() is executing, as the net as alive. It occurs in default_device_ops, which currently requires net_mutex, and it behaves as a barrier at the moment. It will be considered in next patch. Speaking about link_watch_list, it seems, there is no way for loopback_dev at time of registration to be linked in lweventlist and be available for another pernet_operations. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert addrconf_opsKirill Tkhai1-0/+1
These pernet_operations (un)register sysctl, which are not touched by anybody else. So, it's safe to make them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert ipv4_sysctl_opsKirill Tkhai1-0/+1
These pernet_operations create and destroy sysctl, which are not touched by anybody else. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert packet_net_opsKirill Tkhai1-0/+1
These pernet_operations just create and destroy /proc entry, and another operations do not touch it. Also, nobody else are interested in foreign net::packet::sklist. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert unix_net_opsKirill Tkhai1-0/+1
These pernet_operations are just create and destroy /proc and sysctl entries, and are not touched by foreign pernet_operations. So, we are able to make them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert pernet_subsys, registered from inet_init()Kirill Tkhai18-0/+23
arp_net_ops just addr/removes /proc entry. devinet_ops allocates and frees duplicate of init_net tables and (un)registers sysctl entries. fib_net_ops allocates and frees pernet tables, creates/destroys netlink socket and (un)initializes /proc entries. Foreign pernet_operations do not touch them. ip_rt_proc_ops only modifies pernet /proc entries. xfrm_net_ops creates/destroys /proc entries, allocates/frees pernet statistics, hashes and tables, and (un)initializes sysctl files. These are not touched by foreigh pernet_operations xfrm4_net_ops allocates/frees private pernet memory, and configures sysctls. sysctl_route_ops creates/destroys sysctls. rt_genid_ops only initializes fields of just allocated net. ipv4_inetpeer_ops allocated/frees net private memory. igmp_net_ops just creates/destroys /proc files and socket, noone else interested in. tcp_sk_ops seems to be safe, because tcp_sk_init() does not depend on any other pernet_operations modifications. Iteration over hash table in inet_twsk_purge() is made under RCU lock, and it's safe to iterate the table this way. Removing from the table happen from inet_twsk_deschedule_put(), but this function is safe without any extern locks, as it's synchronized inside itself. There are many examples, it's used in different context. So, it's safe to leave tcp_sk_exit_batch() unlocked. tcp_net_metrics_ops is synchronized on tcp_metrics_lock and safe. udplite4_net_ops only creates/destroys pernet /proc file. icmp_sk_ops creates percpu sockets, not touched by foreign pernet_operations. ipmr_net_ops creates/destroys pernet fib tables, (un)registers fib rules and /proc files. This seem to be safe to execute in parallel with foreign pernet_operations. af_inet_ops just sets up default parameters of newly created net. ipv4_mib_ops creates and destroys pernet percpu statistics. raw_net_ops, tcp4_net_ops, udp4_net_ops, ping_v4_net_ops and ip_proc_ops only create/destroy pernet /proc files. ip4_frags_ops creates and destroys sysctl file. So, it's safe to make the pernet_operations async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert sysctl_core_opsKirill Tkhai1-0/+1
These pernet_operations register and destroy sysctl directory, and it's not interesting for foreign pernet_operations. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert wext_pernet_opsKirill Tkhai1-0/+1
These pernet_operations initialize and purge net::wext_nlevents queue, and are not touched by foreign pernet_operations. Mark them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert genl_pernet_opsKirill Tkhai1-0/+1
This pernet_operations create and destroy net::genl_sock. Foreign pernet_operations don't touch it. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert subsys_initcall() registered pernet_operations from net/schedKirill Tkhai2-0/+2
psched_net_ops only creates and destroyes /proc entry, and safe to be executed in parallel with any foreigh pernet_operations. tcf_action_net_ops initializes and destructs tcf_action_net::egdev_ht, which is not touched by foreign pernet_operations. So, make them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert fib_* pernet_operations, registered via subsys_initcallKirill Tkhai2-0/+2
Both of them create and initialize lists, which are not touched by another foreing pernet_operations. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert pernet_subsys ops, registered via net_dev_init()Kirill Tkhai2-0/+3
There are: 1)dev_proc_ops and dev_mc_net_ops, which create and destroy pernet proc file and not interesting for another net namespaces; 2)netdev_net_ops, which creates pernet hashes, which are not touched by another pernet_operations. So, make them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert proto_net_opsKirill Tkhai1-0/+1
This patch starts to convert pernet_subsys, registered from subsys initcalls. It seems safe to be executed in parallel with others, as it's only creates/destoyes proc entry, which nobody else is not interested in. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert uevent_net_opsKirill Tkhai1-0/+1
uevent_net_init() and uevent_net_exit() create and destroy netlink socket, and these actions serialized in netlink code. Parallel execution with other pernet_operations makes the socket disappear earlier from uevent_sock_list on ->exit. As userspace can't be interested in broadcast messages of dying net, and, as I see, no one in kernel listen them, we may safely make uevent_net_ops async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert audit_net_opsKirill Tkhai1-0/+1
This patch starts to convert pernet_subsys, registered from postcore initcalls. audit_net_init() creates netlink socket, while audit_net_exit() destroys it. The rest of the pernet_list are not interested in the socket, so we make audit_net_ops async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert rtnetlink_net_opsKirill Tkhai1-0/+1
rtnetlink_net_init() and rtnetlink_net_exit() create and destroy netlink socket net::rtnl. The socket is used to send rtnl notification via rtnl_net_notifyid(). There is no a problem to create and destroy it in parallel with other pernet operations, as we link net in setup_net() after the socket is created, and destroy in cleanup_net() after net is unhashed from all the lists and there is no RCU references on it. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert netlink_net_opsKirill Tkhai1-0/+1
The methods of netlink_net_ops create and destroy "netlink" file, which are not interesting for foreigh pernet_operations. So, netlink_net_ops may safely be made async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert net_defaults_opsKirill Tkhai1-0/+1
net_defaults_ops introduce only net_defaults_init_net method, and it acts on net::core::sysctl_somaxconn, which is not interesting for the rest of pernet_subsys and pernet_device lists. Then, make them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert net_inuse_opsKirill Tkhai1-0/+1
net_inuse_ops methods expose statistics in /proc. No one from the rest of pernet_subsys or pernet_device lists touch net::core::inuse. So, it's safe to make net_inuse_ops async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert nf_log_net_opsKirill Tkhai1-0/+1
The pernet_operations would have had a problem in parallel execution with others, if init_net had been able to released. But it's not, and the rest is safe for that. There is memory allocation, which nobody else interested in, and sysctl registration. So, we make them async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert netfilter_net_opsKirill Tkhai1-0/+1
Methods netfilter_net_init() and netfilter_net_exit() initialize net::nf::hooks and change net-related proc directory of net. Another pernet_operations are not interested in forein net::nf::hooks or proc entries, so it's safe to make them executed in parallel with methods of other pernet operations. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert sysctl_pernet_opsKirill Tkhai1-0/+1
This patch starts to convert pernet_subsys, registered from core initcalls. Methods sysctl_net_init() and sysctl_net_exit() initialize net::sysctls table of a namespace. pernet_operations::init()/exit() methods from the rest of the list do not touch net::sysctls of strangers, so it's safe to execute sysctl_pernet_ops's methods in parallel with any other pernet_operations. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert net_ns_ops methodsKirill Tkhai1-0/+1
This patch starts to convert pernet_subsys, registered from pure initcalls. net_ns_ops::net_ns_net_init/net_ns_net_init, methods use only ida_simple_* functions, which are not need a synchronization. They are synchronized by idr subsystem. So, net_ns_ops methods are able to be executed in parallel with methods of other pernet operations. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert proc_net_ns_opsKirill Tkhai1-0/+1
This patch starts to convert pernet_subsys, registered before initcalls. proc_net_ns_ops::proc_net_ns_init()/proc_net_ns_exit() {un,}register pernet net->proc_net and ->proc_net_stat. Constructors and destructors of another pernet_operations are not interested in foreign net's proc_net and proc_net_stat. Proc filesystem privitives are synchronized on proc_subdir_lock. So, proc_net_ns_ops methods are able to be executed in parallel with methods of any other pernet operations. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Allow pernet_operations to be executed in parallelKirill Tkhai2-10/+26
This adds new pernet_operations::async flag to indicate operations, which ->init(), ->exit() and ->exit_batch() methods are allowed to be executed in parallel with the methods of any other pernet_operations. When there are only asynchronous pernet_operations in the system, net_mutex won't be taken for a net construction and destruction. Also, remove BUG_ON(mutex_is_locked()) from net_assign_generic() without replacing with the equivalent net_sem check, as there is one more lockdep assert below. v3: Add comment near net_mutex. Suggested-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Move mutex_unlock() in cleanup_net() upKirill Tkhai1-1/+2
net_sem protects from pernet_list changing, while ops_free_list() makes simple kfree(), and it can't race with other pernet_operations callbacks. So we may release net_mutex earlier then it was. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Introduce net_sem for protection of pernet_listKirill Tkhai3-15/+29
Currently, the mutex is mostly used to protect pernet operations list. It orders setup_net() and cleanup_net() with parallel {un,}register_pernet_operations() calls, so ->exit{,batch} methods of the same pernet operations are executed for a dying net, as were used to call ->init methods, even after the net namespace is unlinked from net_namespace_list in cleanup_net(). But there are several problems with scalability. The first one is that more than one net can't be created or destroyed at the same moment on the node. For big machines with many cpus running many containers it's very sensitive. The second one is that it's need to synchronize_rcu() after net is removed from net_namespace_list(): Destroy net_ns: cleanup_net() mutex_lock(&net_mutex) list_del_rcu(&net->list) synchronize_rcu() <--- Sleep there for ages list_for_each_entry_reverse(ops, &pernet_list, list) ops_exit_list(ops, &net_exit_list) list_for_each_entry_reverse(ops, &pernet_list, list) ops_free_list(ops, &net_exit_list) mutex_unlock(&net_mutex) This primitive is not fast, especially on the systems with many processors and/or when preemptible RCU is enabled in config. So, all the time, while cleanup_net() is waiting for RCU grace period, creation of new net namespaces is not possible, the tasks, who makes it, are sleeping on the same mutex: Create net_ns: copy_net_ns() mutex_lock_killable(&net_mutex) <--- Sleep there for ages I observed 20-30 seconds hangs of "unshare -n" on ordinary 8-cpu laptop with preemptible RCU enabled after CRIU tests round is finished. The solution is to convert net_mutex to the rw_semaphore and add fine grain locks to really small number of pernet_operations, what really need them. Then, pernet_operations::init/::exit methods, modifying the net-related data, will require down_read() locking only, while down_write() will be used for changing pernet_list (i.e., when modules are being loaded and unloaded). This gives signify performance increase, after all patch set is applied, like you may see here: %for i in {1..10000}; do unshare -n bash -c exit; done *before* real 1m40,377s user 0m9,672s sys 0m19,928s *after* real 0m17,007s user 0m5,311s sys 0m11,779 (5.8 times faster) This patch starts replacing net_mutex to net_sem. It adds rw_semaphore, describes the variables it protects, and makes to use, where appropriate. net_mutex is still present, and next patches will kick it out step-by-step. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Cleanup in copy_net_ns()Kirill Tkhai1-11/+9
Line up destructors actions in the revers order to constructors. Next patches will add more actions, and this will be comfortable, if there is the such order. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Assign net to net_namespace_list in setup_net()Kirill Tkhai1-10/+3
This patch merges two repeating pieces of code in one, and they will live in setup_net() now. The only change is that assignment: init_net_initialized = true; becomes reordered with: list_add_tail_rcu(&net->list, &net_namespace_list); The order does not have visible effect, and it is a simple cleanup because of: init_net_initialized is used in !CONFIG_NET_NS case to order proc_net_ns_ops registration occuring at boot time: start_kernel()->proc_root_init()->proc_net_init(), with net_ns_init()->setup_net(&init_net, &init_user_ns) also occuring in boot time from the same init_task. When there are no another tasks to race with them, for the single task it does not matter, which order two sequential independent loads should be made. So we make them reordered. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13x86/mm, mm/hwpoison: Don't unconditionally unmap kernel 1:1 pagesTony Luck5-18/+26
In the following commit: ce0fa3e56ad2 ("x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages") ... we added code to memory_failure() to unmap the page from the kernel 1:1 virtual address space to avoid speculative access to the page logging additional errors. But memory_failure() may not always succeed in taking the page offline, especially if the page belongs to the kernel. This can happen if there are too many corrected errors on a page and either mcelog(8) or drivers/ras/cec.c asks to take a page offline. Since we remove the 1:1 mapping early in memory_failure(), we can end up with the page unmapped, but still in use. On the next access the kernel crashes :-( There are also various debug paths that call memory_failure() to simulate occurrence of an error. Since there is no actual error in memory, we don't need to map out the page for those cases. Revert most of the previous attempt and keep the solution local to arch/x86/kernel/cpu/mcheck/mce.c. Unmap the page only when: 1) there is a real error 2) memory_failure() succeeds. All of this only applies to 64-bit systems. 32-bit kernel doesn't map all of memory into kernel space. It isn't worth adding the code to unmap the piece that is mapped because nobody would run a 32-bit kernel on a machine that has recoverable machine checks. Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robert (Persistent Memory) <elliott@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Cc: stable@vger.kernel.org #v4.14 Fixes: ce0fa3e56ad2 ("x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-13locking/semaphore: Update the file path in documentationTycho Andersen1-1/+1
While reading this header I noticed that the locking stuff has moved to kernel/locking/*, so update the path in semaphore.h to point to that. Signed-off-by: Tycho Andersen <tycho@tycho.ws> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180201114119.1090-1-tycho@tycho.ws Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-13locking/atomic/bitops: Document and clarify ordering semantics for failed ↵Will Deacon2-2/+8
test_and_{}_bit() A test_and_{}_bit() operation fails if the value of the bit is such that the modification does not take place. For example, if test_and_set_bit() returns 1. In these cases, follow the behaviour of cmpxchg and allow the operation to be unordered. This also applies to test_and_set_bit_lock() if the lock is found to be be taken already. Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1518528619-20049-1-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-13locking/qspinlock: Ensure node->count is updated before initialising nodeWill Deacon1-0/+8
When queuing on the qspinlock, the count field for the current CPU's head node is incremented. This needn't be atomic because locking in e.g. IRQ context is balanced and so an IRQ will return with node->count as it found it. However, the compiler could in theory reorder the initialisation of node[idx] before the increment of the head node->count, causing an IRQ to overwrite the initialised node and potentially corrupt the lock state. Avoid the potential for this harmful compiler reordering by placing a barrier() between the increment of the head node->count and the subsequent node initialisation. Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1518528177-19169-3-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-13locking/qspinlock: Ensure node is initialised before updating prev->nextWill Deacon1-6/+7
When a locker ends up queuing on the qspinlock locking slowpath, we initialise the relevant mcs node and publish it indirectly by updating the tail portion of the lock word using xchg_tail. If we find that there was a pre-existing locker in the queue, we subsequently update their ->next field to point at our node so that we are notified when it's our turn to take the lock. This can be roughly illustrated as follows: /* Initialise the fields in node and encode a pointer to node in tail */ tail = initialise_node(node); /* * Exchange tail into the lockword using an atomic read-modify-write * operation with release semantics */ old = xchg_tail(lock, tail); /* If there was a pre-existing waiter ... */ if (old & _Q_TAIL_MASK) { prev = decode_tail(old); smp_read_barrier_depends(); /* ... then update their ->next field to point to node. WRITE_ONCE(prev->next, node); } The conditional update of prev->next therefore relies on the address dependency from the result of xchg_tail ensuring order against the prior initialisation of node. However, since the release semantics of the xchg_tail operation apply only to the write portion of the RmW, then this ordering is not guaranteed and it is possible for the CPU to return old before the writes to node have been published, consequently allowing us to point prev->next to an uninitialised node. This patch fixes the problem by making the update of prev->next a RELEASE operation, which also removes the reliance on dependency ordering. Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1518528177-19169-2-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-13x86/error_inject: Make just_return_func() globally visibleArnd Bergmann1-0/+1
With link time optimizations enabled, I get a link failure: ./ccLbOEHX.ltrans19.ltrans.o: In function `override_function_with_return': <artificial>:(.text+0x7f3): undefined reference to `just_return_func' Marking the symbol .globl makes it work as expected. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nicolas Pitre <nico@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Fixes: 540adea3809f ("error-injection: Separate error-injection from kprobe") Link: http://lkml.kernel.org/r/20180202145634.200291-3-arnd@arndb.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-13x86/platform/UV: Fix GAM Range Table entries less than 1GBmike.travis@hpe.com1-3/+12
The latest UV platforms include the new ApachePass NVDIMMs into the UV address space. This has introduced address ranges in the Global Address Map Table that are less than the previous lowest range, which was 2GB. Fix the address calculation so it accommodates address ranges from bytes to exabytes. Signed-off-by: Mike Travis <mike.travis@hpe.com> Reviewed-by: Andrew Banman <andrew.banman@hpe.com> Reviewed-by: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <russ.anderson@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180205221503.190219903@stormcage.americas.sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-13MIPS: Fix incorrect mem=X@Y handlingMarcin Nowakowski1-4/+12
Commit 73fbc1eba7ff ("MIPS: fix mem=X@Y commandline processing") added a fix to ensure that the memory range between PHYS_OFFSET and low memory address specified by mem= cmdline argument is not later processed by free_all_bootmem. This change was incorrect for systems where the commandline specifies more than 1 mem argument, as it will cause all memory between PHYS_OFFSET and each of the memory offsets to be marked as reserved, which results in parts of the RAM marked as reserved (Creator CI20's u-boot has a default commandline argument 'mem=256M@0x0 mem=768M@0x30000000'). Change the behaviour to ensure that only the range between PHYS_OFFSET and the lowest start address of the memories is marked as protected. This change also ensures that the range is marked protected even if it's only defined through the devicetree and not only via commandline arguments. Reported-by: Mathieu Malaterre <mathieu.malaterre@gmail.com> Signed-off-by: Marcin Nowakowski <marcin.nowakowski@mips.com> Fixes: 73fbc1eba7ff ("MIPS: fix mem=X@Y commandline processing") Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: <stable@vger.kernel.org> # v4.11+ Tested-by: Mathieu Malaterre <malat@debian.org> Patchwork: https://patchwork.linux-mips.org/patch/18562/ Signed-off-by: James Hogan <jhogan@kernel.org>