Age | Commit message (Collapse) | Author | Files | Lines |
|
cleared"
This reverts commit c46bfba1337d ("connector: Fix proc_event_num_listeners
count not cleared").
It is not accurate to reset proc_event_num_listeners according to
cn_netlink_send_mult() return value -ESRCH.
In the case of stress-ng netlink-proc, -ESRCH will always be returned,
because netlink_broadcast_filtered will return -ESRCH,
which may cause stress-ng netlink-proc performance degradation.
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202401112259.b23a1567-oliver.sang@intel.com
Fixes: c46bfba1337d ("connector: Fix proc_event_num_listeners count not cleared")
Signed-off-by: Keqi Wang <wangkeqi_chris@163.com>
Link: https://lore.kernel.org/r/20240209091659.68723-1-wangkeqi_chris@163.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c
e009b2efb7a8 ("bnxt_en: Remove mis-applied code from bnxt_cfg_ntp_filters()")
0f2b21477988 ("bnxt_en: Fix compile error without CONFIG_RFS_ACCEL")
https://lore.kernel.org/all/20240105115509.225aa8a2@canb.auug.org.au/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When we register a cn_proc listening event, the proc_event_num_listener
variable will be incremented by one, but if PROC_CN_MCAST_IGNORE is
not called, the count will not decrease.
This will cause the proc_*_connector function to take the wrong path.
It will reappear when the forkstat tool exits via ctrl + c.
We solve this problem by determining whether
there are still listeners to clear proc_event_num_listener.
Signed-off-by: wangkeqi <wangkeqiwang@didiglobal.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Make the code using filter function a bit nicer by consolidating the
filter function arguments using typedef.
Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Check that sk_user_data is not NULL, else return from cn_filter().
Could not reproduce this issue, but Oliver Sang verified it has fixed
the "Closes" problem below.
Fixes: 2aa1f7a1f47c ("connector/cn_proc: Add filtering to fix some bugs")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202309201456.84c19e27-oliver.sang@intel.com/
Signed-off-by: Anjali Kulkarni <anjali.k.kulkarni@oracle.com>
Link: https://lore.kernel.org/r/20231020234058.2232347-1-anjali.k.kulkarni@oracle.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
There were a couple of reasons for not allowing non-root users access
initially - one is there was some point no proper receive buffer
management in place for netlink multicast. But that should be long
fixed. See link below for more context.
Second is that some of the messages may contain data that is root only. But
this should be handled with a finer granularity, which is being done at the
protocol layer. The only problematic protocols are nf_queue and the
firewall netlink. Hence, this restriction for non-root access was relaxed
for NETLINK_ROUTE initially:
https://lore.kernel.org/all/20020612013101.A22399@wotan.suse.de/
This restriction has also been removed for following protocols:
NETLINK_KOBJECT_UEVENT, NETLINK_AUDIT, NETLINK_SOCK_DIAG,
NETLINK_GENERIC, NETLINK_SELINUX.
Since process connector messages are not sensitive (process fork, exit
notifications etc.), and anyone can read /proc data, we can allow non-root
access here. However, since process event notification is not the only
consumer of NETLINK_CONNECTOR, we can make this change even more
fine grained than the protocol level, by checking for multicast group
within the protocol.
Allow non-root access for NETLINK_CONNECTOR via NL_CFG_F_NONROOT_RECV
but add new bind function cn_bind(), which allows non-root access only
for CN_IDX_PROC multicast group.
Signed-off-by: Anjali Kulkarni <anjali.k.kulkarni@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds the capability to filter messages sent by the proc
connector on the event type supplied in the message from the client
to the connector. The client can register to listen for an event type
given in struct proc_input.
This event based filteting will greatly enhance performance - handling
8K exits takes about 70ms, whereas 8K-forks + 8K-exits takes about 150ms
& handling 8K-forks + 8K-exits + 8K-execs takes 200ms. There are currently
9 different types of events, and we need to listen to all of them. Also,
measuring the time using pidfds for monitoring 8K process exits took
much longer - 200ms, as compared to 70ms using only exit notifications of
proc connector.
We also add a new event type - PROC_EVENT_NONZERO_EXIT, which is
only sent by kernel to a listening application when any process exiting,
has a non-zero exit status. This will help the clients like Oracle DB,
where a monitoring process wants notfications for non-zero process exits
so it can cleanup after them.
This kind of a new event could also be useful to other applications like
Google's lmkd daemon, which needs a killed process's exit notification.
The patch takes care that existing clients using old mechanism of not
sending the event type work without any changes.
cn_filter function checks to see if the event type being notified via
proc connector matches the event type requested by client, before
sending(matches) or dropping(does not match) a packet.
Signed-off-by: Anjali Kulkarni <anjali.k.kulkarni@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The current proc connector code has the foll. bugs - if there are more
than one listeners for the proc connector messages, and one of them
deregisters for listening using PROC_CN_MCAST_IGNORE, they will still get
all proc connector messages, as long as there is another listener.
Another issue is if one client calls PROC_CN_MCAST_LISTEN, and another one
calls PROC_CN_MCAST_IGNORE, then both will end up not getting any messages.
This patch adds filtering and drops packet if client has sent
PROC_CN_MCAST_IGNORE. This data is stored in the client socket's
sk_user_data. In addition, we only increment or decrement
proc_event_num_listeners once per client. This fixes the above issues.
cn_release is the release function added for NETLINK_CONNECTOR. It uses
the newly added netlink_release function added to netlink_sock. It will
free sk_user_data.
Signed-off-by: Anjali Kulkarni <anjali.k.kulkarni@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch replaces open code with task_is_in_init_pid_ns() to check if
a task is in root PID namespace.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The connector driver never modifies any cb_id passed to it, so add a const
qualifier to those arguments so callers can declare their struct cb_id as a
constant object.
Fixes build warnings like these when passing a constant struct cb_id:
warning: passing argument 1 of ‘cn_add_callback’ discards ‘const’ qualifier from pointer target
Signed-off-by: Geoff Levand <geoff@infradead.org>
Link: https://lore.kernel.org/r/a9e49c9e-67fa-16e7-0a6b-72f6bd30c58a@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Simplify the return expression.
Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since commit 84af7a6194e4 ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.
This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.
There are a variety of indentation styles found.
a) 4 spaces + '---help---'
b) 7 spaces + '---help---'
c) 8 spaces + '---help---'
d) 1 space + 1 tab + '---help---'
e) 1 tab + '---help---' (correct indentation)
f) 1 tab + 1 space + '---help---'
g) 1 tab + 2 spaces + '---help---'
In order to convert all of them to 1 tab + 'help', I ran the
following commend:
$ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
send_msg() disables preemption to avoid out-of-order messages. As the
code inside the preempt disabled section acquires regular spinlocks,
which are converted to 'sleeping' spinlocks on a PREEMPT_RT kernel and
eventually calls into a memory allocator, this conflicts with the RT
semantics.
Convert it to a local_lock which allows RT kernels to substitute them with
a real per CPU lock. On non RT kernels this maps to preempt_disable() as
before. No functional change.
[bigeasy: Patch description]
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200527201119.1692513-6-bigeasy@linutronix.de
|
|
A small cleanup: this callback is never used.
Originally fixed by Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
for OpenVZ7 bug OVZ-6877
cc: stanislav.kinsburskiy@gmail.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not write to the free software foundation inc
59 temple place suite 330 boston ma 02111 1307 usa
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 1334 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070033.113240726@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Add SPDX license identifiers to all Make/Kconfig files which:
- Have no license information of any form
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
proc_exit_connector() uses ->real_parent lockless. This is not
safe that its parent can go away at any moment, so use RCU to
protect it, and ensure that this task is not released.
[ 747.624551] ==================================================================
[ 747.632946] BUG: KASAN: use-after-free in proc_exit_connector+0x1f7/0x310
[ 747.640686] Read of size 4 at addr ffff88a0276988e0 by task sshd/2882
[ 747.648032]
[ 747.649804] CPU: 11 PID: 2882 Comm: sshd Tainted: G E 4.19.26-rc2 #11
[ 747.658629] Hardware name: IBM x3550M4 -[7914OFV]-/00AM544, BIOS -[D7E142BUS-1.71]- 07/31/2014
[ 747.668419] Call Trace:
[ 747.671269] dump_stack+0xf0/0x19b
[ 747.675186] ? show_regs_print_info+0x5/0x5
[ 747.679988] ? kmsg_dump_rewind_nolock+0x59/0x59
[ 747.685302] print_address_description+0x6a/0x270
[ 747.691162] kasan_report+0x258/0x380
[ 747.695835] ? proc_exit_connector+0x1f7/0x310
[ 747.701402] proc_exit_connector+0x1f7/0x310
[ 747.706767] ? proc_coredump_connector+0x2d0/0x2d0
[ 747.712715] ? _raw_write_unlock_irq+0x29/0x50
[ 747.718270] ? _raw_write_unlock_irq+0x29/0x50
[ 747.723820] ? ___preempt_schedule+0x16/0x18
[ 747.729193] ? ___preempt_schedule+0x16/0x18
[ 747.734574] do_exit+0xa11/0x14f0
[ 747.738880] ? mm_update_next_owner+0x590/0x590
[ 747.744525] ? debug_show_all_locks+0x3c0/0x3c0
[ 747.761448] ? ktime_get_coarse_real_ts64+0xeb/0x1c0
[ 747.767589] ? lockdep_hardirqs_on+0x1a6/0x290
[ 747.773154] ? check_chain_key+0x139/0x1f0
[ 747.778345] ? check_flags.part.35+0x240/0x240
[ 747.783908] ? __lock_acquire+0x2300/0x2300
[ 747.789171] ? _raw_spin_unlock_irqrestore+0x59/0x70
[ 747.795316] ? _raw_spin_unlock_irqrestore+0x59/0x70
[ 747.801457] ? do_raw_spin_unlock+0x10f/0x1e0
[ 747.806914] ? do_raw_spin_trylock+0x120/0x120
[ 747.812481] ? preempt_count_sub+0x14/0xc0
[ 747.817645] ? _raw_spin_unlock+0x2e/0x50
[ 747.822708] ? __handle_mm_fault+0x12db/0x1fa0
[ 747.828367] ? __pmd_alloc+0x2d0/0x2d0
[ 747.833143] ? check_noncircular+0x50/0x50
[ 747.838309] ? match_held_lock+0x7f/0x340
[ 747.843380] ? check_noncircular+0x50/0x50
[ 747.848561] ? handle_mm_fault+0x21a/0x5f0
[ 747.853730] ? check_flags.part.35+0x240/0x240
[ 747.859290] ? check_chain_key+0x139/0x1f0
[ 747.864474] ? __do_page_fault+0x40f/0x760
[ 747.869655] ? __audit_syscall_entry+0x4b/0x1f0
[ 747.875319] ? syscall_trace_enter+0x1d5/0x7b0
[ 747.880877] ? trace_raw_output_preemptirq_template+0x90/0x90
[ 747.887895] ? trace_raw_output_sys_exit+0x80/0x80
[ 747.893860] ? up_read+0x3b/0x90
[ 747.898142] ? stop_critical_timings+0x260/0x260
[ 747.903909] do_group_exit+0xe0/0x1c0
[ 747.908591] ? __x64_sys_exit+0x30/0x30
[ 747.913460] ? trace_raw_output_preemptirq_template+0x90/0x90
[ 747.920485] ? tracer_hardirqs_on+0x270/0x270
[ 747.925956] __x64_sys_exit_group+0x28/0x30
[ 747.931214] do_syscall_64+0x117/0x400
[ 747.935988] ? syscall_return_slowpath+0x2f0/0x2f0
[ 747.941931] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 747.947788] ? trace_hardirqs_on_caller+0x1d0/0x1d0
[ 747.953838] ? lockdep_sys_exit+0x16/0x8e
[ 747.958915] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 747.964784] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 747.971021] RIP: 0033:0x7f572f154c68
[ 747.975606] Code: Bad RIP value.
[ 747.979791] RSP: 002b:00007ffed2dfaa58 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[ 747.989324] RAX: ffffffffffffffda RBX: 00007f572f431840 RCX: 00007f572f154c68
[ 747.997910] RDX: 0000000000000001 RSI: 000000000000003c RDI: 0000000000000001
[ 748.006495] RBP: 0000000000000001 R08: 00000000000000e7 R09: fffffffffffffee0
[ 748.015079] R10: 00007f572f4387e8 R11: 0000000000000246 R12: 00007f572f431840
[ 748.023664] R13: 000055a7f90f2c50 R14: 000055a7f96e2310 R15: 000055a7f96e2310
[ 748.032287]
[ 748.034509] Allocated by task 2300:
[ 748.038982] kasan_kmalloc+0xa0/0xd0
[ 748.043562] kmem_cache_alloc_node+0xf5/0x2e0
[ 748.049018] copy_process+0x1781/0x4790
[ 748.053884] _do_fork+0x166/0x9a0
[ 748.058163] do_syscall_64+0x117/0x400
[ 748.062943] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 748.069180]
[ 748.071405] Freed by task 15395:
[ 748.075591] __kasan_slab_free+0x130/0x180
[ 748.080752] kmem_cache_free+0xc2/0x310
[ 748.085619] free_task+0xea/0x130
[ 748.089901] __put_task_struct+0x177/0x230
[ 748.095063] finish_task_switch+0x51b/0x5d0
[ 748.100315] __schedule+0x506/0xfa0
[ 748.104791] schedule+0xca/0x260
[ 748.108978] futex_wait_queue_me+0x27e/0x420
[ 748.114333] futex_wait+0x251/0x550
[ 748.118814] do_futex+0x75b/0xf80
[ 748.123097] __x64_sys_futex+0x231/0x2a0
[ 748.128065] do_syscall_64+0x117/0x400
[ 748.132835] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 748.139066]
[ 748.141289] The buggy address belongs to the object at ffff88a027698000
[ 748.141289] which belongs to the cache task_struct of size 12160
[ 748.156589] The buggy address is located 2272 bytes inside of
[ 748.156589] 12160-byte region [ffff88a027698000, ffff88a02769af80)
[ 748.171114] The buggy address belongs to the page:
[ 748.177055] page:ffffea00809da600 count:1 mapcount:0 mapping:ffff888107d01e00 index:0x0 compound_mapcount: 0
[ 748.189136] flags: 0x57ffffc0008100(slab|head)
[ 748.194688] raw: 0057ffffc0008100 ffffea00809a3200 0000000300000003 ffff888107d01e00
[ 748.204424] raw: 0000000000000000 0000000000020002 00000001ffffffff 0000000000000000
[ 748.214146] page dumped because: kasan: bad access detected
[ 748.220976]
[ 748.223197] Memory state around the buggy address:
[ 748.229128] ffff88a027698780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 748.238271] ffff88a027698800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 748.247414] >ffff88a027698880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 748.256564] ^
[ 748.264267] ffff88a027698900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 748.273493] ffff88a027698980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 748.282630] ==================================================================
Fixes: b086ff87251b4a4 ("connector: add parent pid and tgid to coredump and exit events")
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix a build warning in connector.c when CONFIG_PROC_FS is not enabled
by marking the unused function as __maybe_unused.
../drivers/connector/connector.c:242:12: warning: 'cn_proc_show' defined but not used [-Wunused-function]
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pull networking updates from David Miller:
1) Add Maglev hashing scheduler to IPVS, from Inju Song.
2) Lots of new TC subsystem tests from Roman Mashak.
3) Add TCP zero copy receive and fix delayed acks and autotuning with
SO_RCVLOWAT, from Eric Dumazet.
4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
Brouer.
5) Add ttl inherit support to vxlan, from Hangbin Liu.
6) Properly separate ipv6 routes into their logically independant
components. fib6_info for the routing table, and fib6_nh for sets of
nexthops, which thus can be shared. From David Ahern.
7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
messages from XDP programs. From Nikita V. Shirokov.
8) Lots of long overdue cleanups to the r8169 driver, from Heiner
Kallweit.
9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.
10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.
11) Plumb extack down into fib_rules, from Roopa Prabhu.
12) Add Flower classifier offload support to igb, from Vinicius Costa
Gomes.
13) Add UDP GSO support, from Willem de Bruijn.
14) Add documentation for eBPF helpers, from Quentin Monnet.
15) Add TLS tx offload to mlx5, from Ilya Lesokhin.
16) Allow applications to be given the number of bytes available to read
on a socket via a control message returned from recvmsg(), from
Soheil Hassas Yeganeh.
17) Add x86_32 eBPF JIT compiler, from Wang YanQing.
18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
From Björn Töpel.
19) Remove indirect load support from all of the BPF JITs and handle
these operations in the verifier by translating them into native BPF
instead. From Daniel Borkmann.
20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.
21) Allow XDP programs to do lookups in the main kernel routing tables
for forwarding. From David Ahern.
22) Allow drivers to store hardware state into an ELF section of kernel
dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.
23) Various RACK and loss detection improvements in TCP, from Yuchung
Cheng.
24) Add TCP SACK compression, from Eric Dumazet.
25) Add User Mode Helper support and basic bpfilter infrastructure, from
Alexei Starovoitov.
26) Support ports and protocol values in RTM_GETROUTE, from Roopa
Prabhu.
27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
Brouer.
28) Add lots of forwarding selftests, from Petr Machata.
29) Add generic network device failover driver, from Sridhar Samudrala.
* ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
strparser: Add __strp_unpause and use it in ktls.
rxrpc: Fix terminal retransmission connection ID to include the channel
net: hns3: Optimize PF CMDQ interrupt switching process
net: hns3: Fix for VF mailbox receiving unknown message
net: hns3: Fix for VF mailbox cannot receiving PF response
bnx2x: use the right constant
Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
enic: fix UDP rss bits
netdev-FAQ: clarify DaveM's position for stable backports
rtnetlink: validate attributes in do_setlink()
mlxsw: Add extack messages for port_{un, }split failures
netdevsim: Add extack error message for devlink reload
devlink: Add extack to reload and port_{un, }split operations
net: metrics: add proper netlink validation
ipmr: fix error path when ipmr_new_table fails
ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
net: hns3: remove unused hclgevf_cfg_func_mta_filter
netfilter: provide udp*_lib_lookup for nf_tproxy
qed*: Utilize FW 8.37.2.0
...
|
|
Variants of proc_create{,_data} that directly take a seq_file show
callback and drastically reduces the boilerplate code in the callers.
All trivial callers converted over.
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The intention is to get notified of process failures as soon
as possible, before a possible core dumping (which could be very long)
(e.g. in some process-manager). Coredump and exit process events
are perfect for such use cases (see 2b5faa4c553f "connector: Added
coredumping event to the process connector").
The problem is that for now the process-manager cannot know the parent
of a dying process using connectors. This could be useful if the
process-manager should monitor for failures only children of certain
parents, so we could filter the coredump and exit events by parent
process and/or thread ID.
Add parent pid and tgid to coredump and exit process connectors event
data.
Signed-off-by: Stefan Strogin <sstrogin@cisco.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.
The variable cn_callback_entry.refcnt is used as pure reference counter.
Convert it to refcount_t and fix up the operations.
Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The Kconfig controlling build of this code is currently:
drivers/connector/Kconfig:config PROC_EVENTS
drivers/connector/Kconfig: bool "Report process events to userspace"
...meaning that it currently is not being built as a module by anyone.
Lets remove the two modular references, so that when reading the driver
there is no doubt it is builtin-only.
Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit.
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The proc connector messages include a sequence number, allowing userspace
programs to detect lost messages. However, performing this detection is
currently more difficult than necessary, since netlink messages can be
delivered to the application out-of-order. To fix this, leave pre-emption
disabled during cn_netlink_send(), and use GFP_NOWAIT.
The following was written as a test case. Building the kernel w/ make -j32
proved a reliable way to generate out-of-order cn_proc messages.
int
main(int argc, char *argv[])
{
static uint32_t last_seq[CPU_SETSIZE], seq;
int cpu, fd;
struct sockaddr_nl sa;
struct __attribute__((aligned(NLMSG_ALIGNTO))) {
struct nlmsghdr nl_hdr;
struct __attribute__((__packed__)) {
struct cn_msg cn_msg;
struct proc_event cn_proc;
};
} rmsg;
struct __attribute__((aligned(NLMSG_ALIGNTO))) {
struct nlmsghdr nl_hdr;
struct __attribute__((__packed__)) {
struct cn_msg cn_msg;
enum proc_cn_mcast_op cn_mcast;
};
} smsg;
fd = socket(PF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR);
if (fd < 0) {
perror("socket");
}
sa.nl_family = AF_NETLINK;
sa.nl_groups = CN_IDX_PROC;
sa.nl_pid = getpid();
if (bind(fd, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
perror("bind");
}
memset(&smsg, 0, sizeof(smsg));
smsg.nl_hdr.nlmsg_len = sizeof(smsg);
smsg.nl_hdr.nlmsg_pid = getpid();
smsg.nl_hdr.nlmsg_type = NLMSG_DONE;
smsg.cn_msg.id.idx = CN_IDX_PROC;
smsg.cn_msg.id.val = CN_VAL_PROC;
smsg.cn_msg.len = sizeof(enum proc_cn_mcast_op);
smsg.cn_mcast = PROC_CN_MCAST_LISTEN;
if (send(fd, &smsg, sizeof(smsg), 0) != sizeof(smsg)) {
perror("send");
}
while (recv(fd, &rmsg, sizeof(rmsg), 0) == sizeof(rmsg)) {
cpu = rmsg.cn_proc.cpu;
if (cpu < 0) {
continue;
}
seq = rmsg.cn_msg.seq;
if ((last_seq[cpu] != 0) && (seq != last_seq[cpu] + 1)) {
printf("out-of-order seq=%d on cpu=%d\n", seq, cpu);
}
last_seq[cpu] = seq;
}
/* NOTREACHED */
perror("recv");
return -1;
}
Signed-off-by: Aaron Campbell <aaron@monkey.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Dmitry reports memleak with syskaller program.
Problem is that connector bumps skb usecount but might not invoke callback.
So move skb_get to where we invoke the callback.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
sleep and avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".
Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.
This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.
This patch then converts a number of sites
o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.
o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.
o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.
o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.
The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.
The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Support for keyword 'boolean' will be dropped later on.
No functional change.
Reference: http://lkml.kernel.org/r/cover.1418003065.git.cj@linux.com
Signed-off-by: Christoph Jaeger <cj@linux.com>
Signed-off-by: Michal Marek <mmarek@suse.cz>
|
|
The struct cn_msg len field comes from userspace and needs to be
validated. More logical to do so here where the cn_msg pointer is
pulled out of the sk_buff than the callback which is passed cn_msg *
and might assume no validation is needed.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: David Fries <David@Fries.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Replace the ever recurring:
ts = ktime_get_ts();
ns = timespec_to_ns(&ts);
with
ns = ktime_get_ns();
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: John Stultz <john.stultz@linaro.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc into next
Pull char/misc driver patches from Greg KH:
"Here is the big char / misc driver update for 3.16-rc1.
Lots of different driver updates for a variety of different drivers
and minor driver subsystems.
All have been in linux-next with no reported issues"
* tag 'char-misc-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (79 commits)
hv: use correct order when freeing monitor_pages
spmi: of: fixup generic SPMI devicetree binding example
applicom: dereferencing NULL on error path
misc: genwqe: fix uninitialized return value in genwqe_free_sync_sgl()
miscdevice.h: Simple syntax fix to make pointers consistent.
MAINTAINERS: Add miscdevice.h to file list for char/misc drivers.
mcb: Add support for shared PCI IRQs
drivers: Remove duplicate conditionally included subdirs
misc: atmel_pwm: only build for supported platforms
mei: me: move probe quirk to cfg structure
mei: add per device configuration
mei: me: read H_CSR after asserting reset
mei: me: drop harmful wait optimization
mei: me: fix hw ready reset flow
mei: fix memory leak of mei_clients array
uio: fix vma io range check in mmap
drivers: uio_dmem_genirq: Fix memory leak in uio_dmem_genirq_probe()
w1: do not unlock unheld list_mutex in __w1_remove_master_device()
w1: optional bundling of netlink kernel replies
connector: allow multiple messages to be sent in one packet
...
|
|
This increases the amount of bundling to reduce the number of packets
sent. For the one wire use there can be multiple struct
w1_netlink_cmd in a struct w1_netlink_msg and multiple of those in
struct cn_msg, and with this change multiple of those in a struct
nlmsghdr, and at each level the len identifies there being multiple of
the next.
Signed-off-by: David Fries <David@Fries.net>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
It is possible by passing a netlink socket to a more privileged
executable and then to fool that executable into writing to the socket
data that happens to be valid netlink message to do something that
privileged executable did not intend to do.
To keep this from happening replace bare capable and ns_capable calls
with netlink_capable, netlink_net_calls and netlink_ns_capable calls.
Which act the same as the previous calls except they verify that the
opener of the socket had the desired permissions as well.
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pull networking updates from David Miller:
"Here is my initial pull request for the networking subsystem during
this merge window:
1) Support for ESN in AH (RFC 4302) from Fan Du.
2) Add full kernel doc for ethtool command structures, from Ben
Hutchings.
3) Add BCM7xxx PHY driver, from Florian Fainelli.
4) Export computed TCP rate information in netlink socket dumps, from
Eric Dumazet.
5) Allow IPSEC SA to be dumped partially using a filter, from Nicolas
Dichtel.
6) Convert many drivers to pci_enable_msix_range(), from Alexander
Gordeev.
7) Record SKB timestamps more efficiently, from Eric Dumazet.
8) Switch to microsecond resolution for TCP round trip times, also
from Eric Dumazet.
9) Clean up and fix 6lowpan fragmentation handling by making use of
the existing inet_frag api for it's implementation.
10) Add TX grant mapping to xen-netback driver, from Zoltan Kiss.
11) Auto size SKB lengths when composing netlink messages based upon
past message sizes used, from Eric Dumazet.
12) qdisc dumps can take a long time, add a cond_resched(), From Eric
Dumazet.
13) Sanitize netpoll core and drivers wrt. SKB handling semantics.
Get rid of never-used-in-tree netpoll RX handling. From Eric W
Biederman.
14) Support inter-address-family and namespace changing in VTI tunnel
driver(s). From Steffen Klassert.
15) Add Altera TSE driver, from Vince Bridgers.
16) Optimizing csum_replace2() so that it doesn't adjust the checksum
by checksumming the entire header, from Eric Dumazet.
17) Expand BPF internal implementation for faster interpreting, more
direct translations into JIT'd code, and much cleaner uses of BPF
filtering in non-socket ocntexts. From Daniel Borkmann and Alexei
Starovoitov"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1976 commits)
netpoll: Use skb_irq_freeable to make zap_completion_queue safe.
net: Add a test to see if a skb is freeable in irq context
qlcnic: Fix build failure due to undefined reference to `vxlan_get_rx_port'
net: ptp: move PTP classifier in its own file
net: sxgbe: make "core_ops" static
net: sxgbe: fix logical vs bitwise operation
net: sxgbe: sxgbe_mdio_register() frees the bus
Call efx_set_channels() before efx->type->dimension_resources()
xen-netback: disable rogue vif in kthread context
net/mlx4: Set proper build dependancy with vxlan
be2net: fix build dependency on VxLAN
mac802154: make csma/cca parameters per-wpan
mac802154: allow only one WPAN to be up at any given time
net: filter: minor: fix kdoc in __sk_run_filter
netlink: don't compare the nul-termination in nla_strcmp
can: c_can: Avoid led toggling for every packet.
can: c_can: Simplify TX interrupt cleanup
can: c_can: Store dlc private
can: c_can: Reduce register access
can: c_can: Make the code readable
...
|
|
There were a couple of patches fixing the same bug that
results in duplicated err = 0; assignment.
The patch removes one of them.
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This allows replying only to the requestor portid while still
supporting broadcasting. Pass 0 to portid for the previous behavior.
Signed-off-by: David Fries <David@Fries.net>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
In af3e095a1fb4, Erik Jacobsen fixed one type of unaligned access
bug for ia64 by converting a 64-bit write to use put_unaligned().
Unfortunately, since gcc will convert a short memset() to a series
of appropriately-aligned stores, the problem is now visible again
on tilegx, where the memset that zeros out proc_event is converted
to three 64-bit stores, causing an unaligned access panic.
A better fix for the original problem is to ensure that proc_event
is aligned to 8 bytes here. We can do that relatively easily by
arranging to start the struct cn_msg aligned to 8 bytes and then
offset by 4 bytes. Doing so means that the immediately following
proc_event structure is then correctly aligned to 8 bytes.
The result is that the memset() stores are now aligned, and as an
added benefit, we can remove the put_unaligned() calls in the code.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We calculated the size for the netlink message buffer as size. Use size
in the memcpy() call as well instead of recalculating it.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The current code tests the length of the whole netlink message to be
at least as long to fit a cn_msg. This is wrong as nlmsg_len includes
the length of the netlink message header. Use nlmsg_len() instead to
fix this "off-by-NLMSG_HDRLEN" size check.
Cc: stable@vger.kernel.org # v2.6.14+
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Initialize event_data for all possible message types to prevent leaking
kernel stack contents to userland (up to 20 bytes). Also set the flags
member of the connector message to 0 to prevent leaking two more stack
bytes this way.
Cc: stable@vger.kernel.org # v2.6.15+
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Process connector can now also detect coredumping events.
Main aim of patch is get notified at start of coredumping, instead of
having to wait for it to finish and then being notified through EXIT
event.
Could be used for instance by process-managers that want to get
notified as soon as possible about process failures, and not
necessarily beeing notified after coredump, which could be in the
order of minutes depending on size of coredump, piping and so on.
Signed-off-by: Jesper Derehag <jderehag@hotmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While PROC_CN_MCAST_LISTEN/IGNORE is entirely advisory, it was possible
for an unprivileged user to turn off notifications for all listeners by
sending PROC_CN_MCAST_IGNORE. Instead, require the same privileges as
required for a multicast bind.
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: stable@vger.kernel.org
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Acked-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
proc_net_remove is only used to remove proc entries
that under /proc/net,it's not a general function for
removing proc entries of netns. if we want to remove
some proc entries which under /proc/net/stat/, we still
need to call remove_proc_entry.
this patch use remove_proc_entry to replace proc_net_remove.
we can remove proc_net_remove after this patch.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Right now, some modules such as bonding use proc_create
to create proc entries under /proc/net/, and other modules
such as ipv4 use proc_net_fops_create.
It looks a little chaos.this patch changes all of
proc_net_fops_create to proc_create. we can remove
proc_net_fops_create after this patch.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
CONFIG_HOTPLUG is going away as an option. As a result, the __dev*
markings need to be removed.
This change removes the use of __devinit, __devexit_p, __devinitdata,
__devinitconst, and __devexit from these drivers.
Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.
Cc: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Pull networking changes from David Miller:
1) GRE now works over ipv6, from Dmitry Kozlov.
2) Make SCTP more network namespace aware, from Eric Biederman.
3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.
4) Make openvswitch network namespace aware, from Pravin B Shelar.
5) IPV6 NAT implementation, from Patrick McHardy.
6) Server side support for TCP Fast Open, from Jerry Chu and others.
7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
Borkmann.
8) Increate the loopback default MTU to 64K, from Eric Dumazet.
9) Use a per-task rather than per-socket page fragment allocator for
outgoing networking traffic. This benefits processes that have very
many mostly idle sockets, which is quite common.
From Eric Dumazet.
10) Use up to 32K for page fragment allocations, with fallbacks to
smaller sizes when higher order page allocations fail. Benefits are
a) less segments for driver to process b) less calls to page
allocator c) less waste of space.
From Eric Dumazet.
11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.
12) VXLAN device driver, one way to handle VLAN issues such as the
limitation of 4096 VLAN IDs yet still have some level of isolation.
From Stephen Hemminger.
13) As usual there is a large boatload of driver changes, with the scale
perhaps tilted towards the wireless side this time around.
Fix up various fairly trivial conflicts, mostly caused by the user
namespace changes.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
hyperv: Add buffer for extended info after the RNDIS response message.
hyperv: Report actual status in receive completion packet
hyperv: Remove extra allocated space for recv_pkt_list elements
hyperv: Fix page buffer handling in rndis_filter_send_request()
hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
hyperv: Fix the max_xfer_size in RNDIS initialization
vxlan: put UDP socket in correct namespace
vxlan: Depend on CONFIG_INET
sfc: Fix the reported priorities of different filter types
sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
sfc: Fix loopback self-test with separate_tx_channels=1
sfc: Fix MCDI structure field lookup
sfc: Add parentheses around use of bitfield macro arguments
sfc: Fix null function pointer in efx_sriov_channel_type
vxlan: virtual extensible lan
igmp: export symbol ip_mc_leave_group
netlink: add attributes to fdb interface
tg3: unconditionally select HWMON support when tg3 is enabled.
Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
gre: fix sparse warning
...
|
|
This patch defines netlink_kernel_create as a wrapper function of
__netlink_kernel_create to hide the struct module *me parameter
(which seems to be THIS_MODULE in all existing netlink subsystems).
Suggested by David S. Miller.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
- Only allow asking for events from the initial user and pid namespace,
where we generate the events in.
- Convert kuids and kgids into the initial user namespace to report
them via the process event connector.
Cc: David Miller <davem@davemloft.net>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
|
|
V2: Replaced assignment in if statement.
Fixed coding style issues.
Signed-off-by: Valentin Ilie <valentin.ilie@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds the following structure:
struct netlink_kernel_cfg {
unsigned int groups;
void (*input)(struct sk_buff *skb);
struct mutex *cb_mutex;
};
That can be passed to netlink_kernel_create to set optional configurations
for netlink kernel sockets.
I've populated this structure by looking for NULL and zero parameters at the
existing code. The remaining parameters that always need to be set are still
left in the original interface.
That includes optional parameters for the netlink socket creation. This allows
easy extensibility of this interface in the future.
This patch also adapts all callers to use this new interface.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|