summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2015-08-07bpf tools: Introduce bpf_load_program() to bpf.cWang Nan2-0/+41
bpf_load_program() can be used to load bpf program into kernel. To make loading faster, first try to load without logbuf. Try again with logbuf if the first try failed. Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kaixu Xia <xiakaixu@huawei.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/1435716878-189507-19-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-07bpf tools: Relocate eBPF programsWang Nan1-0/+52
If an eBPF program accesses a map, LLVM generates a load instruction which loads an absolute address into a register, like this: ld_64 r1, <MCOperand Expr:(mymap)> ... call 2 That ld_64 instruction will be recorded in relocation section. To enable the usage of that map, relocation must be done by replacing the immediate value by real map file descriptor so it can be found by eBPF map functions. This patch to the relocation work based on information collected by patches: 'bpf tools: Collect symbol table from SHT_SYMTAB section', 'bpf tools: Collect relocation sections from SHT_REL sections' and 'bpf tools: Record map accessing instructions for each program'. For each instruction which needs relocation, it inject corresponding file descriptor to imm field. As a part of protocol, src_reg is set to BPF_PSEUDO_MAP_FD to notify kernel this is a map loading instruction. This is the final part of map relocation patch. The principle of map relocation is described in commit message of 'bpf tools: Collect symbol table from SHT_SYMTAB section'. Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kaixu Xia <xiakaixu@huawei.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/1435716878-189507-18-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-07bpf tools: Create eBPF maps defined in an object fileWang Nan2-0/+106
This patch creates maps based on 'map' section in object file using bpf_create_map(), and stores the fds into an array in 'struct bpf_object'. Previous patches parse ELF object file and collects required data, but doesn't play with the kernel. They belong to the 'opening' phase. This patch is the first patch in 'loading' phase. The 'loaded' field is introduced in 'struct bpf_object' to avoid loading an object twice, because the loading phase clears resources collected during the opening which becomes useless after loading. In this patch, maps_buf is cleared. Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kaixu Xia <xiakaixu@huawei.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/1435716878-189507-17-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-07bpf tools: Add bpf.c/h for common bpf operationsWang Nan3-1/+68
This patch introduces bpf.c and bpf.h, which hold common functions issuing bpf syscall. The goal of these two files is to hide syscall completely from user. Note that bpf.c and bpf.h deal with kernel interface only. Things like structure of 'map' section in the ELF object is not cared by of bpf.[ch]. We first introduce bpf_create_map(). Note that, since functions in bpf.[ch] are wrapper of sys_bpf, they don't use OO style naming. Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kaixu Xia <xiakaixu@huawei.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/1435716878-189507-16-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-07bpf tools: Record map accessing instructions for each programWang Nan2-0/+137
This patch records the indices of instructions which are needed to be relocated. That information is saved in the 'reloc_desc' field in 'struct bpf_program'. In the loading phase (this patch takes effect in the opening phase), the collected instructions will be replaced by map loading instructions. Since we are going to close the ELF file and clear all data at the end of the 'opening' phase, the ELF information will no longer be valid in the 'loading' phase. We have to locate the instructions before maps are loaded, instead of directly modifying the instruction. 'struct bpf_map_def' is introduced in this patch to let us know how many maps are defined in the object. This is the third part of map relocation. The principle of map relocation is described in commit message of 'bpf tools: Collect symbol table from SHT_SYMTAB section'. Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kaixu Xia <xiakaixu@huawei.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/1435716878-189507-15-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-07bpf tools: Collect relocation sections from SHT_REL sectionsWang Nan1-0/+26
This patch collects relocation sections into 'struct object'. Such sections are used for connecting maps to bpf programs. 'reloc' field in 'struct bpf_object' is introduced for storing such information. This patch simply store the data into 'reloc' field. Following patch will parse them to know the exact instructions which are needed to be relocated. Note that the collected data will be invalid after ELF object file is closed. This is the second patch related to map relocation. The first one is 'bpf tools: Collect symbol table from SHT_SYMTAB section'. The principle of map relocation is described in its commit message. Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kaixu Xia <xiakaixu@huawei.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/1435716878-189507-14-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-07bpf tools: Collect eBPF programs from their own sectionsWang Nan1-0/+114
This patch collects all programs in an object file into an array of 'struct bpf_program' for further processing. That structure is for representing each eBPF program. 'bpf_prog' should be a better name, but it has been used by linux/filter.h. Although it is a kernel space name, I still prefer to call it 'bpf_program' to prevent possible confusion. bpf_object__add_program() creates a new 'struct bpf_program' object. It first init a variable in stack using bpf_program__init(), then if success, enlarges obj->programs array and copy the new object in. Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kaixu Xia <xiakaixu@huawei.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/1435716878-189507-13-git-send-email-wangnan0@huawei.com [ Made bpf_object__add_program() propagate the error (-EINVAL or -ENOMEM) ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-07bpf tools: Collect symbol table from SHT_SYMTAB sectionWang Nan1-0/+10
This patch collects symbols section. This section is useful when linking BPF maps. What 'bpf_map_xxx()' functions actually require are map's file descriptors (and the internal verifier converts fds into pointers to 'struct bpf_map'), which we don't know when compiling. Therefore, we should make compiler generate a 'ldr_64 r1, <imm>' instruction, and fill the 'imm' field with the actual file descriptor when loading in libbpf. BPF programs should be written in this way: struct bpf_map_def SEC("maps") my_map = { .type = BPF_MAP_TYPE_HASH, .key_size = sizeof(unsigned long), .value_size = sizeof(unsigned long), .max_entries = 1000000, }; SEC("my_func=sys_write") int my_func(void *ctx) { ... bpf_map_update_elem(&my_map, &key, &value, BPF_ANY); ... } Compiler should convert '&my_map' into a 'ldr_64, r1, <imm>' instruction, where imm should be the address of 'my_map'. According to the address, libbpf knows which map it actually referenced, and then fills the imm field with the 'fd' of that map created by it. However, since we never really 'link' the object file, the imm field is only a record in relocation section. Therefore libbpf should do the relocation: 1. In relocation section (type == SHT_REL), positions of each such 'ldr_64' instruction are recorded with a reference of an entry in symbol table (SHT_SYMTAB); 2. From records in symbol table we can find the indics of map variables. Libbpf first record SHT_SYMTAB and positions of each instruction which required bu such operation. Then create file descriptor. Finally, after map creation complete, replace the imm field. This is the first patch of BPF map related stuff. It records SHT_SYMTAB into object's efile field for further use. Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kaixu Xia <xiakaixu@huawei.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/1435716878-189507-12-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-07bpf tools: Collect map definitions from 'maps' sectionWang Nan1-0/+29
If maps are used by eBPF programs, corresponding object file(s) should contain a section named 'map'. Which contains map definitions. This patch copies the data of the whole section. Map data parsing should be acted just before map loading. Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kaixu Xia <xiakaixu@huawei.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/1435716878-189507-11-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-07bpf tools: Collect version and license from ELF sectionsWang Nan1-0/+53
Expand bpf_obj_elf_collect() to collect license and kernel version information in eBPF object file. eBPF object file should have a section named 'license', which contains a string. It should also have a section named 'version', contains a u32 LINUX_VERSION_CODE. bpf_obj_validate() is introduced to validate object file after loaded. Currently it only check existence of 'version' section. Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kaixu Xia <xiakaixu@huawei.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/1435716878-189507-10-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-07bpf tools: Iterate over ELF sections to collect informationWang Nan1-0/+53
bpf_obj_elf_collect() is introduced to iterate over each elf sections to collection information in eBPF object files. This function will futher enhanced to collect license, kernel version, programs, configs and map information. Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kaixu Xia <xiakaixu@huawei.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/1435716878-189507-9-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-07bpf tools: Check endianness and make libbpf fail earlyWang Nan1-0/+30
Check endianness according to EHDR. Code is taken from tools/perf/util/symbol-elf.c. Libbpf doesn't magically convert missmatched endianness. Even if we swap eBPF instructions to correct byte order, we are unable to deal with endianness in code logical generated by LLVM. Therefore, libbpf should simply reject missmatched ELF object, and let LLVM to create good code. Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kaixu Xia <xiakaixu@huawei.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/1435716878-189507-8-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-07bpf tools: Read eBPF object from bufferWang Nan2-12/+52
To support dynamic compiling, this patch allows caller to pass a in-memory buffer to libbpf by bpf_object__open_buffer(). libbpf calls elf_memory() to open it as ELF object file. Because __bpf_object__open() collects all required data and won't need that buffer anymore, libbpf uses that buffer directly instead of clone a new buffer. Caller of libbpf can free that buffer or use it do other things after bpf_object__open_buffer() return. Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kaixu Xia <xiakaixu@huawei.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/1435716878-189507-7-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-07bpf tools: Open eBPF object file and do basic validationWang Nan2-0/+166
This patch defines basic interface of libbpf. 'struct bpf_object' will be the handler of each object file. Its internal structure is hide to user. eBPF object files are compiled by LLVM as ELF format. In this patch, libelf is used to open those files, read EHDR and do basic validation according to e_type and e_machine. All elf related staffs are grouped together and reside in efile field of 'struct bpf_object'. bpf_object__elf_finish() is introduced to clear it. After all eBPF programs in an object file are loaded, related ELF information is useless. Close the object file and free those memory. The zfree() and zclose() functions are introduced to ensure setting NULL pointers and negative file descriptors after resources are released. Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kaixu Xia <xiakaixu@huawei.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/1435716878-189507-6-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-07bpf tools: Allow caller to set printing functionWang Nan2-0/+52
By libbpf_set_print(), users of libbpf are allowed to register he/she own debug, info and warning printing functions. Libbpf will use those functions to print messages. If not provided, default info and warning printing functions are fprintf(stderr, ...); default debug printing is NULL. This API is designed to be used by perf, enables it to register its own logging functions to make all logs uniform, instead of separated logging level control. Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kaixu Xia <xiakaixu@huawei.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Zefan Li <lizefan@huawei.com> Cc: pi3orama@163.com Link: http://lkml.kernel.org/r/1435716878-189507-5-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-07bpf tools: Introduce 'bpf' library and add bpf feature checkWang Nan7-1/+246
This is the first patch of libbpf. The goal of libbpf is to create a standard way for accessing eBPF object files. This patch creates 'Makefile' and 'Build' for it, allows 'make' to build libbpf.a and libbpf.so, 'make install' to put them into proper directories. Most part of Makefile is borrowed from traceevent. Before building, it checks the existence of libelf in Makefile, and deny to build if not found. Instead of throwing an error if libelf not found, the error raises in a phony target "elfdep". This design is to ensure 'make clean' still workable even if libelf is not found. Because libbpf requires 'kern_version' field set for 'union bpf_attr' (bpfdep" is used for that dependency), Kernel BPF API is also checked by intruducing a new feature check 'bpf' into tools/build/feature, which checks the existence and version of linux/bpf.h. When building libbpf, it searches that file from include/uapi/linux in kernel source tree (controlled by FEATURE_CHECK_CFLAGS-bpf). Since it searches kernel source tree it reside, installing of newest kernel headers is not required, except we are trying to port these files to an old kernel. To avoid checking that file when perf building, the newly introduced 'bpf' feature check doesn't added into FEATURE_TESTS and FEATURE_DISPLAY by default in tools/build/Makefile.feature, but added into libbpf's specific. Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Cc: He Kuang <hekuang@huawei.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kaixu Xia <xiakaixu@huawei.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Zefan Li <lizefan@huawei.com> Bcc: pi3orama@163.com Link: http://lkml.kernel.org/r/1435716878-189507-4-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-07drm/vblank: Use u32 consistently for vblank countersDaniel Vetter2-2/+2
In commit 99264a61dfcda41d86d0960cf2d4c0fc2758a773 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Apr 15 19:34:43 2015 +0200 drm/vblank: Fixup and document timestamp update/read barriers I've switched vblank->count from atomic_t to unsigned long and accidentally created an integer comparison bug in drm_vblank_count_and_time since vblanke->count might overflow the u32 local copy and hence the retry loop never succeed. Fix this by consistently using u32. Cc: Michel Dänzer <michel@daenzer.net> Reported-by: Michel Dänzer <michel@daenzer.net> Reviewed-by: Thierry Reding <treding@nvidia.com> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
2015-08-07Merge tag 'asoc-fix-v4.2-rc5' of ↵Takashi Iwai530-3506/+6319
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus ASoC: Fixes for v4.2 There are a couple of small driver specific fixes here but the overwhelming bulk of these changes are fixes to the topology ABI that has been newly introduced in v4.2. Once this makes it into a release we will have to firm this up but for now getting enhancements in before they've made it into a release is the most expedient thing.
2015-08-07KVM: x86: Use adjustment in guest cycles when handling MSR_IA32_TSC_ADJUSTHaozhong Zhang1-1/+1
When kvm_set_msr_common() handles a guest's write to MSR_IA32_TSC_ADJUST, it will calcuate an adjustment based on the data written by guest and then use it to adjust TSC offset by calling a call-back adjust_tsc_offset(). The 3rd parameter of adjust_tsc_offset() indicates whether the adjustment is in host TSC cycles or in guest TSC cycles. If SVM TSC scaling is enabled, adjust_tsc_offset() [i.e. svm_adjust_tsc_offset()] will first scale the adjustment; otherwise, it will just use the unscaled one. As the MSR write here comes from the guest, the adjustment is in guest TSC cycles. However, the current kvm_set_msr_common() uses it as a value in host TSC cycles (by using true as the 3rd parameter of adjust_tsc_offset()), which can result in an incorrect adjustment of TSC offset if SVM TSC scaling is enabled. This patch fixes this problem. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Cc: stable@vger.linux.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-08-07KVM: x86: zero IDT limit on entry to SMMPaolo Bonzini1-0/+5
The recent BlackHat 2015 presentation "The Memory Sinkhole" mentions that the IDT limit is zeroed on entry to SMM. This is not documented, and must have changed some time after 2010 (see http://www.ssi.gouv.fr/uploads/IMG/pdf/IT_Defense_2010_final.pdf). KVM was not doing it, but the fix is easy. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-08-07ARCv2: spinlock/rwlock/atomics: reduce 1 instruction in exponential backoffVineet Gupta2-4/+2
The increment of delay counter was 2 instructions: Arithmatic Shfit Left (ASL) + set to 1 on overflow This can be done in 1 using ROtate Left (ROL) Suggested-by: Nigel Topham <ntopham@synopsys.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2015-08-07virtio-net: drop NETIF_F_FRAGLISTJason Wang1-2/+2
virtio declares support for NETIF_F_FRAGLIST, but assumes that there are at most MAX_SKB_FRAGS + 2 fragments which isn't always true with a fraglist. A longer fraglist in the skb will make the call to skb_to_sgvec overflow the sg array, leading to memory corruption. Drop NETIF_F_FRAGLIST so we only get what we can handle. Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07stmmac: dwmac-ipq806x: fix static checker warningMathieu Olivari1-2/+2
The patch b1c17215d718: "stmmac: add ipq806x glue layer", leads to the following static checker warning: .../stmmac/dwmac-ipq806x.c:314 ipq806x_gmac_probe() warn: double left shift '1 << (1 << gmac->id)' The NSS_COMMON_CLK_SRC_CTRL_OFFSET macro is used once as an offset, and once as a mask, which is a bug indeed. We'll fix it by defining the offset as the real offset value and computing the mask from it when required. Tested on IPQ806x ref designs AP148 & DB149. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Mathieu Olivari <mathieu@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07Merge tag 'perf-core-for-mingo' of ↵Ingo Molnar39-225/+716
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo: User visible changes: - IPC and cycle accounting in 'perf annotate'. (Andi Kleen) - Display cycles in branch sort mode in 'perf report'. (Andi Kleen) - Add total time column to 'perf trace' syscall stats summary. (Milian Woff) Infrastructure changes: - PMU helpers to use in Intel PT. (Adrian Hunter) - Fix perf-with-kcore script not to split args with spaces. (Adrian Hunter) - Add empty Build files for some more architectures. (Ben Hutchings) - Move 'perf stat' config variables to a struct to allow using some of its functions in more places. (Jiri Olsa) - Add DWARF register names for 'xtensa' arch. (Max Filippov) - Implement BPF programs attached to uprobes. (Wang Nan) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-07net: netcp: fix unused interface rx buffer size configurationWingMan Kwok2-23/+13
Prior to this patch, rx buffer size for each rx queue of an interface is configurable through dts bindings. But for an interface, the first rx queue's rx buffer size is always the usual MTU size (plus usual overhead) and page size for the remaining rx queues (if they are enabled by specifying a non-zero rx queue depth dts binding of the corresponding interface). This patch removes the rx buffer size configuration capability. Signed-off-by: WingMan Kwok <w-kwok2@ti.com> Acked-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07net: thunderx: remove effective "default y" from Kconfig if ARCH_THUNDER=yIan Campbell1-3/+0
As well as for kernels built only for ThunderX ARCH_THUNDERX is also enabled for kernels which support multiple platforms (such as distro kernels). Thus "default ARCH_THUNDER" is inappropriate. I believe default m is equally frowned upon, so remove the line completely rather than "default m if ARCH_THUNDER". Signed-off-by: Ian Campbell <ijc@hellion.org.uk> Cc: Sunil Goutham <sgoutham@cavium.com> Cc: Robert Richter <rric@kernel.org> Cc: Derek Chickles <derek.chickles@caviumnetworks.com> Cc: Satanand Burla <satananda.burla@caviumnetworks.com> Cc: Felix Manlunas <felix.manlunas@caviumnetworks.com> Cc: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux-arm-kernel@lists.infradead.org Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07r8169: enforce RX_MULTI_EN on rtl8168ep/8111ep chipsIvan Vecera1-1/+3
Enforcing this flag in RxConfig for the mentioned chips fixes netdev watchdog issues prepended with AMD IOMMU message(s) like: AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x001d address=0x0000000000003000 flags=0x0050] Note that this flag is also set in Realtek's own driver for these chips. Signed-off-by: Ivan Vecera <ivecera@redhat.com> Tested-by: Alexander Lindqvist <alexander@bitspace.se> Acked-by: Francois Romieu <romieu@fr.zoreil.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07bridge: netlink: account for the IFLA_BRPORT_PROXYARP_WIFI attribute size ↵Nikolay Aleksandrov1-0/+2
and policy The attribute size wasn't accounted for in the get_slave_size() callback (br_port_get_slave_size) when it was introduced, so fix it now. Also add a policy entry for it in br_port_policy. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Fixes: 842a9ae08a25 ("bridge: Extend Proxy ARP design to allow optional rules for Wi-Fi") Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07bridge: netlink: account for the IFLA_BRPORT_PROXYARP attribute size and policyNikolay Aleksandrov1-0/+2
The attribute size wasn't accounted for in the get_slave_size() callback (br_port_get_slave_size) when it was introduced, so fix it now. Also add a policy entry for it in br_port_policy. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Fixes: 958501163ddd ("bridge: Add support for IEEE 802.11 Proxy ARP") Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07Merge tag 'wireless-drivers-for-davem-2015-08-04' of ↵David S. Miller8-10/+51
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers Kalle Valo says: ==================== iwlwifi: * a fix for the stuck TFD queue mechanism - it was producing noisy false alarms * a fix for the NIC prepare flow that prevented the driver from being able to access the device on certain systems * a fix for the scan prority handling which allows the regular scan to run even if a scheduled scan is already running rsi: * fix firmware load DMA regression b43: * fix extpa_gain check for 2GHz rtlwifi: * fix NULL dereference when PCI driver used as an AP * add missing module parameter declaration for rtl8723be_mod_params.msi_support ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07net: pktgen: don't abuse current->state in pktgen_thread_worker()Oleg Nesterov1-3/+0
Commit 1fbe4b46caca "net: pktgen: kill the Wait for kthread_stop code in pktgen_thread_worker()" removed (in particular) the final __set_current_state(TASK_RUNNING) and I didn't notice the previous set_current_state(TASK_INTERRUPTIBLE). This triggers the warning in __might_sleep() after return. Afaics, we can simply remove both set_current_state()'s, and we could do this a long ago right after ef87979c273a2 "pktgen: better scheduler friendliness" which changed pktgen_thread_worker() to use wait_event_interruptible_timeout(). Reported-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07xen/netback: Wake dealloc thread after completing zerocopy workRoss Lagerwall2-1/+6
Waking the dealloc thread before decrementing inflight_packets is racy because it means the thread may go to sleep before inflight_packets is decremented. If kthread_stop() has already been called, the dealloc thread may wait forever with nothing to wake it. Instead, wake the thread only after decrementing inflight_packets. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07net: Fix skb_set_peeked use-after-free bugHerbert Xu1-6/+7
The commit 738ac1ebb96d02e0d23bc320302a6ea94c612dec ("net: Clone skb before setting peeked flag") introduced a use-after-free bug in skb_recv_datagram. This is because skb_set_peeked may create a new skb and free the existing one. As it stands the caller will continue to use the old freed skb. This patch fixes it by making skb_set_peeked return the new skb (or the old one if unchanged). Fixes: 738ac1ebb96d ("net: Clone skb before setting peeked flag") Reported-by: Brenden Blanco <bblanco@plumgrid.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Brenden Blanco <bblanco@plumgrid.com> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds4-81/+11
Pull sparc fix from David Miller: "FPU register corruption bug fix" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc64: Fix userspace FPU register corruptions.
2015-08-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds27-102/+153
Merge fixes from Andrew Morton: "21 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (21 commits) writeback: fix initial dirty limit mm/memory-failure: set PageHWPoison before migrate_pages() mm: check __PG_HWPOISON separately from PAGE_FLAGS_CHECK_AT_* mm/memory-failure: give up error handling for non-tail-refcounted thp mm/memory-failure: fix race in counting num_poisoned_pages mm/memory-failure: unlock_page before put_page ipc: use private shmem or hugetlbfs inodes for shm segments. mm: initialize hotplugged pages as reserved ocfs2: fix shift left overflow kthread: export kthread functions fsnotify: fix oops in fsnotify_clear_marks_by_group_flags() lib/iommu-common.c: do not use 0xffffffffffffffffl for computing align_mask mm/slub: allow merging when SLAB_DEBUG_FREE is set signalfd: fix information leak in signalfd_copyinfo signal: fix information leak in copy_siginfo_to_user signal: fix information leak in copy_siginfo_from_user32 ocfs2: fix BUG in ocfs2_downconvert_thread_do_work() fs, file table: reinit files_stat.max_files after deferred memory initialisation mm, meminit: replace rwsem with completion mm, meminit: allow early_pfn_to_nid to be used during runtime ...
2015-08-07sparc64: Fix userspace FPU register corruptions.David S. Miller4-81/+11
If we have a series of events from userpsace, with %fprs=FPRS_FEF, like follows: ETRAP ETRAP VIS_ENTRY(fprs=0x4) VIS_EXIT RTRAP (kernel FPU restore with fpu_saved=0x4) RTRAP We will not restore the user registers that were clobbered by the FPU using kernel code in the inner-most trap. Traps allocate FPU save slots in the thread struct, and FPU using sequences save the "dirty" FPU registers only. This works at the initial trap level because all of the registers get recorded into the top-level FPU save area, and we'll return to userspace with the FPU disabled so that any FPU use by the user will take an FPU disabled trap wherein we'll load the registers back up properly. But this is not how trap returns from kernel to kernel operate. The simplest fix for this bug is to always save all FPU register state for anything other than the top-most FPU save area. Getting rid of the optimized inner-slot FPU saving code ends up making VISEntryHalf degenerate into plain VISEntry. Longer term we need to do something smarter to reinstate the partial save optimizations. Perhaps the fundament error is having trap entry and exit allocate FPU save slots and restore register state. Instead, the VISEntry et al. calls should be doing that work. This bug is about two decades old. Reported-by: James Y Knight <jyknight@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07net: fec: fix initial runtime PM refcountLucas Stach1-0/+1
The clocks are initially active and thus the device is marked active. This still keeps the PM refcount at 0, the pm_runtime_put_autosuspend() call at the end of probe then leaves us with an invalid refcount of -1, which in turn leads to the device staying in suspended state even though netdev open had been called. Fix this by initializing the refcount to be coherent with the initial device status. Fixes: 8fff755e9f8 (net: fec: Ensure clocks are enabled while using mdio bus) Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Tested-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07Merge branch 'drm-fixes-4.2' of git://people.freedesktop.org/~agd5f/linuxLinus Torvalds7-24/+50
Pull amdgpu fixes from Alex Deucher: "Just a few amdgpu fixes to make sure we report the proper firmware information and number of render buffers to userspace and a typo in a debugging function" [ Pulling directly from Alex since Dave Airlie is on vacation - Linus ] * 'drm-fixes-4.2' of git://people.freedesktop.org/~agd5f/linux: drm/amdgpu: set fw_version and feature_version for smu fw loading drm/amdgpu: add feature version for SDMA ucode drm/amdgpu: add feature version for RLC and MEC v2 drm/amdgpu: increment queue when iterating on this variable. drm/amdgpu: fix rb setting for CZ
2015-08-07Merge branch 'drm-tda998x-fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-armLinus Torvalds1-2/+2
Pull TDA998x i2c driver fixes from Russell King: "This fixes the double-checksumming of the AVI infoframe which was resulting in the checksum always being zero. It went unnoticed as none of my HDMI devices had a problem with this" [ Pulling directly from rmk since Dave Airlie is on vacation - Linus ] * 'drm-tda998x-fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: drm/i2c: tda998x: fix bad checksum of the HDMI AVI infoframe
2015-08-07writeback: fix initial dirty limitRabin Vincent1-2/+2
The initial value of global_wb_domain.dirty_limit set by writeback_set_ratelimit() is zeroed out by the memset in wb_domain_init(). Signed-off-by: Rabin Vincent <rabin.vincent@axis.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07mm/memory-failure: set PageHWPoison before migrate_pages()Naoya Horiguchi2-4/+6
Now page freeing code doesn't consider PageHWPoison as a bad page, so by setting it before completing the page containment, we can prevent the error page from being reused just after successful page migration. I added TTU_IGNORE_HWPOISON for try_to_unmap() to make sure that the page table entry is transformed into migration entry, not to hwpoison entry. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dean Nelson <dnelson@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07mm: check __PG_HWPOISON separately from PAGE_FLAGS_CHECK_AT_*Naoya Horiguchi4-10/+16
The race condition addressed in commit add05cecef80 ("mm: soft-offline: don't free target page in successful page migration") was not closed completely, because that can happen not only for soft-offline, but also for hard-offline. Consider that a slab page is about to be freed into buddy pool, and then an uncorrected memory error hits the page just after entering __free_one_page(), then VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP) is triggered, despite the fact that it's not necessary because the data on the affected page is not consumed. To solve it, this patch drops __PG_HWPOISON from page flag checks at allocation/free time. I think it's justified because __PG_HWPOISON flags is defined to prevent the page from being reused, and setting it outside the page's alloc-free cycle is a designed behavior (not a bug.) For recent months, I was annoyed about BUG_ON when soft-offlined page remains on lru cache list for a while, which is avoided by calling put_page() instead of putback_lru_page() in page migration's success path. This means that this patch reverts a major change from commit add05cecef80 about the new refcounting rule of soft-offlined pages, so "reuse window" revives. This will be closed by a subsequent patch. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dean Nelson <dnelson@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07mm/memory-failure: give up error handling for non-tail-refcounted thpNaoya Horiguchi1-9/+12
"non anonymous thp" case is still racy with freeing thp, which causes panic due to put_page() for refcount-0 page. It seems that closing up this race might be hard (and/or not worth doing,) so let's give up the error handling for this case. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dean Nelson <dnelson@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07mm/memory-failure: fix race in counting num_poisoned_pagesNaoya Horiguchi1-2/+2
When memory_failure() is called on a page which are just freed after page migration from soft offlining, the counter num_poisoned_pages is raised twi= ce. So let's fix it with using TestSetPageHWPoison. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dean Nelson <dnelson@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07mm/memory-failure: unlock_page before put_pageNaoya Horiguchi1-2/+2
Recently I addressed a few of hwpoison race problems and the patches are merged on v4.2-rc1. It made progress, but unfortunately some problems still remain due to less coverage of my testing. So I'm trying to fix or avoid them in this series. One point I'm expecting to discuss is that patch 4/5 changes the page flag set to be checked on free time. In current behavior, __PG_HWPOISON is not supposed to be set when the page is freed. I think that there is no strong reason for this behavior, and it causes a problem hard to fix only in error handler side (because __PG_HWPOISON could be set at arbitrary timing.) So I suggest to change it. With this patchset, hwpoison stress testing in official mce-test testsuite (which previously failed) passes. This patch (of 5): In "just unpoisoned" path, we do put_page and then unlock_page, which is a wrong order and causes "freeing locked page" bug. So let's fix it. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dean Nelson <dnelson@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07ipc: use private shmem or hugetlbfs inodes for shm segments.Stephen Smalley3-3/+5
The shm implementation internally uses shmem or hugetlbfs inodes for shm segments. As these inodes are never directly exposed to userspace and only accessed through the shm operations which are already hooked by security modules, mark the inodes with the S_PRIVATE flag so that inode security initialization and permission checking is skipped. This was motivated by the following lockdep warning: ====================================================== [ INFO: possible circular locking dependency detected ] 4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G W ------------------------------------------------------- httpd/1597 is trying to acquire lock: (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130 but task is already holding lock: (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&mm->mmap_sem){++++++}: lock_acquire+0xc7/0x270 __might_fault+0x7a/0xa0 filldir+0x9e/0x130 xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs] xfs_readdir+0x1b4/0x330 [xfs] xfs_file_readdir+0x2b/0x30 [xfs] iterate_dir+0x97/0x130 SyS_getdents+0x91/0x120 entry_SYSCALL_64_fastpath+0x12/0x76 -> #2 (&xfs_dir_ilock_class){++++.+}: lock_acquire+0xc7/0x270 down_read_nested+0x57/0xa0 xfs_ilock+0x167/0x350 [xfs] xfs_ilock_attr_map_shared+0x38/0x50 [xfs] xfs_attr_get+0xbd/0x190 [xfs] xfs_xattr_get+0x3d/0x70 [xfs] generic_getxattr+0x4f/0x70 inode_doinit_with_dentry+0x162/0x670 sb_finish_set_opts+0xd9/0x230 selinux_set_mnt_opts+0x35c/0x660 superblock_doinit+0x77/0xf0 delayed_superblock_init+0x10/0x20 iterate_supers+0xb3/0x110 selinux_complete_init+0x2f/0x40 security_load_policy+0x103/0x600 sel_write_load+0xc1/0x750 __vfs_write+0x37/0x100 vfs_write+0xa9/0x1a0 SyS_write+0x58/0xd0 entry_SYSCALL_64_fastpath+0x12/0x76 ... Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Reported-by: Morten Stevens <mstevens@fedoraproject.org> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Paul Moore <paul@paul-moore.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07mm: initialize hotplugged pages as reservedMel Gorman1-1/+9
Commit 92923ca3aace ("mm: meminit: only set page reserved in the memblock region") broke memory hotplug which expects the memmap for newly added sections to be reserved until onlined by online_pages_range(). This patch marks hotplugged pages as reserved when adding new zones. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: David Vrabel <david.vrabel@citrix.com> Tested-by: David Vrabel <david.vrabel@citrix.com> Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07ocfs2: fix shift left overflowJoseph Qi1-2/+2
When using a large volume, for example 9T volume with 2T already used, frequent creation of small files with O_DIRECT when the IO is not cluster aligned may clear sectors in the wrong place. This will cause filesystem corruption. This is because p_cpos is a u32. When calculating the corresponding sector it should be converted to u64 first, otherwise it may overflow. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07kthread: export kthread functionsDavid Kershner1-0/+4
The s-Par visornic driver, currently in staging, processes a queue being serviced by the an s-Par service partition. We can get a message that something has happened with the Service Partition, when that happens, we must not access the channel until we get a message that the service partition is back again. The visornic driver has a thread for processing the channel, when we get the message, we need to be able to park the thread and then resume it when the problem clears. We can do this with kthread_park and unpark but they are not exported from the kernel, this patch exports the needed functions. Signed-off-by: David Kershner <david.kershner@unisys.com> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07fsnotify: fix oops in fsnotify_clear_marks_by_group_flags()Jan Kara1-5/+25
fsnotify_clear_marks_by_group_flags() can race with fsnotify_destroy_marks() so that when fsnotify_destroy_mark_locked() drops mark_mutex, a mark from the list iterated by fsnotify_clear_marks_by_group_flags() can be freed and thus the next entry pointer we have cached may become stale and we dereference free memory. Fix the problem by first moving marks to free to a special private list and then always free the first entry in the special list. This method is safe even when entries from the list can disappear once we drop the lock. Signed-off-by: Jan Kara <jack@suse.com> Reported-by: Ashish Sangwan <a.sangwan@samsung.com> Reviewed-by: Ashish Sangwan <a.sangwan@samsung.com> Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>