summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-04-11lib: add testing module for UBSANJinbum Park3-0/+153
This is a test module for UBSAN. It triggers all undefined behaviors that linux supports now, and detect them. All test-cases have passed by compiling with gcc-5.5.0. If use gcc-4.9.x, misaligned, out-of-bounds, object-size-mismatch will not be detected. Because gcc-4.9.x doesn't support them. Link: http://lkml.kernel.org/r/20180309102247.GA2944@pjb1027-Latitude-E5410 Signed-off-by: Jinbum Park <jinb.park7@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11lib/test_bitmap.c: do not accidentally use stack VLAKees Cook1-4/+6
This avoids an accidental stack VLA (since the compiler thinks the value of "len" can change, even when marked "const"). This just replaces it with a #define so it will DTRT. Seen with -Wvla. Fixed as part of the directive to remove all VLAs from the kernel: https://lkml.org/lkml/2018/3/7/621 Link: http://lkml.kernel.org/r/20180307212555.GA17927@beast Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Yury Norov <ynorov@caviumnetworks.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11lib/Kconfig.debug: Debug Lockups and Hangs: keep SOFTLOCKUP options togetherRandy Dunlap1-24/+24
Keep all of the SOFTLOCKUP kconfig symbols together (instead of injecting the HARDLOCKUP symbols in the midst of them) so that the config tools display them with their dependencies. Tested with 'make {menuconfig/nconfig/gconfig/xconfig}'. Link: http://lkml.kernel.org/r/6be2d9ed-4656-5b94-460d-7f051e2c7570@infradead.org Fixes: 05a4a9527931 ("kernel/watchdog: split up config options") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11MAINTAINERS: update email address for Alexandre BounineAlexandre Bounine1-1/+1
Link: http://lkml.kernel.org/r/1522958149-6157-1-git-send-email-alex.bou9@gmail.com Signed-off-by: Alexandre Bounine <alex.bou9@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Barry Wood <barry.wood@idt.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11task_struct: only use anon struct under randstruct pluginKees Cook2-12/+3
The original intent for always adding the anonymous struct in task_struct was to make sure we had compiler coverage. However, this caused pathological padding of 40 bytes at the start of task_struct. Instead, move the anonymous struct to being only used when struct layout randomization is enabled. Link: http://lkml.kernel.org/r/20180327213609.GA2964@beast Fixes: 29e48ce87f1e ("task_struct: Allow randomized") Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Peter Zijlstra <peterz@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11clang-format: add configuration fileMiguel Ojeda5-0/+629
clang-format is a tool to format C/C++/... code according to a set of rules and heuristics. Like most tools, it is not perfect nor covers every single case, but it is good enough to be helpful. In particular, it is useful for quickly re-formatting blocks of code automatically, for reviewing full files in order to spot coding style mistakes, typos and possible improvements. It is also handy for sorting ``#includes``, for aligning variables and macros, for reflowing text and other similar tasks. It also serves as a teaching tool/guide for newcomers. The tool itself has been already included in the repositories of popular Linux distributions for a long time. The rules in this file are intended for clang-format >= 4, which is easily available in most distributions. This commit adds the configuration file that contains the rules that the tool uses to know how to format the code according to the kernel coding style. This gives us several advantages: * clang-format works out of the box with reasonable defaults; avoiding that everyone has to re-do the configuration. * Everyone agrees (eventually) on what is the most useful default configuration for most of the kernel. * If it becomes commonplace among kernel developers, clang-format may feel compelled to support us better. They already recognize the Linux kernel and its style in their documentation and in one of the style sub-options. Some of clang-format's features relevant for the kernel are: * Uses clang's tooling support behind the scenes to parse and rewrite the code. It is not based on ad-hoc regexps. * Supports reasonably well the Linux kernel coding style. * Fast enough to be used at the press of a key. * There are already integrations (either built-in or third-party) for many common editors used by kernel developers (e.g. vim, emacs, Sublime, Atom...) that allow you to format an entire file or, more usefully, just your selection. * Able to parse unified diffs -- you can, for instance, reformat only the lines changed by a git commit. * Able to reflow text comments as well. * Widely supported and used by hundreds of developers in highly complex projects and organizations (e.g. the LLVM project itself, Chromium, WebKit, Google, Mozilla...). Therefore, it will be supported for a long time. See more information about the tool at: https://clang.llvm.org/docs/ClangFormat.html https://clang.llvm.org/docs/ClangFormatStyleOptions.html Link: http://lkml.kernel.org/r/20180318171632.qfkemw3mwbcukth6@gmail.com Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Andy Whitcroft <apw@canonical.com> Cc: Joe Perches <joe@perches.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11uts: create "struct uts_namespace" from kmem_cacheAlexey Dobriyan3-4/+24
So "struct uts_namespace" can enjoy fine-grained SLAB debugging and usercopy protection. I'd prefer shorter name "utsns" but there is "user_namespace" already. Link: http://lkml.kernel.org/r/20180228215158.GA23146@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11taint: add taint for randstructKees Cook3-2/+6
Since the randstruct plugin can intentionally produce extremely unusual kernel structure layouts (even performance pathological ones), some maintainers want to be able to trivially determine if an Oops is coming from a randstruct-built kernel, so as to keep their sanity when debugging. This adds the new flag and initializes taint_mask immediately when built with randstruct. Link: http://lkml.kernel.org/r/1519084390-43867-4-git-send-email-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11taint: consolidate documentationKees Cook2-45/+31
This consolidates the taint bit documentation into a single place with both numeric and letter values. Additionally adds the missing TAINT_AUX documentation. Link: http://lkml.kernel.org/r/1519084390-43867-3-git-send-email-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11taint: convert to indexed initializationKees Cook2-17/+20
This converts to using indexed initializers instead of comments, adds a comment on why the taint flags can't be an enum, and make sure that no one forgets to update the taint_flags when adding new bits. Link: http://lkml.kernel.org/r/1519084390-43867-2-git-send-email-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: selftests: test /proc/uptimeAlexey Dobriyan5-0/+202
The only tests I could come up with for /proc/uptime are: - test that values increase monotonically for 1 second, - bounce around CPUs and test the same thing. Avoid glibc like plague for affinity given patches like this: https://marc.info/?l=linux-kernel&m=152130031912594&w=4 Link: http://lkml.kernel.org/r/20180317165235.GB3445@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: use slower rb_first()Alexey Dobriyan4-19/+17
In a typical for /proc "open+read+close" usecase, dentry is looked up successfully on open only to be killed in dput() on close. In fact dentries which aren't /proc/*/... and /proc/sys/* were almost NEVER CACHED. Simple printk in proc_lookup_de() shows that. Now that ->delete hook intelligently picks which dentries should live in dcache and which should not, rbtree caching is not necessary as dcache does it job, at last! As a side effect, struct proc_dir_entry shrinks by one pointer which can go into inline name. Link: http://lkml.kernel.org/r/20180314231032.GA15854@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: selftests: shotgun testing of read/readdir/readlink/writeAlexey Dobriyan3-0/+149
Perform reads with nearly everything in /proc, and some writing as well. Hopefully memleak checkers and KASAN will find something. [adobriyan@gmail.com: /proc/kmsg can and will block if read under root] Link: http://lkml.kernel.org/r/20180316232147.GA20146@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> [adobriyan@gmail.com: /proc/sysrq-trigger lives on the ground floor] Link: http://lkml.kernel.org/r/20180317164911.GA3445@avx2 Link: http://lkml.kernel.org/r/20180315201251.GA12396@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: switch struct proc_dir_entry::count to refcountAlexey Dobriyan3-5/+6
->count is honest reference count unlike ->in_use. Link: http://lkml.kernel.org/r/20180313174550.GA4332@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: reject "." and ".." as filenamesAlexey Dobriyan1-0/+8
Various subsystems can create files and directories in /proc with names directly controlled by userspace. Which means "/", "." and ".." are no-no. "/" split is already taken care of, do the other 2 prohibited names. Link: http://lkml.kernel.org/r/20180310001223.GB12443@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: add selftest for last field of /proc/loadavgAlexey Dobriyan3-0/+85
Test fork counter formerly known as ->last_pid, the only part of /proc/loadavg which can be tested. Testing in init pid namespace is not reliable because of background activity. Link: http://lkml.kernel.org/r/20180311152241.GA26247@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: do mmput ASAP for /proc/*/map_filesAlexey Dobriyan1-1/+1
mm_struct is not needed while printing as all the data was already extracted. Link: http://lkml.kernel.org/r/20180309223120.GC3843@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: faster /proc/cmdlineAlexey Dobriyan1-1/+2
Use seq_puts() and skip format string processing. Link: http://lkml.kernel.org/r/20180309222948.GB3843@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: register filesystem lastAlexey Dobriyan1-6/+2
As soon as register_filesystem() exits, filesystem can be mounted. It is better to present fully operational /proc. Of course it doesn't matter because /proc is not modular but do it anyway. Drop error check, it should be handled by panicking. Link: http://lkml.kernel.org/r/20180309222709.GA3843@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: fix /proc/*/map_files lookup some moreAlexey Dobriyan5-1/+176
I totally forgot that _parse_integer() accepts arbitrary amount of leading zeroes leading to the following lookups: OK # readlink /proc/1/map_files/56427ecba000-56427eddc000 /lib/systemd/systemd bogus # readlink /proc/1/map_files/00000000000056427ecba000-56427eddc000 /lib/systemd/systemd # readlink /proc/1/map_files/56427ecba000-00000000000056427eddc000 /lib/systemd/systemd Link: http://lkml.kernel.org/r/20180303215130.GA23480@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: move "struct proc_dir_entry" into kmem cacheAlexey Dobriyan5-23/+52
"struct proc_dir_entry" is variable sized because of 0-length trailing array for name, however, because of SLAB padding allocations it is possible to make "struct proc_dir_entry" fixed sized and allocate same amount of memory. It buys fine-grained debugging with poisoning and usercopy protection which is not possible with kmalloc-* caches. Currently, on 32-bit 91+ byte allocations go into kmalloc-128 and on 64-bit 147+ byte allocations go to kmalloc-192 anyway. Additional memory is allocated only for 38/46+ byte long names which are rare or may not even exist in the wild. Link: http://lkml.kernel.org/r/20180223205504.GA17139@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: test /proc/self/syscallAlexey Dobriyan3-1/+48
Read from /proc/self/syscall should yield read system call and correct args in the output as current is reading /proc/self/syscall. Link: http://lkml.kernel.org/r/20180226212145.GB742@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: test /proc/self/wchanAlexey Dobriyan5-0/+34
This patch starts testing /proc. Many more tests to come (I promise). Read from /proc/self/wchan should always return "0" as current is in TASK_RUNNING state while reading /proc/self/wchan. Link: http://lkml.kernel.org/r/20180226212006.GA742@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11fs/proc/proc_sysctl.c: remove redundant link check in proc_sys_link_fill_cache()Danilo Krummrich1-6/+3
proc_sys_link_fill_cache() does not need to check whether we're called for a link - it's already done by scan(). Link: http://lkml.kernel.org/r/20180228013506.4915-2-danilokrummrich@dk-develop.de Signed-off-by: Danilo Krummrich <danilokrummrich@dk-develop.de> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Luis R . Rodriguez" <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11fs/proc/proc_sysctl.c: fix potential page fault while unregistering sysctl tableDanilo Krummrich1-0/+3
proc_sys_link_fill_cache() does not take currently unregistering sysctl tables into account, which might result into a page fault in sysctl_follow_link() - add a check to fix it. This bug has been present since v3.4. Link: http://lkml.kernel.org/r/20180228013506.4915-1-danilokrummrich@dk-develop.de Fixes: 0e47c99d7fe25 ("sysctl: Replace root_list with links between sysctl_table_sets") Signed-off-by: Danilo Krummrich <danilokrummrich@dk-develop.de> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: "Luis R . Rodriguez" <mcgrof@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: use set_puts() at /proc/*/wchanAlexey Dobriyan1-1/+1
Link: http://lkml.kernel.org/r/20180217072011.GB16074@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: check permissions earlier for /proc/*/wchanAlexey Dobriyan1-5/+8
get_wchan() accesses stack page before permissions are checked, let's not play this game. Link: http://lkml.kernel.org/r/20180217071923.GA16074@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: replace seq_printf by seq_put_smth to speed up /proc/pid/statusAndrei Vagin1-5/+11
seq_printf() works slower than seq_puts, seq_puts, etc. == test_proc.c int main(int argc, char **argv) { int n, i, fd; char buf[16384]; n = atoi(argv[1]); for (i = 0; i < n; i++) { fd = open(argv[2], O_RDONLY); if (fd < 0) return 1; if (read(fd, buf, sizeof(buf)) <= 0) return 1; close(fd); } return 0; } == $ time ./test_proc 1000000 /proc/1/status == Before path == real 0m5.171s user 0m0.328s sys 0m4.783s == After patch == real 0m4.761s user 0m0.334s sys 0m4.366s Link: http://lkml.kernel.org/r/20180212074931.7227-4-avagin@openvz.org Signed-off-by: Andrei Vagin <avagin@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: optimize single-symbol delimiters to spead up seq_put_decimal_ullAndrei Vagin1-12/+12
A delimiter is a string which is printed before a number. A syngle-symbol delimiters can be printed by set_putc() and this works faster than printing by set_puts(). == test_proc.c int main(int argc, char **argv) { int n, i, fd; char buf[16384]; n = atoi(argv[1]); for (i = 0; i < n; i++) { fd = open(argv[2], O_RDONLY); if (fd < 0) return 1; if (read(fd, buf, sizeof(buf)) <= 0) return 1; close(fd); } return 0; } == $ time ./test_proc 1000000 /proc/1/stat == Before patch == real 0m3.820s user 0m0.337s sys 0m3.394s == After patch == real 0m3.110s user 0m0.324s sys 0m2.700s Link: http://lkml.kernel.org/r/20180212074931.7227-3-avagin@openvz.org Signed-off-by: Andrei Vagin <avagin@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: replace seq_printf on seq_putc to speed up /proc/pid/smapsAndrei Vagin1-2/+3
seq_putc() works much faster than seq_printf() == Before patch == $ time python test_smaps.py real 0m3.828s user 0m0.413s sys 0m3.408s == After patch == $ time python test_smaps.py real 0m3.405s user 0m0.401s sys 0m3.003s == Before patch == - 75.51% 4.62% python [kernel.kallsyms] [k] show_smap.isra.33 - 70.88% show_smap.isra.33 + 24.82% seq_put_decimal_ull_aligned + 19.78% __walk_page_range + 12.74% seq_printf + 11.08% show_map_vma.isra.23 + 1.68% seq_puts == After patch == - 69.16% 5.70% python [kernel.kallsyms] [k] show_smap.isra.33 - 63.46% show_smap.isra.33 + 25.98% seq_put_decimal_ull_aligned + 20.90% __walk_page_range + 12.60% show_map_vma.isra.23 1.56% seq_putc + 1.55% seq_puts Link: http://lkml.kernel.org/r/20180212074931.7227-2-avagin@openvz.org Signed-off-by: Andrei Vagin <avagin@openvz.org> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: add seq_put_decimal_ull_width to speed up /proc/pid/smapsAndrei Vagin7-102/+93
seq_put_decimal_ull_w(m, str, val, width) prints a decimal number with a specified minimal field width. It is equivalent of seq_printf(m, "%s%*d", str, width, val), but it works much faster. == test_smaps.py num = 0 with open("/proc/1/smaps") as f: for x in xrange(10000): data = f.read() f.seek(0, 0) == == Before patch == $ time python test_smaps.py real 0m4.593s user 0m0.398s sys 0m4.158s == After patch == $ time python test_smaps.py real 0m3.828s user 0m0.413s sys 0m3.408s $ perf -g record python test_smaps.py == Before patch == - 79.01% 3.36% python [kernel.kallsyms] [k] show_smap.isra.33 - 75.65% show_smap.isra.33 + 48.85% seq_printf + 15.75% __walk_page_range + 9.70% show_map_vma.isra.23 0.61% seq_puts == After patch == - 75.51% 4.62% python [kernel.kallsyms] [k] show_smap.isra.33 - 70.88% show_smap.isra.33 + 24.82% seq_put_decimal_ull_w + 19.78% __walk_page_range + 12.74% seq_printf + 11.08% show_map_vma.isra.23 + 1.68% seq_puts [akpm@linux-foundation.org: fix drivers/of/unittest.c build] Link: http://lkml.kernel.org/r/20180212074931.7227-1-avagin@openvz.org Signed-off-by: Andrei Vagin <avagin@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: account "struct pde_opener"Alexey Dobriyan1-1/+1
The allocation is persistent in fact as any fool can open a file in /proc and sit on it. Link: http://lkml.kernel.org/r/20180214082409.GC17157@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: move "struct pde_opener" to kmem cacheAlexey Dobriyan3-6/+10
"struct pde_opener" is fixed size and we can have more granular approach to debugging. For those who don't know, per cache SLUB poisoning and red zoning don't work if there is at least one object allocated which is hopeless in case of kmalloc-64 but not in case of standalone cache. Although systemd opens 2 files from the get go, so it is hopeless after all. Link: http://lkml.kernel.org/r/20180214082306.GB17157@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: randomize "struct pde_opener"Alexey Dobriyan1-1/+1
The more the merrier. Link: http://lkml.kernel.org/r/20180214081935.GA17157@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: faster open/close of files without ->release hookAlexey Dobriyan1-18/+23
The whole point of code in fs/proc/inode.c is to make sure ->release hook is called either at close() or at rmmod time. All if it is unnecessary if there is no ->release hook. Save allocation+list manipulations under spinlock in that case. Link: http://lkml.kernel.org/r/20180214063033.GA15579@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: move /proc/sysvipc creation to where it belongsAlexey Dobriyan2-4/+1
Move the proc_mkdir() call within the sysvipc subsystem such that we avoid polluting proc_root_init() with petty cpp. [dave@stgolabs.net: contributed changelog] Link: http://lkml.kernel.org/r/20180216161732.GA10297@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: do less stuff under ->pde_unload_lockAlexey Dobriyan1-5/+9
Commit ca469f35a8e9ef ("deal with races between remove_proc_entry() and proc_reg_release()") moved too much stuff under ->pde_unload_lock making a problem described at series "[PATCH v5] procfs: Improve Scaling in proc" worse. While RCU is being figured out, move kfree() out of ->pde_unload_lock. On my potato, difference is only 0.5% speedup with concurrent open+read+close of /proc/cmdline, but the effect should be more noticeable on more capable machines. $ perf stat -r 16 -- ./proc-j 16 Performance counter stats for './proc-j 16' (16 runs): 130569.502377 task-clock (msec) # 15.872 CPUs utilized ( +- 0.05% ) 19,169 context-switches # 0.147 K/sec ( +- 0.18% ) 15 cpu-migrations # 0.000 K/sec ( +- 3.27% ) 437 page-faults # 0.003 K/sec ( +- 1.25% ) 300,172,097,675 cycles # 2.299 GHz ( +- 0.05% ) 96,793,267,308 instructions # 0.32 insn per cycle ( +- 0.04% ) 22,798,342,298 branches # 174.607 M/sec ( +- 0.04% ) 111,764,687 branch-misses # 0.49% of all branches ( +- 0.47% ) 8.226574400 seconds time elapsed ( +- 0.05% ) ^^^^^^^^^^^ $ perf stat -r 16 -- ./proc-j 16 Performance counter stats for './proc-j 16' (16 runs): 129866.777392 task-clock (msec) # 15.869 CPUs utilized ( +- 0.04% ) 19,154 context-switches # 0.147 K/sec ( +- 0.66% ) 14 cpu-migrations # 0.000 K/sec ( +- 1.73% ) 431 page-faults # 0.003 K/sec ( +- 1.09% ) 298,556,520,546 cycles # 2.299 GHz ( +- 0.04% ) 96,525,366,833 instructions # 0.32 insn per cycle ( +- 0.04% ) 22,730,194,043 branches # 175.027 M/sec ( +- 0.04% ) 111,506,074 branch-misses # 0.49% of all branches ( +- 0.18% ) 8.183629778 seconds time elapsed ( +- 0.04% ) ^^^^^^^^^^^ Link: http://lkml.kernel.org/r/20180213132911.GA24298@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: get rid of task lock/unlock pair to read umask for the "status" fileMateusz Guzik1-18/+5
get_task_umask locks/unlocks the task on its own. The only caller does the same thing immediately after. Utilize the fact the task has to be locked anyway and just do it once. Since there are no other users and the code is short, fold it in. Link: http://lkml.kernel.org/r/1517995608-23683-1-git-send-email-mguzik@redhat.com Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11procfs: optimize seq_pad() to speed up /proc/pid/mapsAndrei Vagin1-2/+8
seq_printf() is slow and it can be replaced by memset() in this case. == test.py num = 0 with open("/proc/1/maps") as f: while num < 10000 : data = f.read() f.seek(0, 0) num = num + 1 == == Before patch == $ time python test.py real 0m0.986s user 0m0.279s sys 0m0.707s == After patch == $ time python test.py real 0m0.932s user 0m0.261s sys 0m0.669s $ perf record -g python test.py == Before patch == - 47.35% 3.38% python [kernel.kallsyms] [k] show_map_vma.isra.23 - 43.97% show_map_vma.isra.23 + 20.84% seq_path - 15.73% show_vma_header_prefix + 6.96% seq_pad + 2.94% __GI___libc_read == After patch == - 44.01% 0.34% python [kernel.kallsyms] [k] show_pid_map - 43.67% show_pid_map - 42.91% show_map_vma.isra.23 + 21.55% seq_path - 15.68% show_vma_header_prefix + 2.08% seq_pad 0.55% seq_putc Link: http://lkml.kernel.org/r/20180112185812.7710-2-avagin@openvz.org Signed-off-by: Andrei Vagin <avagin@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11procfs: add seq_put_hex_ll to speed up /proc/pid/mapsAndrei Vagin3-9/+61
seq_put_hex_ll() prints a number in hexadecimal notation and works faster than seq_printf(). == test.py num = 0 with open("/proc/1/maps") as f: while num < 10000 : data = f.read() f.seek(0, 0) num = num + 1 == == Before patch == $ time python test.py real 0m1.561s user 0m0.257s sys 0m1.302s == After patch == $ time python test.py real 0m0.986s user 0m0.279s sys 0m0.707s $ perf -g record python test.py: == Before patch == - 67.42% 2.82% python [kernel.kallsyms] [k] show_map_vma.isra.22 - 64.60% show_map_vma.isra.22 - 44.98% seq_printf - seq_vprintf - vsnprintf + 14.85% number + 12.22% format_decode 5.56% memcpy_erms + 15.06% seq_path + 4.42% seq_pad + 2.45% __GI___libc_read == After patch == - 47.35% 3.38% python [kernel.kallsyms] [k] show_map_vma.isra.23 - 43.97% show_map_vma.isra.23 + 20.84% seq_path - 15.73% show_vma_header_prefix 10.55% seq_put_hex_ll + 2.65% seq_put_decimal_ull 0.95% seq_putc + 6.96% seq_pad + 2.94% __GI___libc_read [avagin@openvz.org: use unsigned int instead of int where it is suitable] Link: http://lkml.kernel.org/r/20180214025619.4005-1-avagin@openvz.org [avagin@openvz.org: v2] Link: http://lkml.kernel.org/r/20180117082050.25406-1-avagin@openvz.org Link: http://lkml.kernel.org/r/20180112185812.7710-1-avagin@openvz.org Signed-off-by: Andrei Vagin <avagin@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11kasan: prevent compiler from optimizing away memset in testsAndrey Konovalov1-0/+1
A compiler can optimize away memset calls by replacing them with mov instructions. There are KASAN tests that specifically test that KASAN correctly handles memset calls so we don't want this optimization to happen. The solution is to add -fno-builtin flag to test_kasan.ko Link: http://lkml.kernel.org/r/105ec9a308b2abedb1a0d1fdced0c22d765e4732.1519924383.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Nick Terrell <terrelln@fb.com> Cc: Chris Mason <clm@fb.com> Cc: Yury Norov <ynorov@caviumnetworks.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Luis R . Rodriguez" <mcgrof@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Jeff Layton <jlayton@redhat.com> Cc: "Jason A . Donenfeld" <Jason@zx2c4.com> Cc: Kostya Serebryany <kcc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11kasan: fix invalid-free test crashing the kernelAndrey Konovalov1-0/+8
When an invalid-free is triggered by one of the KASAN tests, the object doesn't actually get freed. This later leads to a BUG failure in kmem_cache_destroy that checks that there are no allocated objects in the cache that is being destroyed. Fix this by calling kmem_cache_free with the proper object address after the call that triggers invalid-free. Link: http://lkml.kernel.org/r/286eaefc0a6c3fa9b83b87e7d6dc0fbb5b5c9926.1519924383.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Nick Terrell <terrelln@fb.com> Cc: Chris Mason <clm@fb.com> Cc: Yury Norov <ynorov@caviumnetworks.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Luis R . Rodriguez" <mcgrof@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Jeff Layton <jlayton@redhat.com> Cc: "Jason A . Donenfeld" <Jason@zx2c4.com> Cc: Kostya Serebryany <kcc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11kasan, slub: fix handling of kasan_slab_free hookAndrey Konovalov1-23/+34
The kasan_slab_free hook's return value denotes whether the reuse of a slab object must be delayed (e.g. when the object is put into memory qurantine). The current way SLUB handles this hook is by ignoring its return value and hardcoding checks similar (but not exactly the same) to the ones performed in kasan_slab_free, which is prone to making mistakes. The main difference between the hardcoded checks and the ones in kasan_slab_free is whether we want to perform a free in case when an invalid-free or a double-free was detected (we don't). This patch changes the way SLUB handles this by: 1. taking into account the return value of kasan_slab_free for each of the objects, that are being freed; 2. reconstructing the freelist of objects to exclude the ones, whose reuse must be delayed. [andreyknvl@google.com: eliminate unnecessary branch in slab_free] Link: http://lkml.kernel.org/r/a62759a2545fddf69b0c034547212ca1eb1b3ce2.1520359686.git.andreyknvl@google.com Link: http://lkml.kernel.org/r/083f58501e54731203801d899632d76175868e97.1519400992.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kostya Serebryany <kcc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11mm/thp: don't count ZONE_MOVABLE as the target for freepage reservingJoonsoo Kim1-1/+9
There was a regression report for "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE" [1] and I think that it is related to this problem. CMA patchset makes the system use one more zone (ZONE_MOVABLE) and then increases min_free_kbytes. It reduces usable memory and it could cause regression. ZONE_MOVABLE only has movable pages so we don't need to keep enough freepages to avoid or deal with fragmentation. So, don't count it. This changes min_free_kbytes and thus min_watermark greatly if ZONE_MOVABLE is used. It will make the user uses more memory. System: 22GB ram, fakenuma, 2 nodes. 5 zones are used. Before: min_free_kbytes: 112640 zone_info (min_watermark): Node 0, zone DMA min 19 Node 0, zone DMA32 min 3778 Node 0, zone Normal min 10191 Node 0, zone Movable min 0 Node 0, zone Device min 0 Node 1, zone DMA min 0 Node 1, zone DMA32 min 0 Node 1, zone Normal min 14043 Node 1, zone Movable min 127 Node 1, zone Device min 0 After: min_free_kbytes: 90112 zone_info (min_watermark): Node 0, zone DMA min 15 Node 0, zone DMA32 min 3022 Node 0, zone Normal min 8152 Node 0, zone Movable min 0 Node 0, zone Device min 0 Node 1, zone DMA min 0 Node 1, zone DMA32 min 0 Node 1, zone Normal min 11234 Node 1, zone Movable min 102 Node 1, zone Device min 0 [1] (lkml.kernel.org/r/20180102063528.GG30397%20()%20yexl-desktop) Link: http://lkml.kernel.org/r/1522913236-15776-1-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11ARM: CMA: avoid double mapping to the CMA area if CONFIG_HIGHMEM=yJoonsoo Kim1-1/+15
CMA area is now managed by the separate zone, ZONE_MOVABLE, to fix many MM related problems. In this implementation, if CONFIG_HIGHMEM = y, then ZONE_MOVABLE is considered as HIGHMEM and the memory of the CMA area is also considered as HIGHMEM. That means that they are considered as the page without direct mapping. However, CMA area could be in a lowmem and the memory could have direct mapping. In ARM, when establishing a new mapping for DMA, direct mapping should be cleared since two mapping with different cache policy could cause unknown problem. With this patch, PageHighmem() for the CMA memory located in lowmem returns true so that the function for DMA mapping cannot notice whether it needs to clear direct mapping or not, correctly. To handle this situation, this patch always clears direct mapping for such CMA memory. Link: http://lkml.kernel.org/r/1512114786-5085-4-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Tested-by: Tony Lindgren <tony@atomide.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11mm/cma: remove ALLOC_CMAJoonsoo Kim3-29/+4
Now, all reserved pages for CMA region are belong to the ZONE_MOVABLE and it only serves for a request with GFP_HIGHMEM && GFP_MOVABLE. Therefore, we don't need to maintain ALLOC_CMA at all. Link: http://lkml.kernel.org/r/1512114786-5085-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Tested-by: Tony Lindgren <tony@atomide.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLEJoonsoo Kim5-19/+126
Patch series "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE", v2. 0. History This patchset is the follow-up of the discussion about the "Introduce ZONE_CMA (v7)" [1]. Please reference it if more information is needed. 1. What does this patch do? This patch changes the management way for the memory of the CMA area in the MM subsystem. Currently the memory of the CMA area is managed by the zone where their pfn is belong to. However, this approach has some problems since MM subsystem doesn't have enough logic to handle the situation that different characteristic memories are in a single zone. To solve this issue, this patch try to manage all the memory of the CMA area by using the MOVABLE zone. In MM subsystem's point of view, characteristic of the memory on the MOVABLE zone and the memory of the CMA area are the same. So, managing the memory of the CMA area by using the MOVABLE zone will not have any problem. 2. Motivation There are some problems with current approach. See following. Although these problem would not be inherent and it could be fixed without this conception change, it requires many hooks addition in various code path and it would be intrusive to core MM and would be really error-prone. Therefore, I try to solve them with this new approach. Anyway, following is the problems of the current implementation. o CMA memory utilization First, following is the freepage calculation logic in MM. - For movable allocation: freepage = total freepage - For unmovable allocation: freepage = total freepage - CMA freepage Freepages on the CMA area is used after the normal freepages in the zone where the memory of the CMA area is belong to are exhausted. At that moment that the number of the normal freepages is zero, so - For movable allocation: freepage = total freepage = CMA freepage - For unmovable allocation: freepage = 0 If unmovable allocation comes at this moment, allocation request would fail to pass the watermark check and reclaim is started. After reclaim, there would exist the normal freepages so freepages on the CMA areas would not be used. FYI, there is another attempt [2] trying to solve this problem in lkml. And, as far as I know, Qualcomm also has out-of-tree solution for this problem. Useless reclaim: There is no logic to distinguish CMA pages in the reclaim path. Hence, CMA page is reclaimed even if the system just needs the page that can be usable for the kernel allocation. Atomic allocation failure: This is also related to the fallback allocation policy for the memory of the CMA area. Consider the situation that the number of the normal freepages is *zero* since the bunch of the movable allocation requests come. Kswapd would not be woken up due to following freepage calculation logic. - For movable allocation: freepage = total freepage = CMA freepage If atomic unmovable allocation request comes at this moment, it would fails due to following logic. - For unmovable allocation: freepage = total freepage - CMA freepage = 0 It was reported by Aneesh [3]. Useless compaction: Usual high-order allocation request is unmovable allocation request and it cannot be served from the memory of the CMA area. In compaction, migration scanner try to migrate the page in the CMA area and make high-order page there. As mentioned above, it cannot be usable for the unmovable allocation request so it's just waste. 3. Current approach and new approach Current approach is that the memory of the CMA area is managed by the zone where their pfn is belong to. However, these memory should be distinguishable since they have a strong limitation. So, they are marked as MIGRATE_CMA in pageblock flag and handled specially. However, as mentioned in section 2, the MM subsystem doesn't have enough logic to deal with this special pageblock so many problems raised. New approach is that the memory of the CMA area is managed by the MOVABLE zone. MM already have enough logic to deal with special zone like as HIGHMEM and MOVABLE zone. So, managing the memory of the CMA area by the MOVABLE zone just naturally work well because constraints for the memory of the CMA area that the memory should always be migratable is the same with the constraint for the MOVABLE zone. There is one side-effect for the usability of the memory of the CMA area. The use of MOVABLE zone is only allowed for a request with GFP_HIGHMEM && GFP_MOVABLE so now the memory of the CMA area is also only allowed for this gfp flag. Before this patchset, a request with GFP_MOVABLE can use them. IMO, It would not be a big issue since most of GFP_MOVABLE request also has GFP_HIGHMEM flag. For example, file cache page and anonymous page. However, file cache page for blockdev file is an exception. Request for it has no GFP_HIGHMEM flag. There is pros and cons on this exception. In my experience, blockdev file cache pages are one of the top reason that causes cma_alloc() to fail temporarily. So, we can get more guarantee of cma_alloc() success by discarding this case. Note that there is no change in admin POV since this patchset is just for internal implementation change in MM subsystem. Just one minor difference for admin is that the memory stat for CMA area will be printed in the MOVABLE zone. That's all. 4. Result Following is the experimental result related to utilization problem. 8 CPUs, 1024 MB, VIRTUAL MACHINE make -j16 <Before> CMA area: 0 MB 512 MB Elapsed-time: 92.4 186.5 pswpin: 82 18647 pswpout: 160 69839 <After> CMA : 0 MB 512 MB Elapsed-time: 93.1 93.4 pswpin: 84 46 pswpout: 183 92 akpm: "kernel test robot" reported a 26% improvement in vm-scalability.throughput: http://lkml.kernel.org/r/20180330012721.GA3845@yexl-desktop [1]: lkml.kernel.org/r/1491880640-9944-1-git-send-email-iamjoonsoo.kim@lge.com [2]: https://lkml.org/lkml/2014/10/15/623 [3]: http://www.spinics.net/lists/linux-mm/msg100562.html Link: http://lkml.kernel.org/r/1512114786-5085-2-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Tested-by: Tony Lindgren <tony@atomide.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE requestJoonsoo Kim3-15/+17
Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that important to reserve. When ZONE_MOVABLE is used, this problem would theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE allocation request which is mainly used for page cache and anon page allocation. So, fix it by setting 0 to sysctl_lowmem_reserve_ratio[ZONE_HIGHMEM]. And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size makes code complex. For example, if there is highmem system, following reserve ratio is activated for *NORMAL ZONE* which would be easyily misleading people. #ifdef CONFIG_HIGHMEM 32 #endif This patch also fixes this situation by defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES and place "#ifdef" to right place. Link: http://lkml.kernel.org/r/1504672525-17915-1-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Tony Lindgren <tony@atomide.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: <linux-api@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11mm: unclutter THP migrationMichal Hocko5-41/+36
THP migration is hacked into the generic migration with rather surprising semantic. The migration allocation callback is supposed to check whether the THP can be migrated at once and if that is not the case then it allocates a simple page to migrate. unmap_and_move then fixes that up by spliting the THP into small pages while moving the head page to the newly allocated order-0 page. Remaning pages are moved to the LRU list by split_huge_page. The same happens if the THP allocation fails. This is really ugly and error prone [1]. I also believe that split_huge_page to the LRU lists is inherently wrong because all tail pages are not migrated. Some callers will just work around that by retrying (e.g. memory hotplug). There are other pfn walkers which are simply broken though. e.g. madvise_inject_error will migrate head and then advances next pfn by the huge page size. do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind), will simply split the THP before migration if the THP migration is not supported then falls back to single page migration but it doesn't handle tail pages if the THP migration path is not able to allocate a fresh THP so we end up with ENOMEM and fail the whole migration which is a questionable behavior. Page compaction doesn't try to migrate large pages so it should be immune. This patch tries to unclutter the situation by moving the special THP handling up to the migrate_pages layer where it actually belongs. We simply split the THP page into the existing list if unmap_and_move fails with ENOMEM and retry. So we will _always_ migrate all THP subpages and specific migrate_pages users do not have to deal with this case in a special way. [1] http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com Link: http://lkml.kernel.org/r/20180103082555.14592-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Andrea Reale <ar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11mm, migrate: remove reason argument from new_page_tMichal Hocko10-35/+14
No allocation callback is using this argument anymore. new_page_node used to use this parameter to convey node_id resp. migration error up to move_pages code (do_move_page_to_node_array). The error status never made it into the final status field and we have a better way to communicate node id to the status field now. All other allocation callbacks simply ignored the argument so we can drop it finally. [mhocko@suse.com: fix migration callback] Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()] [mhocko@kernel.org: fix build] Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Andrea Reale <ar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>