From 9cfe015aa424b3c003baba3841a60dd9b5ad319b Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 6 Feb 2008 01:37:16 -0800 Subject: get rid of NR_OPEN and introduce a sysctl_nr_open NR_OPEN (historically set to 1024*1024) actually forbids processes to open more than 1024*1024 handles. Unfortunatly some production servers hit the not so 'ridiculously high value' of 1024*1024 file descriptors per process. Changing NR_OPEN is not considered safe because of vmalloc space potential exhaust. This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to 1024*1024, so that admins can decide to change this limit if their workload needs it. [akpm@linux-foundation.org: export it for sparc64] Signed-off-by: Eric Dumazet Cc: Alan Cox Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: "David S. Miller" Cc: Ralf Baechle Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/fs.txt | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'Documentation/sysctl') diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt index aa986a35e994..f99254327ae5 100644 --- a/Documentation/sysctl/fs.txt +++ b/Documentation/sysctl/fs.txt @@ -23,6 +23,7 @@ Currently, these files are in /proc/sys/fs: - inode-max - inode-nr - inode-state +- nr_open - overflowuid - overflowgid - suid_dumpable @@ -91,6 +92,15 @@ usage of file handles and you don't need to increase the maximum. ============================================================== +nr_open: + +This denotes the maximum number of file-handles a process can +allocate. Default value is 1024*1024 (1048576) which should be +enough for most machines. Actual limit depends on RLIMIT_NOFILE +resource limit. + +============================================================== + inode-max, inode-nr & inode-state: As with file handles, the kernel allocates the inode structures -- cgit v1.2.3 From fef1bdd68c81b71882ccb6f47c70980a03182063 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Thu, 7 Feb 2008 00:14:07 -0800 Subject: oom: add sysctl to enable task memory dump Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a dump of all system tasks (excluding kernel threads) when performing an OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu, oom_adj score, and name. This is helpful for determining why there was an OOM condition and which rogue task caused it. It is configurable so that large systems, such as those with several thousand tasks, do not incur a performance penalty associated with dumping data they may not desire. If an OOM was triggered as a result of a memory controller, the tasklist shall be filtered to exclude tasks that are not a member of the same cgroup. Cc: Andrea Arcangeli Cc: Christoph Lameter Cc: Balbir Singh Signed-off-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 22 ++++++++++++++++++++ kernel/sysctl.c | 9 +++++++++ mm/oom_kill.c | 49 ++++++++++++++++++++++++++++++++++++++++----- 3 files changed, 75 insertions(+), 5 deletions(-) (limited to 'Documentation/sysctl') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 24eac1bc735d..8a4863c4edd4 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -32,6 +32,7 @@ Currently, these files are in /proc/sys/vm: - min_unmapped_ratio - min_slab_ratio - panic_on_oom +- oom_dump_tasks - oom_kill_allocating_task - mmap_min_address - numa_zonelist_order @@ -232,6 +233,27 @@ according to your policy of failover. ============================================================= +oom_dump_tasks + +Enables a system-wide task dump (excluding kernel threads) to be +produced when the kernel performs an OOM-killing and includes such +information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and +name. This is helpful to determine why the OOM killer was invoked +and to identify the rogue task that caused it. + +If this is set to zero, this information is suppressed. On very +large systems with thousands of tasks it may not be feasible to dump +the memory state information for each one. Such systems should not +be forced to incur a performance penalty in OOM conditions when the +information may not be desired. + +If this is set to non-zero, this information is shown whenever the +OOM killer actually kills a memory-hogging task. + +The default value is 0. + +============================================================= + oom_kill_allocating_task This enables or disables killing the OOM-triggering task in diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 86daaa26d120..8c98d8147d88 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -67,6 +67,7 @@ extern int sysctl_overcommit_memory; extern int sysctl_overcommit_ratio; extern int sysctl_panic_on_oom; extern int sysctl_oom_kill_allocating_task; +extern int sysctl_oom_dump_tasks; extern int max_threads; extern int core_uses_pid; extern int suid_dumpable; @@ -870,6 +871,14 @@ static struct ctl_table vm_table[] = { .mode = 0644, .proc_handler = &proc_dointvec, }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "oom_dump_tasks", + .data = &sysctl_oom_dump_tasks, + .maxlen = sizeof(sysctl_oom_dump_tasks), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, { .ctl_name = VM_OVERCOMMIT_RATIO, .procname = "overcommit_ratio", diff --git a/mm/oom_kill.c b/mm/oom_kill.c index ef5084dbc793..4194b9db0104 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -29,6 +29,7 @@ int sysctl_panic_on_oom; int sysctl_oom_kill_allocating_task; +int sysctl_oom_dump_tasks; static DEFINE_SPINLOCK(zone_scan_mutex); /* #define DEBUG */ @@ -262,6 +263,41 @@ static struct task_struct *select_bad_process(unsigned long *ppoints, return chosen; } +/** + * Dumps the current memory state of all system tasks, excluding kernel threads. + * State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj + * score, and name. + * + * If the actual is non-NULL, only tasks that are a member of the mem_cgroup are + * shown. + * + * Call with tasklist_lock read-locked. + */ +static void dump_tasks(const struct mem_cgroup *mem) +{ + struct task_struct *g, *p; + + printk(KERN_INFO "[ pid ] uid tgid total_vm rss cpu oom_adj " + "name\n"); + do_each_thread(g, p) { + /* + * total_vm and rss sizes do not exist for tasks with a + * detached mm so there's no need to report them. + */ + if (!p->mm) + continue; + if (mem && !task_in_mem_cgroup(p, mem)) + continue; + + task_lock(p); + printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d %3d %s\n", + p->pid, p->uid, p->tgid, p->mm->total_vm, + get_mm_rss(p->mm), (int)task_cpu(p), p->oomkilladj, + p->comm); + task_unlock(p); + } while_each_thread(g, p); +} + /** * Send SIGKILL to the selected process irrespective of CAP_SYS_RAW_IO * flag though it's unlikely that we select a process with CAP_SYS_RAW_IO @@ -339,7 +375,8 @@ static int oom_kill_task(struct task_struct *p) } static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, - unsigned long points, const char *message) + unsigned long points, struct mem_cgroup *mem, + const char *message) { struct task_struct *c; @@ -349,6 +386,8 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, current->comm, gfp_mask, order, current->oomkilladj); dump_stack(); show_mem(); + if (sysctl_oom_dump_tasks) + dump_tasks(mem); } /* @@ -389,7 +428,7 @@ retry: if (!p) p = current; - if (oom_kill_process(p, gfp_mask, 0, points, + if (oom_kill_process(p, gfp_mask, 0, points, mem, "Memory cgroup out of memory")) goto retry; out: @@ -495,7 +534,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order) switch (constraint) { case CONSTRAINT_MEMORY_POLICY: - oom_kill_process(current, gfp_mask, order, points, + oom_kill_process(current, gfp_mask, order, points, NULL, "No available memory (MPOL_BIND)"); break; @@ -505,7 +544,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order) /* Fall-through */ case CONSTRAINT_CPUSET: if (sysctl_oom_kill_allocating_task) { - oom_kill_process(current, gfp_mask, order, points, + oom_kill_process(current, gfp_mask, order, points, NULL, "Out of memory (oom_kill_allocating_task)"); break; } @@ -525,7 +564,7 @@ retry: panic("Out of memory and no killable processes...\n"); } - if (oom_kill_process(p, gfp_mask, order, points, + if (oom_kill_process(p, gfp_mask, order, points, NULL, "Out of memory")) goto retry; -- cgit v1.2.3 From 1ec7fd50ba4f845d1cf6b67acabd774577ef13b6 Mon Sep 17 00:00:00 2001 From: Jiri Kosina Date: Sat, 9 Feb 2008 23:24:08 +0100 Subject: brk: document randomize_va_space and CONFIG_COMPAT_BRK (was Re: Document randomize_va_space and CONFIG_COMPAT_BRK. Signed-off-by: Jiri Kosina Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner --- Documentation/sysctl/kernel.txt | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) (limited to 'Documentation/sysctl') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 8984a5396271..dc8801d4e944 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -41,6 +41,7 @@ show up in /proc/sys/kernel: - pid_max - powersave-nap [ PPC only ] - printk +- randomize_va_space - real-root-dev ==> Documentation/initrd.txt - reboot-cmd [ SPARC only ] - rtsig-max @@ -280,6 +281,34 @@ send before ratelimiting kicks in. ============================================================== +randomize-va-space: + +This option can be used to select the type of process address +space randomization that is used in the system, for architectures +that support this feature. + +0 - Turn the process address space randomization off by default. + +1 - Make the addresses of mmap base, stack and VDSO page randomized. + This, among other things, implies that shared libraries will be + loaded to random addresses. Also for PIE-linked binaries, the location + of code start is randomized. + + With heap randomization, the situation is a little bit more + complicated. + There a few legacy applications out there (such as some ancient + versions of libc.so.5 from 1996) that assume that brk area starts + just after the end of the code+bss. These applications break when + start of the brk area is randomized. There are however no known + non-legacy applications that would be broken this way, so for most + systems it is safe to choose full randomization. However there is + a CONFIG_COMPAT_BRK option for systems with ancient and/or broken + binaries, that makes heap non-randomized, but keeps all other + parts of process address space randomized if randomize_va_space + sysctl is turned on. + +============================================================== + reboot-cmd: (Sparc only) ??? This seems to be a way to give an argument to the Sparc -- cgit v1.2.3 From ac76cff2ecd73944473a437cd87770f812635025 Mon Sep 17 00:00:00 2001 From: Michael Opdenacker Date: Wed, 13 Feb 2008 15:03:32 -0800 Subject: Documentation: sysctl/kernel.txt: fix documentation reference This patch fixes a reference to Documentation/kmod.txt which was apparently renamed to Documentation/debugging-modules.txt Signed-off-by: Michael Opdenacker Cc: "Randy.Dunlap" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/kernel.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation/sysctl') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index dc8801d4e944..276a7e637822 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -29,7 +29,7 @@ show up in /proc/sys/kernel: - java-interpreter [ binfmt_java, obsolete ] - kstack_depth_to_print [ X86 only ] - l2cr [ PPC only ] -- modprobe ==> Documentation/kmod.txt +- modprobe ==> Documentation/debugging-modules.txt - msgmax - msgmnb - msgmni -- cgit v1.2.3