From 8c9a134cae6f8d66f35321b07baa202f75ef2199 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Tue, 21 Aug 2018 21:53:10 -0700 Subject: mm: clarify CONFIG_PAGE_POISONING and usage The Kconfig text for CONFIG_PAGE_POISONING doesn't mention that it has to be enabled explicitly. This updates the documentation for that and adds a note about CONFIG_PAGE_POISONING to the "page_poison" command line docs. While here, change description of CONFIG_PAGE_POISONING_ZERO too, as it's not "random" data, but rather the fixed debugging value that would be used when not zeroing. Additionally removes a stray "bool" in the Kconfig. Link: http://lkml.kernel.org/r/20180725223832.GA43733@beast Signed-off-by: Kees Cook Reviewed-by: Andrew Morton Cc: Jonathan Corbet Cc: Laura Abbott Cc: Naoya Horiguchi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/kernel-parameters.txt | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index bffb0caa3693..970d837bd57f 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3041,8 +3041,9 @@ on: enable the feature page_poison= [KNL] Boot-time parameter changing the state of - poisoning on the buddy allocator. - off: turn off poisoning + poisoning on the buddy allocator, available with + CONFIG_PAGE_POISONING=y. + off: turn off poisoning (default) on: turn on poisoning panic= [KNL] Kernel behaviour on panic: delay -- cgit v1.2.3 From 85f237a57f143c0c895dcb7cc53fa0174522ce07 Mon Sep 17 00:00:00 2001 From: juviliu Date: Tue, 21 Aug 2018 21:53:20 -0700 Subject: Documentation/sysctl/vm.txt: update __vm_enough_memory()'s path __vm_enough_memory has moved to mm/util.c. Link: http://lkml.kernel.org/r/E18EDF4A4FA4A04BBFA824B6D7699E532A7E5913@EXMBX-SZMAIL013.tencent.com Signed-off-by: Juvi Liu Reviewed-by: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index e72853b6d725..7d73882e2c27 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -691,7 +691,7 @@ and don't use much of it. The default value is 0. See Documentation/vm/overcommit-accounting.rst and -mm/mmap.c::__vm_enough_memory() for more information. +mm/util.c::__vm_enough_memory() for more information. ============================================================== -- cgit v1.2.3 From 3d8b38eb81cac81395f6a823f6bf401b327268e6 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Tue, 21 Aug 2018 21:53:54 -0700 Subject: mm, oom: introduce memory.oom.group For some workloads an intervention from the OOM killer can be painful. Killing a random task can bring the workload into an inconsistent state. Historically, there are two common solutions for this problem: 1) enabling panic_on_oom, 2) using a userspace daemon to monitor OOMs and kill all outstanding processes. Both approaches have their downsides: rebooting on each OOM is an obvious waste of capacity, and handling all in userspace is tricky and requires a userspace agent, which will monitor all cgroups for OOMs. In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate the necessity of enabling panic_on_oom. Also, it can simplify the cgroup management for userspace applications. This commit introduces a new knob for cgroup v2 memory controller: memory.oom.group. The knob determines whether the cgroup should be treated as an indivisible workload by the OOM killer. If set, all tasks belonging to the cgroup or to its descendants (if the memory cgroup is not a leaf cgroup) are killed together or not at all. To determine which cgroup has to be killed, we do traverse the cgroup hierarchy from the victim task's cgroup up to the OOMing cgroup (or root) and looking for the highest-level cgroup with memory.oom.group set. Tasks with the OOM protection (oom_score_adj set to -1000) are treated as an exception and are never killed. This patch doesn't change the OOM victim selection algorithm. Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.com Signed-off-by: Roman Gushchin Acked-by: Michal Hocko Acked-by: Johannes Weiner Cc: David Rientjes Cc: Tetsuo Handa Cc: Tejun Heo Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/cgroup-v2.rst | 18 +++++++ include/linux/memcontrol.h | 18 +++++++ mm/memcontrol.c | 93 +++++++++++++++++++++++++++++++++ mm/oom_kill.c | 30 +++++++++++ 4 files changed, 159 insertions(+) (limited to 'Documentation') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 1746131bc9cb..184193bcb262 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1072,6 +1072,24 @@ PAGE_SIZE multiple when read back. high limit is used and monitored properly, this limit's utility is limited to providing the final safety net. + memory.oom.group + A read-write single value file which exists on non-root + cgroups. The default value is "0". + + Determines whether the cgroup should be treated as + an indivisible workload by the OOM killer. If set, + all tasks belonging to the cgroup or to its descendants + (if the memory cgroup is not a leaf cgroup) are killed + together or not at all. This can be used to avoid + partial kills to guarantee workload integrity. + + Tasks with the OOM protection (oom_score_adj set to -1000) + are treated as an exception and are never killed. + + If the OOM killer is invoked in a cgroup, it's not going + to kill any tasks outside of this cgroup, regardless + memory.oom.group values of ancestor cgroups. + memory.events A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0e6c515fb698..652f602167df 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -225,6 +225,11 @@ struct mem_cgroup { */ bool use_hierarchy; + /* + * Should the OOM killer kill all belonging tasks, had it kill one? + */ + bool oom_group; + /* protected by memcg_oom_lock */ bool oom_lock; int under_oom; @@ -542,6 +547,9 @@ static inline bool task_in_memcg_oom(struct task_struct *p) } bool mem_cgroup_oom_synchronize(bool wait); +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim, + struct mem_cgroup *oom_domain); +void mem_cgroup_print_oom_group(struct mem_cgroup *memcg); #ifdef CONFIG_MEMCG_SWAP extern int do_swap_account; @@ -1001,6 +1009,16 @@ static inline bool mem_cgroup_oom_synchronize(bool wait) return false; } +static inline struct mem_cgroup *mem_cgroup_get_oom_group( + struct task_struct *victim, struct mem_cgroup *oom_domain) +{ + return NULL; +} + +static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) +{ +} + static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 59c14c988143..4ead5a4817de 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1776,6 +1776,62 @@ cleanup: return true; } +/** + * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM + * @victim: task to be killed by the OOM killer + * @oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM + * + * Returns a pointer to a memory cgroup, which has to be cleaned up + * by killing all belonging OOM-killable tasks. + * + * Caller has to call mem_cgroup_put() on the returned non-NULL memcg. + */ +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim, + struct mem_cgroup *oom_domain) +{ + struct mem_cgroup *oom_group = NULL; + struct mem_cgroup *memcg; + + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) + return NULL; + + if (!oom_domain) + oom_domain = root_mem_cgroup; + + rcu_read_lock(); + + memcg = mem_cgroup_from_task(victim); + if (memcg == root_mem_cgroup) + goto out; + + /* + * Traverse the memory cgroup hierarchy from the victim task's + * cgroup up to the OOMing cgroup (or root) to find the + * highest-level memory cgroup with oom.group set. + */ + for (; memcg; memcg = parent_mem_cgroup(memcg)) { + if (memcg->oom_group) + oom_group = memcg; + + if (memcg == oom_domain) + break; + } + + if (oom_group) + css_get(&oom_group->css); +out: + rcu_read_unlock(); + + return oom_group; +} + +void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) +{ + pr_info("Tasks in "); + pr_cont_cgroup_path(memcg->css.cgroup); + pr_cont(" are going to be killed due to memory.oom.group set\n"); +} + /** * lock_page_memcg - lock a page->mem_cgroup binding * @page: the page @@ -5561,6 +5617,37 @@ static int memory_stat_show(struct seq_file *m, void *v) return 0; } +static int memory_oom_group_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + + seq_printf(m, "%d\n", memcg->oom_group); + + return 0; +} + +static ssize_t memory_oom_group_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + int ret, oom_group; + + buf = strstrip(buf); + if (!buf) + return -EINVAL; + + ret = kstrtoint(buf, 0, &oom_group); + if (ret) + return ret; + + if (oom_group != 0 && oom_group != 1) + return -EINVAL; + + memcg->oom_group = oom_group; + + return nbytes; +} + static struct cftype memory_files[] = { { .name = "current", @@ -5602,6 +5689,12 @@ static struct cftype memory_files[] = { .flags = CFTYPE_NOT_ON_ROOT, .seq_show = memory_stat_show, }, + { + .name = "oom.group", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .seq_show = memory_oom_group_show, + .write = memory_oom_group_write, + }, { } /* terminate */ }; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 330416c67ce5..0e10b864e074 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -908,6 +908,19 @@ static void __oom_kill_process(struct task_struct *victim) } #undef K +/* + * Kill provided task unless it's secured by setting + * oom_score_adj to OOM_SCORE_ADJ_MIN. + */ +static int oom_kill_memcg_member(struct task_struct *task, void *unused) +{ + if (task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { + get_task_struct(task); + __oom_kill_process(task); + } + return 0; +} + static void oom_kill_process(struct oom_control *oc, const char *message) { struct task_struct *p = oc->chosen; @@ -915,6 +928,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) struct task_struct *victim = p; struct task_struct *child; struct task_struct *t; + struct mem_cgroup *oom_group; unsigned int victim_points = 0; static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); @@ -968,7 +982,23 @@ static void oom_kill_process(struct oom_control *oc, const char *message) } read_unlock(&tasklist_lock); + /* + * Do we need to kill the entire memory cgroup? + * Or even one of the ancestor memory cgroups? + * Check this out before killing the victim task. + */ + oom_group = mem_cgroup_get_oom_group(victim, oc->memcg); + __oom_kill_process(victim); + + /* + * If necessary, kill all tasks in the selected memory cgroup. + */ + if (oom_group) { + mem_cgroup_print_oom_group(oom_group); + mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member, NULL); + mem_cgroup_put(oom_group); + } } /* -- cgit v1.2.3 From 7e8a6304d5419cbf056a59de92939e5eef039c57 Mon Sep 17 00:00:00 2001 From: "Dennis Zhou (Facebook)" Date: Tue, 21 Aug 2018 21:53:58 -0700 Subject: /proc/meminfo: add percpu populated pages count Currently, percpu memory only exposes allocation and utilization information via debugfs. This more or less is only really useful for understanding the fragmentation and allocation information at a per-chunk level with a few global counters. This is also gated behind a config. BPF and cgroup, for example, have seen an increase in use causing increased use of percpu memory. Let's make it easier for someone to identify how much memory is being used. This patch adds the "Percpu" stat to meminfo to more easily look up how much percpu memory is in use. This number includes the cost for all allocated backing pages and not just insight at the per a unit, per chunk level. Metadata is excluded. I think excluding metadata is fair because the backing memory scales with the numbere of cpus and can quickly outweigh the metadata. It also makes this calculation light. Link: http://lkml.kernel.org/r/20180807184723.74919-1-dennisszhou@gmail.com Signed-off-by: Dennis Zhou Acked-by: Tejun Heo Acked-by: Roman Gushchin Reviewed-by: Andrew Morton Acked-by: David Rientjes Acked-by: Vlastimil Babka Cc: Johannes Weiner Cc: Christoph Lameter Cc: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 3 +++ fs/proc/meminfo.c | 2 ++ include/linux/percpu.h | 2 ++ mm/percpu.c | 29 +++++++++++++++++++++++++++++ 4 files changed, 36 insertions(+) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 1605acbb23b6..22b4b00dee31 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -870,6 +870,7 @@ Committed_AS: 100056 kB VmallocTotal: 112216 kB VmallocUsed: 428 kB VmallocChunk: 111088 kB +Percpu: 62080 kB HardwareCorrupted: 0 kB AnonHugePages: 49152 kB ShmemHugePages: 0 kB @@ -962,6 +963,8 @@ Committed_AS: The amount of memory presently allocated on the system. VmallocTotal: total size of vmalloc memory area VmallocUsed: amount of vmalloc area which is used VmallocChunk: largest contiguous block of vmalloc area which is free + Percpu: Memory allocated to the percpu allocator used to back percpu + allocations. This stat excludes the cost of metadata. .............................................................................. diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 2fb04846ed11..edda898714eb 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include @@ -121,6 +122,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) (unsigned long)VMALLOC_TOTAL >> 10); show_val_kb(m, "VmallocUsed: ", 0ul); show_val_kb(m, "VmallocChunk: ", 0ul); + show_val_kb(m, "Percpu: ", pcpu_nr_pages()); #ifdef CONFIG_MEMORY_FAILURE seq_printf(m, "HardwareCorrupted: %5lu kB\n", diff --git a/include/linux/percpu.h b/include/linux/percpu.h index 296bbe49d5d1..70b7123f38c7 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -149,4 +149,6 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr); (typeof(type) __percpu *)__alloc_percpu(sizeof(type), \ __alignof__(type)) +extern unsigned long pcpu_nr_pages(void); + #endif /* __LINUX_PERCPU_H */ diff --git a/mm/percpu.c b/mm/percpu.c index 0b6480979ac7..a749d4d96e3e 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -169,6 +169,14 @@ static LIST_HEAD(pcpu_map_extend_chunks); */ int pcpu_nr_empty_pop_pages; +/* + * The number of populated pages in use by the allocator, protected by + * pcpu_lock. This number is kept per a unit per chunk (i.e. when a page gets + * allocated/deallocated, it is allocated/deallocated in all units of a chunk + * and increments/decrements this count by 1). + */ +static unsigned long pcpu_nr_populated; + /* * Balance work is used to populate or destroy chunks asynchronously. We * try to keep the number of populated free pages between @@ -1232,6 +1240,7 @@ static void pcpu_chunk_populated(struct pcpu_chunk *chunk, int page_start, bitmap_set(chunk->populated, page_start, nr); chunk->nr_populated += nr; + pcpu_nr_populated += nr; if (!for_alloc) { chunk->nr_empty_pop_pages += nr; @@ -1260,6 +1269,7 @@ static void pcpu_chunk_depopulated(struct pcpu_chunk *chunk, chunk->nr_populated -= nr; chunk->nr_empty_pop_pages -= nr; pcpu_nr_empty_pop_pages -= nr; + pcpu_nr_populated -= nr; } /* @@ -2176,6 +2186,9 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, pcpu_nr_empty_pop_pages = pcpu_first_chunk->nr_empty_pop_pages; pcpu_chunk_relocate(pcpu_first_chunk, -1); + /* include all regions of the first chunk */ + pcpu_nr_populated += PFN_DOWN(size_sum); + pcpu_stats_chunk_alloc(); trace_percpu_create_chunk(base_addr); @@ -2745,6 +2758,22 @@ void __init setup_per_cpu_areas(void) #endif /* CONFIG_SMP */ +/* + * pcpu_nr_pages - calculate total number of populated backing pages + * + * This reflects the number of pages populated to back chunks. Metadata is + * excluded in the number exposed in meminfo as the number of backing pages + * scales with the number of cpus and can quickly outweigh the memory used for + * metadata. It also keeps this calculation nice and simple. + * + * RETURNS: + * Total number of populated backing pages in use by the allocator. + */ +unsigned long pcpu_nr_pages(void) +{ + return pcpu_nr_populated * pcpu_nr_units; +} + /* * Percpu allocator is initialized early during boot when neither slab or * workqueue is available. Plug async management until everything is up -- cgit v1.2.3 From a2e514453861dd39b53b7a50b6771bd3f9852078 Mon Sep 17 00:00:00 2001 From: Dmitry Vyukov Date: Tue, 21 Aug 2018 21:55:52 -0700 Subject: kernel/hung_task.c: allow to set checking interval separately from timeout Currently task hung checking interval is equal to timeout, as the result hung is detected anywhere between timeout and 2*timeout. This is fine for most interactive environments, but this hurts automated testing setups (syzbot). In an automated setup we need to strictly order CPU lockup < RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall is not detected as task hung and task hung is not detected as silent machine loss. The large variance in task hung detection timeout requires setting silent machine loss timeout to a very large value (e.g. if task hung is 3 mins, then silent loss need to be set to ~7 mins). The additional 3 minutes significantly reduce testing efficiency because usually we crash kernel within a minute, and this can add hours to bug localization process as it needs to do dozens of tests. Allow setting checking interval separately from timeout. This allows to set timeout to, say, 3 minutes, but checking interval to 10 secs. The interval is controlled via a new hung_task_check_interval_secs sysctl, similar to the existing hung_task_timeout_secs sysctl. The default value of 0 results in the current behavior: checking interval is equal to timeout. [akpm@linux-foundation.org: update hung_task_timeout_max's comment] Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.com Signed-off-by: Dmitry Vyukov Cc: Paul E. McKenney Cc: Tetsuo Handa Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/kernel.txt | 15 ++++++++++++++- include/linux/sched.h | 1 + include/linux/sched/sysctl.h | 1 + kernel/fork.c | 1 + kernel/hung_task.c | 15 ++++++++++++++- kernel/sysctl.c | 13 ++++++++++++- 6 files changed, 43 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 59585030cbaf..9d38e75b19a0 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -38,6 +38,7 @@ show up in /proc/sys/kernel: - hung_task_panic - hung_task_check_count - hung_task_timeout_secs +- hung_task_check_interval_secs - hung_task_warnings - hyperv_record_panic_msg - kexec_load_disabled @@ -355,7 +356,7 @@ This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. hung_task_timeout_secs: -Check interval. When a task in D state did not get scheduled +When a task in D state did not get scheduled for more than this value report a warning. This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. @@ -364,6 +365,18 @@ Possible values to set are in range {0..LONG_MAX/HZ}. ============================================================== +hung_task_check_interval_secs: + +Hung task check interval. If hung task checking is enabled +(see hung_task_timeout_secs), the check is done every +hung_task_check_interval_secs seconds. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +0 (default): means use hung_task_timeout_secs as checking interval. +Possible values to set are in range {0..LONG_MAX/HZ}. + +============================================================== + hung_task_warnings: The maximum number of warnings to report. During a check interval diff --git a/include/linux/sched.h b/include/linux/sched.h index 789923fbee3a..58eb3a2bc695 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -853,6 +853,7 @@ struct task_struct { #endif #ifdef CONFIG_DETECT_HUNG_TASK unsigned long last_switch_count; + unsigned long last_switch_time; #endif /* Filesystem information: */ struct fs_struct *fs; diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 913488d828cb..a9c32daeb9d8 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -10,6 +10,7 @@ struct ctl_table; extern int sysctl_hung_task_check_count; extern unsigned int sysctl_hung_task_panic; extern unsigned long sysctl_hung_task_timeout_secs; +extern unsigned long sysctl_hung_task_check_interval_secs; extern int sysctl_hung_task_warnings; extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write, void __user *buffer, diff --git a/kernel/fork.c b/kernel/fork.c index 8c760effa42e..8bcef2413f1b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1302,6 +1302,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk) tsk->nvcsw = tsk->nivcsw = 0; #ifdef CONFIG_DETECT_HUNG_TASK tsk->last_switch_count = tsk->nvcsw + tsk->nivcsw; + tsk->last_switch_time = 0; #endif tsk->mm = NULL; diff --git a/kernel/hung_task.c b/kernel/hung_task.c index 32b479468e4d..b9132d1269ef 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -40,6 +40,11 @@ int __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT; */ unsigned long __read_mostly sysctl_hung_task_timeout_secs = CONFIG_DEFAULT_HUNG_TASK_TIMEOUT; +/* + * Zero (default value) means use sysctl_hung_task_timeout_secs: + */ +unsigned long __read_mostly sysctl_hung_task_check_interval_secs; + int __read_mostly sysctl_hung_task_warnings = 10; static int __read_mostly did_panic; @@ -98,8 +103,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) if (switch_count != t->last_switch_count) { t->last_switch_count = switch_count; + t->last_switch_time = jiffies; return; } + if (time_is_after_jiffies(t->last_switch_time + timeout * HZ)) + return; trace_sched_process_hang(t); @@ -245,8 +253,13 @@ static int watchdog(void *dummy) for ( ; ; ) { unsigned long timeout = sysctl_hung_task_timeout_secs; - long t = hung_timeout_jiffies(hung_last_checked, timeout); + unsigned long interval = sysctl_hung_task_check_interval_secs; + long t; + if (interval == 0) + interval = timeout; + interval = min_t(unsigned long, interval, timeout); + t = hung_timeout_jiffies(hung_last_checked, interval); if (t <= 0) { if (!atomic_xchg(&reset_hung_task, 0)) check_hung_uninterruptible_tasks(timeout); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index f22f76b7a138..dbdcb6673b27 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -145,7 +145,10 @@ static int minolduid; static int ngroups_max = NGROUPS_MAX; static const int cap_last_cap = CAP_LAST_CAP; -/*this is needed for proc_doulongvec_minmax of sysctl_hung_task_timeout_secs */ +/* + * This is needed for proc_doulongvec_minmax of sysctl_hung_task_timeout_secs + * and hung_task_check_interval_secs + */ #ifdef CONFIG_DETECT_HUNG_TASK static unsigned long hung_task_timeout_max = (LONG_MAX/HZ); #endif @@ -1090,6 +1093,14 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dohung_task_timeout_secs, .extra2 = &hung_task_timeout_max, }, + { + .procname = "hung_task_check_interval_secs", + .data = &sysctl_hung_task_check_interval_secs, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = proc_dohung_task_timeout_secs, + .extra2 = &hung_task_timeout_max, + }, { .procname = "hung_task_warnings", .data = &sysctl_hung_task_warnings, -- cgit v1.2.3 From e2652ae6bd7492cdc0436817cbcd09282eb0bb03 Mon Sep 17 00:00:00 2001 From: Manfred Spraul Date: Tue, 21 Aug 2018 22:01:25 -0700 Subject: ipc: reorganize initialization of kern_ipc_perm.seq ipc_addid() initializes kern_ipc_perm.seq after having called idr_alloc() (within ipc_idr_alloc()). Thus a parallel semop() or msgrcv() that uses ipc_obtain_object_check() may see an uninitialized value. The patch moves the initialization of kern_ipc_perm.seq before the calls of idr_alloc(). Notes: 1) This patch has a user space visible side effect: If /proc/sys/kernel/*_next_id is used (i.e.: checkpoint/restore) and if semget()/msgget()/shmget() fails in the final step of adding the id to the rhash tree, then .._next_id is cleared. Before the patch, is remained unmodified. There is no change of the behavior after a successful ..get() call: It always clears .._next_id, there is no impact to non checkpoint/restore code as that code does not use .._next_id. 2) The patch correctly documents that after a call to ipc_idr_alloc(), the full tear-down sequence must be used. The callers of ipc_addid() do not fullfill that, i.e. more bugfixes are required. The patch is a squash of a patch from Dmitry and my own changes. Link: http://lkml.kernel.org/r/20180712185241.4017-3-manfred@colorfullife.com Reported-by: syzbot+2827ef6b3385deb07eaf@syzkaller.appspotmail.com Signed-off-by: Manfred Spraul Cc: Dmitry Vyukov Cc: Kees Cook Cc: Davidlohr Bueso Cc: Michael Kerrisk Cc: Davidlohr Bueso Cc: Herbert Xu Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/kernel.txt | 3 +- ipc/util.c | 91 ++++++++++++++++++++++------------------- 2 files changed, 50 insertions(+), 44 deletions(-) (limited to 'Documentation') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 9d38e75b19a0..37a679501ddc 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -464,7 +464,8 @@ Notes: 1) kernel doesn't guarantee, that new object will have desired id. So, it's up to userspace, how to handle an object with "wrong" id. 2) Toggle with non-default value will be set back to -1 by kernel after -successful IPC object allocation. +successful IPC object allocation. If an IPC object allocation syscall +fails, it is undefined if the value remains unmodified or is reset to -1. ============================================================== diff --git a/ipc/util.c b/ipc/util.c index fdffff41f65b..e5c9e2b2e4c4 100644 --- a/ipc/util.c +++ b/ipc/util.c @@ -194,46 +194,54 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids *ids, key_t key) return NULL; } -#ifdef CONFIG_CHECKPOINT_RESTORE /* - * Specify desired id for next allocated IPC object. + * Insert new IPC object into idr tree, and set sequence number and id + * in the correct order. + * Especially: + * - the sequence number must be set before inserting the object into the idr, + * because the sequence number is accessed without a lock. + * - the id can/must be set after inserting the object into the idr. + * All accesses must be done after getting kern_ipc_perm.lock. + * + * The caller must own kern_ipc_perm.lock.of the new object. + * On error, the function returns a (negative) error code. */ -#define ipc_idr_alloc(ids, new) \ - idr_alloc(&(ids)->ipcs_idr, (new), \ - (ids)->next_id < 0 ? 0 : ipcid_to_idx((ids)->next_id),\ - 0, GFP_NOWAIT) - -static inline int ipc_buildid(int id, struct ipc_ids *ids, - struct kern_ipc_perm *new) +static inline int ipc_idr_alloc(struct ipc_ids *ids, struct kern_ipc_perm *new) { - if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */ + int idx, next_id = -1; + +#ifdef CONFIG_CHECKPOINT_RESTORE + next_id = ids->next_id; + ids->next_id = -1; +#endif + + /* + * As soon as a new object is inserted into the idr, + * ipc_obtain_object_idr() or ipc_obtain_object_check() can find it, + * and the lockless preparations for ipc operations can start. + * This means especially: permission checks, audit calls, allocation + * of undo structures, ... + * + * Thus the object must be fully initialized, and if something fails, + * then the full tear-down sequence must be followed. + * (i.e.: set new->deleted, reduce refcount, call_rcu()) + */ + + if (next_id < 0) { /* !CHECKPOINT_RESTORE or next_id is unset */ new->seq = ids->seq++; if (ids->seq > IPCID_SEQ_MAX) ids->seq = 0; + idx = idr_alloc(&ids->ipcs_idr, new, 0, 0, GFP_NOWAIT); } else { - new->seq = ipcid_to_seqx(ids->next_id); - ids->next_id = -1; + new->seq = ipcid_to_seqx(next_id); + idx = idr_alloc(&ids->ipcs_idr, new, ipcid_to_idx(next_id), + 0, GFP_NOWAIT); } - - return SEQ_MULTIPLIER * new->seq + id; + if (idx >= 0) + new->id = SEQ_MULTIPLIER * new->seq + idx; + return idx; } -#else -#define ipc_idr_alloc(ids, new) \ - idr_alloc(&(ids)->ipcs_idr, (new), 0, 0, GFP_NOWAIT) - -static inline int ipc_buildid(int id, struct ipc_ids *ids, - struct kern_ipc_perm *new) -{ - new->seq = ids->seq++; - if (ids->seq > IPCID_SEQ_MAX) - ids->seq = 0; - - return SEQ_MULTIPLIER * new->seq + id; -} - -#endif /* CONFIG_CHECKPOINT_RESTORE */ - /** * ipc_addid - add an ipc identifier * @ids: ipc identifier set @@ -251,7 +259,7 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm *new, int limit) { kuid_t euid; kgid_t egid; - int id, err; + int idx, err; if (limit > IPCMNI) limit = IPCMNI; @@ -271,30 +279,27 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm *new, int limit) new->cuid = new->uid = euid; new->gid = new->cgid = egid; - id = ipc_idr_alloc(ids, new); + idx = ipc_idr_alloc(ids, new); idr_preload_end(); - if (id >= 0 && new->key != IPC_PRIVATE) { + if (idx >= 0 && new->key != IPC_PRIVATE) { err = rhashtable_insert_fast(&ids->key_ht, &new->khtnode, ipc_kht_params); if (err < 0) { - idr_remove(&ids->ipcs_idr, id); - id = err; + idr_remove(&ids->ipcs_idr, idx); + idx = err; } } - if (id < 0) { + if (idx < 0) { spin_unlock(&new->lock); rcu_read_unlock(); - return id; + return idx; } ids->in_use++; - if (id > ids->max_id) - ids->max_id = id; - - new->id = ipc_buildid(id, ids, new); - - return id; + if (idx > ids->max_id) + ids->max_id = idx; + return idx; } /** -- cgit v1.2.3