From 314e51b9851b4f4e8ab302243ff5a6fc6147f379 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Mon, 8 Oct 2012 16:29:02 -0700 Subject: mm: kill vma flag VM_RESERVED and mm->reserved_vm counter A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA, currently it lost original meaning but still has some effects: | effect | alternative flags -+------------------------+--------------------------------------------- 1| account as reserved_vm | VM_IO 2| skip in core dump | VM_IO, VM_DONTDUMP 3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP 4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP This patch removes reserved_vm counter from mm_struct. Seems like nobody cares about it, it does not exported into userspace directly, it only reduces total_vm showed in proc. Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP. remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP. remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP. [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup] Signed-off-by: Konstantin Khlebnikov Cc: Alexander Viro Cc: Carsten Otte Cc: Chris Metcalf Cc: Cyrill Gorcunov Cc: Eric Paris Cc: H. Peter Anvin Cc: Hugh Dickins Cc: Ingo Molnar Cc: James Morris Cc: Jason Baron Cc: Kentaro Takeda Cc: Matt Helsley Cc: Nick Piggin Cc: Oleg Nesterov Cc: Peter Zijlstra Cc: Robert Richter Cc: Suresh Siddha Cc: Tetsuo Handa Cc: Venkatesh Pallipadi Acked-by: Linus Torvalds Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/unevictable-lru.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt index fa206cccf89f..323ff5dba1cc 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.txt @@ -371,8 +371,8 @@ mlock_fixup() filters several classes of "special" VMAs: mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to allocate the huge pages and populate the ptes. -3) VMAs with VM_DONTEXPAND or VM_RESERVED are generally userspace mappings of - kernel pages, such as the VDSO page, relay channel pages, etc. These pages +3) VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages, + such as the VDSO page, relay channel pages, etc. These pages are inherently unevictable and are not managed on the LRU lists. mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls make_pages_present() to populate the ptes. -- cgit v1.2.3 From 01dc52ebdf472f77cca623ca693ca24cfc0f1bbe Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Mon, 8 Oct 2012 16:29:30 -0700 Subject: oom: remove deprecated oom_adj The deprecated /proc//oom_adj is scheduled for removal this month. Signed-off-by: Davidlohr Bueso Acked-by: David Rientjes Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ABI/obsolete/proc-pid-oom_adj | 22 ------ Documentation/filesystems/proc.txt | 22 +----- fs/proc/base.c | 117 +--------------------------- include/linux/oom.h | 11 --- include/linux/sched.h | 1 - kernel/fork.c | 1 - mm/oom_kill.c | 4 +- 7 files changed, 7 insertions(+), 171 deletions(-) delete mode 100644 Documentation/ABI/obsolete/proc-pid-oom_adj (limited to 'Documentation') diff --git a/Documentation/ABI/obsolete/proc-pid-oom_adj b/Documentation/ABI/obsolete/proc-pid-oom_adj deleted file mode 100644 index 9a3cb88ade47..000000000000 --- a/Documentation/ABI/obsolete/proc-pid-oom_adj +++ /dev/null @@ -1,22 +0,0 @@ -What: /proc//oom_adj -When: August 2012 -Why: /proc//oom_adj allows userspace to influence the oom killer's - badness heuristic used to determine which task to kill when the kernel - is out of memory. - - The badness heuristic has since been rewritten since the introduction of - this tunable such that its meaning is deprecated. The value was - implemented as a bitshift on a score generated by the badness() - function that did not have any precise units of measure. With the - rewrite, the score is given as a proportion of available memory to the - task allocating pages, so using a bitshift which grows the score - exponentially is, thus, impossible to tune with fine granularity. - - A much more powerful interface, /proc//oom_score_adj, was - introduced with the oom killer rewrite that allows users to increase or - decrease the badness score linearly. This interface will replace - /proc//oom_adj. - - A warning will be emitted to the kernel log if an application uses this - deprecated interface. After it is printed once, future warnings will be - suppressed until the kernel is rebooted. diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index fb0a6aeb936c..a1793d670cd0 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -33,7 +33,7 @@ Table of Contents 2 Modifying System Parameters 3 Per-Process Parameters - 3.1 /proc//oom_adj & /proc//oom_score_adj - Adjust the oom-killer + 3.1 /proc//oom_score_adj - Adjust the oom-killer score 3.2 /proc//oom_score - Display current oom-killer score 3.3 /proc//io - Display the IO accounting fields @@ -1320,10 +1320,10 @@ of the kernel. CHAPTER 3: PER-PROCESS PARAMETERS ------------------------------------------------------------------------------ -3.1 /proc//oom_adj & /proc//oom_score_adj- Adjust the oom-killer score +3.1 /proc//oom_score_adj- Adjust the oom-killer score -------------------------------------------------------------------------------- -These file can be used to adjust the badness heuristic used to select which +This file can be used to adjust the badness heuristic used to select which process gets killed in out of memory conditions. The badness heuristic assigns a value to each candidate task ranging from 0 @@ -1361,22 +1361,10 @@ same system, cpuset, mempolicy, or memory controller resources to use at least equivalent to discounting 50% of the task's allowed memory from being considered as scoring against the task. -For backwards compatibility with previous kernels, /proc//oom_adj may also -be used to tune the badness score. Its acceptable values range from -16 -(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17 -(OOM_DISABLE) to disable oom killing entirely for that task. Its value is -scaled linearly with /proc//oom_score_adj. - -Writing to /proc//oom_score_adj or /proc//oom_adj will change the -other with its scaled value. - The value of /proc//oom_score_adj may be reduced no lower than the last value set by a CAP_SYS_RESOURCE process. To reduce the value any lower requires CAP_SYS_RESOURCE. -NOTICE: /proc//oom_adj is deprecated and will be removed, please see -Documentation/feature-removal-schedule.txt. - Caveat: when a parent task is selected, the oom killer will sacrifice any first generation children with separate address spaces instead, if possible. This avoids servers and important system daemons from being killed and loses the @@ -1387,9 +1375,7 @@ minimal amount of work. ------------------------------------------------------------- This file can be used to check the current score used by the oom-killer is for -any given . Use it together with /proc//oom_adj to tune which -process should be killed in an out-of-memory situation. - +any given . 3.3 /proc//io - Display the IO accounting fields ------------------------------------------------------- diff --git a/fs/proc/base.c b/fs/proc/base.c index d295af993677..ef5c84be66f9 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -873,111 +873,6 @@ static const struct file_operations proc_environ_operations = { .release = mem_release, }; -static ssize_t oom_adjust_read(struct file *file, char __user *buf, - size_t count, loff_t *ppos) -{ - struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode); - char buffer[PROC_NUMBUF]; - size_t len; - int oom_adjust = OOM_DISABLE; - unsigned long flags; - - if (!task) - return -ESRCH; - - if (lock_task_sighand(task, &flags)) { - oom_adjust = task->signal->oom_adj; - unlock_task_sighand(task, &flags); - } - - put_task_struct(task); - - len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust); - - return simple_read_from_buffer(buf, count, ppos, buffer, len); -} - -static ssize_t oom_adjust_write(struct file *file, const char __user *buf, - size_t count, loff_t *ppos) -{ - struct task_struct *task; - char buffer[PROC_NUMBUF]; - int oom_adjust; - unsigned long flags; - int err; - - memset(buffer, 0, sizeof(buffer)); - if (count > sizeof(buffer) - 1) - count = sizeof(buffer) - 1; - if (copy_from_user(buffer, buf, count)) { - err = -EFAULT; - goto out; - } - - err = kstrtoint(strstrip(buffer), 0, &oom_adjust); - if (err) - goto out; - if ((oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX) && - oom_adjust != OOM_DISABLE) { - err = -EINVAL; - goto out; - } - - task = get_proc_task(file->f_path.dentry->d_inode); - if (!task) { - err = -ESRCH; - goto out; - } - - task_lock(task); - if (!task->mm) { - err = -EINVAL; - goto err_task_lock; - } - - if (!lock_task_sighand(task, &flags)) { - err = -ESRCH; - goto err_task_lock; - } - - if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) { - err = -EACCES; - goto err_sighand; - } - - /* - * Warn that /proc/pid/oom_adj is deprecated, see - * Documentation/feature-removal-schedule.txt. - */ - printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, please use /proc/%d/oom_score_adj instead.\n", - current->comm, task_pid_nr(current), task_pid_nr(task), - task_pid_nr(task)); - task->signal->oom_adj = oom_adjust; - /* - * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum - * value is always attainable. - */ - if (task->signal->oom_adj == OOM_ADJUST_MAX) - task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX; - else - task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) / - -OOM_DISABLE; - trace_oom_score_adj_update(task); -err_sighand: - unlock_task_sighand(task, &flags); -err_task_lock: - task_unlock(task); - put_task_struct(task); -out: - return err < 0 ? err : count; -} - -static const struct file_operations proc_oom_adjust_operations = { - .read = oom_adjust_read, - .write = oom_adjust_write, - .llseek = generic_file_llseek, -}; - static ssize_t oom_score_adj_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { @@ -1051,15 +946,7 @@ static ssize_t oom_score_adj_write(struct file *file, const char __user *buf, if (has_capability_noaudit(current, CAP_SYS_RESOURCE)) task->signal->oom_score_adj_min = oom_score_adj; trace_oom_score_adj_update(task); - /* - * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is - * always attainable. - */ - if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) - task->signal->oom_adj = OOM_DISABLE; - else - task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) / - OOM_SCORE_ADJ_MAX; + err_sighand: unlock_task_sighand(task, &flags); err_task_lock: @@ -2710,7 +2597,6 @@ static const struct pid_entry tgid_base_stuff[] = { REG("cgroup", S_IRUGO, proc_cgroup_operations), #endif INF("oom_score", S_IRUGO, proc_oom_score), - REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations), REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations), #ifdef CONFIG_AUDITSYSCALL REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations), @@ -3077,7 +2963,6 @@ static const struct pid_entry tid_base_stuff[] = { REG("cgroup", S_IRUGO, proc_cgroup_operations), #endif INF("oom_score", S_IRUGO, proc_oom_score), - REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations), REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations), #ifdef CONFIG_AUDITSYSCALL REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations), diff --git a/include/linux/oom.h b/include/linux/oom.h index 49a3031fda50..d36a8221f58b 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -1,17 +1,6 @@ #ifndef __INCLUDE_LINUX_OOM_H #define __INCLUDE_LINUX_OOM_H -/* - * /proc//oom_adj is deprecated, see - * Documentation/feature-removal-schedule.txt. - * - * /proc//oom_adj set to -17 protects from the oom-killer - */ -#define OOM_DISABLE (-17) -/* inclusive */ -#define OOM_ADJUST_MIN (-16) -#define OOM_ADJUST_MAX 15 - /* * /proc//oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for * pid. diff --git a/include/linux/sched.h b/include/linux/sched.h index 9c5612f0374b..c2070e92a9d6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -671,7 +671,6 @@ struct signal_struct { struct rw_semaphore group_rwsem; #endif - int oom_adj; /* OOM kill score adjustment (bit shift) */ int oom_score_adj; /* OOM kill score adjustment */ int oom_score_adj_min; /* OOM kill score adjustment minimum value. * Only settable by CAP_SYS_RESOURCE. */ diff --git a/kernel/fork.c b/kernel/fork.c index ec667f797af3..972762e01024 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1056,7 +1056,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) init_rwsem(&sig->group_rwsem); #endif - sig->oom_adj = current->signal->oom_adj; sig->oom_score_adj = current->signal->oom_score_adj; sig->oom_score_adj_min = current->signal->oom_score_adj_min; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 198600861638..79e0f3e24831 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -428,8 +428,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, { task_lock(current); pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, " - "oom_adj=%d, oom_score_adj=%d\n", - current->comm, gfp_mask, order, current->signal->oom_adj, + "oom_score_adj=%d\n", + current->comm, gfp_mask, order, current->signal->oom_score_adj); cpuset_print_task_mems_allowed(current); task_unlock(current); -- cgit v1.2.3 From 14b94af0b251a2c80885b60538166fb7d04a642e Mon Sep 17 00:00:00 2001 From: Michel Lespinasse Date: Mon, 8 Oct 2012 16:31:17 -0700 Subject: rbtree: faster augmented rbtree manipulation Introduce new augmented rbtree APIs that allow minimal recalculation of augmented node information. A new callback is added to the rbtree insertion and erase rebalancing functions, to be called on each tree rotations. Such rotations preserve the subtree's root augmented value, but require recalculation of the one child that was previously located at the subtree root. In the insertion case, the handcoded search phase must be updated to maintain the augmented information on insertion, and then the rbtree coloring/rebalancing algorithms keep it up to date. In the erase case, things are more complicated since it is library code that manipulates the rbtree in order to remove internal nodes. This requires a couple additional callbacks to copy a subtree's augmented value when a new root is stitched in, and to recompute augmented values down the ancestry path when a node is removed from the tree. In order to preserve maximum speed for the non-augmented case, we provide two versions of each tree manipulation function. rb_insert_augmented() is the augmented equivalent of rb_insert_color(), and rb_erase_augmented() is the augmented equivalent of rb_erase(). Signed-off-by: Michel Lespinasse Acked-by: Rik van Riel Cc: Peter Zijlstra Cc: Andrea Arcangeli Cc: David Woodhouse Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/rbtree.txt | 190 +++++++++++++++++++++++++++++++++++++++-------- include/linux/rbtree.h | 19 +++++ lib/rbtree.c | 83 +++++++++++++++++++-- lib/rbtree_test.c | 58 +++++++++++---- 4 files changed, 296 insertions(+), 54 deletions(-) (limited to 'Documentation') diff --git a/Documentation/rbtree.txt b/Documentation/rbtree.txt index 8d32d85a5234..0a0b6dce3e08 100644 --- a/Documentation/rbtree.txt +++ b/Documentation/rbtree.txt @@ -193,24 +193,42 @@ Example: Support for Augmented rbtrees ----------------------------- -Augmented rbtree is an rbtree with "some" additional data stored in each node. -This data can be used to augment some new functionality to rbtree. -Augmented rbtree is an optional feature built on top of basic rbtree -infrastructure. An rbtree user who wants this feature will have to call the -augmentation functions with the user provided augmentation callback -when inserting and erasing nodes. +Augmented rbtree is an rbtree with "some" additional data stored in +each node, where the additional data for node N must be a function of +the contents of all nodes in the subtree rooted at N. This data can +be used to augment some new functionality to rbtree. Augmented rbtree +is an optional feature built on top of basic rbtree infrastructure. +An rbtree user who wants this feature will have to call the augmentation +functions with the user provided augmentation callback when inserting +and erasing nodes. -On insertion, the user must call rb_augment_insert() once the new node is in -place. This will cause the augmentation function callback to be called for -each node between the new node and the root which has been affected by the -insertion. +On insertion, the user must update the augmented information on the path +leading to the inserted node, then call rb_link_node() as usual and +rb_augment_inserted() instead of the usual rb_insert_color() call. +If rb_augment_inserted() rebalances the rbtree, it will callback into +a user provided function to update the augmented information on the +affected subtrees. -When erasing a node, the user must call rb_augment_erase_begin() first to -retrieve the deepest node on the rebalance path. Then, after erasing the -original node, the user must call rb_augment_erase_end() with the deepest -node found earlier. This will cause the augmentation function to be called -for each affected node between the deepest node and the root. +When erasing a node, the user must call rb_erase_augmented() instead of +rb_erase(). rb_erase_augmented() calls back into user provided functions +to updated the augmented information on affected subtrees. +In both cases, the callbacks are provided through struct rb_augment_callbacks. +3 callbacks must be defined: + +- A propagation callback, which updates the augmented value for a given + node and its ancestors, up to a given stop point (or NULL to update + all the way to the root). + +- A copy callback, which copies the augmented value for a given subtree + to a newly assigned subtree root. + +- A tree rotation callback, which copies the augmented value for a given + subtree to a newly assigned subtree root AND recomputes the augmented + information for the former subtree root. + + +Sample usage: Interval tree is an example of augmented rb tree. Reference - "Introduction to Algorithms" by Cormen, Leiserson, Rivest and Stein. @@ -230,26 +248,132 @@ and its immediate children. And this will be used in O(log n) lookup for lowest match (lowest start address among all possible matches) with something like: -find_lowest_match(lo, hi, node) +struct interval_tree_node * +interval_tree_first_match(struct rb_root *root, + unsigned long start, unsigned long last) { - lowest_match = NULL; - while (node) { - if (max_hi(node->left) > lo) { - // Lowest overlap if any must be on left side - node = node->left; - } else if (overlap(lo, hi, node)) { - lowest_match = node; - break; - } else if (lo > node->lo) { - // Lowest overlap if any must be on right side - node = node->right; - } else { - break; + struct interval_tree_node *node; + + if (!root->rb_node) + return NULL; + node = rb_entry(root->rb_node, struct interval_tree_node, rb); + + while (true) { + if (node->rb.rb_left) { + struct interval_tree_node *left = + rb_entry(node->rb.rb_left, + struct interval_tree_node, rb); + if (left->__subtree_last >= start) { + /* + * Some nodes in left subtree satisfy Cond2. + * Iterate to find the leftmost such node N. + * If it also satisfies Cond1, that's the match + * we are looking for. Otherwise, there is no + * matching interval as nodes to the right of N + * can't satisfy Cond1 either. + */ + node = left; + continue; + } } + if (node->start <= last) { /* Cond1 */ + if (node->last >= start) /* Cond2 */ + return node; /* node is leftmost match */ + if (node->rb.rb_right) { + node = rb_entry(node->rb.rb_right, + struct interval_tree_node, rb); + if (node->__subtree_last >= start) + continue; + } + } + return NULL; /* No match */ + } +} + +Insertion/removal are defined using the following augmented callbacks: + +static inline unsigned long +compute_subtree_last(struct interval_tree_node *node) +{ + unsigned long max = node->last, subtree_last; + if (node->rb.rb_left) { + subtree_last = rb_entry(node->rb.rb_left, + struct interval_tree_node, rb)->__subtree_last; + if (max < subtree_last) + max = subtree_last; + } + if (node->rb.rb_right) { + subtree_last = rb_entry(node->rb.rb_right, + struct interval_tree_node, rb)->__subtree_last; + if (max < subtree_last) + max = subtree_last; + } + return max; +} + +static void augment_propagate(struct rb_node *rb, struct rb_node *stop) +{ + while (rb != stop) { + struct interval_tree_node *node = + rb_entry(rb, struct interval_tree_node, rb); + unsigned long subtree_last = compute_subtree_last(node); + if (node->__subtree_last == subtree_last) + break; + node->__subtree_last = subtree_last; + rb = rb_parent(&node->rb); + } +} + +static void augment_copy(struct rb_node *rb_old, struct rb_node *rb_new) +{ + struct interval_tree_node *old = + rb_entry(rb_old, struct interval_tree_node, rb); + struct interval_tree_node *new = + rb_entry(rb_new, struct interval_tree_node, rb); + + new->__subtree_last = old->__subtree_last; +} + +static void augment_rotate(struct rb_node *rb_old, struct rb_node *rb_new) +{ + struct interval_tree_node *old = + rb_entry(rb_old, struct interval_tree_node, rb); + struct interval_tree_node *new = + rb_entry(rb_new, struct interval_tree_node, rb); + + new->__subtree_last = old->__subtree_last; + old->__subtree_last = compute_subtree_last(old); +} + +static const struct rb_augment_callbacks augment_callbacks = { + augment_propagate, augment_copy, augment_rotate +}; + +void interval_tree_insert(struct interval_tree_node *node, + struct rb_root *root) +{ + struct rb_node **link = &root->rb_node, *rb_parent = NULL; + unsigned long start = node->start, last = node->last; + struct interval_tree_node *parent; + + while (*link) { + rb_parent = *link; + parent = rb_entry(rb_parent, struct interval_tree_node, rb); + if (parent->__subtree_last < last) + parent->__subtree_last = last; + if (start < parent->start) + link = &parent->rb.rb_left; + else + link = &parent->rb.rb_right; } - return lowest_match; + + node->__subtree_last = last; + rb_link_node(&node->rb, rb_parent, link); + rb_insert_augmented(&node->rb, root, &augment_callbacks); } -Finding exact match will be to first find lowest match and then to follow -successor nodes looking for exact match, until the start of a node is beyond -the hi value we are looking for. +void interval_tree_remove(struct interval_tree_node *node, + struct rb_root *root) +{ + rb_erase_augmented(&node->rb, root, &augment_callbacks); +} diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h index bf836a2c6a32..c902eb9d6506 100644 --- a/include/linux/rbtree.h +++ b/include/linux/rbtree.h @@ -61,6 +61,25 @@ struct rb_root { extern void rb_insert_color(struct rb_node *, struct rb_root *); extern void rb_erase(struct rb_node *, struct rb_root *); + +struct rb_augment_callbacks { + void (*propagate)(struct rb_node *node, struct rb_node *stop); + void (*copy)(struct rb_node *old, struct rb_node *new); + void (*rotate)(struct rb_node *old, struct rb_node *new); +}; + +extern void __rb_insert_augmented(struct rb_node *node, struct rb_root *root, + void (*augment_rotate)(struct rb_node *old, struct rb_node *new)); +extern void rb_erase_augmented(struct rb_node *node, struct rb_root *root, + const struct rb_augment_callbacks *augment); +static inline void +rb_insert_augmented(struct rb_node *node, struct rb_root *root, + const struct rb_augment_callbacks *augment) +{ + __rb_insert_augmented(node, root, augment->rotate); +} + + typedef void (*rb_augment_f)(struct rb_node *node, void *data); extern void rb_augment_insert(struct rb_node *node, diff --git a/lib/rbtree.c b/lib/rbtree.c index 938061ecbe61..a37ee7954b8f 100644 --- a/lib/rbtree.c +++ b/lib/rbtree.c @@ -105,7 +105,9 @@ __rb_rotate_set_parents(struct rb_node *old, struct rb_node *new, __rb_change_child(old, new, parent, root); } -void rb_insert_color(struct rb_node *node, struct rb_root *root) +static __always_inline void +__rb_insert(struct rb_node *node, struct rb_root *root, + void (*augment_rotate)(struct rb_node *old, struct rb_node *new)) { struct rb_node *parent = rb_red_parent(node), *gparent, *tmp; @@ -169,6 +171,7 @@ void rb_insert_color(struct rb_node *node, struct rb_root *root) rb_set_parent_color(tmp, parent, RB_BLACK); rb_set_parent_color(parent, node, RB_RED); + augment_rotate(parent, node); parent = node; tmp = node->rb_right; } @@ -187,6 +190,7 @@ void rb_insert_color(struct rb_node *node, struct rb_root *root) if (tmp) rb_set_parent_color(tmp, gparent, RB_BLACK); __rb_rotate_set_parents(gparent, parent, root, RB_RED); + augment_rotate(gparent, parent); break; } else { tmp = gparent->rb_left; @@ -209,6 +213,7 @@ void rb_insert_color(struct rb_node *node, struct rb_root *root) rb_set_parent_color(tmp, parent, RB_BLACK); rb_set_parent_color(parent, node, RB_RED); + augment_rotate(parent, node); parent = node; tmp = node->rb_left; } @@ -219,13 +224,15 @@ void rb_insert_color(struct rb_node *node, struct rb_root *root) if (tmp) rb_set_parent_color(tmp, gparent, RB_BLACK); __rb_rotate_set_parents(gparent, parent, root, RB_RED); + augment_rotate(gparent, parent); break; } } } -EXPORT_SYMBOL(rb_insert_color); -static void __rb_erase_color(struct rb_node *parent, struct rb_root *root) +static __always_inline void +__rb_erase_color(struct rb_node *parent, struct rb_root *root, + const struct rb_augment_callbacks *augment) { struct rb_node *node = NULL, *sibling, *tmp1, *tmp2; @@ -254,6 +261,7 @@ static void __rb_erase_color(struct rb_node *parent, struct rb_root *root) rb_set_parent_color(tmp1, parent, RB_BLACK); __rb_rotate_set_parents(parent, sibling, root, RB_RED); + augment->rotate(parent, sibling); sibling = tmp1; } tmp1 = sibling->rb_right; @@ -305,6 +313,7 @@ static void __rb_erase_color(struct rb_node *parent, struct rb_root *root) if (tmp1) rb_set_parent_color(tmp1, sibling, RB_BLACK); + augment->rotate(sibling, tmp2); tmp1 = sibling; sibling = tmp2; } @@ -327,6 +336,7 @@ static void __rb_erase_color(struct rb_node *parent, struct rb_root *root) rb_set_parent(tmp2, parent); __rb_rotate_set_parents(parent, sibling, root, RB_BLACK); + augment->rotate(parent, sibling); break; } else { sibling = parent->rb_left; @@ -337,6 +347,7 @@ static void __rb_erase_color(struct rb_node *parent, struct rb_root *root) rb_set_parent_color(tmp1, parent, RB_BLACK); __rb_rotate_set_parents(parent, sibling, root, RB_RED); + augment->rotate(parent, sibling); sibling = tmp1; } tmp1 = sibling->rb_left; @@ -363,6 +374,7 @@ static void __rb_erase_color(struct rb_node *parent, struct rb_root *root) if (tmp1) rb_set_parent_color(tmp1, sibling, RB_BLACK); + augment->rotate(sibling, tmp2); tmp1 = sibling; sibling = tmp2; } @@ -374,12 +386,15 @@ static void __rb_erase_color(struct rb_node *parent, struct rb_root *root) rb_set_parent(tmp2, parent); __rb_rotate_set_parents(parent, sibling, root, RB_BLACK); + augment->rotate(parent, sibling); break; } } } -void rb_erase(struct rb_node *node, struct rb_root *root) +static __always_inline void +__rb_erase(struct rb_node *node, struct rb_root *root, + const struct rb_augment_callbacks *augment) { struct rb_node *child = node->rb_right, *tmp = node->rb_left; struct rb_node *parent, *rebalance; @@ -401,12 +416,14 @@ void rb_erase(struct rb_node *node, struct rb_root *root) rebalance = NULL; } else rebalance = __rb_is_black(pc) ? parent : NULL; + tmp = parent; } else if (!child) { /* Still case 1, but this time the child is node->rb_left */ tmp->__rb_parent_color = pc = node->__rb_parent_color; parent = __rb_parent(pc); __rb_change_child(node, tmp, parent, root); rebalance = NULL; + tmp = parent; } else { struct rb_node *successor = child, *child2; tmp = child->rb_left; @@ -420,8 +437,9 @@ void rb_erase(struct rb_node *node, struct rb_root *root) * \ * (c) */ - parent = child; - child2 = child->rb_right; + parent = successor; + child2 = successor->rb_right; + augment->copy(node, successor); } else { /* * Case 3: node's successor is leftmost under @@ -445,6 +463,8 @@ void rb_erase(struct rb_node *node, struct rb_root *root) parent->rb_left = child2 = successor->rb_right; successor->rb_right = child; rb_set_parent(child, successor); + augment->copy(node, successor); + augment->propagate(parent, successor); } successor->rb_left = tmp = node->rb_left; @@ -462,13 +482,62 @@ void rb_erase(struct rb_node *node, struct rb_root *root) successor->__rb_parent_color = pc; rebalance = __rb_is_black(pc2) ? parent : NULL; } + tmp = successor; } + augment->propagate(tmp, NULL); if (rebalance) - __rb_erase_color(rebalance, root); + __rb_erase_color(rebalance, root, augment); +} + +/* + * Non-augmented rbtree manipulation functions. + * + * We use dummy augmented callbacks here, and have the compiler optimize them + * out of the rb_insert_color() and rb_erase() function definitions. + */ + +static inline void dummy_propagate(struct rb_node *node, struct rb_node *stop) {} +static inline void dummy_copy(struct rb_node *old, struct rb_node *new) {} +static inline void dummy_rotate(struct rb_node *old, struct rb_node *new) {} + +static const struct rb_augment_callbacks dummy_callbacks = { + dummy_propagate, dummy_copy, dummy_rotate +}; + +void rb_insert_color(struct rb_node *node, struct rb_root *root) +{ + __rb_insert(node, root, dummy_rotate); +} +EXPORT_SYMBOL(rb_insert_color); + +void rb_erase(struct rb_node *node, struct rb_root *root) +{ + __rb_erase(node, root, &dummy_callbacks); } EXPORT_SYMBOL(rb_erase); +/* + * Augmented rbtree manipulation functions. + * + * This instantiates the same __always_inline functions as in the non-augmented + * case, but this time with user-defined callbacks. + */ + +void __rb_insert_augmented(struct rb_node *node, struct rb_root *root, + void (*augment_rotate)(struct rb_node *old, struct rb_node *new)) +{ + __rb_insert(node, root, augment_rotate); +} +EXPORT_SYMBOL(__rb_insert_augmented); + +void rb_erase_augmented(struct rb_node *node, struct rb_root *root, + const struct rb_augment_callbacks *augment) +{ + __rb_erase(node, root, augment); +} +EXPORT_SYMBOL(rb_erase_augmented); + static void rb_augment_path(struct rb_node *node, rb_augment_f func, void *data) { struct rb_node *parent; diff --git a/lib/rbtree_test.c b/lib/rbtree_test.c index 66e41d4bfc39..e28345df09bf 100644 --- a/lib/rbtree_test.c +++ b/lib/rbtree_test.c @@ -61,35 +61,65 @@ static inline u32 augment_recompute(struct test_node *node) return max; } -static void augment_callback(struct rb_node *rb, void *unused) +static void augment_propagate(struct rb_node *rb, struct rb_node *stop) { - struct test_node *node = rb_entry(rb, struct test_node, rb); - node->augmented = augment_recompute(node); + while (rb != stop) { + struct test_node *node = rb_entry(rb, struct test_node, rb); + u32 augmented = augment_recompute(node); + if (node->augmented == augmented) + break; + node->augmented = augmented; + rb = rb_parent(&node->rb); + } +} + +static void augment_copy(struct rb_node *rb_old, struct rb_node *rb_new) +{ + struct test_node *old = rb_entry(rb_old, struct test_node, rb); + struct test_node *new = rb_entry(rb_new, struct test_node, rb); + new->augmented = old->augmented; } +static void augment_rotate(struct rb_node *rb_old, struct rb_node *rb_new) +{ + struct test_node *old = rb_entry(rb_old, struct test_node, rb); + struct test_node *new = rb_entry(rb_new, struct test_node, rb); + + /* Rotation doesn't change subtree's augmented value */ + new->augmented = old->augmented; + old->augmented = augment_recompute(old); +} + +static const struct rb_augment_callbacks augment_callbacks = { + augment_propagate, augment_copy, augment_rotate +}; + static void insert_augmented(struct test_node *node, struct rb_root *root) { - struct rb_node **new = &root->rb_node, *parent = NULL; + struct rb_node **new = &root->rb_node, *rb_parent = NULL; u32 key = node->key; + u32 val = node->val; + struct test_node *parent; while (*new) { - parent = *new; - if (key < rb_entry(parent, struct test_node, rb)->key) - new = &parent->rb_left; + rb_parent = *new; + parent = rb_entry(rb_parent, struct test_node, rb); + if (parent->augmented < val) + parent->augmented = val; + if (key < parent->key) + new = &parent->rb.rb_left; else - new = &parent->rb_right; + new = &parent->rb.rb_right; } - rb_link_node(&node->rb, parent, new); - rb_insert_color(&node->rb, root); - rb_augment_insert(&node->rb, augment_callback, NULL); + node->augmented = val; + rb_link_node(&node->rb, rb_parent, new); + rb_insert_augmented(&node->rb, root, &augment_callbacks); } static void erase_augmented(struct test_node *node, struct rb_root *root) { - struct rb_node *deepest = rb_augment_erase_begin(&node->rb); - rb_erase(&node->rb, root); - rb_augment_erase_end(deepest, augment_callback, NULL); + rb_erase_augmented(&node->rb, root, &augment_callbacks); } static void init(void) -- cgit v1.2.3 From 147e615f83c2c36caf89e7a3bf78090ade6f266c Mon Sep 17 00:00:00 2001 From: Michel Lespinasse Date: Mon, 8 Oct 2012 16:31:30 -0700 Subject: prio_tree: remove After both prio_tree users have been converted to use red-black trees, there is no need to keep around the prio tree library anymore. Signed-off-by: Michel Lespinasse Cc: Rik van Riel Cc: Hillf Danton Cc: Peter Zijlstra Cc: Catalin Marinas Cc: Andrea Arcangeli Cc: David Woodhouse Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/00-INDEX | 2 - Documentation/prio_tree.txt | 107 ----------- include/linux/prio_tree.h | 120 ------------ init/main.c | 2 - lib/Kconfig.debug | 6 - lib/Makefile | 3 +- lib/prio_tree.c | 455 -------------------------------------------- lib/prio_tree_test.c | 106 ----------- 8 files changed, 1 insertion(+), 800 deletions(-) delete mode 100644 Documentation/prio_tree.txt delete mode 100644 include/linux/prio_tree.h delete mode 100644 lib/prio_tree.c delete mode 100644 lib/prio_tree_test.c (limited to 'Documentation') diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index 49c051380daf..f54273e2ac97 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -270,8 +270,6 @@ preempt-locking.txt - info on locking under a preemptive kernel. printk-formats.txt - how to get printk format specifiers right -prio_tree.txt - - info on radix-priority-search-tree use for indexing vmas. ramoops.txt - documentation of the ramoops oops/panic logging module. rbtree.txt diff --git a/Documentation/prio_tree.txt b/Documentation/prio_tree.txt deleted file mode 100644 index 3aa68f9a117b..000000000000 --- a/Documentation/prio_tree.txt +++ /dev/null @@ -1,107 +0,0 @@ -The prio_tree.c code indexes vmas using 3 different indexes: - * heap_index = vm_pgoff + vm_size_in_pages : end_vm_pgoff - * radix_index = vm_pgoff : start_vm_pgoff - * size_index = vm_size_in_pages - -A regular radix-priority-search-tree indexes vmas using only heap_index and -radix_index. The conditions for indexing are: - * ->heap_index >= ->left->heap_index && - ->heap_index >= ->right->heap_index - * if (->heap_index == ->left->heap_index) - then ->radix_index < ->left->radix_index; - * if (->heap_index == ->right->heap_index) - then ->radix_index < ->right->radix_index; - * nodes are hashed to left or right subtree using radix_index - similar to a pure binary radix tree. - -A regular radix-priority-search-tree helps to store and query -intervals (vmas). However, a regular radix-priority-search-tree is only -suitable for storing vmas with different radix indices (vm_pgoff). - -Therefore, the prio_tree.c extends the regular radix-priority-search-tree -to handle many vmas with the same vm_pgoff. Such vmas are handled in -2 different ways: 1) All vmas with the same radix _and_ heap indices are -linked using vm_set.list, 2) if there are many vmas with the same radix -index, but different heap indices and if the regular radix-priority-search -tree cannot index them all, we build an overflow-sub-tree that indexes such -vmas using heap and size indices instead of heap and radix indices. For -example, in the figure below some vmas with vm_pgoff = 0 (zero) are -indexed by regular radix-priority-search-tree whereas others are pushed -into an overflow-subtree. Note that all vmas in an overflow-sub-tree have -the same vm_pgoff (radix_index) and if necessary we build different -overflow-sub-trees to handle each possible radix_index. For example, -in figure we have 3 overflow-sub-trees corresponding to radix indices -0, 2, and 4. - -In the final tree the first few (prio_tree_root->index_bits) levels -are indexed using heap and radix indices whereas the overflow-sub-trees below -those levels (i.e. levels prio_tree_root->index_bits + 1 and higher) are -indexed using heap and size indices. In overflow-sub-trees the size_index -is used for hashing the nodes to appropriate places. - -Now, an example prio_tree: - - vmas are represented [radix_index, size_index, heap_index] - i.e., [start_vm_pgoff, vm_size_in_pages, end_vm_pgoff] - -level prio_tree_root->index_bits = 3 ------ - _ - 0 [0,7,7] | - / \ | - ------------------ ------------ | Regular - / \ | radix priority - 1 [1,6,7] [4,3,7] | search tree - / \ / \ | - ------- ----- ------ ----- | heap-and-radix - / \ / \ | indexed - 2 [0,6,6] [2,5,7] [5,2,7] [6,1,7] | - / \ / \ / \ / \ | - 3 [0,5,5] [1,5,6] [2,4,6] [3,4,7] [4,2,6] [5,1,6] [6,0,6] [7,0,7] | - / / / _ - / / / _ - 4 [0,4,4] [2,3,5] [4,1,5] | - / / / | - 5 [0,3,3] [2,2,4] [4,0,4] | Overflow-sub-trees - / / | - 6 [0,2,2] [2,1,3] | heap-and-size - / / | indexed - 7 [0,1,1] [2,0,2] | - / | - 8 [0,0,0] | - _ - -Note that we use prio_tree_root->index_bits to optimize the height -of the heap-and-radix indexed tree. Since prio_tree_root->index_bits is -set according to the maximum end_vm_pgoff mapped, we are sure that all -bits (in vm_pgoff) above prio_tree_root->index_bits are 0 (zero). Therefore, -we only use the first prio_tree_root->index_bits as radix_index. -Whenever index_bits is increased in prio_tree_expand, we shuffle the tree -to make sure that the first prio_tree_root->index_bits levels of the tree -is indexed properly using heap and radix indices. - -We do not optimize the height of overflow-sub-trees using index_bits. -The reason is: there can be many such overflow-sub-trees and all of -them have to be suffled whenever the index_bits increases. This may involve -walking the whole prio_tree in prio_tree_insert->prio_tree_expand code -path which is not desirable. Hence, we do not optimize the height of the -heap-and-size indexed overflow-sub-trees using prio_tree->index_bits. -Instead the overflow sub-trees are indexed using full BITS_PER_LONG bits -of size_index. This may lead to skewed sub-trees because most of the -higher significant bits of the size_index are likely to be 0 (zero). In -the example above, all 3 overflow-sub-trees are skewed. This may marginally -affect the performance. However, processes rarely map many vmas with the -same start_vm_pgoff but different end_vm_pgoffs. Therefore, we normally -do not require overflow-sub-trees to index all vmas. - -From the above discussion it is clear that the maximum height of -a prio_tree can be prio_tree_root->index_bits + BITS_PER_LONG. -However, in most of the common cases we do not need overflow-sub-trees, -so the tree height in the common cases will be prio_tree_root->index_bits. - -It is fair to mention here that the prio_tree_root->index_bits -is increased on demand, however, the index_bits is not decreased when -vmas are removed from the prio_tree. That's tricky to do. Hence, it's -left as a home work problem. - - diff --git a/include/linux/prio_tree.h b/include/linux/prio_tree.h deleted file mode 100644 index db04abb557e0..000000000000 --- a/include/linux/prio_tree.h +++ /dev/null @@ -1,120 +0,0 @@ -#ifndef _LINUX_PRIO_TREE_H -#define _LINUX_PRIO_TREE_H - -/* - * K&R 2nd ed. A8.3 somewhat obliquely hints that initial sequences of struct - * fields with identical types should end up at the same location. We'll use - * this until we can scrap struct raw_prio_tree_node. - * - * Note: all this could be done more elegantly by using unnamed union/struct - * fields. However, gcc 2.95.3 and apparently also gcc 3.0.4 don't support this - * language extension. - */ - -struct raw_prio_tree_node { - struct prio_tree_node *left; - struct prio_tree_node *right; - struct prio_tree_node *parent; -}; - -struct prio_tree_node { - struct prio_tree_node *left; - struct prio_tree_node *right; - struct prio_tree_node *parent; - unsigned long start; - unsigned long last; /* last location _in_ interval */ -}; - -struct prio_tree_root { - struct prio_tree_node *prio_tree_node; - unsigned short index_bits; - unsigned short raw; - /* - * 0: nodes are of type struct prio_tree_node - * 1: nodes are of type raw_prio_tree_node - */ -}; - -struct prio_tree_iter { - struct prio_tree_node *cur; - unsigned long mask; - unsigned long value; - int size_level; - - struct prio_tree_root *root; - pgoff_t r_index; - pgoff_t h_index; -}; - -static inline void prio_tree_iter_init(struct prio_tree_iter *iter, - struct prio_tree_root *root, pgoff_t r_index, pgoff_t h_index) -{ - iter->root = root; - iter->r_index = r_index; - iter->h_index = h_index; - iter->cur = NULL; -} - -#define __INIT_PRIO_TREE_ROOT(ptr, _raw) \ -do { \ - (ptr)->prio_tree_node = NULL; \ - (ptr)->index_bits = 1; \ - (ptr)->raw = (_raw); \ -} while (0) - -#define INIT_PRIO_TREE_ROOT(ptr) __INIT_PRIO_TREE_ROOT(ptr, 0) -#define INIT_RAW_PRIO_TREE_ROOT(ptr) __INIT_PRIO_TREE_ROOT(ptr, 1) - -#define INIT_PRIO_TREE_NODE(ptr) \ -do { \ - (ptr)->left = (ptr)->right = (ptr)->parent = (ptr); \ -} while (0) - -#define INIT_PRIO_TREE_ITER(ptr) \ -do { \ - (ptr)->cur = NULL; \ - (ptr)->mask = 0UL; \ - (ptr)->value = 0UL; \ - (ptr)->size_level = 0; \ -} while (0) - -#define prio_tree_entry(ptr, type, member) \ - ((type *)((char *)(ptr)-(unsigned long)(&((type *)0)->member))) - -static inline int prio_tree_empty(const struct prio_tree_root *root) -{ - return root->prio_tree_node == NULL; -} - -static inline int prio_tree_root(const struct prio_tree_node *node) -{ - return node->parent == node; -} - -static inline int prio_tree_left_empty(const struct prio_tree_node *node) -{ - return node->left == node; -} - -static inline int prio_tree_right_empty(const struct prio_tree_node *node) -{ - return node->right == node; -} - - -struct prio_tree_node *prio_tree_replace(struct prio_tree_root *root, - struct prio_tree_node *old, struct prio_tree_node *node); -struct prio_tree_node *prio_tree_insert(struct prio_tree_root *root, - struct prio_tree_node *node); -void prio_tree_remove(struct prio_tree_root *root, struct prio_tree_node *node); -struct prio_tree_node *prio_tree_next(struct prio_tree_iter *iter); - -#define raw_prio_tree_replace(root, old, node) \ - prio_tree_replace(root, (struct prio_tree_node *) (old), \ - (struct prio_tree_node *) (node)) -#define raw_prio_tree_insert(root, node) \ - prio_tree_insert(root, (struct prio_tree_node *) (node)) -#define raw_prio_tree_remove(root, node) \ - prio_tree_remove(root, (struct prio_tree_node *) (node)) - -#endif /* _LINUX_PRIO_TREE_H */ diff --git a/init/main.c b/init/main.c index db34c0ec4711..313360fe1118 100644 --- a/init/main.c +++ b/init/main.c @@ -86,7 +86,6 @@ extern void init_IRQ(void); extern void fork_init(unsigned long); extern void mca_init(void); extern void sbus_init(void); -extern void prio_tree_init(void); extern void radix_tree_init(void); #ifndef CONFIG_DEBUG_RODATA static inline void mark_rodata_ro(void) { } @@ -547,7 +546,6 @@ asmlinkage void __init start_kernel(void) /* init some links before init_ISA_irqs() */ early_irq_init(); init_IRQ(); - prio_tree_init(); init_timers(); hrtimers_init(); softirq_init(); diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index ee9f030b6951..a6e7e7741523 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1289,12 +1289,6 @@ config RBTREE_TEST A benchmark measuring the performance of the rbtree library. Also includes rbtree invariant checks. -config PRIO_TREE_TEST - tristate "Prio tree test" - depends on m && DEBUG_KERNEL - help - A benchmark measuring the performance of the prio tree library - config INTERVAL_TREE_TEST tristate "Interval tree test" depends on m && DEBUG_KERNEL diff --git a/lib/Makefile b/lib/Makefile index 26f578bf616a..3128e357e286 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -9,7 +9,7 @@ endif lib-y := ctype.o string.o vsprintf.o cmdline.o \ rbtree.o radix-tree.o dump_stack.o timerqueue.o\ - idr.o int_sqrt.o extable.o prio_tree.o \ + idr.o int_sqrt.o extable.o \ sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ is_single_threaded.o plist.o decompress.o @@ -141,7 +141,6 @@ $(foreach file, $(libfdt_files), \ lib-$(CONFIG_LIBFDT) += $(libfdt_files) obj-$(CONFIG_RBTREE_TEST) += rbtree_test.o -obj-$(CONFIG_PRIO_TREE_TEST) += prio_tree_test.o obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o interval_tree_test-objs := interval_tree_test_main.o interval_tree.o diff --git a/lib/prio_tree.c b/lib/prio_tree.c deleted file mode 100644 index bba37148c15e..000000000000 --- a/lib/prio_tree.c +++ /dev/null @@ -1,455 +0,0 @@ -/* - * lib/prio_tree.c - priority search tree - * - * Copyright (C) 2004, Rajesh Venkatasubramanian - * - * This file is released under the GPL v2. - * - * Based on the radix priority search tree proposed by Edward M. McCreight - * SIAM Journal of Computing, vol. 14, no.2, pages 257-276, May 1985 - * - * 02Feb2004 Initial version - */ - -#include -#include -#include -#include - -/* - * A clever mix of heap and radix trees forms a radix priority search tree (PST) - * which is useful for storing intervals, e.g, we can consider a vma as a closed - * interval of file pages [offset_begin, offset_end], and store all vmas that - * map a file in a PST. Then, using the PST, we can answer a stabbing query, - * i.e., selecting a set of stored intervals (vmas) that overlap with (map) a - * given input interval X (a set of consecutive file pages), in "O(log n + m)" - * time where 'log n' is the height of the PST, and 'm' is the number of stored - * intervals (vmas) that overlap (map) with the input interval X (the set of - * consecutive file pages). - * - * In our implementation, we store closed intervals of the form [radix_index, - * heap_index]. We assume that always radix_index <= heap_index. McCreight's PST - * is designed for storing intervals with unique radix indices, i.e., each - * interval have different radix_index. However, this limitation can be easily - * overcome by using the size, i.e., heap_index - radix_index, as part of the - * index, so we index the tree using [(radix_index,size), heap_index]. - * - * When the above-mentioned indexing scheme is used, theoretically, in a 32 bit - * machine, the maximum height of a PST can be 64. We can use a balanced version - * of the priority search tree to optimize the tree height, but the balanced - * tree proposed by McCreight is too complex and memory-hungry for our purpose. - */ - -/* - * The following macros are used for implementing prio_tree for i_mmap - */ - -static void get_index(const struct prio_tree_root *root, - const struct prio_tree_node *node, - unsigned long *radix, unsigned long *heap) -{ - *radix = node->start; - *heap = node->last; -} - -static unsigned long index_bits_to_maxindex[BITS_PER_LONG]; - -void __init prio_tree_init(void) -{ - unsigned int i; - - for (i = 0; i < ARRAY_SIZE(index_bits_to_maxindex) - 1; i++) - index_bits_to_maxindex[i] = (1UL << (i + 1)) - 1; - index_bits_to_maxindex[ARRAY_SIZE(index_bits_to_maxindex) - 1] = ~0UL; -} - -/* - * Maximum heap_index that can be stored in a PST with index_bits bits - */ -static inline unsigned long prio_tree_maxindex(unsigned int bits) -{ - return index_bits_to_maxindex[bits - 1]; -} - -static void prio_set_parent(struct prio_tree_node *parent, - struct prio_tree_node *child, bool left) -{ - if (left) - parent->left = child; - else - parent->right = child; - - child->parent = parent; -} - -/* - * Extend a priority search tree so that it can store a node with heap_index - * max_heap_index. In the worst case, this algorithm takes O((log n)^2). - * However, this function is used rarely and the common case performance is - * not bad. - */ -static struct prio_tree_node *prio_tree_expand(struct prio_tree_root *root, - struct prio_tree_node *node, unsigned long max_heap_index) -{ - struct prio_tree_node *prev; - - if (max_heap_index > prio_tree_maxindex(root->index_bits)) - root->index_bits++; - - prev = node; - INIT_PRIO_TREE_NODE(node); - - while (max_heap_index > prio_tree_maxindex(root->index_bits)) { - struct prio_tree_node *tmp = root->prio_tree_node; - - root->index_bits++; - - if (prio_tree_empty(root)) - continue; - - prio_tree_remove(root, root->prio_tree_node); - INIT_PRIO_TREE_NODE(tmp); - - prio_set_parent(prev, tmp, true); - prev = tmp; - } - - if (!prio_tree_empty(root)) - prio_set_parent(prev, root->prio_tree_node, true); - - root->prio_tree_node = node; - return node; -} - -/* - * Replace a prio_tree_node with a new node and return the old node - */ -struct prio_tree_node *prio_tree_replace(struct prio_tree_root *root, - struct prio_tree_node *old, struct prio_tree_node *node) -{ - INIT_PRIO_TREE_NODE(node); - - if (prio_tree_root(old)) { - BUG_ON(root->prio_tree_node != old); - /* - * We can reduce root->index_bits here. However, it is complex - * and does not help much to improve performance (IMO). - */ - root->prio_tree_node = node; - } else - prio_set_parent(old->parent, node, old->parent->left == old); - - if (!prio_tree_left_empty(old)) - prio_set_parent(node, old->left, true); - - if (!prio_tree_right_empty(old)) - prio_set_parent(node, old->right, false); - - return old; -} - -/* - * Insert a prio_tree_node @node into a radix priority search tree @root. The - * algorithm typically takes O(log n) time where 'log n' is the number of bits - * required to represent the maximum heap_index. In the worst case, the algo - * can take O((log n)^2) - check prio_tree_expand. - * - * If a prior node with same radix_index and heap_index is already found in - * the tree, then returns the address of the prior node. Otherwise, inserts - * @node into the tree and returns @node. - */ -struct prio_tree_node *prio_tree_insert(struct prio_tree_root *root, - struct prio_tree_node *node) -{ - struct prio_tree_node *cur, *res = node; - unsigned long radix_index, heap_index; - unsigned long r_index, h_index, index, mask; - int size_flag = 0; - - get_index(root, node, &radix_index, &heap_index); - - if (prio_tree_empty(root) || - heap_index > prio_tree_maxindex(root->index_bits)) - return prio_tree_expand(root, node, heap_index); - - cur = root->prio_tree_node; - mask = 1UL << (root->index_bits - 1); - - while (mask) { - get_index(root, cur, &r_index, &h_index); - - if (r_index == radix_index && h_index == heap_index) - return cur; - - if (h_index < heap_index || - (h_index == heap_index && r_index > radix_index)) { - struct prio_tree_node *tmp = node; - node = prio_tree_replace(root, cur, node); - cur = tmp; - /* swap indices */ - index = r_index; - r_index = radix_index; - radix_index = index; - index = h_index; - h_index = heap_index; - heap_index = index; - } - - if (size_flag) - index = heap_index - radix_index; - else - index = radix_index; - - if (index & mask) { - if (prio_tree_right_empty(cur)) { - INIT_PRIO_TREE_NODE(node); - prio_set_parent(cur, node, false); - return res; - } else - cur = cur->right; - } else { - if (prio_tree_left_empty(cur)) { - INIT_PRIO_TREE_NODE(node); - prio_set_parent(cur, node, true); - return res; - } else - cur = cur->left; - } - - mask >>= 1; - - if (!mask) { - mask = 1UL << (BITS_PER_LONG - 1); - size_flag = 1; - } - } - /* Should not reach here */ - BUG(); - return NULL; -} -EXPORT_SYMBOL(prio_tree_insert); - -/* - * Remove a prio_tree_node @node from a radix priority search tree @root. The - * algorithm takes O(log n) time where 'log n' is the number of bits required - * to represent the maximum heap_index. - */ -void prio_tree_remove(struct prio_tree_root *root, struct prio_tree_node *node) -{ - struct prio_tree_node *cur; - unsigned long r_index, h_index_right, h_index_left; - - cur = node; - - while (!prio_tree_left_empty(cur) || !prio_tree_right_empty(cur)) { - if (!prio_tree_left_empty(cur)) - get_index(root, cur->left, &r_index, &h_index_left); - else { - cur = cur->right; - continue; - } - - if (!prio_tree_right_empty(cur)) - get_index(root, cur->right, &r_index, &h_index_right); - else { - cur = cur->left; - continue; - } - - /* both h_index_left and h_index_right cannot be 0 */ - if (h_index_left >= h_index_right) - cur = cur->left; - else - cur = cur->right; - } - - if (prio_tree_root(cur)) { - BUG_ON(root->prio_tree_node != cur); - __INIT_PRIO_TREE_ROOT(root, root->raw); - return; - } - - if (cur->parent->right == cur) - cur->parent->right = cur->parent; - else - cur->parent->left = cur->parent; - - while (cur != node) - cur = prio_tree_replace(root, cur->parent, cur); -} -EXPORT_SYMBOL(prio_tree_remove); - -static void iter_walk_down(struct prio_tree_iter *iter) -{ - iter->mask >>= 1; - if (iter->mask) { - if (iter->size_level) - iter->size_level++; - return; - } - - if (iter->size_level) { - BUG_ON(!prio_tree_left_empty(iter->cur)); - BUG_ON(!prio_tree_right_empty(iter->cur)); - iter->size_level++; - iter->mask = ULONG_MAX; - } else { - iter->size_level = 1; - iter->mask = 1UL << (BITS_PER_LONG - 1); - } -} - -static void iter_walk_up(struct prio_tree_iter *iter) -{ - if (iter->mask == ULONG_MAX) - iter->mask = 1UL; - else if (iter->size_level == 1) - iter->mask = 1UL; - else - iter->mask <<= 1; - if (iter->size_level) - iter->size_level--; - if (!iter->size_level && (iter->value & iter->mask)) - iter->value ^= iter->mask; -} - -/* - * Following functions help to enumerate all prio_tree_nodes in the tree that - * overlap with the input interval X [radix_index, heap_index]. The enumeration - * takes O(log n + m) time where 'log n' is the height of the tree (which is - * proportional to # of bits required to represent the maximum heap_index) and - * 'm' is the number of prio_tree_nodes that overlap the interval X. - */ - -static struct prio_tree_node *prio_tree_left(struct prio_tree_iter *iter, - unsigned long *r_index, unsigned long *h_index) -{ - if (prio_tree_left_empty(iter->cur)) - return NULL; - - get_index(iter->root, iter->cur->left, r_index, h_index); - - if (iter->r_index <= *h_index) { - iter->cur = iter->cur->left; - iter_walk_down(iter); - return iter->cur; - } - - return NULL; -} - -static struct prio_tree_node *prio_tree_right(struct prio_tree_iter *iter, - unsigned long *r_index, unsigned long *h_index) -{ - unsigned long value; - - if (prio_tree_right_empty(iter->cur)) - return NULL; - - if (iter->size_level) - value = iter->value; - else - value = iter->value | iter->mask; - - if (iter->h_index < value) - return NULL; - - get_index(iter->root, iter->cur->right, r_index, h_index); - - if (iter->r_index <= *h_index) { - iter->cur = iter->cur->right; - iter_walk_down(iter); - return iter->cur; - } - - return NULL; -} - -static struct prio_tree_node *prio_tree_parent(struct prio_tree_iter *iter) -{ - iter->cur = iter->cur->parent; - iter_walk_up(iter); - return iter->cur; -} - -static inline int overlap(struct prio_tree_iter *iter, - unsigned long r_index, unsigned long h_index) -{ - return iter->h_index >= r_index && iter->r_index <= h_index; -} - -/* - * prio_tree_first: - * - * Get the first prio_tree_node that overlaps with the interval [radix_index, - * heap_index]. Note that always radix_index <= heap_index. We do a pre-order - * traversal of the tree. - */ -static struct prio_tree_node *prio_tree_first(struct prio_tree_iter *iter) -{ - struct prio_tree_root *root; - unsigned long r_index, h_index; - - INIT_PRIO_TREE_ITER(iter); - - root = iter->root; - if (prio_tree_empty(root)) - return NULL; - - get_index(root, root->prio_tree_node, &r_index, &h_index); - - if (iter->r_index > h_index) - return NULL; - - iter->mask = 1UL << (root->index_bits - 1); - iter->cur = root->prio_tree_node; - - while (1) { - if (overlap(iter, r_index, h_index)) - return iter->cur; - - if (prio_tree_left(iter, &r_index, &h_index)) - continue; - - if (prio_tree_right(iter, &r_index, &h_index)) - continue; - - break; - } - return NULL; -} - -/* - * prio_tree_next: - * - * Get the next prio_tree_node that overlaps with the input interval in iter - */ -struct prio_tree_node *prio_tree_next(struct prio_tree_iter *iter) -{ - unsigned long r_index, h_index; - - if (iter->cur == NULL) - return prio_tree_first(iter); - -repeat: - while (prio_tree_left(iter, &r_index, &h_index)) - if (overlap(iter, r_index, h_index)) - return iter->cur; - - while (!prio_tree_right(iter, &r_index, &h_index)) { - while (!prio_tree_root(iter->cur) && - iter->cur->parent->right == iter->cur) - prio_tree_parent(iter); - - if (prio_tree_root(iter->cur)) - return NULL; - - prio_tree_parent(iter); - } - - if (overlap(iter, r_index, h_index)) - return iter->cur; - - goto repeat; -} -EXPORT_SYMBOL(prio_tree_next); diff --git a/lib/prio_tree_test.c b/lib/prio_tree_test.c deleted file mode 100644 index c26084ddc6a4..000000000000 --- a/lib/prio_tree_test.c +++ /dev/null @@ -1,106 +0,0 @@ -#include -#include -#include -#include - -#define NODES 100 -#define PERF_LOOPS 100000 -#define SEARCHES 100 -#define SEARCH_LOOPS 10000 - -static struct prio_tree_root root; -static struct prio_tree_node nodes[NODES]; -static u32 queries[SEARCHES]; - -static struct rnd_state rnd; - -static inline unsigned long -search(unsigned long query, struct prio_tree_root *root) -{ - struct prio_tree_iter iter; - unsigned long results = 0; - - prio_tree_iter_init(&iter, root, query, query); - while (prio_tree_next(&iter)) - results++; - return results; -} - -static void init(void) -{ - int i; - for (i = 0; i < NODES; i++) { - u32 a = prandom32(&rnd), b = prandom32(&rnd); - if (a <= b) { - nodes[i].start = a; - nodes[i].last = b; - } else { - nodes[i].start = b; - nodes[i].last = a; - } - } - for (i = 0; i < SEARCHES; i++) - queries[i] = prandom32(&rnd); -} - -static int prio_tree_test_init(void) -{ - int i, j; - unsigned long results; - cycles_t time1, time2, time; - - printk(KERN_ALERT "prio tree insert/remove"); - - prandom32_seed(&rnd, 3141592653589793238ULL); - INIT_PRIO_TREE_ROOT(&root); - init(); - - time1 = get_cycles(); - - for (i = 0; i < PERF_LOOPS; i++) { - for (j = 0; j < NODES; j++) - prio_tree_insert(&root, nodes + j); - for (j = 0; j < NODES; j++) - prio_tree_remove(&root, nodes + j); - } - - time2 = get_cycles(); - time = time2 - time1; - - time = div_u64(time, PERF_LOOPS); - printk(" -> %llu cycles\n", (unsigned long long)time); - - printk(KERN_ALERT "prio tree search"); - - for (j = 0; j < NODES; j++) - prio_tree_insert(&root, nodes + j); - - time1 = get_cycles(); - - results = 0; - for (i = 0; i < SEARCH_LOOPS; i++) - for (j = 0; j < SEARCHES; j++) - results += search(queries[j], &root); - - time2 = get_cycles(); - time = time2 - time1; - - time = div_u64(time, SEARCH_LOOPS); - results = div_u64(results, SEARCH_LOOPS); - printk(" -> %llu cycles (%lu results)\n", - (unsigned long long)time, results); - - return -EAGAIN; /* Fail will directly unload the module */ -} - -static void prio_tree_test_exit(void) -{ - printk(KERN_ALERT "test exit\n"); -} - -module_init(prio_tree_test_init) -module_exit(prio_tree_test_exit) - -MODULE_LICENSE("GPL"); -MODULE_AUTHOR("Michel Lespinasse"); -MODULE_DESCRIPTION("Prio Tree test"); -- cgit v1.2.3 From 9c079add0d0f45220f4bb37febf0621137ec2d38 Mon Sep 17 00:00:00 2001 From: Michel Lespinasse Date: Mon, 8 Oct 2012 16:31:33 -0700 Subject: rbtree: move augmented rbtree functionality to rbtree_augmented.h Provide rb_insert_augmented() and rb_erase_augmented() through a new rbtree_augmented.h include file. rb_erase_augmented() is defined there as an __always_inline function, in order to allow inlining of augmented rbtree callbacks into it. Since this generates a relatively large function, each augmented rbtree user should make sure to have a single call site. Signed-off-by: Michel Lespinasse Cc: Rik van Riel Cc: Hillf Danton Cc: Peter Zijlstra Cc: Catalin Marinas Cc: Andrea Arcangeli Cc: David Woodhouse Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/rbtree.txt | 13 +++ arch/x86/mm/pat_rbtree.c | 2 +- include/linux/interval_tree_tmpl.h | 8 +- include/linux/rbtree.h | 48 -------- include/linux/rbtree_augmented.h | 223 +++++++++++++++++++++++++++++++++++++ lib/rbtree.c | 162 ++------------------------- lib/rbtree_test.c | 2 +- 7 files changed, 255 insertions(+), 203 deletions(-) create mode 100644 include/linux/rbtree_augmented.h (limited to 'Documentation') diff --git a/Documentation/rbtree.txt b/Documentation/rbtree.txt index 0a0b6dce3e08..61b6c48871a0 100644 --- a/Documentation/rbtree.txt +++ b/Documentation/rbtree.txt @@ -202,6 +202,14 @@ An rbtree user who wants this feature will have to call the augmentation functions with the user provided augmentation callback when inserting and erasing nodes. +C files implementing augmented rbtree manipulation must include + instead of . Note that +linux/rbtree_augmented.h exposes some rbtree implementations details +you are not expected to rely on; please stick to the documented APIs +there and do not include from header files +either so as to minimize chances of your users accidentally relying on +such implementation details. + On insertion, the user must update the augmented information on the path leading to the inserted node, then call rb_link_node() as usual and rb_augment_inserted() instead of the usual rb_insert_color() call. @@ -227,6 +235,11 @@ In both cases, the callbacks are provided through struct rb_augment_callbacks. subtree to a newly assigned subtree root AND recomputes the augmented information for the former subtree root. +The compiled code for rb_erase_augmented() may inline the propagation and +copy callbacks, which results in a large function, so each augmented rbtree +user should have a single rb_erase_augmented() call site in order to limit +compiled code size. + Sample usage: diff --git a/arch/x86/mm/pat_rbtree.c b/arch/x86/mm/pat_rbtree.c index 4d116959075d..415f6c4ced36 100644 --- a/arch/x86/mm/pat_rbtree.c +++ b/arch/x86/mm/pat_rbtree.c @@ -12,7 +12,7 @@ #include #include #include -#include +#include #include #include diff --git a/include/linux/interval_tree_tmpl.h b/include/linux/interval_tree_tmpl.h index c65deda31413..c1aeb922d65f 100644 --- a/include/linux/interval_tree_tmpl.h +++ b/include/linux/interval_tree_tmpl.h @@ -19,6 +19,8 @@ include/linux/interval_tree_tmpl.h */ +#include + /* * Template for implementing interval trees * @@ -57,7 +59,8 @@ static inline ITTYPE IT(compute_subtree_last)(ITSTRUCT *node) return max; } -static void IT(augment_propagate)(struct rb_node *rb, struct rb_node *stop) +static inline void +IT(augment_propagate)(struct rb_node *rb, struct rb_node *stop) { while (rb != stop) { ITSTRUCT *node = rb_entry(rb, ITSTRUCT, ITRB); @@ -69,7 +72,8 @@ static void IT(augment_propagate)(struct rb_node *rb, struct rb_node *stop) } } -static void IT(augment_copy)(struct rb_node *rb_old, struct rb_node *rb_new) +static inline void +IT(augment_copy)(struct rb_node *rb_old, struct rb_node *rb_new) { ITSTRUCT *old = rb_entry(rb_old, ITSTRUCT, ITRB); ITSTRUCT *new = rb_entry(rb_new, ITSTRUCT, ITRB); diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h index 8d1e83b1c87b..0022c1bb1e26 100644 --- a/include/linux/rbtree.h +++ b/include/linux/rbtree.h @@ -62,54 +62,6 @@ extern void rb_insert_color(struct rb_node *, struct rb_root *); extern void rb_erase(struct rb_node *, struct rb_root *); -struct rb_augment_callbacks { - void (*propagate)(struct rb_node *node, struct rb_node *stop); - void (*copy)(struct rb_node *old, struct rb_node *new); - void (*rotate)(struct rb_node *old, struct rb_node *new); -}; - -extern void __rb_insert_augmented(struct rb_node *node, struct rb_root *root, - void (*augment_rotate)(struct rb_node *old, struct rb_node *new)); -extern void rb_erase_augmented(struct rb_node *node, struct rb_root *root, - const struct rb_augment_callbacks *augment); -static inline void -rb_insert_augmented(struct rb_node *node, struct rb_root *root, - const struct rb_augment_callbacks *augment) -{ - __rb_insert_augmented(node, root, augment->rotate); -} - -#define RB_DECLARE_CALLBACKS(rbstatic, rbname, rbstruct, rbfield, \ - rbtype, rbaugmented, rbcompute) \ -static void rbname ## _propagate(struct rb_node *rb, struct rb_node *stop) \ -{ \ - while (rb != stop) { \ - rbstruct *node = rb_entry(rb, rbstruct, rbfield); \ - rbtype augmented = rbcompute(node); \ - if (node->rbaugmented == augmented) \ - break; \ - node->rbaugmented = augmented; \ - rb = rb_parent(&node->rbfield); \ - } \ -} \ -static void rbname ## _copy(struct rb_node *rb_old, struct rb_node *rb_new) \ -{ \ - rbstruct *old = rb_entry(rb_old, rbstruct, rbfield); \ - rbstruct *new = rb_entry(rb_new, rbstruct, rbfield); \ - new->rbaugmented = old->rbaugmented; \ -} \ -static void rbname ## _rotate(struct rb_node *rb_old, struct rb_node *rb_new) \ -{ \ - rbstruct *old = rb_entry(rb_old, rbstruct, rbfield); \ - rbstruct *new = rb_entry(rb_new, rbstruct, rbfield); \ - new->rbaugmented = old->rbaugmented; \ - old->rbaugmented = rbcompute(old); \ -} \ -rbstatic const struct rb_augment_callbacks rbname = { \ - rbname ## _propagate, rbname ## _copy, rbname ## _rotate \ -}; - - /* Find logical next and previous nodes in a tree */ extern struct rb_node *rb_next(const struct rb_node *); extern struct rb_node *rb_prev(const struct rb_node *); diff --git a/include/linux/rbtree_augmented.h b/include/linux/rbtree_augmented.h new file mode 100644 index 000000000000..214caa33433b --- /dev/null +++ b/include/linux/rbtree_augmented.h @@ -0,0 +1,223 @@ +/* + Red Black Trees + (C) 1999 Andrea Arcangeli + (C) 2002 David Woodhouse + (C) 2012 Michel Lespinasse + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + + linux/include/linux/rbtree_augmented.h +*/ + +#ifndef _LINUX_RBTREE_AUGMENTED_H +#define _LINUX_RBTREE_AUGMENTED_H + +#include + +/* + * Please note - only struct rb_augment_callbacks and the prototypes for + * rb_insert_augmented() and rb_erase_augmented() are intended to be public. + * The rest are implementation details you are not expected to depend on. + * + * See Documentation/rbtree.txt for documentation and samples. + */ + +struct rb_augment_callbacks { + void (*propagate)(struct rb_node *node, struct rb_node *stop); + void (*copy)(struct rb_node *old, struct rb_node *new); + void (*rotate)(struct rb_node *old, struct rb_node *new); +}; + +extern void __rb_insert_augmented(struct rb_node *node, struct rb_root *root, + void (*augment_rotate)(struct rb_node *old, struct rb_node *new)); +static inline void +rb_insert_augmented(struct rb_node *node, struct rb_root *root, + const struct rb_augment_callbacks *augment) +{ + __rb_insert_augmented(node, root, augment->rotate); +} + +#define RB_DECLARE_CALLBACKS(rbstatic, rbname, rbstruct, rbfield, \ + rbtype, rbaugmented, rbcompute) \ +static inline void \ +rbname ## _propagate(struct rb_node *rb, struct rb_node *stop) \ +{ \ + while (rb != stop) { \ + rbstruct *node = rb_entry(rb, rbstruct, rbfield); \ + rbtype augmented = rbcompute(node); \ + if (node->rbaugmented == augmented) \ + break; \ + node->rbaugmented = augmented; \ + rb = rb_parent(&node->rbfield); \ + } \ +} \ +static inline void \ +rbname ## _copy(struct rb_node *rb_old, struct rb_node *rb_new) \ +{ \ + rbstruct *old = rb_entry(rb_old, rbstruct, rbfield); \ + rbstruct *new = rb_entry(rb_new, rbstruct, rbfield); \ + new->rbaugmented = old->rbaugmented; \ +} \ +static void \ +rbname ## _rotate(struct rb_node *rb_old, struct rb_node *rb_new) \ +{ \ + rbstruct *old = rb_entry(rb_old, rbstruct, rbfield); \ + rbstruct *new = rb_entry(rb_new, rbstruct, rbfield); \ + new->rbaugmented = old->rbaugmented; \ + old->rbaugmented = rbcompute(old); \ +} \ +rbstatic const struct rb_augment_callbacks rbname = { \ + rbname ## _propagate, rbname ## _copy, rbname ## _rotate \ +}; + + +#define RB_RED 0 +#define RB_BLACK 1 + +#define __rb_parent(pc) ((struct rb_node *)(pc & ~3)) + +#define __rb_color(pc) ((pc) & 1) +#define __rb_is_black(pc) __rb_color(pc) +#define __rb_is_red(pc) (!__rb_color(pc)) +#define rb_color(rb) __rb_color((rb)->__rb_parent_color) +#define rb_is_red(rb) __rb_is_red((rb)->__rb_parent_color) +#define rb_is_black(rb) __rb_is_black((rb)->__rb_parent_color) + +static inline void rb_set_parent(struct rb_node *rb, struct rb_node *p) +{ + rb->__rb_parent_color = rb_color(rb) | (unsigned long)p; +} + +static inline void rb_set_parent_color(struct rb_node *rb, + struct rb_node *p, int color) +{ + rb->__rb_parent_color = (unsigned long)p | color; +} + +static inline void +__rb_change_child(struct rb_node *old, struct rb_node *new, + struct rb_node *parent, struct rb_root *root) +{ + if (parent) { + if (parent->rb_left == old) + parent->rb_left = new; + else + parent->rb_right = new; + } else + root->rb_node = new; +} + +extern void __rb_erase_color(struct rb_node *parent, struct rb_root *root, + void (*augment_rotate)(struct rb_node *old, struct rb_node *new)); + +static __always_inline void +rb_erase_augmented(struct rb_node *node, struct rb_root *root, + const struct rb_augment_callbacks *augment) +{ + struct rb_node *child = node->rb_right, *tmp = node->rb_left; + struct rb_node *parent, *rebalance; + unsigned long pc; + + if (!tmp) { + /* + * Case 1: node to erase has no more than 1 child (easy!) + * + * Note that if there is one child it must be red due to 5) + * and node must be black due to 4). We adjust colors locally + * so as to bypass __rb_erase_color() later on. + */ + pc = node->__rb_parent_color; + parent = __rb_parent(pc); + __rb_change_child(node, child, parent, root); + if (child) { + child->__rb_parent_color = pc; + rebalance = NULL; + } else + rebalance = __rb_is_black(pc) ? parent : NULL; + tmp = parent; + } else if (!child) { + /* Still case 1, but this time the child is node->rb_left */ + tmp->__rb_parent_color = pc = node->__rb_parent_color; + parent = __rb_parent(pc); + __rb_change_child(node, tmp, parent, root); + rebalance = NULL; + tmp = parent; + } else { + struct rb_node *successor = child, *child2; + tmp = child->rb_left; + if (!tmp) { + /* + * Case 2: node's successor is its right child + * + * (n) (s) + * / \ / \ + * (x) (s) -> (x) (c) + * \ + * (c) + */ + parent = successor; + child2 = successor->rb_right; + augment->copy(node, successor); + } else { + /* + * Case 3: node's successor is leftmost under + * node's right child subtree + * + * (n) (s) + * / \ / \ + * (x) (y) -> (x) (y) + * / / + * (p) (p) + * / / + * (s) (c) + * \ + * (c) + */ + do { + parent = successor; + successor = tmp; + tmp = tmp->rb_left; + } while (tmp); + parent->rb_left = child2 = successor->rb_right; + successor->rb_right = child; + rb_set_parent(child, successor); + augment->copy(node, successor); + augment->propagate(parent, successor); + } + + successor->rb_left = tmp = node->rb_left; + rb_set_parent(tmp, successor); + + pc = node->__rb_parent_color; + tmp = __rb_parent(pc); + __rb_change_child(node, successor, tmp, root); + if (child2) { + successor->__rb_parent_color = pc; + rb_set_parent_color(child2, parent, RB_BLACK); + rebalance = NULL; + } else { + unsigned long pc2 = successor->__rb_parent_color; + successor->__rb_parent_color = pc; + rebalance = __rb_is_black(pc2) ? parent : NULL; + } + tmp = successor; + } + + augment->propagate(tmp, NULL); + if (rebalance) + __rb_erase_color(rebalance, root, augment->rotate); +} + +#endif /* _LINUX_RBTREE_AUGMENTED_H */ diff --git a/lib/rbtree.c b/lib/rbtree.c index c0088ca345f9..4f56a11d67fa 100644 --- a/lib/rbtree.c +++ b/lib/rbtree.c @@ -21,7 +21,7 @@ linux/lib/rbtree.c */ -#include +#include #include /* @@ -44,52 +44,16 @@ * parentheses and have some accompanying text comment. */ -#define RB_RED 0 -#define RB_BLACK 1 - -#define __rb_parent(pc) ((struct rb_node *)(pc & ~3)) - -#define __rb_color(pc) ((pc) & 1) -#define __rb_is_black(pc) __rb_color(pc) -#define __rb_is_red(pc) (!__rb_color(pc)) -#define rb_color(rb) __rb_color((rb)->__rb_parent_color) -#define rb_is_red(rb) __rb_is_red((rb)->__rb_parent_color) -#define rb_is_black(rb) __rb_is_black((rb)->__rb_parent_color) - static inline void rb_set_black(struct rb_node *rb) { rb->__rb_parent_color |= RB_BLACK; } -static inline void rb_set_parent(struct rb_node *rb, struct rb_node *p) -{ - rb->__rb_parent_color = rb_color(rb) | (unsigned long)p; -} - -static inline void rb_set_parent_color(struct rb_node *rb, - struct rb_node *p, int color) -{ - rb->__rb_parent_color = (unsigned long)p | color; -} - static inline struct rb_node *rb_red_parent(struct rb_node *red) { return (struct rb_node *)red->__rb_parent_color; } -static inline void -__rb_change_child(struct rb_node *old, struct rb_node *new, - struct rb_node *parent, struct rb_root *root) -{ - if (parent) { - if (parent->rb_left == old) - parent->rb_left = new; - else - parent->rb_right = new; - } else - root->rb_node = new; -} - /* * Helper function for rotations: * - old's parent and color get assigned to new @@ -230,9 +194,9 @@ __rb_insert(struct rb_node *node, struct rb_root *root, } } -static __always_inline void +__always_inline void __rb_erase_color(struct rb_node *parent, struct rb_root *root, - const struct rb_augment_callbacks *augment) + void (*augment_rotate)(struct rb_node *old, struct rb_node *new)) { struct rb_node *node = NULL, *sibling, *tmp1, *tmp2; @@ -261,7 +225,7 @@ __rb_erase_color(struct rb_node *parent, struct rb_root *root, rb_set_parent_color(tmp1, parent, RB_BLACK); __rb_rotate_set_parents(parent, sibling, root, RB_RED); - augment->rotate(parent, sibling); + augment_rotate(parent, sibling); sibling = tmp1; } tmp1 = sibling->rb_right; @@ -313,7 +277,7 @@ __rb_erase_color(struct rb_node *parent, struct rb_root *root, if (tmp1) rb_set_parent_color(tmp1, sibling, RB_BLACK); - augment->rotate(sibling, tmp2); + augment_rotate(sibling, tmp2); tmp1 = sibling; sibling = tmp2; } @@ -336,7 +300,7 @@ __rb_erase_color(struct rb_node *parent, struct rb_root *root, rb_set_parent(tmp2, parent); __rb_rotate_set_parents(parent, sibling, root, RB_BLACK); - augment->rotate(parent, sibling); + augment_rotate(parent, sibling); break; } else { sibling = parent->rb_left; @@ -347,7 +311,7 @@ __rb_erase_color(struct rb_node *parent, struct rb_root *root, rb_set_parent_color(tmp1, parent, RB_BLACK); __rb_rotate_set_parents(parent, sibling, root, RB_RED); - augment->rotate(parent, sibling); + augment_rotate(parent, sibling); sibling = tmp1; } tmp1 = sibling->rb_left; @@ -374,7 +338,7 @@ __rb_erase_color(struct rb_node *parent, struct rb_root *root, if (tmp1) rb_set_parent_color(tmp1, sibling, RB_BLACK); - augment->rotate(sibling, tmp2); + augment_rotate(sibling, tmp2); tmp1 = sibling; sibling = tmp2; } @@ -386,109 +350,12 @@ __rb_erase_color(struct rb_node *parent, struct rb_root *root, rb_set_parent(tmp2, parent); __rb_rotate_set_parents(parent, sibling, root, RB_BLACK); - augment->rotate(parent, sibling); + augment_rotate(parent, sibling); break; } } } - -static __always_inline void -__rb_erase(struct rb_node *node, struct rb_root *root, - const struct rb_augment_callbacks *augment) -{ - struct rb_node *child = node->rb_right, *tmp = node->rb_left; - struct rb_node *parent, *rebalance; - unsigned long pc; - - if (!tmp) { - /* - * Case 1: node to erase has no more than 1 child (easy!) - * - * Note that if there is one child it must be red due to 5) - * and node must be black due to 4). We adjust colors locally - * so as to bypass __rb_erase_color() later on. - */ - pc = node->__rb_parent_color; - parent = __rb_parent(pc); - __rb_change_child(node, child, parent, root); - if (child) { - child->__rb_parent_color = pc; - rebalance = NULL; - } else - rebalance = __rb_is_black(pc) ? parent : NULL; - tmp = parent; - } else if (!child) { - /* Still case 1, but this time the child is node->rb_left */ - tmp->__rb_parent_color = pc = node->__rb_parent_color; - parent = __rb_parent(pc); - __rb_change_child(node, tmp, parent, root); - rebalance = NULL; - tmp = parent; - } else { - struct rb_node *successor = child, *child2; - tmp = child->rb_left; - if (!tmp) { - /* - * Case 2: node's successor is its right child - * - * (n) (s) - * / \ / \ - * (x) (s) -> (x) (c) - * \ - * (c) - */ - parent = successor; - child2 = successor->rb_right; - augment->copy(node, successor); - } else { - /* - * Case 3: node's successor is leftmost under - * node's right child subtree - * - * (n) (s) - * / \ / \ - * (x) (y) -> (x) (y) - * / / - * (p) (p) - * / / - * (s) (c) - * \ - * (c) - */ - do { - parent = successor; - successor = tmp; - tmp = tmp->rb_left; - } while (tmp); - parent->rb_left = child2 = successor->rb_right; - successor->rb_right = child; - rb_set_parent(child, successor); - augment->copy(node, successor); - augment->propagate(parent, successor); - } - - successor->rb_left = tmp = node->rb_left; - rb_set_parent(tmp, successor); - - pc = node->__rb_parent_color; - tmp = __rb_parent(pc); - __rb_change_child(node, successor, tmp, root); - if (child2) { - successor->__rb_parent_color = pc; - rb_set_parent_color(child2, parent, RB_BLACK); - rebalance = NULL; - } else { - unsigned long pc2 = successor->__rb_parent_color; - successor->__rb_parent_color = pc; - rebalance = __rb_is_black(pc2) ? parent : NULL; - } - tmp = successor; - } - - augment->propagate(tmp, NULL); - if (rebalance) - __rb_erase_color(rebalance, root, augment); -} +EXPORT_SYMBOL(__rb_erase_color); /* * Non-augmented rbtree manipulation functions. @@ -513,7 +380,7 @@ EXPORT_SYMBOL(rb_insert_color); void rb_erase(struct rb_node *node, struct rb_root *root) { - __rb_erase(node, root, &dummy_callbacks); + rb_erase_augmented(node, root, &dummy_callbacks); } EXPORT_SYMBOL(rb_erase); @@ -531,13 +398,6 @@ void __rb_insert_augmented(struct rb_node *node, struct rb_root *root, } EXPORT_SYMBOL(__rb_insert_augmented); -void rb_erase_augmented(struct rb_node *node, struct rb_root *root, - const struct rb_augment_callbacks *augment) -{ - __rb_erase(node, root, augment); -} -EXPORT_SYMBOL(rb_erase_augmented); - /* * This function returns the first node (in sort order) of the tree. */ diff --git a/lib/rbtree_test.c b/lib/rbtree_test.c index b20e99969b0f..268b23951fec 100644 --- a/lib/rbtree_test.c +++ b/lib/rbtree_test.c @@ -1,5 +1,5 @@ #include -#include +#include #include #include -- cgit v1.2.3 From 1939c557b5c3c0e800328e3589ae3d27fdfea29e Mon Sep 17 00:00:00 2001 From: Michael Kerrisk Date: Mon, 8 Oct 2012 16:33:09 -0700 Subject: memcg: trivial fixes for Documentation/cgroups/memory.txt While reading through Documentation/cgroups/memory.txt, I found a number of minor wordos and typos. The patch below is a conservative handling of some of these: it provides just a number of "obviously correct" fixes to the English that improve the readability of the document somewhat. Obviously some more significant fixes need to be made to the document, but some of those may not be in the "obvious correct" category. Signed-off-by: Michael Kerrisk Acked-by: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memory.txt | 90 ++++++++++++++++++++-------------------- 1 file changed, 45 insertions(+), 45 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 4372e6b8a353..c07f7b4fb88d 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -18,16 +18,16 @@ from the rest of the system. The article on LWN [12] mentions some probable uses of the memory controller. The memory controller can be used to a. Isolate an application or a group of applications - Memory hungry applications can be isolated and limited to a smaller + Memory-hungry applications can be isolated and limited to a smaller amount of memory. -b. Create a cgroup with limited amount of memory, this can be used +b. Create a cgroup with a limited amount of memory; this can be used as a good alternative to booting with mem=XXXX. c. Virtualization solutions can control the amount of memory they want to assign to a virtual machine instance. d. A CD/DVD burner could control the amount of memory used by the rest of the system to ensure that burning does not fail due to lack of available memory. -e. There are several other use cases, find one or use the controller just +e. There are several other use cases; find one or use the controller just for fun (to learn and hack on the VM subsystem). Current Status: linux-2.6.34-mmotm(development version of 2010/April) @@ -38,12 +38,12 @@ Features: - optionally, memory+swap usage can be accounted and limited. - hierarchical accounting - soft limit - - moving(recharging) account at moving a task is selectable. + - moving (recharging) account at moving a task is selectable. - usage threshold notifier - oom-killer disable knob and oom-notifier - Root cgroup has no limit controls. - Kernel memory support is work in progress, and the current version provides + Kernel memory support is a work in progress, and the current version provides basically functionality. (See Section 2.7) Brief summary of control files. @@ -144,9 +144,9 @@ Figure 1 shows the important aspects of the controller 3. Each page has a pointer to the page_cgroup, which in turn knows the cgroup it belongs to -The accounting is done as follows: mem_cgroup_charge() is invoked to setup +The accounting is done as follows: mem_cgroup_charge() is invoked to set up the necessary data structures and check if the cgroup that is being charged -is over its limit. If it is then reclaim is invoked on the cgroup. +is over its limit. If it is, then reclaim is invoked on the cgroup. More details can be found in the reclaim section of this document. If everything goes well, a page meta-data-structure called page_cgroup is updated. page_cgroup has its own LRU on cgroup. @@ -163,13 +163,13 @@ for earlier. A file page will be accounted for as Page Cache when it's inserted into inode (radix-tree). While it's mapped into the page tables of processes, duplicate accounting is carefully avoided. -A RSS page is unaccounted when it's fully unmapped. A PageCache page is +An RSS page is unaccounted when it's fully unmapped. A PageCache page is unaccounted when it's removed from radix-tree. Even if RSS pages are fully unmapped (by kswapd), they may exist as SwapCache in the system until they -are really freed. Such SwapCaches also also accounted. +are really freed. Such SwapCaches are also accounted. A swapped-in page is not accounted until it's mapped. -Note: The kernel does swapin-readahead and read multiple swaps at once. +Note: The kernel does swapin-readahead and reads multiple swaps at once. This means swapped-in pages may contain pages for other tasks than a task causing page fault. So, we avoid accounting at swap-in I/O. @@ -209,7 +209,7 @@ memsw.limit_in_bytes. Example: Assume a system with 4G of swap. A task which allocates 6G of memory (by mistake) under 2G memory limitation will use all swap. In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. -By using memsw limit, you can avoid system OOM which can be caused by swap +By using the memsw limit, you can avoid system OOM which can be caused by swap shortage. * why 'memory+swap' rather than swap. @@ -217,7 +217,7 @@ The global LRU(kswapd) can swap out arbitrary pages. Swap-out means to move account from memory to swap...there is no change in usage of memory+swap. In other words, when we want to limit the usage of swap without affecting global LRU, memory+swap limit is better than just limiting swap from -OS point of view. +an OS point of view. * What happens when a cgroup hits memory.memsw.limit_in_bytes When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out @@ -236,7 +236,7 @@ an OOM routine is invoked to select and kill the bulkiest task in the cgroup. (See 10. OOM Control below.) The reclaim algorithm has not been modified for cgroups, except that -pages that are selected for reclaiming come from the per cgroup LRU +pages that are selected for reclaiming come from the per-cgroup LRU list. NOTE: Reclaim does not work for the root cgroup, since we cannot set any @@ -316,7 +316,7 @@ We can check the usage: # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes 1216512 -A successful write to this file does not guarantee a successful set of +A successful write to this file does not guarantee a successful setting of this limit to the value written into the file. This can be due to a number of factors, such as rounding up to page boundaries or the total availability of memory on the system. The user is required to re-read @@ -350,7 +350,7 @@ Trying usual test under memory controller is always helpful. 4.1 Troubleshooting Sometimes a user might find that the application under a cgroup is -terminated by OOM killer. There are several causes for this: +terminated by the OOM killer. There are several causes for this: 1. The cgroup limit is too low (just too low to do anything useful) 2. The user is using anonymous memory and swap is turned off or too low @@ -358,7 +358,7 @@ terminated by OOM killer. There are several causes for this: A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of some of the pages cached in the cgroup (page cache pages). -To know what happens, disable OOM_Kill by 10. OOM Control(see below) and +To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and seeing what happens will be helpful. 4.2 Task migration @@ -399,10 +399,10 @@ About use_hierarchy, see Section 6. Almost all pages tracked by this memory cgroup will be unmapped and freed. Some pages cannot be freed because they are locked or in-use. Such pages are - moved to parent(if use_hierarchy==1) or root (if use_hierarchy==0) and this + moved to parent (if use_hierarchy==1) or root (if use_hierarchy==0) and this cgroup will be empty. - Typical use case of this interface is that calling this before rmdir(). + The typical use case for this interface is before calling rmdir(). Because rmdir() moves all pages to parent, some out-of-use page caches can be moved to the parent. If you want to avoid that, force_empty will be useful. @@ -486,7 +486,7 @@ You can reset failcnt by writing 0 to failcnt file. For efficiency, as other kernel components, memory cgroup uses some optimization to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the -method and doesn't show 'exact' value of memory(and swap) usage, it's an fuzz +method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz value for efficient access. (Of course, when necessary, it's synchronized.) If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) value in memory.stat(see 5.2). @@ -496,8 +496,8 @@ value in memory.stat(see 5.2). This is similar to numa_maps but operates on a per-memcg basis. This is useful for providing visibility into the numa locality information within an memcg since the pages are allowed to be allocated from any physical -node. One of the usecases is evaluating application performance by -combining this information with the application's cpu allocation. +node. One of the use cases is evaluating application performance by +combining this information with the application's CPU allocation. We export "total", "file", "anon" and "unevictable" pages per-node for each memcg. The ouput format of memory.numa_stat is: @@ -561,10 +561,10 @@ are pushed back to their soft limits. If the soft limit of each control group is very high, they are pushed back as much as possible to make sure that one control group does not starve the others of memory. -Please note that soft limits is a best effort feature, it comes with +Please note that soft limits is a best-effort feature; it comes with no guarantees, but it does its best to make sure that when memory is heavily contended for, memory is allocated based on the soft limit -hints/setup. Currently soft limit based reclaim is setup such that +hints/setup. Currently soft limit based reclaim is set up such that it gets invoked from balance_pgdat (kswapd). 7.1 Interface @@ -592,7 +592,7 @@ page tables. 8.1 Interface -This feature is disabled by default. It can be enabled(and disabled again) by +This feature is disabled by default. It can be enabledi (and disabled again) by writing to memory.move_charge_at_immigrate of the destination cgroup. If you want to enable it: @@ -601,8 +601,8 @@ If you want to enable it: Note: Each bits of move_charge_at_immigrate has its own meaning about what type of charges should be moved. See 8.2 for details. -Note: Charges are moved only when you move mm->owner, IOW, a leader of a thread - group. +Note: Charges are moved only when you move mm->owner, in other words, + a leader of a thread group. Note: If we cannot find enough space for the task in the destination cgroup, we try to make space by reclaiming memory. Task migration may fail if we cannot make enough space. @@ -612,25 +612,25 @@ And if you want disable it again: # echo 0 > memory.move_charge_at_immigrate -8.2 Type of charges which can be move +8.2 Type of charges which can be moved -Each bits of move_charge_at_immigrate has its own meaning about what type of -charges should be moved. But in any cases, it must be noted that an account of -a page or a swap can be moved only when it is charged to the task's current(old) -memory cgroup. +Each bit in move_charge_at_immigrate has its own meaning about what type of +charges should be moved. But in any case, it must be noted that an account of +a page or a swap can be moved only when it is charged to the task's current +(old) memory cgroup. bit | what type of charges would be moved ? -----+------------------------------------------------------------------------ - 0 | A charge of an anonymous page(or swap of it) used by the target task. - | You must enable Swap Extension(see 2.4) to enable move of swap charges. + 0 | A charge of an anonymous page (or swap of it) used by the target task. + | You must enable Swap Extension (see 2.4) to enable move of swap charges. -----+------------------------------------------------------------------------ - 1 | A charge of file pages(normal file, tmpfs file(e.g. ipc shared memory) + 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) | and swaps of tmpfs file) mmapped by the target task. Unlike the case of - | anonymous pages, file pages(and swaps) in the range mmapped by the task + | anonymous pages, file pages (and swaps) in the range mmapped by the task | will be moved even if the task hasn't done page fault, i.e. they might | not be the task's "RSS", but other task's "RSS" that maps the same file. - | And mapcount of the page is ignored(the page can be moved even if - | page_mapcount(page) > 1). You must enable Swap Extension(see 2.4) to + | And mapcount of the page is ignored (the page can be moved even if + | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to | enable move of swap charges. 8.3 TODO @@ -640,11 +640,11 @@ memory cgroup. 9. Memory thresholds -Memory cgroup implements memory thresholds using cgroups notification +Memory cgroup implements memory thresholds using the cgroups notification API (see cgroups.txt). It allows to register multiple memory and memsw thresholds and gets notifications when it crosses. -To register a threshold application need: +To register a threshold, an application must: - create an eventfd using eventfd(2); - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; - write string like " " to @@ -659,24 +659,24 @@ It's applicable for root and non-root cgroup. memory.oom_control file is for OOM notification and other controls. -Memory cgroup implements OOM notifier using cgroup notification +Memory cgroup implements OOM notifier using the cgroup notification API (See cgroups.txt). It allows to register multiple OOM notification delivery and gets notification when OOM happens. -To register a notifier, application need: +To register a notifier, an application must: - create an eventfd using eventfd(2) - open memory.oom_control file - write string like " " to cgroup.event_control -Application will be notified through eventfd when OOM happens. -OOM notification doesn't work for root cgroup. +The application will be notified through eventfd when OOM happens. +OOM notification doesn't work for the root cgroup. -You can disable OOM-killer by writing "1" to memory.oom_control file, as: +You can disable the OOM-killer by writing "1" to memory.oom_control file, as: #echo 1 > memory.oom_control -This operation is only allowed to the top cgroup of sub-hierarchy. +This operation is only allowed to the top cgroup of a sub-hierarchy. If OOM-killer is disabled, tasks under cgroup will hang/sleep in memory cgroup's OOM-waitqueue when they request accountable memory. -- cgit v1.2.3 From 39b5f29ac1f988c1615fbc9c69f6651ab0d0c3c7 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Mon, 8 Oct 2012 16:33:18 -0700 Subject: mm: remove vma arg from page_evictable page_evictable(page, vma) is an irritant: almost all its callers pass NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page) explicitly in the couple of places it's needed. But in those places we don't even need page_evictable() itself! They're dealing with a freshly allocated anonymous page, which has no "mapping" and cannot be mlocked yet. Signed-off-by: Hugh Dickins Acked-by: Mel Gorman Cc: Rik van Riel Acked-by: Johannes Weiner Cc: Michel Lespinasse Cc: Ying Han Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/unevictable-lru.txt | 10 +++------- include/linux/swap.h | 2 +- mm/internal.h | 5 ++--- mm/ksm.c | 2 +- mm/rmap.c | 2 +- mm/swap.c | 2 +- mm/vmscan.c | 27 +++++++++------------------ 7 files changed, 18 insertions(+), 32 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt index 323ff5dba1cc..a68db7692ee8 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.txt @@ -197,12 +197,8 @@ the pages are also "rescued" from the unevictable list in the process of freeing them. page_evictable() also checks for mlocked pages by testing an additional page -flag, PG_mlocked (as wrapped by PageMlocked()). If the page is NOT mlocked, -and a non-NULL VMA is supplied, page_evictable() will check whether the VMA is -VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and -update the appropriate statistics if the vma is VM_LOCKED. This method allows -efficient "culling" of pages in the fault path that are being faulted in to -VM_LOCKED VMAs. +flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is +faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED. VMSCAN'S HANDLING OF UNEVICTABLE PAGES @@ -651,7 +647,7 @@ PAGE RECLAIM IN shrink_*_list() ------------------------------- shrink_active_list() culls any obviously unevictable pages - i.e. -!page_evictable(page, NULL) - diverting these to the unevictable list. +!page_evictable(page) - diverting these to the unevictable list. However, shrink_active_list() only sees unevictable pages that made it onto the active/inactive lru lists. Note that these pages do not have PageUnevictable set - otherwise they would be on the unevictable list and shrink_active_list diff --git a/include/linux/swap.h b/include/linux/swap.h index 388e70601413..68df9c17fbbb 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -281,7 +281,7 @@ static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) } #endif -extern int page_evictable(struct page *page, struct vm_area_struct *vma); +extern int page_evictable(struct page *page); extern void check_move_unevictable_pages(struct page **, int nr_pages); extern unsigned long scan_unevictable_pages; diff --git a/mm/internal.h b/mm/internal.h index 7f72f249bc29..78f25d6cc6a7 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -168,9 +168,8 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma) } /* - * Called only in fault path via page_evictable() for a new page - * to determine if it's being mapped into a LOCKED vma. - * If so, mark page as mlocked. + * Called only in fault path, to determine if a new page is being + * mapped into a LOCKED vma. If it is, mark page as mlocked. */ static inline int mlocked_vma_newpage(struct vm_area_struct *vma, struct page *page) diff --git a/mm/ksm.c b/mm/ksm.c index 14ee5cf8a513..ecbc090cdaad 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1586,7 +1586,7 @@ struct page *ksm_does_need_to_copy(struct page *page, SetPageSwapBacked(new_page); __set_page_locked(new_page); - if (page_evictable(new_page, vma)) + if (!mlocked_vma_newpage(vma, new_page)) lru_cache_add_lru(new_page, LRU_ACTIVE_ANON); else add_page_to_unevictable_list(new_page); diff --git a/mm/rmap.c b/mm/rmap.c index 28777412de62..0d86433e42d7 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1080,7 +1080,7 @@ void page_add_new_anon_rmap(struct page *page, else __inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES); __page_set_anon_rmap(page, vma, address, 1); - if (page_evictable(page, vma)) + if (!mlocked_vma_newpage(vma, page)) lru_cache_add_lru(page, LRU_ACTIVE_ANON); else add_page_to_unevictable_list(page); diff --git a/mm/swap.c b/mm/swap.c index f76c76c7501b..6310dc2008ff 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -751,7 +751,7 @@ void lru_add_page_tail(struct page *page, struct page *page_tail, SetPageLRU(page_tail); - if (page_evictable(page_tail, NULL)) { + if (page_evictable(page_tail)) { if (PageActive(page)) { SetPageActive(page_tail); active = 1; diff --git a/mm/vmscan.c b/mm/vmscan.c index b010efc43891..8b627309dd44 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -553,7 +553,7 @@ void putback_lru_page(struct page *page) redo: ClearPageUnevictable(page); - if (page_evictable(page, NULL)) { + if (page_evictable(page)) { /* * For evictable pages, we can use the cache. * In event of a race, worst case is we end up with an @@ -587,7 +587,7 @@ redo: * page is on unevictable list, it never be freed. To avoid that, * check after we added it to the list, again. */ - if (lru == LRU_UNEVICTABLE && page_evictable(page, NULL)) { + if (lru == LRU_UNEVICTABLE && page_evictable(page)) { if (!isolate_lru_page(page)) { put_page(page); goto redo; @@ -709,7 +709,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, sc->nr_scanned++; - if (unlikely(!page_evictable(page, NULL))) + if (unlikely(!page_evictable(page))) goto cull_mlocked; if (!sc->may_unmap && page_mapped(page)) @@ -1217,7 +1217,7 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list) VM_BUG_ON(PageLRU(page)); list_del(&page->lru); - if (unlikely(!page_evictable(page, NULL))) { + if (unlikely(!page_evictable(page))) { spin_unlock_irq(&zone->lru_lock); putback_lru_page(page); spin_lock_irq(&zone->lru_lock); @@ -1470,7 +1470,7 @@ static void shrink_active_list(unsigned long nr_to_scan, page = lru_to_page(&l_hold); list_del(&page->lru); - if (unlikely(!page_evictable(page, NULL))) { + if (unlikely(!page_evictable(page))) { putback_lru_page(page); continue; } @@ -3414,27 +3414,18 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) /* * page_evictable - test whether a page is evictable * @page: the page to test - * @vma: the VMA in which the page is or will be mapped, may be NULL * * Test whether page is evictable--i.e., should be placed on active/inactive - * lists vs unevictable list. The vma argument is !NULL when called from the - * fault path to determine how to instantate a new page. + * lists vs unevictable list. * * Reasons page might not be evictable: * (1) page's mapping marked unevictable * (2) page is part of an mlocked VMA * */ -int page_evictable(struct page *page, struct vm_area_struct *vma) +int page_evictable(struct page *page) { - - if (mapping_unevictable(page_mapping(page))) - return 0; - - if (PageMlocked(page) || (vma && mlocked_vma_newpage(vma, page))) - return 0; - - return 1; + return !mapping_unevictable(page_mapping(page)) && !PageMlocked(page); } #ifdef CONFIG_SHMEM @@ -3472,7 +3463,7 @@ void check_move_unevictable_pages(struct page **pages, int nr_pages) if (!PageLRU(page) || !PageUnevictable(page)) continue; - if (page_evictable(page, NULL)) { + if (page_evictable(page)) { enum lru_list lru = page_lru_base_type(page); VM_BUG_ON(PageActive(page)); -- cgit v1.2.3 From 00ea8990aadf754bbbb46ae85d4f2afba19508d8 Mon Sep 17 00:00:00 2001 From: Jiri Kosina Date: Mon, 8 Oct 2012 16:33:25 -0700 Subject: memory.txt: remove stray information Andi removed some outedated documentation from Documentation/memory.txt back in 2009 by commit 3b2b9a875ddc ("Documentation/memory.txt: remove some very outdated recommendations"), but the resulting document is not in a nice shape either. It seems to me like we are not losing anything by completely removing the file now. Signed-off-by: Jiri Kosina Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/memory.txt | 33 --------------------------------- 1 file changed, 33 deletions(-) delete mode 100644 Documentation/memory.txt (limited to 'Documentation') diff --git a/Documentation/memory.txt b/Documentation/memory.txt deleted file mode 100644 index 802efe58647c..000000000000 --- a/Documentation/memory.txt +++ /dev/null @@ -1,33 +0,0 @@ -There are several classic problems related to memory on Linux -systems. - - 1) There are some motherboards that will not cache above - a certain quantity of memory. If you have one of these - motherboards, your system will be SLOWER, not faster - as you add more memory. Consider exchanging your - motherboard. - -All of these problems can be addressed with the "mem=XXXM" boot option -(where XXX is the size of RAM to use in megabytes). -It can also tell Linux to use less memory than is actually installed. -If you use "mem=" on a machine with PCI, consider using "memmap=" to avoid -physical address space collisions. - -See the documentation of your boot loader (LILO, grub, loadlin, etc.) about -how to pass options to the kernel. - -There are other memory problems which Linux cannot deal with. Random -corruption of memory is usually a sign of serious hardware trouble. -Try: - - * Reducing memory settings in the BIOS to the most conservative - timings. - - * Adding a cooling fan. - - * Not overclocking your CPU. - - * Having the memory tested in a memory tester or exchanged - with the vendor. Consider testing it with memtest86 yourself. - - * Exchanging your CPU, cache, or motherboard for one that works. -- cgit v1.2.3