diff options
author | J. Bruce Fields <bfields@citi.umich.edu> | 2008-02-07 11:13:37 +0300 |
---|---|---|
committer | Linus Torvalds <torvalds@woody.linux-foundation.org> | 2008-02-07 19:42:17 +0300 |
commit | 9b8eae7248dad42091204f83ed3448e661456af1 (patch) | |
tree | 1e300d41f8aaa9c258c179024ba63799a79f5a6f /Documentation/sched-design.txt | |
parent | d3cf91d0e201962a6367191e5926f5b0920b0339 (diff) | |
download | linux-9b8eae7248dad42091204f83ed3448e661456af1.tar.xz |
Documentation: create new scheduler/ subdirectory
The top-level Documentation/ directory is unmanageably large, so we
should take any obvious opportunities to move stuff into subdirectories.
These sched-*.txt files seem an obvious easy case.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'Documentation/sched-design.txt')
-rw-r--r-- | Documentation/sched-design.txt | 165 |
1 files changed, 0 insertions, 165 deletions
diff --git a/Documentation/sched-design.txt b/Documentation/sched-design.txt deleted file mode 100644 index 1605bf0cba8b..000000000000 --- a/Documentation/sched-design.txt +++ /dev/null @@ -1,165 +0,0 @@ - Goals, Design and Implementation of the - new ultra-scalable O(1) scheduler - - - This is an edited version of an email Ingo Molnar sent to - lkml on 4 Jan 2002. It describes the goals, design, and - implementation of Ingo's new ultra-scalable O(1) scheduler. - Last Updated: 18 April 2002. - - -Goal -==== - -The main goal of the new scheduler is to keep all the good things we know -and love about the current Linux scheduler: - - - good interactive performance even during high load: if the user - types or clicks then the system must react instantly and must execute - the user tasks smoothly, even during considerable background load. - - - good scheduling/wakeup performance with 1-2 runnable processes. - - - fairness: no process should stay without any timeslice for any - unreasonable amount of time. No process should get an unjustly high - amount of CPU time. - - - priorities: less important tasks can be started with lower priority, - more important tasks with higher priority. - - - SMP efficiency: no CPU should stay idle if there is work to do. - - - SMP affinity: processes which run on one CPU should stay affine to - that CPU. Processes should not bounce between CPUs too frequently. - - - plus additional scheduler features: RT scheduling, CPU binding. - -and the goal is also to add a few new things: - - - fully O(1) scheduling. Are you tired of the recalculation loop - blowing the L1 cache away every now and then? Do you think the goodness - loop is taking a bit too long to finish if there are lots of runnable - processes? This new scheduler takes no prisoners: wakeup(), schedule(), - the timer interrupt are all O(1) algorithms. There is no recalculation - loop. There is no goodness loop either. - - - 'perfect' SMP scalability. With the new scheduler there is no 'big' - runqueue_lock anymore - it's all per-CPU runqueues and locks - two - tasks on two separate CPUs can wake up, schedule and context-switch - completely in parallel, without any interlocking. All - scheduling-relevant data is structured for maximum scalability. - - - better SMP affinity. The old scheduler has a particular weakness that - causes the random bouncing of tasks between CPUs if/when higher - priority/interactive tasks, this was observed and reported by many - people. The reason is that the timeslice recalculation loop first needs - every currently running task to consume its timeslice. But when this - happens on eg. an 8-way system, then this property starves an - increasing number of CPUs from executing any process. Once the last - task that has a timeslice left has finished using up that timeslice, - the recalculation loop is triggered and other CPUs can start executing - tasks again - after having idled around for a number of timer ticks. - The more CPUs, the worse this effect. - - Furthermore, this same effect causes the bouncing effect as well: - whenever there is such a 'timeslice squeeze' of the global runqueue, - idle processors start executing tasks which are not affine to that CPU. - (because the affine tasks have finished off their timeslices already.) - - The new scheduler solves this problem by distributing timeslices on a - per-CPU basis, without having any global synchronization or - recalculation. - - - batch scheduling. A significant proportion of computing-intensive tasks - benefit from batch-scheduling, where timeslices are long and processes - are roundrobin scheduled. The new scheduler does such batch-scheduling - of the lowest priority tasks - so nice +19 jobs will get - 'batch-scheduled' automatically. With this scheduler, nice +19 jobs are - in essence SCHED_IDLE, from an interactiveness point of view. - - - handle extreme loads more smoothly, without breakdown and scheduling - storms. - - - O(1) RT scheduling. For those RT folks who are paranoid about the - O(nr_running) property of the goodness loop and the recalculation loop. - - - run fork()ed children before the parent. Andrea has pointed out the - advantages of this a few months ago, but patches for this feature - do not work with the old scheduler as well as they should, - because idle processes often steal the new child before the fork()ing - CPU gets to execute it. - - -Design -====== - -The core of the new scheduler contains the following mechanisms: - - - *two* priority-ordered 'priority arrays' per CPU. There is an 'active' - array and an 'expired' array. The active array contains all tasks that - are affine to this CPU and have timeslices left. The expired array - contains all tasks which have used up their timeslices - but this array - is kept sorted as well. The active and expired array is not accessed - directly, it's accessed through two pointers in the per-CPU runqueue - structure. If all active tasks are used up then we 'switch' the two - pointers and from now on the ready-to-go (former-) expired array is the - active array - and the empty active array serves as the new collector - for expired tasks. - - - there is a 64-bit bitmap cache for array indices. Finding the highest - priority task is thus a matter of two x86 BSFL bit-search instructions. - -the split-array solution enables us to have an arbitrary number of active -and expired tasks, and the recalculation of timeslices can be done -immediately when the timeslice expires. Because the arrays are always -access through the pointers in the runqueue, switching the two arrays can -be done very quickly. - -this is a hybride priority-list approach coupled with roundrobin -scheduling and the array-switch method of distributing timeslices. - - - there is a per-task 'load estimator'. - -one of the toughest things to get right is good interactive feel during -heavy system load. While playing with various scheduler variants i found -that the best interactive feel is achieved not by 'boosting' interactive -tasks, but by 'punishing' tasks that want to use more CPU time than there -is available. This method is also much easier to do in an O(1) fashion. - -to establish the actual 'load' the task contributes to the system, a -complex-looking but pretty accurate method is used: there is a 4-entry -'history' ringbuffer of the task's activities during the last 4 seconds. -This ringbuffer is operated without much overhead. The entries tell the -scheduler a pretty accurate load-history of the task: has it used up more -CPU time or less during the past N seconds. [the size '4' and the interval -of 4x 1 seconds was found by lots of experimentation - this part is -flexible and can be changed in both directions.] - -the penalty a task gets for generating more load than the CPU can handle -is a priority decrease - there is a maximum amount to this penalty -relative to their static priority, so even fully CPU-bound tasks will -observe each other's priorities, and will share the CPU accordingly. - -the SMP load-balancer can be extended/switched with additional parallel -computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs -can be supported easily by changing the load-balancer. Right now it's -tuned for my SMP systems. - -i skipped the prev->mm == next->mm advantage - no workload i know of shows -any sensitivity to this. It can be added back by sacrificing O(1) -schedule() [the current and one-lower priority list can be searched for a -that->mm == current->mm condition], but costs a fair number of cycles -during a number of important workloads, so i wanted to avoid this as much -as possible. - -- the SMP idle-task startup code was still racy and the new scheduler -triggered this. So i streamlined the idle-setup code a bit. We do not call -into schedule() before all processors have started up fully and all idle -threads are in place. - -- the patch also cleans up a number of aspects of sched.c - moves code -into other areas of the kernel where it's appropriate, and simplifies -certain code paths and data constructs. As a result, the new scheduler's -code is smaller than the old one. - - Ingo |