Today's plan
- Linux O(1) scheduler
- Sam Joseph presentation on semantic file systems
- course evaluation
Linux O(1) Scheduler
- Linux kernels up through 2.4 had a relatively inefficient
scheduler
- goal for 2.5/2.6 was an O(1) scheduler, meaning constant
time would be used when scheduling no matter how many processes or threasd
(tasks in Linux) might be scheduled
- another goal was to have a scalable SMP scheduler: have
individual runqueues so one processor does not (usually) have to wait for
other processors holding locks
- also, scheduling should be fair to equal-priority processes,
and should favor interactive processes (processes that block on system calls)
Linux O(1) algorithms
- each runqueue has two priority arrays, one active one expired
- Linux has (by default) 140 priority levels
- a bitmap (with 140 bits, or 5 words) records which priority
levels have waiting tasks, making it fast (small constant time if the
number of priorities is fixed) to find the highest priority level with
a waiting task
- an array of queues holds pointers to the next runnable process at
each priority level
- each process has a timeslice (Minix quantum)
- recomputing the timeslice for all processes would be O(n), so the
scheduler does it when the timeslice for a task has reached 0 and the
task is moved to the expired queue array -- O(1) per scheduling operation
- note that doing this calculation on each process's timeslice
expiration does not improve the total time needed for the computation,
but it does substantially improve the worst-case latency of scheduling
- once all tasks have moved from the active to the expired queue,
the two arrays are swapped, again in O(1)
Linux Priority Scheduling
- processes have an initial priority -- the nice value, between -20
(highest priority) and 19 (lowest priority)
- the dynamic priority computation adds or removes up to 5 points from
this nice value
- the dynamic priority computation keeps track of how much time a
task spends sleeping, and decreases the nice value (increases the priority)
the more a task sleeps
- the timeslice is computed based on the priority, from 10ms (low priority)
to 100ms (typical priority) to 200ms (highest priority)
- highly interactive high priority tasks are typically reinserted
back into the active array rather than the expired array, but only if
the last array switch was not too far in the past
Linux Scheduler Examples
from kernel/sched.c
/*
* These are the 'tuning knobs' of the scheduler:
*
* Minimum timeslice is 10 msecs, default timeslice is 100 msecs,
* maximum timeslice is 200 msecs. Timeslices get refilled after
* they expire.
*/
#define MIN_TIMESLICE ( 10 * HZ / 1000)
#define MAX_TIMESLICE (200 * HZ / 1000)
#define ON_RUNQUEUE_WEIGHT 30
#define CHILD_PENALTY 95
#define PARENT_PENALTY 100
#define EXIT_WEIGHT 3
#define PRIO_BONUS_RATIO 25
#define MAX_BONUS (MAX_USER_PRIO * PRIO_BONUS_RATIO / 100)
#define INTERACTIVE_DELTA 2
#define MAX_SLEEP_AVG (AVG_TIMESLICE * MAX_BONUS)
#define STARVATION_LIMIT (MAX_SLEEP_AVG)
#define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG))
#define NODE_THRESHOLD 125
#define CREDIT_LIMIT 100
/*
* If a task is 'interactive' then we reinsert it in the active
* array after it has expired its current timeslice. (it will not
* continue to run immediately, it will still roundrobin with
* other interactive tasks.)
*
* This part scales the interactivity limit depending on niceness.
*
* We scale it linearly, offset by the INTERACTIVE_DELTA delta.
* Here are a few examples of different nice levels:
*
* TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
* TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
* TASK_INTERACTIVE( 0): [1,1,1,1,0,0,0,0,0,0,0]
* TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
* TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
*
* (the X axis represents the possible -5 ... 0 ... +5 dynamic
* priority range a task can explore, a value of '1' means the
* task is rated interactive.)
*
* Ie. nice +19 tasks can never get 'interactive' enough to be
* reinserted into the active array. And only heavily CPU-hog nice -20
* tasks will be expired. Default nice 0 tasks are somewhere between,
* it takes some effort for them to get interactive, but it's not
* too hard.
*/
#define NICE_TO_PRIO(nice) (MAX_RT_PRIO + (nice) + 20)
#define PRIO_TO_NICE(prio) ((prio) - MAX_RT_PRIO - 20)
#define TASK_NICE(p) PRIO_TO_NICE((p)->static_prio)
#define SCALE(v1,v1_max,v2_max) \
(v1) * (v2_max) / (v1_max)
#define DELTA(p) \
(SCALE(TASK_NICE(p), 40, MAX_USER_PRIO*PRIO_BONUS_RATIO/100) + \
INTERACTIVE_DELTA)
#define TASK_INTERACTIVE(p) \
((p)->prio <= (p)->static_prio - DELTA(p))
- e.g. a task with nice value 0 (static priority 120 -- 140 is the worst)
ends up with a delta of INTERACTIVE_DELTA, which is two, subtracted from
its static priority, so is interactive if (p)->prio <= 118
Linux Real-Time scheduling
- all tasks with static priority less than 100 are real-time tasks
- highest priority is a FIFO task, which runs until it suspends --
this prevents all other (lower priority) tasks from running
- next highest priority is a RR task, which runs until its timeslice
expires
- priority for real-time tasks is not adjusted dynamically
- this is not hard real time -- the kernel knows nothing of deadlines,
requirements, etc, and simply gives priority to the "real-time" tasks