Today's plan

Linux O(1) scheduler
Sam Joseph presentation on semantic file systems
course evaluation

Linux O(1) Scheduler

Linux kernels up through 2.4 had a relatively inefficient scheduler
goal for 2.5/2.6 was an O(1) scheduler, meaning constant time would be used when scheduling no matter how many processes or threasd (tasks in Linux) might be scheduled
another goal was to have a scalable SMP scheduler: have individual runqueues so one processor does not (usually) have to wait for other processors holding locks
also, scheduling should be fair to equal-priority processes, and should favor interactive processes (processes that block on system calls)

Linux O(1) algorithms

each runqueue has two priority arrays, one active one expired
Linux has (by default) 140 priority levels
a bitmap (with 140 bits, or 5 words) records which priority levels have waiting tasks, making it fast (small constant time if the number of priorities is fixed) to find the highest priority level with a waiting task
an array of queues holds pointers to the next runnable process at each priority level
each process has a timeslice (Minix quantum)
recomputing the timeslice for all processes would be O(n), so the scheduler does it when the timeslice for a task has reached 0 and the task is moved to the expired queue array -- O(1) per scheduling operation
note that doing this calculation on each process's timeslice expiration does not improve the total time needed for the computation, but it does substantially improve the worst-case latency of scheduling
once all tasks have moved from the active to the expired queue, the two arrays are swapped, again in O(1)

Linux Priority Scheduling

processes have an initial priority -- the nice value, between -20 (highest priority) and 19 (lowest priority)
the dynamic priority computation adds or removes up to 5 points from this nice value
the dynamic priority computation keeps track of how much time a task spends sleeping, and decreases the nice value (increases the priority) the more a task sleeps
the timeslice is computed based on the priority, from 10ms (low priority) to 100ms (typical priority) to 200ms (highest priority)
highly interactive high priority tasks are typically reinserted back into the active array rather than the expired array, but only if the last array switch was not too far in the past

Linux Scheduler Examples

from kernel/sched.c

/*
 * These are the 'tuning knobs' of the scheduler:
 *
 * Minimum timeslice is 10 msecs, default timeslice is 100 msecs,
 * maximum timeslice is 200 msecs. Timeslices get refilled after
 * they expire.
 */
#define MIN_TIMESLICE		( 10 * HZ / 1000)
#define MAX_TIMESLICE		(200 * HZ / 1000)
#define ON_RUNQUEUE_WEIGHT	 30
#define CHILD_PENALTY		 95
#define PARENT_PENALTY		100
#define EXIT_WEIGHT		  3
#define PRIO_BONUS_RATIO	 25
#define MAX_BONUS		(MAX_USER_PRIO * PRIO_BONUS_RATIO / 100)
#define INTERACTIVE_DELTA	  2
#define MAX_SLEEP_AVG		(AVG_TIMESLICE * MAX_BONUS)
#define STARVATION_LIMIT	(MAX_SLEEP_AVG)
#define NS_MAX_SLEEP_AVG	(JIFFIES_TO_NS(MAX_SLEEP_AVG))
#define NODE_THRESHOLD		125
#define CREDIT_LIMIT		100

/*
 * If a task is 'interactive' then we reinsert it in the active
 * array after it has expired its current timeslice. (it will not
 * continue to run immediately, it will still roundrobin with
 * other interactive tasks.)
 *
 * This part scales the interactivity limit depending on niceness.
 *
 * We scale it linearly, offset by the INTERACTIVE_DELTA delta.
 * Here are a few examples of different nice levels:
 *
 *  TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
 *  TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
 *  TASK_INTERACTIVE(  0): [1,1,1,1,0,0,0,0,0,0,0]
 *  TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
 *  TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
 *
 * (the X axis represents the possible -5 ... 0 ... +5 dynamic
 *  priority range a task can explore, a value of '1' means the
 *  task is rated interactive.)
 *
 * Ie. nice +19 tasks can never get 'interactive' enough to be
 * reinserted into the active array. And only heavily CPU-hog nice -20
 * tasks will be expired. Default nice 0 tasks are somewhere between,
 * it takes some effort for them to get interactive, but it's not
 * too hard.
 */

#define NICE_TO_PRIO(nice)	(MAX_RT_PRIO + (nice) + 20)
#define PRIO_TO_NICE(prio)	((prio) - MAX_RT_PRIO - 20)
#define TASK_NICE(p)		PRIO_TO_NICE((p)->static_prio)

#define SCALE(v1,v1_max,v2_max) \
	(v1) * (v2_max) / (v1_max)

#define DELTA(p) \
	(SCALE(TASK_NICE(p), 40, MAX_USER_PRIO*PRIO_BONUS_RATIO/100) + \
		INTERACTIVE_DELTA)

#define TASK_INTERACTIVE(p) \
	((p)->prio <= (p)->static_prio - DELTA(p))

e.g. a task with nice value 0 (static priority 120 -- 140 is the worst) ends up with a delta of INTERACTIVE_DELTA, which is two, subtracted from its static priority, so is interactive if (p)->prio <= 118

Linux Real-Time scheduling

all tasks with static priority less than 100 are real-time tasks
highest priority is a FIFO task, which runs until it suspends -- this prevents all other (lower priority) tasks from running
next highest priority is a RR task, which runs until its timeslice expires
priority for real-time tasks is not adjusted dynamically
this is not hard real time -- the kernel knows nothing of deadlines, requirements, etc, and simply gives priority to the "real-time" tasks