ICS 311 #22: Multithreaded Algorithms

Outline

Concepts of Dynamic Multithreading
Modeling Dynamic Multithreading
Measuring Dynamic Multithreading
Analysis of Multithreaded Algorithms
Example: Matrix Multiplication
Example: Merge Sort
Scheduling

Readings

CLRS 3rd Ed. Chapter 27.
Screencasts not available. (This topic was added later, as a "special topic".) Read this document carefully.

Concepts of Dynamic Multithreading

Multithreading is a crucial topic for modern computing. Parallel machines are getting cheaper and in fact are now ubiquitous ...

supercomputers: custom architectures and networks
computer clusters with dedicated networks (distributed memory)
multi-core integrated circuit chips (shared memory)
GPUs (graphics processing units) with multiple processors

Our emphasis here will be parallel algorithms, that is, multithreading a single algorithm so that some of its instructions may be executed simultaneously. Parallelism can also be applied to scheduling and managing multiple algorithms, each running concurrently in their own thread and possibly sharing resources, as studied in courses on operating systems and concurrent and high performance computing.

Static and Dynamic Multithreading

Static threading provides the programmer with an abstraction of virtual processors that are managed explicitly. It's "static" because the programmer must specify in advance how many processors to use at each point. This can be difficult and inflexible with respect to evolving conditions.

Rather than managing threads explicitly, our model is dynamic multithreading in which programmers specify opportunities for parallelism, and a concurrency platform manages the decisions of mapping these opportunities to actual static threads.

Concurrency Constructs:

We will use three keywords in our pseudocode, reflecting current parallel-computing practice:

parallel: add to loop construct such as for to indicate each iteration can be executed in parallel.

spawn: create a parallel subprocess, then keep executing the current process (parallel procedure call).

sync: wait here until all active parallel threads created by this instance of the program finish; used when one cannot proceed without pending results.

These keywords specify opportunities for parallelism without affecting whether (or not) the corresponding sequential program obtained by removing them is correct. In other words, if we ignore the parallel keywords the program can be analyzed as a single threaded program. We exploit this in analysis.

Logical Parallelism

The parallel and spawn keywords do not force parallelism: they just says that it is permissible. This is logical parallelism. A scheduler will make the decision concerning allocation to processors. We return to the question of scheduling at the end of this document, after approriate concepts have been introduced.

However, if parallelism is used, sync must be respected. For safety, there is an implicit sync at the end of every procedure.

Example: Parallel Fibonacci

For illustration, we take a really slow algorithm and make it parallel. (There are much better ways to compute Fibonacci numbers using dynamic programming; this is just for illustration.) Here is the definition of Fibonacci numbers:

F₀ = 0.
F₁ = 1.
F_i = F_i-1 + F_i-2, for i ≥ 2.

Here is a recursive non-parallel algorithm for computing Fibonacci numbers modeled on the above definition, along with its recursion tree:

Fib has recurrence relation T(n) = T(n − 1) + T(n − 2) + Θ(1), which has the solution T(n) = Θ(F_n) (see the text for substitution method proof). This grows exponentially in n, so it's not very efficient. (A straightforward iterative algorithm is much better.)

Noticing that the recursive calls operate independently of each other, let's see what improvement we can get by computing the two recursive calls in parallel. This will illustrate the concurrency keywords and also be an example for analysis:

Notice that without the parallel keywords it is the same as the serial program above.

We will return to this example when we analyze multithreading.

Modeling Dynamic Multithreading

First we need a formal model to describe parallel computations.

A Model of Multithreaded Execution

We will model a multithreaded computation as a computation DAG (directed acyclic graph) G = (V, E):

Vertices in V represent instructions. To simplify the graph, each vertex can represent a strand: a sequence of non-parallel instructions, as they will all be treated the same as far as parallelism is concerned.

Edges in E represent dependencies between instructions or strands: (u, v) ∈ E means u must execute before v. (Edge are categorized in ways elaborated below.)

If G has a directed path from u to v they are logically in series; otherwise they are logically parallel.

A strand with multiple successors means all but one of them must have spawned. A strand with multiple predecessors means they join at a sync statement.

We assume an ideal parallel computer with sequentially consistent memory, meaning it behaves as if the instructions were executed sequentially in some full ordering consistent with orderings within each thread (i.e., consistent with the partial ordering of the computation DAG).

Visualizing the Model

The model can be visualized as exemplified below for the computation DAG for P-Fib(4):

Vertices (strands) are visualized as circles in the figure.

The rounded rectangles are not part of the formal model, but they help organize the visualization by collecting together all strands for a given call.
The colors are specific to this example and indicate the corresponding code: black indicates that the strand is for lines 1-3; grey for line 4; and white for lines 5-6.

Edges are categorized and visualized as follows:

Continuation Edges (u, v) are drawn horizontally and indicate that v is the successor to u in the sequential procedure.
Call Edges (u, v) point downwards, indicating that u called v as a normal subprocedure call. In this example they come out of the grey circles.
Spawn Edges (u, v) also point downwards, indicating that u spawned v in parallel. In this example they come out of the black circles.
Return edges point upwards to indicate the next strand executed after returning from a normal procedure call, or after parallel spawning at a sync point. In this example they return to the white circles.

Measuring Dynamic Multithreading

We write T_P to indicate the running time of an algorithm on P processors. Then we define these measures and laws:

Work

T₁ = the total time to execute an algorithm on one processor. This is called work in analogy to work in physics: the total amount of computational work that gets done.

An ideal parallel computer with P processors can do at most P units of work in one time step. So, in T_P time it can do at most P⋅T_P work. Since the total work is T₁, P⋅T_P ≥ T₁, or dividing by P we get the work law:

T_P ≥ T₁ / P

The work law can be read as saying that the speedup for P processors can be no better than the time with one processor divided by P. That is,

parallelism on P processors at best gives constant speedup where the constant is 1/P.

Parallelism will not change the asymptotic class of an algorithm: it's not a substitute for careful design of asymptotically fast algorithms.

Span

The span of a multithreaded computation is the longest time to execute the strands along any path in the computation DAG. If each strand (represented by vertices) takes a unit of time, then this will be the number of vertices on the longest path in the DAG, which we call the critical path. If strands take different amounts of time then the critical path will be the path with the greatest cost, summing the costs associated with the vertices.

(Readers in our classes may recall the class excercise on finding the shortest time you can complete a set of interdependent jobs by finding the longest path in the job DAG: the concept here is similar.)

We can also define span as T_∞ = the total time to execute an algorithm on an infinite number of processors — or, more practically speaking, on just as many processors as are needed to allow parallelism wherever it is possible. It is the fastest we can possibly expect — an Ω bound -- because no matter how many processors you have, the algorithm must take this long.

The critical path in our P-Fib example is represented by the shaded edges in the figure. Notice that span is not simply the costs on the path from the root to the leaves of the recursion tree: Once the recursion has hit the base case the execution still needs to proceed as the recursion unwinds.

The span law states that a P-processor ideal parallel computer cannot run faster than one with an infinite number of processors:

T_P ≥ T_∞

This is because at some point the span will limit the speedup possible: No matter how many processors you have, you still must do these strands in sequence, taking the time they require.

Exercise: If we count each vertex as one unit of work, what is the work and span of the computation DAG for P-Fib shown?

Speedup

The ratio T₁ / T_P defines how much speedup you get with P processors as compared to one.

By the work law,

T_P ≥ T₁ / P, so T₁ / T_P ≤ P:

one cannot have any more speedup than the number of processors.

This is important enough to repeat: parallelism provides only constant time improvements (the constant being the number of processors) to any algorithm! Parallelism cannot move an algorithm from a higher to lower complexity class (e.g., exponential to polynomial, or quadratic to linear). Parallelism is not a silver bullet: good algorithm design and analysis is still needed.

When the speedup T₁ / T_P = Θ(P) we have linear speedup: the speedup is linear in the number of processors.

When T₁ / T_P = P we have perfect linear speedup: we got the maximum amount of speedup possible from each processor.

Parallelism

The ratio T₁ / T_∞ of the work to the span gives the (potential) parallelism of the computation. It can be interpreted in three ways:

Ratio : The average amount of work that can be performed for each step of parallel execution time.

Upper Bound : the maximum possible speedup that can be achieved on any number of processors.

Limit: The limit on the possibility of attaining perfect linear speedup. Once the number of processors exceeds the parallelism, the computation cannot possibly achieve perfect linear speedup. The more processors we use beyond parallelism, the less perfect the speedup.

This latter way of looking at T₁ / T_∞ leads to the concept of parallel slackness:

(T₁ / T_∞) / P = T₁ / (P⋅T_∞),

the factor by which the parallelism of the computation exceeds the number of processors in the machine. We have three cases:

If slackness is less than 1 then perfect linear speedup is not possible: you have more processors than you can make use of.
If slackness is greater than 1, then the work per processor is the limiting constraint and a scheduler can strive for linear speedup by distributing the work across more processors.
If slackness is 1, (T₁ / T_∞) / P = 1 so T₁ / T_∞ = P: we get perfect linear speedup with P processors.

Exercise: What is the parallelism of the computation DAG for P-Fib shown previously? What does this parallelism say about the prospects for speedup at *this* n? What happens to work and span as n grows?

Analysis of Multithreaded Algorithms

Analyzing work is simple: ignore the parallel constructs and analyze the serial algorithm.

For example, we already noted previously that the work of P-Fib(n) is

T(n) = T(n − 1) + T(n − 2)+ Θ(1),

which has the solution T(n) = Θ(F_n), the work of P-Fib(n).

Analyzing span requires a different approach. (I hope you did the exercises above: they will make you appreciate the following all the more.)

Analyzing Span

If a set of subcomputations (or the vertices representing them) are in series, the span is the sum of the spans of the subcomputations. This is like normal sequential analysis (as was just exemplified above with the sum T(n − 1) + T(n − 2)).

If a set of subcomputations (or the vertices representing them) are in parallel, the span is the maximum of the spans of the computations. This is where analysis of multithreaded algorithms differs.

Returning to our example, the span of the parallel recursive calls of P-Fib(n) is computed by taking the max rather than the sum:

T_∞ (n) = max(T_∞(n−1), T_∞ (n−2)) + Θ(1)
= T_∞(n−1) + Θ(1).

The recurrence T_∞ (n) = T_∞(n−1) + Θ(1) has solution Θ(n). So the span of P-Fib(n) is Θ(n).

We can now compute the parallelism of P-Fib(n) in general (not just the specific case of n=4 that we computed earlier) by dividing its work Θ(F_n) by its span Θ(n):

T₁(n) / T_∞ = Θ(F_n) / Θ(n) = Θ(F_n/n)

This grows dramatically, as F_n grows much faster than n.

For any given number of processors P, there is considerable parallel slackness Θ(F_n/n)/P. For any P above some n there is likely to be something for additional processors to do. Thus there is potential for near perfect linear speedup as n grows.

Of course in this example it's because we chose an inefficent way to compute Fibonacci numbers, but this was only for illustration. These ideas apply to other well designed algorithms.

Parallel Loops

So far we have used spawn, but not the parallel keyword, which is used with loop constructs such as for. Here is an example.

Suppose we want to multiply an n x n matrix A = (a_ij) by an n-vector x = (x_j). This yields an n-vector y = (y_i) where:

The following algorithm does this in parallel:

The parallel for keywords indicate that each iteration of the loop can be executed concurrently. (Notice that the inner for loop is not parallel; a possible point of improvement to be discussed.)

Implementing Parallel Loops

It is not realistic to think that all n subcomputations in these loops (lines 3 and 5) can be spawned simultaneously with no extra work. (For some operations on some hardware up to a constant n this may be possible; e.g., hardware designed for matrix operations; but we are concerned with the general case.) How might this parallel spawning be done, and how does this affect the analysis?

Parallel for spawning can be accomplished by a compiler with a divide and conquer approach, itself implemented with parallelism!

Consider how to implement parallelism in lines 5-7 of Mat-Vec above. The concurrency platform compiler can arrange for the procedure shown below to be called instead, Mat-Vec-Main-Loop(A, x, y, n, 1, n). Lines 2 and 3 below are the lines originally within the loop.

The computation DAG is also shown. It appears that a lot of work is being done to spawn the n leaf node computations, but the increase is not asymptotically significant.

The work of Mat-Vec is T₁(n) = Θ(n²) due to the nested loops in 5-7.

Since the computation DAG is a full binary tree, the number of internal nodes is 1 fewer than the n leaf nodes (Topic 8), so this extra work is Θ(n).

Each leaf node corresponds to one iteration of loop, and the extra work of recursive spawning can be amortized across the work of the iterations, so that it contributes only constant work.

Concurrency platforms sometimes coarsen the recursion tree by executing several iterations in each leaf, reducing the amount of recursive spawning.

The span is increased by Θ(lg n) due to the addition of the recursion tree for Mat-Vec-Main-Loop, which is of height Θ(lg n). In some cases (such as this one), this increase is washed out by other dominating factors (e.g., the span in this example is dominated by the nested loops).

Nested Parallelism

Continuing with our example, the span is Θ(n) because even with full utilization of parallelism the inner for loop still requires Θ(n). Since the work is Θ(n²) the parallelism is Θ(n²)/Θ(n) = Θ(n). Can we improve on this?

Perhaps we could make the inner for loop parallel as well? Compare the original to the revised version Mat-Vec':

Would it work? We need to introduce a new concept ...

Race Conditions

Deterministic algorithms do the same thing on the same input; while nondeterministic algorithms may give different results on different runs.

The above Mat-Vec' algorithm is subject to a potential problem called a determinancy race: when the outcome of a computation could be nondeterministic (unpredictable). In general, this can only happen when two logically parallel computations access the same memory and one performs a write.

Determinancy races are hard to detect with empirical testing: many execution sequences would give correct results. This kind of software bug is consequential: Race condition bugs caused the Therac-25 radiation machine to overdose patients, killing three; and caused the North American Blackout of 2003.

For example, the code shown below might output 1 or 2 depending on the order in which access to x is interleaved by the two threads:

The value of x must first be read into a register r before it is operated on. In this case, there are two registers. It is incremented in the register and then written back out to memory. The table indicates one possible computation sequence that gives the unexpected result.

After you understand that simple example, let's look at our (renamed) matrix-vector example again:

Exercise: Do you see how y_i might be updated differently depending on the order in which parallel invocations of line 7 (including access to current value of y_i and writing new ones) are executed? Keep in mind that race conditions can only happen when two logically parallel computations access the same memory and one performs a write.

Example: Matrix Multiplication

Multithreading the basic algorithm

Here is an algorithm for multithreaded matrix multiplication, based on the T₁(n) = Θ(n³) algorithm:

Exercise: How does this procedure compare to MAT-VEC-WRONG? Both of them have nested parallel for loops: Is P-SQUARE-MATRIX-MULTIPLY also subject to a race condition? Why or why not?

The span of this algorithm is T_∞(n) = Θ(n), due to the path for spawning the outer and inner parallel loop executions and then the n executions of the innermost for loop. So the parallelism is T₁(n) / T_∞(n) = Θ(n³) / Θ(n) = Θ(n²)

Exercise: Could we get the span down to Θ(1) if we parallelized the inner for with parallel for? You should be able to answer this based on the previous exercise.

Multithreading a divide and conquer algorithm

Here is a parallel version of the divide and conquer algorithm from Chapter 4 of CLRS (not in these web notes):

See the text for analysis, which concludes that while the work is still Θ(n³), the span is reduced to Θ(lg²n). Thus, while the work is the same as the basic algorithm the parallelism is Θ(n³) / Θ(lg²n), which makes good use of parallel resources.

Example: Merge Sort

Divide and conquer algorithms are good candidates for parallelism, because they break the problem into independent subproblems that can be solved separately. We look briefly at merge sort.

Parallelizing Merge-Sort

The dividing is in the main procedure MERGE-SORT, and we can parallelize it by spawning the first recursive call:

MERGE remains a serial algorithm, so its work and span are Θ(n) as before.

The recurrence for the work MS'₁(n) of MERGE-SORT' is the same as the serial version:

The recurrence for the span MS'_∞(n) of MERGE-SORT' is based on the fact that the recursive calls run in parallel, so there is only one n/2 term (they are the same, so min takes either):

The parallelism is thus MS'₁(n) / MS'_∞(n) = Θ(n lg n / n) = Θ(lg n).

This is low parallelism, meaning that even for large input we would not benefit from having hundreds of processors. How about speeding up the serial MERGE?

Parallelizing Merge

MERGE takes two sorted lists and steps through them together to construct a single sorted list. This seems intrinsically serial, but there is a clever way to make it parallel.

A divide-and-conquer strategy can rely on the fact that the lists are sorted to break the lists into four lists, two of which will be merged to form the head of the final list and the other two merged to form the tail.

To find the four lists for which this works, we

Choose the longer list to be the first list, T[p₁ .. r₁] in the figure below.
Find the middle element (median) of the first list (x at q₁).
Use binary search to find the position (q₂) of this element if it were to be inserted in the second list T[p₂ .. r₂].
Recursively merge
- The first list up to just before the median T[p₁ .. q₁-1] and the second list up to the insertion point T[p₂ .. q₂-1].
- The first list from just after the median T[q₁+1 .. r₁] and the second list after the insertion point T[q₂ .. r₂].
Assemble the results with the median element placed between them, as shown below.

The text presents the BINARY-SEARCH pseudocode and analysis of Θ(lg n) worst case; this should be review for you. It then assembles these ideas into a parallel merge procedure that merges into a second array Z at location p₃ (r₃ is not provided as it can be computed from the other parameters):

Analysis

My main purpose in showing this to you is to see that even apparently serial algorithms sometimes have a parallel alternative, so we won't get into details, but here is an outline of the analysis:

The span of P-MERGE is the maximum span of a parallel recursive call. Notice that although we divide the first list in half, it could turn out that x's insertion point q₂ is at the beginning or end of the second list. Thus (informally), the maximum recursive span is 3n/4 (as at best we have "chopped off" 1/4 of the first list).

The text derives the recurrence shown below; it does not meet the Master Theorem, so an approach from a prior exercise is used to solve it:

Given 1/4 ≤ α ≤ 3/4 for the unknown dividing of the second array, the work recurrence turns out to be:

With some more work, PM₁(n) = Θ(n) is derived. Thus the parallelism is Θ(n / lg²n)

Some adjustment to the MERGE-SORT' code is needed to use this P-MERGE; see the text. Further analysis shows that the work for the new sort, P-MERGE-SORT, is PMS₁(n lg n) = Θ(n), and the span PMS_∞(n) = Θ(lg³n). This gives parallelism of Θ(n / lg²n), which is much better than Θ(lg n) in terms of the potential use of additional processors as n grows.

The chapter ends with a comment on coarsening the parallelism by using an ordinary serial sort once the lists get small. One might consider whether P-MERGE-SORT is still a stable sort, and choose the serial sort to retain this property if it is desirable.

Scheduling

At the beginning, we noted that we rely on a concurrency platform to determine how to allocate potentially parallel threads of computation to available processors. This is the scheduling problem. Scheduling parallel computations is a complex problem, and sophisticated schedulers have been designed that are beyond what we can discuss here.

Centralized schedulers are those that have information on the global state of computation, but must make decisions in real time rather than in batch. A simple approach to centralized scheduling is a greedy scheduler, which assigns as many strands to available processors as possible at any given time step. The CLRS texts proves a theorem concerning the performance of a greedy scheduler, with interesting corollaries:

Theorem: On an ideal parallel computer with P processors, a greedy scheduler executes a multithreaded computation with work T₁ and span T_∞ in time T_P ≤ T₁ + T_∞.

Corollary: The running time T_P of any multithreaded computation scheduled by a greedy scheduler on an ideal parallel computer with P processors is within a factor of 2 of optimal.

Corollary: As slackness grows a greedy scheduler achieves near-perfect linear speedup on any multithreaded computation.

The proofs are not difficult to understand: see the text if you are interested. I think we have said enough here to introduce the concepts of multithreading.

Final Note

Professor Henri Casanova does research on scheduling, and Professor Nodari Sitchinava does research on parallel algorithms. They would be happy to talk to interested students.

Dan Suthers

Last modified: Sun Nov 29 04:51:33 HST 2020
Images are from the instructor's material for Cormen et al. Introduction to Algorithms, Third Edition.