This presentation follows the CLRS reading fairly closely, selecting out the most relevant parts and explaining a few things in more detail. (The associated videos change the ordering somewhat: 13A provides a conceptual introduction, leaving the activity selection example for 13B.)
Both Dynamic Programming and Greedy Algorithms are ways of solving optimization problems: a solution is sought that optimizes (minimizes or maximizes) an objective function.
Dynamic Programming:
Greedy Algorithms "greedily" take the choice with the most immediate gain.
For some problems, but not all, local optimization actually results in global optimization.
We'll use an example to simultaneously review dynamic programming and motivate greedy algorithms, as the two approaches are related (but distinct).
Suppose that activities require exclusive use of a common resource, and you want to schedule as many as possible.
Let S = {a_{1}, ..., a_{n}} be a set of n activities.
Each activity a_{i} needs the resource during a time period starting at s_{i} and finishing before f_{i}, i.e., during [s_{i}, f_{i}).
(Why not [s_{i}, f_{i}]? ... If we did this, f_{i} = s_{i+1})
The optimization problem is to select the largest set of non-overlapping (mutually compatible) activities from S.
We assume that activities are sorted by finish time f_{1} ≤ f_{2} ≤ ... f_{n-1} ≤ f_{n} (this can be done in Θ(n lg n)).
Consider these activities:
Here is a graphic representation:
Suppose we chose one of the activities that start first, and then look for the next activity that starts after it is done. This could result in {a_{4}, a_{7}, a_{8}}, but this solution is not optimal.
An optimal solution is {a_{1}, a_{3} a_{6}, a_{8}}. (It maximizes the objective function of number of activities scheduled.)
Another one is {a_{2}, a_{5}, a_{7}, a_{9}}. (Optimal solutions are not necessarily unique.)
How do we find (one of) these optimal solutions? Let's consider it as a dynamic programming problem ...
A dynamic programming analysis begins by identifying the choices to be made, and assuming that you can make an optimal choice (without yet specifying what that choice is) that will be part of an optimal solution.
It then specifies the possible subproblems that result in the most general way (to ensure that possible components of optimal solutions are not excluded), and shows that an an optimal solution must recursively include optimal solutions to the subproblems. (This is done by reasoning about the value of the solutions according to the objective function.)
We'll approach Activity Selection similarly. I'll try to clarify the reasoning in the text ...
For generality, we define the problem in a way that applies both to the original problem and subproblems.
Suppose that due to prior choices we are working on a time interval from i to j. This could be after some already-scheduled activity a_{i} and before some already-scheduled event a_{j}, or for the original problem we can define i and j to bound the full set of activities to be considered.
Then the candidate activities S_{ij} to consider are those that start after a_{i} and end before a_{j}:
(Notice that we use < for open interval endpoints and ≤ for closed interval endpoints.)
Now let's define A_{ij} to be an optimal solution, i.e., a maximal set of mutualy compatible activities in S_{ij}. What is the structure of this solution?
At some point we will need to make a choice to include some activity a_{k} with start time s_{k} and finishing by f_{k} in this solution. This choice will leave two sets of compatible candidates after a_{k} is taken out:
(Note that S_{ij} may be a proper superset of S_{ik} ∪ {a_{k}} ∪ S_{kj}, as activities incompatible with a_{k} are excluded.)
Using the same notation as above, define the optimal solutions to these subproblems to be:
So the structure of an optimal solution A_{ij} is:
A_{ij} = A_{ik} ∪ {a_{k}} ∪ A_{kj}
and the number of activities is:
|A_{ij}| = |A_{ik}| + 1 + |A_{kj}|
By the "cut and paste argument", an optimal solution A_{ij} for S_{ij} must include the optimal solutions A_{ik} for S_{ik} and A_{kj} for S_{kj}, because if some suboptimal solution A'_{ik} were used for S_{ik} (or similarly A'_{kj} for S_{kj}), where |A'_{ik}| < |A_{ik}|, we could substitute A_{ik} to increase the number of activities (a contradiction to optimality).
Therefore the Activity Scheduling problem exhibits optimal substructure.
Since the optimal solution A_{ij} must include optimal solutions to the subproblems for S_{ik} and S_{kj}, we could solve by dynamic programming.
Let c[i, j] = size of optimal solution for S_{ij} (c[i, j] has the same value as |A_{ij}|, but apparently CLRS are switching notation to indicate that this is for any optimal solution). Then
c[i, j] = c[i, k] + c[k, j] + 1 (the 1 is to count a_{k}).
We don't know which activity a_{k} to choose for an optimal solution, so we could try them all:
This suggests a recursive algorithm that can be memoized, or we could develop an equivalent bottom-up approach, filling in tables in either case.
But it turns out we can solve this without considering multiple subproblems.
We are trying to optimize the number of activities. Let's be greedy!
Since there is only a single subproblem, the S_{ij} notation, bounding the set at both ends, is more complex than we need. We'll simplify the notation to indicate the activities that start after a_{k} finishes:
S_{k} = {a_{i} ∈ S : s_{i} ≥ f_{k}}
So, after choosing a_{1} we just have S_{1} to solve (and so on after choices in recursive subproblems).
By optimal substructure, if a_{1} is part of an optimal solution, then an optimal solution to the original problem consists of a_{1} plus all activities in an optimal solution to S_{1}.
But we need to prove that a_{1} is always part of some optimal solution (i.e., to prove our original intuition).
Theorem: If S_{k} is nonempty and a_{m} has the earliest finish time in S_{k}, then a_{m} is included in some optimal solution.
Proof: Let A_{k} be an optimal solution to S_{k}, and let a_{j} ∈ A_{k} have the earliest finish time in A_{k}. If a_{j} = a_{m} we are done. Otherwise, let A'_{k} = (A_{k} - {a_{j}}) ∪ {a_{m}} (substitute a_{m} for a_{j}).
Claim: Activities in A'_{k} are disjoint.
Proof of Claim: Activities in A_{k} are disjoint because it was a solution.
Since a_{j} is the first activity in A_{k} to finish, and f_{m} ≤ f_{j} (a_{m} is the earliest in S_{k}), a_{m} cannot overlap with any other activities in A'_{k}.
No other changes were made to A_{k}, so A'_{k} must consist of disjoint activities.
Since |A'_{k}| = |A_{k}| we can conclude that A'_{k} is also an optimal solution to S_{k}, and it includes a_{m}. QED
Therefore we don't need the full power of dynamic programming: we can just repeatedly choose the activity that finishes first, remove any activities that are incompatible with it, and repeat on the remaining activities until no activities remain.
Let the start and finish times be represented by arrays s and f, where f is assumed to be sorted in monotonically increasing order.
Add a fictitious activity a_{0} with f_{0} = 0, so S_{0} = S (i.e., the entire input sequence).
Our initial call will be RECURSIVE-ACTIVITY-SELECTOR(s, f, 0, n).
The algorithm is Θ(n) because each activity is examined exactly once across all calls: each recursive call starts at m, where the previous call left off. (Another example of aggregate analysis.)
If the activities need to be sorted, the overall problem can be solved in Θ(n lg n)).
This algorithm is nearly tail recursive, and can easily be converted to an iterative version:
Let's trace the algorithm on this:
Instead of starting with the more elaborate dynamic programming analysis, we could have gone directly to the greedy approach.
Typical steps for designing a solution with the greedy strategy (and two properties that are key to determining whether it might apply to a problem):
Then we can construct an algorithm that combines the greedy choice with an optimal solution to the remaining problem.
Both require optimal substructure, but ...
Dynamic Programming
Greedy Strategy
These two problems demonstrate that the two strategies do not solve the same problems. Suppose a thief has a knapsack of fixed carrying capacity, and wants to optimize the value of what he takes.
There are n items. Item i is worth $v_{i} and weighs w_{i} pounds. The thief wants to take the most valuable subset of items with weight not exceeding W pounds. It is called 0-1 because the thief must either not take or take each item (they are discrete objects, like gold ingots).
In the example, item 1 is worth $6/pound, item 2 $5/pound and item 3 $4/pound.
The greedy strategy of optimizing value per unit of weight would take item 1 first.
The same as the 0-1 knapsack problem except that the thief can take a fraction of each item (they are divisible substances, like gold powder).
Both have optimal substructure (why?).
Only the fractional knapsack problem has the greedy choice property:
Fractional: One can fill up as much of the most valuable substance by weight as one can hold, then as much of the next most valuable substance, etc., until W is reached:
0-1: A greedy strategy could result in empty space, reducing the overall dollar density of the knapsack. After choosing item 1, the optimal solution (shown third) cannot be achieved:
The activity scheduler was good for illustration, but is not important in practice. We will see several useful greedy algorithms throughout the semester, starting with the following.
Huffman codes provide an efficient way to compress text, and are constructed using a greedy algorithm. These notes only review how this important algorithm works; see the text for analysis.
Fixed-length binary codes (e.g., ASCII) represent each character with a fixed number of bits (a binary string of fixed length called a codeword).
Variable-length binary codes can vary the number of bits allocated to each character. This opens the possibility of space efficiency by using fewer bits for frequent characters.
Example: Suppose we want to encode documents with these characters:
With a 3 bit code it would take 300,000 bits to code a file of 100,000 characters, but the variable-length code shown requires only 224,000 bits.
Prefix codes are codes in which no codeword is a prefix of another. (It would be more accurate to call them prefix-free codes, but the literature has settled on prefix codes.)
For any data, it is always possible to construct a prefix code that is optimal (though not all prefix codes are optimal, as we will see below).
Prefix codes also have the advantage that each character in an input file can be "consumed" unambiguously, as the prefix cannot be confused with another code.
We can think of the 0 and 1 in a prefix code as directions for traversing a binary tree: 0 for left and 1 for right. The leaves store the coded character. For example, here is the fixed-length prefix code from the table above represented as a binary tree (the numbers inside the nodes are the sum of frequencies for characters in each subtree):
Consuming bits from an input file, we traverse the tree until the character is identified, and then start over at the top of the tree for the next character.
Exercise: Decode 001100000011 (12 bits)
But the above tree uses three bits per character: it is not optimal. It can be shown that an optimal code is always represented by a full binary tree (every non-leaf node has two children).
For example, an optimal prefix code (from the table reproduced again here) is represented by this tree:
Exercise: Decode 10111010111 (11 bits)
A one bit different isn't much, but the gains will be magnified in larger texts where the high frequency character 'a' can occur more often.
Huffman's greedy algorithm constructs optimal prefix codes called Huffman Codes.
It is given a set C of n characters, where each character has frequency c.freq in the "text" to be encoded.
The optimality of a code is relative to a "text", which can be what we normally think of as texts, or can be other data encoded as sequences of bits, such as images.
The algorithm creates a binary tree leaf node for each character, annotated with its frequency, and the tree nodes are then put on a min-priority queue (this is only implied in line 2 below).
Then the first two subtrees on the queue (those with minimum frequency) are dequeued with Extract-Min, merged into a single tree, annotated with the sum of their frequencies, and this single node is re-queued.
This process is repeated until only one tree node remains on the queue (the root). Since a tree is being constructed and |E| = |V|−1 we can just run the loop until n−1 and know that there will be one node left at this point.
Here is the algorithm:
The "greedy" aspect is the choice to merge min-frequency nodes first, and assume that this local minimization will result in an optimal global solution.
Intuitively, this approach will result in an optimal solution because the lowest frequency items will be "pushed down" deeper in the tree, and hence have longer codes; while higher frequency items will end up nearer the root, and hence have the shortest codes.
Cormen et al. prove correctness with two Lemmas for the two properties:
The initial BUILD-MIN-HEAP implied by line 2 requires O(n) time.
The loop executes n times, with O(lg n) required for each heap operation.
The latter term dominates, so HUFFMAN is O(n lg n).
The characters are in a min priority queue by frequency:
Take out the two lowest frequency items and make a subtree that is put back on the queue as if it is a combined character:
Combine the next lowest frequency characters:
Continuing, tree fragments themselves become subtrees:
Two subtrees are merged next:
The highest frequency character gets added to the tree last, so it will have a code of length 1:
One might wonder why the second most frequent character does not have a code of length 2. This would force the other characters to be deeper in the tree, giving them excessively long codes.
We will encounter several examples of greedy algorithms later in the course, including classic algorithms for finding minimum spanning trees (Topic 17) and shortest paths in graphs (Topics 18 and 19).