# ICS 311 #16: Disjoint Sets and Union-Find

## Outline

1. A Brief Note on Amortized Analysis
2. Disjoint Dynamic Sets
3. Finding Connected Components with Disjoint Sets
4. Linked List Representations of Disjoint Sets (optional this semester)
5. Forest Representations of Disjoint Sets

• Chapter 21 of CLRS sections 21.1 - 21.3, with focus on 21.3. (You need only know the result of the analysis in 21.4, not the actual analysis)
• Screencast 16A (also in Laulima)

## A Brief Note on Amortized Analysis

This topic relies on amortized analysis of algorithms, rather than the worst-case analysis. This semester we are not covering amortized analysis. However, for the purposes of this topic, you can think of amortized cost of an operation as follows: Given a sequence of n operations, with each operation taking variable amount of time to execute, the amortized cost of an operation is the average time each operation takes.

For example, consider an implementation of a stack using a simple array. If we don't know how many elements we will push onto the stack, no matter how large an array we allocate at some point it will be full and to push an additional element on to the stack, we will have to allocate a new larger array and copy the contents of the old array into the new one. Note, that if the array is not full, pushing an element onto the stack takes O(1) time — simply write the element into the first empty space. However, if the array can fit only k elements, then (k+1)-st push operation will take O(k+1) = O(k) time: copying the k entries of the old array plus inserting the new element into the new array. Thus, sometimes push operation takes O(1) time, and sometime O(k) time. By picking a good strategy for resizing the array, we can make copying arrays very infrequent. For example, one resizing strategy is to make the new array double the size of the old one. One can then show that resizing happens so infrequently, that performing n push operations onto the stack will take about O(3n) = O(n) time for any n, i.e. the amortized (average) cost of each push operation is O(1) time. For details, if you are interested, you can read CLRS chapter 17 or these notes.

## Dynamic Disjoint Sets (Union Find)

Two sets A and B are disjoint if they have no element in common.

Sometimes we need to group n distinct elements into a collection Š of disjoint sets Š = {S1, ..., Sk} that may change over time.

• Š is a set of sets: {{x, ... }, ..., {y, ... }}
• Each set Si ∈ Š is identified by a representative, which is some member of the set (e.g., x and y).
• It does not matter which member is the representative, as long as the representative remains the same while the set is not modified.

Disjoint set data structures are also known as Union-Find data structures, after the two operations in addition to creation. (Applications often involve a mixture of searching for set membership and merging sets.)

### Operations

Make-Set(x): make a new set Si = {x} (x will be its representative) and add Si to Š.
Union(x, y): if xSx and ySy, then Š ← Š − SxSy {Sx Sy} (that is, combine the two sets Sx and Sy).
• The representative of Sx Sy is any member of that new set (implementations often use the representative of one of Sx or Sy.)
• Destroys Sx and Sy, since the sets must be disjoint (they cannot co-exist with Sx Sy).
Find-Set(x): return the representative of the set containing x.

### Analysis

We analyze in terms of:

• n = number of Make-Set operations, i.e., the number of sets initially involved
• m = total number of operations

Some facts we can rely on:

• mn
• Can have at most n−1 Union operations, since after n−1 Unions, only 1 set remains.
• It can be helpful for analysis to assume that the first n operations are Make-Set operations (put all the elements we will be working with in singleton sets to start with).

## Applications of Disjoint Sets

Union-Find on disjoint sets is used to find structure in other data structures, such as a graph. We initially assume that all the elements are distinct by putting them in singleton sets, and then we merge sets as we discover the structure by which the elements are related.

### Finding Connected Components

Recall from Topic 14 that for a graph G = (V, E), vertices u and v are in the same connected component if and only if there is a path between them.

Here are the algorithms for computing connected components and then for testing whether two items are in the same component:  Would that work with a directed graph?

#### Example  Although it is easy to see the connected components above, the utility of the algorithm becomes more obvious when we deal with large graphs (such as pictured)!

#### Alternatives

In a static undirected graph, it is faster to run Depth-First Search (exercise 22.3-12), or for static directed graphs the strongly connected components algorithm of Topic 14 (section 22.5), which consists of two DFS. But in some applications edges may be added to the graph. In this case, union-find with disjoint sets is faster than re-running the DFS.

### Minimum Spanning Trees

Next week we cover algorithms to find minimum spanning trees of graphs. Kruskal's algorithm will use Union-Find operations.

## Linked List Representations of Disjoint Sets (optional this semester)

One might think that lists are the simplest approach, but there is a better approach that is not any more complex: this section is mainly for comparision purposes.

### Representation

Each set is represented using an unordered singly linked list. The list object has attributes:

• head: pointing to the first element in the list, the set's representative.
• tail: pointing to the last element in the list. Each object in the list has attributes for:

• next
• The set member (e.g., the vertex in the graph being analyzed)
• A pointer to the list object that represents the set

### Operations

First try:

• Make-Set(x): create a singleton list containing x
• Find-Set(x): follow the pointer back to the list object, and then follow the head pointer to the representative
• Union(x, y): append y's lists onto the end of x's list.
• Use x's tail pointer to find the end.
• Need to update the pointer back to the set object for every node on y's list.

For example, let's take the union of S1 and S2, replacing S1: This can be slow for large data sets. For example, suppose we start with n singletons and always happend to append the larger list onto the smaller one in a sequence of merges: If there are n Make-Sets and n Unions, the amortized time per operation is O(n)!

A weighted-union heuristic speeds things up: always append the smaller list to the larger list (so we update fewer set object pointers). Althought a single union can still take Ω(n) time (e.g., when both sets have n/2 members), a sequence of m operations on n elements takes O(m + n lg n) time.

Sketch of proof: Each Make-Set and Find-Set still takes O(1). Consider how many times each object's set representative pointer must be updated during a sequence of n Union operations. It must be in the smaller set each time, and after each Union the size of this smaller set is at least double the size. So: Each representative set for a given element is updated ≤ lg n times, and there are n elements plus m operations. However, we can do better!

## Forest Representations of Disjoint Sets

The following is a classic representation of Union-Find, due to Tarjan (1975). The set of sets is represented by a forest of trees. The code is as simple as the analysis of runtime is complex.

### Representation • Each tree represents a set.
• The root of the tree is the set representative.
• Each node points only to its parent (no child pointers needed).
• The root points to itself as parent.

### Operations

• Make-Set(x): create a single node tree with x at the root
• Find-Set(x): follow parent pointers back to the root
• Union(x, y): make one root a child of the other. (This in itself could degenerate to a linear list-like tree, but we will fix this below.) #### Heuristics

In order to avoid degeneration to linear trees, and achieve amazing amortized performance, these two heuristics are applied:

Union by Rank: make the root of the "smaller" tree a child of the root of the "larger" tree. But rather than size we use rank, an upper bound on the height of each node (stored in the node).

• Rank of singleton sets is 0.
• When taking the Union of two trees of equal rank, choose one arbitrarily to be the parent and increment its rank by one. (Why is it incremented?)
• When taking the Union of two trees of unequal rank, the tree with lower rank becomes the child, and ranks are unchanged. (Why does this make sense?)

Path Compression: When running Find-Set(x), make all nodes on the path from x to the root direct children of the root. For example, Find-Set(a): ### Algorithms

The algorithms are very simple! (But their analysis is complex!) We assume that nodes x and y are tree nodes with the client's element data already initialized. Link implements the union by rank heuristic.

Find-Set implements the path compression heuristic. It makes a recursive pass up the tree to find the path to the root, and as recursion unwinds it updates each node on the path to point directly to the root. (This means it is not tail recursive, but as the analysis shows, the paths are very unlikely to be long.)

### Time Complexity

The analysis can be found in section 21.4. It is very involved, and I only expect you to know what is discussed below. It is based on a very fast growing function: Ak(j) is a variation of Ackerman's Function, which is what you will find in most classic texts on the subject. The function grows so fast that A4(1) = 16512 is much larger than the number of atoms in the observable universe (1080)!

The result uses α(n), a single parameter inverse of Ak(j) defined as the lowest k for which Ak(1) is at least n: α(n) = min{k : Ak(1) ≥ n}

α(n) grows very slowly, as shown in the table. We are highly unlikely to ever encounter α(n) > 4 (we would need input size much greater than the number of atoms in the universe). Although its growth is strictly larger than a constant, for all practical purposes we can treat α(n) as a constant.

The analysis of section 21.4 shows that the running time is O(m α(n)) for a sequence of m Make-Set, Find-Set and Union operations, or O(α(n)) per operation. Since α(n) > 4 is highly unlikely, for all practical purposes the cost of a sequence of m such operations is O(m), or O(1) amortized cost per operation!!

## Wrapup

Next, in Topic 17, we will see the application Union-Find data structure to find minimum spanning trees.

Nodari Sitchinava (based on material by Dan Suthers)
Images are from the instructor's material for Cormen et al. Introduction to Algorithms, Third Edition.