- Disjoint Dynamic Sets
- Finding Connected Components with Disjoint Sets
- Linked List Representations of Disjoint Sets -
*Skip in 2017: this is not as good as the forest representation and is presented mainly to motivate the latter.* - Forest Representations of Disjoint Sets

- Chapter 21 of CLRS sections 21.1 and 21.3, with focus on 21.3. (You need only know the result of the analysis in 21.4.)
- Screencast 16A (also in Laulima)

Two sets *A* and *B* are **disjoint** if they have no element in common.

Sometimes we need to group n distinct elements into a collection Š of disjoint sets
Š = {*S*_{1}, ..., *S _{k}*} that may change over time.

- Š is a set of sets: {{
*x*, ... }, ..., {*y*, ... }} - Each set
*S*∈ Š is identified by a_{i}**representative**, which is some member of the set (e.g.,*x*and*y*). - It does not matter which member is the representative, as long as the representative remains the same while the set is not modified.

Disjoint set data structures are also known as **Union-Find** data structures, after the two
operations in addition to creation: Applications often involve a mixture of searching for set
membership (Find) and merging sets (Union).

Make-Set(: make a new setx)S= {_{i}x}(and addxwill be its representative)Sto Š._{i}

Union(: ifx,y)x∈Sand_{x}y∈S, then Š ← Š −_{y}S−_{x}S∪ {_{y}S∪_{x}S}_{y}(that is, combine the two sets.Sand_{x}S)_{y}

- The representative of
S∪_{x}Sis any member of that new set (implementations often use the representative of one of_{y}Sor_{x}S.)_{y}- Union destroys
Sand_{x}S, since the sets must be disjoint (they cannot co-exist with_{y}S∪_{x}S)._{y}

Find-Set(: return the representative of the set containingx)x.

- We don't return the set itself because the representative is
howwe reference the set.

We analyze in terms of:

*n*= number of Make-Set operations, i.e., the number of sets initially involved*m*= total number of operations

Students commonly get confused by *n* and *m*: be clear that for this topic *n* is
the number of iterms in the data structure and *m* is the number of operations on the data
structure.

Some facts we can rely on:

*m*≥*n*(we cannot have more items in the set than operations to put them in.)- Can have at most
*n*−1 Union operations, since after*n*−1 Unions, only 1 set remains. - It can be helpful for analysis to assume that the first
*n*operations are Make-Set operations (put all the elements we will be working with in singleton sets to start with).

Union-Find on disjoint sets is used to find structure in other data structures, such as a graph. We initially assume that all the elements are distinct by putting them in singleton sets, and then we merge sets as we discover the structure by which the elements are related.

In the next topic we cover algorithms to find *minimum spanning trees* of graphs. Kruskal's
algorithm will use Union-Find operations.

Recall from Topic 14 that for a graph *G* = (*V*,
*E*), vertices *u* and *v* are in the same **connected component** if and only if
there is a path between them.

Here is the algorithm for computing connected components in an undirected graph:

*Would that work with a directed graph? Why or why not?*

Once the above has run, we can use this algorithm for testing whether two vertices are in the same component:

Although it is easy to see the connected components above, the utility of the algorithm becomes more obvious when we deal with large graphs (such as the one pictured)!

In a *static* undirected graph, it is faster to run Depth-First Search (exercise 22.3-12),
or for static directed graphs the strongly connected components algorithm of Topic 14 (section 22.5), which consists of two DFS. But in some
applications edges may be added to the graph. In this case, union-find with disjoint sets is faster
than re-running the DFS after each edge is added.

One might think that lists are the simplest approach, but there is a better approach that is not any more complex: this section is mainly for comparision purposes.

Each set is represented using an unordered singly linked list. The list object has attributes:

**head**: pointing to the first element in the list, the set's representative.**tail**: pointing to the last element in the list.

Each object in the list has attributes for:

**next**- The
**set member**(e.g., the vertex in the graph being analyzed) - A pointer to the list object that represents the
**set**

First try:

- Make-Set(
*x*): create a singleton list containing*x* - Find-Set(
*x*): follow the pointer back to the list object, and then follow the`head`pointer to the representative - Union(
*x*,*y*): append*y*'s lists onto the end of*x*'s list.- Use
*x*'s tail pointer to find the end. - Need to update the pointer back to the set object for every node on
*y*'s list.

- Use

For example, let's take the union of *S*_{1} and *S*_{2}, replacing
*S*_{1}:

This can be slow for large data sets. For example, suppose we start with *n* singletons and
always happend to append the larger list onto the smaller one in a sequence of merges:

If there are *n* Make-Sets and *n* Unions, the amortized time per operation is
O(*n*)!

A **weighted-union heuristic** speeds things up: always append the smaller list to the larger
list (so we update fewer set object pointers). Althought a single union can still take
Ω(*n*) time (e.g., when both sets have *n*/2 members), a sequence of *m*
operations on *n* elements takes O(*m* + *n* lg *n*) time.

** Sketch of proof:** Each Make-Set and Find-Set still takes O(1). Consider how many
times each object's set representative pointer must be updated during a sequence of

Each representative set for a given element is updated ≤ lg *n* times, and there are
*n* elements plus *m* operations. However, we can do better!

The following is a classic representation of Union-Find, due to Tarjan (1975). The set of sets is represented by a forest of trees. The code is simple, but the analysis of runtime is complex!

- Each tree represents a set.
- The root of the tree is the set representative.
- Each node points only to its parent (no child pointers needed).
- The root points to itself as parent.

The ADT operations correspond to tree operations as follows:

- Make-Set(
*x*): create a single node tree with*x*at the root - Find-Set(
*x*): follow parent pointers back to the root - Union(
*x*,*y*): make one root a child of the other. (This in itself could degenerate to a linear list-like tree, but we will fix this below.)

Here's an example. Note that this example does NOT include the rank and path compression heuristics to be discussed below. (After you read about path compression, see whether you can identify which path should be compressed.)

In order to avoid degeneration to linear trees, and achieve amazing amortized performance, these two heuristics are applied:

**Union by Rank**: make the root of the "smaller" tree a child of the root of the "larger"
tree. But rather than size we use **rank**, an upper bound on the height of each node (stored in
the node).

- Rank of singleton sets is 0.
- When taking the Union of two trees of equal rank, choose one arbitrarily to be the parent
and increment its rank by one.
*(Why is it incremented?)* - When taking the Union of two trees of unequal rank, the tree with lower rank becomes the
child, and ranks are unchanged.
*(Why is it NOT incremented?)*

**Path Compression**: When running Find-Set(*x*), make all nodes on the path from
*x* to the root direct children of the root. For example, suppose we call Find-Set(a) on the
lefthand tree (the triangles represent arbitrary subtrees):

The algorithms are very simple! (But their analysis is complex!) We assume that nodes *x*
and *y* are tree nodes with the client's element data already initialized.

Link implements the union by rank heuristic.

Find-Set implements the path compression
heuristic. It makes a recursive pass up the tree to find the path to the root, and as recursion
unwinds it updates each node on the path to point directly to the root. (This means it is not tail
recursive, but as the analysis shows, the paths are *extremely* unlikely to be long.)

The analysis can be found in section 21.4. It is very involved, and I only expect you to know what is discussed below. It is based on a very fast growing function:

*A _{k}*(

The result uses **α( n),** a single parameter inverse of

α(n) = min{k:A(1) ≥_{k}n}

α(*n*) grows *very* slowly, as shown in the table. We are highly unlikely to
ever encounter α(*n*) > 4 (we would need input size much greater than the number of
atoms in the universe). Although its growth is strictly larger than a constant, for all practical
purposes we can treat α(*n*) as a constant.

The analysis of section 21.4 shows that the running time is **O( m
α(n))** for a sequence of

*Note:* If asked on a quiz or exam what the amortized time complexity of a union-find
operation in the forest with heuristics implementation, write **O(α( n))**,

We now return to Graphs, where we will see Union-Find used to compute minimum spanning trees.

Dan Suthers Last modified: Sun Oct 29 02:05:13 HST 2017

Images are from the instructor's material for Cormen et al. Introduction to Algorithms, Third Edition.