Outline

Huffman coding
Huffman trees
Implementation of Huffman coding
hash tables
hash functions
open addressing
chained hashing

Huffman Trees

trees seen so far:
- binary trees
- binary search trees
in these trees, each node has data and zero, one, or two children
Huffman trees are trees where each node has zero or two children, and only leaf nodes have data
the data in each leaf node is unique

Huffman coding

suppose you wanted to compress a text file, that is, represent it using the least possible number of bits
suppose also you want to use a unique string of bits for each character -- that is, we are not going to encode entire words
it would be good to use fewer bits for characters that occur frequently, and as many bits as needed for other characters
for example, in the string "hello world", I could use the following encoding:

letter bit string

h 11000

e 11001

l 0

o 10

' ' (space) 11010

w 11011

r 1110

d 1111
in-class exercise: what does the bit string (25 bits) "1100 0100 1111 1101 0111 1101 0111 0" represent?
note that the above string takes 45 bits if each character is represented with a constant 5 bits/character, which is the smallest number of bits to represent all 26 characters and space
so Huffman coding can save space!
it does this by using shorter bit strings to encode more commonly-used characters, and longer bit strings to encode less frequently used characters

algorithm for building a Huffman coding tree

make a list of all symbols with their frequencies
sort the list so the symbols with the least frequency are in front
if the list only has one element, the element is the root of the tree and we are done
remove the first two elements from the list and put them into a binary tree
add the frequencies of the two subtrees to give the frequency of this binary tree
insert this tree in the right place in the sorted list
return to step 3

in-class exercise: do this for the characters in "hello world"

letter frequency

h 1

e 1

l 3

o 2

' ' (space) 1

w 1

r 1

d 1

using a Huffman coding tree

to encode:
- find the leaf with the next character to encode
- starting from the root to this leaf, place a 0 bit for each time you take the left subtree, and a 1 bit for each time you take the right subtree
in-class exercise: use this algorithm and the tree from the last exercise to encode the string "wow"
can do the above search once for each character, put the results in a table, then use the table to do the actual encoding quickly
to decode:
- use each bit to go left (0 bit) or right (1 bit) in the tree
- if you have reached a leaf, put this character into the result string, then
- start at the root of the tree with the next bit
in-class exercise: use this algorithm and the above tree to decode as many bits as possible of the bit string "1100 0100 1101 0111 1101 0111 0"

compression using Huffman coding

an encoding based on letter frequencies in one string (or a large sample) can be used for encoding many different strings
if so, a single copy of the table (tree) can be kept, and huffman coding is guaranteed to do no worse than fixed-length encoding
otherwise, a separate table (tree) is needed for each compression, and the table has to be included in the count of bytes to be stored or transmitted after compression
in such cases, Huffman coding might actually give a somewhat larger size than the original
in practice, even including the table, Huffman coding is usually worthwhile for sufficiently long strings in natural languages, which have lots of redundancy and different letter frequencies

an implementation for Huffman trees

many possible implementations, this is one (textbook, chapter 6.6/8.6)
use several data structures:
- a priority queue implemented by a min heap to hold the sorted data
- each heap element refers to a tree node
- an interior tree node has no value, but has two children
- a leaf node has a value, but has no children
- each node has a frequency of occurrence, which is used as a priority in the queue (low priority returned first)
begin by computing the frequency of each value, perhaps by using a hash table -- explained later
once the frequency of values is known, insert each value into the priority queue, using its frequency as a priority
remove the front two elements from the queue, create an interior node to refer to these two elements, and insert the interior node back into the queue with, as priority, the sum of the priorities of the two nodes
once there is only one element left in the queue, this is the huffman tree
a further step would be to build an encoding table from the tree
in-class exercise: do this given the string "there is no place like home"

hashing

hash browns: mixed-up bits of potatoes, cooked together
hashing: slicing up and mixing together
a hash function takes a larger, perhaps variable amount of data and turns it into a fixed-size integer
in-class exercise: why is this useful?

hash tables

a collection class with O(1) (constant) access time to elements in the collection
simply put the elements in an array
but this only works if the key is a small integer, <= the size of the array
hash functions can take arbitrary keys and turn them into small integers
so we can (a) use the hash of the key to (b) index the array and (c) find out in constant time if the element is present, and if so, (d) get its value
a perfect hash function maps each key to a different array location
in real life, perfect hash functions are hard to find, in part because the data may not be known in advance

hash table example

use the sum of the characters in a string as its hash
use 1 for "a", 2 for "b", etc
so the string "edo" hashes to 5 + 4 + 15 = 24
the string "hello" hashes to 8 + 5 + 12 + 12 + 15 = 52
the string "world" hashes to 23 + 15 + 18 + 12 + 4 = 72
if the hash table has size 11, this gives a perfect hash function for these three strings:
- 24 modulo 11 = 2, so "edo" is stored at index 2
- 52 modulo 11 = 8, so "hello" is stored at index 8
- 72 modulo 11 = 6, so "world" is stored at index 6
in each case, computing the index takes time independent of both the table size and the number of elements in the table: O(1) or, to be accurate, O(key size)
supposing I wanted to use the same hash function on a table of size 3,
- 24 modulo 3 = 0, so "edo" is stored at index 0
- 52 modulo 3 = 1, so "hello" is stored at index 1
- 72 modulo 3 = 0, so "world" is stored at index 0
the first and last string need to be stored in the same location, which is a collision

hash table collisions exercise

an array location can only hold one item, so what to do in case of a collision?
a number of solutions:
- can store colliding elements (and only those elements) in a separate data structure (e.g. a linked list), which is searched when needed
- can increase (e.g. double plus one) the size of the array until the collision goes away
- can look for another place in the same array that is available. This is called open address hashing
- can have each array element refer to a linked list rather than a single element: chained hashing
- or combinations of the above
in-class exercise: assuming few collisions, what is the average runtime of each of the above strategies?
in-class exercise: assuming many collisions, what is the worst-case runtime of each of the above strategies?

Hash functions

finding a perfect hash function is not practical unless all keys are known in advance
for random, evenly distributed keys, any hash function should produce random, evenly distributed hash codes, with a few collisions
for non-random keys that resemble one another, a good hash function should still produce random, evenly distributed hash codes
so for example, using the first three digits of a telephone number (the area code) as a hash key would give many collisions if all the keys are from the same geographic area
in-class exercise: using h(key) = key mod table size, insert elements with key 99, 43, 14, 77 into a table of size 10
for a much more in-depth explanation of hash functions, see here, which includes this link to a more effective (and more complex) hash functions.

Open addressing

when inserting a value in a hash table, if the slot indicated by the hash function is full, insert it into another slot
this works until the hash table is full (100% load factor), i.e. it works as long as there are open slots
the probe sequence determines where to look next when there is a collision
when looking up a value in a hash table, the same probe sequence must be followed as when inserting
when removing a value from a hash table, the hash table must record that the element was removed, so future searches can keep looking when they reach the slot of a deleted element
each slot must record whether it is empty, full, or deleted
a slot can only be empty until the first time a value is inserted, after which it can only be full or deleted

probing in open addressing

linear probing:
- increment the index (modulo the array size) until a free slot is found
- if many keys hash to similar values, may lead to long search times
quadratic probing:
- add the square of the probe number (modulo the array size) to get the next index
- avoids the problem with similar key hashes
- for example,
  hash 200 would probe locations 200, 201, 204, 209, etc
  hash 201 would probe locations 201, 202, 205, 210, etc, with overlap at only one index (201)
- works well unless the hashes are identical
double hashing
- define two hash functions h₁ and h₂
- h₁(key) determines the initial slot to look into
- h₂(key) determines the step size for the next probe
- h₂(key) != 0
- even if h₁(key₁) == h₁(key₂), hopefully h₂(key₁) != h₂(key₂)
- for example, h₁(key) could be the sum of the letters in the string, and h₂(key) could be the product of the letters in the string plus 1
- even if two strings sum to the same number, the product would almost certainly be different, as long as the two keys are different

open addressing table size

load factor cannot exceed 100%, so the table must have at least as many slots as the number of stored elements
if the table size is a prime number, linear hashing or double hashing will visit the entire table before giving up
otherwise, for example double hashing in a table of size 100, with step size 10, can only visit 1/10th of the slots
if the table size is ever changed, each element must be reinserted using the hash function and the new table size, since just copying the old array would map an element to the wrong index:
24 modulo 11 = 2, but 24 modulo 23 = 1 -- no simple relationship without recomputing the hash value

in-class exercises

what is the probing sequence if the hash table size is 11, and h(key) = 4? answer for each of linear addressing, quadratic addressing, and double hashing where h₂(key) = 5
using h(key) = key mod table size, insert elements with key 56, 48, 40, 13 into a table of size 7
remove the element with key 48
locate the element with key 13

Alternatives to open addressing

chaining or chained hashing: each array element refers to a linked list of elements
buckets: each array element can store up to a fixed number of elements
either can accomodate load factors greater than 100%
chaining is more flexible, but requires dynamic memory allocation
in-class exercise: repeat the previous exercise using chained hashing

letter	bit string
h	11000
e	11001
l	0
o	10
' ' (space)	11010
w	11011
r	1110
d	1111