Outline
- Huffman coding
- Huffman trees
- Implementation of Huffman coding
- hash tables
- hash functions
- open addressing
- chained hashing
Huffman Trees
- trees seen so far:
- binary trees
- binary search trees
in these trees, each node has data and zero, one, or two children
- Huffman trees are trees where each node has zero or two children,
and only leaf nodes have data
- the data in each leaf node is unique
Huffman coding
- suppose you wanted to compress a text file, that is, represent it
using the least possible number of bits
- suppose also you want to use a unique string of bits for each
character -- that is, we are not going to encode entire words
- it would be good to use fewer bits for characters that occur frequently,
and as many bits as needed for other characters
- for example, in the string "hello world", I could use the following
encoding:
letter | bit string |
h | 11000 |
e | 11001 |
l | 0 |
o | 10 |
' ' (space) | 11010 |
w | 11011 |
r | 1110 |
d | 1111 |
- in-class exercise: what does the bit string (25 bits)
"1100 0100 1111 1101 0111 1101 0111 0"
represent?
- note that the above string takes 45 bits if
each character is represented with a constant 5 bits/character, which is the
smallest number of bits to represent all 26 characters and space
- so Huffman coding can save space!
- it does this by using shorter bit strings to encode more commonly-used
characters, and longer bit strings to encode less frequently used
characters
algorithm for building a Huffman coding tree
- make a list of all symbols with their frequencies
- sort the list so the symbols with the least frequency are in front
- if the list only has one element, the element is the root of the tree
and we are done
- remove the first two elements from the list and put them into
a binary tree
- add the frequencies of the two subtrees to give the frequency of
this binary tree
- insert this tree in the right place in the sorted list
- return to step 3
- in-class exercise: do this for the characters in "hello world"
letter | frequency |
h | 1 |
e | 1 |
l | 3 |
o | 2 |
' ' (space) | 1 |
w | 1 |
r | 1 |
d | 1 |
using a Huffman coding tree
- to encode:
- find the leaf with the next character to encode
- starting from the root to this leaf, place a 0 bit for each
time you take the left subtree, and a 1 bit for each time you take
the right subtree
- in-class exercise: use this algorithm and the tree from the
last exercise to encode the string "wow"
- can do the above search once for each character, put the results in a
table, then use the table to do the actual encoding quickly
- to decode:
- use each bit to go left (0 bit) or right (1 bit) in the tree
- if you have reached a leaf, put this character into the result string,
then
- start at the root of the tree with the next bit
- in-class exercise: use this algorithm and the above tree to
decode as many bits as possible of the bit string
"1100 0100 1101 0111 1101 0111 0"
compression using Huffman coding
- an encoding based on letter frequencies in one string
(or a large sample) can be used for encoding many different strings
- if so, a single copy of the table (tree) can be kept, and huffman
coding is guaranteed to do no worse than fixed-length encoding
- otherwise, a separate table (tree) is needed for each compression,
and the table has to be included in the count of bytes to be stored
or transmitted after compression
- in such cases, Huffman coding might actually give a somewhat
larger size than the original
- in practice, even including the table, Huffman coding is usually
worthwhile for sufficiently long strings in natural languages, which
have lots of redundancy and different letter frequencies
an implementation for Huffman trees
- many possible implementations, this is one (textbook, chapter 6.6/8.6)
- use several data structures:
- a priority queue implemented by a min heap to hold the sorted data
- each heap element refers to a tree node
- an interior tree node has no value, but has two children
- a leaf node has a value, but has no children
- each node has a frequency of occurrence, which is used as a
priority in the queue (low priority returned first)
- begin by computing the frequency of each value, perhaps by using
a hash table -- explained later
- once the frequency of values is known, insert each value into the
priority queue, using its frequency as a priority
- remove the front two elements from the queue, create an interior
node to refer to these two elements, and insert the interior node back
into the queue with, as priority, the sum of the priorities of the two nodes
- once there is only one element left in the queue, this is the huffman
tree
- a further step would be to build an encoding table from the tree
- in-class exercise: do this given the string "there is no place like home"
hashing
- hash browns: mixed-up bits of potatoes, cooked together
- hashing: slicing up and mixing together
- a hash function takes a larger, perhaps variable amount of data
and turns it into a fixed-size integer
- in-class exercise: why is this useful?
hash tables
- a collection class with O(1) (constant) access time
to elements in the collection
- simply put the elements in an array
- but this only works if the key is a small integer, <= the size of
the array
- hash functions can take arbitrary keys and turn them into small integers
- so we can (a) use the hash of the key to (b) index the array and
(c) find out in constant time if the element is present, and if so,
(d) get its value
- a perfect hash function maps each key to a different array
location
- in real life, perfect hash functions are hard to find, in part
because the data may not be known in advance
hash table example
- use the sum of the characters in a string as its hash
- use 1 for "a", 2 for "b", etc
- so the string "edo" hashes to 5 + 4 + 15 = 24
- the string "hello" hashes to 8 + 5 + 12 + 12 + 15 = 52
- the string "world" hashes to 23 + 15 + 18 + 12 + 4 = 72
- if the hash table has size 11, this gives a perfect hash
function for these three strings:
- 24 modulo 11 = 2, so "edo" is stored at index 2
- 52 modulo 11 = 8, so "hello" is stored at index 8
- 72 modulo 11 = 6, so "world" is stored at index 6
- in each case, computing the index takes time independent of
both the table size and the number of elements in the table:
O(1) or, to be accurate, O(key size)
- supposing I wanted to use the same hash function on a table
of size 3,
- 24 modulo 3 = 0, so "edo" is stored at index 0
- 52 modulo 3 = 1, so "hello" is stored at index 1
- 72 modulo 3 = 0, so "world" is stored at index 0
- the first and last string need to be stored in the same location,
which is a collision
hash table collisions exercise
- an array location can only hold one item, so what to do in case
of a collision?
- a number of solutions:
- can store colliding elements (and only those elements) in a
separate data structure (e.g. a linked list), which is searched when needed
- can increase (e.g. double plus one) the size of the array until
the collision goes away
- can look for another place in the same array that is available.
This is called open address hashing
- can have each array element refer to a linked list rather than a
single element: chained hashing
- or combinations of the above
- in-class exercise: assuming few collisions, what is the average runtime
of each of the above strategies?
- in-class exercise: assuming many collisions, what is the worst-case
runtime of each of the above strategies?
Hash functions
- finding a perfect hash function is not practical unless all keys
are known in advance
- for random, evenly distributed keys, any hash function should
produce random, evenly distributed hash codes, with a few collisions
- for non-random keys that resemble one another, a good hash
function should still produce random, evenly distributed hash codes
- so for example, using the first three digits of a telephone number
(the area code) as a hash key would give many collisions if all the
keys are from the same geographic area
- in-class exercise: using h(key) = key mod table size, insert
elements with key 99, 43, 14, 77 into a table of size 10
- for a much more in-depth explanation of hash functions,
see here,
which includes
this link
to a more effective (and more complex) hash functions.
Open addressing
- when inserting a value in a hash table, if the slot indicated
by the hash function is full, insert it into another slot
- this works until the hash table is full (100% load factor), i.e.
it works as long as there are open slots
- the probe sequence determines where to look next when there
is a collision
- when looking up a value in a hash table, the same probe sequence
must be followed as when inserting
- when removing a value from a hash table,
the hash table must record that the element was removed, so future
searches can keep looking when they reach the slot of a deleted element
- each slot must record whether it is empty, full, or deleted
- a slot can only be empty until the first time a value is inserted,
after which it can only be full or deleted
probing in open addressing
- linear probing:
- increment the index (modulo the array size) until a free slot is
found
- if many keys hash to similar values, may lead to long search times
- quadratic probing:
- add the square of the probe number (modulo the array size) to get the
next index
- avoids the problem with similar key hashes
- for example,
hash 200 would probe locations 200, 201, 204, 209, etc
hash 201 would probe locations 201, 202, 205, 210, etc, with overlap
at only one index (201)
- works well unless the hashes are identical
- double hashing
- define two hash functions h1 and h2
- h1(key) determines the initial slot to look into
- h2(key) determines the step size for the next probe
- h2(key) != 0
- even if h1(key1) == h1(key2),
hopefully h2(key1) != h2(key2)
- for example, h1(key) could be the sum of the letters
in the string, and h2(key) could be the product of the letters
in the string plus 1
- even if two strings sum to the same number, the product would almost
certainly be different, as long as the two keys are different
open addressing table size
- load factor cannot exceed 100%, so the table must have at least as
many slots as the number of stored elements
- if the table size is a prime number, linear hashing or double hashing
will visit the entire table before giving up
- otherwise, for example double hashing in a table of size
100, with step size 10, can only visit 1/10th of the slots
- if the table size is ever changed, each element must be reinserted
using the hash function and the new table size, since just copying
the old array would map an element to the wrong index:
24 modulo 11 = 2, but 24 modulo 23 = 1 -- no simple relationship
without recomputing the hash value
in-class exercises
- what is the probing sequence if the hash table size is 11,
and h(key) = 4? answer for each of linear addressing,
quadratic addressing, and double hashing where h2(key) = 5
- using h(key) = key mod table size, insert
elements with key 56, 48, 40, 13 into a table of size 7
- remove the element with key 48
- locate the element with key 13
Alternatives to open addressing
- chaining or chained hashing: each array element refers to a
linked list of elements
- buckets: each array element can store up to a fixed number of elements
- either can accomodate load factors greater than 100%
- chaining is more flexible, but requires dynamic memory allocation
- in-class exercise: repeat the previous exercise using chained hashing