Compression: basics
- wish to use as few bits as possible to transmit information
- lossless compression: decompressed information must be same as
before compression
- lossy compression: decompressed information may be "slightly"
different from original information (e.g. degraded but acceptable image
quality)
- K(F), where |K(F)| < |F|, is the compression function
- X(K(F)) is the decompression function,
with either X(K(F)) =~ F or X(K(F)) = F
- lossy compression tries to minimize d = |X(K(F)) - F|
- compression ratio: compressed size / uncompressed size
Compression modes
- batch compression compresses an entire file (data) at once
- stream compression compresses data incrementally (e.g. telephone,
interactive video) -- mostly used for real-time and near-real-time applications
- progressive decompression provides essential elements of the data
earlier, and less essential elements later -- used for web browser images
- multilayer compression provides different versions in one package,
e.g. photo CD-ROMs
Source Coding
- source entropy rate H determines number of bits required to encode
output of source
- e.g. coin toss requires one bit per toss
- e.g. one million ones requires the string "1 million 1s" (fewer
than one million bits)
- every digital transmission has a finite set of symbols k
- if symbols are independent and identically distributed, can
characterize source by probabilities p(ki), where \Sigmai p(ki) = 1
Compression Techniques
- short codes for frequent symbols (Huffman -- lossless)
- fit model parameters, send parameters (lossy)
- predict next symbol, send error (lossless/lossy)
- send difference with previous symbol (lossless/lossy)
- send lengths of runs of equal symbols (e.g. white space in a fax)
(lossless)
- build a dictionary, send pointers into dictionary (Lempel-Ziv) (lossless)
Huffman Coding
- prefix-free codes: one code per input symbol,
- one code per input symbol
- no code is a prefix of another
- Huffman code is optimal prefix-free code
- short codes for frequent symbols, longer codes for infrequent symbols
- figures 8.5, 8.6
- algorithm:
- take two symbols with lowest probability
- join them into a new symbol, add their probabilities
- return to Step 1 (unless only one symbol left)
Lempel-Ziv Coding
- dictionary encoding
- near-optimal
- find a sequence s of bits (length l > 0) not in the dictionary
- subsequence of length l - 1 must be in dictionary
- encode s as the position of the left substring, plus the new bit
- number of bits to encode position is log2 (|dictionary|)
- decoding consists of following pointer
Lempel-Ziv Example
1110 1101 000 11 ...
- 1
- 1, pointer to 1 (a) /1
- 1, a/1, 0
- 1, a/1, 0, pointer to 11 (b) /0
- 1, a/1, 0, b/0, c/1
- 1, a/1, 0, b/0, c/1, c/0
- 1, a/1, 0, b/0, c/1, c/0, e/1
- ...
Run-Length Encoding (RLE)
- Replace each string of identical bits (or digits,
or letters) by a count of the symbols.
- Example: data (24 bits) is
0000 0011 1111 1111 1000 0000
- The compressed data is 8 zeros, 9 ones, and 7 zeros.
- Encoding this with a 3-bit count and the 1 bit value, the encoding is
0-110 1-111 1-100 0-111
- The compression ratio is (24 - 16) / 24 = 1/3.
- RLE is good for compressing images with large uniform
areas (scanned text: 8-to-1 compression).