# Compression Codes

Created:
2023-07-23

Updated:
2024-01-18

## Codes Introduction

Given a finite alphabet `X`

a **code** for `X`

is a function:

```
c: X → {0,1}*
```

We can extend the code to arbitrary strings with symbols in `X`

as:

```
c(x₁..xₖ) = c(x₁)..c(xₖ)
```

A code is **uniquely decodable** if its encoding function is injective.

Note that for all the entropy-related formulas we take as the base of the logarithm the cardinality of the code’s alphabet (in our case 2).

Given `x ∈ X`

, the binary string `c(x)`

is called a **codeword** while the set
`C = {c(x): x ∈ X}`

is called a **codebook**.

Given a probability distribution defined over the alphabet symbols, we want to find the code which minimizes the average length of the codewords.

From now on, for the i-th symbol of the alphabet `xᵢ ∈ X`

we’re going to alias
`c(xᵢ)`

with `cᵢ`

and the probability associated with the symbol `p(xᵢ)`

with
`pᵢ`

.

### Entropy Formulas Recap

- Entropy :
`H(p) = ∑ₓ p(x)·log₂(1/p(x))`

- Cross Entropy :
`H(p||q) = ∑ₓ p(x)·log₂(1/q(x))`

- Kullback-Leibler Divergence :
`D(p||q) = ∑ₓ p(x)·log₂(p(x)/q(x))`

- Gibbs Inequality :
`D(p||q) ≥ 0`

- Linking Identity :
`H(p||q) = H(p) + D(p||q)`

## Prefix-Free Codes

A code is **prefix-free** (or **instantaneous**) if no codeword `cᵢ`

is a prefix
of another codeword `cⱼ`

, with `i ≠ j`

.

Every prefix-free code is uniquely decodable, but the inverse is not true.

Fixed length codewords are prefix-free by definition, but not all prefix-free codes are fixed length.

Decoding of an unambiguous binary string proceeds from left to right. As soon as a codeword is found, we remove it from the encoded string, and we restart decoding the next character.

*Examples*

Given the alphabet `X = {a, b, c, d}`

the following codebooks can be associated
to the symbols of `X`

:

```
C₁ = {0, 111, 110, 101} (prefix-free, variable length)
C₂ = {00, 01, 10, 11} (prefix-free, fixed length)
C₃ = {0, 01, 011, 0111} (not prefix-free, uniquely decodable)
C₄ = {0, 1, 01, 10} (non uniquely decodable)
```

In `C₃`

the start of a codeword is marked by the symbol `0`

and thus the code is
unambiguous. However, before decoding an alphabet symbol, we need to check if
the next symbol is a `0`

(*look-ahead*).

In `C₄`

the code `01`

can be decoded both as `ab`

or as `c`

We will prove that for every decodable bit-string there exist an optimal prefix-free code.

### Binary-Tree mapping

A prefix-free code can always be mapped to a binary tree by associating each codeword to some node.

The mapping can be easily defined by using each codeword bit-string values to represent a unique path from the root down to a leaf. The path is unique by definition of prefix-free code (no codeword is prefix of another).

In practice, starting from the tree root, each codeword bits are read from left
to right. If a bit is `0`

proceed to the current node left child, if is `1`

we
proceed to the right child.

Once all codeword bits are read, the current node is bound to the codeword.

When all the codewords are processed, all the subtrees with nodes not associated to any codeword are pruned.

### Kraft Inequality

**Lemma**. Given `n`

integers `{l₁,..,lₙ}`

with `lᵢ ≥ 1`

, there exists a
prefix-free code with codewords lengths `{l₁,..,lₙ}`

if and only if
`∑ᵢ 2^(-lᵢ) ≤ 1`

.

*Proof Idea*.

First requirement is that each binary string with length `lᵢ`

should be mapped
to a leaf in a binary tree. Since we can assign to each leaf a probability value
equal to `2^(-lᵢ)`

. Can be easily seen that the thesis follows.

Note that, the smaller `lᵢ`

is the bigger is `2^(-lᵢ)`

, thus given that in an
optimal code `lᵢ = log₂(1/pᵢ)`

then `2^-lᵢ = 2^-log₂(1/pᵢ) = pᵢ`

.
For an optimal code, the lengths sum at most to `1`

.
Once the optimal lengths were computed we can optionally assign a permutation,
but not reduce any of them without eventually increase another by the same
quantity.

*Examples*

Given the lengths `{1,2,3,3}`

we try to map the lengths to a binary tree.
We draw a path of length `1`

in the tree, and we prune what is below. Then
we take the second length, and we draw a path of length 2 using what is left of
the tree. Again we prune what is below. We proceed with the other lengths using
the same strategy. The Kraft’s inequality sum is: `1/2 + 1/4 + 1/8 + 1/8 = 1`

.

If we try with `{1,2,2,3}`

we can see that is not possible to map it to a binary
tree and thus not possible to create a prefix free code. The Kraft’s inequality
sum is `> 1`

.

### McMillan Inequality

**Lemma**. If `{l₁,..,lₙ}`

are the codewords’ lengths for a non-ambiguous code
then `∑ᵢ 2^(-lᵢ) ≤ 1`

.

**Corollary** McMillan inequality holds in both directions since if `∑ᵢ 2^(-lᵢ) ≤ 1`

then for Kraft’s inequality there exist a prefix free code with the given
code lengths, and these are unambiguous by construction.

**Corollary**. There is a decodable code with lengths of codewords `{l₁,..,lₙ}`

if and only if there exist a prefix free code with the same codewords lengths.

```
∃ Not-Ambiguous c₁(x) ↔ `∑ᵢ 2^(-lᵢ) ≤ 1` ↔ ∃ Prefix Free c₂(x)
(⬑ McMillan) (⬑ Kraft)
```

## Code Average Length

The code average length is defined as:

```
L(c) = ∑ᵢ pᵢ·|cᵢ|
```

*Example*:

```
P = {0.9, 0.05, 0.025, 0.025}
C₁ = {0, 111, 110, 101} → L(C₁) = 0.9·1 + 0.05·3 + ... = 1.2
C₂ = {00, 01, 10, 11} → L(C₂) = 0.9·2 + 0.05·2 + ... = 2.0
H(p) = 0.618
```

Since, as explained later, the average length of an optimal code is equal to the entropy of the alphabet probability distribution, then we can strive for a better codebook.

### Language Redundancy

In this context, redundancy is synonym of **compressibility**.

Imagine that an alphabet has `N`

possible symbols. If each symbol is encoded
using a fixed number of bits `⌈log₂N⌉`

(trivial encoding) instead of the optimal
encoding where on average we use `H(p)`

bits per symbol. Then:

```
Δ = ⌈log₂N⌉ - H(p) ≈ log₂N·(1 - H(p)/log₂N) = log₂N·R
R := 1 - H(p)/log₂N
```

`R`

is defined as the language **redundancy** and has a value between `0`

and `1`

.

When `H(p)`

is close to `log₂N`

then `R`

is close to `0`

and thus the language
is not very compressible. Conversely, the more `H(p)`

is smaller than `log₂N`

the more `R`

is close to `1`

.

For example for the English language `R ≈ 0.62`

, thus it has a redundancy of
`60%`

and with an optimal compression under optimal conditions the result can be
compressed down to the `40%`

of the initial string.

### Source Coding Theorem

The theorem (Shannon 1948) synthesizes all the results we’ve proven so far with respect to optimal encoding.

**Theorem**. If `c`

is an unambiguous code with codewords lengths
`L = {l₁,..,lₙ} `

then `L(c) ≥ H(p)`

, with equality if and only if
`lᵢ = log₂(1/pᵢ)`

for every `pᵢ > 0`

.

*Proof*

If `lᵢ = log₂(1/pᵢ)`

then `L(c) = ∑ᵢ pᵢ·log₂(1/pᵢ) = H(p)`

.

If `lᵢ ≠ log₂(1/pᵢ)`

then `lᵢ = log₂(2^lᵢ) = log₂(1/2^(-lᵢ))`

.

When `∑ᵢ 2^(-lᵢ) = 1`

, then `L`

it can be interpreted as a probability
distribution `q`

and thus `qᵢ = 2^(-lᵢ)`

.

Then `lᵢ = log₂(1/qᵢ)`

and thus for the linking identity:
`L(c) = ∑ᵢ pᵢ·log₂(1/qᵢ) = H(p||q) = H(p) + D(p||q)`

Unfortunately, in general `2^(-lᵢ)`

doesn’t sum exactly to `1`

.
To map it to a probability distribution we need to have the terms to sum to `1`

.
Thus, we define:

```
k = ∑ᵢ 2^(-lᵢ) and qᵢ = 2^(-lᵢ)/k → ∑ᵢ qᵢ = 1
```

The lengths can now be written as:

```
lᵢ = log₂(1/2^(-lᵢ)) = log₂(1/(k·2^(-lᵢ)/k)) = log₂(1/(k·qᵢ))
= log₂(1/qᵢ) + log₂(1/k)
```

The average code length is thus:

```
L(c) = ∑ᵢ pᵢ·lᵢ = ∑ᵢ pᵢ·log₂(1/qᵢ) + ∑ᵢ pᵢ·log₂(1/k) = H(p||q) + log₂(1/k)
```

Since `c`

is unambiguous, then for McMillan inequality:
`0 < k ≤ 1 → 1 ≤ 1/k → 0 ≤ log₂(1/k)`

```
L(c) = H(p||q) + log₂(1/k) ≥ H(p||q)
```

Using the linking identity and the Gibbs inequality:

```
L(c) ≥ H(p||q) = H(p) + D(p||q) ≥ H(p)
```

We have equality when:

`D(p||q) = 0 ↔ p = q`

`log₂(1/k) = 0 ↔ k = 1 ↔ qᵢ = 2^(-lᵢ)`

(always implied by the first)

∎

The theorem suggests that for an optimal non-ambiguous code each codeword length
should be `lᵢ = log₂(1/pᵢ)`

.

The optimal values satisfy the McMillan inequality:

```
∑ᵢ 2^-lᵢ = ∑ᵢ 2^(-log₂(1/pᵢ)) = 2^log₂(pᵢ) = ∑ᵢ pᵢ = 1
```

**Corollary** Given an optimal code `c`

for a given distribution `p`

then the
individual codeword lengths `lᵢ`

are bounded by:

```
log₂(1/pᵢ) ≤ lᵢ = ⌈log₂(1/pᵢ)⌉ < log₂(1/pᵢ) + 1
```

This potential extra bit per codeword may look innocuous but when considered in the context of a big amount of data its may end-up having a non-negligible impact.

`log₂(1/pᵢ) = ⌈log₂(1/pᵢ)⌉`

if and only if `1/pᵢ`

is a power of two.

Estimation for optimal `L(c)`

with an alphabet of `N`

elements:

```
H(p) ≤ L(c) < H(p) + N
```

With `H(p) = L(c)`

if and only if `log₂(1/pᵢ)`

are integers.

Note that if a symbol `x`

has `p(x) = 0`

, we can assign to it an arbitrary
symbol as it will never be encoded.

## Practical Codes Design

### Shannon-Fano Coding

Given the probability distribution of the symbols `p = {p₁,..,pₙ}`

, we set the
length of each symbol `xᵢ`

to `lᵢ = ⌈log₂(1/pᵢ)⌉`

.

The lengths are then used to construct the binary tree associated with the prefix-free encoding.

Problems with this approach:

- The
*true*`pᵢ`

values are unknown, we know just the approximations given by the relative frequencies. Thus,`p`

can’t be used to create an optimal and generic code for all the possible strings. - In general,
`⌈log₂(1/pᵢ)⌉ ≠ log₂(1/pᵢ)`

, thus we are subject to the extra bit penalty described in the previous paragraph. - Sub-optimal encoding of symbols with very low probabilities (see the example).

*Optimal* encoding example:

```
p = {1/2, 1/4, 1/8, 1/8}
l = {1, 2, 3, 3}
C = {0, 10, 110, 111}
```

Note that in this case `⌈log₂(1/pᵢ)⌉ = log₂(1/pᵢ)`

and thus `L(c) = H(p)`

.

*Sub-optimal* encoding example:

```
p = {1/2, 1/2 - ε, ε}, with ε = 2^-5
l₂ = ⌈log₂(1/(1/2 - ε))⌉ = ⌈1.09⌉ = 2
l₃ = log₂(1/ε) = 5
l = {1, 2, 5}
C = {0, 10, 11000 }
```

This code is evidently suboptimal since a prefix free code for such an alphabet
can be constructed with `l = {1, 2, 2}`

. This is a case where `H(p) < L(c)`

.

### Huffman Coding

The prefix-free tree is progressively constructed *bottom-up* by first setting
the elements probabilities as tree leaves.

The two parent-less nodes with smaller probabilities are merged to create the parent node with probability equal to the sum of the children’s probabilities.

The procedure is iterated until we reach the root.

The resulting tree defines the associated prefix-free code.

*Example*.

Given `p = {0.025, 0.025, 0.05, 0.9}`

.

- Sort the probabilities:
`p₀ = {0.025, 0.025, 0.05, 0.9}`

. - Merge the two elements with smaller probabilities
`{0.025, 0.025}`

to create a new node with associated probability`0.05`

. - Update the probabilities set
`p₁ = {0.05, 0.05, 0.9}`

. - Go to step 2 until we are left with a set with one single element
`pₙ = {1}`

.

```
(0.025)-+
(0.025)-+-(0.05)-+
(0.05)-----------+-(0.1)-+
(0.9)--------------------+-(1.0)
```

**Theorem**. The Huffman algorithm constructs an optimal prefix-free code.

The only practical problem is that `p`

can be unknown.

### Lampel-Ziv Coding

This algorithm is the foundation of the algorithm used by *zip/gzip* application
and of *Deflate* algorithm used by *zlib* (LZ77 combined with Huffman codes).

LZ algorithm is not optimal in terms of minimizing the average codeword length for a given probability distribution of source symbols. Instead, it is asymptotically optimal in terms of compression ratio for long sequences of data where the statistical properties are not known in advance or are changing.

The algorithm works by scanning the input data from left to right and building
a **dictionary** of encountered substrings. When a repeated substring is found,
the algorithm outputs a pair of values: a known substring reference and a new
letter used to create the encountered new substring.

Can be proved that the number of bits used by LZ77 tends towards the empirical entropy of the sequence.

In practice the dictionary is a table with the following columns:

*index*: number used to reference back to a row instance;*reference*: number pointing to another row whose substring is used as prefix;*symbol*: symbol introduced by the substring.

Example:

```
"AABABBBABAABABBBABBABB"
```

The first block 0 is the empty block

```
A AB ABB B ABA ABAB BB ABBA BB
idx | ref | sym substring
-----|-----|-----
1 | 0 | A → A
2 | 1 | B → AB
3 | 2 | B → ABB
4 | 0 | B → B
5 | 2 | A → ABA
6 | 5 | B → ABAB
7 | 4 | B → BB
8 | 3 | A → ABBA
9 | 4 | B → BB
```

Note that the last block is equal to block 7.

During encoding references and the letters are converted to binary.

The number of bits to encode a reference is driven by the number of references
already encoded. We start using `n+1`

bits after we encoded `2ⁿ`

references.

In the following example, the binary string outside the parentheses is the
binary representation of the reference while the binary digit within the
parentheses represents the character: `A = 0`

and `B = 1`

.

```
0(0) 1(1) 10(1) 00(1) 010(0) 101(1) 100(1) 011(0) 0100(1)
```

Decoding is about reconstructing the dictionary from the binary string and recovery of the substrings.

In this specific example there is no compression, but in a longer text with more redundancy there are a lot of substrings that repeats and a repeated substring is then merely represented as a pointer to some other block.