# Basic Information Theory

### How do you efficiently encode messages?

• How would we encode English with bits?
• Or, how could we encode the most information with the fewest bits? (Source: Wikimedia Commons)
• We should use shorter codes for more frequent letters
and longer codes for less frequent letters
• The most common characters are:
1. e
2. t
3. a
• Z, X, and Q are rare
Frequency of English characters Morse code takes this into account
Here's another view, similar to a Huffman code Notice that our code length for each letter is around $\log\left(\frac{1}{p(x)}\right)$
(Source: Wikimedia Commons)

## What is entropy?

$H(p) = \sum p(x) \underbrace{\log\left( \frac{1}{p(x)} \right)}_{\class{fragment}{\textrm{"Surprisal"}}}$
We can think about surprisal as message length
So the entropy of a distribution is just the expected message length

### Entropy of a beta distribution

$\alpha$: $\beta$:
Entropy:

## What is cross entropy?

$H(p, q) = \sum p(x) \log\left( \frac{1}{q(x)} \right)$
Total:
p:
q:
Total Cross entropy:

## What is Kullback-Leibler divergence?

$D_{KL}(p || q) = H(p, q) - H(p)$
Total:
p:
q:
Kullback-Leibler divergence:

## Mutual Information

Mutual Information can be thought of as $I(X;Y) = H(X) - H(X|Y)$ or $I(X;Y) = \mathbb{E}_Y\left[D_{\mathrm{KL}}(p(x|y)\|p(x))\right]$

### Uses of Information Theory

• Used for compression, like Huffman coding, etc.
• Can be used to measure information gain: $D_{KL}\left(p(x | split) || p(x)\right)$
• Mutual information is used in decision trees/feature selection. Decision trees try to reduce entropy