Basic Information Theory

(Source: Wikimedia Commons)

We should use shorter codes for more frequent letters
and longer codes for less frequent letters
The most common characters are:
1. e
2. t
3. a
Z, X, and Q are rare

Frequency of English characters

Morse code takes this into account
Here's another view, similar to a Huffman code

Notice that our code length for each letter is around $\log\left(\frac{1}{p(x)}\right)$

(Source: Wikimedia Commons)

What is entropy?

\[H(p) = \sum p(x) \underbrace{\log\left( \frac{1}{p(x)} \right)}_{\class{fragment}{\textrm{"Surprisal"}}}\]

We can think about surprisal as message length

So the entropy of a distribution is just the expected message length

$\alpha$: $\beta$:
Entropy:

\[H(p, q) = \sum p(x) \log\left( \frac{1}{q(x)} \right)\]

Total: Only entropy?

Total Cross entropy:

\[D_{KL}(p || q) = H(p, q) - H(p)\]

Total:

Kullback-Leibler divergence:

Mutual Information can be thought of as \[I(X;Y) = H(X) - H(X|Y)\] or \[I(X;Y) = \mathbb{E}_Y\left[D_{\mathrm{KL}}(p(x|y)\|p(x))\right]\]

Used for compression, like Huffman coding, etc.
Can be used to measure information gain: $D_{KL}\left(p(x | split) || p(x)\right)$
Mutual information is used in decision trees/feature selection. Decision trees try to reduce entropy